PyTorch 2 GPU Performance

A benchmark based performance comparison of the new PyTorch 2 with the well established PyTorch 1. The benchmarks cover different areas of deep learning, such as image classification and language models. It is shown that PyTorch 2 generally outperforms PyTorch 1 and is scaling well on multiple GPUs.

Table of contents

PyTorch has become one of the most popular deep learning frameworks since its release in 2016. Developed by Facebook's AI Research (FAIR) team, PyTorch has gained a reputation for its ease of use and extensive support for neural network architectures. It has been adopted by researchers and industry practitioners alike for a wide range of applications, from image and speech recognition to natural language processing and reinforcement learning. In September 2022, PyTorch became part of the Linux Foundation through the newly established PyTorch Foundation.

The release of PyTorch 2.0 in march 2023 introduced a number of significant changes to improve performance, support dynamic shapes and distributed training. One major feature of PyTorch 2.0 is the introduction of torch.compile as the main API for PyTorch 2.0. This feature wraps your model and returns a compiled model with much better performance, which is fully additive and optional, making PyTorch 2 backward compatible. In most cases this can be done by simply adding one line of code:

model = torch.compile(model)

This blog article aims to analyze and compare the performance of PyTorch 1 and PyTorch 2.  To accomplish this, we used the models ResNet-50 and BERT, which will be elaborated on in the following section.

For a quick start, some important terms and settings that were used for the benchmark training process are explained at the beginning:


A Residual Neural Network, or ResNet, was first introduced in 2015 for image classification. ResNet is considered one of the first truly deep learning networks. It solved the problem of vanishing/exploding gradients that occurred in previously used perceptron network structures when the number of intermediate layers was increased (see Deep Residual Learning for Image Recognition). The characteristic feature of residual networks is the use of "skip connections" between different layers, allowing individual layers to be skipped. This allows much deeper networks to be formed and solves the problem of vanishing/exploding gradients.

The ResNet-50 model version 1.5, used for our benchmarks, consists of 48 convolutional layers, as well as a MaxPool and an Average-Pool layer, for a total of 48+1+1=50 layers with 25 million parameters. As it is used in many benchmarks, an almost optimal implementation is available that maximizes performance from the GPU and shows where the actual performance limits of the hardware are.


BERT stands for Bidirectional Encoder Representations from Transformers and is a deep learning model for natural language processing, developed by Google in 2018. It uses the Transformer architecture to learn a context-sensitive representation of words in a text. BERT was trained on a large amount of unlabeled text to gain a general understanding of natural language. Through this training, BERT can understand context-specific meanings of words and distinguish ambiguous expressions. The special feature of BERT is its bidirectional modeling. Unlike previous language models that only considered the preceding words as context, BERT models the context bidirectionally. This means that BERT uses both the preceding and following words in the text to understand the meaning of a particular word.

We used the two variants "BERT large cased" and "BERT base cased" for our benchmarks. BERT large consists of 24 layers, 1024 hidden dimensions, 16 attention heads, and a total of 335 million parameters, while BERT base consists of 12 layers, 768 hidden dimensions, 12 attention heads, and 110 million parameters. Both versions differentiate between uppercase and lowercase letters.

Batch Size ('bs')

One of the most important metaparameters in Deep Learning training is the batch size. It refers to the number of samples from the dataset that are processed in one training iteration at a time. The training dataset is divided into "batches" that are propagated through the system one after another until the entire dataset has been processed, completing an epoch.
In the case of small datasets fitting entirely into the GPU memory, the batch size can also include the entire dataset, resulting in only one pass per epoch. However, this is typically not the case, which is why dividing the dataset into smaller batches is useful. In literature, the batch is often referred to as a mini-batch to differentiate it from the entire dataset. Since the model weights are updated based on the insights from the current batch after each iteration, the choice of batch size also affects training results and performance.

In our benchmark results the batch size used for the respective GPU is indicated by 'bs'.

Numerical Precision: fp32/AMP

The numerical precision used for computing the weights and associated values in deep learning models plays a significant role in the training process. Higher precision enables finer weight adjustments, but it also requires more memory and slows down computation.

In our benchmarks, we examine the performance of "fp32" data types and calculations that utilize the "Automatic Mixed Precision" technique.

The "fp32" (floating point 32-bit) data type is the most widely used standard in deep learning. It uses a 32-bit encoding, consisting of 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.

Automatic Mixed Precision (AMP) is a technique that is becoming increasingly popular in the deep learning community. It involves using different numerical precisions (such as fp8, fp16, fp32, and fp64) during the training process to improve the efficiency and accuracy of deep learning models. The idea behind AMP is that some parts of the model are more sensitive to numerical precision than others. By using higher precision where necessary and lower precision where less important, calculations can be made faster and more efficiently overall without compromising the model's accuracy. However, implementing Automatic Mixed Precision can be complex and often requires special hardware support.


To fully utilize the capacity of all GPUs involved in multi-GPU training, the DistributedDataParallel module of PyTorch, with one subprocess per GPU, was used resulting in significantly better performance, as opposed to DataParallel, where all GPUs are working in the same process.

Testing environment

The benchmarks were conducted using the AIME benchmark tool, which can be downloaded from GitHub (pytorch-benchmark). The following PyTorch versions were used for the Nvidia GPUs:

  • PyTorch 1.12.1 with CUDA 11.3.109 and Python 3.9.12
  • PyTorch 1.13.1 with CUDA 11.7.99 and Python 3.8.10
  • PyTorch 2.0.0 with CUDA 11.8.89 and Python 3.8.10
  • PyTorch 2.0.0 with CUDA 11.8.89 and Python 3.8.10 using torch.compile for the model
  • PyTorch 2.1.0 with CUDA 11.8.89 and Python 3.10.12 using torch.compile for the model

The AMD GPU was tested on PyTorch 1.12.1 with Rocm 5.4.2 and Python 3.8.10

Depending on the type of GPU, two different AIME systems were used, as some GPUs are suitable for workstations, while others, especially those with passive cooling, should only be used in servers.

The following GPUs were tested using the AIME Workstation G400:

  • Nvidia Geforce RTX 4090
  • Nvidia Geforce RTX 3090
  • Nvidia Geforce RTX 3090 TI
  • Nvidia Geforce RTX 2080 TI
  • Nvidia Geforce RTX 4060 TI
  • Nvidia RTX A5000
  • Nvidia RTX A5500

With the AIME Server A4000 these GPUs have been tested:

  • Nvidia A100 40GB
  • Nvidia A100 80GB
  • Nvidia H100
  • AMD Instinct MI100
  • Nvidia Tesla V100
  • Nvidia RTX 6000 Ada
  • Nvidia RTX 5000 Ada
  • Nvidia RTX A6000

Benchmark results

The following bar diagrams show for each GPU the training performance of the respective model for several PyTorch versions.

ResNet-50, datatype fp32
ResNet-50, automatic mixed precision
BERT base cased, datatype fp32
BERT base cased, automatic mixed precision
BERT large cased, datatype fp32
BERT large cased, automatic mixed precision

It is clearly visible that the training performance has significantly improved for all models on all GPUs, especially the ones of the latest generation, through the use of PyTorch 2 with compiled models. In some cases, for instance on the Nvidia A100 using the model BERT base cased, an improvement of factor 3 is obtained. The above mentioned promise of only needing to add one line of code was kept for the models we used.

Next, we will examine the performance for multi GPU training with the Nvidia A100 40GB, Nvidia A100 80GB, Nvidia RTX 6000 Ada, Nvidia RTX 4090 and Nvidia RTX 5000 on the examples of ResNet-50 with fp32 precision and BERT large cased with AMP. As mentioned above the DistributedDataParallel module of PyTorch was used.

On Resnet-50 the performance scale factor is between 0.9 and 0.95, meaning that each additional GPU is adding between 90% and 95% of the single GPU performance.  A similar behaviour is observed on the model BERT base cased using automatic mixed precision.


The introduction of PyTorch 2 makes it possible to optimize the effectiveness of your own deep learning models and achieve significant increases in performance.

Use Pytorch 2 torch.compile and automatic mixed precision to get the best possible GPU performance out of Pytorch

Thanks to its compiled models, the use of PyTorch 2 significantly increases performance compared to all previous Pytorch versions, even compared to special NVIDIA optimized Pytorch versions.

Multi GPU training scales the performance almost linear

With a scale factor of 90-95 percent multi GPU training is a good solution, to scale the performance.

Updating your application to Pytorch 2.0 is worth it! You get a free performance boost for all current GPU models without much effort.

Spread the word

Keep reading...