PyTorch 2 GPU Performance Benchmarks (Update)

An overview of PyTorch performance on latest GPU models. The benchmarks cover training of LLMs and image classification. They show possible GPU performance improvements by using later PyTorch versions and features, compares the achievable GPU performance and scaling on multiple GPUs.

Table of contents


PyTorch has become the most popular deep learning framework since its release in 2016. It has been adopted by researchers and industry practitioners alike for a wide range of applications, from image and speech recognition to large language model processing and reinforcement learning.

This article aims to analyze and track the performance improvements of various PyTorch versions over time. To accomplish this, we use the models BERT and ResNet-50, which are explained in more detail in below sections.

For a quick start, some important terms and settings that were used for the benchmark training process are explained:

BERT Model

We used the the variant "BERT large cased" for our benchmarks. BERT large is a transformer model consisting of 24 layers, 1024 hidden dimensions, 16 attention heads, and a total of 335 million parameters. Cased means, that the used BERT version differentiates between upper and lowercase characters in the input layer.

BERT stands for Bidirectional Encoder Representations from Transformers and is a deep learning model for natural language processing, developed by Google in 2018. It uses the Transformer architecture to learn a context-sensitive representation of words in a text. BERT was trained on a large amount of unlabeled text to gain a general understanding of natural language. Through this training, BERT can understand context-specific meanings of words and distinguish ambiguous expressions. The special feature of BERT is its bidirectional modeling. Unlike previous language models that only considered the preceding words as context, BERT models the context bidirectionally. This means that BERT uses both the preceding and following words in the text to understand the meaning of a particular word.

ResNet-50 Model

The ResNet-50 model version 1.5, used for our benchmarks, consists of 48 convolutional layers, as well as a MaxPool and an Average-Pool layer, for a total of 48+1+1=50 layers with a total of 25 million parameters. As it is used in many benchmarks, an almost optimal implementation is available that draws maximum performance from the GPU and shows where the actual compute limits of the hardware are.

A Residual Neural Network, or ResNet, was first introduced in 2015 for image classification. ResNet is considered one of the first truly deep learning networks. It solved the problem of vanishing/exploding gradients that occurred in previously used perceptron network structures when the number of intermediate layers was increased (see Deep Residual Learning for Image Recognition). The characteristic feature of residual networks is the use of "skip connections" between different layers, allowing individual layers to be skipped. This allows much deeper networks to be formed and solves the problem of vanishing/exploding gradients.

Batch Size ('bs')

One of the most important metaparameters in Deep Learning training is the batch size. It refers to the number of samples from the dataset that are processed in one training iteration at a time. The training dataset is divided into "batches" that are propagated through the system one after another until the entire dataset has been processed, completing an epoch.
In the case of small datasets fitting entirely into the GPU memory, the batch size can also include the entire dataset, resulting in only one pass per epoch. However, this is typically not the case, which is why dividing the dataset into smaller batches is useful. In literature, the batch is often referred to as a mini-batch to differentiate it from the entire dataset. Since the model weights are updated based on the insights from the current batch after each iteration, the choice of batch size also affects training results and performance.

In our benchmark results the batch size used for the respective GPU is indicated by 'bs'.

Numerical Precision: fp32/AMP

The numerical precision used for computing the weights and associated values in deep learning models plays a significant role in the training process. Higher precision enables finer weight adjustments, but it also requires more memory and slows down computation.

In our benchmarks, we examine the performance of "fp32" data types and calculations that utilize the "Automatic Mixed Precision" technique.

The "fp32" (floating point 32-bit) data type is the most widely used standard in deep learning. It uses a 32-bit encoding, consisting of 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa.

Automatic Mixed Precision (AMP) is a technique that is becoming increasingly popular in the deep learning community. It involves using different numerical precisions (such as fp8, fp16, fp32, and fp64) during the training process to improve the efficiency and accuracy of deep learning models. The idea behind AMP is that some parts of the model are more sensitive to numerical precision than others. By using higher precision where necessary and lower precision where less important, calculations can be made faster and more efficiently overall without compromising the model's accuracy. However, implementing Automatic Mixed Precision can be complex and often requires special hardware support.

Compile Mode

The release of PyTorch 2.0 in march 2023 introduced a number of significant changes to improve performance, support dynamic shapes and distributed training. One major performance feature of PyTorch 2 is the introduction of torch.compile as the main API for PyTorch 2. This feature wraps your model and returns a compiled model with much better performance, which is fully additive and optional, making PyTorch 2 backward compatible. In most cases this can be done by simply adding one line of code:

model = torch.compile(model)

Multi GPU-Training

To fully scale across multiple GPUs and utilize the capacity of all GPUs involved, PyTorch introduced the DistributedDataParallel module. With DistributedDataParalle each GPU is handled with a dedicated subprocess which results in significantly better performance, as opposed to the older DataParallel module, where all GPUs were handled in a single process.

Testing environment

The benchmarks were conducted using the AIME benchmark tool, which can be downloaded from GitHub (pytorch-benchmark).

The following PyTorch, Python and CUDA versions were used for the NVIDIA GPUs:

  • PyTorch 2.0.0 with CUDA 11.8.89 and Python 3.8.10
  • PyTorch 2.5.1 with CUDA 12.1.105 and Python 3.10.12

The AMD Instinct GPU was tested with:

  • PyTorch 2.0.0 with ROCM 5.4.2 and Python 3.8.10
  • PyTorch 2.5.1 with ROCM 6.2 and Python 3.10.12

Depending on the type of GPU, two different AIME systems were used, as some GPUs are suitable for workstations, while others, especially those with passive cooling, should only be used in specialized rack servers that provide the necessary air flow.

The following GPUs were tested using the AIME G500 Workstation:

  • NVIDIA RTX 6000 Ada
  • NVIDIA RTX 5000 Ada
  • NVIDIA RTX 4500 Ada
  • NVIDIA Geforce RTX 4090
  • NVIDIA Geforce RTX 4060 TI
  • NVIDIA Geforce RTX 3090
  • NVIDIA Geforce RTX 2080 TI
  • NVIDIA RTX A6000
  • NVIDIA RTX A5000

With the AIME A4004 Rack Server these GPUs have been tested:

  • NVIDIA H200 NVL 141GB
  • NVIDIA H100 NVL 94GB
  • NVIDIA H100 80GB
  • NVIDIA A100 80GB
  • NVIDIA A100 40GB
  • AMD Instinct MI100
  • NVIDIA Tesla V100

Benchmark results

The following bar diagrams show for each GPU the training performance of the respective model for several PyTorch versions with and without compiling the model.

Bert Training with fp32

BERT large cased, datatype fp32

Bert Training, automatic mixed precision

BERT large cased, automatic mixed precision

ResNet-50 Training with fp32

ResNet-50 Training, datatype fp32

ResNet-50 Training, automatic mixed precision

ResNet-50 Training, automatic mixed precision

It is clearly visible that the training performance has significantly improved for both models on all GPUs when using a newer version of PyTorch. Especially the GPUs of the Ada and Hopper generation benefit from the use of PyTorch 2.5.1 and compile mode. In some cases, for instance on the NVIDIA H200 and when using the BERT model, an improvement of nearly factor 4 can be obtained by using compile mode.

PyTorch Multi-GPU Training

Next, we will examine the PyTorch performance for multi GPU training with the NVIDIA A100 80GB, NVIDIA A100 40GB, NVIDIA RTX 6000 Ada, NVIDIA RTX 4090 and NVIDIA RTX A5000 on the examples BERT base cased and ResNet50 with AMP. The PyTorchDistributedDataParallel module was used to implement the multi GPU training to achieve best performance.

On ResNet-50 the performance scale factor is between 0.9 and 0.95, meaning that each additional GPU is adding between 90% and 95% of the single GPU performance. A similar behaviour is observed on the model BERT base cased using automatic mixed precision.

Conclusions

Thanks to PyTorchs compile mode, the training performance can be significantly increased by better utilizing the GPU hardware.

PyTorch 2 AMP precision, opposed to float32, does nearly double the achievable performance on nearly all GPU models.

💡
Use Pytorch 2 torch.compile and automatic mixed precision to get the best possible GPU performance out of PyTorch

With a scale factor of 90-95 percent multi GPU training is a straight forward solution, to scale the performance.

💡
Multi GPU training scales the performance almost linear

Updating your application to the latest PyTorch version is worth it! You get a free performance increase for all current GPU models without much effort.

💡
Keep your PyTorch version up to date to benefit from latest performance improvements

Spread the word

Keep reading...