Deep Learning GPU Benchmarks

An overview of current high end GPUs and compute accelerators best for deep and machine learning tasks. Included are the latest offerings from NVIDIA: the Hopper and Ada Lovelace GPU generation. Also the performance of multi GPU setups is evaluated.

Table of contents

Overview of the benchmarked GPUs

Although we only tested a small selection of all the available GPUs, we think we covered all GPUs that are currently best suited for deep learning training, model fine tuning and inference tasks due to their compute and memory capabilities and their compatibility to current deep learning frameworks, namely Pytorch and Tensorflow.

For reference also the iconic deep learning GPUs: Geforce GTX 1080 Ti, RTX 2080 Ti, RTX 3090 and Tesla V100 are included to visualize the increase of compute performance over the recent years.

GTX 1080TI

Suitable for: Workstations

Launch Date: 2017.03

Architecture: Pascal

VRAM Memory (GB): 11 (GDDR5X)

Cuda Cores: 3584

Tensor Cores: -

Power Consumption (Watt): 250

Memory Bandwidth (GB/s): 484

Geforce RTX 2080TI

Suitable for: Workstations

Launch Date: 2018.09

Architecture: Turing

VRAM Memory (GB): 11 (DDR6)

Cuda Cores: 5342

Tensor Cores: 544

Power Consumption (Watt): 260

Memory Bandwidth (GB/s): 616

QUADRO RTX 5000

Suitable for: Workstations/Servers

Launch Date: 2018.08

Architecture: Turing

VRAM Memory (GB): 16 (GDDR6)

Cuda Cores: 3072

Tensor Cores: 384

Power Consumption (Watt): 230

Memory Bandwidth (GB/s): 448

Geforce RTX 3090

Suitable for: Workstations

Launch Date: 2020.09

Architecture: Ampere

VRAM Memory (GB): 24 (GDDR6X)

Cuda Cores: 10496 Tensor Cores: 328

Power Consumption (Watt): 350

Memory Bandwidth (GB/s): 936

RTX A5000

Suitable for: Workstations/Servers 

Launch Date: 2021.04 

Architecture: Ampere 

VRAM Memory (GB): 24 (GDDR6) 

Cuda Cores: 8192 

Tensor Cores: 256 

Power Consumption (Watt): 230 

Memory Bandwidth (GB/s): 768

RTX A5500

Suitable for: Workstations/Servers 

Launch Date: 2022.03 

Architecture: Ampere 

VRAM Memory (GB): 24 (GDDR6) 

Cuda Cores: 10240 

Tensor Cores: 220 

Power Consumption (Watt): 230 

Memory Bandwidth (GB/s): 768

RTX 5000 Ada

Suitable for: Servers

Launch Date: 2023.08

Architecture: Ada Lovelace

VRAM Memory (GB): 32 (GDDR6)

Cuda Cores: 12800

Tensor Cores: 400

Power Consumption (Watt): 250

Memory Bandwidth (GB/s): 576

RTX A6000

Suitable for: Workstations/Servers 

Launch Date: 2020.10 

Architecture: Ampere 

VRAM Memory (GB): 48 (GDDR6) 

Cuda Cores: 10752 

Tensor Cores: 336 

Power Consumption (Watt): 300 

Memory Bandwidth (GB/s): 768

AMD Instinct MI100

Suitable for: Servers

Launch Date: 2020.11

Architecture: CDNA (1)

VRAM Memory (GB): 32 (HBM2)

Stream Processors: 7680

Power Consumption (Watt): 250

Memory Bandwidth (TB/s): 1.2

Geforce RTX 4090

Suitable for: Workstations 

Launch Date: 2022.10 

Architecture: Ada Lovelace 

VRAM Memory (GB): 24 (GDDR6X) 

Cuda Cores: 16384 

Tensor Cores: 512 

Power Consumption (Watt): 450 

Memory Bandwidth (GB/s): 1008

RTX 6000 Ada

Suitable for: Workstations/Servers 

Launch Date: 2022.09 

Architecture: Ada Lovelace 

VRAM Memory (GB): 48 (GDDR6) 

Cuda Cores: 18176 

Tensor Cores: 568 

Power Consumption (Watt): 300 

Memory Bandwidth (GB/s): 960

NVIDIA L40S

Suitable for: Servers

Launch Date: 2022.09

Architecture: Ada Lovelace

VRAM Memory (GB): 48 (GDDR6)

Cuda Cores: 18176

Tensor Cores: 568

Power Consumption (Watt): 300

Memory Bandwidth (GB/s): 864

Tesla V100

Suitable for: Servers 

Launch Date: 2017.05 

Architecture: Volta 

VRAM Memory (GB): 16 (HBM2) 

Cuda Cores: 5120 

Tensor Cores: 640 

Power Consumption (Watt): 250 

Memory Bandwidth (GB/s): 900

A100

Suitable for: Servers 

Launch Date: 2020.05 

Architecture: Ampere 

VRAM Memory (GB): 40/80 (HBM2) 

Cuda Cores: 6912 

Tensor Cores: 512 

Power Consumption (Watt): 300 

Memory Bandwidth (GB/s): 1935 (80 GB PCIe)

H100

Suitable for: Servers 

Launch Date: 2022.10 

Architecture: Grace Hopper 

VRAM Memory (GB): 80 (HBM2) 

Cuda Cores: 14592 

Tensor Cores: 456 

Power Consumption (Watt): 350 

Memory Bandwidth (GB/s): 2000

The Deep Learning Benchmark

The visual recognition ResNet50 model (version 1.5) is used for our benchmark. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers, it is still a good network for comparing achievable deep learning performance. As it is used in many benchmarks, a close to optimal implementation is available, driving the GPU to maximum performance and showing where the performance limits of the devices are.

The comparison of the GPUs have been made using synthetic random image data, to minimize the influence of external elements like the type of dataset storage (SSD or HDD), data loader and data format.

Regarding the setup used, we have to remark two important points. The first one is the XLA feature. A Tensorflow performance feature that was declared stable a while ago, but is still turned off by default. XLA (Accelerated Linear Algebra) does optimize the network graph by dynamically compiling parts of the network to kernels optimized for the specific device. This can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types. This feature can be turned on by a simple option or environment flag and maximizes the execution performance.

The second one is the employment of mixed precision. Concerning inference jobs, a lower floating point precision is the standard way to improve performance. For most training situation float 16bit precision can also be applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Applying float 16bit precision is not that trivial as the model layers have to be adjusted to use it. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as mixed precision.

The Python scripts used for the benchmark are available on Github here.

The Testing Environment

As AIME offers server and workstation solutions for deep learning tasks, we used our AIME A4000 server and our AIME G400 Workstation for the benchmark.

The AIME A4000 server and AIME G400 workstation are elaborated environments to run high performance multiple GPUs by providing sophisticated power and cooling, necessary to achieve and hold maximum performance and the ability to run each GPU in a PCIe 4.0 x16 slot directly connected to the CPU.

The technical specs to reproduce our benchmarks are:

For server compatible GPUs: AIME A4000, AMD EPYC 7543 (32 cores), 128 GB ECC RAM

For GPUs only available for workstations: G400, AMD Threadripper Pro 5955WX (16 cores), 128 GB ECC RAM

Using the AIME Machine Learning Container (MLC) management framework with the following setup:

  • Ubuntu 20.04
  • NVIDIA driver version 520.61.5
  • CUDA 11.2
  • CUDNN 8.2.0
  • Tensorflow 2.9.0 (official build)

The NVIDIA H100, RTX 6000 Ada, L40S, RTX 4090,  RTX 5000 Ada, RTX A6000, RTX A5500 and RTX A5000 were tested with the following setup:

  • CUDA 11.8
  • CUDNN 8.6.0
  • Tensorflow 2.13.1 (official build)

The AMD GPU in the benchmark, the AMD Instinct MI100 was tested with:

  • ROCM 5.4
  • MIOpen 2.19.0
  • Tensorflow 2.11.0 (AMD build)

Single GPU Performance

The results of our measurements is the average of images per second that could be trained while running for 50 steps at the specified batch size. The average of three runs were taken, the start temperature of all GPUs was below 50° Celsius.

The GPU speed-up compared to a 32-core-CPU rises here several orders of magnitude, making GPU computing not only feasible but mandatory for high performance deep learning tasks.

Next the results using mixed precision.

One can see that using mixed precision option can increase the performance up to three times.

Multi GPU Deep Learning Training Performance

The next level of deep learning performance is to distribute the work and training loads across multiple GPUs. The AIME A4000 and the AIME G400 support up to four server capable GPUs.

Deep learning does scale very well across multiple GPUs. The method of choice for multi GPU scaling is to spread the batch across the GPUs. Therefore the effective (global) batch size is the sum of the local batch sizes of each GPU in use. Each GPU does calculate the back propagation for the applied inputs of the batch slice. The back propagation results of each GPU are then summed and averaged. The weights of the model are adjusted accordingly and have to be distributed back to all GPUs.

Concerning the data exchange, there is a peak of communication happening to collect the results of a batch and adjust the weights before the next batch can be calculated. While the GPUs are working on calculation a batch not much or no communication at all is happening across the GPUs.

In this standard solution for multi GPU scaling one has to make sure that all GPUs run at the same speed, otherwise the slowest GPU will be the bottleneck for which all GPUs have to wait for! Therefore mixing of different GPU types is not useful.

The next two graphs show how well the RTX A5000 scales by using single and mixed precision.

A good linear and constant scale factor of around 0.93 is reached, meaning that each additional GPU add around 93% of its theoretical linear performance. The similar scale factor is obtained employing mixed precision.

With the RTX 4090 we have a different situation though. While the single GPU performance is decent, the multi-GPU performance of the RTX 4090 is falling short of expectations. As shown in the diagram below, the scale factor of the second RTX 4090 is only 0.76 - not enough for a reasonable Multi-GPU setup.

The reason here fore is probably an intended market segmentation by NVIDIA to separate the Pro-GPUs from the cheaper NVIDIA GeForce 'consumer' series not to be used in Multi-GPU setups.

Conclusions

Mixed Precision can speed-up the training by more than factor 2

A feature definitely worth a look in regards of performance is to switch training from float 32 precision to mixed precision training. Getting a performance boost by adjusting software depending on your constraints could probably be a very efficient move to double the performance.

Multi GPU scaling is more than feasible

Deep Learning performance scales well with multi GPUs for at least up to 4 GPUs: 2 GPUs can often outperform the next more powerful GPU model in regards of price and performance.

Mixing of different GPU for Multi-GPU setups is not useful

The slowest GPU will be the bottleneck for which all GPUs have to wait for!

The best GPU for Deep Learning?

As in most cases there is no simple answer to that question. Performance is for sure the most important aspect of a GPU used for deep learning tasks but not the only one.

So it highly depends on your requirements. Here are our assessments for the most promising deep learning GPUs:

RTX A5000

The RTX A5000 is a good entry card for deep learning training and inference tasks. It has a very good energy efficiency with a similar performance as the legendary but more power hungry, graphic card flagship of the NVIDIA Ampere generation, the RTX 3090.

It is also not to underestimate: a 4x RTX A5000 GPU setup can, as seen in the diagrams above, outperform a more than 10 times as expensive single NVIDIA H100 80GB accelerator in performance, possible batch size and acquisition costs!

RTX A6000

The bigger brother of the RTX A5000 and an efficient solution to get 48 GB GDDR6 GPU memory on a single card. Especially for inference jobs needing a larger memory configuration the RTX A6000 is an interesting option.

RTX 4090

The first NVIDIA GPU of the Ada Lovelace generation. The single GPU performance is outstanding due its high power budget of 450 Watts but the multi-GPU performance is falling short of expectations as shown above. Beside inference jobs, where a single GPU can deliver its full performance and no communication between the GPUs is necessary, a multi-GPU setup with more than two RTX 4090s doesn't seem to be an efficient way to scale.

RTX 6000 Ada / L40S

The pro versions of the RTX 4090 with double the GPU memory (48 GB) and very solid performance with moderate power requirements. The fastest scalable all-round GPU with best top performance / price ratio, also capable to be used in large language model setups.

The RTX 6000 Ada is the fastest available card for Multi-GPU workstation setups.

On first look the L40S seems to be the server only passive cooled version of the RTX 6000 Ada, since the GPU processor specifications appear to be the same. But it has a detail disadvantage: about 10% lower memory bandwidth, affecting its possible performance in this benchmark.

RTX 5000 Ada

The RTX 5000 Ada is one of the later additions to the NVIDIA Ada Love Lace series and is positioned as successor of the RTX A5000. It is a good replacement but doesn't offer a significant performance upgrade to the RTX A5000. The biggest disadvantage of the RTX 5000 Ada is its lower memory bandwidth of 576 GB/s. On the plus side it has 32 GB GDDR6 memory which could enable larger batch size in some use cases.

NVIDIA A100 40GB / 80GB

The first dedicated deep learning accelerator from NVIDIA to deliver highest performance density. The 40 GB models are not produced anymore and hardly available but the 80GB model is still going strong.

The 80GB HBM2 memory is the key value for scaling with larger models.

The lower power consumption of 250/300 Watt in the PCIe version compared to the 400 Watt SXM version of DGX/HGX servers with comparable performance under sustained load the difference in energy and cooling costs can become a factor to consider.

An octa (8x) NVIDIA A100 setup, like possible with the AIME A8000, catapults one into multi petaFLOPS HPC computing area.

NVIDIA H100

In case the most performance regardless of price and highest performance density is needed, the NVIDIA H100 is currently the first choice: it delivers high end deep learning performance.

The biggest advantage to the NVIDIA A100, beside its about 40% higher base performance, is the capability of FP8 computing which can double the compute throughput, especially in inference tasks.

It is also the first PCIe 5.0 capable accelerator, so this card is best suited for PCIe 5.0 ready servers like the AIME A4004 and AIME A8004 to have a comfortable 128 GB/s data interchange between accelerators.

This article will be updated with new additions and corrections as soon as available.

Questions or remarks? Please contact us under: hello@aime.info

Spread the word

Keep reading...