Overview of the benchmarked GPUs
Although we only tested a small selection of all the available GPUs, we think we covered all GPUs that are currently best suited for deep learning training and development due to their compute and memory capabilities and their compatibility to current deep learning frameworks.
NVIDIA's classic GPU for Deep Learning was released just 2017, with 11 GB DDR5 memory and 3584 CUDA cores it was designed for compute workloads. It is out of production for a while now and was just added as a reference point.
The RTX 2080 TI was released Q4 2018. It comes with 5342 CUDA cores which are organized as 544 NVIDIA Turing mixed-precision Tensor Cores delivering 107 Tensor TFLOPS of AI performance and 11 GB of ultra-fast GDDR6 memory. This GPU was stopped being produced in September 2020 and is now only very hardly available.
The Titan RTX is powered by the largest version of the Turing™ architecture. The Titan RTX delivers 130 Tensor TFLOPs of performance through its 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory.
Quadro RTX 6000
The Quadro RTX 6000 is the server edition of the popular Titan RTX with improved multi GPU blower ventilation, additional virtualization capabilities and ECC memory. It is powered by the same Turing™ core as the Titan RTX with 576 tensor cores, delivering 130 Tensor TFLOPs of performance and 24 GB of ultra-fast GDDR6 ECC memory.
Quadro RTX 8000
The Quadro RTX 8000 is the big brother of the RTX 6000. With the same GPU processor but with double the GPU memory: 48 GB GDDR6 ECC. In fact it is currently the GPU with the largest available GPU memory, best suited for the most memory demanding tasks.
One of the first GPU models powered by the NVIDIA Ampere™ architecture, featuring enhanced RT and Tensor Cores and new streaming multiprocessors. The RTX 3080 is equipped with 10 GB of ultra-fast GDDR6X memory and 8704 CUDA cores.
The GeForce RTX™ 3090 is the TITAN class of the NVIDIA's Ampere™ GPU generation. It’s powered by 10496 CUDA cores, 328 third-generation Tensor Cores, and new streaming multiprocessors. Like the Titan RTX it features 24 GB of GDDR6X memory.
NVIDIA RTX A6000
The NVIDIA RTX A6000 is the Ampere based refresh of the Quadro RTX 6000. It features the same GPU processor (GA-102) as the RTX 3090 but with all processor cores enabled. Which leads to 10752 CUDA cores and 336 third-generation Tensor Cores. On top it has the double amount of GPU memory compared to a RTX 3090: 48 GB GDDR6 ECC.
With 640 Tensor Cores, the Tesla V100 was the world’s first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance including 16 GB of highest bandwidth HBM2 memory. Its based on the Volta GPU processor which is/was only available to NVIDIA's professional GPU series.
The Nvidia A100 is the flagship of Nvidia Ampere processor generation. With its 6912 CUDA cores, 432 Third-generation Tensor Cores and 40 GB of highest bandwidth HBM2 memory. A single A100 is breaking the Peta TOPS performance barrier.
Getting the best performance out of Tensorflow
Some regards were taken to get the most performance out of Tensorflow for benchmarking.
One of the most important setting to optimize the workload for each type of GPU is to use the optimal batch size. The batch size specifies how many propagations of the network are done in parallel, the results of each propagation are averaged among the batch and then the result is applied to adjust the weights of the network. The best batch size in regards of performance is directly related to the amount of GPU memory available.
A larger batch size will increase the parallelism and improve the utilization of the GPU cores. But the batch size should not exceed the available GPU memory as then memory swapping mechanisms have to kick in and reduce the performance or the application simply crashes with an 'out of memory' exception.
A large batch size has to some extent no negative effect to the training results, to the contrary a large batch size can have a positive effect to get more generalized results. An example is BigGAN where batch sizes as high as 2,048 are suggested to deliver best results. A further interesting read about the influence of the batch size on the training results was published by OpenAI.
A Tensorflow performance feature that was declared stable a while ago, but is still by default turned off is XLA (Accelerated Linear Algebra). It does optimization on the network graph by dynamically compiling parts of the network to specific kernels optimized for the specific device. This can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types.
This feature can be turned on by a simple option or environment flag and will have a direct effect on the execution performance. How to enable XLA in you projects read here.
Float 16bit / Mixed Precision Learning
Concerning inference jobs, a lower floating point precision and even lower 8 or 4 bit integer resolution is granted and used to improve performance. For most training situation float 16bit precision can also be applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Applying float 16bit precision is not that trivial as the model has to be adjusted to use it. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as "mixed precision".
The full potential of mixed precision learning will be better explored with Tensor Flow 2.X and will probably be the development trend for improving deep learning framework performance.
We provide benchmarks for both float 32bit and 16bit precision as a reference to demonstrate the potential.
The Deep Learning Benchmark
The visual recognition ResNet50 model in version 1.0 is used for our benchmark. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers, it is still a good network for comparing achievable deep learning performance. As it is used in many benchmarks, a close to optimal implementation is available, driving the GPU to maximum performance and showing where the performance limits of the devices are.
The Testing Environment
We used our AIME A4000 server for testing. It is an elaborated environment to run high performance multiple GPUs by providing optimal cooling and the availability to run each GPU in a PCIe 4.0 x16 slot directly connected to the CPU.
The NVIDIA Ampere generation benefits from the PCIe 4.0 capability, it doubles the data transfer rates to 31.5 GB/s to the CPU and between the GPUs.
The connectivity has a measurable influence to the deep learning performance, especially in multi GPU configurations.
Also the AIME A4000 provides sophisticated cooling which is necessary to achieve and hold maximum performance.
The technical specs to reproduce our benchmarks:
- AIME A4000, Epyc 7402 (24 cores), 128 GB ECC RAM
- Ubuntu 20.04
- NVIDIA driver version 455.45
- CUDA 11.1.74
- CUDNN 8.0.5
- Tensorflow 1.15.4
The Python scripts used for the benchmark are available on Github at: Tensorflow 1.x Benchmark
Single GPU Performance
The results of our measurements is the average image per second that could be trained while running for 100 batches at the specified batch size.
The NVIDIA Ampere generation is clearly leading the field, with the A100 declassifying all other models.
When training with float 16bit precision the compute accelerators A100 and V100 increase their lead. But also the RTX 3090 can more than double its performance in comparison to float 32 bit calculations.
The GPU speed-up compared to a CPU rises here to 167x the speed of a 32 core CPU, making GPU computing not only feasible but mandatory for high performance deep learning tasks.
Multi GPU Deep Learning Training Performance
The next level of deep learning performance is to distribute the work and training loads across multiple GPUs. The AIME A4000 does support up to 4 GPUs of any type.
Deep learning does scale well across multiple GPUs. The method of choice for multi GPU scaling in at least 90% the cases is to spread the batch across the GPUs. Therefore the effective batch size is the sum of the batch size of each GPU in use.
So each GPU does calculate its batch for backpropagation for the applied inputs of the batch slice. The results of each GPU are then exchanged and averaged and the weights of the model are adjusted accordingly and have to be distributed back to all GPUs.
Concerning the data exchange, there is a peak of communication happening to collect the results of a batch and adjust the weights before the next batch can start. While the GPUs are working on a batch not much or no communication at all is happening across the GPUs.
In this standard solution for multi GPU scaling one has to make sure that all GPUs run at the same speed, otherwise the slowest GPU will be the bottleneck for which all GPUs have to wait for! Therefore mixing of different GPU types is not useful.
Training Performance put into Perspective
To get a better picture of how the measurement of images per seconds translates into turnaround and waiting times when training such networks, we look at a real use case of training such a network with a large dataset.
For example, the ImageNet 2017 dataset consists of 1,431,167 images. To process each image of the dataset once, so called 1 epoch of training, on ResNet50 it would take about:
|Configuration||float 32 training||float 16 training|
|CPU(32 cores)||27 hours||27 hours|
|Single RTX 2080 TI||69 minutes||29 minutes|
|Single RTX 3080||53 minutes||22 minutes|
|Single RTX 3090||41 minutes||18 minutes|
|Single RTX A6000||41 minutes||16 minutes|
|Single A100||23 minutes||8.5 minutes|
|4 x RTX 2080TI||19 minutes||8 minutes|
|4 x Tesla V100||15 minutes||4,5 minutes|
|4 x RTX 3090||11.5 minutes||5 minutes|
|4 x Tesla A100||6.5 minutes||3 minutes|
Usually at least 50 training epochs are required, so one could have a result to evaluate after:
|Configuration||float 32 training||float 16 training|
|CPU(32 cores)||55 days||55 days|
|Single RTX 2080 TI||57 hours||24 hours|
|Single RTX 3080||44 hours||18 hours|
|Single RTX 3090||34 hours||14.5 hours|
|Single RTX A6000||34 hours||14.5 hours|
|Single A100||19 hours||8 hours|
|4 x RTX 2080TI||16 hours||6.5 hours|
|4 x Tesla V100||12 hours||4 hours|
|4 x RTX 3090||9.5 hours||4 hours|
|4 x Tesla A100||5.5 hours||2.5 hours|
This shows that the correct setup can change the duration of a training task from weeks to a single day or even just hours. In most cases a training time allowing to run the training over night to have the results the next morning is probably desired.
Mixed Precision can speed-up the training by more than factor 2
A feature definitely worth a look in regards of performance is to switch training from float 32 precision to mixed precision training. Getting a performance boost by adjusting software depending on your constraints could probably be a very efficient move to double the performance.
Multi GPU scaling is more than feasible
Deep Learning performance scaling with multi GPUs scales well for at least up to 4 GPUs: 2 GPUs can often outperform the next more powerful GPU in regards of price and performance.
This is for example true when looking at 2 x RTX 3090 in comparison to a NVIDIA A100.
Best GPU for Deep Learning?
As in most cases there is not a simple answer to the question. Performance is for sure the most important aspect of a GPU used for deep learning tasks but not the only one.
So it highly depends on what your requirements are. Here are our assessments for the most promising deep learning GPUs:
It delivers the most bang for the buck. If you are looking for a price-conscious solution, a 4 GPU setup can play in the high-end league with the acquisition costs of less than a single most high-end GPU.
But be aware of the step back in available GPU memory, as the RTX 3080 has 1 GB less memory then the long time 11 GB memory configuration of the GTX 1080 TI and RTX 2080 TI. This probably leads to the necessity to reduce the default batch size of many applications.
Maybe there will be RTX 3080 TI which fixes this bottleneck?
The RTX 3090 is currently the real step up from the RTX 2080 TI. With its sophisticated 24 GB memory and a clear performance increase to the RTX 2080 TI it sets the margin for this generation of deep learning GPUs.
A double RTX 3090 setup can outperform a 4 x RTX 2080 TI setup in deep learning turn around times, with less power demand and with a lower price tag.
If the most performance regardless of price and highest performance density is needed, the NVIDIA A100 is first choice: it delivers the most compute performance in all categories.
The A100 made a big performance improvement compared to the Tesla V100 which makes the price / performance ratio become much more feasible.
Also the lower power consumption of 250 Watt compared to the 700 Watt of a dual RTX 3090 setup with comparable performance reaches a range where under sustained full load the difference in energy costs might become a factor to consider.
Moreover, concerning solutions with the need of virtualization to run under a Hypervisor, for example for cloud renting services, it is currently the best choice for high-end deep learning training tasks.
A quad NVIDIA A100 setup, like possible with the AIME A4000, catapults one into the petaFLOPS HPC computing area.
Questions or remarks? Please contact us under: firstname.lastname@example.org