The results can differ from older benchmarks as latest Tensorflow versions have some new optimizations and show new trends to achieve best training performance and turn around times.
Overview of the benchmarked GPUs
Although we only tested a small selection of all the available GPUs we think we covered all GPUs that were best suited in 2019 for deep learning training and development due to their compute and memory capabilities and their compatibility to current deep learning frameworks.
NVIDIA's classic GPU for Deep Learning, with 11 GB DDR5 memory and 3584 CUDA cores it was designed for compute workloads. It is not being produced any more, so we just added it as a reference point.
RTX 2080 TI
The RTX 2080 TI comes with 5342 CUDA cores which are organized as 544 NVIDIA Turing mixed-precision Tensor Cores delivering 107 Tensor TFLOPS of AI performance and 11 GB of ultra-fast GDDR6 memory.
Powered by the award-winning Turing™ architecture, the Titan RTX is bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory.
With 640 Tensor Cores, the Tesla V100 was the world’s first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance including 16 GB of highest bandwidth HBM2 memory.
Getting the best performance out of Tensorflow
Some regards were taken to get the most performance out of Tensorflow for benchmarking.
One of the most important setting to optimize the workload for each type of GPU is to use the optimal batch size. The batch size specifies how many backpropagations of the network are done in parallel, the result of each backpropagation is then averaged among the batch and then the result is applied to adjust the weights of the network. The best batch size in regards of performance is directly related to the amount of GPU memory available.
The basic rule is to increase the batch size so that the complete GPU memory is used.
A larger batch size will increase the parallelism and improve the utilization of the GPU cores. But the batch size should not exceed the available GPU memory as then memory swapping mechanisms have to kick in and reduce the performance or the application simply crashes with an 'out of memory' exception.
A large batch size has to some extent no negative effect to the training results, to the contrary a large batch size can have a positive effect to get more generalized results. An example is BigGAN where batch sizes as high as 2,048 are suggested to deliver best results. A further interesting read about the influence of the batch size on the training results was published by OpenAI.
A Tensorflow performance feature that was lately declared stable is XLA (Accelerated Linear Algebra). It does optimization on the network graph by dynamically compiling parts of the network to specific kernels optimized for a device. This can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types.
This feature can be turned on by a simple option or environment flag and will have a direct effect on the execution performance. How to enable XLA in you projects read here.
Float 16bit / Mixed Precision Learning
Concerning inference jobs, a lower floating point precision and even lower 8 or 4 bit integer resolution is already granted and used to improve performance. Studies are suggesting that float 16bit precision can also be applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Applying float 16bit precision is not that trivial as the model has to be adjusted to use it. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as "mixed precision".
The full potential of mixed precision learning will be better explored with Tensor Flow 2.X and will probably be the development trend for improving deep learning framework performance.
We provide benchmarks for both float 32bit and 16bit precision as a reference to demonstrate the potential.
The Deep Learning Benchmark
The visual recognition ResNet50 model is used for our benchmark. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers it is still a good network for comparing achievable deep learning performance. As it is used in many benchmarks a close to optimal implementation is available, which drives the GPU to maximum performance and shows where the performance limits of the devices are.
The Testing Environment
We used our AIME R400 server for testing. It is an elaborated environment to run high performance GPUs by providing optimal cooling and the availability to run each GPU in a PCI 3.0 x16 slot directly connected to the CPU. The PCI connectivity has a measurable influence in deep learning performance, especially in multi GPU configurations. A sophisticated cooling is necessary to achieve and hold maximum performance.
The technical specs to reproduce our benchmarks:
- AIME R400, Threadripper 1950X, 64 GB RAM
- NVIDIA driver version 418.87
- CUDA 10.1.243
- CUDNN 18.104.22.168
- Tensorflow 1.15
The Python scripts used for the benchmark are available on Github at: Tensorflow 1.x Benchmark
Single GPU Performance
The results of our measurements is the average image per second that could be trained while running for 100 batches.
One can clearly see an up to 30x speed-up compared to a 32 core CPU. There is already a quite clear distance to the GTX 1080TI which was introduced in the year 2017. The difference between a RTX 2080TI and the Tesla V100 is only a little more then 25% when looking at the float32 performance.
When training with float 16bit precision the field spreads more apart. CPU and the GTX 1080TI do not natively support the float 16bit resolution and therefore don't gain much performance by using a lower bit resolution.
In contrast the Tesla V100 does show its potential and can increase the distance to the RTX GPUs and deliver more than 3 times the performance compared to the float 32 bit performance and reaches nearly 5 times the performance of a GTX 1080TI.
The RTX 2080TI and Titan RTX can at least double the performance in comparison to float 32 bit calculations.
Multi GPU Deep Learning Training Performance
The next level of Deep Learning performance is to distribute the work and training loads across multiple GPUs. The AIME R400 does support up to 4 GPUs of any type.
Deep Learning does scale well across multiple GPUs. The method of choice for multi GPU scaling in at least 90% the cases is to spread the batch across the GPUs. Therefore the effective batch size is the sum of the batch size of each GPU in use.
So each GPU does calculate its batch for backpropagation for the applied inputs of the batch slice. The results of each GPU are then exchanged and averaged and the weights of the model are adjusted accordingly and have to be distributed back to all GPUs.
Concerning the data exchange, there is a peak of communication happening to collect the results of a batch and adjust the weights before the next batch can start. While the GPUs are working on a batch not much or no communication at all is happening across the GPUs.
In this standard solution for multi GPU scaling one has to make sure that all GPUs run at the same speed, otherwise the slowest GPU will be the bottleneck for which all GPUs have to wait for! Therefore mixing of different GPU types is not useful.
Training Performance put into Perspective
To get a better picture of how the measurement of images per seconds translates into turnaround and waiting times when training such networks, we look at a real use case of training such a network with a large dataset.
For example, the ImageNet 2017 dataset consists of 1,431,167 images. To process each image of the dataset once, so called 1 epoch of training, on ResNet50 it would take about:
|Configuration||float 32 training||float 16 training|
|CPU(32 cores)||26 hours||26 hours|
|Single RTX 2080TI||69 minutes||29 minutes|
|Single Tesla V100||55 minutes||17 minutes|
|4 x RTX 2080TI||19 minutes||8 minutes|
|4 x Tesla V100||15 minutes||4,5 minutes|
Usually at least 50 training epochs are required, so one could have a result to evaluate after:
|Configuration||float 32 training||float 16 training|
|CPU(32 cores)||55 days||55 days|
|Single RTX 2080TI||57 hours||24 hours|
|Single Tesla V100||46 hours||14 hours|
|4 x RTX 2080TI||16 hours||6,5 hours|
|4 x Tesla V100||12 hours||4 hours|
This shows that the correct setup can change a training task from weeks to the next working day or even just hours. In most cases a training time to let the training run over night to have the results the next morning is probably desired.
Deep Learning GPU Price Performance Comparison
Another important aspect is to put the reached performance of a GPU in relation to its price. We therefore show the current retail price of each GPU in relation to the reachable float 16 bit performance. The bars are normalized to the most efficient GPU.
Mixed Precision can speed-up the training by factor 3
A feature definitely worth a look in regards of performance is to evaluate to switch training from float 32 precision to mixed precision training. Getting an up to 3 times performance boost by adjusting software depending on your constraints could probably be a very efficient move to triple performance.
To utilize the full potential of the latest high end GPUs the switch to mixed precision is really recommended. Because looking just at float 32 performance, the increase of Turing GPUs compared to a GTX 1080TI, a 3 year old hardware, is not that impressive.
And the other way round an up-to-date GPU is necessary to profit from mixed precision training and can be rewarded with a more than 4 times performance gain overall when updating for example from a GTX 1080TI.
Multi GPU scaling is more than feasible
Deep Learning performance scaling with Multi GPUs scales well for at least up to 4 GPUs: 2 GPUs can often easily outperform the next more powerful GPU in regards of price and performance!
This is true when looking at 2 x RTX 2080TI in comparison to a Titan RTX and 2 x Titan RTX compared to a Tesla V 100.
Best GPU for Deep Learning?
As in most cases there is not a simple answer to the question. Performance is for sure one of the most important aspect of a GPU used for deep learning tasks but not the only one.
So it highly depends on what your requirements are. Here are our assessments for the most promising deep learning GPUs:
RTX 2080 TI
It clearly delivers the most bang for the buck. If you are looking for a price-conscious solution a 4 GPU setup can play in the high-end league with the acquisition costs less than a single most high-end GPU. The only drawback is that it is not useable in virtualization environments because of a limitation imposed by NVIDIA to not use this type of GPU in cloud services. But for workstations or in-house servers it is a very interesting GPU to use.
The bigger brother of the RTX 2080 TI. It shines when a lot of GPU memory is needed. With its 24 GB memory it can load even the most demanding models currently in research. The additional performance comes with a price tag and it has the same limitation imposed by NVIDIA not to be used in cloud renting services.
If the most performance regardless of price or highest performance density is needed the Tesla V100 is first choice: it delivers the most compute performance in all categories. Also solutions which need virtualization to run under a Hypervisor, for example for cloud renting services, it is currently the only choice for high-end deep learning training tasks.
Questions or remarks? Please contact us under: firstname.lastname@example.org