Deep Learning GPU Benchmarks 2019

Overview of the benchmarked GPUs

Allthough we only tested a small selection of all the available GPUs we think we covered all GPUs that currently best suited for deep learning training and development due to their compute and memory capabilities and their compatibility to current deep learning frameworks.

  • GTX 1080TI

    GTX 1080TI

    NVIDIAs classic GPU for Deep Learning, with 11 GB DDR5 memory and 3584 CUDA cores it was designed for compute work loads. It is not produced any more, so we just added it as a reference point here.
  • RTX 2080TI

    RTX 2080TI

    The RTX 2080 TI comes with 5342 CUDA cores which are organized as 544 NVIDIA Turing mixed-precision Tensor Cores delivering 107 Tensor TFLOPS of AI performance and 11 GB of ultra-fast GDDR6 memory.
  • Titan RTX

    Titan RTX

    Powered by the award-winning Turing™ architecture, the Titan RTX is bringing 130 Tensor TFLOPs of performance, 576 tensor cores, and 24 GB of ultra-fast GDDR6 memory.
  • Tesla V100

    Tesla V100

    With 640 Tensor Cores, the Tesla V100 was the world’s first GPU to break the 100 teraFLOPS (TFLOPS) barrier of deep learning performance including 16 GB of highest bandwidth HBM2 memory.

Getting the best performance out of Tensorflow

Some regards were taken to get the most performance out of Tensorflow for benchmarking.

Batch Size

One of the most important setting to optimize the workload for each type of GPU is to use the optimal batch size. The batch size specifies how many backpropagations of the network are done in parallel, the result of each backpropagation is then averaged among the batch and then the result is applied to adjust the weights of the network. The best batch size in regards of performance is directly related to the amount of GPU memory available.

The basic rule is to increase the batch size so that the complete GPU memory is used.

A larger batch size will increase the parallelism and improve the utilization of the GPU cores. But the batch size should not exceed the available GPU memory as then memory swapping mechanisms have to kick in and reduce the performance or the application simply crashes with an 'out of memory' exception.

A large batch size has to some extent no negative effect to the training results, to the contrary a large batch size can have a positive effect to get more generalized results. An example is BigGAN where batch sizes as high as 2,048 are suggested to deliver best results. A further interessting read about the influence of the batch size on the training results was published by OpenAI.

Tensorflow XLA

A Tensorflow performance feature that was lately declared stable is XLA (Accelerated Linear Algebra). It does optimization on the network graph by dynamically compiling parts of the network to specific kernels optimized for a device, this can have performance benefits of 10% to 30% compared to the static crafted Tensorflow kernels for different layer types.

This feature can be turned on by a simple option or environment flag and will have a direct effect on execution performance. For how to enable XLA in you projects read here.

Float 16bit / Mixed Precision Learning

For inference jobs a lower floating point precision and even lower 8 or 4 bit integer resolution is already granted and used to improve performance. Studies are suggesting that float 16bit precision can be also applied for training tasks with neglectable loss in training accuracy and can speed-up training jobs dramatically. Applying float 16bit precision is not that trivial as the model has to be adjusted to use it. As not all calculation steps should be done with a lower bit precision, the mixing of different bit resolutions for calculation is referred as "mixed precision".

The full potential of mixed precision learning will better be explored with Tensor Flow 2.X and will probably be the development trend for improving deep learning framework performance.

For reference we provide benchmarks for both float 32bit and 16bit precision to demonstrate the potential.

The Deep Learning Benchmark

For our benchmark the visual recognition ResNet50 model is used. As the classic deep learning network with its complex 50 layer architecture with different convolutional and residual layers it is still a good network for comparing achievable deep learning performance. As it is used in many benchmarks a near to optimal implementation is available, which drives the GPU to maximum performance and shows where the performance limits of the devices are.

The Testing Environment

For testing we used our AIME R400 server. It is an elaborated environment to run high performance GPUs by providing optimal cooling and the availability to run each GPU in a PCI 3.0 x16 slot directly connected to the CPU. The PCI connectivity has a measurable influence in deep learning performance, especially in multi GPU configurations. A sophisticated cooling is necessary to achieve and hold maximum performance.

The technical specs to reproduce our benchmarks:

  • AIME R400, Threadripper 1950X, 64 GB RAM
  • NVIDIA driver version 418.87
  • CUDA 10.1.243
  • CUDNN 7.6.4.38
  • Tensorflow 1.15

The Python scripts used for the benchmark are available on Github at: Tensorflow 1.x Benchmark

Single GPU Performance

The results of our measurements is the average image per second that could be trained while running for 100 batches.

One can clearly see an up to 30x speed-up compared to a 32 core CPU. There is already a quite clear distance to the GTX 1080TI which was introduced in the year 2017. The difference between a RTX 2080TI and the Tesla V100 is only a little more then 25% when looking at float32 performance.

When training with float 16bit precision the field spreads more apart. CPU and the GTX 1080TI do not natively support the float 16bit resolution and therefore don't gain much performance by using a lower bit resolution.

In contrast the Tesla V100 does show its potential and can increase the distance to the RTX GPUs and deliver more than 3 times the performance compared to float 32 bit performance and reaches nearly 5 times the performance of a GTX 1080TI.

The RTX 2080TI and Titan RTX can at least double the performance in comparison to float 32 bit calculations

Multi GPU Deep Learning Training Performance

The next level of Deep Learning performance is to distribute the work and training loads across multiple GPUs. The AIME R400 does support up to 4 GPUs of any type.

Deep Learning does scale well across multiple GPUs. The method of choice for multi GPU scaling in at least 90% the cases is to spread the batch across the GPUs. Therefore the effective batch size is the sum of the batch size of each GPU in use.

So each GPU does calculate its batch for backpropagation for the applied inputs of the batch slice. The results of each GPU are then exchanged and averaged and the weights of the model are adjusted accordingly and have to be distributed back to all GPUs.

For data exchange there is a peak of communication happening to collect the results of a batch and adjust weights before the next batch can start. While the GPUs are working on a batch not much or no communication at all is happening across the GPUs.

In this standard solution for multi GPU scaling one has to make sure that all GPUs run at the same speed, otherwise the slowest GPU will be the bottleneck for which all GPUs have to wait for! Therefore mixing of different GPU types is not useful.

With the AIME R400 a very good scale factor of 0.92 is reached, so each additional GPU adds about 92% of its possible performance to the total performance

Training Performance put into Perspective

For getting a better picture how the measurement of images per seconds translates into turnaround and waiting times when training such networks, we look at a real use case of training such a network with a large dataset.

For example the ImageNet 2017 Dataset consists of 1,431,167 images. To process each image of the dataset once, so called 1 epoch of training, on ResNet50 it would take about:

Configuration float 32 training float 16 training
CPU(32 cores) 26 hours 26 hours
Single RTX 2080TI 69 minutes 29 minutes
Single Tesla V100 55 minutes 17 minutes
4 x RTX 2080TI 19 minutes 8 minutes
4 x Tesla V100 15 minutes 4,5 minutes

Usually at least 50 training epochs are required, so one could have a result to evaluate after:

Configuration float 32 training float 16 training
CPU(32 cores) 55 days 55 days
Single RTX 2080TI 57 hours 24 hours
Single Tesla V100 46 hours 14 hours
4 x RTX 2080TI 16 hours 6,5 hours
4 x Tesla V100 12 hours 4 hours

This show that the correct setup can change a training task from weeks to the next working day or even just hours. In most cases a training time to let the training run over night to have the results the next morning is probably desired.

Deep Learning GPU Price Performance Comparison

Another important aspect is to put the reached performance of a GPU in relation to its price. We therefore show the current retail price of each GPU in relation to the reachable float 16 bit performance. The bars are normalized to the most efficient GPU.

Clearly the 2080TI is the most efficient GPU in regards of performance per price. It is four times more price efficient then a Tesla V100 even when comparing float 16 bit performance, a feature where the Tesla V100 excels

Conclusions

Mixed Precision can speed-up the training by factor 3

A feature definitely worth a look in regards of performance is to evaluate to switch training from float 32 precision to mixed precision training. Getting an up to 3 times performance boost by adjusting software depending on your constraints could probably be a very efficient move to triple performance.

Also the switch to mixed precision is nearly necessary to get a big advantage of the potential form the latest high end GPUs as the increase of performance to an GTX 1080TI, a 3 year old hardware, is not that impressive, when just looking at float 32 performance.

And the other way round an up-to-date GPU is necessary to profit from mixed precision training and can be rewarded with a more than 4 times performance gain overall when updating for example from a GTX 1080TI.

Multi GPU scaling is more than feasible

Deep Learning performance scaling with Multi GPUs scales well at least for up to 4 GPUs: 2 GPUs can often easily outperform the next more powerful GPU in regards of price and performance!

This is true when looking at 2 x RTX 2080TI in comparison to a Titan RTX and 2 x Titan RTX compared to a Tesla V 100.

Best GPU for Deep Learning?

As in most cases there is not an simple answer to the question. Performance is for sure one of the most important aspect of a GPU used for deep learning tasks but not the only one.

So it highly depends on what your requirements are, here are our assessments for the most promising deep learning GPUs:

RTX 2080 TI

It clearly delivers the most bang for the buck. If you are looking for a price-conscious solution a 4 GPU setup can play in the high end league with the acquisition costs less than a single most high end GPU. The only drawback is that it is not useable in virtualization environments a limitation imposed by NVIDIA to not use this type of GPU in cloud services. But for workstations or in-house servers it is a very interesting GPU to use.

Titan RTX

The bigger brother of the RTX 2080 TI. It shines when a lot of GPU memory is needed. With its 24 GB memory it can load even the most demanding models currently in research. The additional performance comes with a price tag and it has the same limitation imposed by NVIDIA not to be used in cloud renting services.

Tesla V100

If the most performance regardless of price or highest performance density is needed the Tesla V100 is first choice: it delivers the most compute performance in all categories. Also solutions which need virtualization to run under a Hypervisor, for example for cloud renting services, it is currently the only choice for high end deep learning training tasks.

Questions or remarks? Please contact us under: hello@aime.info