Efficient deep learning training - using the practical example of training the ResNet50 model on the ImageNet data set
In order to achieve good results with the shortest possible training times when training deep learning models, it is essential to find suitable values for the training parameters such as learning rate and batch size. The search for suitable values depends on the model to be trained, the amount of data used, but also the available hardware and can therefore prove to be quite time-consuming, since a training run, depending on the model and the training data used, can be very long (up to several days).
Using the example of the ResNet-50 model with the ImageNet dataset, this article shows how suitable values for the training parameters can be determined and describes the influence of the various parameters on the training progress. Practical advice is given on the choice of parameters.
For this purpose, we first present the ResNet-50 model and the ImageNet dataset used for our experiments and explain a few relevant basic terms.
Residual neural networks, ResNet for short, were first introduced in 2015 to classify images. ResNet is known to be one of the first deep learning networks solving the vanishing/exploding gradient problem that occurs in previously used perceptron network structures when the number of intermediate layers is increased, see Deep Residual Learning for Image Recognition. The characteristic feature of residual networks is the use of "jump connections" between different layers, where certain layers can be skipped. This allows to build much deeper networks and solves the vanishing/exploding gradient problem. The ResNet-50 model used for this experiment consists of 48 convolutional layers, as well as a MaxPool and an Average Pool layer (48+1+1=50 layers). With the deeper network structure, better detection rates are achieved indeed than with the flatter network structures previously used.
A version of the ResNet model pre-trained with the ImageNet dataset can be downloaded from the PyTorch library. However, we used an untrained ResNet50 model because we wanted to investigate the optimization of training with ImageNet.
The ImageNet dataset consists of around 14 million annotated images that have been assigned to 1000 different classes. Since 2010, the dataset has been used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) to study image classification and object recognition. Because of its size, quality, and accessibility, the ImageNet dataset is well suited to study the training of models for image classification. It can be downloaded for free from image-net.org or kaggle.
When evaluating a model's detection rate, it is common to look at the TopK-accuracy. With K=1, the Top1-accuracy, it is specified how many of the images in the test dataset were correctly classified. The test dataset consists of images that are not contained in the training dataset. Thus, the Top1-accuracy measures how well the trained model can classify unknown images, or it is measured for how many images of the test dataset the determined highest probability of the output class of the model matches the correct class. However, since there are also images where it is difficult to clearly assign them to a class, e.g. if several objects can be seen in one image, it often makes sense to examine the TopK-accuracy as well. Here it is considered whether the actual class is among the K classes to which the system assigns the highest probability. The usual case here is K=5, the Top5-accuracy.
When training deep learning models, the main goal is to adjust the weights of the neural network until the global minimum of the loss function is found. The optimizer takes care of this task. The most appropriate optimizers for most applications are based on the gradient descent method, in which the weight adjustments are calculated using the gradient of the loss function. The optimizer SGD (Stochastic Gradient Descent) is particularly popular because it brings along a comparatively high level of efficiency thanks to stochastic methods. For this reason we used it for our measurements in this article.
Probably the most important parameter when training deep learning models with the gradient step method is the learning rate of the optimizer. It indicates how quickly it moves towards the direction of a minimum of the loss function after each iteration. The direction is determined by the direction of the gradient of the loss function. The step size is determined by the magnitude of the gradient and the learning rate as a prefactor. If the increment or learning rate is too large, the desired minimum can be missed and thus "overlooked". With very narrow valleys, it can also happen that the step size is so large that it jumps out of the valley. On the other hand, a too small learning rate can lead to being stuck in an undesirable local minimum. In addition, the system converges more slowly, which increases training time.
Another method to optimize training based on the gradient descent method was the introduction of the momentum. First presented in 1999 in On the Momentum Term in Gradient Descent Learning Algorithms it is now used in almost all common training algorithms that rely on the gradient descent method. In simple words, the momentum m (0<m<1) adds a portion of the previous weight change to the current weight change. For example, if the gradient always points in the same direction, the step size is increased towards the direction of the minimum, so the training is accelerated. If the direction of the gradient is constantly changing, the variations are smoothed out somewhat and the movement is thus more oriented towards the desired minimum. In the specialist literature, the value m = 0.9 for the momentum has turned out to be suitable for most training processes. The investigation of other values was not part of our experiments, which is why m = 0.9 was used in all processes.
The batch size defines how many samples from the dataset are processed simultaneously in one training iteration. The training dataset is thus divided into batches which are then propagated through the system one after the other until the entire dataset has been run through and the epoch is completed. For small datasets fitting completely in the memory used, the batch can also consist the entire dataset, in which case the epoch then consists of only a single iteration. However, this is usually not the case, which is why the division into smaller batches makes sense. In the specialist literature, the batch is also often referred to as the mini-batch to point out the difference to the complete dataset. Since the weights in the model are adjusted after each iteration to the findings from the batch currently used, the selection of the batch size also has an influence on the training results.
When training models for image classification, there is often the problem of overfitting the model to the training data. The model recognizes the training data better and better, but loses the desired property to correctly classify previously untrained images. One method of minimizing this is to slightly alter the images with image processing methods after each iteration, e.g. varying the size and orientation. The changes in the input images are carried out automatically and randomly in a certain parameter range after each training run. Since the training data set is artificially augmented as a result, it is referred to as "augmentation".
Through the augmentation of the dataset, the objects are indeed better and more reliably recognized independently of their size and orientation.
Finding the right learning rate and batch size
In order to find an optimal combination of learning rate and batch size, we tried different values and compared the training progress using the Top1-Accuracy. We used an AIME T600 Workstation with a total of four NVIDIA RTX 3090 GPUs, each with 24 GB of GPU memory. Due to the high performance of the AIME workstation, this allowed us relatively short training times of 14 minutes per epoch for a training run of the ImageNet data.
In addition, the combined memory of the four GPUs allowed us to choose a higher batch size, which means further advantages for the training progress. First, we tried different, constant learning rates with the maximum batch size (in this case 768). For comparison, we have plotted the respective Top1-Accuracy over the training epochs and compared them with each other (see Fig.1).
From Fig. 1 you can clearly see that with very low learning rates, such as 0.0001, the accuracy grows much more slowly and has not reached a satisfactory value even after 90 training epochs. At higher learning rates, such as 0.001 and 0.01, the curve grows faster but stagnates after a certain number of epochs. In this case, it is likely that the desired global minimum was not reached, but rather an undesired local minimum. With a very high learning rate of 1, the curve grows very quickly, but only achieves a recognition rate of around 30% because the resolution is too poor. In addition, strong fluctuations can be seen here, which is probably due to the fact that the system "jumps out" of the valleys again due to the high learning rate and cannot reach the associated minimum, but only the flank of the corresponding valley.
Dynamic adjustment of learning rate
The best results were obtained with a learning rate of 0.1. However, even here only accuracies less than 60% were achieved. Presumably, the valley belonging to the global minimum was found here, but the minimum itself was not reached because the step size is too large and the resolution is therefore too coarse. It therefore makes sense to reduce the learning rate in the course of the training in order to get closer to the associated minimum. After the correct valley has been found with coarse resolution, the minimum of this valley is searched for with finer resolution. There are several ways how the learning rate can be adjusted. In our experiments, we reduced the learning rate by a factor of 10 after a certain number of epochs. The following figure shows two training processes where the reduction of the learning rate took place after 20 epochs and after 30 epochs.
In Fig. 2 it can be seen that the accuracy increases significantly immediately after the reduction of the learning rate. In this case, it doesn't seem to make much of a difference whether the reduction occurs after 20 or 30 epochs. However, this is difficult to generalize because the accuracy can increase more slowly with another set of training parameters. To make sure that the adjustment doesn't happen too early, we decided to adjust every 30 epochs for the following experiments. As a next step, we examined different batch sizes and different learning rates in the dimension of 0.1 and again compared the training progresses.
Again it can be seen that shortly after the reduction of the learning rate after 30 epochs, the accuracy improves significantly. While in the first 30 epochs the curves belonging to the various parameters still differ greatly, they become more and more similar as the training progresses. However, based on the graphs, an overfitting of the model to the training data can be assumed, since the accuracy decreases with the training progress after the learning rate has been adjusted.
Avoid overfitting with augmentation
In addition, according to Image Classification on ImageNet, better accuracies for the ResNet-50 training with the ImageNet dataset have already been achieved than the ~70% we achieved here (specifically 75.3%). For this reason, we applied the augmentation discussed above to the training data and again compared various parameters with each other.
After this adjustment, we get higher accuracies of up to 76%, which appear to be stable even after a longer period of training and the augmentation could therefore probably have prevented an overfitting to the training data. Again, there are big differences at the beginning of the training, which become smaller with ongoing training and after reducing the learning rate. However, it turns out that the model seems to train faster with larger batch sizes than with smaller batch sizes. The fastest training progress was observed at a learning rate of 0.1 with a batch size of 768. Here the learning rate could be reduced after even less than 30 epochs in order to further accelerate the training.
From our training experiments, we were able to determine the following practical tips for finding good parameters for deep learning training of the ResNet50 model, especially for the learning rate and batch size:
- In order to choose the right learning rates, you have to search in a large area. Here it is worth starting with several learning rates and using the learning rate that achieves the best training success in terms of the accuracy achieved in the first epochs
- It makes sense to start the search with a high learning rate
- Starting with a learning rate that is too low does not lead to an optimal result, even after a long training session
- It has been shown that reducing the learning rate as training progresses leads to better results than training with a constant learning rate
- Decreasing the learning rate as training progresses is essential to get the best result
- A large batch size has no disadvantage on the learning results and leads to similar or slightly better results. So it makes sense to use the maximum possible batch size of the GPU
- The augmentation is a good way to counteract the overfitting of the model
- With more GPUs or when using GPUs with higher memory capacity, the batch size can be increased even more and the training can probably be further accelerated
- The accuracy of around 76% achieved with the methods described above corresponds to the accuracy of the pre-trained model from the Pytorch library.