Bookmark and Share

Case Study: ResNet50 with DALI

This blog post was originally published at NVIDIA's website. It is reprinted here with the permission of NVIDIA.

Let’s imagine a situation. You buy a brand-new, cutting-edge, Volta-powered DGX-2 server. You’ve done your math right, expecting a 2x performance increase in ResNet50 training over the DGX-1 you had before. You plug it into your rack cabinet and run the training. That’s when an unpleasant surprise pops up. Even though your math is correct, the speedup you’re getting lower than expected. Why? What can you do about it? Can DALI, aka the NVIDIA Data Loading Library, help you here?

When deep learning engineers hear “image classification”, they think about ResNet. Since its publication in 2015 ResNet became a widely recognized standard, and despite numerous descendants and later works, it still encompasses most classification tasks. Even though the residual architecture is considered computationally lighter than the classic deep neural network, ResNet still carries out a lot of processing, especially for CPU-based dataset augmentations. One reason why you are not observing the anticipated speed boost might be the large amount of computations CPU needs to perform, resulting in a bottleneck in this part of hardware.

The Bottleneck

To verify that CPU is indeed the bottleneck in our training, let’s conduct a simple experiment. Normally the preprocessing of dataset is performed on the CPU (using libraries like NumPy or Pillow) while the training runs on the GPU. Furthermore, they work in parallel – the CPU processing of a batch runs at the same time that the GPU conducts forward and back propagation of the previous batch.

Let’s compare the speed of a regular ResNet50 data processing pipeline and a synthetic pipeline, in which no data augmentation occurs, only training. At this point we don’t care that the neural network won’t train; all we need is to compare the performance of the two pipelines.

If we observe no performance difference between these two pipelines in our experiment, then the CPU workload proves irrelevant. On the other hand, if we discover lower speed in the regular pipeline, it means that the CPU is the likely bottleneck for training.

Indeed, as table 1 shows, we observe a difference in the performance of these two pipelines. This observation indicates a bottleneck within the preprocessing pipeline occuring on the CPU. To address this situation, we can make use of DALI – the library exclusively designed to offload the preprocessing from CPU to GPU, thereby reducing or eliminating the CPU bottleneck.

Table 1. CPU-bottleneck experiment. Training throughput of ResNet50

Benchmark Type

DGX-2

Regular ResNet50 training speed
(images/sec)

12,586

Synthetic ResNet50 training speed
(images/sec)

22,656

Benchmarking

Throughout this article we measure the time performance (in images per second) of ResNet50 training. The script that used is available on NVIDIA’s DeepLearningExamples GitHub page. The script allows the training to be run on both DALI and native data preprocessing pipelines.

We base our training on the ImageNet dataset from the original ResNet50. Every epoch consists of 1,281,167 images. The MXNet framework is used to conduct the training. We use NGC Container, version 19.04 to maintain a consistent software configuration.

In addition, the hardware setups used for training are shown in Table 2:

Table 2. Test System Configurations
System Number of GPUs Batch Size
NVIDIA DGX-2 16 256
AWS P3 8 192
NVIDIA TITAN RTX 1 128

We conducted trainings using the same set of hyperparameters to achieve the most accurate training speed comparison. The batch size and number of GPUs differ only from one hardware setup to the other, as shown above. Table 3 highlights the performance metrics for regular and synthetic ResNet50, using the hardware setups specified above.

Table 3. Training performance of ResNet50 depending on preprocessing pipeline

Pipeline Type

DGX-2

AWS P3

TITAN RTX

Standard Preprocessing Pipeline
(images/sec)

12,586

5,740

553

Synthetic Preprocessing Pipeline

(images/sec)

22,656

10,738

585

Augmentations for ResNet50

Prior to plugging the DALI in, let’s list all the data processing operations, including augmentations, that occur within the ResNet50 preprocessing pipeline:

  1. Read raw image data
  2. Decode the image
  3. Resize
  4. Random crop
  5. Random horizontal flip
  6. Normalize
  7. Transpose

Every image is stored in a compressed format (such as JPEG). We need to first decode the image in order to process it. Although image decoding is not widely recognized as a preprocessing operation, it always occurs and consumes significant portion of total preprocessing time.

Without delving into a detailed description of every operation, let’s mention a few things. First, both crop and horizontal flip operations are performed randomly. In other words,  determine on a given probability distribution whether to perform such operation and to what extent. Next, transposing (changing NHWC data layout to NCHW) is not part of ResNet50 itself. Instead, it is a part of the framework-dependent implementation – we want to transform the layout to a format that MXNet works best with.

Last, notice that the training script actually used two different data pipelines: one for training and another for validation, as shown in figures 2 and 3. The latter is just a subset of the former. Validation only performs operations required for formatting the input data to match the network’s input. This article focuses only on DALI pipeline for training.


Figure 1: ResNet50 training data preprocessing pipeline


Figure 2: ResNet50 validation data preprocessing pipeline

Applying Augmentations in DALI

Now that we have listed all the augmentations, let’s set up the DALI pipeline and plug it into ResNet50 training. For the sake of readability, we present simplified code snippets. See the script on GitHub for a complete working example.

Let’s declare the required augmentations:

class HybridTrainPipe(Pipeline):
     def __init__(self):
          super(HybridTrainPipe, self).__init__()

          self.input = ops.MXNetReader(path = [rec_path])
          self.decode = ops.nvJPEGDecoder(device = "mixed")
          self.rrc = ops.RandomResizedCrop(device = "gpu", size = crop_shape)
          self.cmnp = ops.CropMirrorNormalize(device = "gpu", crop = crop_shape)

          self.coin = ops.CoinFlip()

And define the preprocessing graph:

     def define_graph(self):
          rng = self.coin()
          self.jpegs, self.labels = self.input(name = "Reader")

          images = self.decode(self.jpegs)
          images = self.rrc(images)
          output = self.cmnp(images, mirror = rng)
          return [output, self.labels]

The code is mostly self-explanatory; more detail can be found in the DALI Developer Guide.  Images get decoded and then piped through all the augmentations listed in the previous section. Note the CoinFlip operator, which is a special case of an operator. It takes no input and produces output used to drive those operators that perform random operations. Also, you can see that we use two types of devices when declaring operations: gpu and mixed. The first one is quite obvious, whereas the second one denotes an operation which works on both CPU and GPU. Image decoding represents such an operation, since the CPU has to read image from disk and transfer it to GPU memory, where the image is decoded.

Now that we have DALI defined and ready to go, let’s plug it into the training and check the time measurements; you can see the results in table 4. Refer to our examples to see how to plug DALI into a deep learning framework.

Table 4: ResNet50 training performance with DALI

 

DGX-2

AWS P3

TITAN RTX

Performance
(images/sec)

21,034

10,138

575

Fusing Operations

Take a closer look at the HybridTrainPipe. You may notice that some operations are fused together into one operator, such as Resize combined with Crop. This represents a common pattern with an underlying simple idea. Performing multiple operations in one GPU CUDA kernel reduces the number of memory accesses, making overall training more performant.

Fused operators bring yet another optimization trick to the table. The ResNet50 preprocessing pipeline contains an image crop operation. Is it really necessary to decode the whole image, including parts discarded at a later stage? Simply put, no. A little assistance from nvJPEG, allows us to perform region-of-interest (ROI) decoding – cropping an image and decoding only its undiscarded parts.

Let’s introduce nvJPEGDecoderRandomCrop operator into the pipeline:

class HybridTrainPipe(Pipeline):
     def __init__(self):
          super(HybridTrainPipe, self).__init__()

          self.input = ops.MXNetReader(path = [rec_path])
          self.decode = ops.nvJPEGDecoderRandomCrop(device = "mixed",
               random_aspect_ratio = [min_random_aspect_ratio, max_random_aspect_ratio])
          self.resize = ops.Resize(device="gpu", resize_x = crop_shape[0], resize_y = crop_shape[1])
          self.cmnp = ops.CropMirrorNormalize(device = "gpu")

          self.coin = ops.CoinFlip(probability = 0.5)

     def define_graph(self):
          rng = self.coin()

          self.jpegs, self.labels = self.input(name = "Reader")
          images = self.decode(self.jpegs)
          images = self.resize(images)
          output = self.cmnp(images, mirror = rng)
          return [output, self.labels]

Table 5 shows the resulting performance after adding nvJPEGDecoderRandomCrop:

Table 5: ResNet50 training speed with fused CropDecoder

 

DGX-2

AWS P3

TITAN RTX

Performance
(images/sec)

21,592

10,418

578

Analysis

Let’s recap of all the measurements that we gathered so far in this article, shown in table 6.

 

Table 6: Measurements gathered in this article

 

Without DALI

With DALI

DALI with fused decoder

 

Performance
(samples/sec)

Speedup

Performance
(samples/sec)

Speedup

Performance
(sample/sec)

Speedup

DGX-2

12,586

N/A

21,034

67.12%

21,592

71.56%

AWS P3

5,740

N/A

10,138

76.62%

10,418

81.50%

TITAN RTX

553

N/A

575

3.98%

578

4.52%

ResNet50 without DALI is the reference. “Speedup” column shows the performance boost regarding the reference setup. In every hardware configuration tested, we can observe clear performance increases.

The numbers look pretty impressive. We achieved significant speedup for the configurations that represent state-of-the-art hardware for training. Thanks to DALI, ResNet50 trains more than 2 times faster. When training on high-end AWS, minutes shaved off the run are worth every penny (literally). Just using DALI can save 80% of your training costs.

A 5% boost might using TITAN RTX doesn’t seem like an impressive number. Let’s look at the underlying numbers a little more deeply. A rough estimate of training ResNet50 on TITAN RTX using ImageNet takes around 70 hours. Considering that implementing DALI saves you around 5% of that time, overall training shaves off 3.5 hours, which is a significant amount of time.

When to use DALI?

What indicators suggest taking a look at DALI to achieve improved training performance?

Let’s look back at the plain (no-DALI) training of ResNet50. The first test we did checked how the training behaved without preprocessing. We found a significant performance boost for the case without any preprocessing, indicating that the CPU is the bottleneck of the training. However, such observation doesn’t imply that using DALI will enhance the training speed.

Since DALI transfers preprocessing operations to the GPU, it consumes the resources that could otherwise be used for the training. So does consuming some portion of the GPU resources to speed up preprocessing adversely affect overall training performance? Analyzing the GPU utilization during training answers this dilemma. Figure 3 shows such analysis using nvidia-smi stats -d gpuUtil command during training.


Figure 3. GPU utilization histogram during training without DALI

We observe that the GPU in the no-DALI case often fails to achieve 100% utilization. This means that headroom exists for DALI, which can engage the underutilized GPU resources for preprocessing computations. Figure 4 highlights GPU utilization once we add DALI to the mix.


Figure 4. GPU utilization histogram during training with DALI

Combining both utilization charts helps us make a better comparison, shown in figure 5.


Figure 5. GPU utilization histogram during training with (green) and without (blue) DALI

We observe significant differences in GPU utilization. DALI enables you to really use everything your GPU has to offer in terms of performance.

Hence, consider taking the following steps:

  1. Make sure that the CPU is in fact the bottleneck of the training (in other words, try the synthetic data preprocessing pipeline).
  2. If so, also verify that GPU is underutilized throughout the training (nvidia-smi stats -d gpuUtil might come in handy for it).

What happens if any of these conditions isn’t met? You will still find reasons to use our cross-framework library, which supports many data formats. Stay tuned for future articles which elaborate more on why DALI is beneficial and what can be achieved with it.

Tell us what you think!

DALI is available as open source. We encourage your submissions and we’ll be happy to assist you with your contribution. See the DALI Web page and our GTC19 tech talk for more information and for next steps. If you like to use DALI for inference, check out this talk.

DALI is also available in NGC containers so you can download it using pip. We value your feedback, so please leave a comment below if you have any questions or feedback.

Michał Szołucha
Software Engineer, NVIDIA