keras: Why keras apps using multi_gpu_model is slower than single gpu?

multi_gpu_model is from keras.utils and it wraps the application model to use multiple GPU to train. However, it seems that using multi_gpu_model makes the training heavier and slower, Is it as expected? The GPU I am using is NVIDIA Tesla P100.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 38 (19 by maintainers)

Most upvoted comments

Hi, I found some explanations to this issue.

It seems that not all models will benefit for multi_gpu_model. Different models have different scalability due to the overhead of weight synchronization.

ResnetV1 and ResnetV2 are a pair of typical examples to prove that, while ResnetV2 have a better scalability than ResnetV1.

There is a balance between training one mini-batch and weights synchronization. InceptionV3 has heavy computational cost on training one mini-batch while it has sparse weights need to synchronize, so this model will gain a decent boost on multi_gpu_model.

However, any model with large Dense layer usually contributes to a bad scalability, just like mnist_mlp, which have a light computational cost on training one mini-batch while its weights are too large to synchronize efficiently, so in the example of mnist_mlp, the time spent to do one weights synchronization is even able to finish training MANY turns of mini-batch by single GPU, so mnist_mlp will not benefit for multi_gpu_model due to its dense network design to result in a bad scalability.

It also indicates that models to train on different GPU architectures will also have a different answer about whether it will benefit for multi_gpu_model, since it largely depends on whether the GPU is fast enough to perform training one mini-batch than a weights synchronization. So another conclusion is that the faster a GPU is, the less likely that multi_gpu_model can boost your model.

I did the benchmarks by myself, and the example is exactly from keras/examples folder. For example, as for mnist_cnn.py, and I added the following line before model.compile:

model = keras.utils.training_utils.multi_gpu_model(model, 4)

And here is the benchmark for mnist_cnn.py on the NVDIA Tesla P100 (4GPU, 16GB-Mem per device):

original_single: gpu=1, perf = 5s/epoch 75us/step
multi_gpu_model: gpu=2, perf = 5s/epoch 74us/step
multi_gpu_model: gpu=3, perf = 5s/epoch 81us/step
multi_gpu_model: gpu=4, perf = 6s/epoch 103us/step

As for cifar_cnn.py, I alsa added one line above, and the performance is as follow:

# By default, data_augmentation = True in cifar_cnn.py
original_single: gpu=1, perf = 23s 15ms/step
multi_gpu_model: gpu=2, perf = 23s 15ms/step
multi_gpu_model: gpu=3, perf = 24s 15ms/step
multi_gpu_model: gpu=4, perf = 22s 14ms/step

We see there is hardly any difference because CPU-side data_augment is really the bottleneck. If we turned off the data_augmentation, the the performance is as follow:

# data_augmentation = False
original_single: gpu=1, perf = 14s 286us/step
multi_gpu_model: gpu=2, perf = 16s 325us/step
multi_gpu_model: gpu=3, perf = 19s 389us/step
multi_gpu_model: gpu=4, perf = 22s 445us/step

We see all the performance of multi_gpu_model is hardly better but worse.

@mohapatras It seems that CPU-side data-preprocessing can be one of the reason that greatly slow down the multi-GPU training, do you try disabling some pre-processing options such as data-augmentation and then see any boost?

Besides, the current version of multi_gpu_model seems to benefit large NN-models only, such as Xception, since weights synchronization is not the bottleneck. When it is wrapped to simple model such as mnist_cnn and cifar_cnn, weights synchronization is pretty frequent and makes the whole time much slower.

I’m also on the way to implement a customized version for multi GPU to see any ways better.

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don’t change.

@TristanJM Yes, regardless of whether the model can get boosted by multi GPU (computational scaling), another case we have to use multi_gpu_model is for single GPU not able to train a model with large batch_size, thus the model can benefit from multi_gpu_model for its memory occupation scaling.

multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory

The optimal solution is to not split at all. As I put in tensorpack docs here:

Splitting a tensor for data-parallel training makes no sense at all, only to put unnecessary shape constraints on the data. By letting each GPU train on its own input tensors, they can train on inputs of different shapes simultaneously.

It would be better if there will be any example to show to How to use Tensorflow native data tensors (TFrecord) as the input of Keras multi_gpu_model, which might be faster than current inefficient get_slice method.

       import tensorflow as tf
       from keras.applications import ResNet50
       from keras.utils import multi_gpu_model
       import numpy as np
       num_samples = 1000
       height = 224
       width = 224
       num_classes = 1000
       with tf.device('/cpu:0'):
           model = ResNet50(weights=None,
                            input_shape=(height, width, 3),
                            classes=num_classes)
       parallel_model = multi_gpu_model(model, 2)
       parallel_model.compile(loss='categorical_crossentropy',
                              optimizer='rmsprop')
       x = np.random.random((num_samples, height, width, 3))
       y = np.random.random((num_samples, num_classes))
       parallel_model.fit(x, y, epochs=20, batch_size=256)

I also experienced this problem yesterday.When I increased batch_size,multi-gpu is faster than single.Maybe It is because increasing batch_size can make the GPU computational cost larger,but the communication-cost between CPU and GPU don’t change.

That’s because Keras will split each batch according to the total number of gpus used in multi_gpu_model. A model trained with a single gpu and batch size of 64 will be approx. as fast as that same model trained using multi_gpu_model with 2 gpus and the same batch size, as each gpu will process 32 samples at once. So to be able to compare results you should multiply the original batch size by the number of gpus used.

Thus, if 2 gpus are used, your batch size should be 128. Keras will split the samples into 2 groups of 64 samples for each gpu. That way your processing should be ~2x faster than using a single gpu.

That’s because you’re looking at the wrong metric. You’re looking at seconds/step. It makes sense that this would be slower than that for non distributed training. With distributed training it’s doing more operations. Each individual gpu has to do backpropagation on a batch, send the gradients back to RAM, apply the gradients to the parameters you’re storing in RAM, and finally sync the values of the parameters stored in RAM with those store in GPU memory. With non distributed training it only has to apply gradients to the parameters store in GPU memory, so it makes sense that it would take less time to train a single batch.

Distributed training sees a speedup when you look at the number of global steps / sec (i.e. the number of batches trained by all the GPU workers). This will cause the model to converge faster.

@ppwwyyxx I found partial reasons why keras is around 2x slower than tensorflow when training Resnet50. Firstly, Keras Conv2D uses ConvBias weights for every CNN layer which brings about extra computing complexity for bias forward and backward.

Secondly, even though Keras is working as channels_first image format, it still needs to do extra GPU matrix computing to frequently swap between NCHW and NHWC format. In other words, it is not fully working on NCHW computing. Thus, a lot of tensor conversion also brings about much bias computing.

Thirdly, after Keras==2.2.0, multi_gpu_model using cpu_merge=True will allow in-place data split before pushed to GPU memory, so it slightly reduce the overhead from 2x to 1.8x~1.9x as well, which is the third reason.

Maybe there are still more other reasons not found yet.

In my experiments, if using NHWC format, it can finish training 1000 samples in 5 sec/epoch using 2 GPUs, and 3 sec/epoch using 4 GPUs.

So it’s already 2x slower than what it should be, then certainly it’ll be able to scale better. The slower the model is, the better it scales.

it says that channels_last for Tensorflow will have the best performance.

This is definitely not true for Tensorflow on a P100. Cudnn implementation on every GPU architecture before Volta favors NCHW over NHWC. It may be true for Keras, however. For example, there is a recent performance fix in https://github.com/keras-team/keras/pull/8785 that makes it use a faster batchnorm kernel for NCHW. I won’t be surprised if NHWC is faster than NCHW before this PR. But the PR seems to suggest that NCHW is faster now.

Your code uses NHWC image layout which is slower than NCHW. As you’ve pointed out, the slower the code is, the better it scales. https://www.tensorflow.org/performance/benchmarks shows 422 images/s for ResNet50 training on two P100s. So you should expect your code to finish each epoch (1000 samples) in 2.36s. I assume it’s not the case.