tensorflow: TensorFlow 60-80% slower than PyTorch on training Wide ResNet
cc @tfboyd
From https://github.com/tensorflow/tensorflow/issues/7187#issuecomment-295502315
On an AWS p2.xlarge, using the tensorflow/tensorflow:1.0.1-devel-gpu Docker image as a base, I see ~270 ms per epoch while training a WRN-16-4 without dropout on CIFAR-10.
Using a PyTorch implementation from https://github.com/xternalz/WideResNet-pytorch, I see instead ~150 ms per epoch for the same.
My implementation of Wide ResNets uses NCHW and fused batch norm. It does use feed_dict for data loading, but I’ve observed with nvidia-smi that my GPU utilization stays near 100%.
To reproduce:
- Clone https://github.com/4Catalyzer/dl-papers, and go to that directory.
- Check out the
benchmarkbranch. - Build the Docker image, which is based on the Docker hub image above:
$ docker build -t dl-papers .
- Run the Docker image using NVIDIA Docker:
$ nvidia-docker run --rm -it dl-papers /bin/bash
- Run the TF WRN-16-4 training:
# python -m dl_papers.wide_resnet.train cifar10
- Observe the logged batch timings, then kill the process.
- In the same Docker container up the PyTorch Wide ResNet example:
# cd ..
# pip install http://download.pytorch.org/whl/cu80/torch-0.1.11.post5-cp27-none-linux_x86_64.whl
# pip install torchvision tensorboard_logger
# git clone https://github.com/xternalz/WideResNet-pytorch.git
# cd WideResNet-pytorch
- Run PyTorch training:
# python train.py --dataset cifar10 --layers 16 --widen-factor 4 -p 1
- Observe logged batch timings.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 39 (39 by maintainers)
I’d like to highlight that I think the biggest gap here is in DX. We started migrating from Theano to TensorFlow the first week of this year. This is partly my frustration speaking, but the process looked something like this:
tf.layers.batch_normalizationtotf.contrib.layers.batch_norm. This shows a performance improvement.feed_dict. See negligible speedup.As a developer, this is a really suboptimal experience.
My impression is that the modal TF example or published code is at something like our step (1) above, in that it uses NHWC, uses unfused batch norm, and doesn’t enable the non-fused Winograd convolution. Correspondingly, performance is quite far from optimal.
By contrast, though with a smaller sample size, the PyTorch examples I’ve seen generally seem to do “the right thing” performance-wise, and seem to run quickly out-of-the-box. (Also, the status of the built-in PyTorch layer API makes separate PyTorch examples far more consistent in terms of how the code reads.)
I’m very grateful for your help in tracking down these issues, but I really wish the out-of-the-box experience were better, and that it didn’t take so much work to get to this point.
I managed to get a set of traces from pytorch and tensorflow about what convolution algorithm and shapes they use: pytorch:
TensorFlow:
With
TF_ENABLE_WINOGRAD_NONFUSED=1, tensorflow always chooses the same algorithm as pytorch. But the log shows that tensorflow sometimes calls cudnn with irregular input shapes when stride=2. Maybe it can cause some performance issue.