vision: Unable to reproduce classification accuracy using the reference scripts

🐛 Bug

I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:

  • 16 nodes of 4 V100 GPUs
  • 8 nodes of 8 V100 GPUs

but obtained similar results.

To Reproduce

Clone the master branch of torchvision, then cd vision/references/classification and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100.

The training logs (including std logs) are attached for your information: log.txt and resnext101_32x8d_reproduced.log

Expected behavior

Final top-1 accuracy should be around 79%.

Environment

  • PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch / torchvision (conda, pip, source): pip
  • Python version: 3.8
  • CUDA/cuDNN version: 10.2
  • GPU models and configuration: V100

cc @vfdev-5

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566 

Here is my output log: resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different. workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).

I was able to reproduce the results:

Acc@1 79.314 Acc@5 94.566 

Here is my output log: resnext101_32x8d_logs.txt

Comparing the logs shared above with mine, I see workers and world size being different. workers=16, world_size=8 vs workers=10, world_size=64

What did you pass for --gpus-per-node? At the top of the log file, it says 4 GPUs per node. I guess the reported results are with gpus-per-node=8.

This is the command I ran:

srun -p train --cpus-per-task=16 -t 110:00:00 --gpus-per-node=8 python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext101_32x8d --epochs 100 --output-dir logs/run2 > logs/run2/resnext101_32x8d_logs.txt 2>&1

Thanks for the results. This confirms that 0.1 is the correct learning rate for 8 GPUs (with batch size 32). If we train on 64 GPUs (as stated in the documentation) without scaling the learning rate, then we would obtain similar results as I posted above. I will soon create a PR to clarify the documentation.

Hi @netw0rkf10w

Sorry for the delay in replying.

IIRC the accuracies for ResNet-50 were obtained after training the model and recomputing the batch norm statistics.

For ResNeXt, we just report the numbers after training. FYI, here are the training logs for resnext101_32x8d that we provide in torchvision https://gist.github.com/fmassa/4ce4a8146dbbdbf6e1f9a3e0ec49e3d8 (we report results for checkpoint at epoch 96)

@datumbox once you are back, can you try kicking the runs for ResNet-50 and ResNeXt to double-check?