vision: Unable to reproduce classification accuracy using the reference scripts
🐛 Bug
I have been trying to reproduce the reported 79.312% accuracy on ImageNet of resnext101_32x8d
using the reference scripts, but I could obtain only 75%-76%. I try two different trainings on 64 GPUs:
- 16 nodes of 4 V100 GPUs
- 8 nodes of 8 V100 GPUs
but obtained similar results.
To Reproduce
Clone the master branch of torchvision
, then cd vision/references/classification
and submit a training to 64 GPUs with arguments --model resnext101_32x8d --epochs 100
.
The training logs (including std logs) are attached for your information: log.txt and resnext101_32x8d_reproduced.log
Expected behavior
Final top-1 accuracy should be around 79%.
Environment
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0): 1.8.1
- OS (e.g., Linux): Linux
- How you installed PyTorch / torchvision (
conda
,pip
, source):pip
- Python version: 3.8
- CUDA/cuDNN version: 10.2
- GPU models and configuration: V100
cc @vfdev-5
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (16 by maintainers)
I was able to reproduce the results:
Here is my output log: resnext101_32x8d_logs.txt
Comparing the logs shared above with mine, I see workers and world size being different.
workers=16, world_size=8
vsworkers=10, world_size=64
What did you pass for
--gpus-per-node
? At the top of the log file, it says4 GPUs per node
. I guess the reported results are withgpus-per-node=8
.This is the command I ran:
@datumbox Thank you so much for the detailed response and for your transparency! In the issue that you mentioned, there appear to have enough information to reproduce the results (may except one detail, let me post a question there).
Thanks for the results. This confirms that 0.1 is the correct learning rate for 8 GPUs (with batch size 32). If we train on 64 GPUs (as stated in the documentation) without scaling the learning rate, then we would obtain similar results as I posted above. I will soon create a PR to clarify the documentation.
Hi @netw0rkf10w
Sorry for the delay in replying.
IIRC the accuracies for ResNet-50 were obtained after training the model and recomputing the batch norm statistics.
For ResNeXt, we just report the numbers after training. FYI, here are the training logs for resnext101_32x8d that we provide in torchvision https://gist.github.com/fmassa/4ce4a8146dbbdbf6e1f9a3e0ec49e3d8 (we report results for checkpoint at epoch 96)
@datumbox once you are back, can you try kicking the runs for ResNet-50 and ResNeXt to double-check?