examples: training will randomly freeze for training AlexNet from scratch.

sometimes, the training process will simply get stuck at testing.

Epoch: [0][5000/5005]   Time 0.100 (0.335)      Data 0.000 (0.244)      Loss 5.9800 (6.5614)    Prec@1 1.953 (0.735)    Prec@5 7.812 (2.896)
Test: [0/196]   Time 7.905 (7.905)      Loss 4.1344 (4.1344)    Prec@1 16.016 (16.016)  Prec@5 51.562 (51.562)

Or, more frequently, the line Test: [0/196] won’t appear and the whole process gets stuck at line Epoch: [0][5000/5005]

it has been like so for several hours, and by looking at top, no processes are using CPU.

I called CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 20 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC 2>&1 | tee alexnet_train.log to train the network.

This appears both on a CentOS 6 machine as well as a Ubuntu 14.04 machine.

About this issue

Original URL
State: open
Created 7 years ago
Comments: 18

Most upvoted comments

Hit me this week. On Ubuntu 16 machine everything works fine, but in a docker container, it freezes randomly. Once it also completed successfully.

umariqb on Dec 6, 2017