ssd.pytorch: StopIteration ERROR during training

My environment is

8GB RAM Ubuntu 16.04 LTS Pytorch 0.4 with CUDA 9.0 cuDNN v7 Python 3.5 Geforce GTX 1080 8GB.

I have geforce gtx 1080 8gb so i have tried to train network with 16 batch size. And run the training with python3 train.py --batch_size=16 after 1030 iteration,

..... iter 1020 || Loss: 9.2115 || timer: 0.1873 sec. iter 1030 || Loss: 8.1139 || Traceback (most recent call last): File "train.py", line 255, in <module> train() File "train.py", line 165, in train images, targets = next(batch_iterator) File "/home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 326, in __next__ raise StopIteration StopIteration

So i tried training with other batch size like 8, 20, everytime it prints out that signal. and i calculated batch_size * iteration step

then everytime the calculated number is around 16,480 with difference batch size and iter steps.

The problem occured part of pytorch dataloader is

if self.batches_outstanding == 0: self._shutdown_workers() raise StopIteration

in /home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py.

Is it possible my 8GB RAM can occurs this problem? (but i have checked RAM was enough with Ubuntu System Monitor)

Is there anybody can solve this problem? Please help me guyz šŸ˜ƒ

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 16

Commits related to this issue

Most upvoted comments

I fixed this problem. Iā€™ve changed the code in train.py from images, targets = next(batch_iterator) to try: images, targets = next(batch_iterator) except StopIteration: batch_iterator = iter(data_loader) images, targets = next(batch_iterator)

now, i checked all the losses are falling down šŸ˜ƒ

thanks. @chenxinyang123

I think it is because ā€œbatch_iteratorā€ is used up.You may get len(batch_iterator)(it means the number of data/batch size) and use it to help you define a new batch_iterator. It is to say that you can use:batch_iterator = iter(data_loader) after len(batch_iterator) iterations

@seongkyun Unluckily, in my case, sometimes the training processs will diverge, i.e. the losses stopped falling down and became nan, and it seems to happen randomly

@d-li14 Maybe, the batch size could be problem. I donā€™t know why this happened, but sometimes nan loss error is occurred with wrong batch size.