ssd.pytorch: StopIteration ERROR during training

My environment is

8GB RAM Ubuntu 16.04 LTS Pytorch 0.4 with CUDA 9.0 cuDNN v7 Python 3.5 Geforce GTX 1080 8GB.

I have geforce gtx 1080 8gb so i have tried to train network with 16 batch size. And run the training with python3 train.py --batch_size=16 after 1030 iteration,

..... iter 1020 || Loss: 9.2115 || timer: 0.1873 sec. iter 1030 || Loss: 8.1139 || Traceback (most recent call last): File "train.py", line 255, in <module> train() File "train.py", line 165, in train images, targets = next(batch_iterator) File "/home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 326, in __next__ raise StopIteration StopIteration

So i tried training with other batch size like 8, 20, everytime it prints out that signal. and i calculated batch_size * iteration step

then everytime the calculated number is around 16,480 with difference batch size and iter steps.

The problem occured part of pytorch dataloader is

if self.batches_outstanding == 0: self._shutdown_workers() raise StopIteration

in /home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py.

Is it possible my 8GB RAM can occurs this problem? (but i have checked RAM was enough with Ubuntu System Monitor)

Is there anybody can solve this problem? Please help me guyz 😃

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 16

Commits related to this issue

fixed issue #214, changed default lr to a very low value — committed to kentaroy47/ssd.pytorch by kentaroy47 6 years ago
Add #214 fix — committed to sakaia/ssd.pytorch by deleted user 5 years ago
fix issue https://github.com/amdegroot/ssd.pytorch/issues/214 — committed to yodhcn/ssd.pytorch by yodhcn 2 years ago

Most upvoted comments

I fixed this problem. I’ve changed the code in train.py from images, targets = next(batch_iterator) to try: images, targets = next(batch_iterator) except StopIteration: batch_iterator = iter(data_loader) images, targets = next(batch_iterator)

now, i checked all the losses are falling down 😃

thanks. @chenxinyang123

+224

seongkyun on Aug 5, 2018

I think it is because “batch_iterator” is used up.You may get len(batch_iterator)(it means the number of data/batch size) and use it to help you define a new batch_iterator. It is to say that you can use:batch_iterator = iter(data_loader) after len(batch_iterator) iterations

+25

chenxinyang123 on Aug 1, 2018

@seongkyun Unluckily, in my case, sometimes the training processs will diverge, i.e. the losses stopped falling down and became nan, and it seems to happen randomly

d-li14 on Sep 20, 2018

@d-li14 Maybe, the batch size could be problem. I don’t know why this happened, but sometimes nan loss error is occurred with wrong batch size.

seongkyun on Sep 21, 2018