ssd.pytorch: StopIteration ERROR during training
My environment is
8GB RAM Ubuntu 16.04 LTS Pytorch 0.4 with CUDA 9.0 cuDNN v7 Python 3.5 Geforce GTX 1080 8GB.
I have geforce gtx 1080 8gb so i have tried to train network with 16 batch size.
And run the training with
python3 train.py --batch_size=16
after 1030 iteration,
.....
iter 1020 || Loss: 9.2115 || timer: 0.1873 sec.
iter 1030 || Loss: 8.1139 || Traceback (most recent call last):
File "train.py", line 255, in <module>
train()
File "train.py", line 165, in train
images, targets = next(batch_iterator)
File "/home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 326, in __next__
raise StopIteration
StopIteration
So i tried training with other batch size like 8, 20, everytime it prints out that signal.
and i calculated
batch_size * iteration step
then everytime the calculated number is around 16,480
with difference batch size and iter steps.
The problem occured part of pytorch dataloader is
if self.batches_outstanding == 0:
self._shutdown_workers()
raise StopIteration
in /home/han/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py
.
Is it possible my 8GB RAM can occurs this problem? (but i have checked RAM was enough with Ubuntu System Monitor)
Is there anybody can solve this problem? Please help me guyz š
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 16
Commits related to this issue
- fixed issue #214, changed default lr to a very low value — committed to kentaroy47/ssd.pytorch by kentaroy47 6 years ago
- Add #214 fix — committed to sakaia/ssd.pytorch by deleted user 5 years ago
- fix issue https://github.com/amdegroot/ssd.pytorch/issues/214 — committed to yodhcn/ssd.pytorch by yodhcn 2 years ago
I fixed this problem. Iāve changed the code in
train.py
fromimages, targets = next(batch_iterator)
totry:
images, targets = next(batch_iterator)
except StopIteration:
batch_iterator = iter(data_loader)
images, targets = next(batch_iterator)
now, i checked all the losses are falling down š
thanks. @chenxinyang123
I think it is because ābatch_iteratorā is used up.You may get len(batch_iterator)(it means the number of data/batch size) and use it to help you define a new batch_iterator. It is to say that you can use:batch_iterator = iter(data_loader) after len(batch_iterator) iterations
@seongkyun Unluckily, in my case, sometimes the training processs will diverge, i.e. the losses stopped falling down and became
nan
, and it seems to happen randomly@d-li14 Maybe, the batch size could be problem. I donāt know why this happened, but sometimes
nan
loss error is occurred with wrong batch size.