simpletransformers: Multiprocessing Errors when Training for Multiple Epochs
Bug Description
When training RoBERTa for a text regression task for more than 1 epoch, I get an OSError and ValueError related to the multiprocessing package (see details below).
Reproduction Info
The error persists even when using the exact same code as presented in the simpletransformers docu, except for model_args.num_train_epochs = 4 (and other values > 0). I also tried fixing model_args.process_count and afterwards model_args.use_multiprocessing = False, both to no success. Code runs with no errors when model_args.num_train_epochs = 1.
Setup Details
- Debian Linux 10
- GTX 1080 Ti / RTX 2080 Ti with CUDA 11
- Python 3.7, PyTorch 1.7.0, simpletransformers 0.48.14.
Full Error Message
Exception in thread QueueFeederThread:
Traceback (most recent call last):
File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
close()
File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 177, in close
self._close()
File ".../anaconda3/lib/python3.7/multiprocessing/connection.py", line 361, in _close
_close(self._handle)
OSError: [Errno 9] Bad file descriptor
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File ".../anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File ".../anaconda3/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File ".../anaconda3/lib/python3.7/multiprocessing/queues.py", line 263, in _feed
queue_sem.release()
ValueError: semaphore or lock released too many times
Edit: Added GPU, PyTorch, and CUDA specs.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (4 by maintainers)
Most upvoted comments
+1
ThilinaRajapakse on Nov 26, 2020