FARM: Deadlock in DataSilo._get_dataset when using docker
UPDATE/SOLUTION: PSA FOR POSTERITY
If you have a deadlock running FARM in a docker container, make sure you are running the container with --ipc=host
to increase shared memory.
SOLUTION END
Describe the bug There seems to be a multiprocessing-related deadlock in DataSilo._get_dataset, which transforms tsv lines into dicts into datasets, chunkwise. Reading a moderately-sized training set (~18k docs) stalls with zero CPU activity after around 2/3 of the data.
Error message This is a rather unhelpful trace just like so many multiprocessing deadlocks:
Process ForkPoolWorker-23:
Process ForkPoolWorker-20:
Process ForkPoolWorker-22:
Process ForkPoolWorker-21:
Process ForkPoolWorker-19:
Process ForkPoolWorker-17:
Process ForkPoolWorker-18:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
res = self._reader.recv_bytes()
File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
KeyboardInterrupt
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
60%|██████████████████████████████████████████████████████▋ | 10752/17908 [04:04<02:42, 44.04 Dicts/s]
Process ForkPoolWorker-24:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
task = get()
File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
with self._rlock:
File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Expected behavior Dataset loading should successfully finish after processing the last chunk.
Additional context I have tested that manually loading the data works:
# This works
train_dicts = processor.file_to_dicts("train.tsv")
train_dataset, tensor_names = processor.dataset_from_dicts(dicts=train_dicts)
Loading a subset of the first 10k docs works, too.
One hunch is that grouper(dicts, multiprocessing_chunk_size)
under some condition produces a pathological chunk size.
To Reproduce I’ll try and see if I can come up with a synthetic reproducer that doesn’t include my data (which I can’t give out).
System:
- OS: Ubuntu 18.04 with nvidia-docker2 and a CUDA 10.0 image
- GPU/CPU: GTX 1080 / Xeon 4-core, 120GB RAM
- FARM version: master The system is otherwise idle, no file sytem contention, no excessive context switches, plenty of free RAM.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (7 by maintainers)
Yes, this was it! The reproducer runs through without any issue after launching the docker container with --ipc=host.
Thanks a lot, @tanaysoni for the quick help.
Since this issue was basically nothing but a time sink for you (sorry!), I want it to end on some constructive note 😃
It seems that the best way of detecting this issue would be to query for the size of /dev/shm/. The difference is obvious: host system:
Inside a default docker container:
Running a check against that could issue a warning. Although, to be honest, that is probably something that would be worth including in torch upstream rather than here.