FARM: Deadlock in DataSilo._get_dataset when using docker

UPDATE/SOLUTION: PSA FOR POSTERITY If you have a deadlock running FARM in a docker container, make sure you are running the container with --ipc=host to increase shared memory. SOLUTION END

Describe the bug There seems to be a multiprocessing-related deadlock in DataSilo._get_dataset, which transforms tsv lines into dicts into datasets, chunkwise. Reading a moderately-sized training set (~18k docs) stalls with zero CPU activity after around 2/3 of the data.

Error message This is a rather unhelpful trace just like so many multiprocessing deadlocks:

Process ForkPoolWorker-23:
Process ForkPoolWorker-20:
Process ForkPoolWorker-22:
Process ForkPoolWorker-21:
Process ForkPoolWorker-19:
Process ForkPoolWorker-17:
Process ForkPoolWorker-18:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
 60%|██████████████████████████████████████████████████████▋                                    | 10752/17908 [04:04<02:42, 44.04 Dicts/s]
Process ForkPoolWorker-24:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

Expected behavior Dataset loading should successfully finish after processing the last chunk.

Additional context I have tested that manually loading the data works:

# This works
train_dicts = processor.file_to_dicts("train.tsv")
train_dataset, tensor_names = processor.dataset_from_dicts(dicts=train_dicts)

Loading a subset of the first 10k docs works, too.

One hunch is that grouper(dicts, multiprocessing_chunk_size) under some condition produces a pathological chunk size.

To Reproduce I’ll try and see if I can come up with a synthetic reproducer that doesn’t include my data (which I can’t give out).

System:

OS: Ubuntu 18.04 with nvidia-docker2 and a CUDA 10.0 image
GPU/CPU: GTX 1080 / Xeon 4-core, 120GB RAM
FARM version: master The system is otherwise idle, no file sytem contention, no excessive context switches, plenty of free RAM.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (7 by maintainers)

Most upvoted comments

Yes, this was it! The reproducer runs through without any issue after launching the docker container with --ipc=host.

Thanks a lot, @tanaysoni for the quick help.

Since this issue was basically nothing but a time sink for you (sorry!), I want it to end on some constructive note 😃

It seems that the best way of detecting this issue would be to query for the size of /dev/shm/. The difference is obvious: host system:

$ df /dev/shm/
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs           57702028     0  57702028   0% /dev/shm

Inside a default docker container:

# df /dev/shm/
Filesystem     1K-blocks  Used Available Use% Mounted on
shm                65536     0     65536   0% /dev/shm

Running a check against that could issue a warning. Although, to be honest, that is probably something that would be worth including in torch upstream rather than here.

trifle on Oct 17, 2019