hover_net: Run inference script crashes

Hi all,

I have tried to run the Pytorch version after I initially tried with the Tensorflow version. I tried to run the inference script in wsi mode with a ndpi image. It start correct but mid-way through the process I got this error:

Process Chunk 48/99:  61%|#############5        | 35/57 [02:19<01:11,  3.23s/it]|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
    self.process_single_file(wsi_path, msk_path, self.output_dir)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
    self.__get_raw_prediction(chunk_info_list, patch_info_list)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
    chunk_patch_info_list[:, 0, 0], pbar_desc
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
    for batch_idx, batch_data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
Process Chunk 48/99:  61%|#############5        | 35/57 [02:19<01:27,  4.00s/it]
/usr/local/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))

Do you know why this error might occur?

Running on an Ubuntu 20 machine that has a conda env with the requirements.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 40 (7 by maintainers)

Most upvoted comments

@Mengflz @JMBokhorst Can you guys share with us your system specs?

CPU: I7-7700k MEM: 32GB GPU: 1080Ti HDD: 500GB SSD OS: Ubuntu 20.04

JMBokhorst on Jan 27, 2021

I run this script on the server CPU：Intel® Xeon® Platinum 8165 CPU @ 2.30GHz MEM: 378G GPU：Tesla K80 OS：Ubuntu 18.04.2

I met crash out of file pointers. Error message on the terminal just like “RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using ulimit -n in the shell or change the sharing strategy by calling torch.multiprocessing.set_sharing_strategy('file_system') at the beginning of your code”

Mengflz on Jan 27, 2021

Yes, or I created a new conda environment and installed the pip requirement file. In addition I had to install pytorch and openslide-python.

I sourced a computer with more memory (64GB) and there the post-processing also seems to work correctly and finish without any issue’s 😃 Let me know if I can help by running it on more slides!

Thanks for the help and quick responses! 😃

JMBokhorst on Jan 17, 2021

@simongraham and @vqdang,

Thanks for the quick response. I will check out the PR now and see if it fixes the issue.

I’m trying to run the script on a folder containing a single ndpi image. I use this command (based on the run_wsi.sh script):

python3.7 run_infer.py --gpu='0,1' --nr_types=6 --type_info_path=type_info.json --batch_size=64 --model_mode=fast --model_path=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hovernet_fast_pannuke_type_tf2pytorch.tar --nr_inference_workers=8 --nr_post_proc_workers=16 wsi --input_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/test_image/ --output_dir=/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/result/ --save_thumb --save_mask

Below is the output of the debug log, is this the log file you are referring to?

|2021-01-06|11:46:06.636| [INFO] ................ Process: TB_S02_P005_C0001_L15_A15
|2021-01-06|11:46:11.858| [INFO] ................ WARNING: No mask found, generating mask via thresholding at 1.25x!
|2021-01-06|11:46:23.762| [INFO] ........ Preparing Input Output Placement: 17.12366568017751
|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
    len(ancdata))
RuntimeError: received 0 items of ancdata

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
    fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
  File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
    self.process_single_file(wsi_path, msk_path, self.output_dir)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
    self.__get_raw_prediction(chunk_info_list, patch_info_list)
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
    chunk_patch_info_list[:, 0, 0], pbar_desc
  File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
    for batch_idx, batch_data in enumerate(dataloader):
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
    idx, data = self._get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
    "Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

JMBokhorst on Jan 7, 2021