hover_net: Run inference script crashes
Hi all,
I have tried to run the Pytorch version after I initially tried with the Tensorflow version. I tried to run the inference script in wsi mode with a ndpi image. It start correct but mid-way through the process I got this error:
Process Chunk 48/99: 61%|#############5 | 35/57 [02:19<01:11, 3.23s/it]|2021-01-06|13:06:15.182| [ERROR] Crash
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/local/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/usr/local/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd
fd = df.detach()
File "/usr/local/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle
return recvfds(s, 1)[0]
File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in _try_get_data
fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 803, in <listcomp>
fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
File "/usr/local/lib/python3.7/tempfile.py", line 547, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/local/lib/python3.7/tempfile.py", line 258, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
OSError: [Errno 24] Too many open files: '/tmp/tmpxrmts9vn'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 746, in process_wsi_list
self.process_single_file(wsi_path, msk_path, self.output_dir)
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 550, in process_single_file
self.__get_raw_prediction(chunk_info_list, patch_info_list)
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 374, in __get_raw_prediction
chunk_patch_info_list[:, 0, 0], pbar_desc
File "/mnt/netcache/pathology/projects/colon-budding-he/nuclei_detection/hover_pytorch/hover_net-master/infer/wsi.py", line 287, in __run_model
for batch_idx, batch_data in enumerate(dataloader):
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
data = self._next_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 974, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 941, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 807, in _try_get_data
"Too many open files. Communication with the"
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
Process Chunk 48/99: 61%|#############5 | 35/57 [02:19<01:27, 4.00s/it]
/usr/local/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Do you know why this error might occur?
Running on an Ubuntu 20 machine that has a conda env with the requirements.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 40 (7 by maintainers)
CPU: I7-7700k MEM: 32GB GPU: 1080Ti HDD: 500GB SSD OS: Ubuntu 20.04
I run this script on the server CPU:Intel® Xeon® Platinum 8165 CPU @ 2.30GHz MEM: 378G GPU:Tesla K80 OS:Ubuntu 18.04.2
I met crash out of file pointers. Error message on the terminal just like “RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using
ulimit -n
in the shell or change the sharing strategy by callingtorch.multiprocessing.set_sharing_strategy('file_system')
at the beginning of your code”Yes, or I created a new conda environment and installed the pip requirement file. In addition I had to install pytorch and openslide-python.
I sourced a computer with more memory (64GB) and there the post-processing also seems to work correctly and finish without any issue’s 😃 Let me know if I can help by running it on more slides!
Thanks for the help and quick responses! 😃
@simongraham and @vqdang,
Thanks for the quick response. I will check out the PR now and see if it fixes the issue.
I’m trying to run the script on a folder containing a single ndpi image. I use this command (based on the run_wsi.sh script):
Below is the output of the debug log, is this the log file you are referring to?