dask-cuda: System Monitor Error
I’m getting an error when running the local_cudf_merge benchmark with latest of distributed/dask (in main).
python local_cudf_merge.py -p tcp -d 0 --profile foo.html
bzaitlen@prm-dgx-06:~$ python $CONDA_PREFIX/lib/python3.8/site-packages/dask_cuda/benchmarks/local_cudf_merge.py -p tcp -d 0 --profile
foo.html
distributed.utils - ERROR - deque index out of range
Traceback (most recent call last):
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/utils.py", line 671, in log_errors
yield
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 581, in update
self.source.stream(self.get_data(), 1000)
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/dashboard/components/shared.py", line 573, in get_data
d = self.worker.monitor.range_query(start=self.last_count)
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in range_query
d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <dictcomp>
d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
File "/gpfs/fs1/bzaitlen/miniconda3/envs/20210601-nightly-21.08/lib/python3.8/site-packages/distributed/system_monitor.py", line 123, in <listcomp>
d = {k: [v[i] for i in seq] for k, v in self.quantities.items()}
IndexError: deque index out of range
@charlesbluca do you have time to investigate what is happening here ? I think you are familiar with the system_monitor section of distributed.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (19 by maintainers)
Sure, I can do that! Would it be skipped by the Distributed CI right now?
EDIT: Actually it looks like these Distributed tests already handle the general issue here (checking for GPU info in the
WorkerState/worker monitors), so we should be good there.Maybe it make sense to add the test to Distributed as well with a
pytest.importorskip?Thanks @pentschev! I’ll submit a test for this in Dask-CUDA (maybe something like @quasiben’s perf report snippet), but it would be nice to have an equivalent test in Distributed if/when we are able to test NVML stuff there.
I opened https://github.com/dask/distributed/pull/4866 to fix this, it introduces back the change that broke https://github.com/rapidsai/dask-cuda/issues/564 but also a fix for that specifically in https://github.com/dask/distributed/pull/4866/commits/d860e585c8455f285f5e5ca0d1470cb2c255e281 .
Note this also happens in “basic” usage:
We used to initialize a cuda context when starting the workers but I think we changed things slightly. @pentschev mentioned some of those changes here: https://github.com/rapidsai/dask-cuda/issues/632
@pentschev do you have an idea of what’s going on here ? Seems like there is another initialization / timing issue around cuda contexts