dask: Dask crashes or hangs during out-of-core dataframes sort
This is essentially the same issue as this one on dask/community, but I thought it would be worth a try to see if anyone can help here. Please let me know if I should close one of these!
What happened: An out-of-core sort of a large dataframe hangs or crashes.
What you expected to happen: The sort should finish.
Minimal Complete Verifiable Example:
I set up a single-node dask distributed cluster on a machine with 32 vCPUs and 128GB RAM. I started a dask worker with this command (note that I used nprocs=8 because this seemed to help reduce some of the errors compared to nprocs=32):
dask-worker localhost:8786 --nthreads 1 --nprocs=8 --memory-limit=10000000000 # 10GB.
Then, I ran this script, which is meant to benchmark a full sort of a large dataframe. The script generates partitions on-disk, then shuffles and blocks on the result using this line:
print(df.set_index('a').head(10, npartitions=-1)) # Force `head` to wait on all partitions.
On 10GB with 100 (100MB) partitions, I started to get a lot of errors with nprocs=32 (stdout here). Then I tried nprocs=8 and the script finished successfully (results here).
However, I haven’t been able to get the script to work with 100GB yet. Here is an example of the output. So far, in addition to changing nprocs and nthreads, I’ve also tried partitions=100 and 1000 and lowering the target
and memory
config parameters described here.
Anything else we need to know?:
Environment:
- Dask version: 2021.4.1
- Python version: 3.7.7
- Operating System: Ubuntu 18.04
- Install method (conda, pip, source): conda
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 19 (9 by maintainers)
Ah right - My mistake! You are correct that you are giving each worker a 10GB limit.
This might also be a use case that benefits from some of the spilling optimizations that we’ve done in RAPIDS (cc @madsbk 🙂)
Ah, my bad! I just upgraded to 2021.4.1 and tried again. The error messages are different but it seems like the overall behavior is about the same (exception on 10GB with nprocs=32, succeeds on 10GB with nprocs=8, exception on 100GB with nprocs=8).
I updated the issue description accordingly. It does look like the script now crashes reliably instead of hanging. I’ll try adjusting the memory parameters for this version of dask too and report back.