ray: [Core] [Nightly] [Flaky] `many_drivers` test failed
Search before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
(run_driver pid=3156456) ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::f() (pid=3167980, ip=172.31.62.177)
--
| (run_driver pid=3156456) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-62-177 is used (29.1 / 30.57 GB). The top 10 memory consumers are:
| (run_driver pid=3156456)
| (run_driver pid=3156456) PID MEM COMMAND
| (run_driver pid=3156456) 755 21.37GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
| (run_driver pid=3156456) 729 0.41GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s
| (run_driver pid=3156456) 800 0.11GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_m
| (run_driver pid=3156456) 51 0.09GiB /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy
| (run_driver pid=3156456) 378 0.09GiB /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session auth_start
| (run_driver pid=3156456) 691 0.09GiB python workloads/many_drivers.py
| (run_driver pid=3156456) 834 0.09GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 890 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 1014 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 953 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agenTraceback (most recent call last):
| File "workloads/many_drivers.py", line 95, in <module>
| ray.get(ready_id)
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
| return func(*args, **kwargs)
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1742, in get
| raise value.as_instanceof_cause()
| ray.exceptions.RayTaskError(CalledProcessError): ray::run_driver() (pid=3136504, ip=172.31.62.177)
| File "workloads/many_drivers.py", line 80, in run_driver
| output = run_string_as_driver(driver_script)
| subprocess.CalledProcessError: Command '['/home/ray/anaconda3/bin/python', '-']' returned non-zero exit status 1.
It seems the cluster is OOMing and the dashboard is to blame.
Versions / Dependencies
master
Reproduction script
Run the weekly tests. example output: https://buildkite.com/ray-project/periodic-ci/builds/2247#309c72ce-6ceb-4e86-9470-3c429fb1bf81
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 36 (33 by maintainers)
Lower the priority to P1. The reason it failed is that the master is faster after this PR.
Memory leak still exists in the dashboard. But without this PR, every worker basically needs to import a lot of things and it slows the driver’s performance.
We can see from this log
It cost a lot of time to finish one iteration and within 24h, it only finished 2k iterations.
With this PR, it’s like:
This is about 8h, and it has run for 4k iterations.
The reason the slop of memory consumption is smaller as time goes by without the PR is that the test is running slower as time goes.
There are a couple of options here:
I disable the actor info in dashboard ad-hoc, and the issues are still there. If we trust tracemalloc, then the leak has to be in cpp layer. I’ll profile the memory there.
It seems like I was wrong. The memory just increases slowly.
The wired thing here is that from tracemalloc snapshot, it only claims 700mb memory used.
One thing that is not released is about the actor info in dashboard, but from the calculation, it shouldn’t use more than 1gb.
Sorry, this ticket was created during standup just for a reminder purpose and the description was planned to be added later.
Sure, I’ll.
Overall it looks like a memory leak in the dashboard. THe plan here is to do some profiling and track the memory usage.
@iycheng Not a big deal, but we already had an issue open for this! In the future, don’t forget to check for existing issues before opening a new one.