ray: [Core] [Nightly] [Flaky] `many_drivers` test failed

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

(run_driver pid=3156456) ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::f() (pid=3167980, ip=172.31.62.177)
--
  | (run_driver pid=3156456) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-62-177 is used (29.1 / 30.57 GB). The top 10 memory consumers are:
  | (run_driver pid=3156456)
  | (run_driver pid=3156456) PID	MEM	COMMAND
  | (run_driver pid=3156456) 755	21.37GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
  | (run_driver pid=3156456) 729	0.41GiB	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s
  | (run_driver pid=3156456) 800	0.11GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_m
  | (run_driver pid=3156456) 51	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy
  | (run_driver pid=3156456) 378	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session auth_start
  | (run_driver pid=3156456) 691	0.09GiB	python workloads/many_drivers.py
  | (run_driver pid=3156456) 834	0.09GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 890	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 1014	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 953	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agenTraceback (most recent call last):
  | File "workloads/many_drivers.py", line 95, in <module>
  | ray.get(ready_id)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
  | return func(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1742, in get
  | raise value.as_instanceof_cause()
  | ray.exceptions.RayTaskError(CalledProcessError): ray::run_driver() (pid=3136504, ip=172.31.62.177)
  | File "workloads/many_drivers.py", line 80, in run_driver
  | output = run_string_as_driver(driver_script)
  | subprocess.CalledProcessError: Command '['/home/ray/anaconda3/bin/python', '-']' returned non-zero exit status 1.

It seems the cluster is OOMing and the dashboard is to blame.

Versions / Dependencies

master

Reproduction script

Run the weekly tests. example output: https://buildkite.com/ray-project/periodic-ci/builds/2247#309c72ce-6ceb-4e86-9470-3c429fb1bf81

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 36 (33 by maintainers)

Most upvoted comments

Lower the priority to P1. The reason it failed is that the master is faster after this PR.

Memory leak still exists in the dashboard. But without this PR, every worker basically needs to import a lot of things and it slows the driver’s performance.

We can see from this log

Iteration 2221:
  - Iteration time: 91.22895789146423.
  - Absolute time: 1639977131.3817475.
  - Total elapsed time: 82505.75467967987.
Iteration 2222:
  - Iteration time: 89.64419484138489.
  - Absolute time: 1639977221.0259423.
  - Total elapsed time: 82595.39887452126.
Iteration 2223:
  - Iteration time: 70.22041773796082.
  - Absolute time: 1639977291.24636.
  - Total elapsed time: 82665.61929225922.
Iteration 2224:
  - Iteration time: 141.53751230239868.
  - Absolute time: 1639977432.7838724.
  - Total elapsed time: 82807.15680456161.

It cost a lot of time to finish one iteration and within 24h, it only finished 2k iterations.

With this PR, it’s like:

Iteration 4001:
  - Iteration time: 0.95485520362854.
  - Absolute time: 1643077687.9250314.
  - Total elapsed time: 26973.94828104973.
Iteration 4002:
  - Iteration time: 20.370080947875977.
  - Absolute time: 1643077708.2951124.
  - Total elapsed time: 26994.318361997604.
Iteration 4003:
  - Iteration time: 4.543852806091309.
  - Absolute time: 1643077712.8389652.
  - Total elapsed time: 26998.862214803696.
Iteration 4004:
  - Iteration time: 1.0939826965332031.
  - Absolute time: 1643077713.9329479.
  - Total elapsed time: 26999.95619750023.

This is about 8h, and it has run for 4k iterations.

The reason the slop of memory consumption is smaller as time goes by without the PR is that the test is running slower as time goes.

There are a couple of options here:

  • remove the expensive fields in actors (I probably will do this)
  • limit the actor numbers (we’ll lose the actor history as time going)
  • store the info into disk-based DBS. (long-term plan maybe).

I disable the actor info in dashboard ad-hoc, and the issues are still there. If we trust tracemalloc, then the leak has to be in cpp layer. I’ll profile the memory there.

image

It seems like I was wrong. The memory just increases slowly.

The wired thing here is that from tracemalloc snapshot, it only claims 700mb memory used.

One thing that is not released is about the actor info in dashboard, but from the calculation, it shouldn’t use more than 1gb.

Description?

Sorry, this ticket was created during standup just for a reminder purpose and the description was planned to be added later.

@iycheng Not a big deal, but we already had an issue open for this! In the future, don’t forget to check for existing issues before opening a new one.

Sure, I’ll.


Overall it looks like a memory leak in the dashboard. THe plan here is to do some profiling and track the memory usage.

@iycheng Not a big deal, but we already had an issue open for this! In the future, don’t forget to check for existing issues before opening a new one.