ray: [Core] Job gets stuck when submit multiple jobs to the Ray cluster

What happened + What you expected to happen

I interact with the Ray cluster by port forwarding from my local machine. Step 1: Run the script locally that submits a job via JobSubmissionClient.

from ray.job_submission import JobSubmissionClient

client = JobSubmissionClient("http://127.0.0.1:8265")

for i in range(20):
    job_id = client.submit_job(
        entrypoint="python script.py",
    )
    print(job_id)

Step 2: The local python code calls the following script.py in the head node (pod):

import ray
import time

@ray.remote(num_cpus=1, max_calls=1)
def hello_world():
    time.sleep(60)
    return "hello world"

ray.init()
print(ray.get(hello_world.remote()))

Step 3: There is one job keeps stucking in the running status Screenshot from 2023-02-03 13-41-57

logs from the head node image

There is a gap in the timestamp of the dashboard_agent.log logs for that worker:

5412023-02-03 12:57:43,048	INFO job_manager.py:683 -- Head node ID found in GCS; scheduling job driver on head node 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79
5422023-02-03 12:57:43,052	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:20:57:43 +0000] "POST /api/job_agent/jobs/ HTTP/1.1" 200 244 "-" "Python/3.7 aiohttp/3.8.3"
5432023-02-03 12:58:00,266	WARNING worker.py:1851 -- WARNING: 4 PYTHON worker processes have been started on node: 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79 with address: 10.244.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
5442023-02-03 12:58:04,126	WARNING worker.py:1851 -- WARNING: 6 PYTHON worker processes have been started on node: 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79 with address: 10.244.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
5452023-02-03 12:58:07,926	WARNING worker.py:1851 -- WARNING: 8 PYTHON worker processes have been started on node: 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79 with address: 10.244.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
5462023-02-03 12:58:11,963	WARNING worker.py:1851 -- WARNING: 10 PYTHON worker processes have been started on node: 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79 with address: 10.244.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
5472023-02-03 12:58:16,235	WARNING worker.py:1851 -- WARNING: 12 PYTHON worker processes have been started on node: 039b7cf8253bd6f32069bc7b76140448e7437836a6b9d772a9035f79 with address: 10.244.0.9. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
5482023-02-03 13:17:04,143	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:17:04 +0000] "GET /logs HTTP/1.1" 200 629 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5492023-02-03 13:17:05,825	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:17:05 +0000] "GET /logs/dashboard_agent.log HTTP/1.1" 200 239 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5502023-02-03 13:21:44,888	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:21:44 +0000] "GET /logs HTTP/1.1" 200 629 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5512023-02-03 13:21:46,667	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:21:46 +0000] "GET /logs/dashboard_agent.log HTTP/1.1" 200 239 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5522023-02-03 13:22:52,427	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:22:52 +0000] "GET /logs HTTP/1.1" 200 629 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5532023-02-03 13:22:55,041	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:22:55 +0000] "GET /logs/debug_state.txt HTTP/1.1" 200 238 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
5542023-02-03 13:22:56,167	INFO web_log.py:206 -- 10.244.0.9 [03/Feb/2023:21:22:56 +0000] "GET /logs HTTP/1.1" 200 629 "http://localhost:8265/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"

Versions / Dependencies

Ray 2.2.0 Python 3.7.13 2 Workers, 1 CPU each 1 Head, 1 CPU

Reproduction script

  • The reproduction script is shown above. I took a look at the https://github.com/ray-project/ray/issues/3644. But in the case, it seems that there is no nested Ray remote functions (the executed function is very simple).

  • It seems things get worse with the time.sleep(60) in the remote function. I am guessing that it is because that leads to more concurrent running jobs?

  • Is it expected that there are 6 jobs in running status given the fact that I only have 2 workers and I call the remote function using @ray.remote(num_cpus=1, max_calls=1)? running

  • Is there a way to resolve such issue (prevent job to be stuck) when we have a lot of pending tasks in the head?

Issue Severity

High: It blocks me from completing my task.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

It seems like the job is submitted correctly from job supervisor, but the driver is not initiated for some reasons. The user uses unconventional setup which is their min ~ max worker port range is very small (19), and I suspect that’s the root cause. @YQ-Wang will try with higher port ranges.

This issue has been resolved by #32831. Please feel free to close.

Is it expected that there are 6 jobs in running status given the fact that I only have 2 workers and I call the remote function using @ray.remote(num_cpus=1, max_calls=1)?

Regarding this, I think it’s expected. The job entrypoint script has started running, so the job is considered “RUNNING” even if the remote functions in the tasks have not started running.