ray: Ray workers unable to register when used with "venv"-created virtual environment on Windows with Python 3.7.3+

What is the problem?

Ray workers are unable to register, causing ray.init to hang forever, when used with built-in venv-created virtual environment on Windows with Python 3.7.3+

Reproduction

  1. Install Python 3.7.3+ (3.8 and later are all affected as well)
  2. Create virtual environment using built-in venv (in my case I just used PyCharm). Note that environments created using virtualenv are not affected.
  3. Install Ray and run import ray; ray.init(num_cpus=5)

The main clue as to why ray.init hangs can be found within raylet.out log file. Say, if we start Ray with 5 workers we’ll see the following in the log:

[2021-01-29 16:24:53,703 W 12780 6432] worker_pool.cc:471: Received a register request from an unknown worker 13356
[2021-01-29 16:24:53,710 W 12780 6432] worker_pool.cc:471: Received a register request from an unknown worker 3604
[2021-01-29 16:24:53,728 W 12780 6432] worker_pool.cc:471: Received a register request from an unknown worker 17228
[2021-01-29 16:24:53,733 W 12780 6432] worker_pool.cc:471: Received a register request from an unknown worker 13076
[2021-01-29 16:24:53,744 W 12780 6432] worker_pool.cc:471: Received a register request from an unknown worker 18208
[2021-01-29 16:25:23,028 I 12780 6432] worker_pool.cc:375: Some workers of the worker process(8012) have not registered to raylet within timeout.
[2021-01-29 16:25:23,036 I 12780 6432] worker_pool.cc:375: Some workers of the worker process(12484) have not registered to raylet within timeout.
[2021-01-29 16:25:23,045 I 12780 6432] worker_pool.cc:375: Some workers of the worker process(6292) have not registered to raylet within timeout.
[2021-01-29 16:25:23,053 I 12780 6432] worker_pool.cc:375: Some workers of the worker process(5660) have not registered to raylet within timeout.
[2021-01-29 16:25:23,070 I 12780 6432] worker_pool.cc:375: Some workers of the worker process(13064) have not registered to raylet within timeout.

i.e. the workers are started and attempt to register promptly, but their PIDs are not recognized by the parent process. That gives us a clue that somehow, process that registers is not the one the parent actually created.

Let’s take a step back and just run Python from our venv and check Task Manager for the process(es):

./venv/Scripts/python -c "import time; time.sleep(30)"

image

To our surprise, there are actually two processes running! Wtf? Quick googling leads to this SO post with helpful comments, and the root cause is discussed in detail in Issue38905.

In a nutshell, starting from 3.7.3+ built-in venv behavior changed such that it always creates two Python processes, first being a simple redirector that spawns the base interpreter. Same happens when Ray launches child Python processes (workers) - pid it gets is that of a redirector, not the actual worker, causing worker registration to fail.

The issue only affects virtual environments created with built-in venv, but not the ones created with virtualenv, and only for Python 3.7.3+ (including 3.8 and 3.9, I believe).

It also affects virtual environments created by PyCharm for 3.7.3+ since it’s using built-in venv.

What confused me the most was that I had an old project using Poetry which was affected by this as well. I then installed Poetry on a new machine and everything worked fine… turns out, Poetry switched from using venv to virtualenv on Jul 2020, so new Poetry users are not affected. Or at least - not until virtualenv is changed to use venv internally and this issue suddenly re-surfaces again…

A workaround for anyone affected by this is to not use venv but rather use virtualenv for creating virtual environment. If you use Poetry, you may need to upgrade Poetry and re-created your environment.

While this venv behavior change definitely seems like a regression to me, it seems like it’s here to stay - so would be great if Ray handled this. You’re going to need to pipe PID from a child process, or something along those lines…

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

P.S. Symptoms of this issue were first described in #12481, but I decided to raise a new issue with a clear description and title.

ping @richardliaw

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 15 (2 by maintainers)

Most upvoted comments

Thank you for this descriptive explanation. I’m experiencing this same issue when using both venv and virtualenv in python 3.8.7

I can confirm.

On Windows, I have Python 3.8.5 and a venv and installed ray using pip install ray. When executing ray.init() it hangs (ctrl+c) is not working, only closing the console works. It prints the message showing the dashboard-address though.

Then I installed miniconda and updated to Python 3.8.8 and created a conda environment. Then installed ray again using pip install ray. Now executing ray.init() works as expected.

I would love to have it solved since we normally work in venv.

I think this issue has now been fully fixed on master and will be available with the 1.10 release which is to be released in Jan 2022! I’m closing this issue but please let us know if it still happens after 1.10 is released.

We think this is the same issue as https://github.com/ray-project/ray/issues/18951, it should be fixed when https://github.com/ray-project/ray/pull/19014 is merged 😃

Push, same issue here, nothing to add to my predecessors 😉