grpc: [fork][tracking issue] grpc thread pool hanging on fork

What version of gRPC and what language are you using?

grpc 1.51.1

What operating system (Linux, Windows,…) and version?

macOS Catalina 10.15

What runtime / compiler are you using (e.g. python version or version of gcc)

python 3.8.15

What did you do?

Please provide either 1) A unit test for reproducing the bug or 2) Specific steps for us to follow to reproduce the bug. If there’s not enough information to debug the problem, gRPC team may close the issue at their discretion. You’re welcome to re-open the issue once you have a reproduction.

  1. With a ray installation 2.1.0
pip install ray==2.1.0
  1. Running a simple workload that starts a ray cluster and launches tasks to 2 threads.
import ray

def run(): 
    ray.init(num_cpus=2)

    @ray.remote
    def g():
        pass

    ray.get(g.remote())
    ray.get(g.remote())

    ray.shutdown()

for _ in range(10000):
    run()

What did you expect to see?

Script runs ok.

What did you see instead?

Script hanging with

E1213 01:47:43.538549000 4544900544 thread_pool.cc:253]                Waiting for thread pool to idle before forking
E1213 01:47:46.545305000 4544900544 thread_pool.cc:253]                Waiting for thread pool to idle before forking
E1213 01:47:49.551290000 4544900544 thread_pool.cc:253]                Waiting for thread pool to idle before forking
E1213 01:47:52.554189000 4544900544 thread_pool.cc:253]                Waiting for thread pool to idle before forking
E1213 01:47:55.556100000 4544900544 thread_pool.cc:253]                Waiting for thread pool to idle before forking
....

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

See TROUBLESHOOTING.md for how to diagnose problems better.

Anything else we should know about your project / environment?

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 27
  • Comments: 130 (39 by maintainers)

Commits related to this issue

Most upvoted comments

Hello, this is the Google Cloud Storage Python client library team again. This issue has reappeared but is now affecting OS X only (I’m running on Intel, don’t have Apple Silicon handy to test). Our grpc prerelease checker didn’t catch this issue in advance because it runs on Linux; it is impacting released versions.

This happens with both grpc==1.55.0 and grpcio==1.55.0rc1. Repro steps are as per my previous comments, except with newer versions of grpc.

This is a pretty nasty issue; when the process hangs it hangs hard and doesn’t respond to control-c, and must be killed from outside with kill -9. As previously discussed, the stderr output doesn’t clearly indicate what package the issue is coming from, making it very hard for users to debug. Pytest captures and doesn’t display the stderr at all by default (it would normally show it after the test fails, but because it hangs so hard, it never fails) – this behavior can be disabled with a -s flag when running pytest.

Quick update here. I have a Python-level test of fork that reproduces the hang and several other issues deterministically. I’ve made several implementation changes that resolve all known fork issues. I will clean up this change and get it merged to master either today or Monday, January 23rd. Once that’s in, I would like your help in testing out the nightly builds to verify that this resolves your issue.

If these changes resolve the issue, I will perform backports to all affected versions of the library. Namely, 1.52, 1.51, 1.50, 1.49, and 1.48.

1.52.0 has been yanked. It seemed to somehow increase the rate of the issues reported in this thread. We’ll backport the fixes mentioned earlier in this thread and release them as 1.52.1 soon.

Just jumping in to say that this issue is now causing some of our users to be unable to use our package since our pin is incompatible with a pin in newer versions of tensorflow.

Offer still stands to do another live debugging session with a debugger session that reproduces the hang if that would be at all helpful to get to the bottom of this (or to run any additional logging against our test suite that reproduces the problem).

As a meta point, the quality control on gRPC is quite lacking from an OSS perspective unfortunately

We’re sorry to hear that. This seems to be a fork-related issue, which we are aware of and are actively working toward a resolution for.

We can for example offer to test your RCs more thoroughly in our CI and communicate failures before the release.

Yes please. We put out RCs before every release in the hope that we will catch such issues. As you point out, this is especially important for Python packages because of how quickly they end up throughout the ecosystem. Please let us know if you need any guidance/help on how to do this.

@gibsondan Sorry for the pause. I’ve been fighting with CI on old branches. M1 support has been backported to 1.47 as 1.47.5. I’ll get back to tracking down the rest of the fork issues in the next day or two.

Can also confirm setting grpcio==1.48.2 and grpcio-status==1.48.2 resolves it for me as well.

I am similarly experiencing this on Mac - but cannot get an earlier install to work, despite various attempts (update setuptools, update pip etc) I always end up with ERROR: Failed building wheel for grpcio … any suggestion how I can get around this?

What architecture are you on? If it’s ARM, I think macOS ARM isn’t supported yet.

Edit: Yup, no ARM wheels for macOS ☹️ https://pypi.org/project/grpcio/#files, someone will need to provide guidance on building it for your arch.

Yet another edit: I was wrong, https://stackoverflow.com/a/74424435/9852671 claims there’s Apple Silicon support in newer versions. I didn’t see the universal wheels. For older versions you’ll want to take a look in https://github.com/pietrodn/grpcio-mac-arm-build @LukasWallrich.

Experiencing this as well on MacOS 13.3.1 (a), Python 3.9.5, grpcio 1.55.0 when calling google.auth.transport.requests.Request() from google.auth except now it indicates line 230:

E0522 17:36:45.653282000 140704439178048 thread_pool.cc:230] Waiting for thread pool to idle before forking

https://github.com/grpc/grpc/pull/32196 will hopefully resolve these issues.

Following those steps, the problem actually occurs a lot more frequently / earlier in the test suite in master than it does in 1.51.1 .

We were able to find and hotfix the problem.

It order to fix it you have to reset deadlock_graph_mu SpinLock in absl/synchronization/mutex.cc and absl does not provide such method out of the box.

I can upstream the fixes but I see no way for them to be merged.

Can also confirm setting grpcio==1.48.2 and grpcio-status==1.48.2 resolves it for me as well.

I am similarly experiencing this on Mac - but cannot get an earlier install to work, despite various attempts (update setuptools, update pip etc) I always end up with ERROR: Failed building wheel for grpcio … any suggestion how I can get around this?

~What architecture are you on? If it’s ARM, I think macOS ARM isn’t supported yet.~

~Edit: Yup, no ARM wheels for macOS ☹️ https://pypi.org/project/grpcio/#files, someone will need to provide guidance on building it for your arch.~

Yet another edit: I was wrong, https://stackoverflow.com/a/74424435/9852671 claims there’s Apple Silicon support in newer versions. I didn’t see the universal wheels. For older versions you’ll want to take a look in https://github.com/pietrodn/grpcio-mac-arm-build @LukasWallrich.

Thanks! The repo by pietrordn starts only from versions past 1.50, so won’t help here … but you pushed me in the right direction. Conda has wheels for 1.48.1 (though not 1.48.2) and with those, the fork errors disappear.

We merged a pull request (https://github.com/grpc/grpc/pull/32935) yesterday that should fix at least some of Python fork hanging issue. If you’re still experiencing the issue, please try out the latest master branch and let us know if the hanging still exists.

This message arises from the C extension that forms the core I/O library of gRPC. As a result, is logged directly to stderr, not through the Python logging system.

Additionally, this message does not indicate it is related to GRPC in any way

I agree that this could be improved. @drfloob and @veblush to comment on the possibility of somehow making gpr_log calls clearly visible as originating from the gRPC library. Perhaps we could have a global variable that enables all gpr_log invocations to add a prefix?

the source code for thread_pool.cc does not seem to be public

It is indeed public.

I think it would help a lot if this message was somehow more visible and clearer on what package is responsible and what to do.

Under normal circumstances, a user should not be seeing this message at all. My first priority is to make sure that this case never arises.

@andrewsg I pulled down your repro. I was able to reproduce at 1.52.0 (as expected), but 1.53.0rc2 passed 10/10 times. This looks like it’s been resolved.

I synced with @gibsondan offline. After our various fixes since 1.52.0, the rate of incidence of fork-related issues has dropped low enough that we can go forward with the 1.53.0 release. It does sound like there are some issues that may arise under stress-test-like scenarios, but this should not have a large impact on users.

@aredden Yes. Those versions are known to be affected. Can you please try the most recent nightly build or the 1.53 prerelease. I’ve not been able to reproduce at all with these.

If it helps, I’m also seeing this with the grpcio Python pip package, both the current version 1.51.3 and the yanked 1.52.0. Downgrading to 1.51.1 fixes it and makes the Waiting for thread pool to idle before forking hang go away, like @danw-mpl reported in https://github.com/grpc/grpc/issues/31772#issuecomment-1423869639. macOS 13.2.1 (on an M1), Python 3.9.16.

This repros for me:

git clone git@github.com:snarfed/granary.git
source local/bin/activate
pip install -r requirements.txt
pip install grpcio==1.51.3
GAE_ENV=localdev FLASK_ENV=development flask run -p 8080

I saw that a fix landed for this yesterday so I ran the nightly build through our test suite again Unfortunately two of the test runs failed in familiar spots

Yeah, I didn’t ping about the most recent fix PR because I was aware of the hang. But I have a fix for this hang and I now can’t reproduce anything with either of the two reproduction cases you’ve supplied. I need to bang against it a bit more before putting up a PR but hopefully, that’ll be the last necessary change 🤞

By the way, thank you very much for how responsive you’ve been on this issue. I know it must be frustrating dealing with an issue like this. Without someone keeping us in check with actual usage patterns in the wild, it’s quite difficult to synthesize the proper test cases.

Got it! Here’s the assert:

E0207 20:46:23.064340000 123145545011200 thread_pool.cc:225]           ASSERTION FAILED: state_ == State::kRunning
Process 43178 stopped
* thread #2311, stop reason = signal SIGABRT
    frame #0: 0x00007ff80e89600e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`:
->  0x7ff80e89600e <+10>: jae    0x7ff80e896018            ; <+20>
    0x7ff80e896010 <+12>: movq   %rax, %rdi
    0x7ff80e896013 <+15>: jmp    0x7ff80e8901c5            ; cerror_nocancel
    0x7ff80e896018 <+20>: retq   

Here’s the stack trace of that thread:

* thread #2311, stop reason = signal SIGABRT
  * frame #0: 0x00007ff80e89600e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007ff80e8cc1ff libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007ff80e817d24 libsystem_c.dylib`abort + 123
    frame #3: 0x0000000164c6e21a cygrpc.cpython-38-darwin.so`grpc_core::Crash(absl::lts_20230125::string_view, grpc_core::SourceLocation) + 138
    frame #4: 0x0000000164c6ad7f cygrpc.cpython-38-darwin.so`gpr_assertion_failed + 111
    frame #5: 0x0000000164c6720c cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::ThreadPool::Quiesce() + 124
    frame #6: 0x0000000164c59137 cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::PosixEventEngine::~PosixEventEngine() + 151
    frame #7: 0x0000000164c5937e cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::PosixEventEngine::~PosixEventEngine() + 14
    frame #8: 0x0000000164c4f403 cygrpc.cpython-38-darwin.so`grpc_core::ChannelArgTypeTraits<std::__1::shared_ptr<grpc_event_engine::experimental::EventEngine>, void>::VTable()::'lambda0'(void*)::__invoke(void*) + 51
    frame #9: 0x0000000164c1bdca cygrpc.cpython-38-darwin.so`grpc_core::AVL<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, absl::lts_20230125::variant<int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, grpc_core::ChannelArgs::Pointer> >::Node::~Node() + 154
    frame #10: 0x0000000164c17051 cygrpc.cpython-38-darwin.so`grpc_core::ChannelArgs::~ChannelArgs() + 49
    frame #11: 0x0000000164b57ffe cygrpc.cpython-38-darwin.so`grpc_core::Subchannel::~Subchannel() + 462
    frame #12: 0x0000000164b5808e cygrpc.cpython-38-darwin.so`grpc_core::Subchannel::~Subchannel() + 14
    frame #13: 0x0000000164b5ae5e cygrpc.cpython-38-darwin.so`void absl::lts_20230125::internal_any_invocable::LocalInvoker<false, void, grpc_core::Subchannel::OnConnectingFinishedLocked(absl::lts_20230125::Status)::$_3&>(absl::lts_20230125::internal_any_invocable::TypeErasedState*) + 478
    frame #14: 0x0000000164c5a642 cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::PosixEventEngine::ClosureData::Run() + 82
    frame #15: 0x0000000164c66fa0 cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::ThreadPool::Queue::Step() + 464
    frame #16: 0x0000000164c67de8 cygrpc.cpython-38-darwin.so`grpc_event_engine::experimental::ThreadPool::StartThread(std::__1::shared_ptr<grpc_event_engine::experimental::ThreadPool::State>, grpc_event_engine::experimental::ThreadPool::StartThreadReason)::$_0::__invoke(void*) + 248
    frame #17: 0x0000000164c7279c cygrpc.cpython-38-darwin.so`grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) + 140
    frame #18: 0x00007ff80e8cc4e1 libsystem_pthread.dylib`_pthread_start + 125
    frame #19: 0x00007ff80e8c7f6b libsystem_pthread.dylib`thread_start + 15

Here’s the full output of bt all: https://gist.github.com/gibsondan/548d75e11ed8c9e4a7d805a97113b2ce

Wow ChatGPT of all things may have unblocked me… should have that stack trace for you shortly assuming I can successfully reproduce in lldb.

@movy I think your issue is the same as the one in the OP and will be resolved by https://github.com/grpc/grpc/pull/32196. The most recent issue that @gibsondan reported and which we are still tracking down seems separate. My plan is to backport all fork fixes to several different branches as soon as we’re reasonably confident that we have them all wrangled. Previously, I had hoped it would be last week, until @gibsondan reported further issues, which we are still looking into.

Hopefully, we’ll resolve the last of them and do the backports within the next week.

@gibsondan That stacktrace was actually exactly what I needed. Thank you very much for that. I think the issue here is basically what https://github.com/grpc/grpc/pull/31994 was trying to solve, but with slightly different timing. I have an idea for how we might be able to fix the issue.

Thanks @rickyyx. My fix in #31969 removed the most commonly seen cause of that deadlock (an ExecCtx created during a fork event), but I would not be surprised if there were others. Though that change alone should not increase the flake rate.

I don’t think it’s the right solution here, but we can try modifying the thread pool to skip running any callbacks during fork events. I’ll start a discussion on a PR.

@rickyyx You should be able to test with the nightly builds as well.

I think this is the most relevant tracking issue at the moment. We just merged a fix for what seems to be the highest percentage fork offender. Can you please try out the artifacts here to see if it fixes the issue for you?

Edit: I see this report was for 3.8. We don’t build 3.8 artifacts for PRs. I’ll comment again when the master job has run, which should include 3.8 artifacts.