grpc: segfault in python `1.31.0`
What version of gRPC and what language are you using?
Python 1.31.0
.
What operating system (Linux, Windows,…) and version?
Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 x86_64 x86_64 GNU/Linux
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.5.9
What did you do?
Upgraded to 1.31.0
What did you expect to see?
No segfault
What did you see instead?
We started seeing segfaults in both our celery workers and our uWSGI logs:
!!! uWSGI process 20 got Segmentation Fault !!!
*** backtrace of 20 ***
uwsgi(uwsgi_backtrace+0x2a) [0x56079b42b2ea]
uwsgi(uwsgi_segfault+0x23) [0x56079b42b6d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fa99c451fd0]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x270245) [0x7fa987cda245]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x267473) [0x7fa987cd1473]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x268212) [0x7fa987cd2212]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x16e6a4) [0x7fa987bd86a4]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x26cc84) [0x7fa987cd6c84]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x281fcb) [0x7fa987cebfcb]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x25ccc8) [0x7fa987cc6cc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa99e5b36db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa99c534a3f]
*** end of backtrace ***
Tue Aug 11 09:44:16 2020 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Tue Aug 11 09:44:16 2020 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Segmentation fault (core dumped)
We tested 1.30.0
and did not observe the segfaults anymore.
When this happens with our celery workers, what seems to trigger the segfault is when the worker is restarted after having executed the maximum tasks specified by maxtaskperchild
. For example celery -A the_app worker -Q a_queue --concurrency=1 --maxtasksperchild=100 -l info
.
I can’t currently provide cleaned up code for you to reproduce, but I believe any code making GRPC calls should trigger this after enough time.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 16
- Comments: 24 (4 by maintainers)
Links to this issue
Commits related to this issue
- downgrade grpcio to 1.30.0 following https://github.com/grpc/grpc/issues/23796 — committed to getsentry/sentry by joshuarli 3 years ago
- Revert "downgrade grpcio to 1.30.0 following https://github.com/grpc/grpc/issues/23796" This reverts commit 2e6c7256d67059963db4d3e9493704b925695ebd. — committed to getsentry/sentry by joshuarli 3 years ago
- Add additional dependency `grpcio<1.31.0` Current version of Airflow image uses grpcio==1.31.0, which causes segfaults: b/174948982. Depedency added to allow only versions up to 1.30.0. It hasn't be... — committed to GoogleCloudPlatform/composer-airflow by a-googler 4 years ago
- Add additional dependency `grpcio<1.31.0` Current version of Airflow image uses grpcio==1.31.0, which causes segfaults: b/174948982. Depedency added to allow only versions up to 1.30.0. It hasn't be... — committed to GoogleCloudPlatform/composer-airflow by a-googler 4 years ago
- Explicitly set multiprocessing to use spawn not fork. Enable all db and flow tests. On linux, multiprocessing's default is fork, which causes gRPC to fail because its default polling mechanism is epol... — committed to meadowdata/meadowflow by kurtschelfthout 2 years ago
It looks like the default polling strategy changed from
epoll1
in v1.30 toepollex
in v1.31. Unfortunatelyepollex
doesn’t have fork support so it doesn’t work with Celery:Fork support is only compatible with the epoll1 and poll polling strategies
. We were able to work around this issue by settingGRPC_POLL_STRATEGY=epoll1
on our Celery workers.We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.
What we did was as suggested, to set
GRPC_POLL_STRATEGY=epoll1
on our workers that use multiprocessing withfork
on linux based images. As a longterm strategy, we have been gradually getting rid of usingfork
in general (as it was the cause for many other issues). This happened on workers running python3.7.6
.We have also found that workers running on python
3.8.6
do not see this issue. However, we have seen errors like these instead:which is similar to what @cpaulik is seeing.
If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as
fork
ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.
Perhaps a solution would be to add some documentation?
We’re using Flask application and Google PubSub emulator and Google Datastore emulator (both communicating via grpc through client libraries) to run tests. With grpcio==1.31.0 we are getting segmentation fault errors but fortunately, with 1.30.0 it works ok. Not sure what could be the issue. We’re Python 3.7.5
This is happening to me too with grpcio-1.33.2-cp36-cp36m-manylinux2014_x86_64 on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1030-aws x86_64). I get constant segfaults. Downgrading to 1.30.0 solves the problem.
I have the same problem. After upgrading python dependencies in project we started to get kubernetes pods restart with celery workers. There were no logs, no tracebacks, no exceptions. We enabled python faulthandler https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/ and got different tracebacks not connected with grpc in different parts of code. By excluding last updates in working branch step by step we found that it was grpc package source of segfaults. Downgrading to 1.30.0 solved this issue
Same here. Python multiprocessing.Process died randomly, leaving segfault message in dmesg like
[6558262.243117] grpc_global_tim[42632]: segfault at 7f172819d038 ip 00007f1789caff59 sp 00007f1739bb6c40 error 4 in cygrpc.cpython-36m-x86_64-linux-gnu.so[7f1789a60000+5e8000]
. Downgrading to 1.30.0 solves.install
grpcio-1.33.2-cp36-cp36m-manylinux2010_x86_64.whl
does not help, still getting same error.After deploying
1.30.0
on our production systems, the number of segfaults went down to 0.So, although I was able to create a segfault with
1.30.0
on my own machines, it does seem1.30.0
is more stable than1.31.0
Note: the epollex poller has been removed in gRPC 1.46.0 released in May 2022. So I don’t think that the solution above to force the poller to
epoll1
is still relevant today, as this is now the default.thank you so much @jrmlhermitte I’ve been trying find a solution for segfault issue for a very long time. This worked like a charm.
We faced the exact same issue - multiprocessing with grpc leading to segmentation faults. We were using
1.34.0
in our case but downgrading to1.30.0
resolved the issue. Haven’t tried it with 1.32.0 and 1.33.0 yet.