grpc: segfault in python `1.31.0`

What version of gRPC and what language are you using?

Python 1.31.0.

What operating system (Linux, Windows,…) and version?

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 x86_64 x86_64 GNU/Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.5.9

What did you do?

Upgraded to 1.31.0

What did you expect to see?

No segfault

What did you see instead?

We started seeing segfaults in both our celery workers and our uWSGI logs:

!!! uWSGI process 20 got Segmentation Fault !!!
*** backtrace of 20 ***
uwsgi(uwsgi_backtrace+0x2a) [0x56079b42b2ea]
uwsgi(uwsgi_segfault+0x23) [0x56079b42b6d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fa99c451fd0]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x270245) [0x7fa987cda245]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x267473) [0x7fa987cd1473]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x268212) [0x7fa987cd2212]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x16e6a4) [0x7fa987bd86a4]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x26cc84) [0x7fa987cd6c84]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x281fcb) [0x7fa987cebfcb]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x25ccc8) [0x7fa987cc6cc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa99e5b36db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa99c534a3f]
*** end of backtrace ***
Tue Aug 11 09:44:16 2020 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Tue Aug 11 09:44:16 2020 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Segmentation fault (core dumped)

We tested 1.30.0 and did not observe the segfaults anymore.

When this happens with our celery workers, what seems to trigger the segfault is when the worker is restarted after having executed the maximum tasks specified by maxtaskperchild. For example celery -A the_app worker -Q a_queue --concurrency=1 --maxtasksperchild=100 -l info.

I can’t currently provide cleaned up code for you to reproduce, but I believe any code making GRPC calls should trigger this after enough time.

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 16
Comments: 24 (4 by maintainers)

Links to this issue

Cloud Composer release notes | Google Cloud

Commits related to this issue

downgrade grpcio to 1.30.0 following https://github.com/grpc/grpc/issues/23796 — committed to getsentry/sentry by joshuarli 3 years ago
Revert "downgrade grpcio to 1.30.0 following https://github.com/grpc/grpc/issues/23796" This reverts commit 2e6c7256d67059963db4d3e9493704b925695ebd. — committed to getsentry/sentry by joshuarli 3 years ago
Add additional dependency `grpcio<1.31.0` Current version of Airflow image uses grpcio==1.31.0, which causes segfaults: b/174948982. Depedency added to allow only versions up to 1.30.0. It hasn't be... — committed to GoogleCloudPlatform/composer-airflow by a-googler 4 years ago
Add additional dependency `grpcio<1.31.0` Current version of Airflow image uses grpcio==1.31.0, which causes segfaults: b/174948982. Depedency added to allow only versions up to 1.30.0. It hasn't be... — committed to GoogleCloudPlatform/composer-airflow by a-googler 4 years ago
Explicitly set multiprocessing to use spawn not fork. Enable all db and flow tests. On linux, multiprocessing's default is fork, which causes gRPC to fail because its default polling mechanism is epol... — committed to meadowdata/meadowflow by kurtschelfthout 2 years ago

Most upvoted comments

It looks like the default polling strategy changed from epoll1 in v1.30 to epollex in v1.31. Unfortunately epollex doesn’t have fork support so it doesn’t work with Celery: Fork support is only compatible with the epoll1 and poll polling strategies. We were able to work around this issue by setting GRPC_POLL_STRATEGY=epoll1 on our Celery workers.

+21

cgurnik on Dec 8, 2020

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images. As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues). This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

+18

jrmlhermitte on Feb 1, 2021

Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

numbsafari on Feb 2, 2022

We’re using Flask application and Google PubSub emulator and Google Datastore emulator (both communicating via grpc through client libraries) to run tests. With grpcio==1.31.0 we are getting segmentation fault errors but fortunately, with 1.30.0 it works ok. Not sure what could be the issue. We’re Python 3.7.5

zdenulo on Aug 28, 2020

This is happening to me too with grpcio-1.33.2-cp36-cp36m-manylinux2014_x86_64 on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1030-aws x86_64). I get constant segfaults. Downgrading to 1.30.0 solves the problem.

dmjef on Nov 11, 2020

I have the same problem. After upgrading python dependencies in project we started to get kubernetes pods restart with celery workers. There were no logs, no tracebacks, no exceptions. We enabled python faulthandler https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/ and got different tracebacks not connected with grpc in different parts of code. By excluding last updates in working branch step by step we found that it was grpc package source of segfaults. Downgrading to 1.30.0 solved this issue

sergei-iurchenko on Aug 19, 2020

Same here. Python multiprocessing.Process died randomly, leaving segfault message in dmesg like [6558262.243117] grpc_global_tim[42632]: segfault at 7f172819d038 ip 00007f1789caff59 sp 00007f1739bb6c40 error 4 in cygrpc.cpython-36m-x86_64-linux-gnu.so[7f1789a60000+5e8000]. Downgrading to 1.30.0 solves.

install grpcio-1.33.2-cp36-cp36m-manylinux2010_x86_64.whl does not help, still getting same error.

exzhawk on Nov 16, 2020

After deploying 1.30.0 on our production systems, the number of segfaults went down to 0.

So, although I was able to create a segfault with 1.30.0 on my own machines, it does seem 1.30.0 is more stable than 1.31.0

huguesalary on Aug 13, 2020

Note: the epollex poller has been removed in gRPC 1.46.0 released in May 2022. So I don’t think that the solution above to force the poller to epoll1 is still relevant today, as this is now the default.

k4nar on Nov 17, 2023

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images. As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues). This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:
2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."
which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

thank you so much @jrmlhermitte I’ve been trying find a solution for segfault issue for a very long time. This worked like a charm.

oakal on Apr 22, 2021

We faced the exact same issue - multiprocessing with grpc leading to segmentation faults. We were using 1.34.0 in our case but downgrading to 1.30.0 resolved the issue. Haven’t tried it with 1.32.0 and 1.33.0 yet.

wireman27 on Dec 8, 2020