grpc: segfault in python `1.31.0`

What version of gRPC and what language are you using?

Python 1.31.0.

What operating system (Linux, Windows,…) and version?

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 x86_64 x86_64 GNU/Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.5.9

What did you do?

Upgraded to 1.31.0

What did you expect to see?

No segfault

What did you see instead?

We started seeing segfaults in both our celery workers and our uWSGI logs:

!!! uWSGI process 20 got Segmentation Fault !!!
*** backtrace of 20 ***
uwsgi(uwsgi_backtrace+0x2a) [0x56079b42b2ea]
uwsgi(uwsgi_segfault+0x23) [0x56079b42b6d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fa99c451fd0]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x270245) [0x7fa987cda245]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x267473) [0x7fa987cd1473]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x268212) [0x7fa987cd2212]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x16e6a4) [0x7fa987bd86a4]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x26cc84) [0x7fa987cd6c84]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x281fcb) [0x7fa987cebfcb]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x25ccc8) [0x7fa987cc6cc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa99e5b36db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa99c534a3f]
*** end of backtrace ***
Tue Aug 11 09:44:16 2020 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Tue Aug 11 09:44:16 2020 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Segmentation fault (core dumped)

We tested 1.30.0 and did not observe the segfaults anymore.

When this happens with our celery workers, what seems to trigger the segfault is when the worker is restarted after having executed the maximum tasks specified by maxtaskperchild. For example celery -A the_app worker -Q a_queue --concurrency=1 --maxtasksperchild=100 -l info.

I can’t currently provide cleaned up code for you to reproduce, but I believe any code making GRPC calls should trigger this after enough time.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 16
  • Comments: 24 (4 by maintainers)

Commits related to this issue

Most upvoted comments

It looks like the default polling strategy changed from epoll1 in v1.30 to epollex in v1.31. Unfortunately epollex doesn’t have fork support so it doesn’t work with Celery: Fork support is only compatible with the epoll1 and poll polling strategies. We were able to work around this issue by setting GRPC_POLL_STRATEGY=epoll1 on our Celery workers.

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images. As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues). This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

We’re using Flask application and Google PubSub emulator and Google Datastore emulator (both communicating via grpc through client libraries) to run tests. With grpcio==1.31.0 we are getting segmentation fault errors but fortunately, with 1.30.0 it works ok. Not sure what could be the issue. We’re Python 3.7.5

This is happening to me too with grpcio-1.33.2-cp36-cp36m-manylinux2014_x86_64 on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1030-aws x86_64). I get constant segfaults. Downgrading to 1.30.0 solves the problem.

I have the same problem. After upgrading python dependencies in project we started to get kubernetes pods restart with celery workers. There were no logs, no tracebacks, no exceptions. We enabled python faulthandler https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/ and got different tracebacks not connected with grpc in different parts of code. By excluding last updates in working branch step by step we found that it was grpc package source of segfaults. Downgrading to 1.30.0 solved this issue

Same here. Python multiprocessing.Process died randomly, leaving segfault message in dmesg like [6558262.243117] grpc_global_tim[42632]: segfault at 7f172819d038 ip 00007f1789caff59 sp 00007f1739bb6c40 error 4 in cygrpc.cpython-36m-x86_64-linux-gnu.so[7f1789a60000+5e8000]. Downgrading to 1.30.0 solves.

install grpcio-1.33.2-cp36-cp36m-manylinux2010_x86_64.whl does not help, still getting same error.

After deploying 1.30.0 on our production systems, the number of segfaults went down to 0.

So, although I was able to create a segfault with 1.30.0 on my own machines, it does seem 1.30.0 is more stable than 1.31.0

Note: the epollex poller has been removed in gRPC 1.46.0 released in May 2022. So I don’t think that the solution above to force the poller to epoll1 is still relevant today, as this is now the default.

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images. As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues). This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

thank you so much @jrmlhermitte I’ve been trying find a solution for segfault issue for a very long time. This worked like a charm.

We faced the exact same issue - multiprocessing with grpc leading to segmentation faults. We were using 1.34.0 in our case but downgrading to 1.30.0 resolved the issue. Haven’t tried it with 1.32.0 and 1.33.0 yet.