grpc: Ruby gRPC processes get stuck during termination in gpr_cv_wait()

We are running a Rails application with multiple unicorn processes and observed that after upgrading from gRPC 1.1.2 to 1.2.2, many processes were unable to terminate. It appears to be a deadlock condition in gpr_cv_wait():

Thread 1 (Thread 0x7f8ce75e5780 (LWP 20468)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f8cca856fa2 in gpr_cv_wait () from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/grpc-1.2.2-x86_64-linux/src/ruby/lib/grpc/2.3/grpc_c.so
#2  0x00007f8cca825272 in ?? () from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/grpc-1.2.2-x86_64-linux/src/ruby/lib/grpc/2.3/grpc_c.so
#3  0x00007f8cca8252ec in ?? () from /opt/gitlab/embedded/lib/ruby/gems/2.3.0/gems/grpc-1.2.2-x86_64-linux/src/ruby/lib/grpc/2.3/grpc_c.so
#4  0x00007f8ce6d3ea91 in run_final (objspace=0x7f8ce5a16000, zombie=140242726493960) at gc.c:2675
#5  finalize_list (objspace=objspace@entry=0x7f8ce5a16000, zombie=140242726493960) at gc.c:2691
#6  0x00007f8ce6d4a807 in rb_objspace_call_finalizer (objspace=0x7f8ce5a16000) at gc.c:2839
#7  rb_gc_call_finalizer_at_exit () at gc.c:2764
#8  0x00007f8ce6d29a8f in ruby_finalize_1 () at eval.c:131
#9  ruby_cleanup (ex=<optimized out>) at eval.c:222
#10 0x00007f8ce6d29d05 in ruby_run_node (n=0x7f8ce5069950) at eval.c:302
#11 0x000000000040086b in main (argc=10, argv=0x7ffde01903b8) at main.c:36
(gdb) 

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

@stanhu 1.2.2 introduced a background thread that’s involved with the life cycles of grpc channel objects (used by stubs), it’s currently getting started upon a require grpc. It looks like what’s breaking down right now is this background thread is starting before the fork, but then it’s not present in the child post-fork.

so e.g., while below snippet would previously terminate, it’s now hanging:

require 'grpc'
require 'helloworld_services_pb'

def run
   stub = Helloworld::Greeter::Stub.new('localhost:50051', :this_channel_is_insecure)
   user = ARGV.size > 0 ?  ARGV[0] : 'world'
   message = stub.say_hello(Helloworld::HelloRequest.new(name: user)).message
   p "Greeting: #{message}"
end

def main
  p = fork { run }
  Process.wait(p)
end
 
main

is now hanging

fwiw, we’re experiencing similar problems with 1.4.0, however with the PHP extension running under PHP-FPM. Seems to be getting stuck in the exact same place as the original report:

(PHP-FPM being a forking process manager for PHP)

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fa863ea1b9a in gpr_cv_wait () from /usr/lib/php/20151012/grpc.so
#2  0x00007fa863ecf0d3 in stop_threads () from /usr/lib/php/20151012/grpc.so
#3  0x00007fa863ecf132 in grpc_timer_manager_shutdown () from /usr/lib/php/20151012/grpc.so
#4  0x00007fa863ec1e98 in grpc_iomgr_shutdown () from /usr/lib/php/20151012/grpc.so
#5  0x00007fa863ea340b in grpc_shutdown () from /usr/lib/php/20151012/grpc.so
#6  0x00007fa863e9a3a7 in zm_shutdown_grpc () from /usr/lib/php/20151012/grpc.so
#7  0x000056471ab96207 in module_destructor (module=module@entry=0x56471c4110e0) at /usr/src/builddir/Zend/zend_API.c:2503
#8  0x000056471ab8e2cc in module_destructor_zval (zv=<optimized out>) at /usr/src/builddir/Zend/zend.c:620
#9  0x000056471aba1659 in _zend_hash_del_el_ex (prev=<optimized out>, p=<optimized out>, idx=<optimized out>, ht=<optimized out>) at /usr/src/builddir/Zend/zend_hash.c:1026
#10 _zend_hash_del_el (p=0x56471c4af220, idx=28, ht=0x56471b039d60 <module_registry>) at /usr/src/builddir/Zend/zend_hash.c:1050
#11 zend_hash_graceful_reverse_destroy (ht=ht@entry=0x56471b039d60 <module_registry>) at /usr/src/builddir/Zend/zend_hash.c:1506
#12 0x000056471ab9462c in zend_destroy_modules () at /usr/src/builddir/Zend/zend_API.c:1982
#13 0x000056471ab8f365 in zend_shutdown () at /usr/src/builddir/Zend/zend.c:856
#14 0x000056471ab2e79b in php_module_shutdown () at /usr/src/builddir/main/main.c:2360
#15 0x000056471aa13c85 in main (argc=475908082, argv=0x56471c5dc799) at /usr/src/builddir/sapi/fpm/fpm/fpm_main.c:2021

The latest 1.3.4 release fixes the issue in which the finalizer of a “channel object” created after a fork hangs.

The general problems with forking mentioned here should be due to general lack of forking support. Closing this now, as there’s more forking support discussion in https://github.com/grpc/grpc/issues/8798

@apolcyn Would it make sense, in addition to the delayed initialization of the background thread introduced in #10670, to also have an explicit way to perform this initialization?

It’s nice that the library will be doing the right thing automatically most of the time, but without an explicit ‘GRPC.after_fork!’ method call it is harder to defend against this mistake (failing to initializing a background thread in a fork) sneaking back into the code.

For example, in GitLab we currently have a way of doing things where we initialize certain global variables while loading the application, so before forking. This is because we use a dual concurrency model: pre-forking for web requests, and multi-threading for background jobs. Initializing globals while loading the app is a trick we use for the multi-threaded mode.

This means we also call GRPC::Core::Channel.new before forking. Does that ruin the fix in #10670? If we had an explicit GRPC.after_fork! (or something like it) I would not have to ask the question.