grpc: GRPC client deadlocks if TCP connection fails synchronously

What version of gRPC and what language are you using?

1.14.1, Javascript

What operating system (Linux, Windows, …) and version?

Debian 8.11 (node:8 docker image).

What runtime / compiler are you using (e.g. python version or version of gcc)

Node.js 8.11

What did you do?

Please see this repo: https://github.com/alexeevg/grpc-deadlock for scripts reproducing the problem. In short, I tried making a gRPC call over IPv6 from within a Docker container with IPv6 disabled.

const grpc = require('grpc');

const {Calculator} = grpc.load('./proto/math.proto').test;
const client = new Calculator('[::1]:1234', grpc.credentials.createInsecure()); // there is no server listening at [::1]:1234

client.add({a: 1, b: 2}, (err, result) => {
  // the callback is never called, the calling process hangs
});

What did you expect to see?

I expected the client to invoke the callback with connection error. Also, I expected the process to remain responsive while the call is in progress, event loop should not be blocked.

What did you see instead?

Node.js process deadlocked deep inside gRPC core native code. Attaching gdb to the hanging process resulted in the following stacktrace:

(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fba9b8fb479 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fba9b8fb2a0 in __GI___pthread_mutex_lock (mutex=0x4052f18) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fba992f6319 in gpr_mu_lock (mu=<optimized out>) at ../deps/grpc/src/core/lib/gpr/sync_posix.cc:47
#4  0x00007fba9928047f in connected (arg=0x4052f10, error=0x40541e0)
    at ../deps/grpc/src/core/ext/transport/chttp2/client/chttp2_connector.cc:172
#5  0x00007fba9925c2c1 in exec_ctx_run (error=0x40541e0, closure=<optimized out>)
    at ../deps/grpc/src/core/lib/iomgr/exec_ctx.cc:40
#6  grpc_core::ExecCtx::Flush (this=this@entry=0x7fff4adf27d0) at ../deps/grpc/src/core/lib/iomgr/exec_ctx.cc:128
#7  0x00007fba992e8ffe in ~ExecCtx (this=0x7fff4adf27d0, __in_chrg=<optimized out>)
    at ../deps/grpc/src/core/lib/iomgr/exec_ctx.h:108
#8  custom_connect_callback (socket=<optimized out>, error=0x40541e0)
    at ../deps/grpc/src/core/lib/iomgr/tcp_client_custom.cc:100
#9  0x00007fba9928016a in chttp2_connector_connect (con=0x4052f10, args=0x7fff4adf2940, result=0x4053290,
    notify=0x40532a0) at ../deps/grpc/src/core/ext/transport/chttp2/client/chttp2_connector.cc:215
#10 0x00007fba9928e7dc in continue_connect_locked (c=0x4053260)
    at ../deps/grpc/src/core/ext/filters/client_channel/subchannel.cc:408
#11 maybe_start_connecting_locked (c=0x4053260) at ../deps/grpc/src/core/ext/filters/client_channel/subchannel.cc:483
#12 0x00007fba9928f415 in grpc_subchannel_notify_on_state_change (c=0x4053260, interested_parties=0xdeafbeef,
    state=0x405281c, notify=<optimized out>) at ../deps/grpc/src/core/ext/filters/client_channel/subchannel.cc:532
#13 0x00007fba992d9e07 in grpc_core::SubchannelData<grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelList, grpc_core::(anonymous namespace)::PickFirst::PickFirstSubchannelData>::StartConnectivityWatchLocked (
    this=<optimized out>) at ../deps/grpc/src/core/ext/filters/client_channel/lb_policy/subchannel_list.h:304
#14 0x00007fba992dac70 in StartPickingLocked (this=0x40523d0)
    at ../deps/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc:249
#15 grpc_core::(anonymous namespace)::PickFirst::PickLocked (this=0x40523d0, pick=0x404ceb0)
    at ../deps/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc:270
#16 0x00007fba99289e2a in StartLocked (elem=<optimized out>)
    at ../deps/grpc/src/core/ext/filters/client_channel/client_channel.cc:2687
#17 process_service_config_and_start_lb_pick_locked (elem=0x404cd80)
    at ../deps/grpc/src/core/ext/filters/client_channel/client_channel.cc:2815
#18 grpc_core::ResolverResultWaiter::DoneLocked (arg=0x40514a0, error=0x0)
    at ../deps/grpc/src/core/ext/filters/client_channel/client_channel.cc:2920
#19 0x00007fba992a2332 in grpc_combiner_continue_exec_ctx () at ../deps/grpc/src/core/lib/iomgr/combiner.cc:260
#20 0x00007fba9925c2e4 in grpc_core::ExecCtx::Flush (this=this@entry=0x7fff4adf2b00)
    at ../deps/grpc/src/core/lib/iomgr/exec_ctx.cc:131
#21 0x00007fba992a364a in ~ExecCtx (this=0x7fff4adf2b00, __in_chrg=<optimized out>)
    at ../deps/grpc/src/core/lib/iomgr/exec_ctx.h:108
#22 grpc_custom_resolve_callback (r=0x3fc2bc0, result=0x3fa6e80, error=<optimized out>)
    at ../deps/grpc/src/core/lib/iomgr/resolve_address_custom.cc:85
#23 0x0000000001410b75 in uv__work_done (handle=0x2179ab0 <default_loop_struct+176>)
    at ../deps/uv/src/threadpool.c:251
#24 0x0000000001412c8b in uv__async_io (loop=0x2179a00 <default_loop_struct>, w=<optimized out>,
    events=<optimized out>) at ../deps/uv/src/unix/async.c:118
#25 0x0000000001424d48 in uv__io_poll (loop=loop@entry=0x2179a00 <default_loop_struct>, timeout=25)
    at ../deps/uv/src/unix/linux-core.c:400
#26 0x0000000001413616 in uv_run (loop=0x2179a00 <default_loop_struct>, mode=UV_RUN_DEFAULT)
    at ../deps/uv/src/unix/core.c:368
#27 0x00000000008d443d in node::Start(uv_loop_s*, int, char const* const*, int, char const* const*) ()
#28 0x00000000008cc8fd in node::Start(int, char**) ()
#29 0x00007fba9b567b45 in __libc_start_main (main=0x89b0c0 <main>, argc=2, argv=0x7fff4adf7208, init=<optimized out>,
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff4adf71f8) at libc-start.c:287
#30 0x000000000089b1b1 in _start ()

Running strace -e trace=network -f npm test produced the following log:

77    socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 12
77    setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
77    connect(12, {sa_family=AF_INET, sin_port=htons(1234), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
77    getsockopt(12, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
77    socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 13
77    setsockopt(13, SOL_TCP, TCP_NODELAY, [1], 4) = 0
77    connect(13, {sa_family=AF_INET, sin_port=htons(1234), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
77    getsockopt(13, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
77    socket(PF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 14
77    setsockopt(14, SOL_TCP, TCP_NODELAY, [1], 4) = 0
77    connect(14, {sa_family=AF_INET6, sin6_port=htons(1234), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)

Running test with GRPC_TRACE=connectivity_state,tcp produced the following log:

D0822 16:19:46.249122621      25 dns_resolver.cc:280]        Start resolving.
I0822 16:19:46.249524376      25 connectivity_state.cc:92]   CONWATCH: 0x3936548 pick_first: get IDLE
I0822 16:19:46.249582095      25 connectivity_state.cc:116]  CONWATCH: 0x3936548 pick_first: from IDLE [cur=IDLE] notify=0x39a2788
I0822 16:19:46.249603812      25 connectivity_state.cc:164]  SET: 0x3921498 client_channel: IDLE --> IDLE [resolver_result] error=(nil) "No Error"
I0822 16:19:46.249648189      25 connectivity_state.cc:116]  CONWATCH: 0x3937360 subchannel: from IDLE [cur=IDLE] notify=0x39122b8
I0822 16:19:46.249664640      25 connectivity_state.cc:164]  SET: 0x3937360 subchannel: IDLE --> CONNECTING [state_change] error=(nil) "No Error"
I0822 16:19:46.249684799      25 connectivity_state.cc:190]  NOTIFY: 0x3937360 subchannel: 0x39122b8
I0822 16:19:46.249761199      25 tcp_client_custom.cc:139]   CLIENT_CONNECT: 0x39354b0 ipv6:[::1]:1234: asynchronously connecting
I0822 16:19:46.249906749      25 tcp_client_custom.cc:69]    CLIENT_CONNECT: ipv6:[::1]:1234: on_alarm: error="Cancelled"

I suppose the problem is that TCP connection fails immediately with EADDRNOTAVAIL (Cannot assign requested address) and chttp2_connector isn’t prepared to handle synchronous connection errors. As a result, error handler attempts to acquire a mutex protecting the connector at https://github.com/grpc/grpc/blob/v1.14.1/src/core/ext/transport/chttp2/client/chttp2_connector.cc#L172. But this mutex is already held at https://github.com/grpc/grpc/blob/v1.14.1/src/core/ext/transport/chttp2/client/chttp2_connector.cc#L215, and the mutex is non-recursive.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

@nicolasnoble helped test the fix on docker, and it works! #16979 has been merged in gRPC Core

Anyway I was able to test this patch on grpc node 1.15.1 and unfortunately it exposes other similar deadlocks in subchannnel.cc. I have posted details

https://github.com/grpc/grpc/issues/16427

I am surprised that there are no unittests that cover these events happening synchronously.

On Wed, Oct 24, 2018 at 4:37 PM Abhishek Parmar abhishek@carbon3d.com wrote:

I was trying to rebuild from source (what is the right way to do this?) using npm rebuild --build_from_source. For some reason my folder structure was missing the ‘deps’ folder. But I have it now and I am getting errors like:

make: *** No rule to make target ‘Release/obj.target/grpc/deps/grpc/src/cpp/ext/filters/census/grpc_context.o’, needed by ‘Release/obj.target/libgrpc.a’. Stop.

The instructions in the “Build from source” section of https://www.npmjs.com/package/grpc . also do not work.

On Wed, Oct 24, 2018 at 2:27 PM Michael Lumish notifications@github.com wrote:

I’m not sure what you’re looking at, but that exact file is definitely distributed with the grpc node module. The simplest way to test this might be to apply that diff directly to that file in the installed package and then rebuild from source.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc/issues/16427#issuecomment-432834285, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNkXqDJfY3Od_o2jAB-Kpk2A_E44-WQks5uoNtCgaJpZM4WIDtO .

– -Abhishek

– -Abhishek

I was trying to rebuild from source (what is the right way to do this?) using npm rebuild --build_from_source. For some reason my folder structure was missing the ‘deps’ folder. But I have it now and I am getting errors like:

make: *** No rule to make target ‘Release/obj.target/grpc/deps/grpc/src/cpp/ext/filters/census/grpc_context.o’, needed by ‘Release/obj.target/libgrpc.a’. Stop.

The instructions in the “Build from source” section of https://www.npmjs.com/package/grpc . also do not work.

On Wed, Oct 24, 2018 at 2:27 PM Michael Lumish notifications@github.com wrote:

I’m not sure what you’re looking at, but that exact file is definitely distributed with the grpc node module. The simplest way to test this might be to apply that diff directly to that file in the installed package and then rebuild from source.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc/issues/16427#issuecomment-432834285, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNkXqDJfY3Od_o2jAB-Kpk2A_E44-WQks5uoNtCgaJpZM4WIDtO .

– -Abhishek