grpc: ev_poll_posix (Mac OS): fd_orphan fails to remove fd from fork list, postfork_child closes fd that no longer belongs to gRPC, breaks Python subprocess.Popen

What version of gRPC and what language are you using?

$ pip show grpcio Name: grpcio Version: 1.21.1

What operating system (Linux, Windows,…) and version?

Mac OS 10.14.5

What runtime / compiler are you using (e.g. python version or version of gcc)

$ python -V Python 2.7.15

What did you do?

If possible, provide a recipe for reproducing the error. Try being specific and include code snippets if helpful.

  1. Start examples/python/route_guide/route_guide_server.py. The bug reproduces with default options, but to reproduce faster, it helps to set options=[('grpc.max_connection_idle_ms', 10000)].
  2. Modify examples/python/route_guide/route_guide_client.py to add a short delay, then subprocess.Popen call:
def run():
    with grpc.insecure_channel('localhost:50051') as channel:
        stub = route_guide_pb2_grpc.RouteGuideStub(channel)
        print("-------------- GetFeature --------------")
        guide_get_feature(stub)

        time.sleep(20)  # wait for channel to go READY --> IDLE

        devnull = open(os.devnull)
        print('devnull fd: %d' % devnull.fileno())
        try:
            subprocess.Popen(['/bin/echo', 'It worked!'], stdin=devnull)
        except IOError as e:
            print('Exception: ', e)
            print('Child traceback: ', e.child_traceback)
            raise

What did you expect to see?

It worked!

What did you see instead?

Exception:  [Errno 9] Bad file descriptor
Child traceback:  Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 973, in _execute_child
    _dup2(p2cread, 0)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 972, in _dup2
    os.dup2(a, b)
OSError: [Errno 9] Bad file descriptor
Traceback (most recent call last):
  File "route_guide_client.py", line 129, in <module>
    run()
  File "route_guide_client.py", line 120, in run
    subprocess.Popen(['/bin/echo', 'It worked!'], stdin=devnull)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 394, in __init__
    errread, errwrite)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1047, in _execute_child
    raise child_exception
OSError: [Errno 9] Bad file descriptor

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Server log (nothing special):

$ python route_guide_server.py
D0602 21:01:49.878758000 4527408576 ev_posix.cc:174]                   Using polling engine: poll
D0602 21:01:49.878840000 4527408576 dns_resolver_ares.cc:480]          Using ares dns resolver
I0602 21:02:04.501015000 123145540087808 chttp2_transport.cc:1800]     ipv6:[::1]:59007: Sending goaway err={"created":"@1559523724.500976000","description":"max_idle","file":"src/core/ext/filters/max_age/max_age_filter.cc","file_line":259,"http2_error":0}

Client log (“GRPC_VERBOSITY=DEBUG GRPC_TRACE=all,-timer_check python route_guide_client.py”), interesting parts only:

# channel opening, created TCP socket fd 3
I0602 21:01:54.472020000 4501616064 ev_posix.cc:254]                   (fd-trace) fd_create(3, tcp-client:ipv6:[::1]:50051, 1)

# channel has been idle for 10 seconds, server sends GOAWAY
I0602 21:02:09.475174000 123145316352000 chttp2_transport.cc:1157]     ipv6:[::1]:50051: Got goaway [0] err={"created":"@1559523729.475129000","description":"GOAWAY received","file":"src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":1151,"grpc_status":14,"http2_error":0,"raw_bytes":"max_idle"}

# fd 3 shutting down
I0602 21:02:09.479882000 123145316352000 connectivity_state.cc:147]    SET: 0x7fc8fd07e2a8 client_transport: TRANSIENT_FAILURE --> SHUTDOWN [close_transport]
I0602 21:02:09.479894000 123145316352000 ev_posix.cc:274]              (fd-trace) fd_shutdown(3)
I0602 21:02:09.479925000 123145316352000 resource_quota.cc:547]        RU shutdown 0x7fc8fb64c120
I0602 21:02:09.479944000 123145316352000 ev_posix.cc:267]              (fd-trace) grpc_fd_orphan, fd:3 closed

# open('/dev/null') reuses fd 3
devnull fd: 3

# subprocess relies on fd inherited across fork, but grpc postfork thinks
# it still owns the fd and closes it:
I0602 21:02:14.504559000 4501616064 executor.cc:473]                   EXECUTOR Executor::SetThreadingAll(0) called
I0602 21:02:14.504602000 4501616064 executor.cc:154]                   EXECUTOR (default-executor) SetThreading(0) begin
I0602 21:02:14.504633000 123145314742272 executor.cc:249]              EXECUTOR (default-executor) [0]: shutdown
I0602 21:02:14.504712000 4501616064 executor.cc:202]                   EXECUTOR (default-executor) Thread 1 of 1 joined
I0602 21:02:14.504742000 4501616064 executor.cc:224]                   EXECUTOR (default-executor) SetThreading(0) done
I0602 21:02:14.504752000 4501616064 executor.cc:154]                   EXECUTOR (resolver-executor) SetThreading(0) begin
I0602 21:02:14.504795000 123145315278848 executor.cc:249]              EXECUTOR (resolver-executor) [0]: shutdown
I0602 21:02:14.504857000 4501616064 executor.cc:202]                   EXECUTOR (resolver-executor) Thread 1 of 1 joined
I0602 21:02:14.504879000 4501616064 executor.cc:224]                   EXECUTOR (resolver-executor) SetThreading(0) done
I0602 21:02:14.506219000 4501616064 executor.cc:473]                   EXECUTOR Executor::SetThreadingAll(1) called
I0602 21:02:14.506302000 4501616064 executor.cc:154]                   EXECUTOR (default-executor) SetThreading(1) begin
I0602 21:02:14.506467000 4501616064 executor.cc:224]                   EXECUTOR (default-executor) SetThreading(1) done
I0602 21:02:14.506505000 4501616064 executor.cc:154]                   EXECUTOR (resolver-executor) SetThreading(1) begin
I0602 21:02:14.506677000 4501616064 executor.cc:224]                   EXECUTOR (resolver-executor) SetThreading(1) done
I0602 21:02:14.506711000 123145315815424 executor.cc:238]              EXECUTOR (resolver-executor) [0]: step (sub_depth=0)
I0602 21:02:14.506751000 123145315278848 executor.cc:238]              EXECUTOR (default-executor) [0]: step (sub_depth=0)
I0602 21:02:14.507265000 4501616064 executor.cc:473]                   EXECUTOR Executor::SetThreadingAll(1) called
I0602 21:02:14.511119000 4501616064 executor.cc:154]                   EXECUTOR (default-executor) SetThreading(1) begin
I0602 21:02:14.512976000 4501616064 executor.cc:224]                   EXECUTOR (default-executor) SetThreading(1) done
I0602 21:02:14.513005000 123145315278848 executor.cc:238]              EXECUTOR (default-executor) [0]: step (sub_depth=0)
I0602 21:02:14.515061000 4501616064 executor.cc:154]                   EXECUTOR (resolver-executor) SetThreading(1) begin
I0602 21:02:14.519730000 4501616064 executor.cc:224]                   EXECUTOR (resolver-executor) SetThreading(1) done
I0602 21:02:14.519785000 123145315815424 executor.cc:238]              EXECUTOR (resolver-executor) [0]: step (sub_depth=0)

# subprocess.Popen fails in child on dup'ing stdin fd to fd 0
Exception:  [Errno 9] Bad file descriptor
Child traceback:  Traceback (most recent call last):
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 973, in _execute_child
    _dup2(p2cread, 0)
  File "/usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 972, in _dup2
    os.dup2(a, b)
OSError: [Errno 9] Bad file descriptor

Client DTruss (syscall trace), starting during the sleep:

dtrace: system integrity protection is on, some features will not be available
        PID/THRD  SYSCALL(args)                  = return
...
# fork in parent / child
86401/0x282e0f:  fork()          = 86414 0
86414/0x282e98:  fork()          = 0 0

# child closing a series of fd's, including 3
86414/0x282e98:  close(0x4)              = 0 0
86401/0x282e9b:  proc_info(0x5, 0x15181, 0x2)            = 0 0
86414/0x282e98:  close(0x5)              = 0 0
86414/0x282e98:  close(0x3)              = 0 0

# Python attempting to set up child stdin, getting EBADF
86414/0x282e98:  dup2(0x3, 0x0, 0x0)             = -1 Err#9

Client log (fork support disabled: “GRPC_VERBOSITY=DEBUG GRPC_TRACE=all,-timer_check GRPC_ENABLE_FORK_SUPPORT=0 python -u route_guide_client.py”):

...
It worked!

But probably if the program were to use naked os.fork(), something else would break badly.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 16 (13 by maintainers)

Most upvoted comments

@lidizheng PR has been posted (20452)… please let me know if there’s anything I need to do on my end to ease the way through for it.

The accounting of file-descriptor reference counts is wrong. Here’s some debug logging from the prior python example script. This sequence shows an outbound socket being created, connection attempt, failure and disposal.

The socket is closed, but the ref count doesn’t drop to 0 (it’s at 4). The code that closes the file-descriptor doesn’t check the ref-count. But aside from that whatever caused the ref-count to bump up (via “multipoller”) isn’t being undone.

Since the ref-count doesn’t reach 0, the FD (3) is not unregistered from the fork list and is thus left to be closed at fork time.

I1002 14:11:06.169675000 4797703616 ev_posix.cc:255]                   (polling-api) fd_create(3, tcp-client:ipv6:[::1]:12345, 1)
I1002 14:11:06.169683000 4797703616 ev_posix.cc:256]                   (fd-trace) fd_create(3, tcp-client:ipv6:[::1]:12345, 1)
I1002 14:11:06.169744000 4797703616 ev_posix.cc:395]                   (polling-api) pollset_set_add_fd(0x7f835ce43e10, 3)
D1002 14:11:06.169755000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 1 -> 3 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1287]
D1002 14:11:06.169762000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 3 -> 5 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1287]
D1002 14:11:06.169771000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 5 -> 7 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1287]
D1002 14:11:06.169777000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 7 -> 9 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1287]
D1002 14:11:06.169784000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 9 -> 11 [multipoller; src/core/lib/iomgr/ev_poll_posix.cc:878]
D1002 14:11:06.169791000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 11 -> 13 [multipoller; src/core/lib/iomgr/ev_poll_posix.cc:878]
I1002 14:11:06.169877000 4797703616 ev_posix.cc:323]                   (polling-api) pollset_work(0x7f835cc2a650, 203) begin
I1002 14:11:06.169899000 4797703616 ev_posix.cc:326]                   (polling-api) pollset_work(0x7f835cc2a650, 203) end
I1002 14:11:06.169908000 4797703616 ev_posix.cc:323]                   (polling-api) pollset_work(0x7f835cc2a650, 203) begin
D1002 14:11:06.169915000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 13 -> 15 [multipoller_start; src/core/lib/iomgr/ev_poll_posix.cc:993]
D1002 14:11:06.169922000 4797703616 ev_poll_posix.cc:320]              FD 3 0x7f835ce17800   ref 2 15 -> 17 [poll; src/core/lib/iomgr/ev_poll_posix.cc:597]
D1002 14:11:06.169928000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 17 -> 15 [multipoller_start; src/core/lib/iomgr/ev_poll_posix.cc:1006]
I1002 14:11:06.169945000 4797703616 ev_poll_posix.cc:1017]             0x7f835cc2a650 poll=1
I1002 14:11:06.169952000 4797703616 ev_poll_posix.cc:1051]             0x7f835cc2a650 got_event: 3 r:1 w:1 [17]
D1002 14:11:06.169959000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 15 -> 13 [poll; src/core/lib/iomgr/ev_poll_posix.cc:690]
I1002 14:11:06.169976000 4797703616 ev_posix.cc:401]                   (polling-api) pollset_set_del_fd(0x7f835ce43e10, 3)
D1002 14:11:06.169985000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 13 -> 11 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1309]
D1002 14:11:06.169991000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 11 -> 9 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1309]
D1002 14:11:06.169998000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 9 -> 7 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1309]
D1002 14:11:06.170005000 4797703616 ev_poll_posix.cc:337]              FD 3 0x7f835ce17800 unref 2 7 -> 5 [pollset_set; src/core/lib/iomgr/ev_poll_posix.cc:1309]
I1002 14:11:06.170010000 4797703616 ev_posix.cc:268]                   (polling-api) fd_orphan(3, 0x0, 0x0, tcp_client_orphan)
I1002 14:11:06.170015000 4797703616 ev_posix.cc:269]                   (fd-trace) grpc_fd_orphan, fd:3 closed