grpc: Deadlocks using gRPC C# in Unity - Possible shutdown race - processing GOAWAY after shutdown

What version of gRPC and what language are you using?

C#

grpc_unity_package.2.27.0-dev.zip - build ID 44bbb740-e047-4210-8aea-0ab70d40ccb6

What operating system (Linux, Windows,…) and version?

Windows 10

What runtime / compiler are you using (e.g. python version or version of gcc)

Unity 2019.2

What did you do?

I’m trying to investigate a deadlock in Grpc.Core related to the Environment shutdown. The deadlock seems to happen in many cases after having used the Grpc API, after I switch out of Unity’s play mode and then later try and re-enter play mode Unity freezes. Another way of triggering the deadlock is to build a standalone application and the deadlock will manifest when I try and quit the application.

My gRPC client code in Unity is interacting with a game server written in Rust (Tonic) which seems to work fine when I test using a headless gRPC client written in Go.

I have since tried to come up with a minimal reproduction and found one example that seems to be a fairly reliable reproduction with a server streaming request where I dispose the stream without reading anything from it.

This is the protocol I'm testing with
syntax = "proto3";

package test.grpc;

message Empty {}

message Ping {
    uint64 epochSendTimestampNS = 1;
}

message Pong {
    uint64 epochSendTimestampNS = 1;
}

service Test {
    rpc OpenTestStream(Empty) returns (stream Empty);
    rpc SendPing(Ping) returns (Pong);
}
This is the C# test function I have
public async void RunGrpcTest()
    {
        var certUri = Path.Combine(Application.streamingAssetsPath, "ca.crt");
        var req = UnityWebRequest.Get(certUri);
        await req.SendWebRequest();
        var cacert = req.downloadHandler.text;

        var cancellationTokenSource = new CancellationTokenSource();

        var interceptor = new AsyncAuthInterceptor(async (context, metadata) =>
        {
            // dummy token, with a "username" claim
            metadata.Add("authorization", "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6InRlc3QifQ.hNQI_r8BATy1LyXPr6Zuo9X_V0kSED8ngcqQ6G-WV5w");
        });

        string address = "" + testAddress + ":" + testPort;
        ChannelCredentials credentials = null;

        if (useTls)
        {
            credentials = new SslCredentials(cacert);
            credentials = ChannelCredentials.Create(new SslCredentials(cacert),
                CallCredentials.FromInterceptor(interceptor));
        }
        else
        {
            credentials = ChannelCredentials.Insecure;
        }

        var channel = new Channel(address, credentials);

        var client = new Test.Grpc.Test.TestClient(channel);

        try {
            var sentTs = GetEpochTimestampNS();
            var pong = await client.SendPingAsync(new Test.Grpc.Ping { EpochSendTimestampNS = sentTs },
                                                    deadline: DateTime.UtcNow.AddSeconds(10),
                                                    cancellationToken: cancellationTokenSource.Token);
            var receivedTs = pong.EpochSendTimestampNS;
            var delta = receivedTs - sentTs;
            Debug.Log("Ping sent to Frontend (timestamp = " + sentTs + "),  Pong Received (timestamp = " + receivedTs + "), Round Trip = " + delta);

            Debug.Log("Opening test stream...");
            using (var streamCall = client.OpenTestStream(new Test.Grpc.Empty {},
                                                          new CallOptions(deadline: DateTime.UtcNow.AddSeconds(30),
                                                                          cancellationToken: cancellationTokenSource.Token)))
            {
                // With a Rust/Tonic server then if we don't read anything from the stream we tend to get a deadlock
#if false
                if (await streamCall.ResponseStream.MoveNext()) {
                    Debug.Log("Received Empty{} item on test stream");
                } else {
                    Debug.Log("Nothing read from test stream");
                }
#else
                Debug.Log("DEBUG: Immediately disposing create group stream");
#endif
            }
        }
        finally
        {
            Debug.Log("Shutting down channel...");
            await channel.ShutdownAsync();
        }

        Debug.Log("Done");
    }

TL;DR the test sends one unary SendPing request to the server followed by a server streaming OpenTestStream request but doesn’t wait to read anything from the stream before disposing the stream and shutting down the channel.

Looking at the logs from GRPC_TRACE="api,channel,connectivity_state,executor"; something that seems to reliably correlate with test runs that will lead to a deadlock is seeing that a GOAWAY is received after the Environment has shut everything down. (after tracing grpc\src\core\lib\surface\init.cc:212: grpc_shutdown(void))

I don’t know yet, but it seems surprising to see any log messages related to network I/O after the grpc state has been shutdown, so I wonder if maybe there is some kind of reference counting problem and something should be blocking the shutdown from completing while there is apparently still a connection to the server that’s open (or some buffered data that needs to be flushed somehow)?

This is a trace log for a deadlock case:

https://pastebin.com/7YZ94k0d

In this case I was able to run the test four times successfully, but on the fifth run I see this extra information after shutting down grpc (search for "grpc_shutdown(void)" in the log):

INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\ext\transport\chttp2\transport\chttp2_transport.cc:1088: ipv4:192.168.0.21:50051: Got goaway [1] err={"created":"@1578765853.378000000","description":"GOAWAY received","file":"C:\Users\Robert\src\grpc\src\core\ext\transport\chttp2\transport\chttp2_transport.cc","file_line":1077,"grpc_status":14,"http2_error":1,"raw_bytes":""}
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:151: ConnectivityStateTracker client_transport[000001EA602B84D0]: READY -> TRANSIENT_FAILURE (got_goaway)
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:158: ConnectivityStateTracker client_transport[000001EA602B84D0]: notifying watcher 000001EA56953890: READY -> TRANSIENT_FAILURE
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:78: watcher 000001EA56953890: delivering async notification for TRANSIENT_FAILURE
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:151: ConnectivityStateTracker client_channel[000001EA57C68950]: READY -> IDLE (helper)
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:151: ConnectivityStateTracker client_transport[000001EA602B84D0]: TRANSIENT_FAILURE -> SHUTDOWN (close_transport)
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:158: ConnectivityStateTracker client_transport[000001EA602B84D0]: notifying watcher 000001EA56953890: TRANSIENT_FAILURE -> SHUTDOWN
INFO: 2106225329392 C:\Users\Robert\src\grpc\src\core\lib\transport\connectivity_state.cc:78: watcher 000001EA56953890: delivering async notification for SHUTDOWN
INFO: 0 C:\Users\Robert\src\grpc\src\core\lib\surface\call.cc:573: grpc_call_unref(c=000001EA655903E0)

This is another trace log for reference that doesn’t trigger a deadlock with an equivalent server written in Go:

https://pastebin.com/xcASsX2F

Update: for a while I thought I wasn’t able to reproduce this with the Go-based server, but actually I can. It looks like this issue is timing sensitive and changes to the test (such as enabling/disabling TLS) that probably just change timing will affect how reliably this issue is reproduced

At this point, I’m not familiar with the Grpc.Core code to know how to try and fix this, so I’m really hoping someone here will recognise what’s probably wrong based on the log.

If it would be helpful, I can aim to push a git repo with a minimal Unity test and the Rust + Go servers I’ve used to reproduce this?

If you need me to rerun the test with additional logging information, please let me know.

For me this issue is a rather serious blocker for progressing my project, so I’ll continue trying to investigate myself too but any help from someone familiar with the Grpc.Core implementation would be hugely appreciated!

I haven’t figured out how to rebuild the unity packages for the grpc libraries but have at least been able to rebuild Grpc.Core and grpc_csharp_ext.dll so I can start to experiment with local changes.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 20 (1 by maintainers)

Most upvoted comments

@EliCDavis Yeah, never heard any response from a maintainer here so pretty much just decided the Unity support is unmaintained and it would be too much of a liability to depend on in production. I switched to binding in my own plugin written in Rust where I could use https://github.com/hyperium/tonic instead.

Honestly I think I wish they never claimed to support Unity because it led me to waist a huge amount of time which was a big deal for me while I’m trying to boostrap a self-funded startup. Has led me to have general bad feelings about this project - to me it’s a huge red flag that it’s so difficult to get feedback from maintainers.

We are getting similar issues here as well. We have a stream we are starting, but it is also blocking ALL other calls through GRPC (cannot perform any unary calls once the stream has started). As well as it blocked, we cannot seem to shutdown/cancel/stop the stream, if we do further calls are still blocked (unary and streams).