dapr: Crash in daprd sidecar when multiple invocation requests are made without the server running

In what area(s)?

/area runtime

What version of Dapr?

1.1.x, edge

Expected Behavior

When the remote dapr instance is down, sending multiple simultaneous grpc invocation requests should complete all the requests with an error (after 3 retries).

Actual Behavior

In Kubernetes mode, when the remote dapr instance is down, two concurrent grpc invocation requests from a client app crash the local daprd sidecar process on the client instance (the pod will undergo a crash loop) with this log and call stack (kubectl logs deployments/testdaprclient daprd -f --since=30s | grep -v “placement”)

time="2021-05-05T09:25:28.762946026Z" level=info msg="starting Dapr Runtime -- version edge -- commit 200c7ab" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.762982831Z" level=info msg="log level set to: debug" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.764565203Z" level=info msg="metrics server started on :9090/" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.metrics type=log ver=edge
time="2021-05-05T09:25:28.76470558Z" level=info msg="loading default configuration" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.764757796Z" level=info msg="kubernetes mode configured" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.764772079Z" level=info msg="app id: testdaprclient" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.764917903Z" level=info msg="mTLS enabled. creating sidecar authenticator" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.765068248Z" level=info msg="trust anchors and cert chain extracted successfully" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.security type=log ver=edge
time="2021-05-05T09:25:28.765085967Z" level=info msg="authenticator created" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.772377488Z" level=info msg="Initialized name resolution to kubernetes" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.772491359Z" level=debug msg="loading component. name: kubernetes, type: secretstores.kubernetes/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.774680191Z" level=info msg="component loaded. name: kubernetes, type: secretstores.kubernetes/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.779461173Z" level=debug msg="found component. name: statestore, type: state.redis/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.779484524Z" level=debug msg="found component. name: statestore, type: state.redis/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.779491922Z" level=debug msg="found component. name: pubsub, type: pubsub.redis/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.779499519Z" level=info msg="waiting for all outstanding components to be processed" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.77950653Z" level=debug msg="loading component. name: statestore, type: state.redis/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.786864823Z" level=info msg="component loaded. name: statestore, type: state.redis/v1" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.786892037Z" level=info msg="all outstanding components processed" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.786954044Z" level=info msg="enabled gRPC tracing middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.api type=log ver=edge
time="2021-05-05T09:25:28.786975636Z" level=info msg="enabled gRPC metrics middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.api type=log ver=edge
time="2021-05-05T09:25:28.786998577Z" level=info msg="API gRPC server is running on port 50001" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.787135597Z" level=info msg="enabled metrics http middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.http type=log ver=edge
time="2021-05-05T09:25:28.787154655Z" level=info msg="enabled tracing http middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.http type=log ver=edge
time="2021-05-05T09:25:28.787163049Z" level=info msg="http server is running on port 3500" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.787175342Z" level=info msg="The request body size parameter is: 4" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.787203457Z" level=info msg="enabled gRPC tracing middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.internal type=log ver=edge
time="2021-05-05T09:25:28.787215211Z" level=info msg="enabled gRPC metrics middleware" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.internal type=log ver=edge
time="2021-05-05T09:25:28.787221566Z" level=info msg="sending workload csr request to sentry" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.internal type=log ver=edge
time="2021-05-05T09:25:28.79895507Z" level=info msg="certificate signed successfully" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.internal type=log ver=edge
time="2021-05-05T09:25:28.799064038Z" level=info msg="internal gRPC server is running on port 50002" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.799078365Z" level=info msg="application protocol: grpc. waiting on port 5001.  This will block until the app is listening on that port." app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.799156857Z" level=info msg="starting workload cert expiry watcher. current cert expires on: 2021-05-06 09:40:28 +0000 UTC" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.grpc.internal type=log ver=edge
time="2021-05-05T09:25:28.799276278Z" level=info msg="application discovered on port 5001" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:28.801249289Z" level=info msg="actor runtime started. actor idle timeout: 1h0m0s. actor scan interval: 30s" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.actor type=log ver=edge
time="2021-05-05T09:25:28.801290118Z" level=info msg="dapr initialized. Status: Running. Init Elapsed 36.537296ms" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime type=log ver=edge
time="2021-05-05T09:25:29.318432688Z" level=debug msg="retry count: 1, grpc call failed, ns: dapr-test, addr: testdaprserver-dapr.dapr-test.svc.cluster.local:50002, appid: testdaprserver, err: rpc error: code = Unavailable desc = last resolver error: produced zero addresses" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.direct_messaging type=log ver=edge
time="2021-05-05T09:25:29.31847955Z" level=debug msg="retry count: 1, grpc call failed, ns: dapr-test, addr: testdaprserver-dapr.dapr-test.svc.cluster.local:50002, appid: testdaprserver, err: rpc error: code = Unavailable desc = last resolver error: produced zero addresses" app_id=testdaprclient instance=testdaprclient-d76948d59-bc6vf scope=dapr.runtime.direct_messaging type=log ver=edge
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7a68cd]

goroutine 120 [running]:
google.golang.org/grpc/internal/resolver.(*SafeConfigSelector).SelectConfig(0xc0009db428, 0x3c70fd8, 0xc000b18c90, 0x3724b90, 0x34, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/internal/resolver/config_selector.go:92 +0x6d
google.golang.org/grpc.newClientStream(0x3c70fd8, 0xc000b18c90, 0x5561bc0, 0xc0009db180, 0x3724b90, 0x34, 0xc0009ab200, 0x2, 0x2, 0x0, ...)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/stream.go:178 +0x1b5
google.golang.org/grpc.invoke(0x3c70fd8, 0xc000b18c90, 0x3724b90, 0x34, 0x33e9d80, 0xc0009d6e10, 0x33e9e60, 0xc0009d7680, 0xc0009db180, 0xc0009ab200, ...)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/call.go:66 +0x99
github.com/dapr/dapr/pkg/diagnostics.(*grpcMetrics).UnaryClientInterceptor.func1(0x3c70fd8, 0xc000b18c90, 0x3724b90, 0x34, 0x33e9d80, 0xc0009d6e10, 0x33e9e60, 0xc0009d7680, 0xc0009db180, 0x37cbe78, ...)
        /home/runner/work/dapr/dapr/pkg/diagnostics/grpc_monitoring.go:186 +0x1a4
google.golang.org/grpc.(*ClientConn).Invoke(0xc0009db180, 0x3c70fd8, 0xc000b18c90, 0x3724b90, 0x34, 0x33e9d80, 0xc0009d6e10, 0x33e9e60, 0xc0009d7680, 0xc0009ab200, ...)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/call.go:35 +0x109
github.com/dapr/dapr/pkg/proto/internals/v1.(*serviceInvocationClient).CallLocal(0xc0009432e8, 0x3c70fd8, 0xc000b18c90, 0xc0009d6e10, 0xc0009ab200, 0x2, 0x2, 0x2, 0xc000b18c90, 0x0)
        /home/runner/work/dapr/dapr/pkg/proto/internals/v1/service_invocation_grpc.pb.go:46 +0xd4
github.com/dapr/dapr/pkg/messaging.(*directMessaging).invokeRemote(0xc0002fab00, 0x3c70fd8, 0xc000b18c90, 0xc0009e6e60, 0xe, 0xc000056022, 0x9, 0xc0009ec3c0, 0x35, 0xc0004a2f58, ...)
        /home/runner/work/dapr/dapr/pkg/messaging/direct_messaging.go:172 +0x3d0
github.com/dapr/dapr/pkg/messaging.(*directMessaging).invokeWithRetry(0xc0002fab00, 0x3c70fd8, 0xc0009cbd70, 0x3, 0x3b9aca00, 0xc0009e6e60, 0xe, 0xc000056022, 0x9, 0xc0009ec3c0, ...)
        /home/runner/work/dapr/dapr/pkg/messaging/direct_messaging.go:126 +0xcf
github.com/dapr/dapr/pkg/messaging.(*directMessaging).Invoke(0xc0002fab00, 0x3c70fd8, 0xc0009cbd70, 0xc0009e6e60, 0xe, 0xc0004a2f58, 0xc5bca5, 0xc01cb70252c60dc8, 0x2344019e)
        /home/runner/work/dapr/dapr/pkg/messaging/direct_messaging.go:100 +0x1d2
github.com/dapr/dapr/pkg/grpc.(*api).InvokeService(0xc000147300, 0x3c70fd8, 0xc0009cbd70, 0xc0009d2a40, 0xc000147300, 0x2fb1d40, 0xc000b15628)
        /home/runner/work/dapr/dapr/pkg/grpc/api.go:333 +0x17b
github.com/dapr/dapr/pkg/proto/runtime/v1._Dapr_InvokeService_Handler.func1(0x3c70fd8, 0xc0009cbd70, 0x336aac0, 0xc0009d2a40, 0x29, 0xa2, 0xc01cb70252c60dc8, 0x2344019e)
        /home/runner/work/dapr/dapr/pkg/proto/runtime/v1/dapr_grpc.pb.go:404 +0x89
github.com/dapr/dapr/pkg/diagnostics.(*grpcMetrics).UnaryServerInterceptor.func1(0x3c70fd8, 0xc0009cbd70, 0x336aac0, 0xc0009d2a40, 0xc0009aac20, 0xc0009a8a50, 0x0, 0x0, 0x0, 0x0)
        /home/runner/work/dapr/dapr/pkg/diagnostics/grpc_monitoring.go:172 +0x136
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x3c70fd8, 0xc0009cbd70, 0x336aac0, 0xc0009d2a40, 0x0, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.2.2/chain.go:25 +0x63
github.com/dapr/dapr/pkg/diagnostics.GRPCTraceUnaryServerInterceptor.func1(0x3c70fd8, 0xc0009cbd70, 0x336aac0, 0xc0009d2a40, 0xc0009aac20, 0xc0009aac40, 0xc8a11a, 0x335d5e0, 0xc0009aac60, 0xc0009aac20)
        /home/runner/work/dapr/dapr/pkg/diagnostics/grpc_tracing.go:56 +0x2eb
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x3c70fd8, 0xc0009cbd10, 0x336aac0, 0xc0009d2a40, 0xc000076800, 0x0, 0xc000b15b30, 0x40e3d8)
        /home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.2.2/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x3c70fd8, 0xc0009cbd10, 0x336aac0, 0xc0009d2a40, 0xc0009aac20, 0xc0009a8a50, 0xc00033aba0, 0x4c1f06, 0x33f6260, 0xc0009cbd10)
        /home/runner/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.2.2/chain.go:34 +0xd7
github.com/dapr/dapr/pkg/proto/runtime/v1._Dapr_InvokeService_Handler(0x35dd380, 0xc000147300, 0x3c70fd8, 0xc0009cbd10, 0xc0009e5c20, 0xc000929200, 0x3c70fd8, 0xc0009cbd10, 0xc00031d550, 0xa2)
        /home/runner/work/dapr/dapr/pkg/proto/runtime/v1/dapr_grpc.pb.go:406 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000f4fc0, 0x3c9c438, 0xc000a1f380, 0xc0000d7600, 0xc000929350, 0x5543a60, 0x0, 0x0, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1210 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc0000f4fc0, 0x3c9c438, 0xc000a1f380, 0xc0000d7600, 0x0)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:1533 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc0009e6e20, 0xc0000f4fc0, 0x3c9c438, 0xc000a1f380, 0xc0000d7600)
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:871 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /home/runner/go/pkg/mod/google.golang.org/grpc@v1.34.0/server.go:869 +0x1fd

Steps to Reproduce the Problem

Start a simple client container in Kubernetes mode using C# aspnet core sdk with the remote server pod instance not started yet. The containers are Linux. Send multiple invocation requests (two) without waiting the rpc responses and after that wait until all the responses arrive (this was the simplest form that raises the crash, see notes). The configuration for the client is this (if you need to try out):

kind: Service
apiVersion: v1
metadata:
  name: testdaprclient
  labels:
    app: testdaprclient
spec:
  selector:
    app: testdaprclient
  ports:
  - protocol: TCP
    port: 5001
    targetPort: 5001
  type: LoadBalancer

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: testdaprclient
  labels:
    app: testdaprclient
spec:
  replicas: 1
  selector:
    matchLabels:
      app: testdaprclient
  template:
    metadata:
      labels:
        app: testdaprclient
      annotations:
        dapr.io/enabled: "true"
        dapr.io/app-id: "testdaprclient"
        dapr.io/app-port: "5001"
        dapr.io/app-protocol: "grpc"
        dapr.io/log-level: "debug"
        dapr.io/sidecar-liveness-probe-delay-seconds: "10"
        dapr.io/sidecar-readiness-probe-delay-seconds: "10"
    spec:
      containers:
      - name: testdaprclient
        image: trubini/testdaprclient:latest
        ports:
        - containerPort: 5001
        imagePullPolicy: Always

The client app is an aspnet.core service with a local dapr grpc endpoint to wait the message answers from the server asynchronously. The rpc protocol is Client->Request, Server->Ack or Server->Response,Client->Ack The request loop is instantiated from a hosted service dedicated to call the test requests on the remote server app, without any interaction from the Dapr grpc service listening on port 5001 (it is started after the startup and it is independent from the communication exchange). The problem arise during the first steps, without any interaction from the server (the server here is not even started). The relevant code used to invoke the remote server app is this:

               int count = 10; //debug
                int i = 0;
                while (i++ < count)
                {
                    if (_ct.IsCancellationRequested)
                    {
                        _logger.LogInformation($"Cancellation requested");
                        return;
                    }
                    await Task.Delay(1000);
                    _logger.LogInformation("Sending 2 multiple dapr client requests...");
                    var task1 = SendStoredDaprFastRequest(daprClient);
                    var task2 = SendStoredDaprSlowRequest(daprClient);
                    await Task.WhenAll(task1, task2); // await task1; //
                   
                    _logger.LogInformation("Requests complete.");
                }
//...
       private async Task SendStoredDaprFastRequest(DaprClient daprClient)
        {

            var req = new RpcFastRequest
            {
                Name = "RpcCallFastStored",
                Uuid = Guid.NewGuid().ToString()
            };

            try
            {
// "testdaprserver" does not exist during this request
                var acknowledge = await daprClient.InvokeMethodGrpcAsync<RpcFastRequest, Acknowledge>("testdaprserver", "RpcCallFast", req);
                if (acknowledge.Ack != true)
                {
                    // todo retry manager
                    _logger.LogInformation($"Error - Cannot send request {acknowledge.Uuid}");
                }
            }
            catch (Dapr.DaprException ex)
            {
                _logger.LogInformation($"Error - Dapr Exception: {ex.Message} - Inner:{ ex.InnerException.Message }");
                throw;
            }
            catch (Exception ex)
            {
                _logger.LogInformation($"Error - Unknown Exception: {ex.Message} - Inner:{ ex.InnerException.Message }");
                throw;
            }
//... 
// The SendStoredDaprSlowRequest is exactly the same but with a new RpcSlowRequest data structure.

Notes:

  • This happens also even during the runtime (when the server is up and running and the communication is going), after the server app has been restarted or stopped.
  • Before starting up the loop of requests I waited 15 sec. to give the underlying infrastructure some time to become ready and available (and this does not change the behavior).
  • It does not seem to happen if the loop of requests is done just one by one (waiting the rpc Acknowledge synchronously), so it seems related to some race condition inside the sidecar during the retry loop (InvokeWithRetry).
  • I detected this problem with the version 1.1.1, but I also tested this with the 1.1.2 and with the edge version of daprd, the call stack is the same.
  • Does not seem to happen in Self-Hosted mode, even with multiple outstanding requests.
  • I saw this on a local single node Kubernetes cluster running on WSL2 but also and foremost on a production OpenShift native cluster (4.7).

Release Note

RELEASE NOTE: FIX Crash in daprd sidecar.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (10 by maintainers)

Most upvoted comments

@ptrubini I was able to reproduce this. This looks like a bug in the upstream gRPC Go SDK. I’ll try to initiate a fix in the upstream repo and in parallel consider a workaround for Dapr if it’s available.

This will be released in 1.3, the upcoming version.

Testing a fix now, this will be done for the upcoming 1.3 release.

I’ve tested it with image yaron2/daprd1a:v1 , and this bug has been fixed. Thank you!