linkerd2: Controller does not start cleanly on GKE

This appears to have changed since the v0.3.0 release.

When I install Conduit on GKE, the pods in the conduit namespace restart multiple times before stabilizing and entering the Running state. I’d expect them to not restart at all. For example:

$ kubectl -n conduit get po
NAME                          READY     STATUS    RESTARTS   AGE
controller-5b5c6c4846-6nxb2   6/6       Running   3          2m
prometheus-598fc79646-zl2dw   3/3       Running   0          2m
web-85799d759c-vz2bv          2/2       Running   0          2m

It’s hard to track down which of the containers is causing the pod to restart, but I see this in the proxy-api container’s logs:

$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 proxy-api
time="2018-02-28T00:36:17Z" level=info msg="running conduit version git-9ffe8b79"
time="2018-02-28T00:36:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-02-28T00:36:17Z" level=info msg="starting gRPC server on :8086"
time="2018-02-28T00:36:27Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
...
time="2018-02-28T00:36:57Z" level=error msg="Report: rpc error: code = Unknown desc = ResponseCtx is required"

I also see this in the conduit-proxy container’s logs:

$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 conduit-proxy
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unknown })

If I had to guess, I think these errors are likely a result of the go processes trying to route traffic before the proxy has initialized, and this behavior changed in #365.

Here’s the version I’m testing against:

$ ./bin/conduit version
Client version: git-9ffe8b79
Server version: git-9ffe8b79

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 37 (35 by maintainers)

Commits related to this issue

Install: Don't install buoyantio/kubectl into the prometheus pod. In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this container i... — committed to linkerd/linkerd2 by briansmith 6 years ago
Install: Don't install buoyantio/kubectl into the prometheus pod. In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this container i... — committed to linkerd/linkerd2 by briansmith 6 years ago
Install: Don't install buoyantio/kubectl into the prometheus pod. (#509) In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this con... — committed to linkerd/linkerd2 by briansmith 6 years ago
Retry k8s watch endpoints on error (#510) Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues du... — committed to linkerd/linkerd2 by dadjeibaah 6 years ago

Most upvoted comments

It would be interesting to see what the inner messages reveal with @hawkw changes in logging. I can take a look at this when I get the chance.

dadjeibaah on Feb 28, 2018

I filed #522 to track an additional issue with controller containers not starting cleanly. I think we should ship a separate fix for that issue.

klingerf on Mar 7, 2018

@capathida interesting! I don’t think we should assume that your issue is the same as this one. Would you mind opening a new issue describing how you encountered this so that we can try to reproduce the behavior? Thanks!

olix0r on Mar 6, 2018

The proxy can’t expect that every (any) proxied service will wait before it starts networking, so it would be ideal to find a solution that doesn’t require modifying the controller Go code.

briansmith on Feb 28, 2018

@deebo91 I hope the improved log messages are helpful — if you still need help deciphering the proxy logs, let me know.

hawkw on Feb 28, 2018

From anecdotal testing, it takes roughly a minute’s worth of restarts for the controller to get into a working state.

=> P0 then.

briansmith on Feb 28, 2018

As I understand it, the controller doesn’t actually initiate any traffic on its own whereas the proxy does spontaneously generate traffic (telemetry reports). Depending on startup order the proxy could start sending reports before the telemetry service is started which could explain these errors in the proxy-api service.

My question is: why does this result in pods getting restarted? I would hope that these errors would just result in those telemetry reports being lost.

adleong on Feb 28, 2018