linkerd2: Controller does not start cleanly on GKE
This appears to have changed since the v0.3.0 release.
When I install Conduit on GKE, the pods in the conduit
namespace restart multiple times before stabilizing and entering the Running state. I’d expect them to not restart at all. For example:
$ kubectl -n conduit get po
NAME READY STATUS RESTARTS AGE
controller-5b5c6c4846-6nxb2 6/6 Running 3 2m
prometheus-598fc79646-zl2dw 3/3 Running 0 2m
web-85799d759c-vz2bv 2/2 Running 0 2m
It’s hard to track down which of the containers is causing the pod to restart, but I see this in the proxy-api container’s logs:
$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 proxy-api
time="2018-02-28T00:36:17Z" level=info msg="running conduit version git-9ffe8b79"
time="2018-02-28T00:36:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-02-28T00:36:17Z" level=info msg="starting gRPC server on :8086"
time="2018-02-28T00:36:27Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
time="2018-02-28T00:36:28Z" level=error msg="Report: rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
...
time="2018-02-28T00:36:57Z" level=error msg="Report: rpc error: code = Unknown desc = ResponseCtx is required"
I also see this in the conduit-proxy container’s logs:
$ kubectl -n conduit logs controller-5b5c6c4846-6nxb2 conduit-proxy
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unavailable })
ERR! conduit_proxy::map_err turning service error into 500: Inner(Upstream(Inner(Inner(Error { kind: Inner(Error { kind: Proto(INTERNAL_ERROR) }) }))))
WARN conduit_proxy::control::telemetry "controller-client", controller error: Grpc(Status { code: Unknown })
If I had to guess, I think these errors are likely a result of the go processes trying to route traffic before the proxy has initialized, and this behavior changed in #365.
Here’s the version I’m testing against:
$ ./bin/conduit version
Client version: git-9ffe8b79
Server version: git-9ffe8b79
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 37 (35 by maintainers)
Commits related to this issue
- Install: Don't install buoyantio/kubectl into the prometheus pod. In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this container i... — committed to linkerd/linkerd2 by briansmith 6 years ago
- Install: Don't install buoyantio/kubectl into the prometheus pod. In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this container i... — committed to linkerd/linkerd2 by briansmith 6 years ago
- Install: Don't install buoyantio/kubectl into the prometheus pod. (#509) In the initial review for this code (preceding the creation of the runconduit/conduit repository), it was noted that this con... — committed to linkerd/linkerd2 by briansmith 6 years ago
- Retry k8s watch endpoints on error (#510) Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues du... — committed to linkerd/linkerd2 by dadjeibaah 6 years ago
It would be interesting to see what the inner messages reveal with @hawkw changes in logging. I can take a look at this when I get the chance.
I filed #522 to track an additional issue with controller containers not starting cleanly. I think we should ship a separate fix for that issue.
@capathida interesting! I don’t think we should assume that your issue is the same as this one. Would you mind opening a new issue describing how you encountered this so that we can try to reproduce the behavior? Thanks!
The proxy can’t expect that every (any) proxied service will wait before it starts networking, so it would be ideal to find a solution that doesn’t require modifying the controller Go code.
@deebo91 I hope the improved log messages are helpful — if you still need help deciphering the proxy logs, let me know.
=> P0 then.
As I understand it, the controller doesn’t actually initiate any traffic on its own whereas the proxy does spontaneously generate traffic (telemetry reports). Depending on startup order the proxy could start sending reports before the telemetry service is started which could explain these errors in the proxy-api service.
My question is: why does this result in pods getting restarted? I would hope that these errors would just result in those telemetry reports being lost.