linkerd2: Pod can't reliably establish watches properly

Bug Report

What is the issue?

I am running the latest version of linkerd edge 19.1.2 and I am getting this error

WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })

How can it be reproduced?

I just deployed the latest version. Nothing more

Logs, error output, etc

output for linkerd logs --control-plane-component controller

linkerd linkerd-controller-7bc49fd77f-lwt8q linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))

output for linkerd logs --control-plane-component controller -c proxy-api

linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T13:54:55Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 15:57:34.899318       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3417120 (3420499)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:25:43Z" level=info msg="Establishing watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:32:18Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 17:49:54.531144       1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3437967 (3439015)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T21:32:21Z" level=info msg="Establishing watch on endpoint linkerd-prometheus.linkerd:9090"

(If the output is long, please create a gist and paste the link here.)

`linkerd check` output

kubernetes-api
--------------
✔ can initialize the client
✔ can query the Kubernetes API

kubernetes-version
------------------
✔ is running the minimum Kubernetes API version

linkerd-existence
-----------------
✔ control plane namespace exists
✔ controller pod is running
✔ can initialize the client
✔ can query the control plane API

linkerd-api
-----------
✔ control plane pods are ready
✔ can query the control plane API
✔ [kubernetes] control plane can talk to Kubernetes
✔ [prometheus] control plane can talk to Prometheus

linkerd-service-profile
-----------------------
✔ no invalid service profiles

linkerd-version
---------------
✔ can determine the latest version
✔ cli is up-to-date

control-plane-version
---------------------
✔ control plane is up-to-date

Status check results are ✔

Environment

Kubernetes Version:
Cluster Environment: (GKE, AKS, kops, …) EKS
Host OS: Amazon ami
Linkerd version: edge 19.1.2

Possible solution

Additional context

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 43 (33 by maintainers)

Most upvoted comments

I finished my test and everything works as expected. No more errors in the log of linkerd-proxy. My service is able to connect external service.

This new version has fixed all the problems I had previously.

jmirc on Jan 25, 2019

I have 45 minutes before my son’s hockey game starts, installing now! 😃

bourquep on Jan 25, 2019

That’s great to hear! 😄

hawkw on Jan 26, 2019

@bourquep Thanks for the additional info. I just wanted to chime in and say that those “connection refused” messages that appear prior to the “caches synced” message are (unfortunately) expected. They’re a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the “caches synced” message.

For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it’s own fork (called klog 🙄) that is apparently more configurable. So it’s possible that by updating to a more recent version of client-go we could suppress those message, but we haven’t gotten around to it yet.

klingerf on Jan 24, 2019

thanks @jon-walton, that confirms what I’ve been seeing.

the issue isn’t specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane’s service discovery API.

hawkw on Jan 24, 2019