linkerd2: Pod can't reliably establish watches properly
Bug Report
What is the issue?
I am running the latest version of linkerd edge 19.1.2 and I am getting this error
WARN admin={bg=resolver} linkerd2_proxy::control::destination::background::destination_set Destination.Get stream errored for NameAddr { name: DnsName(DNSName("cs-ch-domain-manager-v1.content-hub-test.svc.cluster.local.")), port: 8080 }: Grpc(Status { code: Unknown, error_message: "", binary_error_details: b"" })
How can it be reproduced?
I just deployed the latest version. Nothing more
Logs, error output, etc
output for linkerd logs --control-plane-component controller
linkerd linkerd-controller-7bc49fd77f-lwt8q linkerd-proxy WARN linkerd2_proxy::app::profiles error fetching profile for linkerd-proxy-api.linkerd.svc.cluster.local:8086: Inner(Upstream(Inner(Inner(Error { kind: Timeout(3s) }))))
output for linkerd logs --control-plane-component controller -c proxy-api
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T13:54:55Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 15:57:34.899318 1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3417120 (3420499)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:25:43Z" level=info msg="Establishing watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T17:32:18Z" level=info msg="Stopping watch on endpoint cs-ch-domain-manager-v1.content-hub-test:8080"
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api W0121 17:49:54.531144 1 reflector.go:341] k8s.io/client-go/informers/factory.go:130: watch of *v1beta2.ReplicaSet ended with: too old resource version: 3437967 (3439015)
linkerd linkerd-controller-7bc49fd77f-lwt8q proxy-api time="2019-01-21T21:32:21Z" level=info msg="Establishing watch on endpoint linkerd-prometheus.linkerd:9090"
(If the output is long, please create a gist and paste the link here.)
linkerd check
output
kubernetes-api
--------------
✔ can initialize the client
✔ can query the Kubernetes API
kubernetes-version
------------------
✔ is running the minimum Kubernetes API version
linkerd-existence
-----------------
✔ control plane namespace exists
✔ controller pod is running
✔ can initialize the client
✔ can query the control plane API
linkerd-api
-----------
✔ control plane pods are ready
✔ can query the control plane API
✔ [kubernetes] control plane can talk to Kubernetes
✔ [prometheus] control plane can talk to Prometheus
linkerd-service-profile
-----------------------
✔ no invalid service profiles
linkerd-version
---------------
✔ can determine the latest version
✔ cli is up-to-date
control-plane-version
---------------------
✔ control plane is up-to-date
Status check results are ✔
Environment
- Kubernetes Version:
- Cluster Environment: (GKE, AKS, kops, …) EKS
- Host OS: Amazon ami
- Linkerd version: edge 19.1.2
Possible solution
Additional context
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 43 (33 by maintainers)
I finished my test and everything works as expected. No more errors in the log of linkerd-proxy. My service is able to connect external service.
This new version has fixed all the problems I had previously.
I have 45 minutes before my son’s hockey game starts, installing now! 😃
That’s great to hear! 😄
@bourquep Thanks for the additional info. I just wanted to chime in and say that those “connection refused” messages that appear prior to the “caches synced” message are (unfortunately) expected. They’re a result of the public-api trying to query the kubernetes API before the linkerd-proxy container in the same pod is ready to serve requests. They eventually succeed if you see the “caches synced” message.
For more context, we use k8s.io/client-go to query the kubernetes API, and that package uses glog to log errors when the API is unreachable, before retrying. We would be better off suppressing all of the glog logs, but we have to redirect them to stderr, due to all of the reasons mentioned in kubernetes/kubernetes#61006. Kubernetes recently swapped out glog with it’s own fork (called klog 🙄) that is apparently more configurable. So it’s possible that by updating to a more recent version of client-go we could suppress those message, but we haven’t gotten around to it yet.
thanks @jon-walton, that confirms what I’ve been seeing.
the issue isn’t specific to gRPC services, as the proxy itself uses gRPC to talk to the control plane’s service discovery API.