linkerd2: linkerd-destination seemingly serving incorrect Pod IPs causing connection errors
What is the issue?
Slack thread here.
One of our application’s proxy’s is full of failed to connect errors. Examples:
[ 584.018861s] WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2104.773832s] WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 1785.088002s] WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
Debug logs aren’t much use:
DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering
offer-price-index-service-ie-66bbbfcff-f8kbg [ 1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true
The service otel-agent.open-telemetry.svc.cluster.local:4317
points to a DaemonSet, so pods are constantly switching with our preemptible nodes.
I caught a failed request in Tap, where all other requests are ~1ms:
I then found this log entry:
2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed
10.225.13.4
was for a pod, otel-agent-6kg56
until yesterday (21/07) at 13:52. That IP then got reassigned to pod tap-injector-7c4b6dc5f9-7wsxn
. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.
I run three replicas of linkerd-destination, which at the time of writing had:
linkerd-destination-8f9d689fc-lxl5f 4/4 Running 0 3d22h
linkerd-destination-8f9d689fc-mddv5 4/4 Running 0 21h
linkerd-destination-8f9d689fc-rn8dn 4/4 Running 0 21h
I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:
2022-07-21 14:51:44
linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44
linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
zqxhr
had since been deleted but lxl5f
is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending “bad information” to the linkerd-proxy containers requesting destination information.
I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.
How can it be reproduced?
Not confirmed, but removing pods from a Daemonset and checking if the destination updates?
Logs, error output, etc
See section above.
output of linkerd check -o short
I have a mismatched CLI version so ignore the errors:
Linkerd core checks
===================
linkerd-version
---------------
‼ cli is up-to-date
is running version 22.6.2 but the latest edge version is 22.7.2
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 22.7.1 but the latest edge version is 22.7.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
* linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
* linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
* linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
* linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
* linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
* linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
* nginx-67888f898b-x9mz4 (edge-22.7.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
Linkerd extensions checks
=========================
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
* prometheus-6cd858c97b-rh9xr (edge-22.7.1)
* tap-68b97d8d8b-szrc9 (edge-22.7.1)
* tap-68b97d8d8b-tvj55 (edge-22.7.1)
* tap-68b97d8d8b-xhs69 (edge-22.7.1)
* tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
* web-6576dcfdb-g4mhd (edge-22.7.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
Kubernetes version: 1.22 Host: GKE Linkerd version: edge-22.7.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 20 (7 by maintainers)
This seems to be quite easily reproducible by deleting a GKE node:
I have tried manually deleting two nodes with the above commands (one preemptible and one standard provisioning) and both pods still appear in the diagnostics endpoint. I’ve left it a good 30+ minutes and that pod is still appearing in the diag endpoints. Deleting
otel-agent
pods manually works and they disappear from the list immediately as expected.Now, interestingly: Deleting a node and monitoring for a statefulset or deployment pod does not cause this issue. The statefulset/deployment pods disappear from the diagnostics immediately and reappeared when the new pod reaches Running status.
I then tried repeating this exercise for a different daemonset in the same cluster, deleting a node and watching diagnostics. The pod disappeared immediately.
The difference between the two daemonsets is one is using a clusterIP service and one is using a headless clusterIP service. The service for otel-agent:
So, I created a headless service to sit alongside this:
and then repeated the same test:
As you can see, the pod disappeared immediately from the diagnostics from the headless service, but remains in the clusterIP service.
I’m going to keep testing this headless vs non-headless theory with deployments / statefulsets as I can’t remember whether the deployment and STS I tried had headless / non-headless, but this is promising.
Well, my quick updates. I have tried to fix it, but my knowledge on linkerd internals is not enough, as it has quite complex and threaded, cached internals, etc…
So, linkerd syncs addresses from endpointslices.discovery.k8s.io (
-enable-endpoint-slices=true
) that has record, ex: (let’s say it is our initial state)Once Deployment restarted
kubectl -n open-telemetry rollout restart daemonset.apps/otel-agent
(endpointslices changed), linkerd reconciles it, and (seems) doesn’t cleanup previous recordotel-agent-9j8n5
(may be related todiffAddresses
function or somewhere skippingcontinue
code or conditionederr
, and it seems it simply doesn’t reach Delete function, or so on):How to reproduce:
linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317
Open questions:
When, and Why does
pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: forcing resync
execute? And why it has never been executed?