linkerd2: linkerd-destination seemingly serving incorrect Pod IPs causing connection errors

What is the issue?

Slack thread here.

One of our application’s proxy’s is full of failed to connect errors. Examples:

[   584.018861s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  2104.773832s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  1785.088002s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

Debug logs aren’t much use:

DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering

offer-price-index-service-ie-66bbbfcff-f8kbg [  1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true

The service otel-agent.open-telemetry.svc.cluster.local:4317 points to a DaemonSet, so pods are constantly switching with our preemptible nodes.

I caught a failed request in Tap, where all other requests are ~1ms:

image

I then found this log entry:

2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed

10.225.13.4 was for a pod, otel-agent-6kg56 until yesterday (21/07) at 13:52. That IP then got reassigned to pod tap-injector-7c4b6dc5f9-7wsxn. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.

I run three replicas of linkerd-destination, which at the time of writing had:

linkerd-destination-8f9d689fc-lxl5f       4/4     Running   0          3d22h
linkerd-destination-8f9d689fc-mddv5       4/4     Running   0          21h
linkerd-destination-8f9d689fc-rn8dn       4/4     Running   0          21h

I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:

2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

zqxhr had since been deleted but lxl5f is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending “bad information” to the linkerd-proxy containers requesting destination information.

I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.

How can it be reproduced?

Not confirmed, but removing pods from a Daemonset and checking if the destination updates?

Logs, error output, etc

See section above.

output of linkerd check -o short

I have a mismatched CLI version so ignore the errors:

Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 22.6.2 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 22.7.1 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
	* linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
	* linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
	* nginx-67888f898b-x9mz4 (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
	* prometheus-6cd858c97b-rh9xr (edge-22.7.1)
	* tap-68b97d8d8b-szrc9 (edge-22.7.1)
	* tap-68b97d8d8b-tvj55 (edge-22.7.1)
	* tap-68b97d8d8b-xhs69 (edge-22.7.1)
	* tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
	* web-6576dcfdb-g4mhd (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes version: 1.22 Host: GKE Linkerd version: edge-22.7.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 20 (7 by maintainers)

Commits related to this issue

Most upvoted comments

This seems to be quite easily reproducible by deleting a GKE node:

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-j79jl
open-telemetry   10.225.1.3      4317   otel-agent-j79jl   otel-agent.open-telemetry

>k get pod -n open-telemetry otel-agent-j79jl -o yaml | grep nodeName
nodeName: gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7

>k delete node gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7
node "gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7" deleted

> k get pod -n open-telemetry otel-agent-j79jl -w         
NAME               READY   STATUS    RESTARTS   AGE
otel-agent-j79jl   1/1     Running   0          6h50m
otel-agent-j79jl   1/1     Terminating   0          6h50m
otel-agent-j79jl   1/1     Terminating   0          6h50m

>k get pod -n open-telemetry otel-agent-j79jl   
Error from server (NotFound): pods "otel-agent-j79jl" not found

>k get node gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7
Error from server (NotFound): nodes "gke-gke-web-gb-dev-system-pool-01-pre-a1b91517-hjf7" not found

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-j79jl
open-telemetry   10.225.1.3      4317   otel-agent-j79jl   otel-agent.open-telemetry

I have tried manually deleting two nodes with the above commands (one preemptible and one standard provisioning) and both pods still appear in the diagnostics endpoint. I’ve left it a good 30+ minutes and that pod is still appearing in the diag endpoints. Deleting otel-agent pods manually works and they disappear from the list immediately as expected.


Now, interestingly: Deleting a node and monitoring for a statefulset or deployment pod does not cause this issue. The statefulset/deployment pods disappear from the diagnostics immediately and reappeared when the new pod reaches Running status.

I then tried repeating this exercise for a different daemonset in the same cluster, deleting a node and watching diagnostics. The pod disappeared immediately.

The difference between the two daemonsets is one is using a clusterIP service and one is using a headless clusterIP service. The service for otel-agent:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: opentelemetry
    component: otel-agent
  name: otel-agent
  namespace: open-telemetry
spec:
  clusterIP: 10.224.28.33
  clusterIPs:
  - 10.224.28.33
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: otlp
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: metrics
    port: 8888
    protocol: TCP
    targetPort: 8888
  selector:
    component: otel-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

So, I created a headless service to sit alongside this:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2022-08-10T12:06:46Z"
  labels:
    app: opentelemetry
    component: otel-agent
  name: otel-agent-headless
  namespace: open-telemetry
  resourceVersion: "1216759460"
  uid: 9da722c7-4c58-4080-ad48-1c4603808384
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: otlp
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: metrics
    port: 8888
    protocol: TCP
    targetPort: 8888
  selector:
    component: otel-agent
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

and then repeated the same test:

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent.open-telemetry

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent-headless.open-telemetry

>k get pod otel-agent-pv58f -n open-telemetry -o yaml | grep nodeName
gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8

>k delete node gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8
node "gke-gke-web-gb-dev-web-wi-n2-01-4b906099-pqg8" deleted

>k get pod otel-agent-pv58f -n open-telemetry -w                     
NAME               READY   STATUS    RESTARTS   AGE
otel-agent-pv58f   1/1     Running   0          7m48s
otel-agent-pv58f   1/1     Terminating   0          8m34s
otel-agent-pv58f   1/1     Terminating   0          8m34s

>k get pod otel-agent-pv58f -n open-telemetry -w
Error from server (NotFound): pods "otel-agent-pv58f" not found

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f
open-telemetry   10.225.3.2      4317   otel-agent-pv58f   otel-agent.open-telemetry

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | grep otel-agent-pv58f

>linkerd-edge diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317 | wc -l
34

>linkerd-edge diagnostics endpoints otel-agent-headless.open-telemetry.svc.cluster.local:4317 | wc -l
24

As you can see, the pod disappeared immediately from the diagnostics from the headless service, but remains in the clusterIP service.


I’m going to keep testing this headless vs non-headless theory with deployments / statefulsets as I can’t remember whether the deployment and STS I tried had headless / non-headless, but this is promising.

Well, my quick updates. I have tried to fix it, but my knowledge on linkerd internals is not enough, as it has quite complex and threaded, cached internals, etc…

So, linkerd syncs addresses from endpointslices.discovery.k8s.io (-enable-endpoint-slices=true) that has record, ex: (let’s say it is our initial state)

...
  targetRef:
    kind: Pod
    name: otel-agent-9j8n5
    namespace: open-telemetry
...

Once Deployment restarted kubectl -n open-telemetry rollout restart daemonset.apps/otel-agent (endpointslices changed), linkerd reconciles it, and (seems) doesn’t cleanup previous record otel-agent-9j8n5 (may be related to diffAddresses function or somewhere skipping continue code or conditioned err, and it seems it simply doesn’t reach Delete function, or so on):

...
linkerd-destination-5687bc5cdb-bv9l4 destination time="2022-08-15T14:15:34Z" level=debug msg="Updating EndpointSlice for open-telemetry/otel-agent" addr=":8086" component=service-publisher ns=open-telemetry svc=otel-agent
...
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.id: /
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.targetPort: otlp
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice port.srcPort: 4317
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice oldSlice otel-agent-w4mq9
linkerd-destination-5687bc5cdb-bv9l4 destination EP: updateEndpointSlice newSlice otel-agent-w4mq9

...

linkerd-destination-5687bc5cdb-bv9l4 destination E0815 14:15:37.068327       1 runtime.go:77] Observed a panic: pod "otel-agent-9j8n5" not found
linkerd-destination-5687bc5cdb-bv9l4 destination goroutine 142 [running]:
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1a11c20?, 0xc00443ec80})
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/runtime/runtime.go:75 +0x99
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x0?})
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/runtime/runtime.go:49 +0x75
linkerd-destination-5687bc5cdb-bv9l4 destination panic({0x1a11c20, 0xc00443ec80})
linkerd-destination-5687bc5cdb-bv9l4 destination 	/usr/local/go/src/runtime/panic.go:838 +0x207
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).newPodRefAddress(0xc001c555e0, 0x10dd, {0xc001e888a0, 0x0?}, {0xc001e888d0, 0x10}, {0xc001e888c0, 0xe})
linkerd-destination-5687bc5cdb-bv9l4 destination 	/linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:858 +0x2bc
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).endpointSliceToAddresses(0xc001c555e0, 0xc0033abe40)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:756 +0x89c
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateEndpointSlice(0xc001c555e0, 0xc0033abe40, 0xc001de66e0)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:645 +0x269
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateEndpointSlice(0xc00095df40, 0xc002c2da90?, 0x43ae16?)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:452 +0x2cb
linkerd-destination-5687bc5cdb-bv9l4 destination github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateEndpointSlice(0xc0006b5800, {0x1d83c60?, 0xc0033abe40?}, {0x1d83c60?, 0xc001de66e0?})
linkerd-destination-5687bc5cdb-bv9l4 destination 	/linkerd-build/controller/api/destination/watcher/endpoints_watcher.go:332 +0x10b
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/controller.go:239
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.(*processorListener).run.func1()
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/shared_informer.go:816 +0x137
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:155 +0x3e
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00006df38?, {0x207ba20, 0xc0006ecf90}, 0x1, 0xc0007885a0)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:156 +0xb6
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00006df88?)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:133 +0x89
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.Until(...)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:90
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/client-go/tools/cache.(*processorListener).run(0xc0001dbb80?)
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/shared_informer.go:812 +0x6b
linkerd-destination-5687bc5cdb-bv9l4 destination k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:73 +0x5a
linkerd-destination-5687bc5cdb-bv9l4 destination created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
linkerd-destination-5687bc5cdb-bv9l4 destination 	/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:71 +0x85
...

How to reproduce:

  1. Make sure enable-endpoint-slices=true;
  2. Continuously restart existing daemonset/deployment/sts until error unable to fetch pod thrown 2.1. Wait until restart completed
  3. See differences linkerd diagnostics endpoints otel-agent.open-telemetry.svc.cluster.local:4317

Open questions:

When, and Why does pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: forcing resync execute? And why it has never been executed?