linkerd2: Proxy gets 503 response from K8s service

Bug Report

While debugging occasional failures in Production, I observed connection errors between the outbound proxy and one of the pods (nonexisting pod). This issue resembles the issue 6184

What is the issue?

We use Nginx ingress with LinkerD running in default mode (normal mode). One of the Ingress pods tries to connect to an IP (10.47.255.72:8080) of a nonexisting Pod. After reviewing the clusters state I saw that non of the pods restarted (both source and destination). The IP address is currently not assigned to any pod.

How can it be reproduced?

Wasn’t able to reproduce it (might be caused due to a network glitch in GKE

Logs, error output, etc

[ 272.767311s] INFO ThreadId(01) outbound:accept{peer.addr=10.44.5.238:46900 target.addr=10.47.255.72:8080}: linkerd2_app_core::serve: Connection closed error=Service in fail-fast

https://gist.github.com/AlonGluz/82633391f432ef35158deb11c516fb8c

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.9.4 but the latest stable version is 2.10.2
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.9.4 but the latest stable version is 2.10.2
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods

linkerd-multicluster
--------------------
√ Link CRD exists

Status check results are √

Environment

  • Kubernetes Version: 1.18.17-gke.1200
  • Cluster Environment: GKE
  • Host OS: cloud.google.com/gke-os-distribution: cos
  • Linkerd version: 2.9.4

Possible solution

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

Hey @mateiidavid ,

  1. Linkerd upgrade requires maintenance time, and I prefer not to do it this time unless I’m positive it will resolve the issue.
  2. The mentioned GKE network is routes-Based, so no CNI.
  3. The log entry is describing communication between the Nginx-ingress pod and Auth service. We configured retry in the Linkerd service profile. Without it, the problem mentioned is far worse. The 10.51.245.134 is actually the IP address resolved from the Auth-service K8s service. Both 10.48.10.164 (Nginx-ingress pod) and 10.51.245.134 (Auth K8s service) weren’t deleted, more than that, 10.51.245.43 (Auth K8s service) is reachable a couple of ms later.
  4. I’ll add the more verbose logs shortly.

Hey, @mateiidavid thanks for your response. It happens quite a lot, it’s just not that easily reproducible. I’ll add the config and forward the output.