linkerd2: Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller

Bug Report

What is the issue?

We are having issues upgrading from 2.10.2 -> 2.11.0. It seems to be related to the new Policy Controller. The controller fails starting up, apparently when the Kube Client is trying to watch resources: “kube::client: failed with error error trying to connect: tls handshake eof”. Logs:

2021-10-06T11:36:02.462254Z  INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2021-10-06T11:36:02.462275Z  INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2021-10-06T11:36:02.462913Z  INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2021-10-06T11:36:02.469200Z ERROR servers: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469221Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469455Z ERROR pods: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469476Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469513Z ERROR serverauthorizations: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469525Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:03.470663Z  INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470763Z  INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470804Z  INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting

We are able to track the request to our API Server which seem to respond with status 200, but indicating that “Connection closed early”:

{
   "kind":"Event",
   "apiVersion":"audit.k8s.io/v1",
   "level":"Request",
   "auditID":"78cdd15d-3197-4437-99ae-c44504bea33b",
   "stage":"ResponseStarted",
   "requestURI":"/api/v1/endpoints?allowWatchBookmarks=true\u0026resourceVersion=477527\u0026timeout=5m14s\u0026timeoutSeconds=314\u0026watch=true",
   "verb":"watch",
   "user":{
      "username":"system:serviceaccount:linkerd:linkerd-destination",
      "uid":"edc0ceeb-f8df-4e52-b73b-65b2a420ee21",
      "groups":[
         "system:serviceaccounts",
         "system:serviceaccounts:linkerd",
         "system:authenticated"
      ],
      "extra":{
         "authentication.kubernetes.io/pod-name":[
            "linkerd-destination-9b7f77f5f-7dbj8"
         ],
         "authentication.kubernetes.io/pod-uid":[
            "51ee23ea-5b2c-46dd-bda6-2e998b8fb78c"
         ]
      }"​"
   },
   "sourceIPs":[
      "***"
   ],
   "userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format",
   "objectRef":{
      "resource":"endpoints",
      "apiVersion":"v1"
   },
   "responseStatus":{
      "metadata":{
         
      },
      "status":"Success",
      "message":"Connection closed early",
      "code":200
   },
   "requestReceivedTimestamp":"2021-10-06T11:30:55.194103Z",
   "stageTimestamp":"2021-10-06T11:36:09.196098Z",
   "annotations":{
      "authorization.k8s.io/decision":"allow",
      "authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""
   }"​"
}

linkerd check output

We didn’t save the output from linkerd check and have rolled back to 2.10.2 now. However, when we ran the check while having the issues, it reported all ✅

Environment

  • Kubernetes Version: 1.21.2
  • Cluster Environment: AKS
  • Host OS: Linux
  • Linkerd version: 2.11.0

Additional context

Linker 2.10.2 has been running in the same cluster without any issues. We had some trouble creating a certificate for the policy controller using Cert Manager as the Rust TLS http client used apparently dont support ECDSA. Solved this by switching to RSA as discussed in this thread on Linkerd Slack.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 17 (10 by maintainers)

Most upvoted comments

@sigurdfalk Sorry, I don’t think we have any leads on this issue yet. It’s still on our radar, though.

From some very brief testing, this seems to work. We should update the docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/ (https://github.com/linkerd/website) – which doesn’t even document the policy controller webhook config at the moment – to reflect this.

@sigurdfalk Excellent. Thanks for confirming!

@olix0r The policy controller now accepts the cert with ECDSA, however, we are still seing the same errors 🧐

I did a fresh install of v2.10.2 and then tried to upgrade to v2.11.0 with encoding: PKCS8 in the certificate. I still think this is really strange as all other components of Linkerd seems to work as expected and are able to communicate with the API Server.

New dump of errors:

{"timestamp":"2021-10-21T08:08:47.991233Z","level":"INFO","fields":{"message":"gRPC server listening","addr":"0.0.0.0:8090"},"target":"linkerd_policy_controller","spans":[{"addr":"0.0.0.0:8090","cluster_networks":"[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]","name":"grpc"}]}
{"timestamp":"2021-10-21T08:08:47.991257Z","level":"INFO","fields":{"message":"HTTP admin server listening","addr":"0.0.0.0:9990"},"target":"linkerd_policy_controller::admin","spans":[{"addr":"0.0.0.0:9990","name":"serve"}]}
{"timestamp":"2021-10-21T08:08:47.991365Z","level":"INFO","fields":{"message":"Admission controller server listening","addr":"0.0.0.0:9443"},"target":"linkerd_policy_controller"}
{"timestamp":"2021-10-21T08:08:48.000102Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"pods"}]}
{"timestamp":"2021-10-21T08:08:48.000235Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000384Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"servers"}]}
{"timestamp":"2021-10-21T08:08:48.000448Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:48.000550Z","level":"ERROR","fields":{"message":"failed with error error trying to connect: tls handshake eof"},"target":"kube::client","spans":[{"name":"serverauthorizations"}]}
{"timestamp":"2021-10-21T08:08:48.000576Z","level":"INFO","fields":{"message":"Failed","error":"failed to perform initial object list: HyperError: error trying to connect: tls handshake eof"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.000943Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.001056Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}
{"timestamp":"2021-10-21T08:08:49.002210Z","level":"INFO","fields":{"message":"Restarting"},"target":"linkerd_policy_controller_k8s_api::watch","spans":[]}

And we still see the requests coming in to our API Server, which responds with status 200 OK. Here is a line from the API server log tracing back to the policy controller requesting /apis/split.smi-spec.io/v1alpha1/trafficsplits:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"359d181f-47bc-4bc9-ba3a-d2b52c9099bb","stage":"ResponseComplete","requestURI":"/apis/split.smi-spec.io/v1alpha1/trafficsplits?limit=500\u0026resourceVersion=0","verb":"list","user":{"username":"system:serviceaccount:linkerd:linkerd-destination","uid":"4820cbb9-bb93-4315-8121-1a83dd678e0d","groups":["system:serviceaccounts","system:serviceaccounts:linkerd","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["linkerd-destination-756758f5b-xfn6f"],"authentication.kubernetes.io/pod-uid":["0fe069f2-6624-445d-82de-28720cc0b957"]}​},"sourceIPs":["***"],"userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format","objectRef":{"resource":"trafficsplits","apiGroup":"split.smi-spec.io","apiVersion":"v1alpha1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2021-10-21T08:08:09.650095Z","stageTimestamp":"2021-10-21T08:08:09.651179Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""}​}

Not sure if relevant, but we have a Azure Firewall between the cluster and the API Server. But this has never been an issue before.

@olix0r Thats great, thank you! I’m gonna try verifying this tomorrow 🙏🏻