linkerd2: Upgrading from 2.10.2 -> 2.11.0 - Kube Client "error trying to connect: tls handshake eof" in Policy Controller
Bug Report
What is the issue?
We are having issues upgrading from 2.10.2
-> 2.11.0
. It seems to be related to the new Policy Controller. The controller fails starting up, apparently when the Kube Client is trying to watch resources: “kube::client: failed with error error trying to connect: tls handshake eof”. Logs:
2021-10-06T11:36:02.462254Z INFO grpc{addr=0.0.0.0:8090 cluster_networks=[10.0.0.0/8, 100.64.0.0/10, 172.16.0.0/12, 192.168.0.0/16]}: linkerd_policy_controller: gRPC server listening addr=0.0.0.0:8090
2021-10-06T11:36:02.462275Z INFO serve{addr=0.0.0.0:9990}: linkerd_policy_controller::admin: HTTP admin server listening addr=0.0.0.0:9990
2021-10-06T11:36:02.462913Z INFO linkerd_policy_controller: Admission controller server listening addr=0.0.0.0:9443
2021-10-06T11:36:02.469200Z ERROR servers: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469221Z INFO servers: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469455Z ERROR pods: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469476Z INFO pods: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:02.469513Z ERROR serverauthorizations: kube::client: failed with error error trying to connect: tls handshake eof
2021-10-06T11:36:02.469525Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Failed error=failed to perform initial object list: HyperError: error trying to connect: tls handshake eof
2021-10-06T11:36:03.470663Z INFO serverauthorizations: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470763Z INFO pods: linkerd_policy_controller_k8s_api::watch: Restarting
2021-10-06T11:36:03.470804Z INFO servers: linkerd_policy_controller_k8s_api::watch: Restarting
We are able to track the request to our API Server which seem to respond with status 200
, but indicating that “Connection closed early”:
{
"kind":"Event",
"apiVersion":"audit.k8s.io/v1",
"level":"Request",
"auditID":"78cdd15d-3197-4437-99ae-c44504bea33b",
"stage":"ResponseStarted",
"requestURI":"/api/v1/endpoints?allowWatchBookmarks=true\u0026resourceVersion=477527\u0026timeout=5m14s\u0026timeoutSeconds=314\u0026watch=true",
"verb":"watch",
"user":{
"username":"system:serviceaccount:linkerd:linkerd-destination",
"uid":"edc0ceeb-f8df-4e52-b73b-65b2a420ee21",
"groups":[
"system:serviceaccounts",
"system:serviceaccounts:linkerd",
"system:authenticated"
],
"extra":{
"authentication.kubernetes.io/pod-name":[
"linkerd-destination-9b7f77f5f-7dbj8"
],
"authentication.kubernetes.io/pod-uid":[
"51ee23ea-5b2c-46dd-bda6-2e998b8fb78c"
]
}""
},
"sourceIPs":[
"***"
],
"userAgent":"controller/v0.0.0 (linux/amd64) kubernetes/$Format",
"objectRef":{
"resource":"endpoints",
"apiVersion":"v1"
},
"responseStatus":{
"metadata":{
},
"status":"Success",
"message":"Connection closed early",
"code":200
},
"requestReceivedTimestamp":"2021-10-06T11:30:55.194103Z",
"stageTimestamp":"2021-10-06T11:36:09.196098Z",
"annotations":{
"authorization.k8s.io/decision":"allow",
"authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"linkerd-linkerd-destination\" of ClusterRole \"linkerd-linkerd-destination\" to ServiceAccount \"linkerd-destination/linkerd\""
}""
}
linkerd check
output
We didn’t save the output from linkerd check
and have rolled back to 2.10.2
now. However, when we ran the check while having the issues, it reported all ✅
Environment
- Kubernetes Version: 1.21.2
- Cluster Environment: AKS
- Host OS: Linux
- Linkerd version: 2.11.0
Additional context
Linker 2.10.2
has been running in the same cluster without any issues. We had some trouble creating a certificate for the policy controller using Cert Manager as the Rust TLS http client used apparently dont support ECDSA
. Solved this by switching to RSA
as discussed in this thread on Linkerd Slack.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (10 by maintainers)
@sigurdfalk Sorry, I don’t think we have any leads on this issue yet. It’s still on our radar, though.
From some very brief testing, this seems to work. We should update the docs at https://linkerd.io/2.11/tasks/automatically-rotating-webhook-tls-credentials/ (https://github.com/linkerd/website) – which doesn’t even document the policy controller webhook config at the moment – to reflect this.
@sigurdfalk Excellent. Thanks for confirming!
@olix0r The policy controller now accepts the cert with
ECDSA
, however, we are still seing the same errors 🧐I did a fresh install of
v2.10.2
and then tried to upgrade tov2.11.0
withencoding: PKCS8
in the certificate. I still think this is really strange as all other components of Linkerd seems to work as expected and are able to communicate with the API Server.New dump of errors:
And we still see the requests coming in to our API Server, which responds with status
200 OK
. Here is a line from the API server log tracing back to the policy controller requesting/apis/split.smi-spec.io/v1alpha1/trafficsplits
:Not sure if relevant, but we have a Azure Firewall between the cluster and the API Server. But this has never been an issue before.
@olix0r Thats great, thank you! I’m gonna try verifying this tomorrow 🙏🏻