linkerd2: Random errors: x509: certificate signed by unknown authority
Bug Report
What is the issue?
I don’t understand all details, but periodically I see the error in different places but Linkerd works in general. The error appears randomly. Pods restarting helps to solve it but I don’t think it’s a good workaround.
➜ linkerd top deployment/application --namespace default
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.56.79.145:8089/apis/tap.linkerd.io/v1alpha1/watch/namespaces/default/deployments/application/tap')
Usage:
linkerd top [flags] (RESOURCE)
#...
kubectl rollout restart -n linkerd deployment/linkerd-tap
# ...
linkerd top deployment/application --namespace default
# now it works, but after a while the problem returns
How can it be reproduced?
Logs, error output, etc
[linkerd-tap-86c9f7cc98-p49b5 tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:33188: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:37118: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:26 http: TLS handshake error from 127.0.0.1:37198: remote error: tls: bad certificate
I didn’t find any other errors in other L5d pods.
NAME READY STATUS RESTARTS AGE
linkerd-controller-784c8ddfbd-6l7zv 2/2 Running 0 8h
linkerd-controller-784c8ddfbd-b67s2 2/2 Running 0 47m
linkerd-controller-784c8ddfbd-m95ll 2/2 Running 0 8h
linkerd-destination-7655c8bc7c-4zcxm 2/2 Running 0 8h
linkerd-destination-7655c8bc7c-q4jwz 2/2 Running 0 8h
linkerd-destination-7655c8bc7c-xlx9g 2/2 Running 0 8h
linkerd-grafana-86df8766f8-xlxld 2/2 Running 0 8h
linkerd-identity-59f8fbf6fc-ll597 2/2 Running 0 47m
linkerd-identity-59f8fbf6fc-wgcpj 2/2 Running 0 8h
linkerd-identity-59f8fbf6fc-z66p7 2/2 Running 0 8h
linkerd-prometheus-98c96c5d5-jc2lz 2/2 Running 0 8h
linkerd-proxy-injector-67f7db5566-9wdls 2/2 Running 0 8h
linkerd-proxy-injector-67f7db5566-hc2kv 2/2 Running 0 8h
linkerd-proxy-injector-67f7db5566-t225x 2/2 Running 0 8h
linkerd-sp-validator-c4c598c49-djhv7 2/2 Running 0 47m
linkerd-sp-validator-c4c598c49-ktmdw 2/2 Running 0 8h
linkerd-sp-validator-c4c598c49-lb7jv 2/2 Running 0 30m
linkerd-tap-86c9f7cc98-h8c2d 2/2 Running 0 7h31m
linkerd-tap-86c9f7cc98-p49b5 2/2 Running 0 7h31m
linkerd-tap-86c9f7cc98-psztb 2/2 Running 0 7h30m
linkerd-web-549f59496c-sm6p9 2/2 Running 0 47m
linkerd check
output
➜ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 19.9.3 but the latest edge version is 19.9.4
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 19.9.3 but the latest edge version is 19.9.4
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
Status check results are √
Environment
- Kubernetes Version: v1.12.10-eks-825e5d
- Cluster Environment: EKS
- Host OS: Amazon Linux
- Linkerd version: 19.9.3
Possible solution
Additional context
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 18 (17 by maintainers)
So turns out that we need to make sure
include
is using the right scope (i.e.$
instead of.
). The.
refers to the current scope and has been changed to{{.Values}}
at the start of the template. Using$
, we make sure thetap-rbac.yaml
template isinclude
d using the global scope, hence, all the variables are rendered correctly.Also, we need to make sure the annotation is add to the pod template, not the deployment. This diff works for me:
To reproduce this problem, run:
With the new annotation, the
tap
pod will get restarted after the secondupgrade --install
command.After
kubectl rollout restart -n linkerd deployment/linkerd-tap
So, this is the difference:
And the new date is equal to the last chart update (
LAST DEPLOYED: Tue Oct 15 12:28:58 2019
), which makes me think thathelm upgrade
does not restart everything that must be restarted:At this moment I can’t find any recently changed secrets:
So I see the correlation between the last deploy and the certificate issue date (
Not Before
) but I don’t see why thelinkerd-tap.linkerd.svc
certificate was changed.I guess this can be fixed if Helm always restarts
tap
pods (and maybe others).