linkerd2: Random errors: x509: certificate signed by unknown authority

Bug Report

What is the issue?

I don’t understand all details, but periodically I see the error in different places but Linkerd works in general. The error appears randomly. Pods restarting helps to solve it but I don’t think it’s a good workaround.

➜ linkerd top deployment/application --namespace default
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.56.79.145:8089/apis/tap.linkerd.io/v1alpha1/watch/namespaces/default/deployments/application/tap')
Usage:
  linkerd top [flags] (RESOURCE)
#...

kubectl rollout restart -n linkerd deployment/linkerd-tap
# ...
linkerd top deployment/application --namespace default

# now it works, but after a while the problem returns

How can it be reproduced?

Logs, error output, etc

[linkerd-tap-86c9f7cc98-p49b5 tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:33188: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:25 http: TLS handshake error from 127.0.0.1:37118: remote error: tls: bad certificate
[linkerd-tap-86c9f7cc98-psztb tap] 2019/09/30 14:58:26 http: TLS handshake error from 127.0.0.1:37198: remote error: tls: bad certificate

I didn’t find any other errors in other L5d pods.

NAME                                      READY   STATUS    RESTARTS   AGE
linkerd-controller-784c8ddfbd-6l7zv       2/2     Running   0          8h
linkerd-controller-784c8ddfbd-b67s2       2/2     Running   0          47m
linkerd-controller-784c8ddfbd-m95ll       2/2     Running   0          8h
linkerd-destination-7655c8bc7c-4zcxm      2/2     Running   0          8h
linkerd-destination-7655c8bc7c-q4jwz      2/2     Running   0          8h
linkerd-destination-7655c8bc7c-xlx9g      2/2     Running   0          8h
linkerd-grafana-86df8766f8-xlxld          2/2     Running   0          8h
linkerd-identity-59f8fbf6fc-ll597         2/2     Running   0          47m
linkerd-identity-59f8fbf6fc-wgcpj         2/2     Running   0          8h
linkerd-identity-59f8fbf6fc-z66p7         2/2     Running   0          8h
linkerd-prometheus-98c96c5d5-jc2lz        2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-9wdls   2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-hc2kv   2/2     Running   0          8h
linkerd-proxy-injector-67f7db5566-t225x   2/2     Running   0          8h
linkerd-sp-validator-c4c598c49-djhv7      2/2     Running   0          47m
linkerd-sp-validator-c4c598c49-ktmdw      2/2     Running   0          8h
linkerd-sp-validator-c4c598c49-lb7jv      2/2     Running   0          30m
linkerd-tap-86c9f7cc98-h8c2d              2/2     Running   0          7h31m
linkerd-tap-86c9f7cc98-p49b5              2/2     Running   0          7h31m
linkerd-tap-86c9f7cc98-psztb              2/2     Running   0          7h30m
linkerd-web-549f59496c-sm6p9              2/2     Running   0          47m

`linkerd check` output

➜ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 19.9.3 but the latest edge version is 19.9.4
    see https://linkerd.io/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 19.9.3 but the latest edge version is 19.9.4
    see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: v1.12.10-eks-825e5d
Cluster Environment: EKS
Host OS: Amazon Linux
Linkerd version: 19.9.3

Possible solution

Additional context

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 18 (17 by maintainers)

Most upvoted comments

So turns out that we need to make sure include is using the right scope (i.e. $ instead of .). The . refers to the current scope and has been changed to {{.Values}} at the start of the template. Using $, we make sure the tap-rbac.yaml template is included using the global scope, hence, all the variables are rendered correctly.

Also, we need to make sure the annotation is add to the pod template, not the deployment. This diff works for me:

diff --git a/charts/linkerd2/templates/tap.yaml b/charts/linkerd2/templates/tap.yaml
index 42d6cd71..d6ed4256 100644
--- a/charts/linkerd2/templates/tap.yaml
+++ b/charts/linkerd2/templates/tap.yaml
@@ -49,6 +49,7 @@ spec:
   template:
     metadata:
       annotations:
+        linkerd.io/config-checksum: {{ include (print $.Template.BasePath "/tap-rbac.yaml") $ | sha256sum }}
         {{.CreatedByAnnotation}}: {{default (printf "linkerd/helm %s" .LinkerdVersion) .CliVersion}}
         {{- include "partials.proxy.annotations" .Proxy| nindent 8}}
       labels:

To reproduce this problem, run:

$ helm upgrade --install linkerd2 charts/linkerd2 --set-file Identity.TrustAnchorsPEM=<crt.pem> --set-file Identity.Issuer.TLS.KeyPEM=<key.pem> --set-file Identity.Issuer.TLS.CrtPEM=<crt.pem> --set Identity.Issuer.CrtExpiry=<crt-expiry-date>

# when control plane is ready, repeat the same command
$ helm upgrade --install linkerd2 charts/linkerd2 --set-file Identity.TrustAnchorsPEM=<crt.pem> --set-file Identity.Issuer.TLS.KeyPEM=<key.pem> --set-file Identity.Issuer.TLS.CrtPEM=<crt.pem> --set Identity.Issuer.CrtExpiry=<crt-expiry-date>

$ linkerd -n linkerd tap deploy
Error: HTTP error, status Code [503] (unexpected API response: Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "linkerd-tap.linkerd.svc")'
Trying to reach: 'https://10.106.41.71:443/apis/tap.linkerd.io/v1alpha1/watch/namespaces/linkerd/pods//tap')

With the new annotation, the tap pod will get restarted after the second upgrade --install command.

ihcsim on Nov 4, 2019

After kubectl rollout restart -n linkerd deployment/linkerd-tap

# (the cert for https://10.56.47.53:8089/apis/tap.linkerd.io)
# echo | openssl s_client -showcerts -servername 10.56.47.53 -connect 10.56.47.53:8089 2>/dev/null | openssl x509 -inform pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            3e:27:68:b7:1f:e3:c4:c0:ef:16:0c:fe:c6:13:93:5e
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = linkerd-tap.linkerd.svc
        Validity
            Not Before: Oct 15 12:28:58 2019 GMT
            Not After : Oct 14 12:28:58 2020 GMT
        Subject: CN = linkerd-tap.linkerd.svc
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:d3:b6:8e:77:9e:59:8e:84:c5:64:62:5d:dc:f3:
...
                    78:b1
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:TRUE
    Signature Algorithm: sha256WithRSAEncryption
         31:cb:56:40:f6:04:ff:7d:f9:05:0a:be:94:0a:22:c1:98:11:
...
         d4:d5:21:df

So, this is the difference:

before the restart:

            Not Before: Oct 15 11:54:18 2019 GMT
            Not After : Oct 14 11:54:18 2020 GMT

after the restart:

            Not Before: Oct 15 12:28:58 2019 GMT
            Not After : Oct 14 12:28:58 2020 GMT

And the new date is equal to the last chart update (LAST DEPLOYED: Tue Oct 15 12:28:58 2019), which makes me think that helm upgrade does not restart everything that must be restarted:

# _Identity.TrustAnchorsPEM.tmp and all others are loaded from a secret storage, the internal content is always the same

helm upgrade --install --namespace=linkerd --values ./values/linkerd2/values.yaml --set-file=Identity.TrustAnchorsPEM=_Identity.TrustAnchorsPEM.tmp --set-file=Identity.Issuer.TLS.CrtPEM=_Identity.Issuer.TLS.CrtPEM.tmp --set-file=Identity.Issuer.TLS.KeyPEM=_Identity.Issuer.TLS.KeyPEM.tmp linkerd2 ./linkerd2-2.6.0-f90805b8.tgz
Release "linkerd2" has been upgraded.
LAST DEPLOYED: Tue Oct 15 12:28:58 2019
NAMESPACE: linkerd
STATUS: DEPLOYED

RESOURCES:
==> v1/APIService
NAME                     AGE
v1alpha1.tap.linkerd.io  33d

==> v1/ClusterRole
NAME                            AGE
linkerd-linkerd-controller      33d
linkerd-linkerd-destination     21d
linkerd-linkerd-identity        33d
linkerd-linkerd-prometheus      33d
linkerd-linkerd-proxy-injector  33d
linkerd-linkerd-sp-validator    33d
linkerd-linkerd-tap             33d
linkerd-linkerd-tap-admin       33d

==> v1/ClusterRoleBinding
NAME                                AGE
linkerd-linkerd-controller          33d
linkerd-linkerd-destination         21d
linkerd-linkerd-identity            33d
linkerd-linkerd-prometheus          33d
linkerd-linkerd-proxy-injector      33d
linkerd-linkerd-sp-validator        33d
linkerd-linkerd-tap                 33d
linkerd-linkerd-tap-auth-delegator  33d
linkerd-linkerd-web-admin           33d

==> v1/ConfigMap
NAME                       DATA  AGE
linkerd-config             3     33d
linkerd-grafana-config     3     33d
linkerd-prometheus-config  1     33d

==> v1/Deployment
NAME                    READY  UP-TO-DATE  AVAILABLE  AGE
linkerd-controller      3/3    3           3          33d
linkerd-destination     3/3    3           3          21d
linkerd-grafana         1/1    1           1          33d
linkerd-identity        3/3    3           3          33d
linkerd-prometheus      1/1    1           1          33d
linkerd-proxy-injector  3/3    3           3          33d
linkerd-sp-validator    3/3    3           3          33d
linkerd-tap             3/3    3           3          33d
linkerd-web             1/1    1           1          33d

==> v1/Pod(related)
NAME                                     READY  STATUS   RESTARTS  AGE
linkerd-controller-6dbb9f99c7-8zq9p      3/3    Running  0         34m
linkerd-controller-6dbb9f99c7-cf47c      3/3    Running  0         34m
linkerd-controller-6dbb9f99c7-z8dc8      3/3    Running  0         34m
linkerd-destination-5f85657cdf-fbfgh     2/2    Running  0         34m
linkerd-destination-5f85657cdf-jmq9h     2/2    Running  0         34m
linkerd-destination-5f85657cdf-p49mg     2/2    Running  0         34m
linkerd-grafana-9fd8b57cf-hw28q          2/2    Running  0         34m
linkerd-identity-54789dd4dd-ngt8f        2/2    Running  0         34m
linkerd-identity-54789dd4dd-r9dmt        2/2    Running  0         34m
linkerd-identity-54789dd4dd-sz94l        2/2    Running  0         34m
linkerd-prometheus-7947675d6d-kpkht      2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-cpwnj  2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-dgpt8  2/2    Running  0         34m
linkerd-proxy-injector-5847d54cbc-p6cnv  2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-6d29w    2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-nzwmf    2/2    Running  0         34m
linkerd-sp-validator-57c89c6dd4-w2dns    2/2    Running  0         34m
linkerd-tap-5d4454b48b-d6x7j             2/2    Running  0         34m
linkerd-tap-5d4454b48b-hx7h7             2/2    Running  0         34m
linkerd-tap-5d4454b48b-qpntm             2/2    Running  0         34m
linkerd-web-77b64597d8-qdxxs             2/2    Running  0         34m

==> v1/Role
NAME               AGE
linkerd-heartbeat  33d
linkerd-psp        33d

==> v1/RoleBinding
NAME                             AGE
linkerd-heartbeat                33d
linkerd-linkerd-tap-auth-reader  33d
linkerd-psp                      33d

==> v1/Secret
NAME                        TYPE    DATA  AGE
linkerd-identity-issuer     Opaque  2     33d
linkerd-proxy-injector-tls  Opaque  2     33d
linkerd-sp-validator-tls    Opaque  2     33d
linkerd-tap-tls             Opaque  2     33d

==> v1/Service
NAME                    TYPE       CLUSTER-IP      EXTERNAL-IP  PORT(S)            AGE
linkerd-controller-api  ClusterIP  172.20.234.99   <none>       8085/TCP           33d
linkerd-destination     ClusterIP  172.20.118.59   <none>       8086/TCP           33d
linkerd-dst             ClusterIP  172.20.199.84   <none>       8086/TCP           34m
linkerd-grafana         ClusterIP  172.20.17.98    <none>       3000/TCP           33d
linkerd-identity        ClusterIP  172.20.187.42   <none>       8080/TCP           33d
linkerd-prometheus      ClusterIP  172.20.199.230  <none>       9090/TCP           33d
linkerd-proxy-injector  ClusterIP  172.20.199.105  <none>       443/TCP            33d
linkerd-sp-validator    ClusterIP  172.20.98.94    <none>       443/TCP            33d
linkerd-tap             ClusterIP  172.20.246.46   <none>       8088/TCP,443/TCP   33d
linkerd-web             ClusterIP  172.20.42.162   <none>       8084/TCP,9994/TCP  33d

==> v1/ServiceAccount
NAME                    SECRETS  AGE
linkerd-controller      1        33d
linkerd-destination     1        21d
linkerd-grafana         1        33d
linkerd-heartbeat       1        33d
linkerd-identity        1        33d
linkerd-prometheus      1        33d
linkerd-proxy-injector  1        33d
linkerd-sp-validator    1        33d
linkerd-tap             1        33d
linkerd-web             1        33d

==> v1beta1/CronJob
NAME               SCHEDULE   SUSPEND  ACTIVE  LAST SCHEDULE  AGE
linkerd-heartbeat  0 0 * * *  False    0       12h            33d

==> v1beta1/CustomResourceDefinition
NAME                             AGE
serviceprofiles.linkerd.io       33d
trafficsplits.split.smi-spec.io  33d

==> v1beta1/MutatingWebhookConfiguration
NAME                                   AGE
linkerd-proxy-injector-webhook-config  33d

==> v1beta1/PodSecurityPolicy
NAME                           PRIV   CAPS               SELINUX   RUNASUSER  FSGROUP    SUPGROUP   READONLYROOTFS  VOLUMES
linkerd-linkerd-control-plane  false  NET_ADMIN,NET_RAW  RunAsAny  RunAsAny   MustRunAs  MustRunAs  true            configMap,emptyDir,secret,projected,downwardAPI,persistentVolumeClaim

==> v1beta1/ValidatingWebhookConfiguration
NAME                                 AGE
linkerd-sp-validator-webhook-config  33d

NOTES:
...

At this moment I can’t find any recently changed secrets:

➜ k get secret,configmap -n linkerd
NAME                                        TYPE                                  DATA   AGE
secret/default-token-w7k9t                  kubernetes.io/service-account-token   3      34d
secret/linkerd-controller-token-vj8p4       kubernetes.io/service-account-token   3      34d
secret/linkerd-destination-token-9hm89      kubernetes.io/service-account-token   3      22d
secret/linkerd-grafana-token-8j8mq          kubernetes.io/service-account-token   3      34d
secret/linkerd-heartbeat-token-kv9hc        kubernetes.io/service-account-token   3      34d
secret/linkerd-identity-issuer              Opaque                                2      34d
secret/linkerd-identity-token-fwj7c         kubernetes.io/service-account-token   3      34d
secret/linkerd-prometheus-token-khz72       kubernetes.io/service-account-token   3      34d
secret/linkerd-proxy-injector-tls           Opaque                                2      34d
secret/linkerd-proxy-injector-token-5sdmb   kubernetes.io/service-account-token   3      34d
secret/linkerd-sp-validator-tls             Opaque                                2      34d
secret/linkerd-sp-validator-token-s9b79     kubernetes.io/service-account-token   3      34d
secret/linkerd-tap-tls                      Opaque                                2      34d
secret/linkerd-tap-token-x4skt              kubernetes.io/service-account-token   3      34d
secret/linkerd-web-token-x24x6              kubernetes.io/service-account-token   3      34d

NAME                                  DATA   AGE
configmap/linkerd-config              3      34d
configmap/linkerd-grafana-config      3      34d
configmap/linkerd-prometheus-config   1      34d

So I see the correlation between the last deploy and the certificate issue date (Not Before) but I don’t see why the linkerd-tap.linkerd.svc certificate was changed.

I guess this can be fixed if Helm always restarts tap pods (and maybe others).

KIVagant on Oct 16, 2019