calico: Failed to "KillPodSandbox" due to calico connection is unauthorized

After some period, Pods cannot create and delete with this message

$ kubectl describe pod <name>
error killing pod: failed to "KillPodSandbox" for "9f91266a-70a9-428f-a1d6-a2ae8d5427d1" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4657b77480472f4352e413d52e0c5d5545c675da862cc56c8e6f22d7b0577031\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"

It seems to be relate with the service account of policy changed from kubernetes v1.26.0 https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#manual-secret-management-for-serviceaccounts

Here is the workaround of solution. re-read calico-node information by restart or delete.

$ kubectl rollout restart ds -n kube-system calico-node

Expected Behavior

kubectl create or delete is working fine.

Current Behavior

It won’t work properly

[root@m-k8s ~]# kubectl get po
NAME                                      READY   STATUS              RESTARTS      AGE
dpy-nginx-6564b9dbcc-d7jj5                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-vgjmw                0/1     ContainerCreating   0             17m
dpy-nginx-6564b9dbcc-wbr59                0/1     ContainerCreating   0             17m
nfs-client-provisioner-7596fb9c9c-gmpmn   0/1     Terminating         0             47h
nfs-client-provisioner-7596fb9c9c-jvmnm   1/1     Running             1 (46m ago)   42h
nginx-76d9fbf4fb-7xjgb                    0/1     Terminating         0             42h
nginx-76d9fbf4fb-dv48n                    1/1     Running             0             42h
nginx-76d9fbf4fb-kqp5j                    1/1     Running             0             42h
nginx-76d9fbf4fb-qrl4p                    1/1     Running             0             42h
nginx-76d9fbf4fb-wlpwd                    1/1     Running             0             42h

Possible Solution

`Workaround’ is restart daemonset or delete pod.

OR

‘Possible Solution’ is that create a long period secret token for service account instead of this. and use this secret with service account for calico-node. (it is related with #5712 #6421)

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Steps to Reproduce (for bugs)

  1. Deploy native-kubernetes by vagrant-script (link)
  2. Wait for 1-2days
  3. Deploy new deployment
[root@m-k8s ~]# k create deploy new-nginx --image=nginx --replicas=3
deployment.apps/new-nginx created
  1. Check deployment status
[root@m-k8s ~]# kubectl get po
NAME                                                       READY   STATUS              RESTARTS      AGE
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m
new-nginx-6564b9dbcc-<hash>              0/1     ContainerCreating   0               15m

Context

It already applied to the code from #6218 node/pkg/cni/token_watch.go

const defaultCNITokenValiditySeconds = 24 * 60 * 60
const minTokenRetryDuration = 5 * time.Second
const defaultRefreshFraction = 4
func NewTokenRefresher(clientset *kubernetes.Clientset, namespace string, serviceAccountName string) *TokenRefresher {
	return NewTokenRefresherWithCustomTiming(clientset, namespace, serviceAccountName, defaultCNITokenValiditySeconds, minTokenRetryDuration, defaultRefreshFraction)
}

So I decoded applied JWT on the calico-node. It confirmed 1 year(365d) properly.
JWT

sh-4.4# cat /var/run/secrets/kubernetes.io/serviceaccount/token 
eyJhbGciOiJSUzI1NiIsImtpZCI6IjlpTFk5RXlJR29yb01VZjlXOGg0UGhvLWhLRGhtZnNvekdyeU0xdVlFUTAifQ.eyJhdWQiOlsiaHR0cHM6Ly9rdWJlcm5ldGVzLmRlZmF1bHQuc3ZjLmNsdXN0ZXIubG9jYWwiXSwiZXhwIjoxNzA1OTc1ODA5LCJpYXQiOjE2NzQ0Mzk4MDksImlzcyI6Imh0dHBzOi8va3ViZXJuZXRlcy5kZWZhdWx0LnN2Yy5jbHVzdGVyLmxvY2FsIiwia3ViZXJuZXRlcy5pbyI6eyJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsInBvZCI6eyJuYW1lIjoiY2FsaWNvLW5vZGUtOWRnZzIiLCJ1aWQiOiIxY2UwODRlYS1kNzIzLTQ5MDAtYjI1ZC00YzRhNTVmMmI0OWYifSwic2VydmljZWFjY291bnQiOnsibmFtZSI6ImNhbGljby1ub2RlIiwidWlkIjoiM2RhYmI5MmYtN2UzYy00ZTkyLWI4OTUtZmM3NzczM2RlMTBmIn0sIndhcm5hZnRlciI6MTY3NDQ0MzQxNn0sIm5iZiI6MTY3NDQzOTgwOSwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Omt1YmUtc3lzdGVtOmNhbGljby1ub2RlIn0.SC5WdggKDD-SE2ZnIfNYaMROXNvJVqqdKXdF6SCN_qrLBwmLwXbSHnQA_vkBBFHqi1qsQP2CuBx0beYUzm5VkcBt7LMZeDBHaOfDIfBvwMbzkAAMcSoqd6bnZi1mZa8Mf2ZTVEvhLOJSyb9npGAa0te6xfWAvEbTmGWTOvZaQ59y-RqJ9OfqAiYYWoEDCLpjjjG0F1-ke2_6eRx7m6Ri2Ne47WKGGURfMVvf2GAtV0xrYuI2tvA8UhivzhaPiJx56RfyVmVAnrl8qfBk0rG6J43TkPGA59R52vbvJkI_9k-kPw_OXJv35YDqgExn3i7CswGUZCX9TAGkET5mpm7u4w

Decoded JWT's Payload

{
  "aud": [
    "https://kubernetes.default.svc.cluster.local"
  ],
  "exp": 1705975809,    <<<< Tue Jan 23 2024 02:10:09 GMT+0000 
  "iat": 1674439809,
  "iss": "https://kubernetes.default.svc.cluster.local",
  "kubernetes.io": {
    "namespace": "kube-system",
    "pod": {
      "name": "calico-node-9dgg2",
      "uid": "1ce084ea-d723-4900-b25d-4c4a55f2b49f"
    },
    "serviceaccount": {
      "name": "calico-node",
      "uid": "3dabb92f-7e3c-4e92-b895-fc77733de10f"
    },
    "warnafter": 1674443416
  },
  "nbf": 1674439809,
  "sub": "system:serviceaccount:kube-system:calico-node"
}

Thus this issue is a little different logic to verify the authorization from kubernetes.


/var/log/message from all nodes like below when it happened.

[control-plane node]

Jan 23 09:10:35 m-k8s kubelet: E0123 09:10:35.298683    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:10:50 m-k8s kubelet: E0123 09:10:50.303499    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:05 m-k8s kubelet: E0123 09:11:05.308058    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:20 m-k8s kubelet: E0123 09:11:20.300704    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 23 09:11:35 m-k8s kubelet: E0123 09:11:35.290727    4180 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
<snipped>

[worker node]

Jan 21 16:44:12 w2-k8s kubelet: E0121 16:44:12.656423    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Jan 21 16:44:27 w2-k8s kubelet: E0121 16:44:27.650877    3630 server.go:299] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"

Your Environment

  • Calico version: v3.24.5, v3.25.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): native-kubernetes v1.26.0
[root@m-k8s ~]# kubectl get nodes -o wide 
NAME     STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
m-k8s    Ready    control-plane   2d19h   v1.26.0   192.168.1.10    <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w1-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.101   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w2-k8s   Ready    <none>          2d19h   v1.26.0   192.168.1.102   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
w3-k8s   Ready    <none>          2d18h   v1.26.0   192.168.1.103   <none>        CentOS Linux 7 (Core)   3.10.0-1127.19.1.el7.x86_64   containerd://1.6.10
  • Operating System and version: CentOS 7.9 (3.10.0-1127.19.1.el7.x86_64)
  • Link to your project (optional):

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 7
  • Comments: 26 (5 by maintainers)

Most upvoted comments

Plus workaround is effective.

[root@m-k8s ~]# kubectl rollout restart ds -n kube-system calico-node
daemonset.apps/calico-node restarted
[root@m-k8s ~]# k get po -A
NAMESPACE        NAME                                        READY   STATUS      RESTARTS      AGE
default          new-nginx-d8b84d87b-jpzr9                   1/1     Running     0             21h
default          new-nginx-d8b84d87b-r245z                   1/1     Running     0             21h
default          new-nginx-d8b84d87b-xjc8k                   1/1     Running     0             21h
default          nfs-client-provisioner-7596fb9c9c-jvmnm     1/1     Running     1 (29h ago)   2d22h
default          synthetic-load-generator-554f846686-fxgms   1/1     Running     0             4h
example-hotrod   example-hotrod-6c5d878866-bbt7l             1/1     Running     0             5h2m
ingress-nginx    ingress-nginx-admission-create-bqvnp        0/1     Completed   0             5h18m
ingress-nginx    ingress-nginx-admission-patch-sdjbr         0/1     Completed   1             5h18m
ingress-nginx    ingress-nginx-controller-64f79ddbcc-7wltw   1/1     Running     0             5h16m
kube-system      calico-kube-controllers-57b57c56f-96j5s     1/1     Running     0             3d3h
kube-system      calico-node-fpmtb                           1/1     Running     0             77s
kube-system      calico-node-gmksz                           1/1     Running     0             66s
kube-system      calico-node-hzk7k                           1/1     Running     0             45s
kube-system      calico-node-zqd24                           1/1     Running     0             56s
kube-system      coredns-787d4945fb-n5z6g                    1/1     Running     0             3d3h
kube-system      coredns-787d4945fb-q6zj8                    1/1     Running     0             3d3h
kube-system      etcd-m-k8s                                  1/1     Running     0             3d3h
kube-system      kube-apiserver-m-k8s                        1/1     Running     0             3d3h
kube-system      kube-controller-manager-m-k8s               1/1     Running     0             3d3h
kube-system      kube-proxy-6wrc9                            1/1     Running     0             3d3h
kube-system      kube-proxy-drtcr                            1/1     Running     1 (27h ago)   3d2h
kube-system      kube-proxy-hmp89                            1/1     Running     0             3d3h
kube-system      kube-proxy-hnxrh                            1/1     Running     0             3d3h
kube-system      kube-scheduler-m-k8s                        1/1     Running     0             3d3h
kube-system      metrics-server-7948965fbb-56tct             1/1     Running     0             28h
metallb-system   controller-577b5bdfcc-tj6nq                 1/1     Running     0             28h
metallb-system   speaker-8szsl                               1/1     Running     0             3d3h
metallb-system   speaker-j4hsp                               1/1     Running     0             3d3h
metallb-system   speaker-pm9jj                               1/1     Running     0             3d3h
metallb-system   speaker-rg9wk                               1/1     Running     2 (27h ago)   3d2h
monitoring       jaeger-5dc997d86c-trhnb                     1/1     Running     0             4h40m
monitoring       tempo-0                                     2/2     Running     0             4h

We were able to fix this problem now. Our master node’s had an incorrect version which ruined everything. An update of our master node was fortunately the solution without any hacky workarounds. But thanks for the help - I appreciate that!

Last check k8s v1.27.2 + calico_v3.26.0 = Looking good after AGE 5D

[root@cp-k8s ~]# k get po,svc 
NAME                                          READY   STATUS    RESTARTS   AGE
pod/deploy-nginx-66df7dc8d9-6cd7l             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-77fnk             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-95mck             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-9fkzn             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-hnbh2             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-kh66b             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-q989q             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-qtvkq             1/1     Running   0          5d17h
pod/deploy-nginx-66df7dc8d9-xnvd8             1/1     Running   0          5d17h
pod/nfs-client-provisioner-597dbc5f74-7hw67   1/1     Running   0          5d18h
[root@cp-k8s ~]# k get ds -n kube-system -o yaml | grep -i image:
          image: docker.io/calico/node:v3.26.0
          image: docker.io/calico/cni:v3.26.0
          image: docker.io/calico/cni:v3.26.0
          image: docker.io/calico/node:v3.26.0
          image: registry.k8s.io/kube-proxy:v1.27.2

FYI k8s v1.27.2 + calico_v3.26.0 = Looking good after AGE 42H

[root@cp-k8s ~]# k get node 
NAME     STATUS   ROLES           AGE   VERSION
cp-k8s   Ready    control-plane   42h   v1.27.2
w1-k8s   Ready    <none>          42h   v1.27.2
w2-k8s   Ready    <none>          42h   v1.27.2
w3-k8s   Ready    <none>          42h   v1.27.2
[root@cp-k8s ~]# k get po,svc 
NAME                                          READY   STATUS              RESTARTS   AGE
pod/deploy-nginx-66df7dc8d9-8r545             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-bc9f6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-cqfj6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-fkf99             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-mrrl6             1/1     Running             0          42h
pod/deploy-nginx-66df7dc8d9-q6zgn             1/1     Running             0          42h

NAME                   TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
service/deploy-nginx   LoadBalancer   10.101.73.62   192.168.1.11   80:31560/TCP   42h
service/kubernetes     ClusterIP      10.96.0.1      <none>         443/TCP        42h

I have the same problem, if u think that’s clear in master, maybe the problem can solving at another node or worker.

Facing same issue.

We refrain using workaround so any updates on how to get rid of this issue? How can we tackle service account policy changes in kubernetes v1.26 mentioned in this issue description?

I’m using - k8s v1.26.1 + calico_v3.25.0 + containerd 1.6.18

same behaviour

8m32s       Warning   FailedCreatePodSandBox   pod/hello-27927411-gk5nf                           (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c9cf89858e821ef4eb9502deb09725cf8e88be7675d9861fa1a2d25cc03a596f": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

cluster info

k get nodes -o wide
NAME    STATUS   ROLES                       AGE   VERSION          INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8sc1   Ready    control-plane,etcd,master   88d   v1.24.6+rke2r1   192.168.88.87   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc2   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.88   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc3   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.89   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc4   Ready    <none>                      82d   v1.24.6+rke2r1   192.168.88.90   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc5   Ready    <none>                      71d   v1.24.6+rke2r1   192.168.88.91   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc6   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.92   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1
k8sc7   Ready    <none>                      88d   v1.24.6+rke2r1   192.168.88.93   <none>        Ubuntu 22.04.1 LTS   5.15.0-58-generic   containerd://1.6.8-k3s1

$ 

@coutinhop Oh…? I am so sorry, it does’t mean to upload without any comment. (My cat push the button? something…? anyhow OMG) Thus I updated all I know so far. The trigger or reproducing procedure is not clear yet. Therefore I will clarify for duplicated protocol as soon.

Thank you for letting me know empty issue that I upload.