azure-container-networking: FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox

What happened:

CNI failed to set up network:

Jan 29 06:41:25.617: INFO: At 2021-01-29 06:36:50 +0000 UTC - event for webserver-deployment-5575796f8f-k92dd: {kubelet 2884k8s001} Started: Started container httpd
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:08 +0000 UTC - event for webserver-deployment-5575796f8f-9jj9z: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "678fa65619f0dd4c0434d2b4e510bd99930bc62e8ea4b64ea8a5f79a73b20f6a": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-6rbcl: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cb4c4d0e705d57ac9da1eae9a42e2bb421dca03484fa8e0260672425feb55f15": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-dtc4x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a20980651ba231f652be02984d594aa661c613d440b8dd740d4073f2bfa4d50b": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-jjdlr: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "ba151015cc98dd643d424bf2eafaa71855fb4fb6b75fbb24bb4ea2616f964faf": unexpected end of JSON input
Jan 29 06:41:25.617: INFO: At 2021-01-29 06:37:10 +0000 UTC - event for webserver-deployment-5575796f8f-pst6x: {kubelet 2884k8s000} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5f40817e69457ddc2b907adfd182969bea93f24e337c84675d578aaeb17d2eb1": unexpected end of JSON input

Containerd failed to create the pod:

E0129 06:38:50.213490    4316 remote_runtime.go:251] StartContainer "1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac" from runtime service failed: rpc error: code = Unknown desc = failed to create containerd task: hcsshim::CreateComputeSystem 1054e8a1180749dbb37097a5f985003ca3afb2036099f5df05020a5a9053a4ac: context deadline exceeded
(extra info: {"Owner":"containerd-shim-runhcs-v1.exe","SchemaVersion":{"Major":2,"Minor":1},"Container":{"GuestOs":{"HostName":"pod-adoption"},"Storage":{"Layers":[{"Id":"7b371bd9-de2e-5dc3-ae86-4d3d0378f6c6","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\113"},{"Id":"d83ac863-b163-5ceb-95c5-ac65bf7a39e3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\112"},{"Id":"fd86a0ef-4ae8-5be6-b6bd-9e46685ad393","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\110"},{"Id":"9b944c46-c035-5757-9412-5fecc883941a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\109"},{"Id":"9dee6370-72a7-5b65-af00-e08eeeaa1ef7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\108"},{"Id":"77c8d266-544f-5418-b4e9-f49715060db9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\107"},{"Id":"fa4d9b3a-212e-5b70-a9e6-bb1412fd1e53","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\106"},{"Id":"cc368b4b-70a7-5da3-aff7-b91d843db78a","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\105"},{"Id":"b7023ad6-4285-5a05-8998-e1c9c6bc6e49","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\104"},{"Id":"9c171b78-a39b-5e34-873e-04ab1b164bb7","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\79"},{"Id":"75b3f965-edc8-5443-ac0b-77f329589f05","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\78"},{"Id":"bc445432-f548-5054-b865-43ca4d3a32af","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\77"},{"Id":"e1a4a38f-7885-56a7-898b-96c1ceaf5cd5","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\76"},{"Id":"51c30dea-6447-5859-a549-0801b0fbf76c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\75"},{"Id":"defbcf4b-cdf1-5e89-8198-0fe14410cffc","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\74"},{"Id":"b4e8aa06-2c0c-5738-80d8-af374eba772e","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\73"},{"Id":"59bb3c04-8829-58cb-ab1c-0aef684466cf","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\72"},{"Id":"a2ee96d5-66f6-5a9b-b0d4-384112c9e269","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\71"},{"Id":"2f347625-13a3-5633-badb-6868ea928ff9","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\70"},{"Id":"b03a06fa-412f-5917-a1f1-657de8a3bcc3","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\69"},{"Id":"83621030-7522-570b-ab7c-cf9c5898f09c","Path":"C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows\\snapshots\\62"}],"Path":"\\\\?\\Volume{f8cd70db-970c-40de-aaaf-c07998d35301}\\"},"MappedDirectories":[{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\containers\\pod-adoption\\c6e401e7","ContainerPath":"c:\\dev\\termination-log"},{"HostPath":"c:\\var\\lib\\kubelet\\pods\\8cacd155-a5f4-4c8a-b73c-9d3eb47469be\\volumes\\kubernetes.io~secret\\default-token-tmjhm","ContainerPath":"c:\\var\\run\\secrets\\kubernetes.io\\serviceaccount","ReadOnly":true}],"Networking":{"Namespace":"36666c6e-cdde-4370-bfd1-a36911baa06a"}},"ShouldTerminateOnLastHandleClosed":true}): unknown

kubelet logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-aks-engine-azure-master-windows-containerd/1355027170874888192/artifacts/2884k8s000/

K8s reference issue: https://github.com/kubernetes/kubernetes/issues/98574

What you expected to happen:
The pods are created and network is set up

How to reproduce it:
We use the following aks-engine template: https://raw.githubusercontent.com/kubernetes-sigs/windows-testing/master/job-templates/kubernetes_containerd_master.json

Orchestrator and Version (e.g. Kubernetes, Docker):
kubernetes 1.21 alpha Containerd 1.4.3

Operating System (Linux/Windows):
Windows

Kernel (e.g. uanme -a for Linux or $(Get-ItemProperty -Path "C:\windows\system32\hal.dll").VersionInfo.FileVersion for Windows): Microsoft Windows NT 10.0.17763.0

Source Description HotFixID InstalledBy InstalledOn


2884k8s000 Update KB4578966 10/13/2020 12:00:00 AM
2884k8s000 Security Update KB4577667 10/13/2020 12:00:00 AM
2884k8s000 Security Update KB4586793 2884k8s000\azureuser 11/12/2020 12:00:00 AM

Anything else we need to know?: [Miscellaneous information that will assist in solving the issue.]

This appears to be a transient issue. The tests pass regularly: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-containerd

Unfortunately the logs for the cni with containerd end up in the wrong place, We made a fix to collect these but it hasn’t been rolled out yet.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 36 (16 by maintainers)

Most upvoted comments

Fix is already included in all Windows Server 2022 builds. Planned ETA for Windows Server 2019 fix is July 19th update, but could slip out (tentative date).

I managed to solve this, the problem for me was that I had both the calico CNI and linkerd at the same time. The main problem was that linkerd started before calico did and even if it came online the CNI didn’t understand that and pod restarts didn’t solve the issue.

In our case we added a init container to linkerd that waited for the calico CNI to add it’s conflist before linkerd could start. https://github.com/linkerd/linkerd2/pull/5916

In short this error can mean many things but I hope this is a hint for someone else. I would recommend to start to look in /host/etc/cni/net.d/and try to understand whats happening there.

@pjohnst5 thanks for bearing with us. Finally collecting the correct logs and have the repo here:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128

kubelet logs:

Apr 23 03:26:22.164: INFO: At 2021-04-23 03:22:54 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "620e1db49cbfe06e6b6db4fb887a23998523f2a8368e069efe525e96dafee7da": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:24:17 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9a0d72403a2e3f96a16431fbbcc36bbebc340ae6ca6248d0b76ac7089fe1d418": unexpected end of JSON input
Apr 23 03:26:22.164: INFO: At 2021-04-23 03:25:43 +0000 UTC - event for pod2: {kubelet 2726k8s001} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 

Full Node logs: https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/101357/pull-kubernetes-e2e-aks-engine-windows-containerd/1385422187535536128/artifacts/2726k8s001/

Azure cni logs around the time of the error:

2021/04/23 03:25:57 [12052] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:00 [3812] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:00 [3812] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:00 [3812] Report plugin error
2021/04/23 03:26:00 [3812] [Azure-Utils] Get-Process -Id 3812
2021/04/23 03:26:02 [3812] Read from Lock file:10720
2021/04/23 03:26:02 [3812] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:05 [3812] [CNI] Process name is azure-vnet
2021/04/23 03:26:05 [3812] Plugin store is nil
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [Azure-Utils] ver
2021/04/23 03:26:05 [5084] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:05 [5084] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:20 [12988] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:20 [12988] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:20 [12988] Report plugin error
2021/04/23 03:26:20 [12988] [Azure-Utils] Get-Process -Id 12988
2021/04/23 03:26:21 [12956] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12956] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] Report plugin error
2021/04/23 03:26:21 [12052] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:21 [12052] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:21 [12956] [Azure-Utils] Get-Process -Id 12956
2021/04/23 03:26:21 [12052] Report plugin error
2021/04/23 03:26:21 [12052] [Azure-Utils] Get-Process -Id 12052
2021/04/23 03:26:27 [12988] Read from Lock file:10720
2021/04/23 03:26:27 [12988] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:27 [12956] Read from Lock file:10720
2021/04/23 03:26:27 [12956] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:31 [12988] [CNI] Process name is azure-vnet
2021/04/23 03:26:31 [12988] Plugin store is nil
2021/04/23 03:26:31 [12052] Read from Lock file:10720
2021/04/23 03:26:31 [12052] [Azure-Utils] Get-Process -Id 10720|Format-List
2021/04/23 03:26:32 [5084] [cni] Failed to lock store: timed out locking store.
2021/04/23 03:26:32 [5084] Failed to initialize key-value store of network plugin, err:timed out locking store.
2021/04/23 03:26:32 [5084] Report plugin error
2021/04/23 03:26:32 [5084] [Azure-Utils] Get-Process -Id 5084
2021/04/23 03:26:32 [12956] [CNI] Process name is azure-vnet
2021/04/23 03:26:32 [12956] Plugin store is nil
2021/04/23 03:26:33 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:33 [12336] [Azure-Utils] ver
2021/04/23 03:26:36 [12336] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [12336] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC
2021/04/23 03:26:36 [12052] [CNI] Process name is azure-vnet
2021/04/23 03:26:36 [12052] Plugin store is nil
2021/04/23 03:26:36 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:36 [9864] [Azure-Utils] ver
2021/04/23 03:26:37 [6260] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [6260] [Azure-Utils] ver
2021/04/23 03:26:37 [9864] Formatted Boot time: 2021-04-23 03:05:38
2021/04/23 03:26:37 [9864] [cni] reboot time 2021-04-23 03:05:38 +0000 UTC storeLockFile mod time 2021-04-23 03:20:45.0425544 +0000 UTC

looks like we are not collecting them properly, I will fix that and come back once it repos again. Thanks for taking a look!

Any chance you could include azure-vnet.log* in those logs? Saw that you made a fix to collect them.

A new version would most likely not fix the issue, we have a hunch about where this error is coming from but a repro cluster or azure-vnet.log files will help us confirm that

We are using azure cni v1.2.2. We updated to this version for this fix: https://github.com/Azure/azure-container-networking/pull/755

/cc @abelhu