kubernetes: k8s reports pod as "Terminated: Error" with "Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container"

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

“error syncing pod” “no such container”


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.1 LTS
  • Kernel (e.g. uname -a): 4.4.0-36-generic x86_64
  • Install tools: kops 1.6.0-beta.1
  • Others:

What happened:

A pod was restarted a few times (it was killed by the kernel due to running out of memory). After the last restart, the pod appears stuck in an Error state.

~$ kubectl -n master get pods dhrubacol11222268508-leaf-0-84612
NAME                                READY     STATUS    RESTARTS   AGE
dhrubacol11222268508-leaf-0-84612   0/1       Error     3          16h

kubectl describe shows a “no such container” error:

  FirstSeen     LastSeen        Count   From                                                    SubObjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                                    -------------   --------        ------          -------
  1h            5s              346     kubelet, ip-172-20-39-143.us-west-2.compute.internal                    Warning         FailedSync      Error syncing pod, skipping: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f

kubectl logs show the logs from the last iteration of the pod (the one that finished 1 hour ago).

Interestingly, the pod is actually running; kubectl exec lets me enter the pod.

What you expected to happen:

I expected the pod to be restarted.

How to reproduce it (as minimally and precisely as possible):

Does not appear reproducible. This has happened twice to us so far; deleting the pod by hand fixed the problem.

Anything else we need to know:

Note the following in the kubelet logs (more details below):

At 22:01:58, the pod dies. The container ID starts with bedfcb… At 22:29:55, the pod dies again; The container ID starts with a102f9… At 22:30:25, getPodContainerStatuses stats failing, but the container ID is from a previous iteration of the pod – the one that died at 22:01, not the one that died at 22:29.

May 10 22:01:58 ip-172-20-39-143 kubelet[9126]: I0510 22:01:58.521978    9126 kubelet.go:1842] SyncLoop (PLEG): "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)", event: &pleg.PodLifecycleEvent{ID:"da61dd55-3551-11e7-b03c-0207e1349dc2", Type:"ContainerDied", Data:"bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f"}
May 10 22:29:55 ip-172-20-39-143 kubelet[9126]: I0510 22:29:55.428792    9126 kubelet.go:1842] SyncLoop (PLEG): "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)", event: &pleg.PodLifecycleEvent{ID:"da61dd55-3551-11e7-b03c-0207e1349dc2", Type:"ContainerDied", Data:"a102f91130d14cd4c1040fa7957e174190cb6fe531136f939b436ff337af3082"}
May 10 22:30:03 ip-172-20-39-143 kubelet[9126]: I0510 22:30:03.368974    9126 kuberuntime_manager.go:742] checking backoff for container "leafagg" in pod "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)"
May 10 22:30:25 ip-172-20-39-143 kubelet[9126]: E0510 22:30:25.969445    9126 kuberuntime_manager.go:858] getPodContainerStatuses for pod "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)" failed: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f
May 10 22:30:25 ip-172-20-39-143 kubelet[9126]: E0510 22:30:25.969466    9126 generic.go:239] PLEG: Ignoring events for pod dhrubacol11222268508-leaf-0-84612/master: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f
May 10 22:30:27 ip-172-20-39-143 kubelet[9126]: E0510 22:30:27.304149    9126 kuberuntime_manager.go:858] getPodContainerStatuses for pod "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)" failed: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f
May 10 22:30:27 ip-172-20-39-143 kubelet[9126]: E0510 22:30:27.304180    9126 generic.go:239] PLEG: Ignoring events for pod dhrubacol11222268508-leaf-0-84612/master: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f
May 10 22:30:27 ip-172-20-39-143 kubelet[9126]: E0510 22:30:27.408508    9126 kuberuntime_manager.go:858] getPodContainerStatuses for pod "dhrubacol11222268508-leaf-0-84612_master(da61dd55-3551-11e7-b03c-0207e1349dc2)" failed: rpc error: code = 2 desc = Error: No such container: bedfcb3556065064b60471b8ebb73b09c1c450cfd4d07a087a45ba5d8e83dd2f

(and then the “getPodContainerStatuses… failed” messages spam the logs)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 28
  • Comments: 62 (15 by maintainers)

Most upvoted comments

Please post to stack overflow for support questions

This appears to be a Docker issue, actually. docker ps -a lists these containers stuck in the Dead state, but the containers don’t exist. journalctl -u docker shows lots of May 12 20:48:07 ip-10-0-114-119 dockerd[8767]: time=“2017-05-12T20:48:07.051761235Z” level=error msg=“Handler for GET /v1.24/containers/0a285b05d00344564e05cf9995a8a75479ff69a16eada920144f8bdb55446429/json returned error: open /var/lib/docker/overlay/ace5b2af622e39a87e76fe57077ece42616cadb10e0d4b0d2473a68a17cbcfe2/lower-id: no such file or directory”

Restarting docker fixed this, at least for now.

@philipn @bcorijn @cooper667 @thegranddesign @BradErz @greenkiwi pavelhritonenko @andreychernih … my thesis when I asked a bunch of you what the issue was (see https://github.com/kubernetes/kubernetes/issues/45626#issuecomment-319611102) is that part of the problem is with the 4.4.65-k8s kernel. I’ve been running/testing 4.4.78-k8s (see https://github.com/kopeio/kubernetes-kernel/pull/8) for a couple of weeks and my cluster is behaving much much better. Sorry guys, kinda got swamped and didn’t report back.

The common thread here that supports this thesis is people having this issue are using kops 1.6 which comes with a 4.4.65-k8s kernel i.e. AMI kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02 (if you set up before the new AMI was built).

I submitted a PR to bump this AMI to 4.4.78-k8s in https://github.com/kopeio/kubernetes-kernel/pull/8 which is probably what @bcorijn @pavelhritonenko and @cooper667 are using and things seems to be going well.

You guys might be interested in https://github.com/kubernetes/kops/issues/2901 and https://github.com/kubernetes/kops/issues/2928.

Having the same problem - a lot of containers in Dead state.

root@ip-10-51-43-214:~# kubelet --version
Kubernetes v1.6.2
root@ip-10-51-43-214:~# uname -a
Linux ip-10-51-43-214 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 GNU/Linux
root@ip-10-51-43-214:~# journalctl -u docker -f
-- Logs begin at Mon 2017-07-31 22:42:18 UTC. --
Aug 02 17:24:39 ip-10-51-43-214 dockerd[505]: time="2017-08-02T17:24:39.619180536Z" level=error msg="Handler for GET /v1.24/containers/d065f31293a48b300f19eaa3abc58467c802021115c20b1128f4055abfd68742/json returned error: open /var/lib/docker/overlay/fffdb06decb78d0d98853b5268f0815d1d9e04bb2449ce64897e5760e404973f/lower-id: no such file or directory"
Aug 02 17:24:39 ip-10-51-43-214 dockerd[505]: time="2017-08-02T17:24:39.640373298Z" level=error msg="Handler for GET /v1.24/containers/bf65075a9496b55346dbcac3c309cc34eae7cf91ab40803490fbdd6d6183ed32/json returned error: open /var/lib/docker/overlay/42e1f44c4292fdc540aba1820c402a4319d1da7c76a4b7123b42a47520c01696/lower-id: no such file or directory"
Aug 02 17:24:39 ip-10-51-43-214 dockerd[505]: time="2017-08-02T17:24:39.665034833Z" level=error msg="Handler for GET /v1.24/containers/244af01223b2e97b157edd9df508077bca40e37e964ce521aafaaff73611bc60/json returned error: open /var/lib/docker/overlay/16274a3631c480a5050c474bf42b83a963056e0c689aa53b3517acb205f83d96/lower-id: no such file or directory"

This is a fresh 1.6 installation.

It looks like there was a kernel panic on this node and system rebooted:

[61852.265584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[61852.269531] IP: [<ffffffff810aa389>] wakeup_preempt_entity.isra.62+0x9/0x50
[61852.269531] PGD af515067 PUD 60cb6067 PMD 0 
[61852.269531] Oops: 0000 [#1] SMP 
[61852.269531] Modules linked in: xt_statistic(E) binfmt_misc(E) nf_conntrack_netlink(E) ip6t_MASQUERADE(E) nf_nat_masquerade_ipv6(E) xt_conntrack(E) ip6table_nat(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6t_rpfilter(E) ip6table_filter(E) nf_nat_ipv6(E) ip6table_raw(E) ip6_tables(E) xt_set(E) iptable_raw(E) ip_set_hash_ip(E) ip_set_hash_net(E) ip_set(E) nfnetlink(E) ipip(E) tunnel4(E) ip_tunnel(E) veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) xt_comment(E) xt_mark(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) xfrm_user(E) xfrm_algo(E) overlay(E) ipt_REJECT(E) nf_reject_ipv4(E) nf_log_ipv4(E) nf_log_common(E) xt_LOG(E) xt_limit(E) xt_multiport(E) iptable_filter(E) xt_recent(E) ip_tables(E) x_tables(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) fscache(E) sunrpc(E) intel_rapl(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) hmac(E) drbg(E) ppdev(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) evdev(E) parport_pc(E) 8250_fintek(E) parport(E) cirrus(E) ttm(E) drm_kms_helper(E) acpi_cpufreq(E) snd_pcsp(E) tpm_tis(E) drm(E) tpm(E) snd_pcm(E) i2c_piix4(E) processor(E) serio_raw(E) snd_timer(E) button(E) snd(E) soundcore(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) xor(E) raid6_pq(E) dm_mod(E) ata_generic(E) xen_blkfront(E) ata_piix(E) libata(E) crc32c_intel(E) psmouse(E) scsi_mod(E) ixgbevf(E) fjes(E)
[61852.269531] CPU: 3 PID: 20879 Comm: exe Tainted: G            E   4.4.65-k8s #1
[61852.269531] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[61852.269531] task: ffff8801b4596400 ti: ffff88018d2d4000 task.ti: ffff88018d2d4000
[61852.269531] RIP: 0010:[<ffffffff810aa389>]  [<ffffffff810aa389>] wakeup_preempt_entity.isra.62+0x9/0x50
[61852.269531] RSP: 0018:ffff88018d2d7e00  EFLAGS: 00010086
[61852.269531] RAX: ffff8803f7fd54c0 RBX: 0000c1919b040844 RCX: 0000000000000000
[61852.269531] RDX: ffff88040fc75e30 RSI: 0000000000000000 RDI: 0000c1919b040844
[61852.269531] RBP: 0000000000000000 R08: 000000000001e32c R09: 0000000000000000
[61852.269531] R10: 0000000000001847 R11: 0000000000000000 R12: 0000000000000000
[61852.269531] R13: 0000000000000000 R14: ffff88040fc75dc0 R15: 0000000000000003
[61852.269531] FS:  00007f107b1fd700(0000) GS:ffff88040fc60000(0000) knlGS:0000000000000000
[61852.269531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61852.269531] CR2: 0000000000000050 CR3: 000000003642a000 CR4: 00000000001406e0
[61852.269531] Stack:
[61852.269531]  ffff8803f9ebea00 ffffffff810aa440 ffff8803f9ebea00 0000000000000000
[61852.269531]  0000000000015dc0 0000000000000000 ffffffff810b33bf ffff8801b4596400
[61852.269531]  ffff88040fc76840 ffffffff8100c291 0000000000015dc0 ffff88040fc75e30
[61852.269531] Call Trace:
[61852.269531]  [<ffffffff810aa440>] ? pick_next_entity+0x70/0x140
[61852.269531]  [<ffffffff810b33bf>] ? pick_next_task_fair+0x30f/0x4a0
[61852.269531]  [<ffffffff8100c291>] ? xen_clocksource_read+0x11/0x20
[61852.269531]  [<ffffffff8159fdcf>] ? __schedule+0xdf/0x960
[61852.269531]  [<ffffffff815a0681>] ? schedule+0x31/0x80
[61852.269531]  [<ffffffff810031cb>] ? exit_to_usermode_loop+0x6b/0xc0
[61852.269531]  [<ffffffff81003bcf>] ? syscall_return_slowpath+0x8f/0x110
[61852.269531]  [<ffffffff815a4598>] ? int_ret_from_sys_call+0x25/0x8f
[61852.269531] Code: 5b e9 3c f2 ff ff 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f7 e9 e3 fb ff ff 0f 1f 00 0f 1f 44 00 00 53 48 89 fb <48> 2b 5e 50 48 85 db 7e 2c 48 81 3e 00 04 00 00 8b 05 01 89 9a 
[61852.269531] RIP  [<ffffffff810aa389>] wakeup_preempt_entity.isra.62+0x9/0x50
[61852.269531]  RSP <ffff88018d2d7e00>
[61852.269531] CR2: 0000000000000050
[61852.269531] ---[ end trace c4ff4559f7c3464a ]---
[61852.269531] Kernel panic - not syncing: Fatal exception
[61852.269531] Shutting down cpus with NMI
[61852.269531] Kernel Offset: disabled

Found https://github.com/kubernetes/kops/issues/874 related to this crash, trying to upgrade kernel to 4.4.70 at least. But it still concerning that kubernetes did not recover from this state correctly.

Restarting Docker does not help, pods are still in “Terminted” state and kubernetes is not trying to restart them.

I ended up running the following command which fixed the problem:

docker ps -a|grep 'Dead'|awk '{ print $1 }'|xargs docker rm

So far it’s 12 days on 1.7 without seeing this issue, so quite confident it is fixed for me. Not sure if it was the Kube upgrade or the new AMI however…

@aabed @armandocerna @bcorijn @philipn @tudor I have a few questions:

  1. Do you sometimes find that your node has been terminated and replaced in the ASG?
  2. What version of kops and kubernetes are you using?
  3. Is your cluster a fresh 1.6.x installation or an upgrade from 1.5.x?

Not sure if this helps, but I am only getting this issue with this Jupyter image. I haven’t dug in much to figure out why yet.

Seeing this in our cluster as well. Quite an annoying bug, as it will indeed get a deployment stuck in a state with less replicas than expected until I manually intervene. Did you find any more permanent solution/workaround @tudor?