containerd: container not starting in few nodes "standard_init_linux.go:219: exec user process caused: exec format error"

All our k8s nodes are with same architecture and same Linux image. we are trying to bring up calico-node pods deamon set, the pod is failing to run on some nodes. and kubectl logs shows “standard_init_linux.go:219: exec user process caused: exec format error” for failing pods.

Where to get info about the failure? is it image issue or is it runtime issue?

Describe the results you received: in some nodes pod went to craashLoopBackof

Describe the results you expected: in all nodes pods must be running

What version of containerd are you using:

$ containerd --version
v1.4.4


# runc -version
runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.13.15
libseccomp: 2.4.1



$ uname -a
Linux .... x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 63 (28 by maintainers)

Commits related to this issue

Most upvoted comments

I also encountered the same issue, but I was able to resolve it by following these instructions. I deployed one pod at a time with a 15-second interval between each deployment, and the pods started up normally.

Running into the same issue after dirty host reboot (power event). this is my workaround currently:

# corrupt cached images (exec /app/cmd/controller/controller: exec format error)
1. kubectl scale -n cert-manager deployment cert-manager --replicas=0
2. ssh into bad node
3. sudo crictl images
4. sudo crictl rmi IMAGE-ID 8eaca4249b016
5. sudo crictl rmi --prune
6. kubectl scale -n cert-manager deployment cert-manager --replicas=1

Would love to identify a fix for this too!

Here is the meta.db file

meta.db.zip

@jasonaliyetti Thanks! The information is really helpful. I will file pr to fix this issue.

@jasonaliyetti

seem to have a gap in log collection around this as well due to whatever event occurred and will need to dive into why things seem to panic on the system.

The host did restart, right? last command can show if it did reboot or not. Or use uptime. Could you mind to check the last or uptime with create-time of the damaged snapshot? If it did reboot and happened after snapshot, I think we lost the page-cache 😞

It’s still odd to me that I didn’t hit this with dockerd, but at least it makes a little more sense.

For the unpack logic, docker does the same things. However, docker limits the total number of layers being downloaded at the same time.

dockerd -h | grep pull
      --max-concurrent-downloads int            Set the max concurrent downloads for each pull (default 3)
      --max-download-attempts int               Set the max download attempts for each pull (default 5)

By default, there are only three layers being downloaded at one moment.

But containerd only limits the number of layers being downloaded in one image-pull-requests. containerd brings more IO pressure if there are a lot of pulling requests for different images which don’t share base layers.

https://github.com/containerd/containerd/blob/97480afdac09c947d48f5e3a134db86c78f4bfa6/pkg/cri/config/config.go#L299

I think we can align the MaxConcurrentDownloads scope with docker and sync each file during unpack.

@fuweid We did not see this issue occur after moving back to dockerd on Friday. Similar mix of workload, traffic, and instance types. When we run with containerd we see this issue occur every couple of hours.

I will try to reproduce in this cluster again today and do a more in depth check of the snapshots and get the data you requested. From what I had captured there were no completely empty directories just empty files, but I did not check the layer info.

The containerd config is whatever is configured as the default on the EKS optimized ami (see here). We aren’t customizing it further.

I’ll update this issue once I have the requested info.

@dnwe sorry for late reply. I had investigated this but no clue yet. There are two cases about regenerating the snapshots.

  • The deleted image A is as base image for the using image B. Since the snapshot of image B is child of snapshots of image A, the deletion and re-pull will not work until you remove the image B.
  • The deleted image A is still hold by the leases. Containerd uses lease to prevent the data and snapshot data from deleted by GC. The default expired time of lease is 24 hours. It is only leaky when containerd restarts during pulling image. For this case, you should cleanup the leases (use ctr -n k8s.io leases ls to check) or wait for expired time.

But I still consider that the root cause is unexpected reboot without flushing the dirty-pages. Could you mind to check the last or uptime with create-time of snapshot? Thanks.