containerd: container not starting in few nodes "standard_init_linux.go:219: exec user process caused: exec format error"
All our k8s nodes are with same architecture and same Linux image. we are trying to bring up calico-node pods deamon set, the pod is failing to run on some nodes. and kubectl logs shows “standard_init_linux.go:219: exec user process caused: exec format error” for failing pods.
Where to get info about the failure? is it image issue or is it runtime issue?
Describe the results you received: in some nodes pod went to craashLoopBackof
Describe the results you expected: in all nodes pods must be running
What version of containerd are you using:
$ containerd --version
v1.4.4
# runc -version
runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.13.15
libseccomp: 2.4.1
$ uname -a
Linux .... x86_64 x86_64 x86_64 GNU/Linux
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 63 (28 by maintainers)
Commits related to this issue
- cmd/ctr/commands/images: support usage subcommand The `ctr image usage` can display the usage of snapshots with a given image ref. It's easy for user to get chain snapshot IDs and unpack usage. And a... — committed to fuweid/containerd by fuweid a year ago
- cmd/ctr/commands/images: support usage subcommand The `ctr image usage` can display the usage of snapshots with a given image ref. It's easy for user to get chain snapshot IDs and unpack usage. And a... — committed to jsturtevant/containerd by fuweid a year ago
- *: introduce image_pull_with_sync_fs in CRI It's to ensure the data integrity during unexpected power failure. Background: Since release 1.3, in Linux system, containerD unpacks and writes files in... — committed to fuweid/containerd by fuweid 7 months ago
- *: introduce image_pull_with_sync_fs in CRI It's to ensure the data integrity during unexpected power failure. Background: Since release 1.3, in Linux system, containerD unpacks and writes files in... — committed to fuweid/containerd by fuweid 7 months ago
- *: introduce image_pull_with_sync_fs in CRI It's to ensure the data integrity during unexpected power failure. Background: Since release 1.3, in Linux system, containerD unpacks and writes files in... — committed to fuweid/containerd by fuweid 7 months ago
- *: introduce image_pull_with_sync_fs in CRI It's to ensure the data integrity during unexpected power failure. Background: Since release 1.3, in Linux system, containerD unpacks and writes files in... — committed to fuweid/containerd by fuweid 7 months ago
I also encountered the same issue, but I was able to resolve it by following these instructions. I deployed one pod at a time with a 15-second interval between each deployment, and the pods started up normally.
Here is the meta.db file
meta.db.zip
@jasonaliyetti Thanks! The information is really helpful. I will file pr to fix this issue.
@jasonaliyetti
The host did restart, right?
last
command can show if it did reboot or not. Or useuptime
. Could you mind to check thelast
oruptime
with create-time of the damaged snapshot? If it did reboot and happened after snapshot, I think we lost the page-cache 😞For the unpack logic, docker does the same things. However, docker limits the total number of layers being downloaded at the same time.
By default, there are only three layers being downloaded at one moment.
But containerd only limits the number of layers being downloaded in one image-pull-requests. containerd brings more IO pressure if there are a lot of pulling requests for different images which don’t share base layers.
https://github.com/containerd/containerd/blob/97480afdac09c947d48f5e3a134db86c78f4bfa6/pkg/cri/config/config.go#L299
I think we can align the
MaxConcurrentDownloads
scope with docker and sync each file during unpack.@fuweid We did not see this issue occur after moving back to dockerd on Friday. Similar mix of workload, traffic, and instance types. When we run with containerd we see this issue occur every couple of hours.
I will try to reproduce in this cluster again today and do a more in depth check of the snapshots and get the data you requested. From what I had captured there were no completely empty directories just empty files, but I did not check the layer info.
The containerd config is whatever is configured as the default on the EKS optimized ami (see here). We aren’t customizing it further.
I’ll update this issue once I have the requested info.
@dnwe sorry for late reply. I had investigated this but no clue yet. There are two cases about regenerating the snapshots.
ctr -n k8s.io leases ls
to check) or wait for expired time.But I still consider that the root cause is unexpected reboot without flushing the dirty-pages. Could you mind to check the
last
oruptime
with create-time ofsnapshot
? Thanks.