podman: race condition yielding "Cannot get exit code: died not found: unable to find event"
/kind bug
Description
xref: sister report in against FCOS: https://github.com/coreos/fedora-coreos-tracker/issues/966
In Fedora CoreOS we have a test that tests various invocations of podman with different options. It runs in a tight loop and I believe it has exposed a race condition where the following error gets displayed:
ERRO[0000] Cannot get exit code: died not found: unable to find event
and the exit code of the podman run is 127.
Steps to reproduce the issue:
Currently we’re only seeing this on AWS aarch64 FCOS instances. Once you have access to an instance:
- build
echocontainer
tmpdir=$(mktemp -d); cd $tmpdir; echo -e "FROM scratch\nCOPY . /" > Dockerfile;
b=$(which echo); libs=$(sudo ldd $b | grep -o /lib'[^ ]*' | sort -u);
sudo rsync -av --relative --copy-links $b $libs ./;
sudo podman build --network host --layers=false -t localhost/echo .
- run this script which loops until one of the commands fails:
$ cat /tmp/script.sh
#!/bin/bash
set -eux -o pipefail
while true; do
sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --memory-reservation=10m echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --cpu-shares=100 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --cpu-period=1000 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --cpuset-cpus=0 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --cpuset-mems=0 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --cpu-quota=1000 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --blkio-weight=10 echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --memory=128m echo echo 1 > /tmp/output.txt
sudo podman run --net=none --rm --shm-size=1m echo echo 1 > /tmp/output.txt
done
Describe the results you received: Error
Describe the results you expected: No Error
Additional information you deem important (e.g. issue happens only occasionally):
Output of podman version:
$ podman version
Version: 3.3.1
API Version: 3.3.1
Go Version: go1.16.6
Built: Mon Aug 30 20:45:47 2021
OS/Arch: linux/arm64
Output of podman info --debug:
$ sudo podman info --debug
host:
arch: arm64
buildahVersion: 1.22.3
cgroupControllers:
- cpuset
- cpu
- io
- memory
- pids
cgroupManager: systemd
cgroupVersion: v2
conmon:
package: conmon-2.0.29-2.fc34.aarch64
path: /usr/bin/conmon
version: 'conmon version 2.0.29, commit: '
cpus: 4
distribution:
distribution: fedora
version: "34"
eventLogger: journald
hostname: ip-172-31-81-157
idMappings:
gidmap: null
uidmap: null
kernel: 5.13.16-200.fc34.aarch64
linkmode: dynamic
memFree: 7625285632
memTotal: 8154873856
ociRuntime:
name: crun
package: crun-1.0-1.fc34.aarch64
path: /usr/bin/crun
version: |-
crun version 1.0
commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
remoteSocket:
path: /run/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: false
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: true
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.1.12-2.fc34.aarch64
version: |-
slirp4netns version 1.1.12
commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.0
swapFree: 0
swapTotal: 0
uptime: 38m 7.93s
registries:
search:
- registry.fedoraproject.org
- registry.access.redhat.com
- docker.io
- quay.io
store:
configFile: /etc/containers/storage.conf
containerStore:
number: 0
paused: 0
running: 0
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mountopt: nodev,metacopy=on
graphRoot: /var/lib/containers/storage
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "true"
imageStore:
number: 1
runRoot: /run/containers/storage
volumePath: /var/lib/containers/storage/volumes
version:
APIVersion: 3.3.1
Built: 1630356347
BuiltTime: Mon Aug 30 20:45:47 2021
GitCommit: ""
GoVersion: go1.16.6
OsArch: linux/arm64
Version: 3.3.1
Package info (e.g. output of rpm -q podman or apt list podman):
podman-3.3.1-1.fc34.aarch64
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)
No
Additional environment details (AWS, VirtualBox, physical, etc.):
AWS aarch64 ami-0d04187158a93719f Fedora CoreOS 34.20210917.20.0 on c6g.xlarge instance type.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 20 (20 by maintainers)
Commits related to this issue
- Add a backoff and retries to retrieving exited event There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but a... — committed to mheon/libpod by mheon 3 years ago
- Add a backoff and retries to retrieving exited event There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but a... — committed to mheon/libpod by mheon 3 years ago
- Add a backoff and retries to retrieving exited event There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but a... — committed to mheon/libpod by mheon 3 years ago
- Add a backoff and retries to retrieving exited event There's a potential race around extremely short-running containers and events with journald. Events may not be written for some time (small, but a... — committed to mheon/libpod by mheon 3 years ago
@dustymabe If you want to fix this race for now in your tests, this should work: