podman: race condition yielding "Cannot get exit code: died not found: unable to find event"

/kind bug

Description

xref: sister report in against FCOS: https://github.com/coreos/fedora-coreos-tracker/issues/966

In Fedora CoreOS we have a test that tests various invocations of podman with different options. It runs in a tight loop and I believe it has exposed a race condition where the following error gets displayed:

ERRO[0000] Cannot get exit code: died not found: unable to find event

and the exit code of the podman run is 127.

Steps to reproduce the issue:

Currently we’re only seeing this on AWS aarch64 FCOS instances. Once you have access to an instance:

  • build echo container
tmpdir=$(mktemp -d); cd $tmpdir; echo -e "FROM scratch\nCOPY . /" > Dockerfile;
b=$(which echo); libs=$(sudo ldd $b | grep -o /lib'[^ ]*' | sort -u);
sudo rsync -av --relative --copy-links $b $libs ./;
sudo podman build --network host --layers=false -t localhost/echo .
  • run this script which loops until one of the commands fails:
$ cat /tmp/script.sh 
#!/bin/bash
set -eux -o pipefail
while true; do
    sudo podman run --net=none --rm --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory-reservation=10m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-shares=100 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-period=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-cpus=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpuset-mems=0 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --cpu-quota=1000 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --blkio-weight=10 echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --memory=128m echo echo 1 > /tmp/output.txt
    sudo podman run --net=none --rm --shm-size=1m echo echo 1 > /tmp/output.txt
done

Describe the results you received: Error

Describe the results you expected: No Error

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      3.3.1
API Version:  3.3.1
Go Version:   go1.16.6
Built:        Mon Aug 30 20:45:47 2021
OS/Arch:      linux/arm64

Output of podman info --debug:

$ sudo podman info --debug                                                                                                                                                                                                  
host:
  arch: arm64
  buildahVersion: 1.22.3
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.29-2.fc34.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.29, commit: '
  cpus: 4
  distribution:
    distribution: fedora
    version: "34"
  eventLogger: journald
  hostname: ip-172-31-81-157
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.13.16-200.fc34.aarch64
  linkmode: dynamic
  memFree: 7625285632
  memTotal: 8154873856
  ociRuntime:
    name: crun
    package: crun-1.0-1.fc34.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 1.0
      commit: 139dc6971e2f1d931af520188763e984d6cdfbf8
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc34.aarch64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.0
  swapFree: 0
  swapTotal: 0
  uptime: 38m 7.93s
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 1
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.3.1
  Built: 1630356347
  BuiltTime: Mon Aug 30 20:45:47 2021
  GitCommit: ""
  GoVersion: go1.16.6
  OsArch: linux/arm64
  Version: 3.3.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-3.3.1-1.fc34.aarch64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/master/troubleshooting.md)

No

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS aarch64 ami-0d04187158a93719f Fedora CoreOS 34.20210917.20.0 on c6g.xlarge instance type.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@dustymabe If you want to fix this race for now in your tests, this should work:

...
sudo podman run --net=none --memory=128m --memory-swap=128m echo echo 1 > /tmp/output.txt
sudo podman rm -f -l
...