podman: Error forwarding signal 18 (sometimes 15) to container ...
Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)
/kind bug
Description
This is distilled down from a larger example, reproducing only with coreutils xargs / timeout + sleep
This is a pretty reliable example:
$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
ERRO[0000] container not running
ERRO[0002] Error forwarding signal 18 to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: error sending signal to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: `/usr/lib/cri-o-runc/sbin/runc kill d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c 18` failed: exit status 1
ERRO[0000] container not running
ERRO[0002] Error forwarding signal 18 to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: error sending signal to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: `/usr/lib/cri-o-runc/sbin/runc kill 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283 18` failed: exit status 1
occasionally I’ll get outputs like this:
+ timeout 2 podman run --rm runner-image:latest python3 -c 'import time; time.sleep(5)'
ERRO[0002] Error forwarding signal 18 to container 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11: can only kill running containers. 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11 is in state stopped: container state improper
(runner-image is this one)
My actual use case ends like this however:
time="2020-10-19T15:07:21-07:00" level=error msg="container not running"
time="2020-10-19T15:07:21-07:00" level=error msg="Error forwarding signal 15 to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: error sending signal to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: `/usr/lib/cri-o-runc/sbin/runc kill 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb 15` failed: exit status 1"
With -i, it occasionally deadlocks, breaking most podman commands (such as podman ps for that user). the only way I’ve been able to recover the deadlock is to kill -9 that podman process
+ timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'
(hung forever)
strace of podman ps shows what looks like a spinlock with a timeout:
$ strace podman ps
...
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL
the signals in question are:
15: SIGTERM 18: SIGCONT
Steps to reproduce the issue:
- See commands above
Describe the results you received:
See output above
Describe the results you expected:
I expect things to exit gracefully, and not deadlock
Additional information you deem important (e.g. issue happens only occasionally):
Output of podman version:
$ podman version
Version: 2.1.1
API Version: 2.0.0
Go Version: go1.15.2
Built: Wed Dec 31 16:00:00 1969
OS/Arch: linux/amd64
Output of podman info --debug:
$ podman info --debug
host:
arch: amd64
buildahVersion: 1.16.1
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: 'conmon: /usr/libexec/podman/conmon'
path: /usr/libexec/podman/conmon
version: 'conmon version 2.0.20, commit: '
cpus: 5
distribution:
distribution: ubuntu
version: "20.04"
eventLogger: journald
hostname: babibox
idMappings:
gidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 1000
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 5.4.0-51-generic
linkmode: dynamic
memFree: 5532704768
memTotal: 8348299264
ociRuntime:
name: runc
package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
path: /usr/lib/cri-o-runc/sbin/runc
version: 'runc version spec: 1.0.2-dev'
os: linux
remoteSocket:
path: /run/user/1000/podman/podman.sock
rootless: true
slirp4netns:
executable: /usr/bin/slirp4netns
package: 'slirp4netns: /usr/bin/slirp4netns'
version: |-
slirp4netns version 1.1.4
commit: unknown
libslirp: 4.3.1-git
SLIRP_CONFIG_VERSION_MAX: 3
swapFree: 1964396544
swapTotal: 1964396544
uptime: 5h 13m 0.85s (Approximately 0.21 days)
registries:
search:
- docker.io
- quay.io
store:
configFile: /home/asottile/.config/containers/storage.conf
containerStore:
number: 2
paused: 0
running: 2
stopped: 0
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
Version: |-
fusermount3 version: 3.9.0
fuse-overlayfs: version 1.1.0
FUSE library version 3.9.0
using FUSE kernel interface version 7.31
graphRoot: /home/asottile/.local/share/containers/storage
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 23
runRoot: /run/user/1000/containers
volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
APIVersion: 2.0.0
Built: 0
BuiltTime: Wed Dec 31 16:00:00 1969
GitCommit: ""
GoVersion: go1.15.2
OsArch: linux/amd64
Version: 2.1.1
Package info (e.g. output of rpm -q podman or apt list podman):
$ apt list podman
Listing... Done
podman/unknown,now 2.1.1~2 amd64 [installed]
podman/unknown 2.1.1~2 arm64
podman/unknown 2.1.1~2 armhf
podman/unknown 2.1.1~2 s390x
Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
VirtualBox + aws
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 63 (52 by maintainers)
crun returns that error when the container is already exited.
I think a fully loaded machine just makes easier to trigger the race where Podman thinks the container is running and sends a signal.
On such failures, probably Podman should ask the runtime if the container is still alive, and ignore any error when it is already terminated
This is still an issue.
Podman v2.2.1
Hitting this bug also.
The “Error forwarding signal” bit definitely reproduces.
First step is probably to drop that down to an Info level log if it’s caused by the container being in a bad state - that will stop spamming the terminal with logs, at least.