podman: Error forwarding signal 18 (sometimes 15) to container ...

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

This is distilled down from a larger example, reproducing only with coreutils xargs / timeout + sleep

This is a pretty reliable example:

$ yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init ubuntu:focal sleep 5
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: error sending signal to container d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c: `/usr/lib/cri-o-runc/sbin/runc kill d9ae2ff8a421a835771930f9743210c994e2a162df6d4099eded6f5609048c6c 18` failed: exit status 1 
ERRO[0000] container not running                        
ERRO[0002] Error forwarding signal 18 to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: error sending signal to container 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283: `/usr/lib/cri-o-runc/sbin/runc kill 69b4770e164abce77ecb0c0105414ed59a274296d13703eeaebc00b5f7873283 18` failed: exit status 1 

occasionally I’ll get outputs like this:

+ timeout 2 podman run --rm runner-image:latest python3 -c 'import time; time.sleep(5)'
ERRO[0002] Error forwarding signal 18 to container 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11: can only kill running containers. 473036e98db1884e7a6baac667a135db7e5176b9c34db0c885b6bc10efd6fd11 is in state stopped: container state improper 

(runner-image is this one)

My actual use case ends like this however:

time="2020-10-19T15:07:21-07:00" level=error msg="container not running"
time="2020-10-19T15:07:21-07:00" level=error msg="Error forwarding signal 15 to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: error sending signal to container 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb: `/usr/lib/cri-o-runc/sbin/runc kill 31e767a031b7cd086dbda154b9df5adaff694fd111e7feedee748d91c7eedcdb 15` failed: exit status 1"

With -i, it occasionally deadlocks, breaking most podman commands (such as podman ps for that user). the only way I’ve been able to recover the deadlock is to kill -9 that podman process

+ timeout 2 podman run --rm -i runner-image:latest python3 -c 'import time; time.sleep(5)'

(hung forever)

strace of podman ps shows what looks like a spinlock with a timeout:

$ strace podman ps

...

futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x28849e8, FUTEX_WAIT_PRIVATE, 0, NULL

the signals in question are:

15: SIGTERM 18: SIGCONT

Steps to reproduce the issue:

  1. See commands above

Describe the results you received:

See output above

Describe the results you expected:

I expect things to exit gracefully, and not deadlock

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

$ podman version
Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.15.2
Built:        Wed Dec 31 16:00:00 1969
OS/Arch:      linux/amd64

Output of podman info --debug:

$ podman info --debug
host:
  arch: amd64
  buildahVersion: 1.16.1
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.20, commit: '
  cpus: 5
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: journald
  hostname: babibox
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-51-generic
  linkmode: dynamic
  memFree: 5532704768
  memTotal: 8348299264
  ociRuntime:
    name: runc
    package: 'cri-o-runc: /usr/lib/cri-o-runc/sbin/runc'
    path: /usr/lib/cri-o-runc/sbin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.1.4
      commit: unknown
      libslirp: 4.3.1-git
      SLIRP_CONFIG_VERSION_MAX: 3
  swapFree: 1964396544
  swapTotal: 1964396544
  uptime: 5h 13m 0.85s (Approximately 0.21 days)
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/asottile/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 2
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: 'fuse-overlayfs: /usr/bin/fuse-overlayfs'
      Version: |-
        fusermount3 version: 3.9.0
        fuse-overlayfs: version 1.1.0
        FUSE library version 3.9.0
        using FUSE kernel interface version 7.31
  graphRoot: /home/asottile/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 23
  runRoot: /run/user/1000/containers
  volumePath: /home/asottile/.local/share/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 0
  BuiltTime: Wed Dec 31 16:00:00 1969
  GitCommit: ""
  GoVersion: go1.15.2
  OsArch: linux/amd64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

$ apt list podman
Listing... Done
podman/unknown,now 2.1.1~2 amd64 [installed]
podman/unknown 2.1.1~2 arm64
podman/unknown 2.1.1~2 armhf
podman/unknown 2.1.1~2 s390x

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

VirtualBox + aws

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 63 (52 by maintainers)

Most upvoted comments

crun returns that error when the container is already exited.

I think a fully loaded machine just makes easier to trigger the race where Podman thinks the container is running and sends a signal.

On such failures, probably Podman should ask the runtime if the container is still alive, and ignore any error when it is already terminated

This is still an issue.

Podman v2.2.1

Hitting this bug also.

ERRO[0000] container not running                        
ERRO[0000] Error forwarding signal 15 to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: error sending signal to container b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a: `/usr/bin/runc kill b260052814cadf180bb9be699a37193842259fbe1ed4b75e85130012cc047b0a 15` failed: exit status 1 

The “Error forwarding signal” bit definitely reproduces.

First step is probably to drop that down to an Info level log if it’s caused by the container being in a bad state - that will stop spamming the terminal with logs, at least.