podman: power-loss while creating containers may leave podman (storage) in a broken state

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

If power-loss occurs in a small time window while creating containers with podman, the container storage is broken and no containers can be started nor created anymore. Only a podman system prune -a seems to resolve the issue while all other prune commands don’t.

Steps to reproduce the issue (maybe in general):

Setup systemd units to create containers on boot
Disconnect power source while containers are created on boot
Restart and observe unit / containers

Steps to reproduce the issue (specifically):

Following are the specific steps with regards to my actual setup. This might make a difference, since the Raspberry Pi 3B+ has few resources which causes image pull and container creation to take some time (especially when starting 5 containers in parallel) which could widen the time window for corruption.

Install latest Fedora IoT on a Raspberry Pi 3B+
Setup some containers starting via systemd on boot, including a pod. Use unit-dependency to start the pod at first, then one container with a boot time of more than a minute (e.g. node-red) and four other containers after that
As soon as the containers are being created after successful boot disconnect the power source

Describe the results you received:

After successful reboot after the power-loss all podman container units fail to start with the following error message:

Error: readlink /var/lib/containers/storage/overlay/l/ORYZLEWFSIV3UXAUDOB4OAH6SW: no such file or directory

Describe the results you expected:

I expect all containers to be created normally. My systemd units remove any left-over containers before attempting to create the new ones. This should work in any case, even on power loss. Podman should not enter a state where I have to manually issue a podman system prune -a or other intervention when something fails at container creation.

Additional information you deem important (e.g. issue happens only occasionally):

I’m starting 5 containers in parallel, which slows down container creation quite a bit on a raspberry Pi 3B+ which could widen a potential time window for corruption.

Output of podman version:

Version:      2.1.1
API Version:  2.0.0
Go Version:   go1.14.9
Built:        Wed Sep 30 21:31:36 2020
OS/Arch:      linux/arm64

Output of podman info --debug:

host:
  arch: arm64
  buildahVersion: 1.16.1
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.21-2.fc32.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.21, commit: 5c1a09d48bd2b912c29efe00ec956c8f84ae26b9'
  cpus: 4
  distribution:
    distribution: fedora
    version: "32"
  eventLogger: journald
  hostname: localhost
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.8.13-200.fc32.aarch64
  linkmode: dynamic
  memFree: 11911168
  memTotal: 981143552
  ociRuntime:
    name: crun
    package: crun-0.15-5.fc32.aarch64
    path: /usr/bin/crun
    version: |-
      crun version 0.15
      commit: 56ca95e61639510c7dbd39ff512f80f626404969
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  rootless: false
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 370003968
  swapTotal: 466997248
  uptime: 3h 14m 42.71s (Approximately 0.12 days)
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 6
    paused: 0
    running: 6
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 6
  runRoot: /var/run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 2.0.0
  Built: 1601494296
  BuiltTime: Wed Sep 30 21:31:36 2020
  GitCommit: ""
  GoVersion: go1.14.9
  OsArch: linux/arm64
  Version: 2.1.1

Package info (e.g. output of rpm -q podman or apt list podman):

podman-2.1.1-7.fc32.aarch64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

I’m using the aarch64 variant on a Raspberry Pi 3B+ (limited resources) running Fedora IoT 32. The containers are created automatically on boot via systemd units. The units first try to remove any existing container via optional command and then run a podman container command with --systemd flag.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (16 by maintainers)

Most upvoted comments

Is there a way to workaround this broken state without clearing the podman storage with system prune -a?

I’ve several deployments of podman in the field on a low-bandwith or cost-by-byte connection and would like to keep downloaded images and still recover from this broken state. Any ideas?

EDIT:

Might be related to #5986 - at least there seems to be a valid work-around using read-only fs: https://github.com/containers/podman/issues/5986#issuecomment-716376419

w4tsn on Oct 27, 2020