podman: [btrfs] Sporadic Found incomplete layer error results in broken container engine

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

A sporadically occurring “Found incomplete layer” error after the nighly automatic system updates on openSUSE MicroOS, results in broken podman container engine:

WARN[0000] Found incomplete layer "236fcd368394d7094f40012a131c301d615722e60b25cb459efa229a7242041b", deleting it 
Error: stat /var/lib/containers/storage/btrfs/subvolumes/236fcd368394d7094f40012a131c301d615722e60b25cb459efa229a7242041b: no such file or directory

Once the error occurs, nothing works anymore. Even a podman image prune complains about the same error and fails. The only way to fix podman is to manually nuke the /var/lib/containers/storage/btrfs directory.

I’m having this issue on a MicroOS installation with the most recent podman version (4.3.1). I have a couple of container running there and this issue occurred now for the second time in a month after the automatic nighly updates. A fellow redditor confirms the issue.

The issue arises after a round of automatic updates during the night. It is unclear, if the system update or a run of podman auto-update causes the issue, I have not been able to find a reproducer yet.

Steps to reproduce the issue:

A possible reproducer can be found below

Describe the results you received:

  • podman container engine broken after automatic system and container updates

Describe the results you expected:

  • podman keeps working

Additional information you deem important (e.g. issue happens only occasionally):

  • Issue happens only occasionally

Output of podman version:

Client:       Podman Engine
Version:      4.3.1
API Version:  4.3.1
Go Version:   go1.17.13
Built:        Tue Nov 22 00:00:00 2022
OS/Arch:      linux/amd64

Output of podman info:

host:
  arch: amd64
  buildahVersion: 1.28.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.5-2.1.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: unknown'
  cpuUtilization:
    idlePercent: 98.92
    systemPercent: 0.36
    userPercent: 0.72
  cpus: 4
  distribution:
    distribution: '"opensuse-microos"'
    version: "20221217"
  eventLogger: journald
  hostname: starfury
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.0.12-1-default
  linkmode: dynamic
  logDriver: journald
  memFree: 309272576
  memTotal: 7366852608
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-2.1.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      commit: v1.1.4-0-ga916309fff0f
      spec: 1.0.2-dev
      go: go1.18.6
      libseccomp: 2.5.4
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-1.1.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: unknown
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.4
  swapFree: 0
  swapTotal: 0
  uptime: 3h 47m 14.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.opensuse.org
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 8
    paused: 0
    running: 8
    stopped: 0
  graphDriverName: btrfs
  graphOptions: {}
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 26834087936
  graphRootUsed: 9974857728
  graphStatus:
    Build Version: Btrfs v6.0.2
    Library Version: "102"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 8
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.3.1
  Built: 1669075200
  BuiltTime: Tue Nov 22 00:00:00 2022
  GitCommit: ""
  GoVersion: go1.17.13
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.1

Package info (e.g. output of rpm -q podman or apt list podman or brew info podman):

podman-4.3.1-1.1.x86_64

Have you tested with the latest version of Podman and have you checked Podman Troubleshooting Guide?

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

  • KVM Virtual machine running openSUSE MicroOS
  • I’m using the btrfs overlay

A working hypothesis is that the podman auto-update gets interrupted by a system reboot, resulting in dangling (corrupted) images. On MicroOS, the transactional-updates (system updates) and the podman auto-updates start times are randomized (i.e. systemd units with RandomizedDelaySec in place), so there is the chance that the podman auto-update service gets interrupted by a system reboot. I’m running about 8 container at the host, so the vulnerable timeslot would not be negligible. This remains a hypothesis at the moment, as I was unable yet to verify this yet.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 3
  • Comments: 15 (5 by maintainers)

Most upvoted comments

Sadly we have no expertise in ZFS File system as a storage driver. We would recommend using Overlay over a ZFS lower layer.

This happens me a lot, but with ZFS, so the problem might not be in the storage, but Podman?

Same here and I fixed it by removing the reference of the layer (that doesn’t exists) in the /var/lib/containers/storage/btrfs-layers/layers.json file.

I don’t know if it’s a better way to solve it but now at least I can manage my containers without losing data.