buildah: Unprivileged Rootless Buildah on Kubernetes fails due to newgidmap/newuidmap: Operation not permitted error

Description I have an unprivileged rootless Buildah container running on kubernetes/CRI-O on a Centos 7.9 host using VFS storage. When running any buildah command I receive the following output:

WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping      

The buildah bud command eventually fails due to the following error:

ERRO[0001] Error while applying layer: ApplyLayer exit status 1 stdout:  stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 192:192 for /run/systemd/netif): Check /etc/subuid and /etc/subgid: lchown /run/systemd/netif: invalid argument 
error creating build container: writing blob: adding layer with blob "sha256:b297047bc4b568dedc27cbeda009a435b0714fd2f681870601890a4b1c7531e8": ApplyLayer exit status 1 stdout:  stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 192:192 for /run/systemd/netif): Check /etc/subuid and /etc/subgid: lchown /run/systemd/netif: invalid argument

This issue is not present when running as user build with privileged: true enabled on the pod or when running as the root user without privileged mode enabled.

Steps to reproduce the issue:

  1. Deploy buildah to kubernetes running as user build
  2. Run buildah bud

Describe the results you received: Buildah is unable to create new gid or uid maps:

[build@runner-le6vnazt-project-140-concurrent-026xwb /]$ buildah info
WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               

Describe the results you expected: Expected buildah to be able to create mappings

Output of rpm -q buildah or apt list buildah:

buildah-1.23.3-2.fc35.x86_64

Output of buildah version:

buildah version 1.23.3 (image-spec 1.0.1-dev, runtime-spec 1.0.2-dev)

Output of cat /etc/*release:

Fedora release 35 (Thirty Five)
NAME="Fedora Linux"
VERSION="35 (Container Image)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Container Image)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Container Image"
VARIANT_ID=container
Fedora release 35 (Thirty Five)
Fedora release 35 (Thirty Five)

Output of uname -a:

Linux runner-le6vnazt-project-140-concurrent-026xwb 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

# This file is is the configuration file for all tools
# that use the containers/storage library. The storage.conf file
# overrides all other storage.conf files. Container engines using the
# container/storage library do not inherit fields from other storage.conf
# files.
#
#  Note: The storage.conf file overrides other storage.conf files based on this precedence:
#      /usr/containers/storage.conf
#      /etc/containers/storage.conf
#      $HOME/.config/containers/storage.conf
#      $XDG_CONFIG_HOME/containers/storage.conf (If XDG_CONFIG_HOME is set)
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver, Must be set for proper operation.
driver = "vfs"

# Temporary storage location
runroot = "/run/containers/storage"

# Primary Read/Write location of container storage
# When changing the graphroot location on an SELINUX system, you must
# ensure  the labeling matches the default locations labels with the
# following commands:
# semanage fcontext -a -e /var/lib/containers/storage /NEWSTORAGEPATH
# restorecon -R -v /NEWSTORAGEPATH
graphroot = "/var/lib/containers/storage"


# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
"/var/lib/shared",
]

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to the UIDs/GIDs as they should appear outside of the container,
# and the length of the range of UIDs/GIDs.  Additional mapped sets can be
# listed and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
#remap-uids = 0:1668442479:65536
#remap-gids = 0:1668442479:65536

# Remap-User/Group is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file.  Mappings are set up starting
# with an in-container ID of 0 and then a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped in-container ID,
# until all of the entries have been used for maps.
#
# remap-user = "containers"
# remap-group = "containers"

# Root-auto-userns-user is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid and /etc/subgid file.  These ranges will be partitioned
# to containers configured to create automatically a user namespace.  Containers
# configured to automatically create a user namespace can still overlap with containers
# having an explicit mapping set.
# This setting is ignored when running as rootless.
# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the minimum size for a user namespace created automatically.
# auto-userns-max-size=65536

[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids.  Note multiple UIDs will be
# squashed down to the default uid in the container.  These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
#ignore_chown_errors = "false"

# Inodes is used to set a maximum inodes of the container image.
# inodes = ""

# Path to an helper program to use for mounting the file system instead of mounting it
# directly.
mount_program = "/usr/bin/fuse-overlayfs"

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev,fsync=0"

# Set to skip a PRIVATE bind mount on the storage home directory.
# skip_mount_home = "false"

# Size is used to set a maximum size of the container image.
# size = ""

# ForceMask specifies the permissions mask that is used for new files and
# directories.
#
# The values "shared" and "private" are accepted.
# Octal permission masks are also accepted.
#
#  "": No value specified.
#     All files/directories, get set with the permissions identified within the
#     image.
#  "private": it is equivalent to 0700.
#     All files/directories get set with 0700 permissions.  The owner has rwx
#     access to the files. No other users on the system can access the files.
#     This setting could be used with networked based homedirs.
#  "shared": it is equivalent to 0755.
#     The owner has rwx access to the files and everyone else can read, access
#     and execute them. This setting is useful for sharing containers storage
#     with other users.  For instance have a storage owned by root but shared
#     to rootless users as an additional store.
#     NOTE:  All files within the image are made readable and executable by any
#     user on the system. Even /etc/shadow within your image is now readable by
#     any user.
#
#   OCTAL: Users can experiment with other OCTAL Permissions.
#
#  Note: The force_mask Flag is an experimental feature, it could change in the
#  future.  When "force_mask" is set the original permission mask is stored in
#  the "user.containers.override_stat" xattr and the "mount_program" option must
#  be specified. Mount programs like "/usr/bin/fuse-overlayfs" present the
#  extended attribute permissions to processes within containers rather then the
#  "force_mask"  permissions.
#
# force_mask = ""

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"

# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"

# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"

# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"

# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper.
# directlvm_device = ""

# directlvm_device_force wipes device even if device already has a filesystem.
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"

# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# metadata_size is used to set the `pvcreate --metadatasize` options when
# creating thin devices. Default is 128k
# metadata_size = ""

# Size is used to set a maximum size of the container image.
# size = ""

# use_deferred_removal marks devicemapper block device for deferred removal.
# If the thinpool is in use when the driver attempts to remove it, the driver
# tells the kernel to remove it as soon as possible. Note this does not free
# up the disk space, use deferred deletion to fully remove the thinpool.
# use_deferred_removal = "True"

# use_deferred_deletion marks thinpool device for deferred deletion.
# If the device is busy when the driver attempts to delete it, the driver
# will attempt to delete device every 30 seconds until successful.
# If the program using the driver exits, the driver will continue attempting
# to cleanup the next time the driver is used. Deferred deletion permanently
# deletes the device and all data stored in device will be lost.
# use_deferred_deletion = "True"

# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"

Output of buildah info:

WARN[0000] error running newgidmap: exit status 1: newgidmap: write to gid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
WARN[0000] error running newuidmap: exit status 1: newuidmap: write to uid_map failed: Operation not permitted 
WARN[0000] falling back to single mapping               
{
    "host": {
        "CgroupVersion": "v1",
        "Distribution": {
            "distribution": "fedora",
            "version": "35"
        },
        "MemFree": 468131840,
        "MemTotal": 16259915776,
        "OCIRuntime": "crun",
        "SwapFree": 8232628224,
        "SwapTotal": 8254386176,
        "arch": "amd64",
        "cpus": 8,
        "hostname": "runner-le6vnazt-project-140-concurrent-026xwb",
        "kernel": "3.10.0-1160.25.1.el7.x86_64",
        "os": "linux",
        "rootless": true,
        "uptime": "310h 24m 16.22s (Approximately 12.92 days)"
    },
    "store": {
        "ContainerStore": {
            "number": 0
        },
        "GraphDriverName": "vfs",
        "GraphOptions": null,
        "GraphRoot": "/home/build/.local/share/containers/storage",
        "GraphStatus": {},
        "ImageStore": {
            "number": 0
        },
        "RunRoot": "/var/tmp/containers-user-1000/containers"
    }
}

Output of capsh --print:

Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=i
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Current IAB: cap_chown,cap_dac_override,!cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,!cap_linux_immutable,cap_net_bind_service,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,cap_sys_chroot,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,cap_mknod,!cap_lease,cap_audit_write,!cap_audit_control,cap_setfcap,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend
Securebits: 00/0x0/1'b0 (no-new-privs=1)
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1000(build) euid=1000(build)
gid=1000(build)
groups=1000(build)
Guessed mode: UNCERTAIN (0)

Output of /etc/subgid and /etc/subuid:

build:2000:50000

Output of rpm -qa | grep shadow:

shadow-utils-4.9-9.fc35.x86_64

Output of getcap /usr/bin/newuidmap /usr/bin/newgidmap:

/usr/bin/newuidmap cap_setuid=ep
/usr/bin/newgidmap cap_setgid=ep

Output of cat /proc/sys/users/max_user_namespace on Centos 7.9 host::

15000

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (5 by maintainers)

Most upvoted comments

From my understanding CAP_SETUID is equivalent to running as root, as a process with CAP_SETUID can set their UID to anything including 0 i.e. root. Are there any ways to build containers inside of a container (such as in kubernetes) without this capability?

buildah build --isolation=chroot … Might be helpful

No change when adding --isolation=chroot