sysbox: --cap-drop and --cap-add do not work as expected

Hi @ctalledo ,

I found some confusing behaviour of sysbox with options --cap-add and --cap-drop. For many cases --cap-drop ALL is disregarded at all. Checked with capsh --print | grep Current in container. Example:

$ docker run --rm  --runtime=sysbox-runc --cap-drop ALL -- x11docker/check capsh --print | grep Current
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip

Compare with --runtime=runc:

$ docker run --rm --runtime=runc --cap-drop ALL -- x11docker/check capsh --print | grep Current
Current: =

I found only one setup that indeed drops all capabilities:

$ docker run --rm --runtime=sysbox-runc --cap-drop ALL --security-opt=no-new-privileges  --user 1000:1000 -- x11docker/check capsh --print | grep Current
Current: =

However, e.g. adding --cap-add SYS_BOOT fails and does not appear.

$ docker run --rm --runtime=sysbox-runc --cap-drop ALL --cap-add SYS_BOOT --security-opt=no-new-privileges --user 1000:1000 -- x11docker/check capsh --print | grep Current
Current: =

Dropping capabilities also fails if I don’t use one of --security-opt=no-new-privileges or --user 1000:1000.

Expected behaviour: capsh --print | grep Current in container should show exactly the capabilities that are defined on CLI.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

In the end, we decided to apply the solution as follows:

  • If the sysbox container is passed the SYSBOX_HONOR_CAPS=TRUE environment variable, Sysbox will honor the capabilities passed by the higher level container manager (e.g., Docker) when launching the container. For example:
$ docker run --runtime=sysbox-runc -e SYSBOX_HONOR_CAPS=TRUE --rm -it alpine 
/ # cat /proc/self/status | grep -i cap
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
  • Otherwise, Sysbox will assign default capabilities to the container to mimic those of a Linux host: if the container’s process is a root process, it will assign full capabilities; otherwise, it will assign no capabilities.

For example, container process is root:

$ docker run --runtime=sysbox-runc --rm alpine sh -c "cat /proc/self/status | grep -i cap"
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003ffffffff

Container process is non-root:

$ docker run --runtime=sysbox-runc --rm -u 1000:1000 alpine sh -c "cat /proc/self/status | grep -i cap"
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
  • Note that since Sysbox uses the Linux user-namespace on all containers, the capabilities are restricted within the container (i.e., the container has no capabilities at host level).

  • In general, if a user wants fine control of the capabilities (e.g., for extra security), it can use the SYSBOX_HONOR_CAPS=TRUE setting. The drawback is that the user must understand all the capabilities required by the processes inside the container.

  • Finally, the SYSBOX_HONOR_CAPS=TRUE controls the per-container behavior. Users that want this behavior to apply to all containers can do so by editing the sysbox-mgr systemd unit to add the --honor-caps flags to the sysbox-mgr command line. If the user does this, she need not pass SYSBOX_HONOR_CAPS=TRUE to the containers anymore. And she can always start the container without the config by passing the SYSBOX_HONOR_CAPS=FALSE env var to the container (i.e., the env var always overrides the global config).

Special thanks to @mviereck for opening this issue and suggesting a good solution.

Much thanks for your solution! It works very well here.

The drawback is that the user must understand all the capabilities required by the processes inside the container.

It might help to give some examples in the docs. There could be lists of needed capabilities for different tasks. A few are already given above in https://github.com/nestybox/sysbox/issues/453#issuecomment-1022097979. I could provide a few more.

Hi @mviereck,

Thanks for the very good feedback.

I get why this is a problem for x11docker, and we want to design it so that you don’t have to code-up any special changes in x11docker for Sysbox.

On the other hand, the reason we decided that Sysbox would give the root user all caps by default is because Sysbox is a specialized runtime to create “VM-like” environments in secure containers (via the Linux user-ns and other isolation features), and a root user in such environments has all caps enabled by default.

We felt that in the common case, most users (and most software inside the container) expect root to be all powerful within the container (as it’s on a real host or VM), so we wanted to avoid the burden of users having to specify --cap-add=ALL on pretty much every Sysbox container. Reverting that decision would add burden and break most users of Sysbox at this time.

I am thinking we can thread the needle with an approach such as:

If user does not specify caps:
  if root user -> all caps
  else -> no caps
else 
  honor user-specified caps (for root or non-root)

The problem is that at Sysbox’s level, we may not be able to discern if the user specified the caps. But I’ll dig down further.

Would such an approach work for x11docker?

Thanks again!

I can confirm that your test setup looks well here, too. Tested with base images ubuntu:latest and debian:bullseye. The issue showing all caps in capsh --print occures only with debian:buster. So that seemed to be a capsh issue.

lauscher@debianlaptop:~/git2/test$ docker run --rm --user=1000:1000 --runtime=sysbox-runc test capsh --print
WARNING: libcap needs an update (cap=40 should have a name).
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=1000(???) euid=1000(???)
gid=1000(???)
groups=
Guessed mode: UNCERTAIN (0)

lauscher@debianlaptop:~/git2/test$ docker run --rm --user=1000:1000 --runtime=sysbox-runc test cat /proc/self/status | grep -i cap
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000

Without --user=1000:1000 the Current output of capsh --print looks odd: (Current: =eip 38,39,40-eip). However, it looks the same with --runtime=runc --cap-add=ALL, so this is rather a capsh issue.

lauscher@debianlaptop:~/git2/test$ docker run --rm --runtime=sysbox-runc test capsh --print
WARNING: libcap needs an update (cap=40 should have a name).
Current: =eip 38,39,40-eip
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Ambient set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root) euid=0(root)
gid=0(root)
groups=
Guessed mode: UNCERTAIN (0)

lauscher@debianlaptop:~/git2/test$ docker run --rm --runtime=sysbox-runc test cat /proc/self/status | grep -i cap
CapInh:	0000003fffffffff
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000003fffffffff

--cap-add=CHOWN --user=1000:1000 is ignored, CHOWN is not available:

lauscher@debianlaptop:~/git2/test$ docker run --rm --runtime=sysbox-runc  --user=1000:1000 --cap-add=CHOWN test cat /proc/self/status | grep -i cap
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000


--cap-drop=ALL for root is ignored, all capabilities are available:

lauscher@debianlaptop:~/git2/test$ docker run --rm --runtime=sysbox-runc --cap-drop=ALL test cat /proc/self/status | grep -i cap
CapInh:	0000003fffffffff
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000003fffffffff

So far, sysbox-runc seems to behave as you have intended.