sysbox: --cap-drop and --cap-add do not work as expected
Hi @ctalledo ,
I found some confusing behaviour of sysbox with options --cap-add and --cap-drop.
For many cases --cap-drop ALL is disregarded at all. Checked with capsh --print | grep Current in container.
Example:
$ docker run --rm --runtime=sysbox-runc --cap-drop ALL -- x11docker/check capsh --print | grep Current
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip
Compare with --runtime=runc:
$ docker run --rm --runtime=runc --cap-drop ALL -- x11docker/check capsh --print | grep Current
Current: =
I found only one setup that indeed drops all capabilities:
$ docker run --rm --runtime=sysbox-runc --cap-drop ALL --security-opt=no-new-privileges --user 1000:1000 -- x11docker/check capsh --print | grep Current
Current: =
However, e.g. adding --cap-add SYS_BOOT fails and does not appear.
$ docker run --rm --runtime=sysbox-runc --cap-drop ALL --cap-add SYS_BOOT --security-opt=no-new-privileges --user 1000:1000 -- x11docker/check capsh --print | grep Current
Current: =
Dropping capabilities also fails if I don’t use one of --security-opt=no-new-privileges or --user 1000:1000.
Expected behaviour: capsh --print | grep Current in container should show exactly the capabilities that are defined on CLI.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (10 by maintainers)
In the end, we decided to apply the solution as follows:
SYSBOX_HONOR_CAPS=TRUEenvironment variable, Sysbox will honor the capabilities passed by the higher level container manager (e.g., Docker) when launching the container. For example:For example, container process is root:
Container process is non-root:
Note that since Sysbox uses the Linux user-namespace on all containers, the capabilities are restricted within the container (i.e., the container has no capabilities at host level).
In general, if a user wants fine control of the capabilities (e.g., for extra security), it can use the
SYSBOX_HONOR_CAPS=TRUEsetting. The drawback is that the user must understand all the capabilities required by the processes inside the container.Finally, the SYSBOX_HONOR_CAPS=TRUE controls the per-container behavior. Users that want this behavior to apply to all containers can do so by editing the sysbox-mgr systemd unit to add the
--honor-capsflags to the sysbox-mgr command line. If the user does this, she need not passSYSBOX_HONOR_CAPS=TRUEto the containers anymore. And she can always start the container without the config by passing theSYSBOX_HONOR_CAPS=FALSEenv var to the container (i.e., the env var always overrides the global config).Special thanks to @mviereck for opening this issue and suggesting a good solution.
Much thanks for your solution! It works very well here.
It might help to give some examples in the docs. There could be lists of needed capabilities for different tasks. A few are already given above in https://github.com/nestybox/sysbox/issues/453#issuecomment-1022097979. I could provide a few more.
Hi @mviereck,
Thanks for the very good feedback.
I get why this is a problem for x11docker, and we want to design it so that you don’t have to code-up any special changes in x11docker for Sysbox.
On the other hand, the reason we decided that Sysbox would give the root user all caps by default is because Sysbox is a specialized runtime to create “VM-like” environments in secure containers (via the Linux user-ns and other isolation features), and a root user in such environments has all caps enabled by default.
We felt that in the common case, most users (and most software inside the container) expect root to be all powerful within the container (as it’s on a real host or VM), so we wanted to avoid the burden of users having to specify
--cap-add=ALLon pretty much every Sysbox container. Reverting that decision would add burden and break most users of Sysbox at this time.I am thinking we can thread the needle with an approach such as:
The problem is that at Sysbox’s level, we may not be able to discern if the user specified the caps. But I’ll dig down further.
Would such an approach work for x11docker?
Thanks again!
I can confirm that your test setup looks well here, too. Tested with base images
ubuntu:latestanddebian:bullseye. The issue showing all caps incapsh --printoccures only withdebian:buster. So that seemed to be acapshissue.Without
--user=1000:1000theCurrentoutput ofcapsh --printlooks odd: (Current: =eip 38,39,40-eip). However, it looks the same with--runtime=runc --cap-add=ALL, so this is rather acapshissue.--cap-add=CHOWN --user=1000:1000is ignored,CHOWNis not available:--cap-drop=ALLfor root is ignored, all capabilities are available:So far,
sysbox-runcseems to behave as you have intended.