nsjail: Cannot make nsjail work on cgroupsv2 system

For example, when I run nsjail with --use_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/NSJAIL, I still see errors like

writeBufToFile():95 Couldn't open '/sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max' for writing: No such file or directory

If I udnerstand cgroups v2 correctly, it should look for /sys/fs/cgroup/NSJAIL/memory.max, not /sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max.

/sys/fs/cgroup/NSJAIL exists.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 23 (9 by maintainers)

Most upvoted comments

I looked at this a little more, since I know I’ve run into issues running on stock 22.04 as well. I tried this on a 22.04 desktop in virtualbox. I did see slightly different initial behavior in AWS, but I think what’s here should still be helpful.

Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux For some reason, on boot the cpu controller is missing from cgroup.subtree_control (why? i have no idea):

$ cat /sys/fs/cgroup/cgroup.subtree_control
memory pids

Example 1: Running as root?

If you just straight up run nsjail now in the root cgroup (as sudo, so it can create it’s child cgroup), --cgroup_mem_max works fine, but if you set --cgroup_cpu_ms_per_sec you’ll get:

$ sudo ./nsjail -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 -- /bin/bash -i
[I][2022-11-16T15:29:05-0500] Setting 'cpu.max' to '500000 1000000'
[E][2022-11-16T15:29:05-0500][4983] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/NSJAIL.4984/cpu.max' for writing: No such file or directory
[W][2022-11-16T15:29:05-0500][4983] writeToCgroup():61 Could not update cpu.max
[E][2022-11-16T15:29:05-0500][4983] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=4984
[F][2022-11-16T15:29:05-0500][1] runChild():483 Launching child process failed

Fix:

echo "+cpu" > /sys/fs/cgroup/cgroup.subtree_control

(If you try the same thing on my fork, it’ll do this last line for you. Whether this is desirable behavior in general or not, I am not sure.)

Example 2: Creating a cgroup, running non-root (by adding user to `/sys/fs/cgroup/cgroup.procs`)

OK, but what if instead of using the root cgroup, we want to make a new cgroup (as @mattgodbolt was trying), and give our user permissions to use it?

sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest

I think the permissions error @mattgodbolt was running into is due to the fact you don’t have permission to move processes out of the root cgroup? We can fix that:

sudo chown andrew:root /sys/fs/cgroup/cgroup.procs

Now nsjail can move it’s children into the appropriate cgroup, and we get a little further:

$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest/ -- /bin/bash -i
...
[E][2022-11-16T16:58:12-0500][3096] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest//NSJAIL.3097/memory.max' for writing: No such file or directory
[W][2022-11-16T16:58:12-0500][3096] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T16:58:12-0500][3096] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3097

We need just one more thing, since /sys/fs/cgroup/jailtest/cgroup.subtree_control is empty:

echo "+cpu +memory" > /sys/fs/cgroup/jailtest/cgroup.subtree_control

Now nsjail works 😃

Example 3: Exec into a cgroup, do everything from there (non-root)

sudo cgcreate -a $USER -t $USER -g memory,cpu:jailtest3
sudo cgexec -g memory,cpu:jailtest3 sudo -s -u andrew
andrew@andrew2204:~/nsjail$

Now we are in the child cgroup… lets try to run nsjail

$ ./nsjail --cgroup_mem_max 1000000 -R /bin/ -R /lib/ -R /lib64 -R /usr/ -R /sbin/ --use_cgroupv2 --cgroup_cpu_ms_per_sec 500 --cgroupv2_mount /sys/fs/cgroup/jailtest3/ -- /bin/bash -i
...
[I][2022-11-16T17:13:58-0500] Setting 'memory.max' to '1000000'
[E][2022-11-16T17:13:58-0500][3251] writeBufToFile():96 Couldn't open '/sys/fs/cgroup/jailtest3//NSJAIL.3252/memory.max' for writing: No such file or directory
[W][2022-11-16T17:13:58-0500][3251] writeToCgroup():61 Could not update memory.max
[E][2022-11-16T17:13:58-0500][3251] initParent():425 Couldn't initialize cgroup 2 user namespace for pid=3252

Same issue… lets do the same thing, right?

$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control
bash: echo: write error: Device or resource busy

Why can’t we do this? It’s because the “no internal processes rule” won’t let us have controllers in cgroup.subtree_control if our cgroup currently has processes. First, lets see how to fix this manually:

$ cat /sys/fs/cgroup/jailtest3/cgroup.procs
3281
3282
3283
3299
$ mkdir /sys/fs/cgroup/jailtest3/lol/
$ echo "3281" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3282" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "3283" >  /sys/fs/cgroup/jailtest3/lol/cgroup.procs
$ echo "+cpu +memory" > /sys/fs/cgroup/jailtest3/cgroup.subtree_control

(Since we spawned a shell, we have a couple of processes in the jailtest3 cgroup – these have to be moved before we can add to subtree_control)

Now nsjail works!

The point of my PR is to add controllers to cgroup.subtree_control (the very last command I ran in each of these examples) if they are not present. For the last example, it doesn’t seem like nsjail ought to move all those processes into a subgroup blindly – so my PR only handles the case where nsjail is the only process in the group.

ndrewh on Nov 17, 2022

I’ll try to summarize (hopefully correctly) here in case someone finds this later:

--cgroupv2_mount is the root at which nsjail will create its individual child process cgroups. nsjail needs to have permission to create cgroups (ie. make subdirectories) at this path, and the cgroup needs to have either no processes in it, or just nsjail (in the case where nsjail is in this group, nsjail will move itself into a subgroup for technical reasons).
The cgroup that nsjail is running in is also important, because nsjail needs to have permission to remove its child processes from that cgroup (nsjail needs to have permissions for its current cgroup’s cgroup.procs file). By cgexecing you only need to chown the cgroup.procs file for that cgroup.

These groups do not have to be the same. It sounds like for many applications you could just as well create two cgroups:

One called jailparentgroup that you run nsjail in (via cgexec)
Another called jailchildgroup that you pass as --cgroupv2_mount=/sys/fs/cgroup/jailchildgroup/ which is where nsjail would make subgroups with all the restrictions and move the child processes.

$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailparentgroup
$ sudo cgcreate -a $USER -t $USER -g cpu,pids,memory:jailchildgroup
$ sudo cgexec -g cpu,pids,memory:jailparentgroup ./nsjail --cgroup_mem_max 10000000 --cgroup_pids_max 50 --cgroup_cpu_ms_per_sec 500 --verbose --detect_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/jailchildgroup/ -R /usr -R /bin -R /lib -R /lib64 -- /bin/bash

The user would need full ownership of /sys/fs/cgroup/jailchildgroup and additionally permission on /sys/fs/cgroup/jailparentgroup/cgroup.procs – if you cgcreate as above, you need no additional changes. (Creating separate groups as above also avoid any cgroup.subtree_control issues, since jailchildgroup would not have any processes, only sub-cgroups.).

Note I don’t think this trick improves the situation in a default (but privileged) Docker container, where your best best is making sure that nsjail is the root process (and then nsjail will move itself to create a 2-group scenario similar to above).

ndrewh on Dec 1, 2022

@mattgodbolt

The point of --detect_cgroupv2 (at least, as I intended it) was to allow you to specify options for both v1 and v2, and nsjail would infer which options are valid at runtime. So if you specify the ‘cgroupv2_mount’ and ‘detect_cgroupv2: true’ in the config file, it should be backwards-compatible. It will check if the v2 mount is a valid cgroupv2 filesystem and will use v2 only if it is.

As for the permissions error, I think you’re closest to the “Example 2” in my previous comment. My guess is nsjail does not have permission to move the child out of the current cgroup. You can fix this by either (1) spawning nsjail inside a cgroup it has permissions to move children out of (e.g. via cgexec or Docker), or (2) modifying the permissions on the cgroup.procs file for nsjail’s current cgroup (probably either the root one or the one associated with your terminal).

ndrewh on Nov 29, 2022

@disconnect3d thank you for such cool in-depth analysis.

I’m vaguely familiar with cgroups2 myself, but I guess I can take a look at what can be improved here.

Though, if anyone will beat me to that, I won’t complain 😃

robertswiecki on May 20, 2022

EDIT: below you can see some diagnosis of your issues, but I am wondering: is there any particular reason you want to use nsjail with cgroups v2 instead of v1?

Docker enables lots of options that may influence whether you can or cannot do a certain operation and for example even if you use the --privileged flag, Docker will still use Linux namespaces and specifically the cgroup namespace which will make the /sys/fs/cgroup/ to render the cgroup controllers with the groups hierarchy that were created only in this container (or rather: the namespaces that were created for it). But yeah, what @mateuszlewko showed, bind mounting the “host” cgroup mount point should help here.

Fwiw it is hard to diagnose your issues not having much details about what commands you executed or the environment you run this against. But anyway, lets try to help 😃.

I have tried to reproduce your issues on my side on Ubuntu 21.04 and my first issue was that /sys/fs/cgroup is read-only:

$ ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:07+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:07+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:07+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:07+0200] Uid map: inside_uid:99999 outside_uid:1000 count:1 newuidmap:false
[I][2022-05-20T01:27:07+0200] Gid map: inside_gid:99999 outside_gid:1000 count:1 newgidmap:false
[W][2022-05-20T01:27:07+0200][30182] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30183', 0700) failed: Read-only file system
[E][2022-05-20T01:27:07+0200][30182] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30183
[F][2022-05-20T01:27:07+0200][1] runChild():469 Launching child process failed

$ sudo ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2
[I][2022-05-20T01:27:09+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:27:09+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:27:09+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:27:09+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:27:09+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:27:09+0200][30188] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[W][2022-05-20T01:27:09+0200][30188] createCgroup():49 mkdir('/sys/fs/cgroup/NSJAIL.30189', 0700) failed: Read-only file system
[E][2022-05-20T01:27:09+0200][30188] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30189
[F][2022-05-20T01:27:09+0200][1] runChild():469 Launching child process failed

On my side, this is because I have both cgroups v1 and v2 and v2 is mounted in a different path:

$ mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)

I was able to resolve this issue with the --cgroupv2_mount=/sys/fs/cgroup/unified flag:

$ sudo ./nsjail --cgroup_mem_max 104857600  --user 99999 --group 99999 --disable_proc --chroot / --time_limit 100 /bin/bash --use_cgroupv2 --cgroupv2_mount=/sys/fs/cgroup/unified
[I][2022-05-20T01:29:02+0200] Mode: STANDALONE_ONCE
[I][2022-05-20T01:29:02+0200] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/bin/bash', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:100, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-05-20T01:29:02+0200] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2022-05-20T01:29:02+0200] Uid map: inside_uid:99999 outside_uid:0 count:1 newuidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():265 Process will be UID/EUID=0 in the global user namespace, and will have user root-level access to files
[I][2022-05-20T01:29:02+0200] Gid map: inside_gid:99999 outside_gid:0 count:1 newgidmap:false
[W][2022-05-20T01:29:02+0200][30304] logParams():275 Process will be GID/EGID=0 in the global user namespace, and will have group root-level access to files
[I][2022-05-20T01:29:02+0200] Setting 'memory.max' to '104857600'
[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory
[W][2022-05-20T01:29:02+0200][30304] writeToCgroup():61 Could not update memory.max
[E][2022-05-20T01:29:02+0200][30304] initParent():411 Couldn't initialize cgroup 2 user namespace for pid=30305
[F][2022-05-20T01:29:02+0200][1] runChild():469 Launching child process failed

But as we can see, now I am getting the error that @carlbordum was getting:

[E][2022-05-20T01:29:02+0200][30304] writeBufToFile():95 Couldn't open '/sys/fs/cgroup/unified/NSJAIL.30305/memory.max' for writing: No such file or directory

So what happens here? Well, while the cgroup v2 memory controller indeed does expose such file it does not exist on my side because… I don’t have a memory cgroup v2 controllers enabled or even available! 😦

We can see that here, as according to this kernel documentation page the cgroup.controllers file should list us the available controllers (e.g. memory io cpu):

$ cat /sys/fs/cgroup/unified/cgroup.controllers 
$

But it shows nothing instead! So why is that? Why are there no cgroupv2 controllers available?

If I understand correctly, this is related to what they write here:

cgroup2 filesystem has the magic number 0x63677270 (“cgrp”). All controllers which support v2 and are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root. Controllers which are not in active use in the v2 hierarchy can be bound to other hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple hierarchies in a fully backward compatible way.

A controller can be moved across hierarchies only after the controller is no longer referenced in its current hierarchy. Because per-cgroup controller states are destroyed asynchronously and controllers may have lingering references, a controller may not show up immediately on the v2 hierarchy after the final umount of the previous hierarchy. Similarly, a controller should be fully disabled to be moved out of the unified hierarchy and it may take some time for the disabled controller to become available for other hierarchies; furthermore, due to inter-controller dependencies, other controllers may need to be disabled too.

While useful for development and manual configurations, moving controllers dynamically between the v2 and other hierarchies is strongly discouraged for production use. It is recommended to decide the hierarchies and controller associations before starting using the controllers after system boot.

During transition to v2, system management software might still automount the v1 cgroup filesystem and so hijack all controllers during boot, before manual intervention is possible. To make testing and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2.

It seems that a given controller may be bound either to v1 or to v2 but never to both of them. I guess this kinda makes sense, and just to recap, my memory controller is indeed bound to v1 as what my mount output showed:

cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)

So if you are in the same situation as me, I guess the easiest is to change kernel boot parameters and add cgroup_no_v1=memory there or/and other controller excludes (not sure which are all of those that nsjail use). As I guess removing all processes from cgroup v1 may be hard at runtime (e.g. since lots of this may be managed by systemd and idk if it supports v1->v2 migration).

disconnect3d on May 19, 2022