nsjail: Cannot make nsjail work on cgroupsv2 system
For example, when I run nsjail
with --use_cgroupv2 --cgroupv2_mount /sys/fs/cgroup/NSJAIL
, I still see errors like
writeBufToFile():95 Couldn't open '/sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max' for writing: No such file or directory
If I udnerstand cgroups v2 correctly, it should look for /sys/fs/cgroup/NSJAIL/memory.max
, not /sys/fs/cgroup/NSJAIL/NSJAIL.10/memory.max
.
/sys/fs/cgroup/NSJAIL
exists.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 23 (9 by maintainers)
I looked at this a little more, since I know I’ve run into issues running on stock 22.04 as well. I tried this on a 22.04 desktop in virtualbox. I did see slightly different initial behavior in AWS, but I think what’s here should still be helpful.
Linux 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
For some reason, on boot thecpu
controller is missing fromcgroup.subtree_control
(why? i have no idea):Example 1: Running as root?
If you just straight up run
nsjail
now in the root cgroup (as sudo, so it can create it’s child cgroup),--cgroup_mem_max
works fine, but if you set--cgroup_cpu_ms_per_sec
you’ll get:Fix:
(If you try the same thing on my fork, it’ll do this last line for you. Whether this is desirable behavior in general or not, I am not sure.)
Example 2: Creating a cgroup, running non-root (by adding user to
/sys/fs/cgroup/cgroup.procs
)OK, but what if instead of using the root cgroup, we want to make a new cgroup (as @mattgodbolt was trying), and give our user permissions to use it?
I think the permissions error @mattgodbolt was running into is due to the fact you don’t have permission to move processes out of the root cgroup? We can fix that:
Now nsjail can move it’s children into the appropriate cgroup, and we get a little further:
We need just one more thing, since
/sys/fs/cgroup/jailtest/cgroup.subtree_control
is empty:Now nsjail works 😃
Example 3: Exec into a cgroup, do everything from there (non-root)
Now we are in the child cgroup… lets try to run nsjail
Same issue… lets do the same thing, right?
Why can’t we do this? It’s because the “no internal processes rule” won’t let us have controllers in
cgroup.subtree_control
if our cgroup currently has processes. First, lets see how to fix this manually:(Since we spawned a shell, we have a couple of processes in the
jailtest3
cgroup – these have to be moved before we can add to subtree_control)Now nsjail works!
The point of my PR is to add controllers to
cgroup.subtree_control
(the very last command I ran in each of these examples) if they are not present. For the last example, it doesn’t seem likensjail
ought to move all those processes into a subgroup blindly – so my PR only handles the case where nsjail is the only process in the group.I’ll try to summarize (hopefully correctly) here in case someone finds this later:
--cgroupv2_mount
is the root at which nsjail will create its individual child process cgroups. nsjail needs to have permission to create cgroups (ie. make subdirectories) at this path, and the cgroup needs to have either no processes in it, or just nsjail (in the case where nsjail is in this group, nsjail will move itself into a subgroup for technical reasons).The cgroup that nsjail is running in is also important, because nsjail needs to have permission to remove its child processes from that cgroup (nsjail needs to have permissions for its current cgroup’s cgroup.procs file). By
cgexec
ing you only need to chown the cgroup.procs file for that cgroup.These groups do not have to be the same. It sounds like for many applications you could just as well create two cgroups:
jailparentgroup
that you run nsjail in (via cgexec)jailchildgroup
that you pass as--cgroupv2_mount=/sys/fs/cgroup/jailchildgroup/
which is where nsjail would make subgroups with all the restrictions and move the child processes.The user would need full ownership of
/sys/fs/cgroup/jailchildgroup
and additionally permission on/sys/fs/cgroup/jailparentgroup/cgroup.procs
– if youcgcreate
as above, you need no additional changes. (Creating separate groups as above also avoid any cgroup.subtree_control issues, sincejailchildgroup
would not have any processes, only sub-cgroups.).Note I don’t think this trick improves the situation in a default (but privileged) Docker container, where your best best is making sure that nsjail is the root process (and then nsjail will move itself to create a 2-group scenario similar to above).
@mattgodbolt
The point of
--detect_cgroupv2
(at least, as I intended it) was to allow you to specify options for both v1 and v2, and nsjail would infer which options are valid at runtime. So if you specify the ‘cgroupv2_mount’ and ‘detect_cgroupv2: true’ in the config file, it should be backwards-compatible. It will check if the v2 mount is a valid cgroupv2 filesystem and will use v2 only if it is.As for the permissions error, I think you’re closest to the “Example 2” in my previous comment. My guess is nsjail does not have permission to move the child out of the current cgroup. You can fix this by either (1) spawning nsjail inside a cgroup it has permissions to move children out of (e.g. via cgexec or Docker), or (2) modifying the permissions on the cgroup.procs file for nsjail’s current cgroup (probably either the root one or the one associated with your terminal).
@disconnect3d thank you for such cool in-depth analysis.
I’m vaguely familiar with cgroups2 myself, but I guess I can take a look at what can be improved here.
Though, if anyone will beat me to that, I won’t complain 😃
EDIT: below you can see some diagnosis of your issues, but I am wondering: is there any particular reason you want to use nsjail with cgroups v2 instead of v1?
Docker enables lots of options that may influence whether you can or cannot do a certain operation and for example even if you use the
--privileged
flag, Docker will still use Linux namespaces and specifically the cgroup namespace which will make the/sys/fs/cgroup/
to render the cgroup controllers with the groups hierarchy that were created only in this container (or rather: the namespaces that were created for it). But yeah, what @mateuszlewko showed, bind mounting the “host” cgroup mount point should help here.Fwiw it is hard to diagnose your issues not having much details about what commands you executed or the environment you run this against. But anyway, lets try to help 😃.
I have tried to reproduce your issues on my side on Ubuntu 21.04 and my first issue was that
/sys/fs/cgroup
is read-only:On my side, this is because I have both cgroups v1 and v2 and v2 is mounted in a different path:
I was able to resolve this issue with the
--cgroupv2_mount=/sys/fs/cgroup/unified
flag:But as we can see, now I am getting the error that @carlbordum was getting:
So what happens here? Well, while the cgroup v2 memory controller indeed does expose such file it does not exist on my side because… I don’t have a memory cgroup v2 controllers enabled or even available! 😦
We can see that here, as according to this kernel documentation page the
cgroup.controllers
file should list us the available controllers (e.g.memory io cpu
):But it shows nothing instead! So why is that? Why are there no cgroupv2 controllers available?
If I understand correctly, this is related to what they write here:
It seems that a given controller may be bound either to v1 or to v2 but never to both of them. I guess this kinda makes sense, and just to recap, my memory controller is indeed bound to v1 as what my
mount
output showed:So if you are in the same situation as me, I guess the easiest is to change kernel boot parameters and add
cgroup_no_v1=memory
there or/and other controller excludes (not sure which are all of those that nsjail use). As I guess removing all processes from cgroup v1 may be hard at runtime (e.g. since lots of this may be managed by systemd and idk if it supports v1->v2 migration).