runc: Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs)
Running rootless container inside Docker under non-root user fails with
container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/mycontainer/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
(operation not permitted for mounting procfs).
Actually master version of runc
fails a bit earlier due to not handled read-only cgroup filesystem, but I managed to fix this with https://github.com/opencontainers/runc/pull/1657, so I assume that this PR is applied.
I built following Docker image to reproduce this issue (with master version of runc with applied https://github.com/opencontainers/runc/pull/1657).
I created Docker image with user with uid/gid 1000/1000 (which matches my host user id for which I have entries in /etc/subuid
and /etc/subgid
), start Docker container with this image and run runc
inside as 1000/1000 user using su
.
Dockerfile
:
FROM ubuntu:16.04
RUN apt-get update && apt-get install -y strace gdb less vim jq
# Busybox rootfs of some version.
COPY busybox.tar /
# Patched runc from master (with applied https://github.com/opencontainers/runc/pull/1657).
ADD runc /usr/local/bin/
RUN chmod +x /usr/local/bin/runc
RUN groupadd user -g 1000
RUN useradd -d /mycontainer -m -g user user
COPY prepare.sh /
COPY start.sh /
prepare.sh
:
#!/bin/bash -eux
su -l user -c "mkdir -p /mycontainer/rootfs"
su -l user -c "mkdir -p /mycontainer/containerroot"
su -l user -c "tar -C /mycontainer/rootfs -xf /busybox.tar"
su -l user -c "cd /mycontainer/; runc spec --rootless"
start.sh
:
#!/bin/bash -eux
su -l user -c "cd /mycontainer; runc --root /mycontainer/containerroot run mycontainerid"
This image is pushed as rutsky/runc-rootless-in-docker:bugreport
.
Steps to reproduce:
$ sudo docker run --rm --cap-add SYS_ADMIN --security-opt seccomp:unconfined --security-opt=apparmor:unconfined -ti rutsky/runc-rootless-in-docker:bugreport
root@d4ff244031d9:/# ./prepare.sh
+ su -l user -c 'mkdir -p /mycontainer/rootfs'
+ su -l user -c 'mkdir -p /mycontainer/containerroot'
+ su -l user -c 'tar -C /mycontainer/rootfs -xf /busybox.tar'
+ su -l user -c 'cd /mycontainer/; runc spec --rootless'
root@d4ff244031d9:/# ./start.sh
+ su -l user -c 'cd /mycontainer; runc --root /mycontainer/containerroot run mycontainerid'
container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/mycontainer/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
root@d4ff244031d9:/#
Part of strace that includes failed mount:
[pid 68] mount("", "/", 0xc42001b2ca, MS_REC|MS_SLAVE, NULL <unfinished ...>
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 68] <... mount resumed> ) = 0
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] openat(AT_FDCWD, "/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 8</proc/68/mountinfo>
[pid 68] epoll_ctl(7<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 8</proc/68/mountinfo>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1036353280, u64
=140128639352576}}) = 0
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 68] fcntl(8</proc/68/mountinfo>, F_GETFL <unfinished ...>
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] <... fcntl resumed> ) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid 68] fcntl(8</proc/68/mountinfo>, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
[pid 68] read(8</proc/68/mountinfo>, <unfinished ...>
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] <... read resumed> "263 239 0:119 / / rw,relatime - "..., 4096) = 3855
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] read(8</proc/68/mountinfo>, "", 4096) = 0
[pid 68] epoll_ctl(7<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 8</proc/68/mountinfo>, 0xc4200fab0c <unfinished ...>
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] <... epoll_ctl resumed> ) = 0
[pid 68] close(8</proc/68/mountinfo>) = 0
[pid 68] mount("/mycontainer/rootfs", "/mycontainer/rootfs", 0xc42001b5d0, MS_BIND|MS_REC, NULL <unfinished ...>
[pid 69] <... pselect6 resumed> ) = 0 (Timeout)
[pid 69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid 68] <... mount resumed> ) = 0
[pid 68] stat("/mycontainer/rootfs/proc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 68] mount("proc", "/mycontainer/rootfs/proc", "proc", 0, NULL) = -1 EPERM (Operation not permitted)
Tested on Ubuntu 16.04 on my desktop and Ubuntu 16.04 in GKE. Docker info-s from them:
# Desktop
$ sudo docker info
Containers: 12
Running: 1
Paused: 0
Stopped: 11
Images: 199
Server Version: 17.09.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.10.0-38-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.63GiB
Name: bob-vaio
ID: EQPL:4SC2:YOP2:Z7IM:VEWI:ZSYQ:G7LG:UWWW:G24T:GSKL:3EJU:JT6H
Docker Root Dir: /srv/docker-data
Debug Mode (client): false
Debug Mode (server): false
Username: rutsky
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
# GKE
$ sudo docker info
Containers: 27
Running: 25
Paused: 0
Stopped: 2
Images: 24
Server Version: 1.12.6
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 139
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge overlay null host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-1027-gke
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.755 GiB
Name: gke-cluster-1-default-pool-163751e2-sg48
ID: 46OX:MIU5:TESN:HGMY:KSKR:34H7:MLG6:GHVN:AOAZ:XN56:LFCF:AWBB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
10.0.0.0/8
127.0.0.0/8
If I run Docker container with --privileged
option runc
works as expected.
If I run runc
with rootless configuration under my host user it works as expected.
I tried to disable apparmor system-wide — doesn’t help.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 33 (25 by maintainers)
definitely agree with you @rhatdan
(obviously if I did it it would be fine, but if anyone else did it then I would be horrified hahaha)
Sure but it also gives you a false sense of security. I only opened pandoras box a little, so I feel better about opening it. Perhaps I am sensitive since I keep getting people asking me how to change SELinux to allow a container to write to the docker socket. When I tell users that they should just run a privileged container, they say no, since they want to lock it down a little. Then a security analyser comes by runs some tool that says they are not running any privileged containers, so they are good to go…
tl;dr: I have no problem with the patches in runc & Kubernetes. I am just exploring different possible workarounds for the same problem.
@jessfraz I could reproduce the bug and I have not yet tried your fix but I have no reason to believe your fix would behave differently on my computer 😃 I just wanted to really understand the underlying mechanism in order to see whether an easier solution would be possible, because I would like to have unprivileged builds in Kubernetes as well, and I would prefer if it was possible without having to use a new “rawproc” option in Docker and Kubernetes.
What I learned today:
So, by adding any procfs fully visible in the outer container, that should work: it does not need to be the one located at
/proc
and I don’t need to remove the masked paths at all.By adding
-v /proc:/newproc
, it works without the “rawproc” branch. So we could use this without patching Kubernetes or Docker.That is making the host processes visible in the container though. But this could be avoided with:
Here I am using a procfs mount that refers to a dead pidns, so it does not have any processes inside. A bit hacky but it works as fine 😃
I would prefer if there was kernel support for mounting new procfs in the inner container with the same masked paths as the outer container though…
/proc
is needed to support generic Dockerfile builds with arbitrary commands inRUN
but I might be lucky because I might need only to support a subset ofDockerfile
that would play nice with a missing/proc
.I can reproduce this, it’s super weird, trying to find the cause
@ulm0 Arch Linux did not support user namespaces at all for a really long time, that’s what that wiki article is talking about. While you do need user namespaces for rootless containers, the issue reported here is more than just a lack of user namespaces support.
I just ran into this with trying to use
unshare -Urmpf --mount-proc
inside a Kubernetes container - it took a while to dig up why the mount was failing but once I foundmount_too_revealing
this issue explains it nicely.As far as I can tell, this should work now after @jessfraz’s pull requests moby/moby#36644 and (for those of us on k8s) kubernetes/kubernetes#64283, right? Or is there something pending in runc for this?
I agree it’d be nicer to have a “real” fix to this, either a way to mount a new /proc while preserving the hidden files or a way to do unrestricted mounts of a limited procfs, as discussed in the most recent comments. (Though for my use case, I need a writable /proc/sys, which is tricky because Docker mounts /proc/sys read-only. There should still be a way to write to namespace-specific sysctls like kernel.ns_last_pid within my namespace, even if Docker wants to block access to its namespace… I agree with @brauner’s comment on the mailing list thread that I don’t see the point of the restriction, root can unmount the hiding and non-root can’t write to things they lack capabilities for anyway.) But I think that the approach of starting a Docker container that’s unprivileged but has an unmasked /proc should work fine today.
Thanks to everyone on this issue for both explaining the problem nicely and all the work on it 😃
My work-in-progress attempt to make unprivileged new proc mounts possible: https://lists.linuxfoundation.org/pipermail/containers/2018-April/038840.html
As far as I understand, masked dirs are just bind mounts on
/proc/kcore
(on the outer container) without using any further mechanisms like seccomp. When preparing the inner container by mounting a new proc on/home/user/rootfs/proc
, how does the kernel know that a bind mount on a unrelated directory (/proc/kcore
) is supposed to block themount()
syscall by returning EPERM?/proc
itself (outer container) is not masked or readonly, only some files inside are. And the mountpoint/home/user/rootfs/proc
(for the inner container) does not have anything masked inside. So obviously I miss a detail in the story.It is my understanding that masking those paths would allow an escape hatch out of containment, although I would figure most of these are also blocked by SELinux. If you set a flag that leads to easy breakout, then why not just use --privileged.
I do agree that allowing mounting over the masked paths, seems to make sense, I could even see allowing processes to write to tmpfs there so they are fooled.
The one advantage of having multiple flags to turn on and off security is for developers trying to figure out whether it is SELinux, AppArmor, Dropped Capabilities, Device Cgroup, SECCOMP, NO_NEW_PRIVS, readonly mounts, masked mounts is causing your failure. Years ago we attempted to get a patch into the kernel called FriendlyEperm, that would have written something into the logs telling a process why it was getting an access denial. But it could not be done without being racy.