sysbox: K8s projected volumes don't work when using systemd inside the container
Using an Ubuntu 20.04 (kernel 5.15.0) node running Kubernetes 1.26 and sysbox 0.6.3.
I’m trying to inject a ServiceAccount token into my pod using a Kubernetes projected volume.
apiVersion: apps/v1
kind: Deployment
metadata:
name: sysbox-test
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: sysbox-test
strategy:
type: Recreate
template:
metadata:
annotations:
io.kubernetes.cri-o.userns-mode: auto:size=65536
labels:
app.kubernetes.io/name: sysbox-test
spec:
containers:
- command: ["sh", "-c", "exec /sbin/init"]
image: nestybox/ubuntu-bionic-systemd:latest
name: dev
securityContext:
allowPrivilegeEscalation: true
privileged: false
readOnlyRootFilesystem: false
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- mountPath: /var/run/secrets/serviceaccount
name: token
readOnly: true
runtimeClassName: sysbox-runc
securityContext:
fsGroup: 0
runAsNonRoot: false
runAsUser: 0
serviceAccount: my-service-account
serviceAccountName: my-service-account
volumes:
- name: token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 86400
path: token
Systemd works fine from inside the container as expected, but if I try to get the injected token:
root@sysbox-test-d5846cdcb-7m75r:/# cat /var/run/secrets/serviceaccount/token
cat: /var/run/secrets/serviceaccount/token: No such file or directory
If I simply change the command to ["sh", "-c", "sleep 1000"] so it doesn’t start systemd as PID 1 the token is injected successfully and I can read it.
I can see the mount with findmnt so I’m not sure why it’s failing to actually get mounted:
root@sysbox-test-d5846cdcb-7m75r:/# findmnt | grep serviceaccount
|-/run/secrets/serviceaccount /var/lib/sysbox/shiftfs/ef954be7-d6f7-492e-b448-f3b412a7399f
shiftfs ro,relatime
I came across issue #728 while looking into this so I thought to check the logs from sysbox-mgr in case shiftfs wasn’t working properly but it doesn’t seem to be the same issue reported there:
level=info msg="Starting ..."
level=info msg="Sysbox data root: /var/lib/sysbox"
level=info msg="Shiftfs module found in kernel: yes"
level=info msg="Shiftfs works properly: yes"
level=info msg="Shiftfs-on-overlayfs works properly: yes"
level=info msg="ID-mapped mounts supported by kernel: yes"
level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
level=info msg="Operating in system container mode."
level=info msg="Inner container image preloading disabled."
level=info msg="Listening on /run/sysbox/sysmgr.sock"
level=info msg="Ready ..."
I don’t know why this would only happen when systemd is started as the container’s PID 1, any insight is appreciated.
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 18 (10 by maintainers)
Seems the bug is here in sysbox-runc.
That code ensures the mounts are ordered such that they don’t opaque each other (e.g., mount
/foobefore/foo/bar). But it’s not doing it for a scenario where we have a tmpfs mount on/runand a bind-mount on/run/some/path.Normally the higher level container manager (e.g., Docker or K8s) sends the mounts in the correct order, but because Sysbox implicitly adds some mounts of it’s own (e.g., tmpfs on
/runwhen systemd is PID 1), it needs to do the ordering again to take into account the implicit mounts. Seems like it’s not doing it right for/runin systemd scenarios.If it’s OK, I can try patching it and send you a new sysbox-runc binary that you can then use on the K8s node, to see if it fixes the problem. I’ve not been able to reproduce locally with Docker yet unfortunately.
Thanks @jojonium, very helpful info.
I think I see the problem; in the above output, the
/run/secrets/serviceaccountmount should have been a submount of the/runmount (similar to/run/lock), but it does not appear to be.I suspect that Sysbox (incorrectly) did the
/runmount after the/run/secrets/serviceaccountmount and thus it’s hiding it. Let me check the code to see where the bug is.