sysbox: K8s projected volumes don't work when using systemd inside the container

Using an Ubuntu 20.04 (kernel 5.15.0) node running Kubernetes 1.26 and sysbox 0.6.3.

I’m trying to inject a ServiceAccount token into my pod using a Kubernetes projected volume.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sysbox-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: sysbox-test
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        io.kubernetes.cri-o.userns-mode: auto:size=65536
      labels:
        app.kubernetes.io/name: sysbox-test
    spec:
      containers:
      - command: ["sh", "-c", "exec /sbin/init"]
        image: nestybox/ubuntu-bionic-systemd:latest
        name: dev
        securityContext:
          allowPrivilegeEscalation: true
          privileged: false
          readOnlyRootFilesystem: false
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - mountPath: /var/run/secrets/serviceaccount
          name: token
          readOnly: true
      runtimeClassName: sysbox-runc
      securityContext:
        fsGroup: 0
        runAsNonRoot: false
        runAsUser: 0
      serviceAccount: my-service-account
      serviceAccountName: my-service-account
      volumes:
      - name: token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              expirationSeconds: 86400
              path: token

Systemd works fine from inside the container as expected, but if I try to get the injected token:

root@sysbox-test-d5846cdcb-7m75r:/# cat /var/run/secrets/serviceaccount/token
cat: /var/run/secrets/serviceaccount/token: No such file or directory

If I simply change the command to ["sh", "-c", "sleep 1000"] so it doesn’t start systemd as PID 1 the token is injected successfully and I can read it.

I can see the mount with findmnt so I’m not sure why it’s failing to actually get mounted:

root@sysbox-test-d5846cdcb-7m75r:/# findmnt | grep serviceaccount
|-/run/secrets/serviceaccount                  /var/lib/sysbox/shiftfs/ef954be7-d6f7-492e-b448-f3b412a7399f
      shiftfs  ro,relatime

I came across issue #728 while looking into this so I thought to check the logs from sysbox-mgr in case shiftfs wasn’t working properly but it doesn’t seem to be the same issue reported there:

level=info msg="Starting ..."
level=info msg="Sysbox data root: /var/lib/sysbox"
level=info msg="Shiftfs module found in kernel: yes"
level=info msg="Shiftfs works properly: yes"
level=info msg="Shiftfs-on-overlayfs works properly: yes"
level=info msg="ID-mapped mounts supported by kernel: yes"
level=info msg="Overlayfs on ID-mapped mounts supported by kernel: no"
level=info msg="Operating in system container mode."
level=info msg="Inner container image preloading disabled."
level=info msg="Listening on /run/sysbox/sysmgr.sock"
level=info msg="Ready ..."

I don’t know why this would only happen when systemd is started as the container’s PID 1, any insight is appreciated.

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

I suspect that Sysbox (incorrectly) did the /run mount after the /run/secrets/serviceaccount mount and thus it’s hiding it. Let me check the code to see where the bug is.

Seems the bug is here in sysbox-runc.

That code ensures the mounts are ordered such that they don’t opaque each other (e.g., mount /foo before /foo/bar). But it’s not doing it for a scenario where we have a tmpfs mount on /run and a bind-mount on /run/some/path.

Normally the higher level container manager (e.g., Docker or K8s) sends the mounts in the correct order, but because Sysbox implicitly adds some mounts of it’s own (e.g., tmpfs on /run when systemd is PID 1), it needs to do the ordering again to take into account the implicit mounts. Seems like it’s not doing it right for /run in systemd scenarios.

If it’s OK, I can try patching it and send you a new sysbox-runc binary that you can then use on the K8s node, to see if it fixes the problem. I’ve not been able to reproduce locally with Docker yet unfortunately.

Thanks @jojonium, very helpful info.

This is from inside the container with systemd enabled:

root@sysbox-test-6f7b77dbd8-rc478:/# findmnt | grep run |-/run tmpfs tmpfs rw,nosuid,nodev,mode=755,uid=296608,gid=296608,inode64 | `-/run/lock tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=296608,gid=296608,inode64 |-/run/.containerenv /var/lib/sysbox/shiftfs/d8d04a15-949c-4f42-a2aa-9a42f286ed27[/.containerenv] shiftfs rw,relatime |-/run/secrets/serviceaccount /var/lib/sysbox/shiftfs/a8354163-5638-4430-900a-e0ffba0dbc6a shiftfs ro,relatime

I think I see the problem; in the above output, the /run/secrets/serviceaccount mount should have been a submount of the /run mount (similar to /run/lock), but it does not appear to be.

I suspect that Sysbox (incorrectly) did the /run mount after the /run/secrets/serviceaccount mount and thus it’s hiding it. Let me check the code to see where the bug is.