rook: Rook CSI RBD Plugin can't start on host nodes that lack the findmnt binary
Is this a bug report or feature request?
- Bug Report
There is only one other issue that mentions
findmnt#3726 but I don’t think it is related.
Deviation from expected behavior: csi-rdbplugin fails to start on any node in the cluster
Expected behavior: csi-rbdplugin should start on all nodes in the cluster
How to reproduce it (minimal and precise): Deploy Kubernetes via Rancher RKE onto some number of RancherOS (bare metal) nodes. Some things to note:
- This will deploy kubernetes via Docker using
hyperkube - RancherOS (by default) doesn’t supply a
findmntbinary or astatbinary - Using Rook via the Flex Volume system works fine on Rancher
Ensure you pass “extra_binds” arguments so kubelet gains access to host directories (much as is done when using the Flex volume system).
extra_binds:
- "/usr/libexec/kubernetes/kubelet-plugins/volume/exec:/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
- "/var/lib/kubelet/plugins_registry:/var/lib/kubelet/plugins_registry"
- "/var/lib/kubelet/pods:/var/lib/kubelet/pods:shared,z"
I had a previously working Rook 1.0.6/Ceph 14.2.x installation and followed these upgrade instructions to move from 1.0.6 to 1.1.0. I had not used the Ceph CSI driver before now.
https://rook.io/docs/rook/v1.1/ceph-upgrade.html
At the step where the Rook operator is updated (I had to manually fix some minor RBAC issues see #3868 ) the result is csi-rdbplugin fails to deploy on each cluster host node. The related driver-registrar and liveness-prometheus deploy correctly.
File(s) to submit:
- Crashing pod(s) logs, if necessary
This is the log from
csi-rdbplugin
I0919 19:27:26.963369 19679 cephcsi.go:103] Driver version: v1.2.0 and Git version: c420ee6de9e2f90fcce97b2700c0fd57845225c3
E0919 19:27:26.965139 19679 cephcsi.go:128] Failed to get the PID limit, can not reconfigure: open /sys/fs/cgroup/pids/docker/e7caa2de7b1e1640df0b2382774a3bd653343364391fdfc99aa572dfaa6160f4/kubepods/besteffort/pod48425a55-daef-11e9-9ffd-e0db5570db32/8a9f453be111b13efed122df4bb56528d94f6c4a0662d56cd222ced80f4b3b39/pids.max: no such file or directory
I0919 19:27:26.965342 19679 cephcsi.go:158] Starting driver type: rbd with name: paas-rook-ceph.rbd.csi.ceph.com
I0919 19:27:26.976889 19679 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I0919 19:27:26.976926 19679 mount_linux.go:171] systemd-run failed with: exit status 1
I0919 19:27:26.976953 19679 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
F0919 19:27:26.977175 19679 driver.go:145] failed to start node server, err unable to find findmnt
Discussion
I have tracked this down to the use of quay.io/cephcsi/cephcsi:v1.2.0 which eventually calls into Kubernetes code (in nsenter) that checks for the tools it needs:
binaries := []string{
"mount",
"findmnt",
"umount",
"systemd-run",
"stat",
"touch",
"mkdir",
"sh",
"chmod",
"realpath",
}
// search for the required commands in other locations besides /usr/bin
for _, binary := range binaries {
// check for binary under the following directories
for _, path := range []string{"/", "/bin", "/usr/sbin", "/usr/bin"} {
binPath := filepath.Join(path, binary)
if _, err := os.Stat(filepath.Join(ne.hostRootFsPath, binPath)); err != nil {
continue
}
ne.paths[binary] = binPath
break
}
// systemd-run is optional, bailout if we don't find any of the other binaries
if ne.paths[binary] == "" && binary != "systemd-run" {
return fmt.Errorf("unable to find %v", binary)
}
}
Which I am fairly sure gets its “hostRootFsPath” from the yaml Rook used to deploy csi-rbdplugin
env:
- name: HOST_ROOTFS
value: /rootfs
...
volumeMounts:
- mountPath: /rootfs
name: host-rootfs
...
volumes:
- hostPath:
path: /
type: ""
name: host-rootfs
The binaries needed exist in the quay.io/cephcsi/cephcsi:v1.2.0 image. Is there a way to configure the CephCSI driver to look for the binaries in the container rather than in a filesystem mounted from the host?
Environment:
- OS (e.g. from /etc/os-release):
NAME="RancherOS"
VERSION=v1.5.1
ID=rancheros
ID_LIKE=
VERSION_ID=v1.5.1
PRETTY_NAME="RancherOS v1.5.1"
HOME_URL="http://rancher.com/rancher-os/"
SUPPORT_URL="https://forums.rancher.com/c/rancher-os"
BUG_REPORT_URL="https://github.com/rancher/os/issues"
BUILD_ID=
- Kernel (e.g.
uname -a):Linux host-0 4.14.85-rancher #1 SMP Sat Dec 1 12:40:08 UTC 2018 x86_64 GNU/Linux - Cloud provider or hardware configuration: Private bare metal
- Rook version (use
rook versioninside of a Rook Pod):rook: v1.1.0 - Storage backend version (e.g. for ceph do
ceph -v):ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable) - Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.9", GitCommit:"3e4f6a92de5f259ef313ad876bb008897f6a98f0", GitTreeState:"clean", BuildDate:"2019-08-05T09:22:00Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
- Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Using Rancher 2.2.8
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
[root@host-0 /]# ceph health
HEALTH_OK
[root@host-0 /]#
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 28 (11 by maintainers)
@Madhu-1: Great. Sorry, I did’t notice you had already added that.
I just did a a fresh deployment from
master, with the following added tooperator.yaml:After putting
--containerized=falseinto thecsi-rbdplugindaemonset, things seem to work fine.I ran a quick test with rbd.csi and cephfs.csi storage classes and the PVs did provision and attach to pods as expected.
So, here’s hoping for your fix for the original
nsenterissue to make it intomastersoon.@martinlindner Thanks for testing it out, removal of
nsenteris still under discussion in ceph-csi. will update PR once it’s fixed.If you are using RKE / Rancher to deploy your Kubernetes on top of your Debian / RancherOS then there is a solution from the Rancher/RKE side. If you are not using these then this will not help.
You can tell Rancher / RKE to use the same kubelet rootdir setting by changing the configuration you give to RKE or by using the Rancher to change the yaml that defines your cluster.
We force root-dir to be a uniform value, then we ensure that all the folders we need for flex and csi plugins get bind mounted into the kubelet container from the host.
I have an issue open with the Rancher team that is related. https://github.com/rancher/rancher/issues/24886
@martinlindner Thanks for tackling where RKE puts stuff. I was still trying to use the
extra_bindsconfiguration with/var/lib/kubeletand had not even realized that RKE deployed kublet to/opt/rke/var/lib/kubelet.@Madhu-1 I can confirm that with the changes you provided for
--containerized=falseoncsi-rbdpluginand settingROOK_CSI_KUBELET_DIR_PATHas provided by martinlinder, everything works as expected.@Madhu-1 @martinlindner I can confirm that setting
--containerized=falsefrom the current default of--containerized=truedoes causecsi-rbdpluginto deploy properly. I will test that it correctly allocates volumes are report back.