rook: Crash loopback off error in OSD pods
I am trying to run rook-ceph in my AKS cluster but my OSD pods are having status crash loop back error. I have cloned the repo from https://github.com/rook/rook.git Common.yaml file is the same. Here is my operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: rook-ceph-operator
namespace: rook-ceph
labels:
operator: rook
storage-backend: ceph
spec:
selector:
matchLabels:
app: rook-ceph-operator
replicas: 1
template:
metadata:
labels:
app: rook-ceph-operator
spec:
serviceAccountName: rook-ceph-system
containers:
- name: rook-ceph-operator
image: rook/ceph:master
args: ["ceph", "operator"]
volumeMounts:
- mountPath: /var/lib/rook
name: rook-config
- mountPath: /etc/ceph
name: default-config-dir
env:
- name: ROOK_CURRENT_NAMESPACE_ONLY
value: "false"
- name: FLEXVOLUME_DIR_PATH
value: "/etc/kubernetes/volumeplugins"
- name: ROOK_ALLOW_MULTIPLE_FILESYSTEMS
value: "false"
- name: ROOK_LOG_LEVEL
value: "INFO"
- name: ROOK_CEPH_STATUS_CHECK_INTERVAL
value: "60s"
- name: ROOK_MON_HEALTHCHECK_INTERVAL
value: "45s"
- name: ROOK_MON_OUT_TIMEOUT
value: "600s"
- name: ROOK_DISCOVER_DEVICES_INTERVAL
value: "60m"
- name: ROOK_HOSTPATH_REQUIRES_PRIVILEGED
value: "false"
- name: ROOK_ENABLE_SELINUX_RELABELING
value: "true"
- name: ROOK_ENABLE_FSGROUP
value: "true"
- name: ROOK_DISABLE_DEVICE_HOTPLUG
value: "false"
- name: ROOK_ENABLE_FLEX_DRIVER
value: "false"
# Whether to start the discovery daemon to watch for raw storage devices on nodes in the cluster.
# This daemon does not need to run if you are only going to create your OSDs based on StorageClassDeviceSets with PVCs. --> CHANGED to false
- name: ROOK_ENABLE_DISCOVERY_DAEMON
value: "false"
- name: ROOK_CSI_ENABLE_CEPHFS
value: "true"
- name: ROOK_CSI_ENABLE_RBD
value: "true"
- name: ROOK_CSI_ENABLE_GRPC_METRICS
value: "true"
- name: CSI_ENABLE_SNAPSHOTTER
value: "true"
- name: CSI_PROVISIONER_TOLERATIONS
value: |
- effect: NoSchedule
key: storage-node
operator: Exists
- name: CSI_PLUGIN_TOLERATIONS
value: |
- effect: NoSchedule
key: storage-node
operator: Exists
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: rook-config
emptyDir: {}
- name: default-config-dir
emptyDir: {}
And here is my cluster.yaml file
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
volumeClaimTemplate:
spec:
storageClassName: managed-premium
resources:
requests:
storage: 10Gi
cephVersion:
image: ceph/ceph:v15.2.4
allowUnsupported: false
dashboard:
enabled: true
ssl: true
network:
hostNetwork: false
placement:
mon:
tolerations:
- key: storage-node
operator: Exists
storage:
storageClassDeviceSets:
- name: set1
# The number of OSDs to create from this device set
count: 4
# IMPORTANT: If volumes specified by the storageClassName are not portable across nodes
# this needs to be set to false. For example, if using the local storage provisioner
# this should be false.
portable: true
# Since the OSDs could end up on any node, an effort needs to be made to spread the OSDs
# across nodes as much as possible. Unfortunately the pod anti-affinity breaks down
# as soon as you have more than one OSD per node. If you have more OSDs than nodes, K8s may
# choose to schedule many of them on the same node. What we need is the Pod Topology
# Spread Constraints, which is alpha in K8s 1.16. This means that a feature gate must be
# enabled for this feature, and Rook also still needs to add support for this feature.
# Another approach for a small number of OSDs is to create a separate device set for each
# zone (or other set of nodes with a common label) so that the OSDs will end up on different
# nodes. This would require adding nodeAffinity to the placement here.
placement:
tolerations:
- key: storage-node
operator: Exists
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: agentpool
operator: In
values:
- npstorage
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-osd
- key: app
operator: In
values:
- rook-ceph-osd-prepare
topologyKey: kubernetes.io/hostname
resources:
limits:
cpu: "500m"
memory: "4Gi"
requests:
cpu: "500m"
memory: "2Gi"
volumeClaimTemplates:
- metadata:
name: data
spec:
resources:
requests:
storage: 100Gi
storageClassName: managed-premium
volumeMode: Block
accessModes:
- ReadWriteOnce
disruptionManagement:
managePodBudgets: false
osdMaintenanceTimeout: 30
manageMachineDisruptionBudgets: false
machineDisruptionBudgetNamespace: openshift-machine-api
This is my output:
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-6xzw6 3/3 Running 0 34m
csi-cephfsplugin-dtncx 3/3 Running 0 34m
csi-cephfsplugin-provisioner-67f9c99b5f-hvgtb 6/6 Running 0 34m
csi-cephfsplugin-provisioner-67f9c99b5f-xf2wx 6/6 Running 0 34m
csi-cephfsplugin-t7q9g 3/3 Running 0 34m
csi-cephfsplugin-tccnb 3/3 Running 0 34m
csi-cephfsplugin-tjxs7 3/3 Running 0 34m
csi-cephfsplugin-wxtsr 3/3 Running 0 34m
csi-rbdplugin-65z9v 3/3 Running 0 34m
csi-rbdplugin-6kdj4 3/3 Running 0 34m
csi-rbdplugin-9vlwn 3/3 Running 0 34m
csi-rbdplugin-dvsrq 3/3 Running 0 34m
csi-rbdplugin-phxjr 3/3 Running 0 34m
csi-rbdplugin-provisioner-5d5cfb887b-4f9vh 6/6 Running 0 34m
csi-rbdplugin-provisioner-5d5cfb887b-ww87t 6/6 Running 0 34m
csi-rbdplugin-qr9j2 3/3 Running 0 34m
rook-ceph-crashcollector-aks-agentpool-25228689-vmss0000007m2mh 1/1 Running 0 32m
rook-ceph-crashcollector-aks-npstorage-25228689-vmss000000j6fvg 1/1 Running 0 30m
rook-ceph-crashcollector-aks-npstorage-25228689-vmss0000016h7bl 1/1 Running 0 32m
rook-ceph-crashcollector-aks-rstudiomed-25228689-vmss000002pddf 1/1 Running 0 31m
rook-ceph-mgr-a-7575fdb658-7n4gn 1/1 Running 0 31m
rook-ceph-mon-a-6d44495c59-4rqh9 1/1 Running 0 33m
rook-ceph-mon-b-5d9cc8bc8d-47jdw 1/1 Running 0 32m
rook-ceph-mon-c-d4f6bcb45-s2dfp 1/1 Running 0 32m
rook-ceph-operator-78f46865d8-hgnbz 1/1 Running 0 36m
rook-ceph-osd-0-7989bc8b9-ndgzl 0/1 CrashLoopBackOff 10 30m
rook-ceph-osd-1-5f749bcd97-4cczm 0/1 CrashLoopBackOff 10 30m
rook-ceph-osd-2-58668bbb4b-68cxm 0/1 CrashLoopBackOff 10 30m
rook-ceph-osd-3-66844fbfb6-knvrn 0/1 CrashLoopBackOff 10 30m
rook-ceph-osd-prepare-set1-data-0-w6gf9-9phtm 0/1 Completed 0 31m
rook-ceph-osd-prepare-set1-data-1-2r2wj-kpqjm 0/1 Completed 0 31m
rook-ceph-osd-prepare-set1-data-2-mrdsz-2l84b 0/1 Completed 0 31m
rook-ceph-osd-prepare-set1-data-3-d2mr9-xbhjx 0/1 Completed 0 31m
And output of kubecrl describe of all crashloopbackoff pods are as follows:
kubectl describe pod -n rook-ceph rook-ceph-osd-0-7989bc8b9-ndgzl
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned rook-ceph/rook-ceph-osd-0-7989bc8b9-ndgzl to aks-npstorage-25228689-vmss000000
Warning FailedAttachVolume 31m attachdetach-controller Multi-Attach error for volume "pvc-5b6acdb1-c524-4b55-ae60-549471369853" Volume is already used by pod(s) rook-ceph-osd-prepare-set1-data-0-w6gf9-9phtm
Normal SuccessfulAttachVolume 31m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-5b6acdb1-c524-4b55-ae60-549471369853"
Normal SuccessfulMountVolume 30m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-5b6acdb1-c524-4b55-ae60-549471369853" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/azure-disk/volumeDevices/kubernetes-dynamic-pvc-5b6acdb1-c524-4b55-ae60-549471369853"
Normal SuccessfulMountVolume 30m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-5b6acdb1-c524-4b55-ae60-549471369853" volumeMapPath "/var/lib/kubelet/pods/7b0f2572-132a-4b9b-912f-16ffe72238d9/volumeDevices/kubernetes.io~azure-disk"
Normal Pulled 30m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 30m kubelet Created container blkdevmapper
Normal Started 30m kubelet Started container blkdevmapper
Normal Pulled 30m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 30m kubelet Created container activate
Normal Started 30m kubelet Started container activate
Normal Pulled 30m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 30m kubelet Created container expand-bluefs
Normal Started 30m kubelet Started container expand-bluefs
Normal Pulled 30m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 30m kubelet Created container chown-container-data-dir
Normal Started 30m kubelet Started container chown-container-data-dir
Normal Started 30m (x2 over 30m) kubelet Started container osd
Normal Pulled 30m (x3 over 30m) kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 30m (x3 over 30m) kubelet Created container osd
Warning BackOff 40s (x148 over 30m) kubelet Back-off restarting failed container
kubectl describe pod -n rook-ceph rook-ceph-osd-1-5f749bcd97-4cczm
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33m default-scheduler Successfully assigned rook-ceph/rook-ceph-osd-1-5f749bcd97-4cczm to aks-npstorage-25228689-vmss000001
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-f601df09-53e1-49d2-98ef-be7253d9153e" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/azure-disk/volumeDevices/kubernetes-dynamic-pvc-f601df09-53e1-49d2-98ef-be7253d9153e"
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-f601df09-53e1-49d2-98ef-be7253d9153e" volumeMapPath "/var/lib/kubelet/pods/d65bef28-17d4-47db-9058-6d772432ff64/volumeDevices/kubernetes.io~azure-disk"
Normal Created 33m kubelet Created container blkdevmapper
Normal Started 33m kubelet Started container blkdevmapper
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container activate
Normal Started 33m kubelet Started container activate
Normal Started 33m kubelet Started container expand-bluefs
Normal Created 33m kubelet Created container expand-bluefs
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container chown-container-data-dir
Normal Started 33m kubelet Started container chown-container-data-dir
Normal Started 33m (x2 over 33m) kubelet Started container osd
Normal Pulled 32m (x3 over 33m) kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 32m (x3 over 33m) kubelet Created container osd
Warning BackOff 3m13s (x147 over 33m) kubelet Back-off restarting failed container
kubectl describe pod -n rook-ceph rook-ceph-osd-2-58668bbb4b-68cxm
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33m default-scheduler Successfully assigned rook-ceph/rook-ceph-osd-2-58668bbb4b-68cxm to aks-npstorage-25228689-vmss000001
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-b99de71f-411b-4922-a4f2-2d3b95a782be" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/azure-disk/volumeDevices/kubernetes-dynamic-pvc-b99de71f-411b-4922-a4f2-2d3b95a782be"
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-b99de71f-411b-4922-a4f2-2d3b95a782be" volumeMapPath "/var/lib/kubelet/pods/76b65d95-d78f-427a-913d-0bba65ea370e/volumeDevices/kubernetes.io~azure-disk"
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container blkdevmapper
Normal Started 33m kubelet Started container blkdevmapper
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container activate
Normal Started 33m kubelet Started container activate
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container expand-bluefs
Normal Started 33m kubelet Started container expand-bluefs
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container chown-container-data-dir
Normal Started 33m kubelet Started container chown-container-data-dir
Normal Started 33m (x2 over 33m) kubelet Started container osd
Normal Pulled 33m (x3 over 33m) kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m (x3 over 33m) kubelet Created container osd
Warning BackOff 3m43s (x148 over 33m) kubelet Back-off restarting failed container
kubectl describe pod -n rook-ceph rook-ceph-osd-3-66844fbfb6-knvrn
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34m default-scheduler Successfully assigned rook-ceph/rook-ceph-osd-3-66844fbfb6-knvrn to aks-npstorage-25228689-vmss000000
Warning FailedAttachVolume 34m attachdetach-controller Multi-Attach error for volume "pvc-a4a1291f-743f-46fe-8234-5e67b8d053ee" Volume is already used by pod(s) rook-ceph-osd-prepare-set1-data-2-mrdsz-2l84b
Normal SuccessfulAttachVolume 33m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-a4a1291f-743f-46fe-8234-5e67b8d053ee"
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-a4a1291f-743f-46fe-8234-5e67b8d053ee" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/azure-disk/volumeDevices/kubernetes-dynamic-pvc-a4a1291f-743f-46fe-8234-5e67b8d053ee"
Normal SuccessfulMountVolume 33m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-a4a1291f-743f-46fe-8234-5e67b8d053ee" volumeMapPath "/var/lib/kubelet/pods/5adfbb90-da86-41de-aa0e-50fd3f7104f1/volumeDevices/kubernetes.io~azure-disk"
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container blkdevmapper
Normal Started 33m kubelet Started container blkdevmapper
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container activate
Normal Started 33m kubelet Started container activate
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container expand-bluefs
Normal Started 33m kubelet Started container expand-bluefs
Normal Pulled 33m kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m kubelet Created container chown-container-data-dir
Normal Started 33m kubelet Started container chown-container-data-dir
Normal Started 33m (x2 over 33m) kubelet Started container osd
Normal Pulled 33m (x3 over 33m) kubelet Container image "ceph/ceph:v15.2.4" already present on machine
Normal Created 33m (x3 over 33m) kubelet Created container osd
Warning BackOff 3m23s (x145 over 33m) kubelet Back-off restarting failed container
Any help would be appreciated!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (6 by maintainers)
Thanks @travisn for resolving the issue!
@vergilcw If in the same network, I would expect it to work, but haven’t tried it.
@travisn thanks for helping my colleague @Siddhu1096 solve this problem!
To answer your question about NFS requirements, we need high performance NFS in the cluster for shared user home directories. The application (RStudio Server) has lots of I/O on small files in users’ home directories. We thought CephNFS looked promising for performance reasons.
There is a chance we may want to expose the NFS home directories to other VMs outside the AKS cluster but in the same network. Would that be possible/recommended with CephNFS?
I’d recommend the CephNFS CRD since it’s more directly integrated with Ceph. The other one is a more general NFS solution that can be backed by any storage.
What is the need for NFS? Are you exposing the storage outside the AKS cluster?
The zone comes from the topology labels on the nodes. They are not set by Rook, Rook just consumes them.
Are you also using the latest release instead of master like my earlier suggestion?