ceph-csi: Need a workaround when the ceph-csi rbdplugin pod failed "fsck" on the disk.
Describe the bug
A clear and concise description of what the bug is.
Environment details
- Image/version of Ceph CSI driver : 1.2.1
- Helm chart version :
- Kernel version :
- Mounter used for mounting PVC (for cephfs its
fuse
orkernel
. for rbd itskrbd
orrbd-nbd
) : krbd - Kubernetes cluster version : v1.15.4
- Ceph cluster version : v14.2.4
Steps to reproduce
Steps to reproduce the behavior:
- Setup details: ‘…’ rook-ceph cluster is deployed with application using csi volume.
- Deployment to trigger the issue ‘…’ docker daemon restarted and the application pod was reschedued. The rescheduled application pod failed to mount the volume.
- See error rbdplugin pod logs shows below error messages repeatedly. Complete log is attached.
I0727 17:06:41.056271 9037 utils.go:125] ID: 501047 GRPC response: {"usage":[{"available":3919605760,"total":5150212096,"unit":1,"used":1213829120},{"available":327195,"total":327680,"unit":2,"used":485}]}
I0727 17:06:53.137613 9037 utils.go:119] ID: 501048 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0727 17:06:53.137636 9037 utils.go:120] ID: 501048 GRPC request: {}
I0727 17:06:53.138088 9037 utils.go:125] ID: 501048 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0727 17:06:53.276368 9037 utils.go:119] ID: 501049 GRPC call: /csi.v1.Node/NodeStageVolume
I0727 17:06:53.276389 9037 utils.go:120] ID: 501049 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c092a35-b824-46b5-a18f-c1e5db034cfd/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","pool":"csireplpool","storage.kubernetes.io/csiProvisionerIdentity":"1591172062273-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5"}
I0727 17:06:53.278296 9037 rbd_util.go:477] ID: 501049 setting disableInUseChecks on rbd volume to: false
I0727 17:06:53.325286 9037 rbd_util.go:140] ID: 501049 rbd: status csi-vol-b9a43413-a652-11ea-9a78-7ef490e8cee5 using mon 10.254.239.237:6789,10.254.1.104:6789,10.254.229.126:6789, pool csireplpool
W0727 17:06:53.387644 9037 rbd_util.go:162] ID: 501049 rbd: no watchers on csi-vol-b9a43413-a652-11ea-9a78-7ef490e8cee5
I0727 17:06:53.387673 9037 rbd_attach.go:202] ID: 501049 rbd: map mon 10.254.239.237:6789,10.254.1.104:6789,10.254.229.126:6789
I0727 17:06:53.452749 9037 nodeserver.go:147] ID: 501049 rbd image: 0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5/csireplpool was successfully mapped at /dev/rbd9
I0727 17:06:53.452874 9037 mount_linux.go:515] Attempting to determine if disk "/dev/rbd9" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd9])
I0727 17:06:53.462499 9037 mount_linux.go:518] Output: "DEVNAME=/dev/rbd9\nTYPE=ext4\n", err: <nil>
I0727 17:06:53.462524 9037 mount_linux.go:441] Checking for issues with fsck on disk: /dev/rbd9
E0727 17:06:53.487198 9037 nodeserver.go:345] ID: 501049 failed to mount device path (/dev/rbd9) to staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c092a35-b824-46b5-a18f-c1e5db034cfd/globalmount/0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5) for volume (0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5) error 'fsck' found errors on device /dev/rbd9 but could not correct them: fsck from util-linux 2.23.2
/dev/rbd9: Superblock needs_recovery flag is clear, but journal has data.
/dev/rbd9: Run journal anyway
/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
E0727 17:06:53.565721 9037 utils.go:123] ID: 501049 GRPC error: rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd9 but could not correct them: fsck from util-linux 2.23.2
/dev/rbd9: Superblock needs_recovery flag is clear, but journal has data.
/dev/rbd9: Run journal anyway
/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without -a or -p options)
.
Actual results
The rescheduled app pod is in “ContainerCreating” state failing to mount the volume.
Describe what happened
Expected behavior
- Should be in “Running” state with PVC attached.
- The PVC is not mounted or mapped. Can’t run “fsck” manually.
A clear and concise description of what you expected to happen.
Logs
If the issue is in PVC creation, deletion, cloning please attach complete logs of below containers.
- csi-provisioner and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.
If the issue is in PVC resize please attach complete logs of below containers.
- csi-resizer and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.
If the issue is in snapshot creation and deletion please attach complete logs of below containers.
- csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.
If the issue is in PVC mounting please attach complete logs of below containers.
-
csi-rbdplugin/csi-cephfsplugin and driver-registrar container logs from plugin pod from the node where the mount is failing. rook-logs07-27-18-45-rbdplugin.txt
-
if required attach dmesg logs.
Note:- If its a rbd issue please provide only rbd related logs, if its a cephfs issue please provide cephfs logs.
Additional context
Add any other context about the problem here.
For example:
Any existing bug report which describe about the similar issue/behavior
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28
Our error is the same as yanchicago, as follows: `May 14 10:27:00 k8s-test-0-114 kubelet: E0514 10:27:00.723771 3657 csi_attacher.go:320] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = ‘fsck’ found errors on device /dev/rbd3 but could not correct them: fsck from util-linux 2.32.1 May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3 contains a file system with errors, check forced. May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: Inode 23 has an invalid extent node (blk 63775, lblk 5509)
May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.`
The above error occurred after the physical machine was down. At the same time, a hard disk on this physical machine also appeared in this situation, which needs fsck to repair. We use the kernel mode to mount, rbdplugin restart will not appear this situation (at least we have not encountered until now) this situation so far we have only encountered it once, and we have not reproduced it in the follow-up. We did not use calico. Csi pod hostNetwork is true.
Many thanks for your support. 👍We were able to recover the pod. Could you shed some light on the how the IP address was selected for the “watcher”? We’ve a k8s cluster using Calico CNI via IPIP encapulate mode. There’re two subnets among all hosts. And the watcher IP seems randomly allocated between the two subnets. Is the watcher IP used in any way? Do you see any issues with this type of IP config?