ceph-csi: Need a workaround when the ceph-csi rbdplugin pod failed "fsck" on the disk.

Describe the bug

A clear and concise description of what the bug is.

Environment details

Image/version of Ceph CSI driver : 1.2.1
Helm chart version :
Kernel version :
Mounter used for mounting PVC (for cephfs its fuse or kernel. for rbd its krbd or rbd-nbd) : krbd
Kubernetes cluster version : v1.15.4
Ceph cluster version : v14.2.4

Steps to reproduce

Steps to reproduce the behavior:

Setup details: ‘…’ rook-ceph cluster is deployed with application using csi volume.
Deployment to trigger the issue ‘…’ docker daemon restarted and the application pod was reschedued. The rescheduled application pod failed to mount the volume.
See error rbdplugin pod logs shows below error messages repeatedly. Complete log is attached.

I0727 17:06:41.056271    9037 utils.go:125] ID: 501047 GRPC response: {"usage":[{"available":3919605760,"total":5150212096,"unit":1,"used":1213829120},{"available":327195,"total":327680,"unit":2,"used":485}]}
I0727 17:06:53.137613    9037 utils.go:119] ID: 501048 GRPC call: /csi.v1.Node/NodeGetCapabilities
I0727 17:06:53.137636    9037 utils.go:120] ID: 501048 GRPC request: {}
I0727 17:06:53.138088    9037 utils.go:125] ID: 501048 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0727 17:06:53.276368    9037 utils.go:119] ID: 501049 GRPC call: /csi.v1.Node/NodeStageVolume
I0727 17:06:53.276389    9037 utils.go:120] ID: 501049 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c092a35-b824-46b5-a18f-c1e5db034cfd/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"rook-ceph","imageFeatures":"layering","imageFormat":"2","pool":"csireplpool","storage.kubernetes.io/csiProvisionerIdentity":"1591172062273-8081-rook-ceph.rbd.csi.ceph.com"},"volume_id":"0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5"}
I0727 17:06:53.278296    9037 rbd_util.go:477] ID: 501049 setting disableInUseChecks on rbd volume to: false
I0727 17:06:53.325286    9037 rbd_util.go:140] ID: 501049 rbd: status csi-vol-b9a43413-a652-11ea-9a78-7ef490e8cee5 using mon 10.254.239.237:6789,10.254.1.104:6789,10.254.229.126:6789, pool csireplpool
W0727 17:06:53.387644    9037 rbd_util.go:162] ID: 501049 rbd: no watchers on csi-vol-b9a43413-a652-11ea-9a78-7ef490e8cee5
I0727 17:06:53.387673    9037 rbd_attach.go:202] ID: 501049 rbd: map mon 10.254.239.237:6789,10.254.1.104:6789,10.254.229.126:6789
I0727 17:06:53.452749    9037 nodeserver.go:147] ID: 501049 rbd image: 0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5/csireplpool was successfully mapped at /dev/rbd9
I0727 17:06:53.452874    9037 mount_linux.go:515] Attempting to determine if disk "/dev/rbd9" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/rbd9])
I0727 17:06:53.462499    9037 mount_linux.go:518] Output: "DEVNAME=/dev/rbd9\nTYPE=ext4\n", err: <nil>
I0727 17:06:53.462524    9037 mount_linux.go:441] Checking for issues with fsck on disk: /dev/rbd9
E0727 17:06:53.487198    9037 nodeserver.go:345] ID: 501049 failed to mount device path (/dev/rbd9) to staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c092a35-b824-46b5-a18f-c1e5db034cfd/globalmount/0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5) for volume (0001-0009-rook-ceph-0000000000000001-b9a43413-a652-11ea-9a78-7ef490e8cee5) error 'fsck' found errors on device /dev/rbd9 but could not correct them: fsck from util-linux 2.23.2
/dev/rbd9: Superblock needs_recovery flag is clear, but journal has data.
/dev/rbd9: Run journal anyway

/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)
.
E0727 17:06:53.565721    9037 utils.go:123] ID: 501049 GRPC error: rpc error: code = Internal desc = 'fsck' found errors on device /dev/rbd9 but could not correct them: fsck from util-linux 2.23.2
/dev/rbd9: Superblock needs_recovery flag is clear, but journal has data.
/dev/rbd9: Run journal anyway

/dev/rbd9: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)
.

Actual results

The rescheduled app pod is in “ContainerCreating” state failing to mount the volume.

Describe what happened

Expected behavior

Should be in “Running” state with PVC attached.
The PVC is not mounted or mapped. Can’t run “fsck” manually.

A clear and concise description of what you expected to happen.

Logs

If the issue is in PVC creation, deletion, cloning please attach complete logs of below containers.

csi-provisioner and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.

If the issue is in PVC resize please attach complete logs of below containers.

csi-resizer and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.

If the issue is in snapshot creation and deletion please attach complete logs of below containers.

csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container logs from the provisioner pod.

If the issue is in PVC mounting please attach complete logs of below containers.

csi-rbdplugin/csi-cephfsplugin and driver-registrar container logs from plugin pod from the node where the mount is failing. rook-logs07-27-18-45-rbdplugin.txt
if required attach dmesg logs.

Note:- If its a rbd issue please provide only rbd related logs, if its a cephfs issue please provide cephfs logs.

Additional context

Add any other context about the problem here.

For example:

Any existing bug report which describe about the similar issue/behavior

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 28

Most upvoted comments

Our error is the same as yanchicago, as follows: `May 14 10:27:00 k8s-test-0-114 kubelet: E0514 10:27:00.723771 3657 csi_attacher.go:320] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = ‘fsck’ found errors on device /dev/rbd3 but could not correct them: fsck from util-linux 2.32.1 May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3 contains a file system with errors, check forced. May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: Inode 23 has an invalid extent node (blk 63775, lblk 5509)

May 14 10:27:00 k8s-test-0-114 kubelet: /dev/rbd3: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.`

The above error occurred after the physical machine was down. At the same time, a hard disk on this physical machine also appeared in this situation, which needs fsck to repair. We use the kernel mode to mount, rbdplugin restart will not appear this situation (at least we have not encountered until now) this situation so far we have only encountered it once, and we have not reproduced it in the follow-up. We did not use calico. Csi pod hostNetwork is true.

cl51287 on May 20, 2021

Many thanks for your support. 👍We were able to recover the pod. Could you shed some light on the how the IP address was selected for the “watcher”? We’ve a k8s cluster using Calico CNI via IPIP encapulate mode. There’re two subnets among all hosts. And the watcher IP seems randomly allocated between the two subnets. Is the watcher IP used in any way? Do you see any issues with this type of IP config?

yanchicago on Jul 28, 2020