longhorn: [BUG] Corruption using XFS after node restart or pod scale
Describe the bug
Upon either restarting a kubernetes worker node hosting Longhorn replicas, or during pod scaling (e.g. scaling a StatefulSet to 0), I’ve sometimes experienced silent data corruption. Following the Longhorn docs for detecting and deleting a failed replica does find a different hash, but deleting the replica doesn’t solve the problem. Ultimately restoring from a snapshot is the only fix.
Of note, I’m using XFS as the underlying filesystem, and all affected apps have used SQLite.
Unfortunately/fortunately, this problem seems to be non-deterministic. I’ve experienced it twice; once when restarting a node for maintenance, and once scaling an app down and up.
To Reproduce
Steps to reproduce the behavior:
- Use StatefulSets with VolumeClaimTemplates to define PVCs.
- Restart a node containing replicas, or scale StatefulSets utilizing the PVCs to 0 and back to n>0.
Expected behavior
The replica volumes to return to service with no corruption.
Log or Support bundle
Example dmesg from a longhorn-manager pod. Note that while this shows the recovery completing, others do not. I unfortunately don’t have timestamps to match up a failed one, although I can confirm that I experienced issues with this linked PVC, requiring a snapshot recovery.
[ 1332.278329] XFS (sdc): Metadata CRC error detected at xfs_agfl_read_verify+0xa2/0xf0 [xfs], xfs_agfl block 0x3
[ 1332.278332] XFS (sdc): Unmount and run xfs_repair
[ 1332.278334] XFS (sdc): First 128 bytes of corrupted metadata buffer:
[ 1332.278343] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278344] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278345] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278346] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278346] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278347] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278348] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278348] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 1332.278368] XFS (sdc): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3 len 1 error 74
[ 1332.365389] XFS (sdc): xfs_do_force_shutdown(0x8) called from line 446 of file fs/xfs/libxfs/xfs_defer.c. Return address = 0000000061a87018
[ 1332.365393] XFS (sdc): Corruption of in-memory data detected. Shutting down filesystem
[ 1332.365394] XFS (sdc): Please unmount the filesystem and rectify the problem(s)
[ 1447.117969] cni0: port 13(veth6f28afba) entered disabled state
[ 1447.129033] device veth6f28afba left promiscuous mode
[ 1447.129059] cni0: port 13(veth6f28afba) entered disabled state
[ 1448.304291] XFS (sdc): Unmounting Filesystem
[ 1452.882691] sd 7:0:0:1: [sdc] Synchronizing SCSI cache
[ 1459.419198] scsi host7: iSCSI Initiator over TCP/IP
[ 1459.448202] scsi 7:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
[ 1459.450009] scsi 7:0:0:0: Attached scsi generic sg3 type 12
[ 1459.451601] scsi 7:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
[ 1459.452849] sd 7:0:0:1: Attached scsi generic sg4 type 0
[ 1459.453233] sd 7:0:0:1: Power-on or device reset occurred
[ 1459.455747] sd 7:0:0:1: [sdc] 4194304 512-byte logical blocks: (2.15 GB/2.00 GiB)
[ 1459.456106] sd 7:0:0:1: [sdc] Write Protect is off
[ 1459.456111] sd 7:0:0:1: [sdc] Mode Sense: 69 00 10 08
[ 1459.456794] sd 7:0:0:1: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1460.586295] sd 7:0:0:1: [sdc] Attached SCSI disk
[ 1469.044275] XFS (sdc): Mounting V5 Filesystem
[ 1471.123131] XFS (sdc): Starting recovery (logdev: internal)
[ 1471.572637] XFS (sdc): Ending recovery (logdev: internal)
[ 1471.624639] xfs filesystem being mounted at /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-81fb33a4-4b2e-425b-b6df-9dbd7af36034/globalmount supports timestamps until 2038 (0x7fffffff)
[ 1471.850475] xfs filesystem being remounted at /var/lib/kubelet/pods/f0f3ff6b-841c-4c4e-a0f9-28cbd9fbac80/volumes/kubernetes.io~csi/pvc-81fb33a4-4b2e-425b-b6df-9dbd7af36034/mount supports timestamps until 2038
(0x7fffffff)
/sys/devices/system/edac/mc/mc{0,1} on each node shows 0 corrected and 0 uncorrected errors.
Environment
- Longhorn version: v1.2.3
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.21.5+k3s2
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
- Node config
- OS type and version: k3OS v0.21.5-k3s2r1
- CPU per node: E5-2650 v2
- Manager: 4
- Worker: 28
- Memory per node: DDR3-PC10600R
- Manager: 8 Gi
- Worker: 24 Gi
- Disk type(e.g. SSD/NVMe): SSD (Intel DC S3500 300GB)
- Network bandwidth between the nodes: 1 GBe
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM (Proxmox)
- Number of Longhorn volumes in the cluster: 20
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 39 (15 by maintainers)
We’ve identified the root cause of the filesystem corruption issue. In this report, we will explain the corruption occuring in an ext4 filesystem, and the same issue is also seen in a XFS filesystem.
Reproduce steps
The script can help reproduce the issue.
mkfs.ext4 -b 1024 /dev/longhorn/<pvc-name>on worker-1 nodeRoot Cause Analysis
The illustration shows how ext4 filesystem corruption and mount failure are caused. The root cause is a mismatch between the logical block size (1024 bytes) of the iSCSI block device and the copy-on-write (COW) block size (4096 bytes) in Longhorn replica.
The full story is
FIG 1 & 2 After deleting one replica, LH starts rebuilding the deleted replica, which is marked as WO mode, and then takes a snapshot. Meanwhile, the snapshot is empty. The new write io (offset=1024, length=1024) is incoming before finishing receiving the snapshot from the healthy replica (FIG2).
FIG 3 This write io also triggers the copy-on-write. The block size (replica.volume.sectorSize) used by copy-on-write operation is 4096 bytes. So, the data buffer contains two parts
FIG 4 & FIG 5 After finishing receiving the snapshot from the healthy replica, the prune operation prunes the redundant blocks. So, the snapshots of the two replicas are identical, but the volume-heads are mismatched.
For a xfs filesystem, the default data block size is 4096 bytes, but the superblock size is 512. The block size information can be checked by
xfs_info /dev/longhorn/<pvc-name>. Hence, the mismatch between the superblock size and Longhorn COW logic will cause the data corruption.Impact
Solution
Related Issues
cc @shuo-wu @PhanLe1010 @joshimoo @innobead
I think this issue is addressed in https://github.com/longhorn/longhorn/issues/4354 and https://github.com/longhorn/longhorn/issues/4594
@derekbit other than minor version differences, the only difference I note is that you’re using EBS which would present a block device. I was using LVM, and had formatted the VG with xfs on my hypervisor, then passed it through to the worker node, and pointed Longhorn to it.
@stephanGarland I tried to reproduce the xfs filesystem corruption issue on AWS platform, but unfortunately, the corruption still cannot be reproduced. The StatefulSet workload writes data to a sqlite databse storing on the LH RWO volume. I scaled StatefulSet utilizing the PVC to 0 and back to 3 repeatedly as you mentioned. Every mount and xfs log recovery were always successful.
Here is my setting. Any suggestion is appreciated.
Host OS: Ubuntu 20.04 (kernel 5.4.0-1041-aws)
K8s: v1.24.4+rke2r1
Longhorn: v1.2.4
Storage Class
cc @PhanLe1010 @shuo-wu @innobead
Not really sure if there is a bug in xfs. I think user can switch to use ext4 first and see if it still happens in ext4.
k3s_kubelet.log kw01_dmesg.log longhorn_shas.txt longhorn-support-bundle_a0d2f19d-cc3c-40ea-ac41-47baad381803_2022-02-10T00-00-40Z.zip
Here’s an organic event that just happened. I’m not sure what triggered it; I don’t believe I’ve touched this pod in a few days. Deleting the odd-out replica didn’t fix the problem. Weirdly, although the node
dell03-k3s-worker-01was the one reporting a different SHA (and thus the one I deleted), the dmesg error was fromdell01-k3s-worker-01- the other two reported no issues in their logs.Unfortunately my most recent snapshot was also bad, so I had to delete the PVC and recreate it using a manual backup I performed a few weeks ago, copying the files in.
I checked the SSDs on the node for SMART errors but didn’t see any, FWIW.