longhorn: [BUG] Corruption using XFS after node restart or pod scale

Describe the bug

Upon either restarting a kubernetes worker node hosting Longhorn replicas, or during pod scaling (e.g. scaling a StatefulSet to 0), I’ve sometimes experienced silent data corruption. Following the Longhorn docs for detecting and deleting a failed replica does find a different hash, but deleting the replica doesn’t solve the problem. Ultimately restoring from a snapshot is the only fix.

Of note, I’m using XFS as the underlying filesystem, and all affected apps have used SQLite.

Unfortunately/fortunately, this problem seems to be non-deterministic. I’ve experienced it twice; once when restarting a node for maintenance, and once scaling an app down and up.

To Reproduce

Steps to reproduce the behavior:

Use StatefulSets with VolumeClaimTemplates to define PVCs.
Restart a node containing replicas, or scale StatefulSets utilizing the PVCs to 0 and back to n>0.

Expected behavior

The replica volumes to return to service with no corruption.

Log or Support bundle

Example dmesg from a longhorn-manager pod. Note that while this shows the recovery completing, others do not. I unfortunately don’t have timestamps to match up a failed one, although I can confirm that I experienced issues with this linked PVC, requiring a snapshot recovery.


[ 1332.278329] XFS (sdc): Metadata CRC error detected at xfs_agfl_read_verify+0xa2/0xf0 [xfs], xfs_agfl block 0x3
[ 1332.278332] XFS (sdc): Unmount and run xfs_repair
[ 1332.278334] XFS (sdc): First 128 bytes of corrupted metadata buffer:
[ 1332.278343] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278344] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278345] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278346] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278346] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278347] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278348] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278348] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[ 1332.278368] XFS (sdc): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x3 len 1 error 74
[ 1332.365389] XFS (sdc): xfs_do_force_shutdown(0x8) called from line 446 of file fs/xfs/libxfs/xfs_defer.c. Return address = 0000000061a87018
[ 1332.365393] XFS (sdc): Corruption of in-memory data detected.  Shutting down filesystem
[ 1332.365394] XFS (sdc): Please unmount the filesystem and rectify the problem(s)
[ 1447.117969] cni0: port 13(veth6f28afba) entered disabled state
[ 1447.129033] device veth6f28afba left promiscuous mode
[ 1447.129059] cni0: port 13(veth6f28afba) entered disabled state
[ 1448.304291] XFS (sdc): Unmounting Filesystem
[ 1452.882691] sd 7:0:0:1: [sdc] Synchronizing SCSI cache
[ 1459.419198] scsi host7: iSCSI Initiator over TCP/IP
[ 1459.448202] scsi 7:0:0:0: RAID              IET      Controller       0001 PQ: 0 ANSI: 5
[ 1459.450009] scsi 7:0:0:0: Attached scsi generic sg3 type 12
[ 1459.451601] scsi 7:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
[ 1459.452849] sd 7:0:0:1: Attached scsi generic sg4 type 0
[ 1459.453233] sd 7:0:0:1: Power-on or device reset occurred
[ 1459.455747] sd 7:0:0:1: [sdc] 4194304 512-byte logical blocks: (2.15 GB/2.00 GiB)
[ 1459.456106] sd 7:0:0:1: [sdc] Write Protect is off
[ 1459.456111] sd 7:0:0:1: [sdc] Mode Sense: 69 00 10 08
[ 1459.456794] sd 7:0:0:1: [sdc] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1460.586295] sd 7:0:0:1: [sdc] Attached SCSI disk
[ 1469.044275] XFS (sdc): Mounting V5 Filesystem
[ 1471.123131] XFS (sdc): Starting recovery (logdev: internal)
[ 1471.572637] XFS (sdc): Ending recovery (logdev: internal)
[ 1471.624639] xfs filesystem being mounted at /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-81fb33a4-4b2e-425b-b6df-9dbd7af36034/globalmount supports timestamps until 2038 (0x7fffffff)
[ 1471.850475] xfs filesystem being remounted at /var/lib/kubelet/pods/f0f3ff6b-841c-4c4e-a0f9-28cbd9fbac80/volumes/kubernetes.io~csi/pvc-81fb33a4-4b2e-425b-b6df-9dbd7af36034/mount supports timestamps until 2038
(0x7fffffff)

/sys/devices/system/edac/mc/mc{0,1} on each node shows 0 corrected and 0 uncorrected errors.

Environment

Longhorn version: v1.2.3
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.21.5+k3s2
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Node config
- OS type and version: k3OS v0.21.5-k3s2r1
- CPU per node: E5-2650 v2
  - Manager: 4
  - Worker: 28
- Memory per node: DDR3-PC10600R
  - Manager: 8 Gi
  - Worker: 24 Gi
- Disk type(e.g. SSD/NVMe): SSD (Intel DC S3500 300GB)
- Network bandwidth between the nodes: 1 GBe
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM (Proxmox)
Number of Longhorn volumes in the cluster: 20

Additional context

Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 39 (15 by maintainers)

Most upvoted comments

We’ve identified the root cause of the filesystem corruption issue. In this report, we will explain the corruption occuring in an ext4 filesystem, and the same issue is also seen in a XFS filesystem.

Reproduce steps

The script can help reproduce the issue.

Create a PVC containing 3 replicas
Attach the volume to one node (worker-1)
Format the longhorn volume by mkfs.ext4 -b 1024 /dev/longhorn/<pvc-name> on worker-1 node
Run the script on worker-2 node (need to update the DEV path)

#!/bin/bash

DEV=/dev/longhorn/<pvc-name>

for x in {1..30}
do
  echo "============== delete $x =============="
  REPLICA=`kubectl -n longhorn-system get --no-headers=true replicas -o name -l longhornnode=rancher50-worker2`
  echo $REPLICA
  kubectl -n longhorn-system delete $REPLICA

  for i in {1..2000}
  do
    echo $i
    date
    umount /mnt/disk
    mount "$DEV" /mnt/disk
    if [ $? != 0 ]; then
      exit 1
    fi
    sleep 0.01
  done
done

Root Cause Analysis

The illustration shows how ext4 filesystem corruption and mount failure are caused. The root cause is a mismatch between the logical block size (1024 bytes) of the iSCSI block device and the copy-on-write (COW) block size (4096 bytes) in Longhorn replica.

The full story is

FIG 1 & 2 After deleting one replica, LH starts rebuilding the deleted replica, which is marked as WO mode, and then takes a snapshot. Meanwhile, the snapshot is empty. The new write io (offset=1024, length=1024) is incoming before finishing receiving the snapshot from the healthy replica (FIG2).

FIG 3 This write io also triggers the copy-on-write. The block size (replica.volume.sectorSize) used by copy-on-write operation is 4096 bytes. So, the data buffer contains two parts

0-1024 byte and 2048-4096 byte contains zero data reading from the empty snapshot
1024-2048 byte are the new data. The merged data buffer is written to the volume-head.

FIG 4 & FIG 5 After finishing receiving the snapshot from the healthy replica, the prune operation prunes the redundant blocks. So, the snapshots of the two replicas are identical, but the volume-heads are mismatched.

For a xfs filesystem, the default data block size is 4096 bytes, but the superblock size is 512. The block size information can be checked by xfs_info /dev/longhorn/<pvc-name>. Hence, the mismatch between the superblock size and Longhorn COW logic will cause the data corruption.

Impact

Longhorn volume using xfs filesystem
- All versions are impacted. By deafult, the superblock size is 512.
Longhorn volume using ext4 filesystem
- Ext4 filesystem on a Longhorn volume created by Longhorn < v1.3.0’s block size is mostly 4096 bytes, but it might be set to 1024 bytes from v1.3.0. The value is heuristically determined by the filesystem size and the expected usage of the filesystem (ref)

Solution

Fix Longhorn replica COW logic during rebuilding (https://github.com/longhorn/longhorn-engine/pull/745)
Set the xfs and ext4 filesystem block size to 4096 (https://github.com/longhorn/longhorn/issues/4594)

derekbit on Sep 28, 2022

I think this issue is addressed in https://github.com/longhorn/longhorn/issues/4354 and https://github.com/longhorn/longhorn/issues/4594

PhanLe1010 on Sep 29, 2022

@derekbit other than minor version differences, the only difference I note is that you’re using EBS which would present a block device. I was using LVM, and had formatted the VG with xfs on my hypervisor, then passed it through to the worker node, and pointed Longhorn to it.

stephanGarland on Sep 15, 2022

@stephanGarland I tried to reproduce the xfs filesystem corruption issue on AWS platform, but unfortunately, the corruption still cannot be reproduced. The StatefulSet workload writes data to a sqlite databse storing on the LH RWO volume. I scaled StatefulSet utilizing the PVC to 0 and back to 3 repeatedly as you mentioned. Every mount and xfs log recovery were always successful.

Here is my setting. Any suggestion is appreciated.

Host OS: Ubuntu 20.04 (kernel 5.4.0-1041-aws)
K8s: v1.24.4+rke2r1
Longhorn: v1.2.4
Storage Class

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-xfs
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "xfs"

StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: volume-test
  namespace: default
spec:
  selector:
    matchLabels:
      app: volume-test
  serviceName: "volume-test"
  replicas: 3
  template:
    metadata:
      labels:
        app: volume-test
    spec:
      #restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - image: naturlich/ubuntu:fs-corruption
        name: volume-test
        command: ["/bin/bash", "/root/sqlite.sh"]
        imagePullPolicy: Always
        volumeMounts:
        - name: longhorn-volv
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: longhorn-volv
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn-xfs"
      resources:
        requests:
          storage: 2Gi

sqlite.sh

#!/bin/bash

rm -rf /data/test.db

sqlite3 /data/test.db <<EOF
create table helloworld (id INTEGER PRIMARY KEY,k TEXT,v TEXT);
EOF

for i in {1..100000};
do
KEY=`tr -dc A-Za-z0-9 </dev/urandom | head -c 4096`
VAL=`tr -dc A-Za-z0-9 </dev/urandom | head -c 4096`
sqlite3 /data/test.db <<EOF
insert into helloworld (k,v) values ('${KEY}','${VAL}');
EOF
done

cc @PhanLe1010 @shuo-wu @innobead

derekbit on Sep 14, 2022

Not really sure if there is a bug in xfs. I think user can switch to use ext4 first and see if it still happens in ext4.

derekbit on Feb 21, 2022

k3s_kubelet.log kw01_dmesg.log longhorn_shas.txt longhorn-support-bundle_a0d2f19d-cc3c-40ea-ac41-47baad381803_2022-02-10T00-00-40Z.zip

Here’s an organic event that just happened. I’m not sure what triggered it; I don’t believe I’ve touched this pod in a few days. Deleting the odd-out replica didn’t fix the problem. Weirdly, although the node dell03-k3s-worker-01 was the one reporting a different SHA (and thus the one I deleted), the dmesg error was from dell01-k3s-worker-01 - the other two reported no issues in their logs.

Unfortunately my most recent snapshot was also bad, so I had to delete the PVC and recreate it using a manual backup I performed a few weeks ago, copying the files in.

I checked the SSDs on the node for SMART errors but didn’t see any, FWIW.

stephanGarland on Feb 10, 2022