longhorn: [BUG] Reboot node while volume expansion, will cause pod stuck at creating state

Describe the bug (šŸ› if you encounter this issue)

Reboot node while volume expansion, will cause pod stuck at creating state Can reproduce without https://github.com/longhorn/longhorn-manager/commit/3e27dc3561395dd9cab8c59c13618c564329fa59 from @derekbit

To Reproduce

Steps to reproduce the behavior:

  1. Deploy Longhorn v1.4.x
  2. Dynamic provision volume by statefulset (1 pod replica)
  3. Write data into pod mount point
  4. Edit PVC volume size trigger volume online expansion
  5. Reboot volume attached node while volume expansion
  6. After node up, volume healthy but pod stuck at creating

Expected behavior

Pod should come up and can read data

Log or Support bundle

supportbundle_d77756d5-0aed-4bc9-8609-47068143430f_2022-12-29T11-03-39Z.zip

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    2m1s  default-scheduler  Successfully assigned default/web-0 to ip-172-31-95-56
  Warning  FailedMount  102s  kubelet            MountVolume.Setup failed while expanding volume for volume "pvc-60603d49-b693-4171-b7bb-c24a64ccf0a2" : Expander.NodeExpand failed to expand the volume : rpc error: code = Internal desc = failed to read size of filesystem on /dev/longhorn/pvc-60603d49-b693-4171-b7bb-c24a64ccf0a2: exit status 152: dumpe2fs 1.46.4 (18-Aug-2021)
dumpe2fs: Superblock checksum does not match superblock while trying to open /dev/longhorn/pvc-60603d49-b693-4171-b7bb-c24a64ccf0a2
Filesystem volume name:   <none>
Last mounted on:          /usr/share/nginx/html
Filesystem UUID:          2c5e1c0d-0ec4-4e34-bce9-d66661444e68
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              720896
Block count:              2883584
Reserved block count:     0
Overhead clusters:        54714
Free blocks:              2828863
Free inodes:              720884
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      126
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Thu Dec 29 09:40:08 2022
Last mount time:          Thu Dec 29 09:46:12 2022
Last write time:          Thu Dec 29 09:46:12 2022
Mount count:              2
Maximum mount count:      -1
Last checked:             Thu Dec 29 09:40:08 2022
Check interval:           0 (<none>)
Lifetime writes:          565 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:            256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      a5544af0-bd43-4953-bbf2-839aed407674
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xfddfd606
Journal features:         journal_64bit journal_checksum_v3
Total journal size:       32M
Total journal blocks:     8192
Max transaction length:   8192
Fast commit length:       0
Journal sequence:         0x00000008
Journal start:            0
Journal checksum type:    crc32c
Journal checksum:         0x2b7e1b86
vents:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    89s   default-scheduler  Successfully assigned default/web-0 to ip-172-31-81-141
  Warning  FailedMount  70s   kubelet            MountVolume.MountDevice failed for volume "pvc-244c68d4-0caf-4443-b683-97b4e0c9284d" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o defaults /dev/longhorn/pvc-244c68d4-0caf-4443-b683-97b4e0c9284d /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/2602be959344fdd1c281ef2e078352c3e9f72c8e1a92f3cac0987f564a2d385a/globalmount
Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/2602be959344fdd1c281ef2e078352c3e9f72c8e1a92f3cac0987f564a2d385a/globalmount: cannot mount; probably corrupted filesystem on /dev/longhorn/pvc-244c68d4-0caf-4443-b683-97b4e0c9284d.

Environment

  • Longhorn version: v1.4.x head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s 1.24
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:

Additional context

Disconnect node connection can reproduce the same error

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 19 (19 by maintainers)

Most upvoted comments

To avoid this issue was caused by environment, tested in a fresh environment with v1.4.x head images deployed, did 3 times test with steps, there were 2 times can reproduce this error.

Hi @shuo-wu , after using storage class with parameter -O ^metadata_csum added below, I can not reproduce volume corruption state

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
  mkfsParams: "-O ^metadata_csum"

@innobead Right now the conclusion is, we can not directly fix this issue and the filesystem corruption may be hard to fix (I didn’t reproduce this case). But as I mentioned above, disabling metadata_csum can bypass this issue. Now I prefer to write a KB to inform this case and all workarounds (manual fsck or disabling metadata_csum at the beginning), and the upstream PR link, then close this ticket.

I tried do restore the superblock by steps and the volume still not get repaired, below was the reply when execute sudo e2fsck -b block_number /dev/xxx

e2fsck: Bad magic number in super-block while trying to open /dev/longhorn/pvc-1e4e47bb-a720-4438-88f4-ec3e719c9555

In addition, calculated md5sum of volume head image and volume snap images in all replicas, the result were all the same. Because they were all the same, I did not perform replica repair, thank you

It seems the mount point should be the target when running fsck as per the doc.

At least for EXT4, running fsck <mount point> would report error code 16 - usage or syntax error. I don’t think we can fix this issue by executing fsck before NodeExpandVolume

Some notes

  1. Power outage right before superblock checksum update?
  2. Data mismatch between replicas?
  3. There are some patches recently fixing the superblock checksum error caused by ext4 volume online resize, e.g. PR. Not sure if it is related to this issue, because Longhorn re-expands the volume after the node comes back. Need more time to investigate it.

For a workaround, we can try if the error can be fixed by fsck.

cc @chriscchien @shuo-wu @innobead

In the PR, it mentioned that the reproducing step is online resizing is performed twice consecutively. But during the node reboot, we don’t know (and it’s hard to know) if the node really executes fs resizing before completely shutting down.


Based on my test, this checksum mismatching issue happens during the NodeStageVolume resizing. After this execution, running dumpe2fs <volume device path> would encounter the error, for example:

longhorn-csi-plugin-7lpkp:/ # dumpe2fs -h /dev/longhorn/pvc-b229a6a6-6338-4118-809a-b7c9110047c9
dumpe2fs 1.46.4 (18-Aug-2021)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          3d32fa57-f873-49b0-9061-3e451ef70a52
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              196608
Block count:              786432
Reserved block count:     0
Overhead clusters:        30262
Free blocks:              756164
Free inodes:              196597
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      255
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Wed Jan  4 12:40:17 2023
Last mount time:          Wed Jan  4 12:45:57 2023
Last write time:          Wed Jan  4 12:45:57 2023
Mount count:              2
Maximum mount count:      -1
Last checked:             Wed Jan  4 12:40:17 2023
Check interval:           0 (<none>)
Lifetime writes:          1061 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      db0e4df7-de4b-4d21-b258-bb2a3d3b741d
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x45d31649
Journal features:         journal_64bit journal_checksum_v3
Total journal size:       64M
Total journal blocks:     16384
Max transaction length:   16384
Fast commit length:       0
Journal sequence:         0x00000008
Journal start:            17
Journal checksum type:    crc32c
Journal checksum:         0xdb860f47

dumpe2fs: Superblock checksum does not match superblock while trying to open /dev/longhorn/pvc-b229a6a6-6338-4118-809a-b7c9110047c9
*** Run e2fsck now!

Actually, NodeStageVolume will execute fsck before mounting. But the resizing requires a mount point, and itself is the cause of the error. Besides, running fsck requires volume being unmounted, we cannot execute fsck after the NodeStageVolume resizing or NodeExpandVolume resizing… In other word, I haven’t found a way to fix it.


BTW, the correct workaround is: Scaling down then re-scaling up the workload.

Some notes

  1. Power outage right before superblock checksum update?
  2. Data mismatch between replicas?
  3. There are some patches recently fixing the superblock checksum error caused by ext4 volume online resize, e.g. PR. Not sure if it is related to this issue, because Longhorn re-expands the volume after the node comes back. Need more time to investigate it.

For a workaround, we can try if the error can be fixed by fsck.

cc @chriscchien @shuo-wu @innobead