longhorn: [BUG] volume does not get reattached cleanly after an interuption

Describe the bug

after an interruption (possibly related to network issues), a volume stays read-only, resulting in a crashloop of the pod. in this case, the pod has a node affinity, and thus is always scheduled to the same node. after deleting the pod (which is part of a statefulset), it gets recreated and starts just fine. the kernel log shows the volume attaching, and it starts up writable, and the workload is happy.

this here: https://github.com/longhorn/longhorn/issues/3283

looks like it is in the same ballpark, but not quite exactly the same issue.

To Reproduce

we don’t know for sure where the initial interruption is coming from, so we’re not able to reproduce. it happened now several days in a row, always at the same time, so if there’s anything we can measure / log, we’d be glad to try.

Expected behavior

the volume becomes usable (i.e. writable) again after an interruption, workloads get started cleanly.

Log or Support bundle

rabbitmq operator:

2021-11-25 02:01:45.612 [error] <0.1612.0> ** Generic server <0.1612.0> terminating
2021-11-25 02:01:45.615 [error] <0.4698.0> CRASH REPORT Process <0.4698.0> with 1 neighbours exited with reason: no match of right hand value {error,eio} in file_handle_cache:soft_close/1 line 885 in gen_server2:terminate/3 line 1172
2021-11-25 02:01:55.678 [error] <0.634.0> CRASH REPORT Process channel_queue_exchange_metrics_metrics_collector with 0 neighbours crashed with reason: bad argument in call to ets:safe_fixtable(channel_queue_exchange_metrics, true) in ets:foldl/3 line 625
2021-11-25 02:01:55.717 [info] <0.11163.0> Mirrored queue 'celery' in vhost '[redacted]': Secondary replica of queue <rabbit@rabbitmqcluster-operator-server-2.rabbitmqcluster-operator-nodes.[redacted].1637752668.11163.0> detected replica  <rabbit@rabbitmqcluster-operator-server-1.rabbitmqcluster-operator-nodes.[redacted].1637752774.15340.7> to be down

then, rabbitmq operator is stuck in a crashloop:

02:02:14.914 [warning] Failed to write PID file "/var/lib/rabbitmq/mnesia/rabbit@rabbitmqcluster-operator-server-1.rabbitmqcluster-operator-nodes.[redacted].pid": read-only file system

node kernel:

[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#25 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#25 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#25 CDB: Write(10) 2a 00 01 44 7f 38 00 00 48 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 21266232 op 0x1:(WRITE) flags 0x800 phys_seg 9 prio class 0
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#24 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#24 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#24 CDB: Write(10) 2a 00 02 06 54 40 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 33969216 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048649 (offset 0 size 4096 starting block 4246153)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 4246152
[Thu Nov 25 02:00:35 2021] Aborting journal on device sdn-8.
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#21 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#21 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#21 CDB: Write(10) 2a 00 02 06 1a 08 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 33954312 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048695 (offset 0 size 0 starting block 4244290)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 4244289
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#23 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#23 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#23 CDB: Write(10) 2a 00 00 0a 65 28 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 681256 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048636 (offset 0 size 0 starting block 85158)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 85157
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#22 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#22 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#22 CDB: Write(10) 2a 00 00 0a 54 70 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 676976 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048666 (offset 0 size 0 starting block 84623)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 84622
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#20 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#20 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#20 CDB: Write(10) 2a 00 00 0a 2c d8 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 666840 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048681 (offset 0 size 0 starting block 83356)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 83355
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#26 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#26 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#26 CDB: Write(10) 2a 00 00 0a 5d 30 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 679216 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048632 (offset 0 size 0 starting block 84903)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 84902
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#27 Sense Key : Medium Error [current] 
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#27 Add. Sense: Unrecovered read error
[Thu Nov 25 02:00:35 2021] sd 15:0:0:1: [sdn] tag#27 CDB: Write(10) 2a 00 02 06 37 00 00 00 08 00
[Thu Nov 25 02:00:35 2021] blk_update_request: critical medium error, dev sdn, sector 33961728 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[Thu Nov 25 02:00:35 2021] EXT4-fs warning (device sdn): ext4_end_bio:309: I/O error 7 writing to inode 1048652 (offset 0 size 4096 starting block 4245217)
[Thu Nov 25 02:00:35 2021] Buffer I/O error on device sdn, logical block 4245216
[Thu Nov 25 02:00:47 2021] EXT4-fs error (device sdn): ext4_journal_check_start:61: Detected aborted journal
[Thu Nov 25 02:00:47 2021] EXT4-fs (sdn): Remounting filesystem read-only
[Thu Nov 25 02:00:56 2021] EXT4-fs error (device sdn): ext4_journal_check_start:61: Detected aborted journal
[Thu Nov 25 02:00:56 2021] EXT4-fs error (device sdn): ext4_journal_check_start:61: Detected aborted journal
[Thu Nov 25 02:00:56 2021] EXT4-fs error (device sdn): ext4_journal_check_start:61: Detected aborted journal
[Thu Nov 25 02:00:56 2021] EXT4-fs error (device sdn): ext4_journal_check_start:61: Detected aborted journal

node kernel after deleting the pod

[Thu Nov 25 09:13:05 2021] sd 15:0:0:1: [sdn] Synchronizing SCSI cache
[Thu Nov 25 09:13:12 2021] scsi host15: iSCSI Initiator over TCP/IP
[Thu Nov 25 09:13:12 2021] scsi 15:0:0:0: RAID IET Controller 0001 PQ: 0 ANSI: 5
[Thu Nov 25 09:13:12 2021] scsi 15:0:0:0: Attached scsi generic sg25 type 12
[Thu Nov 25 09:13:12 2021] scsi 15:0:0:1: Direct-Access IET VIRTUAL-DISK 0001 PQ: 0 ANSI: 5
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: Power-on or device reset occurred
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: Attached scsi generic sg26 type 0
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: [sdn] 41943040 512-byte logical blocks: (21.5 GB/20.0 GiB)
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: [sdn] Write Protect is off
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: [sdn] Mode Sense: 69 00 10 08
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: [sdn] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Thu Nov 25 09:13:12 2021] sd 15:0:0:1: [sdn] Attached SCSI disk
[Thu Nov 25 09:13:24 2021] EXT4-fs (sdn): mounted filesystem with ordered data mode. Opts: (null)

longhorn-manager

time="2021-11-25T02:00:59Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb-e-aa17071b\", UID:\"72002d87-82de-4d3a-aa53-fb4041760b73\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"98211882\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Detected replica pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb-r-cba0f30b (10.42.7.124:10270) in error"
time="2021-11-25T02:00:59Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb\", UID:\"98228c58-1b7a-4969-9442-55b03f9bd440\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"98211881\", FieldPath:\"\"}): type: 'Normal' reason: 'Degraded' volume pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb became degraded"
time="2021-11-25T02:00:59Z" level=warning msg="pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb-r-cba0f30b: time=\"2021-11-24T11:17:12Z\" level=warning msg=\"Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4\""

longhorn-instance-manager

[longhorn-instance-manager] time="2021-11-25T02:00:59Z" level=debug msg="Process Manager: start getting logs for process pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb-r-cba0f30b"
[longhorn-instance-manager] time="2021-11-25T02:00:59Z" level=debug msg="Process Manager: got logs for process pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb-r-cba0f30b"

Environment

5.4.0-90-generic #101-Ubuntu SMP Fri Oct 15 20:00:55 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux k8s: 1.20.6 iscsi: open-iscsi 2.0.874-7.1ubuntu6.2 longhorn: longhornio/longhorn-instance-manager:v1_20210731 longhornio/longhorn-engine:v1.2.2 k8s.gcr.io/sig-storage/csi-attacher:v3.2.1 k8s.gcr.io/sig-storage/csi-provisioner:v2.1.2

Installation method: Helm
Kubernetes distro: RKE:
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 20.04.3 LTS
- CPU per node: 8
- Memory per node: 32GB
- Disk type(e.g. SSD/NVMe): unknown (virtualized). the underlying disk appears as “sdb”
- Network bandwidth between the nodes: 1Gbit
Underlying Infrastructure (Hetzner cloud):
Number of Longhorn volumes in the cluster: ~20

Additional context

networking between the nodes is using wireguard tunnels - we’re planning to switch this to encryption built into the CNI plugin, but as of now, the encryption is done outside of k8s. also, the MTUs are odd - the internal ethernet interfaces have an MTU of 1400, and on top of that we’re running wireguard tunnels.
this happened every night at 2:00am for several days, and although we dont believe in a coincidence, we also have not identified any possible culprit. longhorn backups etc are all running at different times. this is, however, off-hours, cpu- and network at the point where this happens are relatively low.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 21 (11 by maintainers)

Most upvoted comments

2. Can user simply setup liveness probe to restart their pod in this case

@PhanLe1010 liveness probe does not help because pod restart does not trigger volume unmount + remount on host. Only pod delete + create does.

I just found Bitnami Redis cluster (those have liveness probe configured by default) from boot loop where pods was constantly restarting with this output:

redis-cluster 09:44:12.73
redis-cluster 09:44:12.73 Welcome to the Bitnami redis-cluster container
redis-cluster 09:44:12.74 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-redis-cluster
redis-cluster 09:44:12.74 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-redis-cluster/issues
redis-cluster 09:44:12.74
redis-cluster 09:44:12.75 INFO  ==> ** Starting Redis setup **
redis-cluster 09:44:12.78 INFO  ==> Initializing Redis
redis-cluster 09:44:12.82 INFO  ==> Setting Redis config file
Storing map with hostnames and IPs
/opt/bitnami/scripts/librediscluster.sh: line 196: /bitnami/redis/data/nodes.sh: Input/output error

and volumes in node was in read-only state:

# mount | grep "longhorn" | grep "(ro"
/dev/longhorn/pvc-d1a1b08a-96be-491e-a9c8-caed620a7cc7 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d1a1b08a-96be-491e-a9c8-caed620a7cc7/globalmount type ext4 (ro,relatime)
/dev/longhorn/pvc-d1a1b08a-96be-491e-a9c8-caed620a7cc7 on /var/lib/kubelet/pods/80481e66-6de8-4ac7-b25a-b91092a7d0d1/volumes/kubernetes.io~csi/pvc-d1a1b08a-96be-491e-a9c8-caed620a7cc7/mount type ext4 (ro,relatime)
/dev/longhorn/pvc-31facc72-37b4-4ee1-af23-c0cbecf36770 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-31facc72-37b4-4ee1-af23-c0cbecf36770/globalmount type ext4 (ro,relatime)
/dev/longhorn/pvc-31facc72-37b4-4ee1-af23-c0cbecf36770 on /var/lib/kubelet/pods/b3c8a0df-d6cf-4da7-8747-f5c31fe21eae/volumes/kubernetes.io~csi/pvc-31facc72-37b4-4ee1-af23-c0cbecf36770/mount type ext4 (ro,relatime)
/dev/longhorn/pvc-b84bcc35-eca7-4c85-871e-427fcec6072f on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-b84bcc35-eca7-4c85-871e-427fcec6072f/globalmount type ext4 (ro,relatime)
/dev/longhorn/pvc-b84bcc35-eca7-4c85-871e-427fcec6072f on /var/lib/kubelet/pods/025a05a7-570a-4178-afac-58e816991fae/volumes/kubernetes.io~csi/pvc-b84bcc35-eca7-4c85-871e-427fcec6072f/mount type ext4 (ro,relatime)

so I think that Longhorn should be modified to use mount parameter errors=continue at way pods would works at least after liveness probe restarts them. Or perhaps it should be even configurable option because in some environments users might want use even errors=panic as it forces nodes to reboot after significant network failure.

olljanat on Aug 8, 2022

Ref https://linux.die.net/man/8/mount

errors={continue|remount-ro|panic}
Define the behaviour when an error is encountered. (Either ignore errors and just mark the filesystem erroneous and 
continue, or remount the filesystem read-only, or panic and halt the system.) The default is set in the filesystem 
superblock, and can be changed using [tune2fs](https://linux.die.net/man/8/tune2fs)(8).

For the remount-ro mode, if there are many io errors, the filesystem will be remounted to read-only mode, and Longhorn is not aware of this. We can check this option by tune2fs -l /dev/longhorn/<name> if users encounter this issue.

cc @innobead @shuo-wu @joshimoo @PhanLe1010

derekbit on Jul 27, 2022

I’ve checked the log. Both pvc-f052a410-f19d-4fca-a0ce-936d8f23b6cb and pvc-cbbdf292-cdca-47a8-a4e4 only had one disconnected replica, and the other two replicas were still running. The volumes was not crashed. No sure why ext4 filesystem showed the IO error.

Also, Automatically Delete Workload Pod when The Volume Is Detached Unexpectedly was not triggered, so the automatic recovery was not executed that time.

cc. @shuo-wu @PhanLe1010 @jenting, do you have any idea?

derekbit on Dec 9, 2021

longhorn-support-bundle_2021-11-30.zip

lets try this again … so here is the support bundle.

rmalchow on Nov 30, 2021

@derekbit hmm. its a bit difficult - while it is kind of reproducible, this is a production environment. and at the moment, we increased the replica count to 3, in the hopes that we can avoid this issue entirely if the volume is available locally on the host that the pod is running on. i will see if we can make this happen, but i’m not too happy about intentionally breaking a production system (althought the impact is not that severe, there’s always risks).

as for the “delete workload…” option, i am relatively sure that this did not happen, or if it did, then something else went wrong. yesterday morning, i did see this behaviour again, and the pod still had the read-only filesystem. after manually deleting the pod, it came back just fine.

rmalchow on Nov 29, 2021