longhorn: [QUESTION] Volumes stuck in attaching state

Question In our 3 node cluster, multiple volumes fail to attach. Rebooting nodes sometimes helps to resolve the issue for some volumes, but causes others to fail.

Longhorn worked just fine for 5-6 months with exactly the same cluster configuration. We did however experience latency issues with our config store (galera-cluster) in the past that caused k3s components to crash. These problems are resolved, but may have caused our issues with longhorn.

We see the following events in the namespace longhorn-system:

LAST SEEN   TYPE      REASON      OBJECT                                                       MESSAGE
4m23s       Warning   Unhealthy   pod/share-manager-pvc-8c08f699-a65a-41c3-a1ce-4f43327c294a   Readiness probe failed: cat: /var/run/ganesha.pid: No such file or directory
4m21s       Warning   Unhealthy   pod/share-manager-pvc-0097611e-0a90-412e-bfa3-3d53e52723dd   Readiness probe failed: cat: /var/run/ganesha.pid: No such file or directory
3m11s       Warning   Unhealthy   pod/share-manager-pvc-b424093a-da21-4856-bd2d-c98003512418   Readiness probe failed: cat: /var/run/ganesha.pid: No such file or directory

Allmost all replicas are stuck in state “stopped” for node “server914zx.mueller.de”: Bildschirmfoto 2021-09-21 um 10 59 54

What should we do to resolve these issues?

Environment:

Longhorn version: 1.1.0
Kubernetes version: v1.19.7+k3s1
Node config
- OS type and version: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-65-generic x86_64)
- CPU per node: 32
- Memory per node: 252 GB
- Disk type: LVM, XFS, local disks
- Network bandwidth and latency between the nodes: 2x 1GBit LACP, 0.07ms ping average
Underlying Infrastructure: Baremetal

Additional context longhorn-support-bundle_2044762c-c02e-416d-83eb-ae942f0224ab_2021-09-21T09-31-42Z.zip

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 1
Comments: 23 (6 by maintainers)

Most upvoted comments

@haskell42 seems to be the same as https://github.com/longhorn/longhorn/issues/3000

Can you try to delete this instance manager: kubectl delete instancemanager instance-manager-r-7b5c128d -n longhorn-system

timmy59100 on Sep 21, 2021

https://github.com/longhorn/longhorn/issues/3000#issuecomment-918614544 Looks like the instance managers are ok. Did you try to scale down the workloads after the deletion?

timmy59100 on Sep 21, 2021

We still experience problems attaching the following volumes:

pvc-55d0f277-da48-4c33-b210-66492321f83f
pvc-30ffb664-2891-454c-b7d6-e84398758ecf

All replicas are running, but after about 5 minutes the pods get destroyed (and attaching is aborted) because of the missing volumes.

Yesterday this worked just fine.

longhorn-support-bundle_2044762c-c02e-416d-83eb-ae942f0224ab_2021-09-22T08-16-00Z.zip

haskell42 on Sep 22, 2021

Thanks @timmy59100 for the great help! yeah, we hit the same multiple instance manager with the same spec.

This time, it happens with Longhorn v1.1.0 so this issue is not introduced by the new version v1.2.0. We are still investigating what is the root case in the thread of longhorn/longhorn#3000

PhanLe1010 on Sep 21, 2021