longhorn: [BUG] Failed Statefulset Pod Creation with RWX Workload on Longhorn v1.3.3 and SLES 15 SP5
Describe the bug (🐛 if you encounter this issue)
| Longhorn v1.3.3 | Longhorn v1.4.3 – | – | – SLES 15-SP4 | PASSED | PASSED SLES 15-SP5 | Failed | PASSED
After deploying an RWX statefulset workload, the creation of statefulset pods fails on Longhorn v1.3.3 running on SLES 15-SP5.
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6m6s default-scheduler 0/4 nodes are available: 4 pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
Warning FailedScheduling 6m5s default-scheduler 0/4 nodes are available: 4 pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
Normal Scheduled 6m3s default-scheduler Successfully assigned default/web-state-rwx-0 to ip-10-0-2-201
Normal SuccessfulAttachVolume 5m54s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8"
Warning FailedMount 113s (x2 over 3m53s) kubelet MountVolume.MountDevice failed for volume "pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 103s (x2 over 4m1s) kubelet Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[www kube-api-access-ckb5b]: timed out waiting for the condition
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 attached healthy 536870912 ip-10-0-2-135 2m32s
pvc-70aa68c1-0619-4c6a-af4a-c781d2991896 attached healthy 536870912 ip-10-0-2-135 2m23s
pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 attached healthy 536870912 ip-10-0-2-135 23m
2023-08-09T14:22:23.103033741Z time="2023-08-09T14:22:23Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\nOutput: mount.nfs: Connection timed out\n"
2023-08-09T14:23:54.307042007Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"ip-10-0-2-135\",\"volume_id\":\"pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e\"}"
2023-08-09T14:23:54.309342690Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: volume pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e no longer exists"
2023-08-09T14:23:54.309363073Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-08-09T14:23:54.320596132Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"ip-10-0-2-135\",\"volume_id\":\"pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e\"}"
2023-08-09T14:23:54.321298336Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: volume pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e no longer exists"
2023-08-09T14:23:54.321310144Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-08-09T14:24:23.168674384Z E0809 14:24:23.168466 17369 mount_linux.go:195] Mount failed: exit status 32
2023-08-09T14:24:23.168713885Z Mounting command: /usr/local/sbin/nsmounter
2023-08-09T14:24:23.168727245Z Mounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount
2023-08-09T14:24:23.168731951Z Output: mount.nfs: Connection timed out
2023-08-09T14:24:23.168735272Z
2023-08-09T14:24:23.168741309Z time="2023-08-09T14:24:23Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\nOutput: mount.nfs: Connection timed out\n"
2023-08-09T14:24:23.789574784Z time="2023-08-09T14:24:23Z" level=info msg="NodeStageVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1691590697714-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8\"}"
2023-08-09T14:24:23.865847328Z time="2023-08-09T14:24:23Z" level=debug msg="trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount"
2023-08-09T14:24:23.867643733Z time="2023-08-09T14:24:23Z" level=debug msg="mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount try reading dir to make sure it's healthy"
2023-08-09T14:26:24.254955092Z E0809 14:26:24.254708 17369 mount_linux.go:195] Mount failed: exit status 32
2023-08-09T14:26:24.255001784Z Mounting command: /usr/local/sbin/nsmounter
2023-08-09T14:26:24.255012984Z Mounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount
2023-08-09T14:26:24.255017195Z Output: mount.nfs: Connection timed out
2023-08-09T14:26:24.255020464Z
To Reproduce
- Deploy an SLES
15-SP5cluster using our Terraform script https://github.com/longhorn/longhorn-tests/tree/master/test_framework/terraform/aws/sles. - Install Longhorn
v1.3.3on SLES15-SP5. - Deploy RWX & RWO statefulset workload.
apiVersion: v1
kind: Service
metadata:
name: nginx-state-rwo
labels:
app: nginx-state-rwo
spec:
ports:
- port: 80
name: web-state-rwo
selector:
app: nginx-state-rwo
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwo
spec:
selector:
matchLabels:
app: nginx-state-rwo # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwo"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwo # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwo
image: nginx:stable
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwo
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "longhorn"
resources:
requests:
storage: 0.5Gi
---
apiVersion: v1
kind: Service
metadata:
name: nginx-state-rwx
labels:
app: nginx-state-rwx
spec:
ports:
- port: 80
name: web-state-rwx
selector:
app: nginx-state-rwx
type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-state-rwx
spec:
selector:
matchLabels:
app: nginx-state-rwx # has to match .spec.template.metadata.labels
serviceName: "nginx-state-rwx"
replicas: 1 # by default is 1
template:
metadata:
labels:
app: nginx-state-rwx # has to match .spec.selector.matchLabels
spec:
restartPolicy: Always
terminationGracePeriodSeconds: 10
containers:
- name: nginx-state-rwx
image: nginx:stable
livenessProbe:
exec:
command:
- ls
- /usr/share/nginx/html/lost+found
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 80
name: web-state-rwx
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteMany" ]
storageClassName: "longhorn"
resources:
requests:
storage: 0.5Gi
- Observe that the creation of RWX statefulset pods fails.
Expected behavior
The statefulset pods should be created successfully without any errors.
Support bundle for troubleshooting
longhorn-support-bundle_e1038140-b5b4-47fa-8c01-78954bea9941_2023-08-09T14-27-37Z.zip
Environment
- Longhorn version: v1.3.3
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.24.8+k3s1
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: SLES 15-SP5
- Kernel version:
- CPU per node: 4
- Memory per node: 16GB
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
- Number of Longhorn volumes in the cluster:
Additional context
CC @longhorn/qa
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (11 by maintainers)
Likely related to:
Before disabling tx checksum offload:
With disabling tx checksum offload:
After disabling tx checksum offload on all nodes, the RWX successfully mounts:
This has been a known issue with a variety of CNI plugins using VXLAN for a long time. It’s not clear what changed between SP4 and SP5 that exacerbated it.
This is a rare case and quite specific. Let’s have a KB, good enough. @ejweber can you help with that? Thanks.
Reopened. Let’s close this issue after the KB is merged.
One more related issue is https://github.com/flannel-io/flannel/issues/1679, which was fixed in flannel
v0.20.2. Apparently, there is a double-NAT bug that can, on slow down the connection and even cause packet loss with some kernels. The version of k3s our Terraform scripts use (v1.24.8+k3s1) uses flannelv0.20.1. When I upgraded the scripts to use k3sv1.24.16+k3s1(which uses flannelv0.22.0), the problem went away.Note that this is NOT a Longhorn bug. It is an issue with the routing/networking in the Kubernetes distribution Longhorn is installed on. We are likely only hitting it in
v1.3.3testing, because we test other Longhorn versions on updated Kubernetes distribution. The regression from SP4 to SP5 is strange, but k3sv1.24is not validated on SP5, so we probably shouldn’t spend more time investigating the issue.I will submit a PR that causes the Longhorn test infrastructure to use k3s to use
v1.24.16instead ofv1.24.8.