longhorn: [BUG] Failed Statefulset Pod Creation with RWX Workload on Longhorn v1.3.3 and SLES 15 SP5

Describe the bug (🐛 if you encounter this issue)

| Longhorn v1.3.3 | Longhorn v1.4.3 – | – | – SLES 15-SP4 | PASSED | PASSED SLES 15-SP5 | Failed | PASSED

After deploying an RWX statefulset workload, the creation of statefulset pods fails on Longhorn v1.3.3 running on SLES 15-SP5.

QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Warning  FailedScheduling        6m6s                  default-scheduler        0/4 nodes are available: 4 pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        6m5s                  default-scheduler        0/4 nodes are available: 4 pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Normal   Scheduled               6m3s                  default-scheduler        Successfully assigned default/web-state-rwx-0 to ip-10-0-2-201
  Normal   SuccessfulAttachVolume  5m54s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8"
  Warning  FailedMount             113s (x2 over 3m53s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             103s (x2 over 4m1s)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[www kube-api-access-ckb5b]: timed out waiting for the condition

NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE        NODE            AGE
pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8   attached   healthy                  536870912   ip-10-0-2-135   2m32s
pvc-70aa68c1-0619-4c6a-af4a-c781d2991896   attached   healthy                  536870912   ip-10-0-2-135   2m23s
pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8   attached   healthy                  536870912   ip-10-0-2-135   23m

2023-08-09T14:22:23.103033741Z time="2023-08-09T14:22:23Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\nOutput: mount.nfs: Connection timed out\n"
2023-08-09T14:23:54.307042007Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"ip-10-0-2-135\",\"volume_id\":\"pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e\"}"
2023-08-09T14:23:54.309342690Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: volume pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e no longer exists"
2023-08-09T14:23:54.309363073Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-08-09T14:23:54.320596132Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"ip-10-0-2-135\",\"volume_id\":\"pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e\"}"
2023-08-09T14:23:54.321298336Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: volume pvc-9acf3d8a-1912-4843-bed5-a9e81f72493e no longer exists"
2023-08-09T14:23:54.321310144Z time="2023-08-09T14:23:54Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-08-09T14:24:23.168674384Z E0809 14:24:23.168466   17369 mount_linux.go:195] Mount failed: exit status 32
2023-08-09T14:24:23.168713885Z Mounting command: /usr/local/sbin/nsmounter
2023-08-09T14:24:23.168727245Z Mounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount
2023-08-09T14:24:23.168731951Z Output: mount.nfs: Connection timed out
2023-08-09T14:24:23.168735272Z 
2023-08-09T14:24:23.168741309Z time="2023-08-09T14:24:23Z" level=error msg="NodeStageVolume: err: rpc error: code = Internal desc = mount failed: exit status 32\nMounting command: /usr/local/sbin/nsmounter\nMounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\nOutput: mount.nfs: Connection timed out\n"
2023-08-09T14:24:23.789574784Z time="2023-08-09T14:24:23Z" level=info msg="NodeStageVolume: req: {\"staging_target_path\":\"/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1691590697714-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8\"}"
2023-08-09T14:24:23.865847328Z time="2023-08-09T14:24:23Z" level=debug msg="trying to ensure mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount"
2023-08-09T14:24:23.867643733Z time="2023-08-09T14:24:23Z" level=debug msg="mount point /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount try reading dir to make sure it's healthy"
2023-08-09T14:26:24.254955092Z E0809 14:26:24.254708   17369 mount_linux.go:195] Mount failed: exit status 32
2023-08-09T14:26:24.255001784Z Mounting command: /usr/local/sbin/nsmounter
2023-08-09T14:26:24.255012984Z Mounting arguments: mount -t nfs -o vers=4.1,noresvport,soft,intr,timeo=30,retrans=3 10.43.222.49:/pvc-9fcf03ad-07d6-45d8-b00c-d8ef04928bf8 /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/11ec8ece01effb5207e0df7f96ec6950c28fafdda89dc1ef80557d721ed2b19b/globalmount
2023-08-09T14:26:24.255017195Z Output: mount.nfs: Connection timed out
2023-08-09T14:26:24.255020464Z 

To Reproduce

  1. Deploy an SLES 15-SP5 cluster using our Terraform script https://github.com/longhorn/longhorn-tests/tree/master/test_framework/terraform/aws/sles.
  2. Install Longhorn v1.3.3 on SLES 15-SP5.
  3. Deploy RWX & RWO statefulset workload.
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwo
  labels:
    app: nginx-state-rwo
spec:
  ports:
  - port: 80
    name: web-state-rwo
  selector:
    app: nginx-state-rwo
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwo
spec:
  selector:
    matchLabels:
      app: nginx-state-rwo # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwo"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwo # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwo
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwo
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwx
  labels:
    app: nginx-state-rwx
spec:
  ports:
  - port: 80
    name: web-state-rwx
  selector:
    app: nginx-state-rwx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwx
spec:
  selector:
    matchLabels:
      app: nginx-state-rwx # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwx"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwx
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwx
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteMany" ]      
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 0.5Gi

  1. Observe that the creation of RWX statefulset pods fails.

Expected behavior

The statefulset pods should be created successfully without any errors.

Support bundle for troubleshooting

longhorn-support-bundle_e1038140-b5b4-47fa-8c01-78954bea9941_2023-08-09T14-27-37Z.zip

Environment

  • Longhorn version: v1.3.3
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s v1.24.8+k3s1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: SLES 15-SP5
    • Kernel version:
    • CPU per node: 4
    • Memory per node: 16GB
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:

Additional context

CC @longhorn/qa

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

Likely related to:

Before disabling tx checksum offload:

ec2-user@ip-10-0-1-153:~> time nc -z 10.43.78.204 2049

real    1m5.089s
user    0m0.000s
sys     0m0.002s

With disabling tx checksum offload:

ec2-user@ip-10-0-1-153:~> sudo ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksum-ip-generic: off
tx-tcp-segmentation: off [not requested]
tx-tcp-ecn-segmentation: off [not requested]
tx-tcp-mangleid-segmentation: off [not requested]
tx-tcp6-segmentation: off [not requested]

ec2-user@ip-10-0-1-153:~> time nc -z 10.43.78.204 2049

real    0m0.005s
user    0m0.002s
sys     0m0.000s

After disabling tx checksum offload on all nodes, the RWX successfully mounts:

eweber@laptop:~/longhorn-tests/test_framework/terraform/aws/sles> k get pod
NAME              READY   STATUS    RESTARTS   AGE
web-state-rwo-0   1/1     Running   0          3h41m
web-state-rwx-0   1/1     Running   0          3h41m

This has been a known issue with a variety of CNI plugins using VXLAN for a long time. It’s not clear what changed between SP4 and SP5 that exacerbated it.

This is a rare case and quite specific. Let’s have a KB, good enough. @ejweber can you help with that? Thanks.

Reopened. Let’s close this issue after the KB is merged.

One more related issue is https://github.com/flannel-io/flannel/issues/1679, which was fixed in flannel v0.20.2. Apparently, there is a double-NAT bug that can, on slow down the connection and even cause packet loss with some kernels. The version of k3s our Terraform scripts use (v1.24.8+k3s1) uses flannel v0.20.1. When I upgraded the scripts to use k3s v1.24.16+k3s1 (which uses flannel v0.22.0), the problem went away.

Note that this is NOT a Longhorn bug. It is an issue with the routing/networking in the Kubernetes distribution Longhorn is installed on. We are likely only hitting it in v1.3.3 testing, because we test other Longhorn versions on updated Kubernetes distribution. The regression from SP4 to SP5 is strange, but k3s v1.24 is not validated on SP5, so we probably shouldn’t spend more time investigating the issue.

I will submit a PR that causes the Longhorn test infrastructure to use k3s to use v1.24.16 instead of v1.24.8.