longhorn: [BUG] RWX doesn't work with release 1.4.0 due to end grace update error from recovery backend

Describe the bug (🐛 if you encounter this issue)

I reinstall longhorn 1.4.0 with k3s 1.25.5, everything is fine but RWX volume mount is repeatedly failed

To Reproduce

Steps to reproduce the behavior:

Make a volume with RWX
Mount it with a pod

Expected behavior

RWX should be mounted, as was in 1.3.2

Log or Support bundle

Here is the log from share-manager-<volume-name>:

│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_FILEHANDLE from NIV_EVENT to NIV_INFO                                                                       │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DISPATCH from NIV_EVENT to NIV_INFO                                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CACHE_INODE from NIV_EVENT to NIV_INFO                                                                      │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CACHE_INODE_LRU from NIV_EVENT to NIV_INFO                                                                  │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_HASHTABLE from NIV_EVENT to NIV_INFO                                                                        │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_HASHTABLE_CACHE from NIV_EVENT to NIV_INFO                                                                  │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DUPREQ from NIV_EVENT to NIV_INFO                                                                           │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_INIT from NIV_EVENT to NIV_INFO                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_MAIN from NIV_EVENT to NIV_INFO                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_IDMAPPER from NIV_EVENT to NIV_INFO                                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_READDIR from NIV_EVENT to NIV_INFO                                                                      │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_V4_LOCK from NIV_EVENT to NIV_INFO                                                                      │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CONFIG from NIV_EVENT to NIV_INFO                                                                           │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CLIENTID from NIV_EVENT to NIV_INFO                                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_SESSIONS from NIV_EVENT to NIV_INFO                                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_PNFS from NIV_EVENT to NIV_INFO                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_RW_LOCK from NIV_EVENT to NIV_INFO                                                                          │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NLM from NIV_EVENT to NIV_INFO                                                                              │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_RPC from NIV_EVENT to NIV_INFO                                                                              │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_TIRPC from NIV_EVENT to NIV_INFO                                                                            │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_CB from NIV_EVENT to NIV_INFO                                                                           │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_THREAD from NIV_EVENT to NIV_INFO                                                                           │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_V4_ACL from NIV_EVENT to NIV_INFO                                                                       │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_STATE from NIV_EVENT to NIV_INFO                                                                            │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_9P from NIV_EVENT to NIV_INFO                                                                               │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_9P_DISPATCH from NIV_EVENT to NIV_INFO                                                                      │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_FSAL_UP from NIV_EVENT to NIV_INFO                                                                          │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DBUS from NIV_EVENT to NIV_INFO                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_MSK from NIV_EVENT to NIV_INFO                                                                          │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed                                                                                               │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] init_fds_limit :INODE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576.                                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] init_server_pkgs :NFS STARTUP :INFO :State lock layer successfully initialized                                                                                                    │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] init_server_pkgs :NFS STARTUP :INFO :IP/name cache successfully initialized                                                                                                       │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.                                                                                                                     │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.                                                                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] nfs4_recovery_init :CLIENT ID :INFO :Recovery Backend Init for longhorn                                                                                                           │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] longhorn_recov_init :CLIENT ID :EVENT :Initialize recovery backend 'share-manager-shared-volume'                                                                                  │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90                                                                                                               │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] longhorn_read_recov_clids :CLIENT ID :EVENT :Read clients from recovery backend share-manager-shared-volume                                                                       │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] read_clids :CLIENT ID :EVENT :response={"actions":{},"clients":[],"hostname":"share-manager-shared-volume","id":"share-manager-shared-volume","links":{"self":"http://longhorn-re │
│                                                                                                                                                                                                                                                                             │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend                                                                                                    │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)                                                                                                   │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] longhorn_recov_end_grace :CLIENT ID :EVENT :End grace for recovery backend 'share-manager-shared-volume' version LUUZWL8T                                                         │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] http_call :CLIENT ID :EVENT :HTTP error: 500 (url=http://longhorn-recovery-backend:9600/v1/recoverybackend/share-manager-shared-volume, payload={"version": "LUUZWL8T"})          │
│ 31/12/2022 22:40:52 : epoch 63b0ba74 : share-manager-shared-volume : nfs-ganesha-29[main] longhorn_recov_end_grace :CLIENT ID :FATAL :HTTP call error: res=-1 ((null))                                                                                                      │
│ time="2022-12-31T22:40:52Z" level=error msg="NFS server exited with error" encrypted=false error="ganesha.nfsd failed with error: exit status 2, output: " volume=shared-volume                                                                                             │
│ W1231 22:40:52.523325       1 mount_helper_common.go:133] Warning: "/export/shared-volume" is not a mountpoint, deleting                                                                                                                                                    │
│ time="2022-12-31T22:40:52Z" level=debug msg="Device /dev/mapper/shared-volume is not an active LUKS device" error="failed to run cryptsetup args: [status shared-volume] output:  error: exit status 4"

Environment

Longhorn version: 1.4.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s 1.25.5
- Number of management node in the cluster: 2
- Number of worker node in the cluster: 2
Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 64
- Memory per node: 384Gi
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 10G + 10G (link aggregated)
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): On-prem
Number of Longhorn volumes in the cluster: 7

Additional context

Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 133 (73 by maintainers)

Most upvoted comments

Thanks @jinserk ! You are heavily contributing to Longhorn 🙂 . Helpful feedback!

innobead on Jan 1, 2023

From the log and code flow, the full story is

Created a configmap for the share-manager
Read the NFS client list from the configmap
The multiple NFS client concurrently updated the configmap and trigger the resource conflict errors.
Actually, the api/route.go has 5 retries. But I just realized the error handling in HandleError function is not correct. The data in socket buffer was already read before, and the next retry always hit the EOF error and cannot continue the configmap update.

To improve the resilience to the conflict error

Move the retry mechanism to the NFS sever recovery-backend logics and don’t rely on the api/route.go’s HandleError
Fix the logics in the api/route.go HandleError

Jinserk Baik @.***>於 2023年1月1日週日，上午11:20寫道：

@derekbit https://github.com/derekbit At first the volume was created in v1.3.2 and used without big issues. The pod mounting this is spawned by JupyterHub as singleuser JupyterLab, but basically the same to create a pod using yaml. Once I found the error, I tried another clean RWX volume on v1.4.0 on the frontend ui and created pv/pvc from the volume menu, but the situation is the same.

— Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/5183#issuecomment-1368339413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7SNALTSVB4YJMVWIAY7RDWQDZWZANCNFSM6AAAAAATNYHBOU . You are receiving this because you were mentioned.Message ID: @.***>

derekbit on Jan 1, 2023

done 👍 @derekbit

tbertenshaw on Jan 6, 2023

@derekbit I’m not sure these are the logs you want: I used the cmds to get them:
$ sudo journalctl -u k3s --since "2023-01-03" > kyu-gpu1.log    # on kyu-gpu1 node
$ sudo journalctl -u k3s --since "2023-01-03" > kyu-gpu2.log    # on kyu-gpu2 node
kyu-gpu1.log kyu-gpu2.log If you’re digging the long-time FailedMount issue, it will be super helpful for me!! Thank you so much again. I found #3207 which looks like related to this.
@jinserk Can you try the workaround mentioned in #3207? I will continue digging in the new issue. BTW, does the issue happen in v1.3.x as well?

Yes I used longhorn from v1.3.1, and experienced this very frequently, especially everytime I reinstall the k3s cluster.

I already applied the https://github.com/longhorn/longhorn/issues/3207 workaround, and found that it works in my case too. Every volumes in my use has 2 replicas on both nodes, which hit the case in https://github.com/longhorn/longhorn/issues/3207

jinserk on Jan 4, 2023

@derekbit Wow it looks working! It is amazing! Did you resolve the 4-yr long issue? 👍

The new image is applied:

$ kubectl describe pod/longhorn-manager-jjrqj -n longhorn-system                                                                                                                                                 
Name:             longhorn-manager-jjrqj                                                                                                                                                                                                                                       
Namespace:        longhorn-system                                                                                                                                                                                                                                              
Priority:         0                                                                                                                                                                                                                                                            
Service Account:  longhorn-service-account                                                                                                                                                                                                                                     
Node:             kyu-gpu1/172.17.0.2                                                                                                                                                                                                                                          
Start Time:       Mon, 02 Jan 2023 10:53:03 -0500                                                                                                                                                                                                                              
Labels:           app=longhorn-manager                                                                                                                                                                                                                                         
                  app.kubernetes.io/instance=longhorn                                                                                                                                                                                                                          
                  app.kubernetes.io/managed-by=Helm                                                                                                                                                                                                                            
                  app.kubernetes.io/name=longhorn                                                                                                                                                                                                                              
                  app.kubernetes.io/version=v1.4.0                                                                                                                                                                                                                             
                  controller-revision-hash=54c7c98f74                                                                                                                                                                                                                          
                  helm.sh/chart=longhorn-1.4.0                                                                                                                                                                                                                                 
                  pod-template-generation=3                                                                                                                                                                                                                                    
Annotations:      <none>                                                                                                                                                                                                                                                       
Status:           Running                                                                                                                                                                                                                                                      
IP:               10.42.0.126                                                                                                                                                                                                                                                  
IPs:                                                                                                                                                                                                                                                                           
  IP:           10.42.0.126                                                                                                                                                                                                                                                    
Controlled By:  DaemonSet/longhorn-manager                                                                                                                                                                                                                                     
Init Containers:                                                                                                                                                                                                                                                               
  wait-longhorn-admission-webhook:                                                                                                                                                                                                                                             
    Container ID:  containerd://03d91f41b7c346a8fac2683efc4db62f898572362a5a2125ddb30816850e0574                                                                                                                                                                               
    Image:         derekbit/longhorn-manager:v1.4.0-verify-create                                                                                                                                                                                                              
    Image ID:      docker.io/derekbit/longhorn-manager@sha256:75e0f97156d09faf216b514f1654729f4ecd752eb79f907d1740a6cc5954f781                                                                                                                                                 
    Port:          <none>                                                                                                                                                                                                                                                      
    Host Port:     <none>                                                                                                                                                                                                                                                      
    Command:                                                                                                                                                                                                                                                                   
      sh                                                                                                                                                                                                                                                                       
      -c                                                                                                                                                                                                                                                                       
      while [ $(curl -m 1 -s -o /dev/null -w "%{http_code}" -k https://longhorn-admission-webhook:9443/v1/healthz) != "200" ]; do echo waiting; sleep 2; done                                                                                                                  
    State:          Terminated                                                                                                                                                                                                                                                 
      Reason:       Completed                                                                                                                                                                                                                                                  
      Exit Code:    0                                                                                                                                                                                                                                                          
      Started:      Mon, 02 Jan 2023 10:53:06 -0500                                                                                                                                                                                                                            
      Finished:     Mon, 02 Jan 2023 10:53:06 -0500
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v9hks (ro)
Containers:
  longhorn-manager:
    Container ID:  containerd://b4fe1fa880751b31b2ba681138ce8026a083ffea6d68c16384d19cb66f319762
    Image:         derekbit/longhorn-manager:v1.4.0-verify-create
    Image ID:      docker.io/derekbit/longhorn-manager@sha256:75e0f97156d09faf216b514f1654729f4ecd752eb79f907d1740a6cc5954f781
    Port:          9500/TCP
    Host Port:     0/TCP
    Command:
      longhorn-manager
      -d
      daemon
      --engine-image
      longhornio/longhorn-engine:v1.4.0
      --instance-manager-image
      longhornio/longhorn-instance-manager:v1.4.0
      --share-manager-image
      longhornio/longhorn-share-manager:v1.4.0
      --backing-image-manager-image
      longhornio/backing-image-manager:v1.4.0
      --support-bundle-manager-image
      longhornio/support-bundle-kit:v0.0.17
      --manager-image
      derekbit/longhorn-manager:v1.4.0-verify-create
      --service-account
      longhorn-service-account
    State:          Running
      Started:      Mon, 02 Jan 2023 10:53:07 -0500
    Ready:          True
    Restart Count:  0
    Readiness:      tcp-socket :9500 delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:  longhorn-system (v1:metadata.namespace)
      POD_IP:          (v1:status.podIP)
      NODE_NAME:       (v1:spec.nodeName)
    Mounts:
      /host/dev/ from dev (rw)
      /host/proc/ from proc (rw)
      /tls-files/ from longhorn-grpc-tls (rw)
      /var/lib/longhorn/ from longhorn (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v9hks (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/
    HostPathType:  
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/
    HostPathType:  
  longhorn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/longhorn/
    HostPathType:  
  longhorn-grpc-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  longhorn-grpc-tls
    Optional:    true
  kube-api-access-v9hks:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age    From               Message
  ----     ------     ----   ----               -------
  Normal   Scheduled  2m52s  default-scheduler  Successfully assigned longhorn-system/longhorn-manager-jjrqj to kyu-gpu1
  Normal   Pulling    2m51s  kubelet            Pulling image "derekbit/longhorn-manager:v1.4.0-verify-create"
  Normal   Pulled     2m50s  kubelet            Successfully pulled image "derekbit/longhorn-manager:v1.4.0-verify-create" in 1.743822772s
  Normal   Created    2m49s  kubelet            Created container wait-longhorn-admission-webhook
  Normal   Started    2m49s  kubelet            Started container wait-longhorn-admission-webhook
  Normal   Pulled     2m48s  kubelet            Container image "derekbit/longhorn-manager:v1.4.0-verify-create" already present on machine
  Normal   Created    2m48s  kubelet            Created container longhorn-manager
  Normal   Started    2m48s  kubelet            Started container longhorn-manager
  Warning  Unhealthy  2m47s  kubelet            Readiness probe failed: dial tcp 10.42.0.126:9500: connect: connection refused

The rwx-volume test is applied:

$ kubectl apply -f test-rwx.yaml 
persistentvolumeclaim/longhorn-volv-pvc created
pod/volume-test created

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                STORAGECLASS   REASON   AGE
pvc-1495ae4e-4beb-40ba-8aa9-b4b10e2efd3a   10Gi       RWO            Delete           Bound    jupyterhub/hub-db-dir                longhorn                41h
user-shared-volume                         1000Gi     RWX            Retain           Bound    jupyterhub/jupyter-shared            longhorn                41h
pvc-884a49b0-4dbc-4788-b0e6-dec25430a592   2Gi        RWX            Delete           Bound    default/longhorn-volv-pvc            longhorn                29s

$ kubectl get pvc -n default   
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
longhorn-volv-pvc   Bound    pvc-884a49b0-4dbc-4788-b0e6-dec25430a592   2Gi        RWX            longhorn       78s

$ kubectl get pod/volume-test -n default
NAME          READY   STATUS    RESTARTS   AGE
volume-test   1/1     Running   0          2m3s

$ kubectl describe pod/volume-test -n default
Name:             volume-test
Namespace:        default
Priority:         0
Service Account:  default
Node:             kyu-gpu2/172.17.0.4
Start Time:       Mon, 02 Jan 2023 10:57:15 -0500
Labels:           <none>
Annotations:      <none>
Status:           Running
IP:               10.42.1.95
IPs:
  IP:  10.42.1.95
Containers:
  volume-test:
    Container ID:   containerd://dd9af584933eb421d59766630cefb40e4921a5a5a20ac8a6a43eb7ec96f08857
    Image:          nginx:stable-alpine
    Image ID:       docker.io/library/nginx@sha256:2366ede62d2e26a20f7ce7d0294694fe52b166107fd346894e4658dfb5273f9c
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 02 Jan 2023 10:57:30 -0500
    Ready:          True
    Restart Count:  0
    Liveness:       exec [ls /data/lost+found] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from volv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tkfj6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  volv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  longhorn-volv-pvc
    ReadOnly:   false
  kube-api-access-tkfj6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age    From                     Message
  ----     ------                  ----   ----                     -------
  Warning  FailedScheduling        2m25s  default-scheduler        0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
  Warning  FailedScheduling        2m24s  default-scheduler        0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
  Normal   Scheduled               2m16s  default-scheduler        Successfully assigned default/volume-test to kyu-gpu2
  Normal   SuccessfulAttachVolume  2m3s   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-884a49b0-4dbc-4788-b0e6-dec25430a592"
  Normal   Pulled                  2m2s   kubelet                  Container image "nginx:stable-alpine" already present on machine
  Normal   Created                 2m2s   kubelet                  Created container volume-test
  Normal   Started                 2m2s   kubelet                  Started container volume-test

I’ll do it with my Jupyterhub env and test it under the multiple pods mounting!

jinserk on Jan 2, 2023

@derekbit Can you follow up the steps?

Create a RWX volume

Create multiple pods to mount the volume

Backup the volume to S3 or NFS

Shutdown the cluster (I used k3s)

Uninstall the cluster

Reinstall the cluster from the scratch

Install Longhorn

Restore the volume from the backup

Create PV/PVC from UI

Create mulitple pods again to mount the volume

Works expectedly in my env, but I used the k3s embeded db and without HA in this test. I can try the HA + external db tomorrow.

derekbit on Jan 1, 2023

Thank you for the update.

It is the workaround. Will fix the it as I mentioned in the next release.

Jinserk Baik @.***>於 2023年1月1日週日，下午4:12寫道：

@derekbit https://github.com/derekbit It looks working!!

— Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/5183#issuecomment-1368383573, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7SNAIULQKDUHXNIVIM65DWQE4ABANCNFSM6AAAAAATNYHBOU . You are receiving this because you were mentioned.Message ID: @.***>

derekbit on Jan 1, 2023

@derekbit It looks working!!

time="2023-01-01T08:08:56Z" level=info msg="starting RLIMIT_NOFILE rlimit.Cur 1048576, rlimit.Max 1048576"
time="2023-01-01T08:08:56Z" level=info msg="Ending RLIMIT_NOFILE rlimit.Cur 1048576, rlimit.Max 1048576"
time="2023-01-01T08:08:56Z" level=warning msg="Waiting with nfs server start, volume is not attached" encrypted=false volume=user-shared-volume
time="2023-01-01T08:09:01Z" level=debug msg="Volume user-shared-volume device /dev/longhorn/user-shared-volume contains filesystem of format ext4" encrypted=false volume=user-shared-volume
time="2023-01-01T08:09:02Z" level=info msg="Starting nfs server, volume is ready for export" encrypted=false volume=user-shared-volume
time="2023-01-01T08:09:02Z" level=info msg="Running NFS server!"
time="2023-01-01T08:09:02Z" level=info msg="Starting health check for volume" encrypted=false volume=user-shared-volume
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] apply_logger_config_levels :LOG :NULL :LOG: Changing Default_Log_Level from (null) to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_LOG from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_MEM_ALLOC from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_MEMLEAKS from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_FSAL from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFSPROTO from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_V4 from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_EXPORT from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_FILEHANDLE from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DISPATCH from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CACHE_INODE from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CACHE_INODE_LRU from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_HASHTABLE from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_HASHTABLE_CACHE from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DUPREQ from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_INIT from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_MAIN from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_IDMAPPER from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_READDIR from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_V4_LOCK from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CONFIG from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_CLIENTID from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_SESSIONS from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_PNFS from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_RW_LOCK from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NLM from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_RPC from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_TIRPC from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_CB from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_THREAD from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_V4_ACL from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_STATE from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_9P from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_9P_DISPATCH from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_FSAL_UP from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_DBUS from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] SetComponentLogLevel :LOG :NULL :LOG: Changing log level of COMPONENT_NFS_MSK from NIV_EVENT to NIV_INFO
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_set_param_from_conf :NFS STARTUP :EVENT :Configuration file successfully parsed
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_fds_limit :INODE LRU :EVENT :Setting the system-imposed limit on FDs to 1048576.
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_server_pkgs :NFS STARTUP :INFO :State lock layer successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_server_pkgs :NFS STARTUP :INFO :IP/name cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_server_pkgs :NFS STARTUP :EVENT :Initializing ID Mapper.
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_server_pkgs :NFS STARTUP :EVENT :ID Mapper successfully initialized.
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs4_recovery_init :CLIENT ID :INFO :Recovery Backend Init for longhorn
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] longhorn_recov_init :CLIENT ID :EVENT :Initialize recovery backend 'share-manager-user-shared-volume'
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] longhorn_read_recov_clids :CLIENT ID :EVENT :Read clients from recovery backend share-manager-user-shared-volume
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] read_clids :CLIENT ID :EVENT :response={"actions":{},"clients":[],"hostname":"share-manager-user-shared-volume","id":"share-manager-user-shared-volume","links":{"self":"http://longhorn-recovery-backend:9600/v1/recoverybackendstatuses/share-manager-user-shared-volume"},"type":"recoveryBackendStatus"}

01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] longhorn_recov_end_grace :CLIENT ID :EVENT :End grace for recovery backend 'share-manager-user-shared-volume' version SMGEBNFJ
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] export_defaults_commit :CONFIG :INFO :Export Defaults now (options=02202000/0771f1e7 no_root_squash, ----, -4-, ---, TCP, ----,               ,         ,                ,                ,                , sys)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_config :FSAL :INFO :FSAL_VFS testing OFD Locks
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] init_config :FSAL :INFO :FSAL_VFS enabling OFD Locks
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] posix_create_file_system :FSAL :INFO :Added filesystem 0x165ac10 / namelen=255 dev=0.628 fsid=0x0000000036272e6e.0x00000000fd969780 908537454.4254504832 type=overlay
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] posix_create_file_system :FSAL :INFO :Added filesystem 0x165c1a0 /export/user-shared-volume namelen=255 dev=8.48 fsid=0x2042bcb07a39ceeb.0xc070b06c767a5990 2324627823827472107.13866777232564443536 type=ext4
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] posix_find_parent :FSAL :INFO :File system /export/user-shared-volume is a child of /
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] process_claim :FSAL :INFO :FSAL VFS Claiming 0x165c1a0 /export/user-shared-volume
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] claim_posix_filesystems :FSAL :INFO :Root fs for export /export/user-shared-volume is /export/user-shared-volume
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] export_commit_common :CONFIG :INFO :Export 1 created at pseudo (/user-shared-volume) with path (/export/user-shared-volume) and tag ((null)) perms (options=022021e0/0771f1e7 no_root_squash, RWrw, -4-, ---, TCP, ----,               ,         ,                ,                ,                , sys)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] export_commit_common :CONFIG :INFO :Export 1 has 0 defined clients
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] build_default_root :CONFIG :INFO :Export 0 (/) successfully created
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] ReadExports :EXPORT :INFO :Export     1 pseudo (/user-shared-volume) with path (/export/user-shared-volume) and tag ((null)) perms (options=022021e0/0771f1e7 no_root_squash, RWrw, -4-, ---, TCP, ----,               ,         ,                ,                ,                , sys)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] ReadExports :EXPORT :INFO :Export     0 pseudo (/) with path (/) and tag ((null)) perms (options=0221f080/0771f3e7 no_root_squash, --r-, -4-, ---, TCP, ----,               ,         ,                ,                ,                , none, sys, krb5, krb5i, krb5p)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] config_errs_to_log :CONFIG :WARN :Config File (/tmp/vfs.conf:4): Unknown parameter (NLM_Port)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] config_errs_to_log :CONFIG :WARN :Config File (/tmp/vfs.conf:5): Unknown parameter (MNT_Port)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] config_errs_to_log :CONFIG :WARN :Config File (/tmp/vfs.conf:6): Unknown parameter (RQUOTA_Port)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] config_errs_to_log :CONFIG :WARN :Config File (/tmp/vfs.conf:7): Unknown parameter (Enable_NLM)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] config_errs_to_log :CONFIG :WARN :Config File (/tmp/vfs.conf:8): Unknown parameter (Enable_RQUOTA)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] gsh_dbus_pkginit :DBUS :CRIT :dbus_bus_get failed (Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory)
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 ACL cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] check_filesystem :FSAL :INFO :Lookup of .. crosses filesystem boundary to unclaimed file system / - attempt to claim it
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] process_claim :FSAL :INFO :FSAL VFS Claiming 0x165ac10 /
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] claim_posix_filesystems :FSAL :INFO :Root fs for export /export/user-shared-volume is /
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 clientid cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :duplicate request hash table cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 State Id cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 Open Owner cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 Session Id cache successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :NFSv4 pseudo file system successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] fsal_save_ganesha_credentials :FSAL :INFO :Ganesha uid=0 gid=0 ngroups=0
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init_svc :DISP :INFO :NFS INIT: using TIRPC
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init_svc :DISP :INFO :NFS INIT: Using IPv6
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] Bind_sockets_V6 :DISP :INFO :Binding to address :::0
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] Bind_sockets_V6 :DISP :INFO :Binding TCP socket to address :::2049 for NFS
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] Bind_sockets :DISP :INFO :Bind sockets successful, v6disabled = 0, vsock = 0, rdma = 0
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init :NFS STARTUP :INFO :RPC resources successfully initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Init_admin_thread :NFS CB :EVENT :Admin thread initialized
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Start_threads :THREAD :EVENT :Starting delayed executor.
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Start_threads :THREAD :EVENT :gsh_dbusthread was started successfully
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Start_threads :THREAD :EVENT :admin thread was started successfully
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Start_threads :THREAD :EVENT :reaper thread was started successfully
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_Start_threads :THREAD :EVENT :General fridge was started successfully
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[dbus] gsh_dbus_thread :DBUS :CRIT :DBUS not initialized, service thread exiting
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[dbus] gsh_dbus_thread :DBUS :EVENT :shutdown
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
01/01/2023 08:09:02 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
01/01/2023 08:09:10 : epoch 63b13f9e : share-manager-user-shared-volume : nfs-ganesha-35[svc_5] longhorn_add_clid :CLIENT ID :EVENT :Add client '22:Linux NFSv4.1 kyu-gpu1' to recovery backend share-manager-user-shared-volume

Thank you so much @derekbit! for the nice workaround. I guess this will resolve my problem, but if you need to fix the issue to the next release, I’ll help for the test. 😃

jinserk on Jan 1, 2023

@tbertenshaw Because your issue is different than the issue @jinserk encountered, can you open a ticket for tracking what you encountered? Provide the env, steps, symptons and the latest support bundle. We will handle it in the new ticket. Thank you.

derekbit on Jan 6, 2023

supportbundle_cf947ccc-9d7e-490f-8df5-f22db6f413f7_2023-01-06T14-28-35Z.zip @derekbit 😃

tbertenshaw on Jan 6, 2023

Thanks, @tbertenshaw to cooperate with us for testing to figure out this rare cause. 👍

innobead on Jan 6, 2023

@innobead From the patched image. @tbertenshaw You can fall back to longhornio/longhorn-manager:v1.4.0. Looks the patched one has some issues.

derekbit on Jan 6, 2023

@derekbit longhorn-manager-x5p6j.log as well as the share manager restarting, i can see that one of the longhorn-manager pods has restarted 123 times longhorn-manager-x5p6j.log

tbertenshaw on Jan 6, 2023

The volume pvc-f5952e3a-d202-4e8f-9fcf-277d70bec191 is the one that is not running correctly now since the upgrade. This is the one where the share manage pod is restarted regularly.

@tbertenshaw Sorry, my bad. Checked the wrong node before.

Yes, as you mentioned. The share-manager pod is created and deleted repeatedly. The share-manager controller complains cannot check IsNodeDownOrDeleted() when syncShareManagerPod because of the errorno node name provided to check node down or deleted. I also checked the log, and the share-manager pod’s spec.nodeName is actually set to aks-nodepoolres-14028323-vmss000001 which is ready.

I suspect this is probably caused by that immediately checking pod.spec.nodeName after creating a share-manager pod.

I’m going to work on a patch, and I would appreciate it if you help check it.

derekbit on Jan 5, 2023

@innobead @derekbit scaled workload to 3 pods and this is the support bundle after an hour of this config.

supportbundle_cf947ccc-9d7e-490f-8df5-f22db6f413f7_2023-01-04T18-53-12Z.zip

tbertenshaw on Jan 4, 2023

derekbit/longhorn-manager:v1.4.0-update

So update the longhorn-manager daemonset and the longhorn-recovery-backend deployment to use derekbit/longhorn-manager:v1.4.0-update? Just done that now and can see that the share-manager-pvc pod is still restarting every few minutes

@tbertenshaw Can you provide us the support bundle? A newly created volume also encounters the same issue?

share-manager-pvc-f5952e3a-d202-4e8f-9fcf-277d70bec191.log

AKS 1.25.2
longhorn version 1.4.0 (previously had 1.3.2 only spotted this issue since upgrade)
3 nodes (2core x 8Gb) essentially only running longhorn and moodle with 3 moodle pods RWX.

Any more detail needed?

tbertenshaw on Jan 4, 2023

@derekbit I’m not sure these are the logs you want: I used the cmds to get them:
$ sudo journalctl -u k3s --since "2023-01-03" > kyu-gpu1.log    # on kyu-gpu1 node
$ sudo journalctl -u k3s --since "2023-01-03" > kyu-gpu2.log    # on kyu-gpu2 node
kyu-gpu1.log kyu-gpu2.log

If you’re digging the long-time FailedMount issue, it will be super helpful for me!! Thank you so much again. I found #3207 which looks like related to this.

@jinserk Can you try the workaround mentioned in https://github.com/longhorn/longhorn/issues/3207? I will continue digging in the new issue. BTW, does the issue happen in v1.3.x as well?

derekbit on Jan 4, 2023

@timbo

applied this to the 3 places of the longhornmanager daemonset. I’m still seeing the same restart of the sharemanager pod every 2 mins or so.

Sorry, I didn’t explain clearly. Should replace the longhorn-recovery-backend deployment’s image with derekbit/longhorn-manager:v1.4.0-update, because the NFS recovery-backend logics are in this deployment.

derekbit on Jan 4, 2023

@derekbit The volume name is jinserk-baik-volume and the pvc name is claim-jinserk-2ebaik. The timing is more than 10 min – as you can see the pod description above.

jinserk on Jan 3, 2023

I see, the functions in datastore/kubernetes do not contain verifyCreate or verifyUpdate No… I just knew the comments Sheng left in the ticket…

shuo-wu on Jan 3, 2023

I applied it and lauching pod to mount the volume, but I am watching share-manager-jupyter-shared-volume pod is repeatedly containercreating -> running -> completed with the following logs:

Hello @jinserk I added the retry of confimap update on conflict in image derekbit/longhorn-manager:v1.4.0-update. Could you please try again? Thank you.

@longhorn/qa @chriscchien I cannot reproduce the issue in my env. Can you help check if you can reproduce it using v1.4.0 images?

derekbit on Jan 3, 2023

Wow it looks working! It is amazing! Did you resolve the 4-yr long issue?

No, my colleague added the verification for some resources creation operations for a workaound. I just thought of the verification and added it to the share-manager’s configmap creation flow. 😃 Look forward to your test results. Many thanks for your patience and help.

From the issue, I think we should verify all creations for all resource creation for avoiding any accident. cc @innobead

derekbit on Jan 2, 2023

@innobead Do you mean the k3s database? no, I’m using a remote postgresql server for k3s instead of sqlite or etcd. /var/lib/rancher and /var/lib/kubelet resides in a separated SSD, on both nodes.

jinserk on Jan 1, 2023

Also I’d like to state that the physical volume for the longhorn use on each nodes are software-raid1 (mirroring) MD volume with 2 10TB HDDs.

$ sudo mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Wed Oct 26 17:08:51 2022
        Raid Level : raid1
        Array Size : 9766304768 (9313.87 GiB 10000.70 GB)
     Used Dev Size : 9766304768 (9313.87 GiB 10000.70 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Jan  1 10:44:47 2023
             State : clean, checking 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

      Check Status : 38% complete

              Name : kyu-gpu1:1  (local to host kyu-gpu1)
              UUID : a39538d5:c4c33848:197153e2:f7c968f2
            Events : 120599

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb

$ sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Oct 25 13:43:55 2022
        Raid Level : raid1
        Array Size : 9766304768 (9313.87 GiB 10000.70 GB)
     Used Dev Size : 9766304768 (9313.87 GiB 10000.70 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Jan  1 10:45:34 2023
             State : active 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : kyu-gpu2:0  (local to host kyu-gpu2)
              UUID : 1ed21724:8a54958a:154fe3ad:19ff5fb4
            Events : 68943

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       16        1      active sync   /dev/sdb

jinserk on Jan 1, 2023

By the way, can you try the simple steps in #5183 (comment)? I’d check if the simple test still hits the error.

Create a RWX volume Create multiple pods to mount the volume

The two steps always succeeded?

Sure, I’ll check it too!

jinserk on Jan 1, 2023

One more thing is, I’m using two master nodes (and also work as worker nodes) with HA configuration – a remote postgresql k3s database. I’m not sure this could be affect to the Longhorn’s operation.

jinserk on Jan 1, 2023

@derekbit @innobead I found that the RWX volume has another issue when two pods are trying to mount it at the same time.

│ 01/01/2023 14:33:55 : epoch 63b199d2 : share-manager-user-shared-volume : nfs-ganesha-29[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend                                                                                               │
│ 01/01/2023 14:33:55 : epoch 63b199d2 : share-manager-user-shared-volume : nfs-ganesha-29[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)                                                                                              │
│ 01/01/2023 14:33:55 : epoch 63b199d2 : share-manager-user-shared-volume : nfs-ganesha-29[main] longhorn_recov_end_grace :CLIENT ID :EVENT :End grace for recovery backend 'share-manager-user-shared-volume' version Z4X6XXRF                                               │
│ 01/01/2023 14:33:55 : epoch 63b199d2 : share-manager-user-shared-volume : nfs-ganesha-29[main] http_call :CLIENT ID :EVENT :HTTP error: 500 (url=http://longhorn-recovery-backend:9600/v1/recoverybackend/share-manager-user-shared-volume, payload={"version": "Z4X6XXRF"} │
│ 01/01/2023 14:33:55 : epoch 63b199d2 : share-manager-user-shared-volume : nfs-ganesha-29[main] longhorn_recov_end_grace :CLIENT ID :FATAL :HTTP call error: res=-1 ((null))                                                                                                 │
│ time="2023-01-01T14:33:55Z" level=error msg="NFS server exited with error" encrypted=false error="ganesha.nfsd failed with error: exit status 2, output: " volume=user-shared-volume                                                                                        │
│ W0101 14:33:55.424366       1 mount_helper_common.go:133] Warning: "/export/user-shared-volume" is not a mountpoint, deleting                                                                                                                                               │
│ time="2023-01-01T14:33:55Z" level=debug msg="Device /dev/mapper/user-shared-volume is not an active LUKS device" error="failed to run cryptsetup args: [status user-shared-volume] output:  error: exit status 4"

I just tried creating 10 pods with the same RWX volume, but they were created successfully.

My steps are changing the ReadWriteOnce to ReadWriteMany of pod_with_pvc.yaml and changing the pod name. Create them by kubectl -f <all pod_with_pvc.yaml manifests>.

derekbit on Jan 1, 2023

@innobead

After successfuly creating the share-manager’s configmap, recovery-backend somehow failed to read get the configmap with not found error. Then, a new pod was created and then operated the configmap. So, I guesses there is a concurrent issue and try to scale down replicas.

But, I’m still not quite sure if it is a valid workaround. Additionally, the very beginning not found error is very weird, because the configmap was acutally created.

Need more time to investigate the issue.

time="2022-12-31T22:50:22Z" level=info msg="Creating a recovery backend share-manager-user-shared-volume (version MUZMSGS9)"
time="2022-12-31T22:50:22Z" level=info msg="Reading clients from recovery backend share-manager-user-shared-volume"
time="2022-12-31T22:50:22Z" level=info msg="Read clients from recovery backend recovery-backend-share-manager-user-shared-volume"
time="2022-12-31T22:50:22Z" level=warning msg="HTTP handling error failed to get configmap recovery-backend-share-manager-user-shared-volume: configmap \"recovery-backend-share-manager-user-shared-volume\" not found"

derekbit on Jan 1, 2023

No, we didn’t encounter the error.

Still thinking how to reproduce it and check if there is a workaround.

David Ko @.***>於 2023年1月1日週日，下午1:13寫道：

@derekbit https://github.com/derekbit it seems we didn’t encounter this error before so is it always reproducible? Any workaround?

— Reply to this email directly, view it on GitHub https://github.com/longhorn/longhorn/issues/5183#issuecomment-1368356454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7SNAJ2YBKVN526GSK7DJTWQEG5ZANCNFSM6AAAAAATNYHBOU . You are receiving this because you were mentioned.Message ID: @.***>

derekbit on Jan 1, 2023

@derekbit Here are the logs from the pods: longhorn-recovery-backend-logs.zip

BTW, how to make the support bundle? Do you have any doc to howto?

@innobead It’s my pleasure! I appreciate for the great project!

jinserk on Jan 1, 2023

Can you send us the support bundle? BTW, can you show the logs of longhorn-recovery-backend-864d6fb7c-c8wlv and longhorn-recovery-backend-864d6fb7c-z7ptn?

derekbit on Jan 1, 2023

I have two nodes in my cluster and the pods under longhorn-system namespace are:

│ NAME_                                                                   PF                    READY                                        RESTARTS STATUS                         IP                              NODE                          AGE                        │
│ c-6ggsu2-27875580-pbpzz                                                 _                     0/1                                                 0 Completed                      10.42.0.91                      kyu-gpu1                      27m                        │
│ csi-attacher-66f9979b99-g8tgs                                           _                     1/1                                                 0 Running                        10.42.0.25                      kyu-gpu1                      4h6m                       │
│ csi-attacher-66f9979b99-s7wpd                                           _                     1/1                                                 0 Running                        10.42.1.22                      kyu-gpu2                      4h6m                       │
│ csi-provisioner-64dcbd65b4-mggj9                                        _                     1/1                                                 0 Running                        10.42.1.24                      kyu-gpu2                      4h6m                       │
│ csi-provisioner-64dcbd65b4-xt9v5                                        _                     1/1                                                 0 Running                        10.42.0.26                      kyu-gpu1                      4h6m                       │
│ csi-resizer-ccdb95b5c-cn9jp                                             _                     1/1                                                 0 Running                        10.42.0.27                      kyu-gpu1                      4h6m                       │
│ csi-resizer-ccdb95b5c-cxjkh                                             _                     1/1                                                 0 Running                        10.42.1.23                      kyu-gpu2                      4h6m                       │
│ csi-snapshotter-546b79649b-6sf4t                                        _                     1/1                                                 0 Running                        10.42.0.28                      kyu-gpu1                      4h6m                       │
│ csi-snapshotter-546b79649b-vqht6                                        _                     1/1                                                 0 Running                        10.42.1.25                      kyu-gpu2                      4h6m                       │
│ engine-image-ei-fc06c6fb-24nvh                                          _                     1/1                                                 0 Running                        10.42.0.22                      kyu-gpu1                      4h6m                       │
│ engine-image-ei-fc06c6fb-lwcdk                                          _                     1/1                                                 0 Running                        10.42.1.21                      kyu-gpu2                      4h6m                       │
│ instance-manager-e-4148d195326cb79119b3162aa7774664                     _                     1/1                                                 0 Running                        10.42.0.21                      kyu-gpu1                      4h6m                       │
│ instance-manager-e-b514b42c18c4d26efe794beec08c55b4                     _                     1/1                                                 0 Running                        10.42.1.19                      kyu-gpu2                      4h6m                       │
│ instance-manager-r-4148d195326cb79119b3162aa7774664                     _                     1/1                                                 0 Running                        10.42.0.20                      kyu-gpu1                      4h6m                       │
│ instance-manager-r-b514b42c18c4d26efe794beec08c55b4                     _                     1/1                                                 0 Running                        10.42.1.20                      kyu-gpu2                      4h6m                       │
│ longhorn-admission-webhook-df85b9887-9f6lw                              _                     1/1                                                 0 Running                        10.42.0.15                      kyu-gpu1                      4h7m                       │
│ longhorn-admission-webhook-df85b9887-xgmrp                              _                     1/1                                                 0 Running                        10.42.1.16                      kyu-gpu2                      4h7m                       │
│ longhorn-conversion-webhook-57986cf858-bbx42                            _                     1/1                                                 0 Running                        10.42.1.17                      kyu-gpu2                      4h7m                       │
│ longhorn-conversion-webhook-57986cf858-bwkbm                            _                     1/1                                                 0 Running                        10.42.0.17                      kyu-gpu1                      4h7m                       │
│ longhorn-csi-plugin-gl6zh                                               _                     3/3                                                 0 Running                        10.42.0.29                      kyu-gpu1                      4h6m                       │
│ longhorn-csi-plugin-lqbj9                                               _                     3/3                                                 0 Running                        10.42.1.26                      kyu-gpu2                      4h6m                       │
│ longhorn-driver-deployer-7fcc48d46d-q6nj7                               _                     1/1                                                 0 Running                        10.42.0.19                      kyu-gpu1                      4h7m                       │
│ longhorn-manager-6pt6b                                                  _                     1/1                                                 0 Running                        10.42.1.14                      kyu-gpu2                      4h7m                       │
│ longhorn-manager-9649j                                                  _                     1/1                                                 1 Running                        10.42.0.16                      kyu-gpu1                      4h7m                       │
│ longhorn-recovery-backend-864d6fb7c-c8wlv                               _                     1/1                                                 0 Running                        10.42.0.18                      kyu-gpu1                      4h7m                       │
│ longhorn-recovery-backend-864d6fb7c-z7ptn                               _                     1/1                                                 0 Running                        10.42.1.15                      kyu-gpu2                      4h7m                       │
│ longhorn-ui-6bb85455bf-ht62s                                            _                     1/1                                                 0 Running                        10.42.1.18                      kyu-gpu2                      4h7m                       │
│ longhorn-ui-6bb85455bf-jzrgm                                            _                     1/1                                                 0 Running                        10.42.0.14                      kyu-gpu1                      4h7m

I couldn’t find the matched logs from the backend pod logs, but can see repeated errors with different version numbers:

│ time="2022-12-31T22:55:52Z" level=info msg="Creating a recovery backend share-manager-user-shared-volume (version CEM9LSV9)"                                                                                                                                                │
│ time="2022-12-31T22:55:52Z" level=info msg="Reading clients from recovery backend share-manager-user-shared-volume"                                                                                                                                                         │
│ time="2022-12-31T22:55:52Z" level=info msg="Read clients from recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                                           │
│ time="2022-12-31T22:55:52Z" level=info msg="Annotation version is not existing in recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                       │
│ time="2022-12-31T22:55:52Z" level=info msg="Ending grace for recovery backend share-manager-user-shared-volume (version CEM9LSV9)"                                                                                                                                          │
│ time="2022-12-31T22:55:52Z" level=warning msg="Retry API call due to conflict"                                                                                                                                                                                              │
│ time="2022-12-31T22:55:52Z" level=warning msg="HTTP handling error EOF"                                                                                                                                                                                                     │
│ time="2022-12-31T22:55:52Z" level=error msg="Error in request: EOF"                                                                                                                                                                                                         │
│ time="2022-12-31T22:55:52Z" level=error msg="Attempting to Write an unknown type: error"                                                                                                                                                                                    │
│ time="2022-12-31T22:55:58Z" level=info msg="Creating a recovery backend share-manager-user-shared-volume (version L8LW4KDN)"                                                                                                                                                │
│ time="2022-12-31T22:55:58Z" level=info msg="Reading clients from recovery backend share-manager-user-shared-volume"                                                                                                                                                         │
│ time="2022-12-31T22:55:58Z" level=info msg="Read clients from recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                                           │
│ time="2022-12-31T22:55:58Z" level=info msg="Annotation version is not existing in recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                       │
│ time="2022-12-31T22:55:58Z" level=info msg="Ending grace for recovery backend share-manager-user-shared-volume (version L8LW4KDN)"                                                                                                                                          │
│ time="2022-12-31T22:55:58Z" level=warning msg="Retry API call due to conflict"                                                                                                                                                                                              │
│ time="2022-12-31T22:55:59Z" level=warning msg="HTTP handling error EOF"                                                                                                                                                                                                     │
│ time="2022-12-31T22:55:59Z" level=error msg="Error in request: EOF"                                                                                                                                                                                                         │
│ time="2022-12-31T22:55:59Z" level=error msg="Attempting to Write an unknown type: error"                                                                                                                                                                                    │
│ time="2022-12-31T22:56:19Z" level=info msg="Creating a recovery backend share-manager-user-shared-volume (version NW0RONQB)"                                                                                                                                                │
│ time="2022-12-31T22:56:19Z" level=info msg="Reading clients from recovery backend share-manager-user-shared-volume"                                                                                                                                                         │
│ time="2022-12-31T22:56:19Z" level=info msg="Read clients from recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                                           │
│ time="2022-12-31T22:56:19Z" level=info msg="Annotation version is not existing in recovery backend recovery-backend-share-manager-user-shared-volume"                                                                                                                       │
│ time="2022-12-31T22:56:19Z" level=info msg="Ending grace for recovery backend share-manager-user-shared-volume (version NW0RONQB)"                                                                                                                                          │
│ time="2022-12-31T22:56:19Z" level=warning msg="Retry API call due to conflict"                                                                                                                                                                                              │
│ time="2022-12-31T22:56:19Z" level=warning msg="HTTP handling error EOF"                                                                                                                                                                                                     │
│ time="2022-12-31T22:56:19Z" level=error msg="Error in request: EOF"                                                                                                                                                                                                         │
│ time="2022-12-31T22:56:19Z" level=error msg="Attempting to Write an unknown type: error"

Please let me know what logs you need to figure it out. I’ll do my best to provide the info. Thank you so much and Happy New Year!

jinserk on Jan 1, 2023