longhorn: [BUG] Some volumes fail to backup
Describe the bug
Some longhorn volumes fail to backup to S3 (Minio). When a backup is trigerred, the instance ‘dies’ and restarts, this triggers a rebuild of the instance. One specific volume I tried it one had one backup succeed a month ago but all subsequent backups fail. I deleted all old backups and tried again to no avail.
Also also tried backing up from an existing snapshot with the same/similar outcome as above.
Expected behavior
A backup should be taken to S3
Log or Support bundle
Attaching some logs, I’ve capture a full support bundle and happy to provide it.
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:11Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:12Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:13Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:14Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:14Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Snapshotting volume " serviceURL="10.42.10.197:10004"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:15Z" level=info msg="Starting snapshot" snapshot= volume=pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:15Z" level=info msg="Requesting system sync before snapshot" snapshot= volume=pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10005"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:15Z" level=info msg="Starting to snapshot: 10.42.8.141:10000 657dfc93-269b-45c0-aea6-05f1b57521e3 UserCreated true Created at 2022-09-08T11:13:15Z, Labels map[]"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:15Z" level=info msg="Finished to snapshot: 10.42.8.141:10000 657dfc93-269b-45c0-aea6-05f1b57521e3 UserCreated true Created at 2022-09-08T11:13:15Z, Labels map[]"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:15Z" level=info msg="Finished snapshot" snapshot= volume=pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:15Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:16Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:17Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:18Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10001"
[longhorn-instance-manager] time="2022-09-08T11:13:19Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10003"
[longhorn-instance-manager] time="2022-09-08T11:13:19Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10001"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:19Z" level=error msg="Error reading from wire: EOF"
response_process: Receive error for response 3 of seq 297
tgtd: bs_longhorn_request(105) fail to read at 17371136 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 28 -14 4096 17371136, Success
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="RemovePendingOp: OpID already removed0"
time="2022-09-08T11:13:20Z" level=error msg="Replicator.ReadAt:0 EOF"
response_process: Receive error for response 3 of seq 298
tgtd: bs_longhorn_request(91) fail to write at 0 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 2a -14 4096 0, Success
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="Setting replica tcp://10.42.8.141:10000 to ERR due to: EOF"
time="2022-09-08T11:13:20Z" level=info msg="Set replica tcp://10.42.8.141:10000 to mode ERR"
time="2022-09-08T11:13:20Z" level=error msg="I/O error: tcp://10.42.8.141:10000: EOF"
time="2022-09-08T11:13:20Z" level=info msg="Monitoring stopped tcp://10.42.8.141:10000"
time="2022-09-08T11:13:20Z" level=error msg="I/O error: no backend available"
response_process: Receive error for response 3 of seq 299
tgtd: bs_longhorn_request(105) fail to read at 17371136 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 28 -14 4096 17371136, Success
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="I/O error: no backend available"
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="I/O error: no backend available"
response_process: Receive error for response 3 of seq 300
tgtd: bs_longhorn_request(91) fail to write at 0 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 2a -14 4096 0, Success
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="I/O error: no backend available"
response_process: Receive error for response 3 of seq 301
tgtd: bs_longhorn_request(105) fail to read at 17371136 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 28 -14 4096 17371136, Success
[pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715] time="2022-09-08T11:13:20Z" level=error msg="I/O error: no backend available"
response_process: Receive error for response 3 of seq 302
tgtd: bs_longhorn_request(91) fail to write at 0 for 4096
tgtd: bs_longhorn_request(150) io error 0x2cac1d0 2a -14 4096 0, Success
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:20Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10005"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Getting volume" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Getting replica rebuilding status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Get snapshot purge status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Getting backup restore status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Getting snapshot clone status" serviceURL="10.42.10.197:10004"
[longhorn-instance-manager] time="2022-09-08T11:13:21Z" level=debug msg="Process Manager: start getting logs for process pvc-7e9d77b1-8313-4cfd-bb7f-6550d317507c-e-1d8a9715"
[longhorn-instance-manager] time="2022-09-08T11:13:22Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:22Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10002"
[longhorn-instance-manager] time="2022-09-08T11:13:22Z" level=debug msg="Listing replicas" serviceURL="10.42.10.197:10000"
[longhorn-instance-manager] time="2022-09-08T11:13:22Z" level=debug msg="Listing snapshots" serviceURL="10.42.10.197:10000"
Environment
- Longhorn version: 1.3.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s 1.23.10
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 9
- Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 4+
- Memory per node: 4+
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: Variable
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMware
- Number of Longhorn volumes in the cluster: 15
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (6 by maintainers)
I’m sending another log bundle, I tried to restore a volume from a backup and it crashes out the entire node (all volumes restart on it)
@mateuszdrab Can you try to edit longhorn-manager daemonset by
kubectl -n longhorn-system edit daemonset longhorn-managerAddto
env.ref: https://github.com/longhorn/longhorn/issues/1768, https://github.com/longhorn/longhorn/issues/4415
@mateuszdrab, Sorry about that I did not receive the support bundle, could you please send that again? You could send longhorn-support-bundle@rancher.com or longhorn-support-bundle@suse.com Thank you