longhorn: [BUG] Nil pointer dereference error on restoring the system backup

Describe the bug (🐛 if you encounter this issue)

A system backup which was taken in the v1.4.0-rc1 on restoring stuck in Pending state with nil pointer dereference error.

To Reproduce

Steps to reproduce the behavior:

  1. Install Longhorn v1.4.0-rc1
  2. Create some volumes.
  3. Take the volume backup and system backup.
  4. Create a cluster with Longhorn v1.4.0-rc2.
  5. Restore the system backup.
  6. Restoring stuck in Pending state.

Expected behavior

Restore should complete.

Log or Support bundle

supportbundle_06317708-75d1-48d5-8092-1de9ee200fe7_2022-12-22T20-25-53Z.zip

Environment

  • Longhorn version: v1.4.0-rc2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl

Additional context

If creating system backup in v1.4.0-rc2 and restoring it in the same version doesn’t have any problem.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 19 (19 by maintainers)

Most upvoted comments

Verified passed on v1.4.x-head (longhorn-manager 9d57b4c). Different kinds of volumes (Created from Longhorn UI, dynamic provisioning, RWO, RWX, volume with backing image, encrypted volume) can all be backup and restored to the new cluster without problem.

Verified passed on v1.4.x-head (longhorn-manager 9d57b4c). Different kinds of volumes (Created from Longhorn UI, dynamic provisioning, RWO, RWX, volume with backing image, encrypted volume) can all be backup and restored to the new cluster without problem.

This is a good test to cover different types of volume creation. cc @longhorn/qa

Tested on v1.4.x-head (longhorn-manager 9d57b4c). Trying to restore a volume with backing image, data is missing from the restored volume:

In cluster A: (1) Create a volume with backing image (2) Create a pod to use this volume:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1

(3) Write some data to the volume:

dd if=/dev/urandom of=/data/test-1 bs=1M count=16

(4) Create system backup

In cluster B: (5) Manually create backing image (6) Restore the system backup (7) The volume with backing image can be restored and in detached state (8) Create a pod to use this restored volume like in step (2) (9) Check the volume data by kubectl exec -it test-pod-1 -- /bin/sh, there is no test-1 file created in step (3)

  • Manually create a volume backup and restore the volume backup, the test-1 file can be found in the restored volume and the content is intact.

Tested on v1.4.x-head, the following test scenario results in abnormal behavior:

(1) In cluster A, create volume “test-1” and create pv/pvc in default namespace, all from Longhorn UI (2) Create a pod using this volume:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1  

(3) Create backup for this volume image (4) Create system backup

(5) In cluster B, restore this system backup => Volume test-1 can be found restored on Longhorn UI, but it’s in faulted state, the job log keeps printing the following messages forvever:

time="2022-12-26T03:48:36Z" level=debug msg="Restoring PersistentVolumeList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="volume is waiting to restore data" systemRestore=test-1
time="2022-12-26T03:48:37Z" level=debug msg="Restoring PersistentVolumeClaimList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="persistentvolume \"test-1\" not found" systemRestore=test-1

(6) delete the restore job and the faulted volume, and retry the restore process again, somehow the restore can be completed without problems. (7) In cluster B, create a pod like step 2 to use the restored volume test-1 => After the volume is attached and becomes healthy, there is a abnormal snapshot appearing in the snapshots list: image

Tested on v1.4.x-head (longhorn-manager 9d57b4c). Trying to restore a volume with backing image, data is missing from the restored volume:

In cluster A: (1) Create a volume with backing image (2) Create a pod to use this volume:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1

(3) Write some data to the volume:

dd if=/dev/urandom of=/data/test-1 bs=1M count=16

(4) Create system backup

In cluster B: (5) Manually create backing image (6) Restore the system backup (7) The volume with backing image can be restored and in detached state (8) Create a pod to use this restored volume like in step (2) (9) Check the volume data by kubectl exec -it test-pod-1 -- /bin/sh, there is no test-1 file created in step (3)

* Manually create a volume backup and restore the volume backup, the `test-1` file can be found in the restored volume and the content is intact.

It’s expected because volume backup has to be created to restore the data in another cluster.

From @c3y1huang If the system backup volume doesn’t contain volume backup then the data will not get restored in cluster-B. Instead it will restore the volume only.

@c3y1huang is this issue happening to 1.3.2 or 1.3.x? @yangchiu can help check?

Yes, this also happens in v1.3.x-head.

@c3y1huang will this happen when restoring the volume in the same cluster?

Yes

(6) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml => volume states transfer from detached to healthy, and the volume has backup will have a snapshot with invalid date

The Snapshot is without the creationTime.

status:
  checksum: ""
  children:
    volume-head: true
  creationTime: ""

This doesn’t look related to the system-backup/restore and should be reproducible without the system backup/restore.

cluster-A:

  1. Create volume, PV, PVC.
  2. Attach to workload and create volume backup.

cluster-B:

  1. Restore a volume from the volume backup.
  2. Create PV/PVC with the same name.
  3. Attach to the workload the same as cluster-A.

New issue created: https://github.com/longhorn/longhorn/issues/5153

(5) In cluster B, restore this system backup => Volume test-1 can be found restored on Longhorn UI, but it’s in faulted state, the job log keeps printing the following messages forvever:

time="2022-12-26T03:48:36Z" level=debug msg="Restoring PersistentVolumeList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="volume is waiting to restore data" systemRestore=test-1
time="2022-12-26T03:48:37Z" level=debug msg="Restoring PersistentVolumeClaimList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="persistentvolume \"test-1\" not found" systemRestore=test-1

Logs from cluster B

2022-12-27T04:04:27.822650926Z time="2022-12-27T04:04:27Z" level=debug msg="Instance handler updated instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf state, old state stopped, new state error"
2022-12-27T04:04:27.822662076Z time="2022-12-27T04:04:27Z" level=warning msg="Instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf crashed on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163, try to get log"
2022-12-27T04:04:27.932368856Z time="2022-12-27T04:04:27Z" level=warning msg="Instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf crashed on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163, try to get log"
2022-12-27T04:04:28.049084766Z time="2022-12-27T04:04:28Z" level=warning msg="Replica pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf that not in the engine mode map is marked as failed, current state error, engine name pvc-2938f36e-e080-4272-9483-66ce4397abee-e-d8f78540, active true" accessMode=rwo controller=longhorn-volume currentEngine=pvc-2938f36e-e080-4272-9483-66ce4397abee-e-d8f78540 frontend=blockdev migratable=false node=ip-10-0-1-163 owner=ip-10-0-1-163 state=attaching volume=pvc-2938f36e-e080-4272-9483-66ce4397abee
2022-12-27T04:04:28.171337705Z time="2022-12-27T04:04:28Z" level=warning msg="Try to get requested log for pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf in instance manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef"
2022-12-27T04:04:28.172695476Z time="2022-12-27T04:04:28Z" level=info msg="Deleting instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.174251657Z time="2022-12-27T04:04:28Z" level=warning msg="Instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf crashed on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163, try to get log"
2022-12-27T04:04:28.174471006Z time="2022-12-27T04:04:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf\", UID:\"09449713-25f1-436f-b938-b591d3e85f79\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2300\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.175421113Z time="2022-12-27T04:04:28Z" level=error msg="Failed to receive log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.175443082Z time="2022-12-27T04:04:28Z" level=warning msg="cannot get crash log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.282717593Z time="2022-12-27T04:04:28Z" level=info msg="Deleting instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.284004630Z time="2022-12-27T04:04:28Z" level=warning msg="Instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf crashed on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163, try to get log"
2022-12-27T04:04:28.284417742Z time="2022-12-27T04:04:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf\", UID:\"09449713-25f1-436f-b938-b591d3e85f79\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2305\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.285121515Z time="2022-12-27T04:04:28Z" level=error msg="Failed to receive log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.285135831Z time="2022-12-27T04:04:28Z" level=warning msg="cannot get crash log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.380823798Z time="2022-12-27T04:04:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-2938f36e-e080-4272-9483-66ce4397abee\", UID:\"5318b4f9-351e-479b-9cfb-f23bbbe95443\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2285\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Persistent Volume pvc-2938f36e-e080-4272-9483-66ce4397abee started to use/reuse Longhorn volume pvc-2938f36e-e080-4272-9483-66ce4397abee"
2022-12-27T04:04:28.392566140Z time="2022-12-27T04:04:28Z" level=info msg="Deleting instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.393833673Z time="2022-12-27T04:04:28Z" level=warning msg="Instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf crashed on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163, try to get log"
2022-12-27T04:04:28.394019456Z time="2022-12-27T04:04:28Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf\", UID:\"09449713-25f1-436f-b938-b591d3e85f79\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2307\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.395203717Z time="2022-12-27T04:04:28Z" level=error msg="Failed to receive log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.395256562Z time="2022-12-27T04:04:28Z" level=warning msg="cannot get crash log for instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf on Instance Manager instance-manager-r-cb97de6b6f131c858bbc075efdf49eef at ip-10-0-1-163" error="rpc error: code = NotFound desc = cannot find process pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf"
2022-12-27T04:04:28.778365206Z time="2022-12-27T04:04:28Z" level=warning msg="Replica pvc-2938f36e-e080-4272-9483-66ce4397abee-r-2ddda446 that not in the engine mode map is marked as failed, current state error, engine name pvc-2938f36e-e080-4272-9483-66ce4397abee-e-d8f78540, active true" accessMode=rwo controller=longhorn-volume currentEngine=pvc-2938f36e-e080-4272-9483-66ce4397abee-e-d8f78540 frontend=blockdev migratable=false node=ip-10-0-1-163 owner=ip-10-0-1-163 state=attaching volume=pvc-2938f36e-e080-4272-9483-66ce4397abee
2022-12-27T04:04:28.820124594Z time="2022-12-27T04:04:28Z" level=debug msg="Instance handler updated instance pvc-2938f36e-e080-4272-9483-66ce4397abee-r-eaa80eaf state, old state error, new state stopped"
2022-12-27T04:04:29.731408367Z time="2022-12-27T04:04:29Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-2938f36e-e080-4272-9483-66ce4397abee\", UID:\"5318b4f9-351e-479b-9cfb-f23bbbe95443\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2348\", FieldPath:\"\"}): type: 'Normal' reason: 'Detached' volume pvc-2938f36e-e080-4272-9483-66ce4397abee has been detached"
2022-12-27T04:04:57.213448461Z time="2022-12-27T04:04:57Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:57.411356801Z time="2022-12-27T04:04:57Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:57.434589657Z time="2022-12-27T04:04:57Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:57.990938625Z time="2022-12-27T04:04:57Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:57.999917446Z time="2022-12-27T04:04:57Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:58.010601153Z time="2022-12-27T04:04:58Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:58.022101246Z time="2022-12-27T04:04:58Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:58.040252915Z time="2022-12-27T04:04:58Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:58.061237978Z time="2022-12-27T04:04:58Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:04:58.084689929Z time="2022-12-27T04:04:58Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-163] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:05:02.544264665Z time="2022-12-27T04:05:02Z" level=debug msg="CheckEngineImageReadiness: nodes [ip-10-0-1-140] don't have the engine image longhornio/longhorn-engine:v1.4.0-rc2"
2022-12-27T04:05:02.651039867Z time="2022-12-27T04:05:02Z" level=debug msg="Engine Image Controller ip-10-0-1-163 picked up ei-6eab9564 (longhornio/longhorn-engine:v1.4.0-rc2)" controller=longhorn-engine-image engineImage=ei-6eab9564 image="longhornio/longhorn-engine:v1.4.0-rc2" node=ip-10-0-1-163
2022-12-27T04:05:03.239491348Z time="2022-12-27T04:05:03Z" level=info msg="Event(v1.ObjectReference{Kind:\"EngineImage\", Namespace:\"longhorn-system\", Name:\"ei-6eab9564\", UID:\"b7ef12e8-7eae-46f1-84e2-33df519ef34e\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"2495\", FieldPath:\"\"}): type: 'Normal' reason: 'ready' Engine image ei-6eab9564 (longhornio/longhorn-engine:v1.4.0-rc2) is fully deployed on all ready nodes"

The replica failed at executing engine binary before engine image is fully deployed on all ready nodes.

Tested on v1.4.x-head, the following test scenario results in abnormal behavior:

(1) In cluster A, create volume “test-1” and create pv/pvc in default namespace, all from Longhorn UI (2) Create a pod using this volume:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1  

(3) Create backup for this volume image (4) Create system backup

(5) In cluster B, restore this system backup => Volume test-1 can be found restored on Longhorn UI, but it’s in faulted state, the job log keeps printing the following messages forvever:

time="2022-12-26T03:48:36Z" level=debug msg="Restoring PersistentVolumeList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="volume is waiting to restore data" systemRestore=test-1
time="2022-12-26T03:48:37Z" level=debug msg="Restoring PersistentVolumeClaimList" backupTargetURL="s3://yang-test-4@us-east-1/" controller=longhorn-system-rollout engineImage="longhornio/longhorn-engine:v1.4.x-head" error="persistentvolume \"test-1\" not found" systemRestore=test-1

(6) delete the restore job and the faulted volume, and retry the restore process again, somehow the restore can be completed without problems. (7) In cluster B, create a pod like step 2 to use the restored volume test-1 => After the volume is attached and becomes healthy, there is a abnormal snapshot appearing in the snapshots list: image

For the above abnormal behavior, no need to create volume/pv/pvc manually from Longhorn UI, it can be reproduced by dynamic provisioning as following steps:

In cluster A (1) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml (2) create backup for a volume from Longhorn UI (3) create system backup

In cluster B (4) restore this system backup => the volume has backup will become faulted state (5) delete the restored volume and restore job, and then retry restore again => volumes can be restored without error, the restored volumes remain in detached state (6) kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/statefulset.yaml => volume states transfer from detached to healthy, and the volume has backup will have a snapshot with invalid date

@c3y1huang Am I missing anything or have any other restrict or prerequisite for restore?

I don’t think it will. Because the system restore will only leverage volume backup restore, the behavior will be the same as restoring volume backup. (the restored volume will have no snapshots)

Wait. We should only allow restoring the system backup to the cluster with the same Longhorn version. @c3y1huang correct?

Yes, see restore path.

The test is good but it seems not valid enough because the RC1 is still a transient version. Does RC2 work well if restoring from a backup created by RC2? I think yes, because it’s a best criteria .

However, this means we need to pay attention to system backup n restore backward compatibility for the system backup created by a previous version.

@khushboo-rancher Let’s add this test case to the regression testing if not yet.