longhorn: [BUG] Backend sizes do not match 5368709120 != 10737418240 in the engine initiation phase

Describe the bug

Suddenly, a volume which was working fine has started to attach and detach forever. Inspecting the logs, I can see the following error at the corresponding instance-manager:

BUG: Backend sizes do not match 5368709120 != 10737418240 in the engine initiation phase

Expected behavior

The volume should be attached and mounted correctly.

Log or Support bundle

Some log from instance-manager:

[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=info msg="Process Manager: process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6 error out, error msg: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=info msg="stop waiting for gRPC service of process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6 to start at localhost:10000"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:29Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: prepare to delete process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: deleted process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=info msg="Process Manager: successfully unregistered process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: prepare to delete process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:30Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=info msg="Process Manager: prepare to create process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=debug msg="Process Manager: validate process path: /engine-binaries/longhornio-longhorn-engine-v1.2.3/longhorn dir: /engine-binaries/ image: longhornio-longhorn-engine-v1.2.3 binary: longhorn"
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=info msg="Process Manager: created process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:36Z" level=info msg="Starting with replicas [\"tcp://10.42.2.201:10000\" \"tcp://10.42.0.95:10000\" \"tcp://10.42.1.178:10000\"]"
time="2022-02-08T15:22:36Z" level=info msg="Connecting to remote: 10.42.2.201:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:36Z" level=info msg="Opening: 10.42.2.201:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:36Z" level=info msg="Connecting to remote: 10.42.0.95:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:36Z" level=info msg="Opening: 10.42.0.95:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:36Z" level=info msg="Adding backend: tcp://10.42.2.201:10000"
time="2022-02-08T15:22:36Z" level=info msg="Adding backend: tcp://10.42.0.95:10000"
time="2022-02-08T15:22:36Z" level=info msg="Adding backend: tcp://10.42.1.178:10000"
2022/02/08 15:22:36 BUG: Backend sizes do not match 5368709120 != 10737418240 in the engine initiation phase
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=info msg="Process Manager: process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6 error out, error msg: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:36Z" level=info msg="stop waiting for gRPC service of process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6 to start at localhost:10000"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:37Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: got logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: prepare to delete process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: deleted process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=info msg="Process Manager: successfully unregistered process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process update: pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6: state error: Error: exit status 1"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: prepare to delete process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:39Z" level=debug msg="Process Manager: start getting logs for process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:44Z" level=info msg="Process Manager: prepare to create process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[longhorn-instance-manager] time="2022-02-08T15:22:44Z" level=debug msg="Process Manager: validate process path: /engine-binaries/longhornio-longhorn-engine-v1.2.3/longhorn dir: /engine-binaries/ image: longhornio-longhorn-engine-v1.2.3 binary: longhorn"
[longhorn-instance-manager] time="2022-02-08T15:22:44Z" level=info msg="Process Manager: created process pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:44Z" level=info msg="Starting with replicas [\"tcp://10.42.2.201:10000\" \"tcp://10.42.0.95:10000\" \"tcp://10.42.1.178:10000\"]"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:44Z" level=info msg="Connecting to remote: 10.42.2.201:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:44Z" level=info msg="Opening: 10.42.2.201:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:44Z" level=info msg="Connecting to remote: 10.42.0.95:10000"
time="2022-02-08T15:22:44Z" level=info msg="Opening: 10.42.0.95:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] time="2022-02-08T15:22:44Z" level=info msg="Adding backend: tcp://10.42.2.201:10000"
time="2022-02-08T15:22:44Z" level=info msg="Adding backend: tcp://10.42.0.95:10000"
time="2022-02-08T15:22:44Z" level=info msg="Adding backend: tcp://10.42.1.178:10000"
[pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-e-3a430bc6] 2022/02/08 15:22:44 BUG: Backend sizes do not match 5368709120 != 10737418240 in the engine initiation phase

Environment

  • Longhorn version: 2.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 18.04
    • CPU per node: 1
    • Memory per node: 6GB
    • Disk type(e.g. SSD/NVMe): HDD
    • Network bandwidth between the nodes: Unknown
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Unknown, virtual machines
  • Number of Longhorn volumes in the cluster: 20

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (14 by maintainers)

Most upvoted comments

I list the following cases image

From the cases, if we send the e.Spec.Size and e.Status.CurrentSize to engine, then we can figure out the crashed replicas from the two values and the replicas’ sizes. Make sense?

YES. If somehow there is a replica that fails to expand the size then the volume gets detached, this issue may be triggered in the next attachment. In this user’s case, the replica on worker-1 contains the incorrect volume size. During initialization, Longhorn engines may need to figure out crashed replicas, mark them as ERROR, and continue the attachment rather than directly erroring out.

https://github.com/longhorn/longhorn-engine/blob/051ebde20b02cdeb11ff49588f5bf8e7d527c7eb/pkg/controller/control.go#L596-L627

@derekbit May I have question about case 7. It seems that this is the case in which replica successfully expanded but the engine crashed. In this case, how do we recover?

Good catch. I think in case 7, we can use spec.size as the ground truth. Then, we should send both spec.size and status.currentSzie to the engine.

@derekbit May I have question about case 7. It seems that this is the case in which replica successfully expanded but the engine crashed. In this case, how do we recover?

After checking the expansion flow,

v.Spec.Size is the nominal size of a volume v.Status.ActualSize is the total size of the snapshot files and volume head.

e.Spec.Size is also the nominal size of a volume. When expansion is called, e.Spec.Size becomes the new size, even though the expansion is not completed.

e.Status.CurrentSize is the nominal size of a volume. Before finishing expansion, the size is still the old size.

So, I think should use e.Status.CurrentSize as the ground truth. One exception is that when a newly created volume is attached for the first time, the e.Status.CurrentSize is zero, we can use e.Spec.Size as the ground truth in this case. The logic will be

	volumeSize := e.Status.CurrentSize
	if volumeSize == 0 {
		volumeSize = e.Spec.VolumeSize
	}

	return c.EngineProcessCreate(e.Name, e.Spec.VolumeName, volumeSize, e.Spec.EngineImage, frontend, e.Status.CurrentReplicaAddressMap, e.Spec.RevisionCounterDisabled, e.Spec.SalvageRequested)

WDYT? cc @shuo-wu

@shuo-wu Can we add a --vol-size in the start engine controller command? Then, the volume size from the volume CR can be the ground truth.

I have solved the issue following those steps:

At worker-2: mv /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-bbefd2c2 /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-bbefd2c2.bak

At worker-3: mv /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-1841f12b /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-1841f12b.bak

I tried to attach the volume but the same error persisted. However, the new replica size at worker-2 and worker-3 seemed to be the same.

Next, I have executed the same command at worker-1: mv /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-0bbe59c9 /var/lib/longhorn/replicas/pvc-e1b8218e-152b-4fcb-893c-09c6f5fd29a4-0bbe59c9.bak

I tried to attach the volume and it worked.

Thank you!

@shuo-wu is there any possibility to cause the replica size different among replicas except the volume meta-file getting modified accidentally? cc @joshimoo