longhorn: [BUG] Somehow the Rebuilding field inside volume.meta is set to true causing the volume to stuck in attaching/detaching loop

Describe the bug

Somehow the Rebuilding field inside volume.meta is set to true causing the volume to stuck in attaching/detaching loop

To Reproduce

Not sure how to reproduce yet. There is a situation in which the volume is trying to attach. Replica process is running. However, when the engine process trying to connect to the replica, it hits this error 2022-06-22T17:35:28.031550367Z [pvc-126d40e2-7bff-4679-a310-e444e84df267-e-f7e3192f] 2022/06/22 17:35:28 Invalid state rebuilding for getting revision counter. It is indicating that the engine process pvc-126d40e2-7bff-4679-a310-e444e84df267-e-f7e3192f of the volume cannot start (and thus repeatedly crashed) because it failed to get the GetRevisionCounter of the replica which is in rebuilding state. The replica is the only replica of the volume.

So the question here is how did the replica get into this non-recoverable state

See here for more details

Expected behavior

The volume should not be stuck in attaching/detaching loop

Environment

  • Longhorn version: Longhorn 1.2.4

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 19 (18 by maintainers)

Most upvoted comments

To prevent from the mistaken deletion of the engineimage, we can count the number of replicas and engines using the image rather than the number of volumes.

Sounds good to me. However, we are not sure if this issue is caused by deleting the engine image. I think we can create a new ticket for the engine image reference count instead. How do you think @innobead

From other user’s report, this might be the root cause

There was one Volume with 2 replicas (A,B) A failed first and B tried to rebuild A by syncing files During the syncing B failed to send the file, so both A, B failed

time=\"2023-05-01T11:51:44Z\" level=error msg=\"Sync agent gRPC server failed to rebuild replica/sync files: replica tcp://10.42.3.11:10000 failed to send file volume-snap-c849fe9d-f938-403d-b0fc-e0cdd8d31467.img to 10.42.2.9:10021: failed to send file volume-snap-c849fe9d-f938-403d-b0fc-e0cdd8d31467.img to 10.42.2.9:10021: rpc error: code = Unavailable desc = transport is closing\""
ERROR: 2023/05/01 11:51:44 grpc: server failed to encode response:  rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil"

However, engine chose A who was just failed when rebuilding as salvage replica But since replica A state was still in rebuilding: true So the engine failed to start with the error

time=\"2023-05-01T11:52:09Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.2.9:10090\\\"]\""
time=\"2023-05-01T11:52:09Z\" level=info msg=\"Connecting to remote: 10.42.2.9:10090\""
time=\"2023-05-01T11:52:09Z\" level=info msg=\"Opening: 10.42.2.9:10090\""
time=\"2023-05-01T11:52:09Z\" level=warning msg=\"backend tcp://10.42.2.9:10090 is in the invalid state rebuilding\""

And the reason why it got selected might because in it’s replica spec it had healthAt="XXX" and failedAt="" which seemed like had not been updated correctly

Let’s close this to prevent confusion about the reason for reopening. Let’s use https://github.com/longhorn/longhorn/issues/6626 instead.

Yeah, the implementation is for resilience and the root cause is not identified.

Reopening this ticket because we have not identified the root cause. The PR here mitigates the consequence when volume has multiple replicas. But some users with single replica are still hitting this

Ref: https://github.com/longhorn/longhorn/issues/4212#issuecomment-1200465946

cc @innobead

4. workload and the volume is attached: the problematic replica marked error then delete, a new replica rebuilt, all replicas back to running. The volume will be not stuck in the attaching/detaching loop

Yes. The fix will result in the case as you said. What I mentioned is the case before the fix.

If a volume with multiple replica and some replicas’ volume.meta have illegal values, we can consider deleting these problematic replicas and avoid the failure.

I agree with this point

After investigating the logs of the three problematic volumes, the three volumes all only have one replica and the volume.meta’s rebuilding field are set to true. Because the information in the log are insufficient, I’m still not sure how the volumes ran into the error or how to reproduce it.

But for now we can make the volume attachment process is more rubust during attaching

  • If a volume with multiple replica and some replicas’ volume.meta have illegal values, we can consider deleting these problematic replicas and avoid the failure.
  • But if a volume with only one replica, we can also fix the illegal values and then attach the volume. But, although we fix the illegal values, the data in the volume might be lost. Not sure if the fix is necessary.

cc @shuo-wu @innobead

Agreed. BTW, if somehow still unable to figure out the root cause of this issue, we can revisit this after the new ref count introduction then see if this issue still happens.

To prevent from the mistaken deletion of the engineimage, we can count the number of replicas and engines using the image rather than the number of volumes.

Sounds good to me. However, we are not sure if this issue is caused by deleting the engine image. I think we can create a new ticket for the engine image reference count instead. How do you think @innobead

Agree. I will open a new issue handling the refcount part.

To prevent from the mistaken deletion of the engineimage, we can count the number of replicas and engines using the image rather than the number of volumes.

Sounds good to me.

cc @shuo-wu @PhanLe1010

Did an investigation of the engineimage.status.refcount.

The refcount is caculated in engineimage controller (here and here). It counts the volumes whose spec or status are using this engineimage.

To prevent from the mistaken deletion of the engineimage, we can count the replicas and engines using the image rather than the volumes.

from @joshimoo @innobead

Feel we should have something in our validation webhook to avoid that deletion if that image is referenced by any volume anyway. 

This is probably related to the active engine image being deleted. I think there is a reference counter but it only includes the volume.spec / engine.spec stuff not necessary the still active old replica processes.

I agree we should prevent deletion of an engine image that is still required in any form or fashion.
The manual deletion mentioned above is just a hunch on the potential cause of the invalid meta stuff.

cc @shuo-wu @derekbit