longhorn: [BUG] Somehow the Rebuilding field inside volume.meta is set to true causing the volume to stuck in attaching/detaching loop
Describe the bug
Somehow the Rebuilding field inside volume.meta is set to true causing the volume to stuck in attaching/detaching loop
To Reproduce
Not sure how to reproduce yet.
There is a situation in which the volume is trying to attach. Replica process is running. However, when the engine process trying to connect to the replica, it hits this error 2022-06-22T17:35:28.031550367Z [pvc-126d40e2-7bff-4679-a310-e444e84df267-e-f7e3192f] 2022/06/22 17:35:28 Invalid state rebuilding for getting revision counter. It is indicating that the engine process pvc-126d40e2-7bff-4679-a310-e444e84df267-e-f7e3192f of the volume cannot start (and thus repeatedly crashed) because it failed to get the GetRevisionCounter of the replica which is in rebuilding state. The replica is the only replica of the volume.
So the question here is how did the replica get into this non-recoverable state
See here for more details
Expected behavior
The volume should not be stuck in attaching/detaching loop
Environment
- Longhorn version: Longhorn 1.2.4
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 19 (18 by maintainers)
Sounds good to me. However, we are not sure if this issue is caused by deleting the engine image. I think we can create a new ticket for the engine image reference count instead. How do you think @innobead
From other user’s report, this might be the root cause
There was one Volume with 2 replicas (A,B) A failed first and B tried to rebuild A by syncing files During the syncing B failed to send the file, so both A, B failed
However, engine chose A who was just failed when rebuilding as salvage replica But since replica A state was still in
rebuilding: trueSo the engine failed to start with the errorAnd the reason why it got selected might because in it’s replica spec it had
healthAt="XXX"andfailedAt=""which seemed like had not been updated correctlyLet’s close this to prevent confusion about the reason for reopening. Let’s use https://github.com/longhorn/longhorn/issues/6626 instead.
Yeah, the implementation is for resilience and the root cause is not identified.
Reopening this ticket because we have not identified the root cause. The PR here mitigates the consequence when volume has multiple replicas. But some users with single replica are still hitting this
Ref: https://github.com/longhorn/longhorn/issues/4212#issuecomment-1200465946
cc @innobead
Yes. The fix will result in the case as you said. What I mentioned is the case before the fix.
I agree with this point
After investigating the logs of the three problematic volumes, the three volumes all only have one replica and the volume.meta’s rebuilding field are set to
true. Because the information in the log are insufficient, I’m still not sure how the volumes ran into the error or how to reproduce it.But for now we can make the volume attachment process is more rubust during attaching
cc @shuo-wu @innobead
Agreed. BTW, if somehow still unable to figure out the root cause of this issue, we can revisit this after the new ref count introduction then see if this issue still happens.
Agree. I will open a new issue handling the
refcountpart.Sounds good to me.
cc @shuo-wu @PhanLe1010
Did an investigation of the
engineimage.status.refcount.The refcount is caculated in engineimage controller (here and here). It counts the volumes whose spec or status are using this engineimage.
To prevent from the mistaken deletion of the engineimage, we can count the replicas and engines using the image rather than the volumes.
from @joshimoo @innobead
cc @shuo-wu @derekbit