longhorn: [BUG] Rebuild rebuilding is possibly issued to a wrong replica
Describe the bug (š if you encounter this issue)
Engine controller iterates and compares the e.Status.ReplicaModeMap and e.Status.CurrentReplicaAddressMap, and then triggers a replica rebuilding if a replica is not in e.Status.CurrentReplicaAddressMap.
The e.Status.ReplicaModeMap and e.Status.CurrentReplicaAddressMap might not be the latest result from the engine monitor, so there is a chance that a replicaās IP address is wrong. The replica rebuilding is issued to a wrong replica and leads to a volume expansion (code).
To Reproduce
Steps to reproduce the behavior:
- Go to āā¦ā
- Click on āā¦ā
- Perform āā¦ā
- See error
Expected behavior
A clear and concise description of what you expected to happen.
Log or Support bundle
If applicable, add the Longhorn managersā log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment
- Longhorn version: v1.3.1+
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (20 by maintainers)
I think it is valuable to add this test case into our e2e tests https://github.com/longhorn/longhorn/issues/5709#issuecomment-1521889589
WDYT @longhorn/qa @innobead ?
The full issue canāt easily be recreated while the volume controllerās replenishment backoff is enabled because:
The long and short of it is that we need to be really unlucky on timing for an unexpected expansion to occur.
Apologies for the huge amount of text in this issue! I have been hoping to capture a good bit of analysis in case we need to revisit it.
@innobead @shuo-wu @PhanLe1010
At this point, Iām quite confident my last comment reflects the real race condition we are dealing with and the true root cause of the inappropriate expansions. In every set of logs showing the issue I have seen (including when I have recreated it), we see the same sequence of events:
I have prepared a PR (pending a bit more testing) that should mitigate this issue. The fix is quite small and self-contained, and hopefully easily backported. I recommend we remove the require/lep label and close this issue out with that PR if it passes review. If we want to proceed with my more complex proposal, I recommend we open a new issue and reconsider priority.
Recreate and analysis
As @PhanLe1010 showed in his code flow analysis, the potential to hit this will exist as long as instance-manager-r is allowed to choose a new replicas processās ports (and thus switch which replica is listening on which port in the middle of an engine controller reconciliation). However, I found a quirk in the engine controller that makes hitting it far more likely and can reproduce it pretty consistently.
Recreate steps
In this example, the volume likely eventually recovers. The engine is running with a correctly sized replica, so rebuilding the incorrectly sized replica continues to fail until a brand new, correctly sized, replica is rebuilt. Recreating the complete failures some users have reported is harder, and likely involves additional poorly timed events.
Contributing factors
Analyzing logs and changes to Kubernetes resources during the recreate, we see a couple of contributing factors.
One minute delay in getting correct replica instance status
Longhorn-manager starts instance-manager-r and immediately beings monitoring it for instance status updates. The replica controller uses these updates to inform the changes it makes to a replicaās status. Curiously, longhorn-manager and instance-manager-r only log attempts to delete replica processes in the first minute of instance-manager-r running. The replica instance is definitely not running, but longhorn-manager thinks it is.
Early instance-manager-r logs for a 4gb replica involved in an incorrect expansion.
Replica status during the above time period.
Instance-manager-r status during this time period.
As a result of this behavior, longhorn-manager thinks the replica is running (on a different port than it will eventually use) for a minute while it is actually shut down. This is mostly harmless, except that when the replica eventually starts running, the engine controller may order a rebuild using the old port before it learns about the new one.
Ten second lag between ordering a rebuild and it actually starting
When the engine controller attempts to add a replica, it purges snapshots first, then adds it. It never proceeds from purging snapshots for at least 10 seconds, during which time the status of the running replica may change. It does not check any updated status before proceeding.
Longhorn-manager logs before and during replica expansion
2023-04-24T18:21:58.159364012Z [longhorn-instance-manager] time="2023-04-24T18:21:58Z" level=info msg="Purging snapshots" serviceURL="10.42.150.88:10004"We wait 10 seconds (long after the rest of the system knows the port has changed).
Engine status during this time
The bad expansion has already occured and the rebuild has failed. However, the engine controller wonāt time out until a full 30 seconds have passed, so the wrong port remains in its currentReplicaAddressMap.
Proposal 1: Move port allocation logic back to Longhorn manager
This is a stab at something like @PhanLe1010ās āPossible fixā number 1.
Motivation
During our discussion of this issue, @PhanLe1010 and I came to the conclusion that no fix to the engine controller alone would be sufficient. Our information about which replica addresses (most importantly, ports) an engine can talk to comes (in a roundabout way) from other controllers. In particular, the replica controller is acting to start/stop replicas, and the ports used by those replicas are chosen by instance-manager. Unless we move port allocation up the management stack and stabilize it, there will always be the potential for a race between the replica controller / instance-manager and the engine controller.
Considerations
High level design
We introduce a new PortMap field to InstanceManagerSpec. It is a map of InstanceNames to PortRanges, where a PortRange consists of a StartingPort and NumPorts. For example:
"instanceManager.Spec.portMap": ["volume1-r-7bab29b8": {"startingPort": 10000, "numPorts": 15}].We deprecate instance-managerās port selection mechanism and do port selection in the engine and replica controllers instead.
Both the engine and replica controllers already call an instanceHandler.ReconcileInstanceState. ReconcileInstanceState already gets the appropriate InstanceManager. If a replicaās or engineās DesireState is running, we run port selection logic within ReconcileInstanceState and try to update InstanceManager. If our update is rejected (due to conflict), we do not start the instance process. Instead, we requeue and try again next time. If our update is accepted, we start the instance process with the ports we allocated.
In the same InstanceHandler.ReconcileInstanceState, if DesireState is stopped and CurrentState is stopped, we remove the instance from the new InstanceManagerSpec.PortMap. If our update is rejected, we requeue and try again next time.
Is this removal logic conservative enough? Could our old replica address still exist in Engine.Status.CurrentReplicaAddressMap? If so, we may need to move the port selection logic, or at least this deletion logic, even further up the stack to somewhere we can be sure an engine will never try to access an address before we remove its port from our mapping. We could also consider NEVER deleting this mapping until a replica is deleted. This would be the safest option, but would limit the number of replicas allowed on a node to the number of ports we are allowed to use (assuming we use ports from 10000-65535 and each replica needs 15 ports, the limit is ~3700 volumes).
At a high level, the port selection logic:
Other port related behavior continues to work as outlined by @PhanLe1010. The only real operational difference is that a crashed instance-manager doesnāt get to select new ports for any processes when it starts up again. It uses the ports already decided on by longhorn-manager. These ports are predictable, and an engine does not accidentally talk to the wrong replica.
Potential advantages
Drawbacks
Extra notes
By default, a replica consumes 13 ports (1 for the gRPC server, 1 for the data server, 1 for the sync agent, 10 for the sync agent ālisten rangeā. Longhorn manager already defaults to requesting 15 ports.
The port selection algorithm being used by longhorn-instance-manager appears to be pretty highly optimized and based off a bitmap (https://github.com/longhorn/longhorn-instance-manager/blob/34e6650fb5922b961e7bc06205ce33dfc6899e45/pkg/util/bitmap.go#L32-L70). In this proposal, our selection routine will be less optimized, based off fields in the InstanceManager CR. My instinct is that this level of optimization is not necessary, so I have not included this as a drawback.
@ejweber Please use Mergifyio to backport to 1.3.
@ejweber is this ready for testing? for e2e testing, we can follow up at https://github.com/longhorn/longhorn/issues/5850.
Sounds good, @innobead. I need to modify it slightly for 1.4.x since this one will now use new instanceManager.status fields. I will do it once #1887 is approved.
Yeah, we should. After this issue is moved to ready for testing, @ejweber can provide that test step or event e2e test case for that.
This will be part of resilience testing, node reboot (or IM pod restart).
cc @longhorn/qa
Test fixes for the above were not sufficient to avoid the issue.
It appears the actual underlying race exists between the instance manager controller, the replica controller, and the volume controller.
When instance-manager-r is knocked offline, its instanceManager.status.currentState changes to unknown, then starting, then error, then starting, then running. None of these transitions change the instance state recorded in instanceManager.status.instances, so old IP addresses and port numbers for instances are maintained.
The replica controller generally updates the replica.status.currentState/port/storageIP of each replica based on the instance state recorded in instanceManager.status.instances. As long as instanceManager.status.currentState != running, the replica controller ignores this state and keeps it blank. However, as soon as instanceManager.status.currentState == running, it copies whatever it finds to the replica.
Once the volume controller sees the replica is running, it updates the appropriate engine.spec.replicaAddressMap, kicking off a rebuild.
Hereās the problem: even with the fix I mentioned above, there is some time before the instance manager controllerās monitor updates instanceManager.status.instances. On a fresh start, there are no replica instances running. Once the monitor realizes this, it clears instanceManager.status.instances. This update eventually results in replica instances actually being started, generally with IPs that are different than the engine controller is reconciling with. Itās too late, however, Engines are instructed to add the wrong replicas, and resizing occurs.
If we can clear instanceManager.status.instances when we know an instance-manager is down (so the replica controller canāt get false information from it) or ensure we donāt transition an instanceManager.status.currentState to running before its monitor syncs, we can eliminate the race.
I prefer the 2nd proposal. Using a unique token(volume name/replica directory/ā¦) for communication identity verification is straightforward with fewer compatibility issues. As mentioned above, for the old engine/replica, Longhorn can ignore the token while proposal 1 requires more extra logic.
At this point, Iām leaning towards taking a stab at the second proposal first. It could eliminate the negative effects of the issue we are observing and provide logging to support our efforts to eliminate it entirely. It would require changes to the engine API, I THINK they would be mostly uncomplicated: an additional optional(?) field in the replica server and an additional optional(?) field in most longhorn-engine gRPC requests. Iām going to try to get some buy-in for one of these ideas from the team before actually pursuing them.
Proposal 2: Verify volume identity in all gRPC requests to a longhorn-engine process
Motivation
During our discussion of this issue, @PhanLe1010 and I came to the conclusion that no fix to the engine controller alone would be sufficient. Our information about which replica addresses (most importantly, ports) an engine can talk to comes (in a roundabout way) from other controllers. In particular, the replica controller is acting to start/stop replicas, and the ports used by those replicas are chosen by instance-manager. Issues where we incorrectly talk to the wrong engine and/or replica are probably rare, thought itās difficult to be certain, as the symptoms arenāt always obvious in logs. In some cases, nothing appears to be wrong until a volume eventually detaches and reattaches.
@PhanLe1010 and I have some concern that the issue we have observed accidentally talking to the wrong replica could also occur when talking to the engine itself (i.e., we could have a stale engine address and send incorrect commands).
Goals and non-goals
High level design
A controller server knows its own name and returns it in the respone to most requests (most notably, ControllerService.VolumeGet).
A replica server doesnāt have a concept of a name. There is nothing in the response to ReplicaService.ReplicaGet that reveals any obvious association with the volume name, and nothing in the metadata file either.
There may not be a reason that a replica server couldnāt or shouldnāt have a name. We could add a name argument to the replica command line arguments. Then, the replica controller could start the replica process with the unique volume name associated with a volume (essentially, modify ReplicaProcessCreate so it slightly more closely mirrors EngineProcessCreate in https://github.com/longhorn/longhorn-manager/blob/9893e969f1357bba95d01b995605e501d69d3eee/engineapi/instance_manager.go#L152-L241). The replica server could remember that name and return it in the response to most requests (like the controller server does).
Questions
Should the replica.Server.Name be unique to a replica, or should all replica.Server.Names associated with the same volume just use the volume name?
@derekbit, while discussing this, @PhanLe1010 and I came to the conclusion that we might be able to improve this issue by making changes in the engine controller, but we would not be able to eliminate a race condition like this without doing something elsewhere. In my mind, this is largely because, after an instance-manager-r crash, the replica controller is working to get replicas started (on any available ports) at the same time the engine controller is trying to rebuild replicas on previously known ports. I was curious whether you agreed, or if you can see a low touch way of dealing with the issue?
A small code analysis:
Normal replica rebuilding flow:
engine.status.replicaModeMap) so Longhorn manager marked the replica as failed by settingreplica.Spec.FailedAtand setr.Spec.DesireState = longhorn.InstanceStateStopped: link.Spec.FailedAtandSpec.HealthyAtlinkengine.Spec.ReplicaAddressMaplinkStatus.CurrentReplicaAddressMaplinke.Status.CurrentReplicaAddressMaplinke.Status.CurrentReplicaAddressMapbut not inengine.status.replicaModeMaplinkPossible root cause
From the code flow, the issue can happen as:
Possible fix:
As discussed with @ejweber, I think