longhorn: [BUG] Rebuild rebuilding is possibly issued to a wrong replica

Describe the bug (šŸ› if you encounter this issue)

Engine controller iterates and compares the e.Status.ReplicaModeMap and e.Status.CurrentReplicaAddressMap, and then triggers a replica rebuilding if a replica is not in e.Status.CurrentReplicaAddressMap.

The e.Status.ReplicaModeMap and e.Status.CurrentReplicaAddressMap might not be the latest result from the engine monitor, so there is a chance that a replica’s IP address is wrong. The replica rebuilding is issued to a wrong replica and leads to a volume expansion (code).

To Reproduce

Steps to reproduce the behavior:

  1. Go to ā€˜ā€¦ā€™
  2. Click on ā€˜ā€¦ā€™
  3. Perform ā€˜ā€¦ā€™
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: v1.3.1+
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (20 by maintainers)

Most upvoted comments

I think it is valuable to add this test case into our e2e tests https://github.com/longhorn/longhorn/issues/5709#issuecomment-1521889589

WDYT @longhorn/qa @innobead ?

The full issue can’t easily be recreated while the volume controller’s replenishment backoff is enabled because:

The long and short of it is that we need to be really unlucky on timing for an unexpected expansion to occur.

Apologies for the huge amount of text in this issue! I have been hoping to capture a good bit of analysis in case we need to revisit it.

@innobead @shuo-wu @PhanLe1010

At this point, I’m quite confident my last comment reflects the real race condition we are dealing with and the true root cause of the inappropriate expansions. In every set of logs showing the issue I have seen (including when I have recreated it), we see the same sequence of events:

  1. Instance-manager pod stops while instances are still running.
  2. InstanceManager resource shows as in error or starting, but still has port information about its instances. Longhorn-manager tries to delete the instances, but they are no longer running.
  3. Instance-manager pod and InstanceManager resource transition to running. Instance-manager controller hasn’t yet had time to poll the instances running inside instance-manager pod (there are none), but the running state of InstanceManager resource causes the replica controller to transition Replica resource statuses to running and the volume controller to add replicas to engines.
  4. Instance-manager controller polls instance-manager pod and removes all instances from InstanceManager resources status.
  5. Replica controller ACTUALLY starts instances. Now instances are running and their correct port information eventually correctly propagates to other components.
  6. It’s too late, engine controller is in the middle of a rebuild with the wrong information.

I have prepared a PR (pending a bit more testing) that should mitigate this issue. The fix is quite small and self-contained, and hopefully easily backported. I recommend we remove the require/lep label and close this issue out with that PR if it passes review. If we want to proceed with my more complex proposal, I recommend we open a new issue and reconsider priority.

Recreate and analysis

As @PhanLe1010 showed in his code flow analysis, the potential to hit this will exist as long as instance-manager-r is allowed to choose a new replicas process’s ports (and thus switch which replica is listening on which port in the middle of an engine controller reconciliation). However, I found a quirk in the engine controller that makes hitting it far more likely and can reproduce it pretty consistently.

Recreate steps

  1. Disable replica replenishment backoff (https://github.com/longhorn/longhorn-manager/blob/9893e969f1357bba95d01b995605e501d69d3eee/controller/volume_controller.go#L1838). This causes the volume controller to immediately rerun failed replicas instead of waiting, making it much more likely to hit the issue. (I was able to recreate it 10/13 times with backoff disabled ond 0/4 times with it enabled.)
  2. Disable Longhorn scheduling on all but two cluster node. This recreate requires replicas for all volumes to schedule to the same two nodes.
  3. Create five volumes with different sizes (e.g. 1GB, 2GB, 3GB, 4GB, and 5GB).
  4. Attach all volumes to one node.
  5. Monitor the instance-manager-e logs on the node volumes are attached to.
  6. Reboot the other node.
  7. Wait for the instance-manager-e to log "[longhorn-instance-manager] time=ā€œ2023-04-24T14:35:40Zā€ level=info msg=ā€œPrepare to expand new replica to size 5368709120"ā€. You have recreated the issue!
  8. Check the Longhorn UI or logs to see which volume is now degraded because one of its snapshots has been expanded by the engine of another (in the screenshot below, a 2GB replica was expanded to 5GB).

In this example, the volume likely eventually recovers. The engine is running with a correctly sized replica, so rebuilding the incorrectly sized replica continues to fail until a brand new, correctly sized, replica is rebuilt. Recreating the complete failures some users have reported is harder, and likely involves additional poorly timed events.

Contributing factors

Analyzing logs and changes to Kubernetes resources during the recreate, we see a couple of contributing factors.

One minute delay in getting correct replica instance status

Longhorn-manager starts instance-manager-r and immediately beings monitoring it for instance status updates. The replica controller uses these updates to inform the changes it makes to a replica’s status. Curiously, longhorn-manager and instance-manager-r only log attempts to delete replica processes in the first minute of instance-manager-r running. The replica instance is definitely not running, but longhorn-manager thinks it is.

Early instance-manager-r logs for a 4gb replica involved in an incorrect expansion.

2023-04-24T18:20:56.318449496Z time="2023-04-24T18:20:56Z" level=info msg="Storing process logs at path: /var/log/instances"
2023-04-24T18:20:56.318529601Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=warning msg="Failed to add TLS key pair from /tls-files/" error="open /tls-files/tls.crt: no such file or directory"
2023-04-24T18:20:56.318536614Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Creating gRPC server with no auth"
2023-04-24T18:20:56.318541016Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager proxy gRPC server listening to 0.0.0.0:8501"
2023-04-24T18:20:56.318545116Z [longhorn-instance-manager] time="2023-04-24T18:20:56Z" level=info msg="Instance Manager process gRPC server listening to 0.0.0.0:8500"
2023-04-24T18:20:57.110671536Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Started new process manager update watch"
2023-04-24T18:20:57.176274620Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:20:57.343966728Z [longhorn-instance-manager] time="2023-04-24T18:20:57Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.761683795Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.772749817Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.935662832Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:07.965271521Z [longhorn-instance-manager] time="2023-04-24T18:21:07Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.141594846Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.147821855Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:08.193646014Z [longhorn-instance-manager] time="2023-04-24T18:21:08Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:16.015204433Z [longhorn-instance-manager] time="2023-04-24T18:21:16Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:27.843741381Z [longhorn-instance-manager] time="2023-04-24T18:21:27Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:38.184150101Z [longhorn-instance-manager] time="2023-04-24T18:21:38Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:46.021051790Z [longhorn-instance-manager] time="2023-04-24T18:21:46Z" level=info msg="Process Manager: prepare to delete process 4gb-r-768bdb99"
2023-04-24T18:21:57.304418332Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: prepare to create process 4gb-r-768bdb99"
2023-04-24T18:21:57.304791715Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=debug msg="Process Manager: validate process path: /host/var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.4.1/longhorn dir: /host/var/lib/longhorn/engine-binaries/ image: rancher-mirrored-longhornio-longhorn-engine-v1.4.1 binary: longhorn"
2023-04-24T18:21:57.305052318Z [longhorn-instance-manager] time="2023-04-24T18:21:57Z" level=info msg="Process Manager: created process 4gb-r-768bdb99"
2023-04-24T18:21:57.335766556Z [4gb-r-768bdb99] time="2023-04-24T18:21:57Z" level=info msg="Listening on data server 0.0.0.0:10046"
2023-04-24T18:21:57.335798653Z time="2023-04-24T18:21:57Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10045"
2023-04-24T18:21:57.335804247Z time="2023-04-24T18:21:57Z" level=info msg="Listening on sync agent server 0.0.0.0:10047"

Replica status during the above time period.

18:20:45 (Correctly shows currentState = stopped.)
  desireState: stopped
  currentState: stopped
  - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:19:41Z"
    message: 'proxyServer=10.42.150.88:8501 destination=10.42.150.88:10004: failed
      to add replica tcp://10.42.174.163:10015 for volume: rpc error: code = Unknown
      desc = failed to get replica 10.42.174.163:10015: rpc error: code = Unavailable
      desc = all SubConns are in TransientFailure, latest connection error: connection
      error: desc = "transport: Error while dialing dial tcp 10.42.174.163:10015:
      i/o timeout"'
    reason: KubernetesNodeNotReady
    status: "True"
    type: RebuildFailed  
  port: 0
  storageIP: ""

18:20:56 (Incorrectly shows currentState = running.)
  desireState: stopped
  failedAt: "2023-04-24T18:19:41Z"
  healthyAt: ""
  - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:19:41Z"
    message: 'proxyServer=10.42.150.88:8501 destination=10.42.150.88:10004: failed
      to add replica tcp://10.42.174.163:10015 for volume: rpc error: code = Unknown
      desc = failed to get replica 10.42.174.163:10015: rpc error: code = Unavailable
      desc = all SubConns are in TransientFailure, latest connection error: connection
      error: desc = "transport: Error while dialing dial tcp 10.42.174.163:10015:
      i/o timeout"'
    reason: KubernetesNodeNotReady
    status: "True"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129

18:21:06
  failedAt: ""
  healthyAt: ""
  - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:21:07Z"
    message: ""
    reason: ""
    status: "False"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129

18:21:07
  desireState: stopped
  failedAt:
  healthyAt: ""
    - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:21:07Z"
    message: 'proxyServer=10.42.150.88:8501 destination=10.42.150.88:10004: failed
      to add replica tcp://10.42.174.129:10015 for volume: rpc error: code = Unknown
      desc = failed to get replica 10.42.174.129:10015: rpc error: code = Unavailable
      desc = all SubConns are in TransientFailure, latest connection error: connection
      error: desc = "transport: Error while dialing dial tcp 10.42.174.129:10015:
      connect: connection refused"'
    reason: Disconnection
    status: "True"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129
  
18:21:07
  desireState: stopped
  failedAt: "2023-04-24T18:21:08Z"
  healthyAt: ""
    - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:21:07Z"
    message: 'proxyServer=10.42.150.88:8501 destination=10.42.150.88:10004: failed
      to add replica tcp://10.42.174.129:10015 for volume: rpc error: code = Unknown
      desc = failed to get replica 10.42.174.129:10015: rpc error: code = Unavailable
      desc = all SubConns are in TransientFailure, latest connection error: connection
      error: desc = "transport: Error while dialing dial tcp 10.42.174.129:10015:
      connect: connection refused"'
    reason: Disconnection
    status: "True"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129
  
18:21:26
  desireState: stopped
  failedAt: ""
  healthyAt: ""
  - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:21:07Z"
    message: 'proxyServer=10.42.150.88:8501 destination=10.42.150.88:10004: failed
      to add replica tcp://10.42.174.129:10015 for volume: rpc error: code = Unknown
      desc = failed to get replica 10.42.174.129:10015: rpc error: code = Unavailable
      desc = all SubConns are in TransientFailure, latest connection error: connection
      error: desc = "transport: Error while dialing dial tcp 10.42.174.129:10015:
      connect: connection refused"'
    reason: Disconnection
    status: "True"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129
  
18:21:37
  desireState: stopped
  failedAt: ""
  healthyAt: ""
  - lastProbeTime: ""
    lastTransitionTime: "2023-04-24T18:21:38Z"
    message: ""
    reason: ""
    status: "False"
    type: RebuildFailed
  currentState: running
  port: 10015
  storageIP: 10.42.174.129

18:21:56  
  desireState: stopped
  failedAt: ""
  healthyAt: ""
  currentState: stopped
  port: 0
  storageIP: ""

18:21:56 (This is the first accurate status.)
  desireState: running
  failedAt: ""
  healthyAt: ""
  currentState: stopped
  port: 0
  storageIP: ""
  
18:21:57 (Now the replica is running, but not on 100015.)
  desireState: running
  failedAt: ""
  healthyAt: ""
  currentState: running
  port: 10045
  storageIP: 10.42.174.129

Instance-manager-r status during this time period.

2023-04-24T18:20:56Z (Includes the instance statuses from before node shut down.)
  currentState: running
  instances:
    4gb-r-768bdb99:
      spec:
        name: 4gb-r-768bdb99
      status:
        endpoint: ""
        errorMsg: ""
        listen: ""
        portEnd: 10029
        portStart: 10015
        resourceVersion: 0
        state: running
        type: ""
		
18:21:55 (Status finally changes a minute later.)
  currentState: running
  instances: {}
  
18:21:56 (Now the replica is running, but not on 10015.)
  currentState: running
  instances:
    4gb-r-768bdb99:
      spec:
        name: 4gb-r-768bdb99
      status:
        endpoint: ""
        errorMsg: ""
        listen: ""
        portEnd: 10059
        portStart: 10045
        resourceVersion: 0
        state: running
        type: ""

As a result of this behavior, longhorn-manager thinks the replica is running (on a different port than it will eventually use) for a minute while it is actually shut down. This is mostly harmless, except that when the replica eventually starts running, the engine controller may order a rebuild using the old port before it learns about the new one.

Ten second lag between ordering a rebuild and it actually starting

When the engine controller attempts to add a replica, it purges snapshots first, then adds it. It never proceeds from purging snapshots for at least 10 seconds, during which time the status of the running replica may change. It does not check any updated status before proceeding.

Longhorn-manager logs before and during replica expansion

2023-04-24T18:21:58.159364012Z [longhorn-instance-manager] time="2023-04-24T18:21:58Z" level=info msg="Purging snapshots" serviceURL="10.42.150.88:10004"

We wait 10 seconds (long after the rest of the system knows the port has changed).

2023-04-24T18:22:08.169385679Z time="2023-04-24T18:22:08Z" level=debug msg="Finished snapshot purge, will start rebuilding then" controller=longhorn-engine engine=4gb-e-928e8b72 node=eweber-v124-worker-1ae51dbb-4pngn volume=4gb
2023-04-24T18:22:08.181854871Z time="2023-04-24T18:22:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"4gb-e-928e8b72\", UID:\"1414aca5-50b1-4ee1-8352-210eeaf8f309\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"11921061\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Start rebuilding replica 4gb-r-768bdb99 with Address 10.42.174.129:10015 for normal engine 4gb-e-928e8b72 and volume 4gb"
2023-04-24T18:22:08.525500147Z time="2023-04-24T18:22:08Z" level=info msg="Marked failed rebuild on replica 4gb-r-768bdb99, backoff period is now 80 seconds" controller=longhorn-engine engine=4gb-e-928e8b72 node=eweber-v124-worker-1ae51dbb-4pngn volume=4gb

Engine status during this time

18:21:27
  replicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
    4gb-r-768bdb99: 10.42.174.129:10015
  currentReplicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
    4gb-r-768bdb99: 10.42.174.129:10015
  replicaModeMap:
    4gb-r-79e5f310: RW

18:21:57
  replicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
    4gb-r-768bdb99: 10.42.174.129:10045
  currentReplicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
    4gb-r-768bdb99: 10.42.174.129:10015
  replicaModeMap:
    4gb-r-79e5f310: RW

The bad expansion has already occured and the rebuild has failed. However, the engine controller won’t time out until a full 30 seconds have passed, so the wrong port remains in its currentReplicaAddressMap.

18:22:07
  replicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
  currentReplicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
    4gb-r-768bdb99: 10.42.174.129:10015
  replicaModeMap:
    4gb-r-79e5f310: RW

2023-04-24T18:22:27Z (From here on out, things work as expected.)
  replicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
  currentReplicaAddressMap:
    4gb-r-79e5f310: 10.42.150.70:10060
  replicaModeMap:
    4gb-r-79e5f310: RW

Proposal 1: Move port allocation logic back to Longhorn manager

This is a stab at something like @PhanLe1010’s ā€œPossible fixā€ number 1.

Motivation

During our discussion of this issue, @PhanLe1010 and I came to the conclusion that no fix to the engine controller alone would be sufficient. Our information about which replica addresses (most importantly, ports) an engine can talk to comes (in a roundabout way) from other controllers. In particular, the replica controller is acting to start/stop replicas, and the ports used by those replicas are chosen by instance-manager. Unless we move port allocation up the management stack and stabilize it, there will always be the potential for a race between the replica controller / instance-manager and the engine controller.

Considerations

  1. Port allocation state must NOT be distributed amongst multiple CRs. If only one CR is responsible for port allocation state, we can rely on etcd and the Kubernetes API’s ā€œConflictā€ status to ensure different controllers (or the same controller reconciling different objects) don’t allocate the same port.
  2. Port allocations should be stable (unaffected by temporary changes to the status of a controller or replica process).

High level design

We introduce a new PortMap field to InstanceManagerSpec. It is a map of InstanceNames to PortRanges, where a PortRange consists of a StartingPort and NumPorts. For example: "instanceManager.Spec.portMap": ["volume1-r-7bab29b8": {"startingPort": 10000, "numPorts": 15}].

  • The field is in InstanceManager to guarantee that no two longhorn-managers can select the same ports for the same InstanceManager simultaneously. If they try to, Kubernetes will reject one of their selections.
  • The field is in InstanceManagerSpec instead of InstanceManagerStatus because other controllers (not the Instance Manager controller) will write it.

We deprecate instance-manager’s port selection mechanism and do port selection in the engine and replica controllers instead.

Both the engine and replica controllers already call an instanceHandler.ReconcileInstanceState. ReconcileInstanceState already gets the appropriate InstanceManager. If a replica’s or engine’s DesireState is running, we run port selection logic within ReconcileInstanceState and try to update InstanceManager. If our update is rejected (due to conflict), we do not start the instance process. Instead, we requeue and try again next time. If our update is accepted, we start the instance process with the ports we allocated.

In the same InstanceHandler.ReconcileInstanceState, if DesireState is stopped and CurrentState is stopped, we remove the instance from the new InstanceManagerSpec.PortMap. If our update is rejected, we requeue and try again next time.

Is this removal logic conservative enough? Could our old replica address still exist in Engine.Status.CurrentReplicaAddressMap? If so, we may need to move the port selection logic, or at least this deletion logic, even further up the stack to somewhere we can be sure an engine will never try to access an address before we remove its port from our mapping. We could also consider NEVER deleting this mapping until a replica is deleted. This would be the safest option, but would limit the number of replicas allowed on a node to the number of ports we are allowed to use (assuming we use ports from 10000-65535 and each replica needs 15 ports, the limit is ~3700 volumes).

At a high level, the port selection logic:

  • Checks whether a port has already been selected previously. If so, use it.
  • Looks for a range of ports large enough to accomodate our process (much larger for replicas than controllers).

Other port related behavior continues to work as outlined by @PhanLe1010. The only real operational difference is that a crashed instance-manager doesn’t get to select new ports for any processes when it starts up again. It uses the ports already decided on by longhorn-manager. These ports are predictable, and an engine does not accidentally talk to the wrong replica.

Potential advantages

  1. Remove the possibility of talking to the wrong replica altogether (instead of just limiting the effects).

Drawbacks

  1. Significant refactoring is required.
  2. Brought up by @PhanLe1010: There willl be contention among replicas trying to set their ports on the appropriate InstanceManager. This could result in a lot of requeuing at times.
  3. I’m not yet 100% sure this fixes the problem (see my note in red).

Extra notes

By default, a replica consumes 13 ports (1 for the gRPC server, 1 for the data server, 1 for the sync agent, 10 for the sync agent ā€œlisten rangeā€. Longhorn manager already defaults to requesting 15 ports.

The port selection algorithm being used by longhorn-instance-manager appears to be pretty highly optimized and based off a bitmap (https://github.com/longhorn/longhorn-instance-manager/blob/34e6650fb5922b961e7bc06205ce33dfc6899e45/pkg/util/bitmap.go#L32-L70). In this proposal, our selection routine will be less optimized, based off fields in the InstanceManager CR. My instinct is that this level of optimization is not necessary, so I have not included this as a drawback.

@ejweber Please use Mergifyio to backport to 1.3.

@ejweber is this ready for testing? for e2e testing, we can follow up at https://github.com/longhorn/longhorn/issues/5850.

Let’s make this fix to 1.4.2.

Sounds good, @innobead. I need to modify it slightly for 1.4.x since this one will now use new instanceManager.status fields. I will do it once #1887 is approved.

I think it is valuable to add this test case into our e2e tests https://github.com/longhorn/longhorn/issues/5709#issuecomment-1521889589

WDYT @longhorn/qa @innobead ?

Yeah, we should. After this issue is moved to ready for testing, @ejweber can provide that test step or event e2e test case for that.

This will be part of resilience testing, node reboot (or IM pod restart).

cc @longhorn/qa

Test fixes for the above were not sufficient to avoid the issue.

  • The one minute instance-manager-r delay is problematic, but a race occurs whether instance-manager-r starts monitoring after one minute, or only a few seconds.
  • The issue is generally mitigated when we recheck the replicaAddressMap after 10 seconds, but this is cheating. The race is actually occurring before we attempt a rebuild (not during the 10 second window like I thought). However, the 10 second window gives other components time to resolve the race, so when we check back in, it no longer exists.

It appears the actual underlying race exists between the instance manager controller, the replica controller, and the volume controller.

When instance-manager-r is knocked offline, its instanceManager.status.currentState changes to unknown, then starting, then error, then starting, then running. None of these transitions change the instance state recorded in instanceManager.status.instances, so old IP addresses and port numbers for instances are maintained.

The replica controller generally updates the replica.status.currentState/port/storageIP of each replica based on the instance state recorded in instanceManager.status.instances. As long as instanceManager.status.currentState != running, the replica controller ignores this state and keeps it blank. However, as soon as instanceManager.status.currentState == running, it copies whatever it finds to the replica.

Once the volume controller sees the replica is running, it updates the appropriate engine.spec.replicaAddressMap, kicking off a rebuild.

Here’s the problem: even with the fix I mentioned above, there is some time before the instance manager controller’s monitor updates instanceManager.status.instances. On a fresh start, there are no replica instances running. Once the monitor realizes this, it clears instanceManager.status.instances. This update eventually results in replica instances actually being started, generally with IPs that are different than the engine controller is reconciling with. It’s too late, however, Engines are instructed to add the wrong replicas, and resizing occurs.

If we can clear instanceManager.status.instances when we know an instance-manager is down (so the replica controller can’t get false information from it) or ensure we don’t transition an instanceManager.status.currentState to running before its monitor syncs, we can eliminate the race.

I prefer the 2nd proposal. Using a unique token(volume name/replica directory/…) for communication identity verification is straightforward with fewer compatibility issues. As mentioned above, for the old engine/replica, Longhorn can ignore the token while proposal 1 requires more extra logic.

At this point, I’m leaning towards taking a stab at the second proposal first. It could eliminate the negative effects of the issue we are observing and provide logging to support our efforts to eliminate it entirely. It would require changes to the engine API, I THINK they would be mostly uncomplicated: an additional optional(?) field in the replica server and an additional optional(?) field in most longhorn-engine gRPC requests. I’m going to try to get some buy-in for one of these ideas from the team before actually pursuing them.

Proposal 2: Verify volume identity in all gRPC requests to a longhorn-engine process

Motivation

During our discussion of this issue, @PhanLe1010 and I came to the conclusion that no fix to the engine controller alone would be sufficient. Our information about which replica addresses (most importantly, ports) an engine can talk to comes (in a roundabout way) from other controllers. In particular, the replica controller is acting to start/stop replicas, and the ports used by those replicas are chosen by instance-manager. Issues where we incorrectly talk to the wrong engine and/or replica are probably rare, thought it’s difficult to be certain, as the symptoms aren’t always obvious in logs. In some cases, nothing appears to be wrong until a volume eventually detaches and reattaches.

@PhanLe1010 and I have some concern that the issue we have observed accidentally talking to the wrong replica could also occur when talking to the engine itself (i.e., we could have a stale engine address and send incorrect commands).

Goals and non-goals

  • A goal of this proposal is NOT to eliminate the underlying issue of an engine controller potentially communicating with an incorrect replica.
  • A goal of this proposal IS to negate the negative effects of this communication.
  • A goal of this proposal IS to provide additional logging when innapropriate communication occurs. This logging can help when further implementing a fix to the underlying issue.

High level design

A controller server knows its own name and returns it in the respone to most requests (most notably, ControllerService.VolumeGet).

  • The name the controller server knows is simply the name it was started with.
  • The engine controller always starts the controller process with the unique volume name associated with a volume.
  • Verifying this name in every gRPC ControllerService request could ensure that we always talked to the right controller.
  • We could make this verification optional to limit the risk of breaking something:
    • If a request includes the name and it matches the one the controller server knows, it proceeds.
    • If a request includes the name and it doesn not match the one the controller server knows, it fails.
    • If a request does not include the name, it proceeds.

A replica server doesn’t have a concept of a name. There is nothing in the response to ReplicaService.ReplicaGet that reveals any obvious association with the volume name, and nothing in the metadata file either.

There may not be a reason that a replica server couldn’t or shouldn’t have a name. We could add a name argument to the replica command line arguments. Then, the replica controller could start the replica process with the unique volume name associated with a volume (essentially, modify ReplicaProcessCreate so it slightly more closely mirrors EngineProcessCreate in https://github.com/longhorn/longhorn-manager/blob/9893e969f1357bba95d01b995605e501d69d3eee/engineapi/instance_manager.go#L152-L241). The replica server could remember that name and return it in the response to most requests (like the controller server does).

  • Verifying this name in every gRPC ReplicaService request could ensure that we always talked to the right replica.
  • We could make this verification optional to limit the risk of breaking something:
    • If a request includes the name and it matches the one the replica server knows, it proceeds.
    • If a request includes the name and it doesn not match the one the replica server knows, it fails.
    • If a request does not include the name, it proceeds.

Questions

Should the replica.Server.Name be unique to a replica, or should all replica.Server.Names associated with the same volume just use the volume name?

  • The latter is potentially much easier, as every component knows the name of the volume it is operating on and this name can be proxied through from engine controller to engine to replica.
  • To do the former, the engine controller (for example) would need to know the name of the replica it is trying to talk to. It generally only knows the address of the replica it is trying to talk to.

@derekbit, while discussing this, @PhanLe1010 and I came to the conclusion that we might be able to improve this issue by making changes in the engine controller, but we would not be able to eliminate a race condition like this without doing something elsewhere. In my mind, this is largely because, after an instance-manager-r crash, the replica controller is working to get replicas started (on any available ports) at the same time the engine controller is trying to rebuild replicas on previously known ports. I was curious whether you agreed, or if you can see a low touch way of dealing with the issue?

A small code analysis:

Normal replica rebuilding flow:

  1. Somehow, the replica process failed (reported by IM manager or engine.status.replicaModeMap) so Longhorn manager marked the replica as failed by setting replica.Spec.FailedAt and set r.Spec.DesireState = longhorn.InstanceStateStopped: link
  2. Volume controller then try to reuse the failed replica by clearing .Spec.FailedAt and Spec.HealthyAt link
  3. The replica then is removed from engine.Spec.ReplicaAddressMap link
  4. Engine controller then replica the replica from Status.CurrentReplicaAddressMap link
  5. Engine monitoring will then marks the replica as unknown because the replica doesn’t exist in the spec anymore
  6. Then engine controller removes the (unknown) replica from the controller’s backend list link
  7. Volume controller waits for the replica CR to be stopped then try to starts it here
  8. The replica is then added to e.Spec.ReplicaAddressMap which will then be copied to e.Status.CurrentReplicaAddressMap link
  9. Finally, the rebuild is triggered because the replica exist in e.Status.CurrentReplicaAddressMap but not in engine.status.replicaModeMap link

Possible root cause

From the code flow, the issue can happen as:

  1. Let’s say the engine CR is current in a state as in step 9 above:

    the replica exists in e.Status.CurrentReplicaAddressMap but not in engine.status.replicaModeMap link

  2. Now the IM of the replica crashed. When it wakes up, a replica of a new volume is started on the IM first, so the IM uses the lowest available port range for the new replica. Since the IM is not aware of that the port is currently belong to the old replica CR, it might allocate the port number for the new replica
  3. Then engine controller of the old volume issue the rebuild command to old replica’s IP and port. However, the new replica of the new volume is currently located at that IP and port. This mess up the new replica’s size because we check and expand the replica first before trying to add it to the engine’s backend https://github.com/longhorn/longhorn-engine/blob/4ff5fa158c6f84864774d794d7770b018dbc9dd5/pkg/sync/sync.go#L407-L409

Possible fix:

As discussed with @ejweber, I think

  1. We could move the port allocation logic back to Longhorn manager to avoid the race condition (require significant effort) OR
  2. May be the replica can send back infomation about its current volume identity, and the engine can check it and say like it is not me I will error out and the error propagate up to longhorn-manager (require less effort)