longhorn: [BUG] Volume restoration will never complete if attached node is down

Describe the bug (šŸ› if you encounter this issue)

From https://longhorn.github.io/longhorn-tests/manual/pre-release/backup-and-restore/restore-volume-node-down/

During volume restoration in progress, shutdown the attached node, volume restoration will never complete and volume will stuck at attaching state.

Happened in longhorn mastre-head e1ea3d and v1.4.1-rc1 but not in v1.4.0

To Reproduce

Steps to reproduce the behavior:

  1. Create a backup.
  2. Restore the above backup.
  3. Power off the volume attached node during the restoring.
  4. Wait for the Longhorn node down.
  5. Wait for the restore volume being reattached and starting restoring volume with state Degraded.
  6. Volume stuck at attaching state and the restore will never complete

Expected behavior

Volume attached to another node then restore complete after awhile

Log or Support bundle

supportbundle_2c216df5-e12c-4b2e-8373-0833d6802812_2023-03-02T03-46-21Z.zip

Environment

  • Longhorn version: master-head e1ea3d and v1.4.1-rc1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.24.7+k3s1
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1

Additional context

N/A

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 31 (30 by maintainers)

Most upvoted comments

@weizhe0422 is the status of the engine also unknown? If yes, because r.Status.CurrentState is also unknown, there is no chance to make the replica fail.

/controller/volume_controller.go#L543-L566

	if e.Status.CurrentState == longhorn.InstanceStateUnknown {
		if v.Status.Robustness != longhorn.VolumeRobustnessUnknown {
			v.Status.Robustness = longhorn.VolumeRobustnessUnknown
			vc.eventRecorder.Eventf(v, v1.EventTypeWarning, constant.EventReasonUnknown, "volume %v robustness is unknown", v.Name)
		}
		return nil
	}
	if e.Status.CurrentState != longhorn.InstanceStateRunning {
		// If a replica failed at attaching stage before engine become running,
		// there is no record in e.Status.ReplicaModeMap
		engineInstanceCreationCondition := types.GetCondition(e.Status.Conditions, longhorn.InstanceConditionTypeInstanceCreation)
		isNoAvailableBackend := strings.Contains(engineInstanceCreationCondition.Message, fmt.Sprintf("exit status %v", int(syscall.ENODATA)))
		for _, r := range rs {
			if isNoAvailableBackend || (r.Spec.FailedAt == "" && r.Status.CurrentState == longhorn.InstanceStateError) {
				log.Warnf("Replica %v that not in the engine mode map is marked as failed, current state %v, engine name %v, active %v",
					r.Name, r.Status.CurrentState, r.Spec.EngineName, r.Spec.Active)
				e.Spec.LogRequested = true
				r.Spec.LogRequested = true
				r.Spec.FailedAt = vc.nowHandler()
				r.Spec.DesireState = longhorn.InstanceStateStopped
			}
		}
		return nil
	}

It seems when a node is down or not ready, the current implementation will make the status of the instance manager unknown (https://github.com/longhorn/longhorn/issues/2650), and it would cause the engine/replica on that node to have unknown status as well. I thought when a node is really down, the status should be error (InstanceManagerStateError) instead.

/controller/instance_manager_controller.go#L322-L330

	// Skip the cleanup when the node is suddenly down.
	isDown, err := imc.ds.IsNodeDownOrDeleted(im.Spec.NodeID)
	if err != nil {
		log.WithError(err).Warnf("cannot check IsNodeDownOrDeleted(%v) when syncInstanceManager ", im.Spec.NodeID)
	}
	if isDown {
		im.Status.CurrentState = longhorn.InstanceManagerStateUnknown
		return nil
	}

/controller/instance_handler.go#L56-L70

	if im == nil || im.Status.CurrentState == longhorn.InstanceManagerStateUnknown {
		if status.Started {
			if status.CurrentState != longhorn.InstanceStateUnknown {
				logrus.Warnf("The related node %v of instance %v is down or deleted, will mark the instance as state UNKNOWN", spec.NodeID, instanceName)
			}
			status.CurrentState = longhorn.InstanceStateUnknown
		} else {
			status.CurrentState = longhorn.InstanceStateStopped
			status.CurrentImage = ""
		}
		status.IP = ""
		status.StorageIP = ""
		status.Port = 0
		return
	}

cc @derekbit @shuo-wu @c3y1huang

I’m thinking if the solution is reasonable or not.

When the engine is not in attached mode, the replica can be marked as failed if the node is down directly. Except for the attached mode, we should not rely on the data plane’s connectivity status. Does it make sense?

Update: the issue and https://github.com/longhorn/longhorn/issues/5477 are both passed after applying the solution.

@weizhe0422 @shuo-wu @innobead Can you help check if there is any concern? Thank you.

you mean like this way, right?

	if r.Status.CurrentState != longhorn.InstanceStateRunning {
		nodeDown, err := vc.ds.IsNodeDownOrDeleted(r.Spec.NodeID)
		if err != nil || nodeDown {
			continue
		}
		return nil
	}

Will this cause #5340 not fixed? because a new replica will be rebuilt. @derekbit @shuo-wu

@innobead @derekbit Using the version that adding this logic to skip replica on the down node, I execute the test step as below:

  • premise:
    • cluster spec: 1 master + 3 worker nodes
    • os&k8s: v1.25.4+k3s1 image
  • steps
  1. Restore a 2G backup with 2 replicas: rep-a, rep-b
  2. During the restoring, run the command service k3s-agent stop on node containing one of the replica and the node doesn’t have the volume engine: rep-a image
  3. The replica: rep_a becomes stopped and the rep_b keeps restoring
  4. After the rep_b accomplish restoring, both two replicas becomes stopped image
  5. Run the command service k3s-agent start to recover the k3s. The rep_a will become running at first for a second then then stopped. But there is no rebuilding operation for rep_a and it is also not create another replica on the other node.

I think this is related to this change; it will continue keep waiting without attach on the node since the status of replica on the stopped node is stopped.

https://github.com/longhorn/longhorn-manager/blob/c29563ca46a476b4ebc52cdfe9e6b8126a31ec10/controller/volume_controller.go#L1609-L1611

if r.Status.CurrentState != longhorn.InstanceStateRunning {
          return nil
}

Before this modification, the time will be recorded in the FailedAt field of the replica on the stopped node https://github.com/longhorn/longhorn-manager/blob/9e4630586a9484d63d0c6f6826d0568b10391afe/controller/volume_controller.go#L1498-L1500

if r.Spec.FailedAt == "" {
	r.Spec.FailedAt = vc.nowHandler()
}

so it will be skipped in the checking mechanism and then attach according to the node where the volume is located. https://github.com/longhorn/longhorn-manager/blob/9e4630586a9484d63d0c6f6826d0568b10391afe/controller/volume_controller.go#L1521-L1523

if r.Spec.FailedAt != "" {
	   continue
}

Maybe it need also bypass check the replica which is on the stopped node.

cc @innobead @derekbit

My concern is: when the node is unknown (like kubelet down/network disconnection) but the workload & engine & replica are still storing new data, should we mark the replica as Failed rather than how can the replica be marked as Failed? At that moment, I am thinking if we could clean up the unknown replica when it really blocks the attachment. In brief, marking it as Failed then deleting it during attachment.

Then I just had a discussion with Derek. He mentioned that the cluster is messy when there are this kind of nodes then emphasizing data integrity at that moment may be meaningless. But with my idea, the risk of attachment flow would increase. We have to do a trade-off. Relying on the existing fix is good enough.

If the node is really down and it’s not the attached node, why won’t the engine process mark the replica as mode ERR?

In this case, engine process marks the replica as mode ERR as you mentioned, and it’s later deleted.

I’m thinking if the solution is reasonable or not.

When the engine is not in attached mode, the replica can be marked as failed if the node is down directly. Except for the attached mode, we should not rely on the data plane’s connectivity status. Does it make sense?

Update: the issue and #5477 are both passed after applying the solution. @weizhe0422 @shuo-wu @innobead Can you help check if there is any concern? Thank you.

I wanted to ask additionally if it is still necessary to check if the replica’s node is abnormal when its status is not running?

The node down is detected here .

I’m thinking if the solution is reasonable or not.

When the engine is not in attached mode, the replica can be marked as failed if the node is down directly. Except for the attached mode, we should not rely on the data plane’s connectivity status. Does it make sense?

Well, after applying the patch, the unknown replica is deleted from the engine.status.replicaModeMap. That means the io cannot go to the unknown replica. It breaks the fix in 5340…

@innobead No, the engine state is stopped. I think it is because when synchronize the im-e status to engine but the engine is not started ( status.Started == false) image

And I think you might be right if the node is down then the status should be marked as error, so it can update the spec fields of replica and engine. I’ll go into the details and do some testing.

Yeah, something like that.