longhorn: [BUG] Expansion error still occurs in 1.4.2

Describe the bug (šŸ› if you encounter this issue)

The issue #5513 was documented as resolved in the release notes of 1.4.2. Yesterday I upgraded from 1.4.1 to 1.4.2. This morning, after a node restart, the error occured again.

Expansion Error: BUG: The expected size 2147483648 of engine pvc-97135cac-3890-42b1-bc32-80a0edec1b2e-e-6151baeb should not be smaller than the current size 5368709120
Note: You can cancel the expansion to avoid volume crash

Stopping the expansion throws an error, that the expansion has not started yet. Due to the node restart, we only have 1 replica instead of 3, because rebuild can’t start. And we are not able to recover from this state except restoring from backup. Because there still seems to be a probability that this issue can occur at any time and there is no good way to recover, we have to stop the rollout for version 1.4.2

To Reproduce

It’s not really reproduceable. It happens from time to time after a node restart.

Expected behavior

The issue should not occur. It would be at least helpful to know how to recover from this state.

Log or Support bundle

Environment

  • Longhorn version: 1.4.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE1 / K8S v1.24.13
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 4
  • Node config
    • OS type and version: CentOS 7
    • CPU per node: 16 vCPU
    • Memory per node: 64GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 2 - 6 Gbit/s
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud/KVM
  • Number of Longhorn volumes in the cluster: 110

Additional context

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 32 (15 by maintainers)

Most upvoted comments

cc @longhorn/dev

I was able to trigger this using the modifications I have made for #5845 on master-head. Because of those modifications, the ā€œtargetedā€ replica was not expanded. However, there is clearly still the potential for incorrect expansion in all released versions of Longhorn.

I triggered it in a three node cluster with 50 volumes (150 replicas) all actively mounted to a toy ngnix workload. I used the below script to repeatedly kill an instance-manager pod. It took six iterations before a log message surfaced indicating that the modifications prevented an issue.

#!/bin/bash

current_time=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

while true; do
    # Execute kubectl -n longhorn-system delete command
    kubectl -n longhorn-system delete --grace-period=1 pod instance-manager-c090b6286e8431380a8cbe71c3fb43ec

    # Wait for 300 seconds
    sleep 300

    # Execute kubectl -n longhorn-system logs and grep commands
    kubectl -n longhorn-system logs -l longhorn.io/component=instance-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata"
    log_result_1=$?

    kubectl -n longhorn-system logs -l app=longhorn-manager --tail -1 --since-time $current_time | grep -i -e "invalid grpc metadata" -e "incorrect volume name" -e "incorrect instance name"
    log_result_2=$?

    # Check if there was any output from the grep commands
    if [[ $log_result_1 -eq 0 || $log_result_2 -eq 0 ]]; then
        echo "Execution stopped. Trigger keywords found in logs."
        break
    else
        echo "Execution completed successfully."
    fi
done

The killed instance-manager pod logged the following:

2023-06-27T19:48:22.606088859Z [pvc-059a429b-28f7-4b21-a7cc-45c8632be109-r-cd8811d9] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2 method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-059a429b-28f7-4b21-a7cc-45c8632be109
2023-06-27T19:48:22.607799147Z [pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938-r-03074ae7] time="2023-06-27T19:48:22Z" level=error msg="Invalid gRPC metadata" clientVolumeName=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf method=/ptypes.ReplicaService/ReplicaGet serverVolumeName=pvc-c1a6441c-bf6e-4a5d-8a6c-58b02da82938

The longhorn-manager pod on the same node logged the following:

2023-06-27T19:48:22.607532292Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607555900Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10147" controller=longhorn-engine engine=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-70c30a6e-c387-4352-aae1-86f94ad334d2
2023-06-27T19:48:22.607560110Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-70c30a6e-c387-4352-aae1-86f94ad334d2-e-799c7bff\", UID:\"fb4612d5-6ca8-4417-a829-e65fc1eab29a\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778902\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10147: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10065: failed to add replica tcp://10.42.150.121:10147 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10147: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"
2023-06-27T19:48:22.608620253Z time="2023-06-27T19:48:22Z" level=error msg="Failed to rebuild replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 error="proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address" node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608673822Z time="2023-06-27T19:48:22Z" level=info msg="Removing failed rebuilding replica 10.42.150.121:10137" controller=longhorn-engine engine=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32 node=eweber-v124-worker-1ae51dbb-4pngn volume=pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf
2023-06-27T19:48:22.608734835Z time="2023-06-27T19:48:22Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-6ea5e001-3f33-4f2b-9c03-95271d95dccf-e-17a35c32\", UID:\"0aacfd87-07ee-47b3-b3bb-bd3a068765c7\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"52778904\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRebuilding' Failed rebuilding replica with Address 10.42.150.121:10137: proxyServer=10.42.150.121:8501 destination=10.42.150.121:10062: failed to add replica tcp://10.42.150.121:10137 for volume: rpc error: code = Unknown desc = failed to get replica 10.42.150.121:10137: rpc error: code = FailedPrecondition desc = Incorrect volume name; check replica address"

I have a complete support bundle from moments after it occurred, so hopefully I can understand what went wrong.

Longhorn v1.4.3 should be out tomorrow. It includes an additional fix for a reliable recreate we were tracking in https://github.com/longhorn/longhorn/issues/6217. If you are able, please upgrade your v1.4.2 cluster to v1.4.3 and see if the problem persists.

To be clear, if a volume has already been hit by the inappropriate expansion issue, the new fix will not recover it. However, we are very interested to know whether the fix prevents new inappropriate expansions. If you have data (good or bad) after an upgrade, please let us know here.

Upgraded our cluster to 1.4.3. Now observing…