longhorn: [BUG] Intermittent issue while rebuilding replica

Describe the bug There is an intermittent issue observed while the rebuilding of a replica. Later rebuilding successfully completed.

To Reproduce Steps to reproduce the behavior:

Deploy Longhorn on a K8s cluster of 3 nodes (1 etcd/control plane, 3 workers).
Create volume with 3 replicas.
Write some data and take some snapshots.
Enable Data-locality and reduce replica count to 1
After the 2 replicas get deleted, increase the replica count to 3 again.
Two replicas will get rebuild.
Verify the events log in the volume page at the bottom. There is rebuild failed log sometimes.

Log

Failed rebuilding replica with Address 10.42.1.25:10000: failed to add replica address='tcp://10.42.1.25:10000' to controller 'pvc-da52c6eb-1e57-4f25-a300-341df67459e6': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0-rc1/longhorn [--url 10.42.3.29:10000 add tcp://10.42.1.25:10000], output , stderr, time="2020-12-08T23:42:29Z" level=info msg="Adding replica tcp://10.42.1.25:10000 in WO mode" time="2020-12-08T23:42:29Z" level=info msg="Using replica tcp://10.42.3.31:10000 as the source for rebuild " time="2020-12-08T23:42:29Z" level=info msg="Using replica tcp://10.42.1.25:10000 as the target for rebuild " time="2020-12-08T23:44:16Z" level=fatal msg="Error running add replica command: failed to sync files [{FromFileName:volume-snap-c-1samky-da21b1c5.img ToFileName:volume-snap-c-1samky-da21b1c5.img ActualSize:69632} {FromFileName:volume-snap-c-1samky-da21b1c5.img.meta ToFileName:volume-snap-c-1samky-da21b1c5.img.meta ActualSize:0} {FromFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img ToFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img ActualSize:119762944} {FromFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img.meta ToFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img.meta ActualSize:0}]
from tcp://10.42.3.31:10000: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil" , error exit status 1

Environment:

Longhorn version: v1.1.0-rc1
Kubernetes version: 18.2
Node config
- OS type and version: ubuntu 18.04
- CPU per node: 2 vcpus
- Memory per node: 4 GB

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 5
Comments: 28 (7 by maintainers)

Most upvoted comments

Hello, I’m also experiencing this issue on replica rebuilds. It tends to happen on larger volumes, I’ve only seen it happen on anything over 1GB. I’m running 1.1.0 on arm64. Anything I can do to help with debugging?

davidsbond on Dec 29, 2020

Had the same problem, I had 2 volumes of 20GB with actual size 4.5GB. I connect nodes located in different datacenters through tailscale and cilium. Today I observed that the two volumes were degraded and rebuilds would fail halfway through. When rebuilding, the two nodes communicate directly through the tailscale at about 9M/s. Halfway through the rebuilding, the traffic suddenly drops, and then the rebuilding fails, and the cycle continues. This cycle was going on for about 17 hours before I noticed high traffic, and I had to lower the number of replicas to end endless loops. Node CPU and disk usage are low, so I wonder if the longhorn network stability tolerance is too low?

Longhorn Version: longhornio/longhorn-engine:v1.3.2 Kubernetes distro: K8s v1.24.8 Node OS: RHEL 8.6 CPU per node: 12+ / 4 Memory per node: 128GB / 4GB Disk Type: Nvme SSD / SSD Cloud drive Network bandwidth and latency between the nodes: through the tailscale at about 9M/s Underlying Infrastructure: Baremetal / VPS Volumes in total: 4

Failed rebuilding replica with Address 10.32.1.67:10000: proxyServer=10.32.2.14:8501 destination=10.32.2.14:10000: failed to add replica tcp://10.32.1.67:10000 for volume: rpc error: code = Unknown desc = failed to sync files [{FromFileName:volume-snap-2966251f-4488-493c-8681-385ff4907c49.img ToFileName:volume-snap-2966251f-4488-493c-8681-385ff4907c49.img ActualSize:4788080640} {FromFileName:volume-snap-2966251f-4488-493c-8681-385ff4907c49.img.meta ToFileName:volume-snap-2966251f-4488-493c-8681-385ff4907c49.img.meta ActualSize:0} {FromFileName:volume-snap-3125ce53-2a52-4755-8a6d-3cf29f4c3fed.img ToFileName:volume-snap-3125ce53-2a52-4755-8a6d-3cf29f4c3fed.img ActualSize:0} {FromFileName:volume-snap-3125ce53-2a52-4755-8a6d-3cf29f4c3fed.img.meta ToFileName:volume-snap-3125ce53-2a52-4755-8a6d-3cf29f4c3fed.img.meta ActualSize:0}] from tcp://10.32.2.248:10015: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil

10935336 on Nov 29, 2022

traffic suddenly drops

@10935336 Do you know how long the network outage between the nodes lasted when rebuilding the replica?

OK, I try to reproduce this process. Then I found out that if I manually rebuild a replica (turning up the replica count), it completes fine. If I try to rebuild 2 replicas at once, it fails 100%. (Rebuild from two different nodes to the same node). It fails almost immediately. Which is not the same as what I have observed before, where the traffic can continue for a while. Maybe the rebuilds of the two previous replica were staggered.

I recorded a video here:

https://user-images.githubusercontent.com/32383605/204445351-c5190b95-6668-49b6-b4d1-197b0981d265.mp4

longhorn-support-bundle_6322c316-fe96-496a-8f7d-fc5c808a4a25_2022-11-29T05-49-02Z.zip

10935336 on Nov 29, 2022

I have also seen this issue recently (cluster using rpi 4B + HDDs, volume ~ 100Gi data size). Even using the v1.2.2 engine image the rebuilding process can sometimes fail. By increasing the Guaranteed Engine Manager CPU and Guaranteed Replica Manager CPU some previously failed rebuilds seem to be successful.

alchem0x2A on Nov 26, 2021

@vinid223 Thanks! You can track it here https://github.com/rancher/charts/pulls?q=is%3Apr+is%3Aopen+longhorn

PhanLe1010 on Sep 13, 2021

My rebuild never succeeds for large replicas.

It fails with rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil

longhorn-support-bundle_70cfb67f-d94c-494f-bbb9-17c10d783ca3_2021-04-12T15-59-14Z.zip

Longhorn Version: 1.1.0 Kubernetes distro: K3s v1.18.9+k3s1 Node OS: v0.11.1 CPU per node: vary Memory per node: vary Disk Type: vary (SSD & HDD) Network bandwidth and latency between the nodes: rebuild fails for nodes with connection latency ~50ms Underlying Infrastructure: KVM Volumes in total: 10

dgrechka on Apr 12, 2021

I am experiencing the same issue right now, though I also see it with volumes smaller than 1GB, e.g. 100MB. However, it seems that smaller volumes will eventually be able to replicate, while bigger ones might not.

docbobo on Jan 3, 2021