longhorn: [BUG] Intermittent issue while rebuilding replica
Describe the bug There is an intermittent issue observed while the rebuilding of a replica. Later rebuilding successfully completed.
To Reproduce Steps to reproduce the behavior:
- Deploy Longhorn on a K8s cluster of 3 nodes (1 etcd/control plane, 3 workers).
- Create volume with 3 replicas.
- Write some data and take some snapshots.
- Enable
Data-localityand reduce replica count to 1 - After the 2 replicas get deleted, increase the replica count to 3 again.
- Two replicas will get rebuild.
- Verify the events log in the volume page at the bottom. There is rebuild failed log sometimes.
Log
Failed rebuilding replica with Address 10.42.1.25:10000: failed to add replica address='tcp://10.42.1.25:10000' to controller 'pvc-da52c6eb-1e57-4f25-a300-341df67459e6': failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.1.0-rc1/longhorn [--url 10.42.3.29:10000 add tcp://10.42.1.25:10000], output , stderr, time="2020-12-08T23:42:29Z" level=info msg="Adding replica tcp://10.42.1.25:10000 in WO mode" time="2020-12-08T23:42:29Z" level=info msg="Using replica tcp://10.42.3.31:10000 as the source for rebuild " time="2020-12-08T23:42:29Z" level=info msg="Using replica tcp://10.42.1.25:10000 as the target for rebuild " time="2020-12-08T23:44:16Z" level=fatal msg="Error running add replica command: failed to sync files [{FromFileName:volume-snap-c-1samky-da21b1c5.img ToFileName:volume-snap-c-1samky-da21b1c5.img ActualSize:69632} {FromFileName:volume-snap-c-1samky-da21b1c5.img.meta ToFileName:volume-snap-c-1samky-da21b1c5.img.meta ActualSize:0} {FromFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img ToFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img ActualSize:119762944} {FromFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img.meta ToFileName:volume-snap-a6b3e8da-b9ff-4715-8cbc-9290617fd3ba.img.meta ActualSize:0}]
from tcp://10.42.3.31:10000: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil" , error exit status 1
Environment:
- Longhorn version: v1.1.0-rc1
- Kubernetes version: 18.2
- Node config
- OS type and version: ubuntu 18.04
- CPU per node: 2 vcpus
- Memory per node: 4 GB
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 5
- Comments: 28 (7 by maintainers)
Hello, I’m also experiencing this issue on replica rebuilds. It tends to happen on larger volumes, I’ve only seen it happen on anything over 1GB. I’m running 1.1.0 on arm64. Anything I can do to help with debugging?
Had the same problem, I had 2 volumes of 20GB with actual size 4.5GB. I connect nodes located in different datacenters through tailscale and cilium. Today I observed that the two volumes were degraded and rebuilds would fail halfway through. When rebuilding, the two nodes communicate directly through the tailscale at about 9M/s. Halfway through the rebuilding, the traffic suddenly drops, and then the rebuilding fails, and the cycle continues. This cycle was going on for about 17 hours before I noticed high traffic, and I had to lower the number of replicas to end endless loops. Node CPU and disk usage are low, so I wonder if the longhorn network stability tolerance is too low?
Longhorn Version: longhornio/longhorn-engine:v1.3.2 Kubernetes distro: K8s v1.24.8 Node OS: RHEL 8.6 CPU per node: 12+ / 4 Memory per node: 128GB / 4GB Disk Type: Nvme SSD / SSD Cloud drive Network bandwidth and latency between the nodes: through the tailscale at about 9M/s Underlying Infrastructure: Baremetal / VPS Volumes in total: 4
OK, I try to reproduce this process. Then I found out that if I manually rebuild a replica (turning up the replica count), it completes fine. If I try to rebuild 2 replicas at once, it fails 100%. (Rebuild from two different nodes to the same node). It fails almost immediately. Which is not the same as what I have observed before, where the traffic can continue for a while. Maybe the rebuilds of the two previous replica were staggered.
I recorded a video here:
https://user-images.githubusercontent.com/32383605/204445351-c5190b95-6668-49b6-b4d1-197b0981d265.mp4
longhorn-support-bundle_6322c316-fe96-496a-8f7d-fc5c808a4a25_2022-11-29T05-49-02Z.zip
I have also seen this issue recently (cluster using rpi 4B + HDDs, volume ~ 100Gi data size). Even using the v1.2.2 engine image the rebuilding process can sometimes fail. By increasing the
Guaranteed Engine Manager CPUandGuaranteed Replica Manager CPUsome previously failed rebuilds seem to be successful.@vinid223 Thanks! You can track it here https://github.com/rancher/charts/pulls?q=is%3Apr+is%3Aopen+longhorn
My rebuild never succeeds for large replicas.
It fails with
rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nillonghorn-support-bundle_70cfb67f-d94c-494f-bbb9-17c10d783ca3_2021-04-12T15-59-14Z.zip
Longhorn Version: 1.1.0 Kubernetes distro: K3s v1.18.9+k3s1 Node OS: v0.11.1 CPU per node: vary Memory per node: vary Disk Type: vary (SSD & HDD) Network bandwidth and latency between the nodes: rebuild fails for nodes with connection latency ~50ms Underlying Infrastructure: KVM Volumes in total: 10
I am experiencing the same issue right now, though I also see it with volumes smaller than 1GB, e.g. 100MB. However, it seems that smaller volumes will eventually be able to replicate, while bigger ones might not.