longhorn: [BUG] Replica rebuild fail with context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Question

I created a 100GB volume with 3 replicas. Right after creation, I added 20GB of data to it. For some reason, I have only 1 healthy replica. Looking at the longhorn UI, it always shows a replica doing rebuild, but it always fails in the middle. Below are some logs that might be helpful. I also have a 2GB volume with 3 replicas and it is working fine.

Replica (source/healthy)

# kubectl logs -f instance-manager-r-92cd6def -n longhorn-system
[...]
[volume-r-ec113baa] time="2021-07-02T05:44:03Z" level=info msg="Sending file volume-snap-f0beb2fc-2f13-4021-bc6a-2f2f4aabb0c9.img to 10.233.108.7:10006"
time="2021-07-02T05:44:03Z" level=info msg="source file size: 107374182400, setting up directIO: true"
[volume-r-ec113baa] time="2021-07-02T05:54:12Z" level=error msg="writeData for batchInterval: [ 4668128: 4668160]( 32) failed, err: writeData failed, err: Post \"http://10.233.108.7:10006/v1-ssync/writeData?begin=19120652288&end=19120783360\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
time="2021-07-02T05:54:12Z" level=error msg="syncFileContent failed: syncDataInterval [ 4651008: 4673536](22528) failed, err: writeData failed, err: Post \"http://10.233.108.7:10006/v1-ssync/writeData?begin=19120652288&end=19120783360\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Replica (destination/rebuilding)

# kubectl logs -f instance-manager-r-192f695e -n longhorn-system
[...]
[volume-r-cb38f33c] time="2021-07-02T05:44:05Z" level=info msg="Running ssync server for file volume-snap-f0beb2fc-2f13-4021-bc6a-2f2f4aabb0c9.img at port 10006"
time="2021-07-02T05:44:05Z" level=info msg="Creating Ssync service"
[volume-r-cb38f33c] time="2021-07-02T05:44:05Z" level=info msg="open: receiving fileSize: 107374182400, setting up directIO: true"
[volume-r-cb38f33c] time="2021-07-02T05:44:05Z" level=info msg="Ssync server opened and ready"
[volume-r-cb38f33c] time="2021-07-02T05:44:05Z" level=info msg="Closing ssync server"
[volume-r-cb38f33c] time="2021-07-02T05:44:05Z" level=info msg="Done running ssync server for file volume-snap-be46f111-c1b4-451b-9a6f-7124cb4656e6.img.meta at port 10005"
[volume-r-cb38f33c] time="2021-07-02T05:54:14Z" level=info msg="Closing ssync server"
[volume-r-cb38f33c] time="2021-07-02T05:54:14Z" level=info msg="Done running ssync server for file volume-snap-f0beb2fc-2f13-4021-bc6a-2f2f4aabb0c9.img at port 10006"
time="2021-07-02T05:54:14Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files: replica tcp://10.233.95.9:10000 failed to send file volume-snap-f0beb2fc-2f13-4021-bc6a-2f2f4aabb0c9.img to 10.233.108.7:10006: failed to send file volume-snap-f0beb2fc-2f13-4021-bc6a-2f2f4aabb0c9.img to 10.233.108.7:10006: rpc error: code = Unknown desc = syncDataInterval [ 4651008: 4673536](22528) failed, err: writeData failed, err: Post \"http://10.233.108.7:10006/v1-ssync/writeData?begin=19120652288&end=19120783360\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
ERROR: 2021/07/02 05:54:14 grpc: server failed to encode response:  rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
[volume-r-cb38f33c] time="2021-07-02T05:54:14Z" level=info msg="Closing volume"

Instance Manager

# kubectl logs -f -n longhorn-system  instance-manager-e-3882fa60
[...]
[volume-e-ba6f199c] time="2021-07-02T05:43:34Z" level=info msg="Set revision counter of 10.233.108.7:10000 to : 0"
[volume-e-ba6f199c] time="2021-07-02T05:43:34Z" level=info msg="Set backend tcp://10.233.108.7:10000 revision counter to 0"
[volume-e-ba6f199c] time="2021-07-02T05:43:34Z" level=info msg="Synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.233.108.7:10003"
[volume-e-ba6f199c] time="2021-07-02T05:43:34Z" level=info msg="Done synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.233.108.7:10003"
[volume-e-ba6f199c] time="2021-07-02T05:54:12Z" level=info msg="Removing backend: tcp://10.233.108.7:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:12Z" level=info msg="Closing: 10.233.108.7:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:13Z" level=error msg="Error reading from wire: EOF"
[volume-e-ba6f199c] time="2021-07-02T05:54:14Z" level=error msg="Backend tcp://10.233.108.7:10000 monitoring failed, mark as ERR: EOF"
time="2021-07-02T05:54:14Z" level=info msg="Monitoring stopped tcp://10.233.108.7:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Connecting to remote: 10.233.105.5:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Opening: 10.233.105.5:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Starting to snapshot: 10.233.95.9:10000 41f30916-84d5-4ee7-bd83-8cd6bea18749 UserCreated false Created at 2021-07-02T05:54:15Z, Labels map[]"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Finished to snapshot: 10.233.95.9:10000 41f30916-84d5-4ee7-bd83-8cd6bea18749 UserCreated false Created at 2021-07-02T05:54:15Z, Labels map[]"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Starting to snapshot: 10.233.105.5:10000 41f30916-84d5-4ee7-bd83-8cd6bea18749 UserCreated false Created at 2021-07-02T05:54:15Z, Labels map[]"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Finished to snapshot: 10.233.105.5:10000 41f30916-84d5-4ee7-bd83-8cd6bea18749 UserCreated false Created at 2021-07-02T05:54:15Z, Labels map[]"
time="2021-07-02T05:54:15Z" level=info msg="Adding backend: tcp://10.233.105.5:10000"
time="2021-07-02T05:54:15Z" level=info msg="Start monitoring tcp://10.233.105.5:10000"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Set revision counter of 10.233.105.5:10000 to : 0"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Set backend tcp://10.233.105.5:10000 revision counter to 0"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Synchronizing volume-head-004.img.meta to volume-head-001.img.meta@10.233.105.5:10003"
[volume-e-ba6f199c] time="2021-07-02T05:54:15Z" level=info msg="Done synchronizing volume-head-004.img.meta to volume-head-001.img.meta@10.233.105.5:10003"

Environment:

Longhorn version: 1.1.1
Kubernetes version: 1.21
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal

Can you please provide me some direction on how to investigate this kind of issue?

Thanks!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (8 by maintainers)

Most upvoted comments

One important information is that the longhorn hard drives are being attached through iSCSI.

So, the hard drives are external iSCSI storage, right?

is there a way to increase the timeout or add a retry for the failed sync?

Since Longhorn currently rely on stable and fast underlying network infrastructure, and we did not create an option to let the user control the timeout value.

But we did notice this kind of networking issue across nodes, we’ll try our best to fix it.

jenting on Dec 24, 2021

@jenting are you going to consider this as an issue or as working by design?

alkmim on Feb 1, 2022