longhorn: [BUG] Failure under sustained write load with NFS backup target.

Describe the bug I have a relatively large number of replica failures during sustained write operations (measured in minutes or tens of minutes)

To Reproduce Steps to reproduce the behavior:

Run a MinIO cluster using minio operator
wait for cluster to come up
Using the mc client (mc cp *files* host/bucket), try to send around 20GB, broken into files of around 100MB each
See replica failure

Expected behavior Replicas should stay up

Log

root@instance-manager-r-cc03c099:/var/log/instances# cat pvc-04a60ec8-16f3-400d-841d-6f2911e23630-r-ca01d6ee.log 
time="2020-05-06T15:38:22Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:38:22Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:38:22Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:38:25Z" level=info msg="New connection from: 10.244.12.196:50926"
time="2020-05-06T15:38:25Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:48:58Z" level=info msg="Closing volume"
time="2020-05-06T15:48:59Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T15:48:59Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"
time="2020-05-06T15:49:01Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:49:01Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:49:04Z" level=info msg="New connection from: 10.244.10.198:46818"
time="2020-05-06T15:49:04Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T21:08:10Z" level=info msg="Replica server starts to snapshot [1a6e785c-e0e8-4f4b-b929-e66b11f6a88e] volume, user created false, created time 2020-05-06T21:08:10Z, labels map[]"
time="2020-05-06T21:08:10Z" level=info msg="Sending file volume-snap-1a6e785c-e0e8-4f4b-b929-e66b11f6a88e.img to 10.244.11.226:10259"
time="2020-05-06T21:08:10Z" level=info msg="source file size: 42949672960, setting up directIo: true"
time="2020-05-06T21:40:07Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T21:40:07Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"

Environment:

Longhorn version: 0.8.0
Kubernetes version: 1.16.0
Node OS type and version: Alpine Linux 3.10

Additional context Add any other context about the problem here.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 36 (36 by maintainers)

Most upvoted comments

@dmayle understand. That would require automatically throttling when CPU is a problem. In fact you remind me of one thing, can you check and reduce node.session.queue_depth in /etc/iscsi/iscsid.conf on each node? It will result in lower bandwidth of course, but it will limit the request to the backend, thus might fix the issue.

yasker on Jun 4, 2020