longhorn: [BUG] Failure under sustained write load with NFS backup target.

Describe the bug I have a relatively large number of replica failures during sustained write operations (measured in minutes or tens of minutes)

To Reproduce Steps to reproduce the behavior:

  1. Run a MinIO cluster using minio operator
  2. wait for cluster to come up
  3. Using the mc client (mc cp *files* host/bucket), try to send around 20GB, broken into files of around 100MB each
  4. See replica failure

Expected behavior Replicas should stay up

Log

root@instance-manager-r-cc03c099:/var/log/instances# cat pvc-04a60ec8-16f3-400d-841d-6f2911e23630-r-ca01d6ee.log 
time="2020-05-06T15:38:22Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:38:22Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:38:22Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:38:25Z" level=info msg="New connection from: 10.244.12.196:50926"
time="2020-05-06T15:38:25Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:48:58Z" level=info msg="Closing volume"
time="2020-05-06T15:48:59Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T15:48:59Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"
time="2020-05-06T15:49:01Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:49:01Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:49:04Z" level=info msg="New connection from: 10.244.10.198:46818"
time="2020-05-06T15:49:04Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T21:08:10Z" level=info msg="Replica server starts to snapshot [1a6e785c-e0e8-4f4b-b929-e66b11f6a88e] volume, user created false, created time 2020-05-06T21:08:10Z, labels map[]"
time="2020-05-06T21:08:10Z" level=info msg="Sending file volume-snap-1a6e785c-e0e8-4f4b-b929-e66b11f6a88e.img to 10.244.11.226:10259"
time="2020-05-06T21:08:10Z" level=info msg="source file size: 42949672960, setting up directIo: true"
time="2020-05-06T21:40:07Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T21:40:07Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"

Environment:

  • Longhorn version: 0.8.0
  • Kubernetes version: 1.16.0
  • Node OS type and version: Alpine Linux 3.10

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 36 (36 by maintainers)

Most upvoted comments

@dmayle understand. That would require automatically throttling when CPU is a problem. In fact you remind me of one thing, can you check and reduce node.session.queue_depth in /etc/iscsi/iscsid.conf on each node? It will result in lower bandwidth of course, but it will limit the request to the backend, thus might fix the issue.