longhorn: [BUG] Failure under sustained write load with NFS backup target.
Describe the bug I have a relatively large number of replica failures during sustained write operations (measured in minutes or tens of minutes)
To Reproduce Steps to reproduce the behavior:
- Run a MinIO cluster using minio operator
- wait for cluster to come up
- Using the mc client (
mc cp *files* host/bucket), try to send around 20GB, broken into files of around 100MB each - See replica failure
Expected behavior Replicas should stay up
Log
root@instance-manager-r-cc03c099:/var/log/instances# cat pvc-04a60ec8-16f3-400d-841d-6f2911e23630-r-ca01d6ee.log
time="2020-05-06T15:38:22Z" level=info msg="Creating volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:38:22Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:38:22Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:38:22Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:38:25Z" level=info msg="New connection from: 10.244.12.196:50926"
time="2020-05-06T15:38:25Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T15:48:58Z" level=info msg="Closing volume"
time="2020-05-06T15:48:59Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T15:48:59Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"
time="2020-05-06T15:49:01Z" level=info msg="Listening on gRPC Replica server 0.0.0.0:10090"
time="2020-05-06T15:49:01Z" level=info msg="Listening on data server 0.0.0.0:10091"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync agent server 0.0.0.0:10092"
time="2020-05-06T15:49:01Z" level=info msg="Listening on sync 0.0.0.0:10092"
time="2020-05-06T15:49:04Z" level=info msg="New connection from: 10.244.10.198:46818"
time="2020-05-06T15:49:04Z" level=info msg="Opening volume /host/var/lib/longhorn/replicas/pvc-04a60ec8-16f3-400d-841d-6f2911e23630-25d075e0, size 42949672960/512"
time="2020-05-06T21:08:10Z" level=info msg="Replica server starts to snapshot [1a6e785c-e0e8-4f4b-b929-e66b11f6a88e] volume, user created false, created time 2020-05-06T21:08:10Z, labels map[]"
time="2020-05-06T21:08:10Z" level=info msg="Sending file volume-snap-1a6e785c-e0e8-4f4b-b929-e66b11f6a88e.img to 10.244.11.226:10259"
time="2020-05-06T21:08:10Z" level=info msg="source file size: 42949672960, setting up directIo: true"
time="2020-05-06T21:40:07Z" level=warning msg="Received signal interrupt to shutdown"
time="2020-05-06T21:40:07Z" level=warning msg="Starting to execute registered shutdown func github.com/longhorn/longhorn-engine/app/cmd.startReplica.func4"
Environment:
- Longhorn version: 0.8.0
- Kubernetes version: 1.16.0
- Node OS type and version: Alpine Linux 3.10
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 36 (36 by maintainers)
@dmayle understand. That would require automatically throttling when CPU is a problem. In fact you remind me of one thing, can you check and reduce
node.session.queue_depthin/etc/iscsi/iscsid.confon each node? It will result in lower bandwidth of course, but it will limit the request to the backend, thus might fix the issue.