longhorn: [BUG] gRPC client need to use long timeout where previously binary execute without timeout

We need to use timeout longer then GRPCServiceLongTimeout = 24 * time.Hour where engine binary executed with ExecuteEngineBinaryWithoutTimeout.

ReplicaAdd
ReplicaRebuildVerify
SnapshotPurg
SnapshotClone
SnapshotBackup
BackupRestore

The timeout is determined by https://github.com/longhorn/longhorn-engine/blob/0a69c6c3be34ffcd411d484774bbb992c12c93e0/pkg/sync/sync.go#L444. and should not exceed.

https://github.com/longhorn/longhorn-instance-manager/pull/120#discussion_r893055231

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 15 (14 by maintainers)

Most upvoted comments

verified passed on master-head (with longhorn-instance-manager:v1_20220611) by: (1) Create a 800GB volume (2) Set the Storage Minimal Available Percentage to 0 and write ~800GB data to the volume (3) When the write process completed, the volume actual size is ~800GB, then do (3-1) delete a replica to trigger replica rebuilding (3-2) create a snapshot (3-3) create a backup All the above operations works and completed without problems.

It’s an expected behavior. https://github.com/longhorn/longhorn-manager/blob/master/scheduler/replica_scheduler.go#L611

Because the data written to each replica is 800GB, the required space for the failed replica is 800GB. The disk’s free space is 862GB and the default MinimalAvailablePercentage is 25% (UI > Setting > General > Storage Minimal Available Percentage), so the available disk space is only 646.5 GB. Thus, the error message shows There's no available disk for replica....

You can set the Storage Minimal Available Percentage to 0 to see if the rebuild can execute successfully.

cc @yangchiu

Is it possible to https://github.com/longhorn/longhorn-manager/commit/fedb7eb0303bbc509b995916b3f5460e519c4e2b?

Looks not related. The error message is related to the scheduler.

Updated: Confirmed. Not related to this PR.

I feel we can improve the proxy interface to highlight what methods are async via naming convention but the rest are synchronous.

Or in the future, all external APIs should be async once there are blocking operations.

  • ReplicaRebuildVerify: In the proxy side implementation, there is no blocking call. But I am not sure if the locking would lead to any extra delay. In general, I don’t think we need to use a 24-hour timeout for this.
  • SnapshotPurge: This is an async call. There is no need to use a long timeout.
  • SnapshotBackup: This would take some time but should not take 24 hours. The most time-consuming operation for the backup is: https://github.com/longhorn/backupstore/blob/56ddc538b85950b02c37432e4854e74f2647ca61/deltablock.go#L185
  • BackupRestore : This is an async call. There is no need to use a long timeout.

Test replica rebuilding failed on master-head + longhorn-instance-manager:v1_20220611 After write data and delete a replica through UI to trigger replica rebuilding, no replica rebuilding starts, volume stuck in Scheduling Failure state with a stopped replica.

test steps: (1) setup k3s cluster with c5d.12xlarge instances (2) for each instance, format and mount one of the SSD storages and add it to Longhorn (3) create volume and write large data

for i in $(seq 1 724)
do
  exec dd if=/dev/zero of=/data/test-binary-${i} bs=1M count=1024 &
done

(4) after write completed, delete a replica through UI to trigger replica rebuilding ==> no replica rebuilding starts: http://35.171.40.168:30007/#/volume/test-1

@yangchiu The fix is after RC3, so we need to verify it via the latest master build. Please help verify it again. Thanks.