alluxio: DistributedLoad fails in k8s env
Alluxio version: v2.7.2
In a k8s env where alluxio runs in docker containers, we have 1 master and 1 worker, a distributed load on a dir with 1.9G can only cache 8MB.
worker log: worker.log job worker log: job_worker.log
This problem can be reproduced consistently in Fluid with version 2.7.2. Alluxio 2.7.0 doesn’t have the issue
From the worker log, we found
2022-01-27 00:59:20,564 WARN CacheRequestManager - Failed to async cache block 9110028288 from remote worker (ip-10-0-5-96.ec2.internal/10.0.5.96:20088) on copying the block: alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 300000ms. clientClosed: false clientCancelled: false serverClosed: false (Zero Copy GrpcDataReader)
The worker is requesting itself to async cache block and read from itself. https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/core/server/worker/src/main/java/alluxio/worker/block/CacheRequestManager.java#L225
By looking into commits between 2.7.0 (good version) and 2.7.2 (bad version), we found possible commit: https://github.com/Alluxio/alluxio/commit/e765c8436d36aaf911607f2fb4b772c52ac82f0a with code comments
// issues#11172: If the worker is in a container, use the container hostname
// to establish the connection.
if (!dataSource.getContainerHost().equals("")) {
host = dataSource.getContainerHost();
}
// issues#11172: If the worker is in a container, use the container hostname
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (15 by maintainers)
@ZhuTopher @jja725 The root cause is that Fluid by default sets
alluxio.job.worker.threadpool.size
to 164. The 164 threads seems flooding the connection, which errors out sayingcannot connect to remote block worker
. I set this property to 10 in Fluid, which is the default value in Alluxio doc, and no more such errors. @apc999 @yuzhu FYI