alluxio: DistributedLoad fails in k8s env

Alluxio version: v2.7.2

In a k8s env where alluxio runs in docker containers, we have 1 master and 1 worker, a distributed load on a dir with 1.9G can only cache 8MB.

worker log: worker.log job worker log: job_worker.log

This problem can be reproduced consistently in Fluid with version 2.7.2. Alluxio 2.7.0 doesn’t have the issue

From the worker log, we found

2022-01-27 00:59:20,564 WARN  CacheRequestManager - Failed to async cache block 9110028288 from remote worker (ip-10-0-5-96.ec2.internal/10.0.5.96:20088) on copying the block: alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 300000ms. clientClosed: false clientCancelled: false serverClosed: false (Zero Copy GrpcDataReader)

The worker is requesting itself to async cache block and read from itself. https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/core/server/worker/src/main/java/alluxio/worker/block/CacheRequestManager.java#L225

By looking into commits between 2.7.0 (good version) and 2.7.2 (bad version), we found possible commit: https://github.com/Alluxio/alluxio/commit/e765c8436d36aaf911607f2fb4b772c52ac82f0a with code comments

    // issues#11172: If the worker is in a container, use the container hostname
    // to establish the connection.
    if (!dataSource.getContainerHost().equals("")) {
      host = dataSource.getContainerHost();
    }

https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/job/server/src/main/java/alluxio/job/util/JobUtils.java#L215

// issues#11172: If the worker is in a container, use the container hostname

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

@ZhuTopher @jja725 The root cause is that Fluid by default sets alluxio.job.worker.threadpool.size to 164. The 164 threads seems flooding the connection, which errors out saying cannot connect to remote block worker. I set this property to 10 in Fluid, which is the default value in Alluxio doc, and no more such errors. @apc999 @yuzhu FYI