tensorflow-upstream: Hang on building tensorflow rccl packages

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: master branch of this repo
  • Python version: 3.5
  • Installed using virtualenv? pip? conda?: NA
  • Bazel version (if compiling from source): 1.19.2
  • GCC/Compiler version (if compiling from source): gcc-5
  • CUDA/cuDNN version: NA
  • GPU model and memory: NA

Describe the problem

Hi, I am building the tensorflow-upstream source code inside ROCm docker image on a non-any-GPU machine, all ROCm libraries are installed inside the container such as rocBLAS, etc.

But when I build the tensorflow inside this container, the following building step hangs and is never going to pass:

[2,905 / 9,076] 96 actions, 9 running
    Linking external/nasm/nasm [for host]; 1599s local
    Linking external/protobuf_archive/python/google/protobuf/internal/_api_implementation.so [for host]; 1599s local
    Linking tensorflow/python/framework/fast_tensor_util.so [for host]; 1598s local
    Compiling external/rccl_archive/src/rcclTracker.cpp [for host]; 1598s local
    Compiling external/rccl_archive/src/rcclReduce.cpp [for host]; 1598s local
    Compiling external/rccl_archive/src/rcclAllReduce.cpp [for host]; 1598s local
    Compiling external/rccl_archive/src/rcclAllGather.cpp [for host]; 1598s local
    Compiling external/rccl_archive/src/rcclBcast.cpp [for host]; 1598s local ...

How to get over it?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20

Commits related to this issue

Most upvoted comments

@ghostplant , there’s a config file inside the docker image, edit that file if you want to add more GPU targets for your build: /opt/rocm/bin/target.lst The current content of the file:

/root# cat /opt/rocm/bin/target.lst
gfx803
gfx900
gfx906

@ghostplant , the tip of r1.12-rocm branch already has that workaround applied. Please try it with the recommended docker image with the dev environment.