tensorflow-upstream: Hang on building tensorflow rccl packages
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
- TensorFlow installed from (source or binary): source
- TensorFlow version: master branch of this repo
- Python version: 3.5
- Installed using virtualenv? pip? conda?: NA
- Bazel version (if compiling from source): 1.19.2
- GCC/Compiler version (if compiling from source): gcc-5
- CUDA/cuDNN version: NA
- GPU model and memory: NA
Describe the problem
Hi, I am building the tensorflow-upstream source code inside ROCm docker image on a non-any-GPU machine, all ROCm libraries are installed inside the container such as rocBLAS, etc.
But when I build the tensorflow inside this container, the following building step hangs and is never going to pass:
[2,905 / 9,076] 96 actions, 9 running
Linking external/nasm/nasm [for host]; 1599s local
Linking external/protobuf_archive/python/google/protobuf/internal/_api_implementation.so [for host]; 1599s local
Linking tensorflow/python/framework/fast_tensor_util.so [for host]; 1598s local
Compiling external/rccl_archive/src/rcclTracker.cpp [for host]; 1598s local
Compiling external/rccl_archive/src/rcclReduce.cpp [for host]; 1598s local
Compiling external/rccl_archive/src/rcclAllReduce.cpp [for host]; 1598s local
Compiling external/rccl_archive/src/rcclAllGather.cpp [for host]; 1598s local
Compiling external/rccl_archive/src/rcclBcast.cpp [for host]; 1598s local ...
How to get over it?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20
@ghostplant , there’s a config file inside the docker image, edit that file if you want to add more GPU targets for your build: /opt/rocm/bin/target.lst The current content of the file:
@ghostplant , the tip of r1.12-rocm branch already has that workaround applied. Please try it with the recommended docker image with the dev environment.