tensorflow: Build fails if Nvidia nccl doc files (NCCL-SLA.txt) are relocated
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow
If you open a GitHub issue, here is our policy:
- It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
- The form below must be filled out.
- It shouldn’t be a TensorBoard issue. Those go here.
Here’s why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): NO
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Slackware 14.2+ (-current)
-
TensorFlow installed from (source or binary): Source
-
TensorFlow version (use command below):
-
Python version: 2.7.15
-
Bazel version (if compiling from source): 0.13.1- (@non-git)
-
GCC/Compiler version (if compiling from source): gcc (GCC) 7.3.0
-
CUDA/cuDNN version: CUDA 9.2/cuDNN 7.1
-
GPU model and memory: Titan X Pascal 16 GB
-
Exact command to reproduce: bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”
Describe the problem
Describe the problem clearly here. Be sure to convey here why it’s a bug in TensorFlow or a feature request.
Tensorflow build process with nccl enabled is too “picky” about location of Nvidia NCLL doc file(s) - for ex. NCCL-SLA.txt.
It expects to find the text file(s) in the root of the nccl install dir (in my case /opt/nvidia/nccl
) and the build fails if I relocate the txt file(s) (to for ex. a doc
dir in the nccl install dir (for ex. /opt/nvidia/nccl/doc
)
Would be great if the build process would also look for the file(s) in subdirs of the nccl
install directory. This would also make it possible to install nccl in a prefix such as /usr/
and put the docs in /usr/doc
. Not a big deal though, considering there are always more important issues to worry about.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
ERROR: missing input file '@local_config_nccl//:nccl/NCCL-SLA.txt'
ERROR: /usr/local/src/tensorflow/tensorflow-git/tensorflow/tools/pip_package/BUILD:167:1: //tensorflow/tools/pip_package:build_pip_package: missing input file '@local_config_nccl//:nccl/NCCL-SLA.txt'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 20 (1 by maintainers)
In my case, there was no NCCL-SLA.txt file. Instead, there was a LICENCE.txt file in /usr/local/cuda-9.2/targets/x86_64-linux I created a new identical file in the same directory with the name NCCL-SLA.txt and the problem has been solved, that is, I managed to compile tensorflow with cuda-9.2
And for me on manjaro, it must be in
/opt/cuda
@jimfcarroll everything works. There is only a minor addition. While running
./configure
thenccl
directory should be changed.Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda/nccl
In Tensorflow version 1.9.0,
set_tf_nccl_install_path
inconfigure.py
isn’t robust enough to handle NCCL installs under Cuda 9.2 standards. NCCL (version 2.x.x) is installed in standard system locations. That is,nccl.h
is in/usr/include
and the shared library is in/usr/lib/x86_64-linux-gnu/
. However, the python configure code expectsNCCL
to have it’s own install directory withlib
andinclude
subdirectories.My fix for this was to create a NCCL directory under the Cuda root install (in my case,
/usr/local/cuda-9.2
) and add symbolic links to/usr/include
and/usr/lib/x86_64-linux-gnu/
(calledlib
in this later case).EDIT: I should note I also STILL need to copy a copy of NCCL-SLA.txt into the
/usr/local/cuda-9.2/nccl
directory.It seems working for me by copying
NCCL-SLA.txt
from/usr/share/doc/libnccl2/
to/usr
where I installed nccl2.ref: https://github.com/tensorflow/serving/issues/327