tensorflow: Build fails if Nvidia nccl doc files (NCCL-SLA.txt) are relocated

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
It shouldn’t be a TensorBoard issue. Those go here.

Here’s why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): NO
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Slackware 14.2+ (-current)
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below):
Python version: 2.7.15
Bazel version (if compiling from source): 0.13.1- (@non-git)
GCC/Compiler version (if compiling from source): gcc (GCC) 7.3.0
CUDA/cuDNN version: CUDA 9.2/cuDNN 7.1
GPU model and memory: Titan X Pascal 16 GB
Exact command to reproduce: bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the problem

Describe the problem clearly here. Be sure to convey here why it’s a bug in TensorFlow or a feature request.

Tensorflow build process with nccl enabled is too “picky” about location of Nvidia NCLL doc file(s) - for ex. NCCL-SLA.txt. It expects to find the text file(s) in the root of the nccl install dir (in my case /opt/nvidia/nccl) and the build fails if I relocate the txt file(s) (to for ex. a doc dir in the nccl install dir (for ex. /opt/nvidia/nccl/doc)

Would be great if the build process would also look for the file(s) in subdirs of the nccl install directory. This would also make it possible to install nccl in a prefix such as /usr/ and put the docs in /usr/doc. Not a big deal though, considering there are always more important issues to worry about.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

ERROR: missing input file '@local_config_nccl//:nccl/NCCL-SLA.txt'
ERROR: /usr/local/src/tensorflow/tensorflow-git/tensorflow/tools/pip_package/BUILD:167:1: //tensorflow/tools/pip_package:build_pip_package: missing input file '@local_config_nccl//:nccl/NCCL-SLA.txt'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 20 (1 by maintainers)

Most upvoted comments

In my case, there was no NCCL-SLA.txt file. Instead, there was a LICENCE.txt file in /usr/local/cuda-9.2/targets/x86_64-linux I created a new identical file in the same directory with the name NCCL-SLA.txt and the problem has been solved, that is, I managed to compile tensorflow with cuda-9.2

+19

yrefanid on Sep 30, 2018

And for me on manjaro, it must be in /opt/cuda

dev-michael-schmidt on Jun 14, 2018

@jimfcarroll everything works. There is only a minor addition. While running ./configure the nccl directory should be changed.

Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda/nccl

madcat1991 on Aug 3, 2018

In Tensorflow version 1.9.0, set_tf_nccl_install_path in configure.py isn’t robust enough to handle NCCL installs under Cuda 9.2 standards. NCCL (version 2.x.x) is installed in standard system locations. That is, nccl.h is in /usr/include and the shared library is in /usr/lib/x86_64-linux-gnu/. However, the python configure code expects NCCL to have it’s own install directory with lib and include subdirectories.

My fix for this was to create a NCCL directory under the Cuda root install (in my case, /usr/local/cuda-9.2) and add symbolic links to /usr/include and /usr/lib/x86_64-linux-gnu/ (called lib in this later case).

EDIT: I should note I also STILL need to copy a copy of NCCL-SLA.txt into the /usr/local/cuda-9.2/nccl directory.

jimfcarroll on Jul 27, 2018

It seems working for me by copying NCCL-SLA.txt from /usr/share/doc/libnccl2/ to /usr where I installed nccl2.

ref: https://github.com/tensorflow/serving/issues/327

shenwei356 on Jun 19, 2018