serving: bazel GPU build error with fatal error: external/nccl_archive/src/nccl.h: No such file or directory

We are trying to build Tensorflow Serving 0.5.1 with TensorFlow 1.0.0@07bb8ea

Basing on CUDA 7.5, cuDNN 5. Bazel 0.4.4

cd serving && bazel build -c opt --config=cuda tensorflow_serving/...
ERROR: /root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe0160
8c/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:23:1: C++ c
ompilation of rule '@org_tensorflow//tensorflow/contrib/nccl:python/
ops/_nccl_ops.so' failed: crosstool_wrapper_driver_is_not_gcc failed
: error executing command external/local_config_cuda/crosstool/clang
/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE '-D_FORTI
FY_SOURCE=1' -fstack-protector -fPIE -Wall -Wunused-but-set-paramete
r ... (remaining 76 argument(s) skipped): com.google.devtools.build.
lib.shell.BadExitStatusException: Process exited with status 1.
In file included from external/org_tensorflow/tensorflow/contrib/ncc
l/kernels/nccl_manager.cc:15:0:
external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_manager
.h:23:44: fatal error: external/nccl_archive/src/nccl.h: No such fil
e or directory
 #include "external/nccl_archive/src/nccl.h"
                                            ^
compilation terminated.
INFO: Elapsed time: 147.378s, Critical Path: 107.11s

I’m able to find nccl.h, but it can’t be found during bazel build. Any suggestions? Thanks in advanced.

find / -name nccl.h
/root/.cache/bazel/_bazel_root/5071e8dca1385fb776f72b33971bf157/exte
rnal/nccl_archive/src/nccl.h
/root/.cache/bazel/_bazel_root/f8d1071c69ea316497c31e40fe01608c/exte
rnal/nccl_archive/src/nccl.h

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 44 (4 by maintainers)

Most upvoted comments

git clone https://github.com/NVIDIA/nccl.git cd nccl/ make CUDA_HOME=/usr/local/cuda

sudo make install sudo mkdir -p /usr/local/include/external/nccl_archive/src sudo ln -s /usr/local/include/nccl.h /usr/local/include/external/nccl_archive/src/nccl.h

Hi, @perdasilva

I have compiled successful tensorflow 1.8 with NCCL2, the problem is that if you have used the deb package to install it on your system, then the package will be splited into different locations:

  • lib content to /usr/lib/x86_64-linux-gnu/ folder
  • include content to /usr/include/ folder
  • NCCL-SLA (software license agreement) to python somewhere in site-packages folder

However Tensorflow configuration needs only one path for the root of this content, that’s why the compilation is not happy.

To solve this you can:

  1. create a folder and put symlinks pointing to this exact structure like this (for example the lastest version 2.1.15):
  nccl2 (or the name you like it)
   ├── include
   │     └── nccl.h
   ├── lib
   │     ├── libnccl.so -> libnccl.so.2*
   │     ├── libnccl.so.2 -> libnccl.so.2.1.15*
   │     ├── libnccl.so.2.1.15*
   │     └── libnccl_static.a
   ├── NCCL-SLA.txt
   └── COPYRIGHT.txt
  1. Or download and extract in some where, one of this packages accordingly to your cuda version. You can easily find on web:
  • nccl_2.1.15-1%2Bcuda8.0_x86_64.txz
  • nccl_2.1.15-1%2Bcuda9.0_x86_64.txz
  • nccl_2.1.15-1%2Bcuda9.1_x86_64.txz
  1. Pointing the path on the ./configure process when asked or setting the environment variable for it and it will not be asked. export TF_NCCL_VERSION=‘2.1.15’ export NCCL_INSTALL_PATH=/usr/local/nccl2 (my prefered path)

To get around it you can comment out the DEP for nccl in: tensorflow/tensorflow/contrib/BUILD

Line 42 iirc

@skonto removing prefix /external/nccl_archive in files nccl_ops.cc and nccl_manager.h which in folder tensorflow/tensorflow/contrib/nccl/kernels, fix the issue

NVIDIA in times to times change the locations of its packages (because they think its funny) 😃 If you investigate a little, depending on your cuda version some files go to some places others go to another places… I believe NVidia doesn’t have a stable ideia where to put this things exacly and tensorflow cannot enter on their hell.

I solved it by removing the prefix /external/nccl_archive.

65: “//tensorflow/contrib/nccl:nccl_py”,

I believe…

seems that now there’s a --config=nonccl option you can add to a bazel command, e.g. bazel build --config=opt --config=cuda --config=nonccl //tensorflow/tools/pip_package:build_pip_package (dunno if this will work entirely, but it seems to get me past this error …)

Thanks @jlertle.

Thanks, @jlertle

We don’t have any official support for macOS and nccl builds currently, though feel free to file a new issue specifically for macOS, we welcome any community support here!

Wouldn’t it be the right way to tensorflow to just look at the right directories? /usr/include/ is the place for header files in linux, I don’t get why it looks somewhere else…?