LightGBM: R package install with GPU support fails

This used to work:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
    apt-get update && \
    apt-get install -y r-base

RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake

ENV MAKE="make -j$(nproc)"

RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'

RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd   ## otherwise lightgm segfaults at runtime (compiles fine without it)

RUN git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

Now, I get this error:

Cloning into 'LightGBM'...
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/LightGBM/compute'...
Cloning into '/LightGBM/eigen'...
Cloning into '/LightGBM/external_libs/fast_double_parser'...
Cloning into '/LightGBM/external_libs/fmt'...
Submodule path 'compute': checked out '36c89134d4013b2e5e45bc55656a18bd6141995a'
Submodule path 'eigen': checked out '8ba1b0f41a7950dc3e1d4ed75859e36c73311235'
Submodule path 'external_libs/fast_double_parser': checked out 'ace60646c02dc54c57f19d644e49a61e7e7758ec'
Submodule 'benchmark/dependencies/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'
Submodule 'benchmark/dependencies/double-conversion' (https://github.com/google/double-conversion.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'
Cloning into '/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'...
Cloning into '/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'...
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'cc09f1a6798c085c325569ef466bcdcffdc266d4'
* checking for file '/LightGBM/lightgbm_r/DESCRIPTION' ... OK
* preparing 'lightgbm':
* checking DESCRIPTION meta-information ... OK
* cleaning src
Warning in system2(command, args, stdout = NULL, stderr = NULL, ...) :
  error in running command
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
WARNING: directory 'lightgbm/src/compute/test' is empty
* looking to see if a 'data/datalist' file should be added
* building 'lightgbm_3.1.1.99.tar.gz'

* installing to library '/usr/local/lib/R/site-library'
* installing *source* package 'lightgbm' ...
** using staged installation
** libs
installing via 'install.libs.R' to /usr/local/lib/R/site-library/00LOCK-lightgbm/00new/lightgbm
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- R version passed into FindLibR.cmake: 4.0.3
-- Found LibR: /usr/lib/R
-- LIBR_EXECUTABLE: /usr/bin/R
-- LIBR_INCLUDE_DIRS: /usr/share/R/include
-- LIBR_CORE_LIBRARY: /usr/lib/R/lib/libR.so
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)


-- Configuring incomplete, errors occurred!
See also "/tmp/RtmpvcXiAX/R.INSTALL14755eba078/lightgbm/src/build/CMakeFiles/CMakeOutput.log".
Error in .run_shell_command("cmake", c(cmake_args, "..")) :
  Command failed with exit code: 1
* removing '/usr/local/lib/R/site-library/lightgbm'
Error in .run_shell_command(install_cmd, install_args) :
  Command failed with exit code: 1
Execution halted
The command '/bin/sh -c git clone --recursive https://github.com/microsoft/LightGBM &&     cd LightGBM &&     Rscript build_r.R --use-gpu' returned a non-zero code: 1

If I build the docker image with the last RUN entry commented out:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
    apt-get update && \
    apt-get install -y r-base

RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake

ENV MAKE="make -j$(nproc)"

RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'

RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd   ## otherwise lightgm segfaults at runtime (compiles fine without it)

#RUN git clone --recursive https://github.com/microsoft/LightGBM && \
#    cd LightGBM && \
#    Rscript build_r.R --use-gpu

with

sudo docker build -t gbmperf_gpu .

and then run it:

sudo nvidia-docker run --rm -ti gbmperf_gpu /bin/bash

then I can run things manually:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

gives the same error.

However, just compiling lightgbm (not the R package) seems fine:

git clone --recursive https://github.com/microsoft/LightGBM  && \
cd LightGBM  &&  mkdir build  &&  cd build  &&  cmake -DUSE_GPU=1 ..  &&  make -j4

as here:

...
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'cc09f1a6798c085c325569ef466bcdcffdc266d4'
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /LightGBM/LightGBM/build
make[1]: warning: -j0 forced in submake: resetting jobserver mode.
Scanning dependencies of target lightgbm
Scanning dependencies of target _lightgbm
[  1%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/boosting.cpp.o
[  2%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt.cpp.o
[  4%] Building CXX object CMakeFiles/lightgbm.dir/src/boosting/gbdt.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm.dir/src/boosting/boosting.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm.dir/src/main.cpp.o
[  8%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
[ 10%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
...

though I also see

/usr/include/CL/cl_version.h:34:104: note: #pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)
   34 | #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)")

but it compiles anyway:

[ 98%] Linking CXX shared library ../lib_lightgbm.so
[100%] Linking CXX executable ../lightgbm
[100%] Built target _lightgbm
[100%] Built target lightgbm

So there must be something in the R package(?) cc @jameslamb

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (17 by maintainers)

Commits related to this issue

Most upvoted comments

or use nvidia-smi šŸ˜‰

Well, as I corrected myself later, the sed version actually does not work properly anymore either (it compiles, but it does not add GPU support actually).

Yeah, my Dockerfile has a history of additions over the years (and hacks like the echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd thing), I’ll see if I can clean it up with your suggestions @StrikerRUS.

However, lightgbm compiles fine outside the R package, so it seems it’s only the R package that gets confused about OpenCL.

Thanks @StrikerRUS , I fixed it now. Yeah, strange indeed it was compiling with the == as well.

@szilard I’m afraid you have a typo (duplicated = sign) in the commit you’ve linked:

--boost-librarydir==/usr/lib/x86_64-linux-gnu
------------------^--------------

Quite strange that even with typo compilation succeed.

Thanks @jameslamb for fix and merging into LightGBM master. I changed the Dockerfile in my repo GBM-perf to take advantage of this fix (replaced the sed hack with flags to the build script): https://github.com/szilard/GBM-perf/commit/3b56bf0b474edd5dcf8039c9ddd86cddb9c1d845 Thanks.

Thanks to both of you for all the great information, and a nice reproducible example!

I’ve proposed what I think could be a fix, in https://github.com/microsoft/LightGBM/pull/3779. It wouldn’t ā€œjust workā€, but would at least allow you to pass in these paths as command-line args like you can in the Python package, so no one would need to use sed to re-write install.libs.R.

Thanks for such nice reproducible examples @szilard ! I can look into this this weekend, and probably expose more options via the build_r.R command-line args, so you don’t have to use sed.

Error in lgb.last_error() : api error: No OpenCL device found

Nice, given that the error happens on non-GPU machine! Indeed good sign!

But please note that successfully compiled GPU version and using device_type='gpu' in params may still result in training on CPU. This can occur with CPU that have onboard graphics and some combination of system-wide default platform and device (refer to gpu_platform_id and gpu_device_id). So to be 100% sure LightGBM uses real GPU please take a look at training log and find this line

[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation

All this strange, because last time I ran the benchmarks (September 2020) it was all working.

@jameslamb I believe R-package needs the same additional command line options for GPU-version as our Python-package:

- boost-root
- boost-dir
- boost-include-dir
- boost-librarydir
- opencl-include-dir
- opencl-library

https://github.com/microsoft/LightGBM/tree/master/python-package#build-gpu-version