DeepSpeed: [BUG] Fail to compile Deepspeed 0.9.0 with CUDA 11.7 and PyTorch 1.13.1 with Docker.
Describe the bug
I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8"
, it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
.
Since it only return nvcc error without no more information, I have no idea to correct it.
To Reproduce Steps to reproduce the behavior:
- Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile
- See error
Expected behavior
It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1
.
ds_report output
I can not run ds_report
due to wrong compiling.
Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489
System info (please complete the following information):
I am building on Github Action paltform with ubuntu-latest
environment, detailed workflow.yml
can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml
Launcher context N/A
Docker context N/A
Additional context N/A
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (7 by maintainers)
For those coming to this issue while trying to build a wheel (install works yes, but does not build a wheel), instead of setup.py using pythons build module seems to make it work ,with CMAKE_POSITION_INDEPENDENT_CODE=ON and NVCC_PREPEND_FLAGS=“–forward-unknown-opts” to pass gcc builf (fpic) problems and nvidia unknown ops. Note that last architecture depends on CUDA version on development environment… E.g. CUDA 11.7 -> last supported = 8.7, CUDA 11.8 -> 8.9 (lovelace, CUDA 12 (9.0 , hopper)
I encountered the same issues, and I built a new docker image to solve it. You can use it as following:
@chenyaofo - I tried making my own dockerfile to test this, and I’m able to get the below working. I’m not familiar with the needs of your system, but I believe something in the conda setup isn’t working properly with the build, since I’m also able to modify yours to work.
To modify yours to work, I’ve replaced the build line with this:
This slows the build down slightly, but does compile properly.
Successful output is here:
A sample docker file that works fine: