DeepSpeed: [BUG] Fail to compile Deepspeed 0.9.0 with CUDA 11.7 and PyTorch 1.13.1 with Docker.

Describe the bug I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

Since it only return nvcc error without no more information, I have no idea to correct it.

To Reproduce Steps to reproduce the behavior:

  1. Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile
  2. See error

Expected behavior It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

ds_report output I can not run ds_report due to wrong compiling.

Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489

System info (please complete the following information): I am building on Github Action paltform with ubuntu-latest environment, detailed workflow.yml can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml

Launcher context N/A

Docker context N/A

Additional context N/A

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

For those coming to this issue while trying to build a wheel (install works yes, but does not build a wheel), instead of setup.py using pythons build module seems to make it work ,with CMAKE_POSITION_INDEPENDENT_CODE=ON and NVCC_PREPEND_FLAGS=“–forward-unknown-opts” to pass gcc builf (fpic) problems and nvidia unknown ops. Note that last architecture depends on CUDA version on development environment… E.g. CUDA 11.7 -> last supported = 8.7, CUDA 11.8 -> 8.9 (lovelace, CUDA 12 (9.0 , hopper)


DEEPSPEED_VERSION=v0.9.1
git clone https://github.com/microsoft/DeepSpeed.git \
     && cd DeepSpeed \
     && git checkout ${DEEPSPEED_VERSION} \
     && pip install --upgrade "pydantic<2.0.0" \
     && pip install build==0.10.0 \
     && CMAKE_POSITION_INDEPENDENT_CODE=ON NVCC_PREPEND_FLAGS="--forward-unknown-opts" CUDA_PATH=/usr/local/cuda TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 python -m build --wheel --no-isolation

Describe the bug I am building a docker image via Github Action, I installed pytorch 1.13.1 with cuda 11.7. Then when I am trying to install deepspeed 0.9.0 by DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6" /opt/conda/envs/dev/bin/pip install --no-cache-dir deepspeed --global-option="build_ext" --global-option="-j8", it fails with the error error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

Since it only return nvcc error without no more information, I have no idea to correct it.

To Reproduce Steps to reproduce the behavior:

1. Build this Dockerfile https://gitlab.com/chenyaofo/dockerfiles/-/raw/main/deepspeed/Dockerfile

2. See error

Expected behavior It would fail with error: command '/usr/local/cuda/bin/nvcc' failed with exit code 1.

ds_report output I can not run ds_report due to wrong compiling.

Screenshots I am building docker image via Github Action, the building log is avaiable at https://github.com/chenyaofo/docker-image-open-builder/actions/runs/4700207656/jobs/8334587489

System info (please complete the following information): I am building on Github Action paltform with ubuntu-latest environment, detailed workflow.yml can be found at https://github.com/chenyaofo/docker-image-open-builder/blob/main/.github/workflows/build.yml

Launcher context N/A

Docker context N/A

Additional context N/A

I encountered the same issues, and I built a new docker image to solve it. You can use it as following:

docker pull jockeyyan/deepspeed:torch113_cuda117v2.0

@chenyaofo - I tried making my own dockerfile to test this, and I’m able to get the below working. I’m not familiar with the needs of your system, but I believe something in the conda setup isn’t working properly with the build, since I’m also able to modify yours to work.

To modify yours to work, I’ve replaced the build line with this:

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

This slows the build down slightly, but does compile properly.

Successful output is here:

[+] Building 982.1s (10/10) FINISHED
 => [internal] load .dockerignore                                                                                  0.0s
 => => transferring context: 2B                                                                                    0.0s
 => [internal] load build definition from Dockerfile                                                               0.1s
 => => transferring dockerfile: 1.68kB                                                                             0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                    0.0s
 => [1/6] FROM docker.io/nvidia/cuda:11.7.0-devel-ubuntu22.04                                                      0.0s
 => CACHED [2/6] RUN APT_INSTALL="apt-get install -y --no-install-recommends --no-install-suggests" &&     GIT_CL  0.0s
 => CACHED [3/6] RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_  0.0s
 => CACHED [4/6] RUN /opt/conda/bin/mamba create -n dev python=3.10 &&     CONDA_INSTALL="/opt/conda/bin/mamba in  0.0s
 => CACHED [5/6] RUN PIP_INSTALL="/opt/conda/envs/dev/bin/pip install --no-cache-dir" &&     $PIP_INSTALL torch==  0.0s
 => [6/6] RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 /opt/conda/envs/dev/bin/pip install deepspeed

A sample docker file that works fine:

FROM nvidia/cuda:11.7.1-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

SHELL [ "/bin/bash","-c" ]

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,video,utility

RUN apt update -y \
&& apt upgrade -y

RUN apt install wget -y \
&& apt install git -y \ 
&& apt install libaio-dev -y \
&& apt install libaio1 -y 

RUN apt install python3.9 -y \
&& apt install python3-pip -y \
&& apt install python-is-python3 -y

RUN pip install --upgrade pip setuptools wheel

RUN pip install ninja

RUN pip install torch torchvision torchaudio

RUN DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install git+https://github.com/microsoft/DeepSpeed.git@v0.9.1

CMD ["bash"]