DeepSpeed: Error in building Transformer kernel

I am using deepspeed/deepspeed:latest container (I tried to install Deepspeed with DS_BUILD_OPS=1 pip install deepspeed but I got the same error) and trying to use the Transformer kernel provided by DeepSpeed as follows:

from deepspeed import DeepSpeedTransformerLayer, DeepSpeedTransformerConfig

if __name__ == "__main__":
    transformer_config = DeepSpeedTransformerConfig(
        batch_size=40,
        hidden_size=768,
        heads=768 // 64,
        intermediate_size=768 * 4,
        attn_dropout_ratio=0.0,
        hidden_dropout_ratio=0.0,
        num_hidden_layers=4,
        initializer_range=0.02,
        fp16=True,
        pre_layer_norm=True,
        stochastic_mode=True,
    )
    layer = DeepSpeedTransformerLayer(config=transformer_config)

But I can’t initialize the layer with the following error

DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 40, 'hidden_size': 768, 'intermediate_size': 3072, 'heads': 12, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 4, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True, 'huggingface': False}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/stochastic_transformer/build.ninja...
Building extension module stochastic_transformer...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/cublas_wrappers.cu -o cublas_wrappers.cuda.o
[2/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu -o dropout_kernels.cuda.o
FAILED: dropout_kernels.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu -o dropout_kernels.cuda.o
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(102): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(103): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(216): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(217): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(335): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu(336): error: no operator "*" matches these operands
            operand types are: __half2 * const __half2

6 errors detected in the compilation of "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/dropout_kernels.cu".
[3/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu -o normalize_kernels.cuda.o
FAILED: normalize_kernels.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu -o normalize_kernels.cuda.o
/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(880): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(883): error: no operator "-" matches these operands
            operand types are: const __half2 - const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(885): error: ambiguous "?" operation: second operand of type "<error-type>" can be converted to third operand type "const __half2", and vice versa

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(890): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(892): error: no operator "-" matches these operands
            operand types are: const __half2 - const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(893): error: ambiguous "?" operation: second operand of type "<error-type>" can be converted to third operand type "const __half2", and vice versa

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(901): error: no operator "*" matches these operands
            operand types are: __half2 * __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(901): error: identifier "h2sqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(905): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(927): error: no operator "-" matches these operands
            operand types are: - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1189): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1194): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1205): error: no operator "-" matches these operands
            operand types are: const __half2 - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1206): error: no operator "*" matches these operands
            operand types are: __half2 * __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1210): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1232): error: no operator "-" matches these operands
            operand types are: - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1232): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1621): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1624): error: no operator "-" matches these operands
            operand types are: const __half2 - const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1626): error: ambiguous "?" operation: second operand of type "<error-type>" can be converted to third operand type "const __half2", and vice versa

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1631): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1633): error: no operator "-" matches these operands
            operand types are: const __half2 - const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1634): error: ambiguous "?" operation: second operand of type "<error-type>" can be converted to third operand type "const __half2", and vice versa

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1642): error: no operator "*" matches these operands
            operand types are: __half2 * __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1642): error: identifier "h2sqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1646): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1668): error: no operator "-" matches these operands
            operand types are: - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1703): error: no operator "+" matches these operands
            operand types are: __half2 + const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1710): error: no operator "+" matches these operands
            operand types are: __half2 + const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1940): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1946): error: no operator "*=" matches these operands
            operand types are: __half2 *= __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1959): error: no operator "-" matches these operands
            operand types are: __half2 - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1960): error: no operator "*" matches these operands
            operand types are: __half2 * __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1964): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1986): error: no operator "-" matches these operands
            operand types are: - __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(1986): error: identifier "h2rsqrt" is undefined

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(2021): error: no operator "+" matches these operands
            operand types are: __half2 + const __half2

/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu(2027): error: no operator "+" matches these operands
            operand types are: __half2 + const __half2

38 errors detected in the compilation of "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/normalize_kernels.cu".
[4/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/general_kernels.cu -o general_kernels.cuda.o
[5/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/transform_kernels.cu -o transform_kernels.cuda.o
[6/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/gelu_kernels.cu -o gelu_kernels.cuda.o
[7/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=stochastic_transformer -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -D__STOCHASTIC_MODE__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/transformer/softmax_kernels.cu -o softmax_kernels.cuda.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1549, in _run_ninja_build
    subprocess.run(
  File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "experimentation.py", line 17, in <module>
    layer = DeepSpeedTransformerLayer(config=transformer_config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 543, in __init__
    stochastic_transformer_cuda_module = StochasticTransformerBuilder().load()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 208, in jit_load
    op_module = load(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 999, in load
    return _jit_compile(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1204, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1308, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1565, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'stochastic_transformer'

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 28 (13 by maintainers)

Most upvoted comments

root@x8a100-0000:/workspace# env | grep -i arch TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX

export TORCH_CUDA_ARCH_LIST=7.0 DS_BUILD_OPS=1 pip3 install deepspeed

Worked, thank you.

unset TORCH_CUDA_ARCH_LIST fixed the problem for me.

There we go, that’s the missing piece 😃 we have this fully reproduced now with these two steps:

  1. use this docker image nvcr.io/nvidia/pytorch:20.12-py3
  2. DS_BUILD_OPS=1 pip install deepspeed

I believe we also know why this this exact build error is happening but not sure why it is being triggered. For some reason the build command is adding a gencode for compute capability (cc) 5.2 but there are clearly no cc 5.2 gpus on the box. We’ll dig into this further and report back once we have a fix.