DeepSpeed: Cifar-10 example - RuntimeError: Error building extension 'fused_adam'

Hey, I was trying out the cifar-10 tutorial (link).
Could you assist with the runtime error.

On executing (run_ds.sh):


(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ sh run_ds.sh
[2021-01-26 05:43:56,524] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-26 05:43:56,554] [INFO] [runner.py:355:main] cmd = /home/axe/VirtualEnvs/dspeed/bin/python3.6 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[2021-01-26 05:43:56,972] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2021-01-26 05:43:56,972] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=2, node_rank=0
[2021-01-26 05:43:56,972] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2021-01-26 05:43:56,972] [INFO] [launch.py:100:main] dist_world_size=2
[2021-01-26 05:43:56,973] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
0it [00:00, ?it/s]Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
 99%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉  | 168140800/170498071 [00:07<00:00, 28603271.23it/s]Extracting ./data/cifar-10-python.tar.gz to ./data
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Files already downloaded and verified
170500096it [00:10, 16970356.67it/s]                                                                                                                                                                          
170500096it [00:10, 16911123.86it/s]                                                                                                                                                                          
horse plane   cat  bird
[2021-01-26 05:44:13,334] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:13,335] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
truck horse  ship  ship
[2021-01-26 05:44:14,857] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10, git-hash=unknown, git-branch=unknown
[2021-01-26 05:44:14,857] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-26 05:44:18,027] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2021-01-26 05:44:18,028] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Using /home/axe/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/axe/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
FAILED: multi_tensor_adam.cuda.o 
/usr/local/cuda_10_1_7_6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++14 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
       __p->_M_set_sharable();
       ~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24:   required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134:   required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95:   required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/TH -isystem /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda_10_1_7_6/include -isystem /home/axe/VirtualEnvs/dspeed/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "cifar10_deepspeed.py", line 144, in <module>
    args=args, model=net, model_parameters=parameters, training_data=trainset)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'         # *******************************************************
Loading extension module fused_adam...
Traceback (most recent call last):
  File "cifar10_deepspeed.py", line 144, in <module>
    args=args, model=net, model_parameters=parameters, training_data=trainset)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/__init__.py", line 119, in initialize
    config_params=config_params)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 171, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 514, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 583, in _configure_basic_optimizer
    optimizer = FusedAdam(model_parameters, **optimizer_parameters)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 180, in load
    return self.jit_load(verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py", line 216, in jit_load
    verbose=verbose)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/home/axe/VirtualEnvs/dspeed/lib/python3.6/imp.py", line 297, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'fused_adam'    # *******************************************************

Here’s ds_report:

(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ ds_report 
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
 [WARNING]  sparse_attn requires the 'cmake' command, but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/axe/VirtualEnvs/dspeed/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.10, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

Running with CUDA 10.1 on Ubuntu 18/04. Here’s the virtual environment:

(dspeed) axe@axe-H270-Gaming-3:~/Downloads/DeepSpeedExamples/cifar$ pip freeze
cycler==0.10.0
dataclasses==0.8
deepspeed==0.3.10
kiwisolver==1.3.1
matplotlib==3.3.3
ninja==1.10.0.post2
numpy==1.19.5
Pillow==8.1.0
protobuf==3.14.0
pyparsing==2.4.7
python-dateutil==2.8.1
six==1.15.0
tensorboardX==1.8
torch==1.7.1+cu101
torchaudio==0.7.2
torchvision==0.8.2+cu101
tqdm==4.56.0
typing-extensions==3.7.4.3

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 21 (6 by maintainers)

Most upvoted comments

In my case, the same issue happened even after I update cuda to version 10.1.243, and I could not update CUDA 10.2 as my Ubuntu is 14.04 I found that my issue caused by the old version of GCC (4.8). I follow this solution to update GCC 6 and problem solved: https://gist.github.com/application2000/73fd6f4bf1be6600a2cf9f56315a2d91 Hope this help someone ^^

I had issues with installation and was following the idea in https://github.com/microsoft/DeepSpeed/issues/629#issuecomment-753993124 to change CUDA from 10.1.105 to 10.1.243 and ended up installing 10.2 instead, which fixed this issue.

Sorry, I won’t have time to revert to 10.1 to look for the underlying cause, but in any case, that should be an easy fix in the meantime.