DeepSpeed: Error building extension 'fused_adam' with DeepSpeed==0.3.13

Hi,

I upgraded DeepSpeed to 0.3.13 and Torch to 1.8.0 and while using DeepSpeed with HF (HuggingFace), I’m getting below error - RuntimeError: Error building extension ‘fused_adam’ and here is the stacktrace -

[2021-03-23 07:03:49,374] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:03:49,407] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1672             check=True,
-> 1673             env=env)
   1674     except subprocess.CalledProcessError as e:

/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)
    439     return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    670                     optimizer = FusedAdam(model_parameters,
    671                                           **optimizer_parameters,
--> 672                                           adam_w_mode=effective_adam_w_mode)
    673 
    674         elif self.optimizer_name() == LAMB_OPTIMIZER:

~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
     70         self.set_grad_none = set_grad_none
     71 
---> 72         fused_adam_cuda = FusedAdamBuilder().load()
     73         # Skip buffer
     74         self._dummy_overflow_buf = torch.cuda.IntTensor([0])

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089         is_python_module,
   1090         is_standalone,
-> 1091         keep_intermediates=keep_intermediates)
   1092 
   1093 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1300                         verbose=verbose,
   1301                         with_cuda=with_cuda,
-> 1302                         is_standalone=is_standalone)
   1303             finally:
   1304                 baton.release()

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
   1405         build_directory,
   1406         verbose,
-> 1407         error_prefix=f"Error building extension '{name}'")
   1408 
   1409 

~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
   1681         if hasattr(error, 'output') and error.output:  # type: ignore
   1682             message += f": {error.output.decode()}"  # type: ignore
-> 1683         raise RuntimeError(message) from e
   1684 
   1685 

RuntimeError: Error building extension 'fused_adam'

Versions that I’m using are -

Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.8.0+cu101
[pip3] torchvision==0.8.1
[conda] Could not collect

transformers==4.4.2 DeepSpeed==0.3.13 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

But I was able to run DeepSpeed-0.3.10 with HuggingFace-4.3.2 and Torch-1.7.1+cu101 without any issue.

Plz suggest how to proceed further…

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (10 by maintainers)

Most upvoted comments

Thanks a lot @stas00

Finally it worked. As colab is having python-3.7, I replicated what you’ve said in my AWS EC2 instance where I had many CUDAs including 10.1 and 11.1.

Reiterating the steps I followed so that it can help someone with similar issues -

Created a conda environment with python-3.6.9 (because my target machine where I want to run DeepSpeed is having 3.6.9).
Changed PATH and LD_LIBRARY_PATH to point to CUDA-10.1 ( again because of my target machine) as suggested in HF’s installation notes here. Below are the commands -

export PATH=/usr/local/cuda-10.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64:$LD_LIBRARY_PATH

Install PyTorch (should be same version as in target machine). I installed with below command -

pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

verify torch versions with python -m torch.utils.collect_env

Execute below commands to pre-build DeepSpeed -

git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
time DS_BUILD_OPS=1 pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 | tee build.log

Check whether compatible op’s were installed or not with ds_report

Extract whl file for this DeepSpeed using below command-

rm -rf build
DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel

Take whl from dist/ and install in target machine using pip install deepspeed-0.3.13+7fcc891-cp36-cp36m-linux_x86_64.whl

saichandrapandraju on Mar 31, 2021

That’s the correct way: major=7, minor=0 => 7.0

Also you can find the full list of all archs at https://developer.nvidia.com/cuda-gpus

Incidentally I have just added all this information to the docs, hopefully should be merged in the next few days:


You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** 
(same as arch in this context) `here <https://developer.nvidia.com/cuda-gpus>`__.

You can check the archs pytorch was built with using:

.. code-block:: bash

   python -c "import torch; print(torch.cuda.get_arch_list())"

Here is how to find out the arch for one of the installed GPU. For example, for GPU 0:

.. code-block:: bash

   CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
   print(torch.cuda.get_device_properties(torch.device('cuda')))"

If the output is:

.. code-block:: bash

   _CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)

then you know that this card's arch is ``8.6``.

You can also leave ``TORCH_CUDA_ARCH_LIST`` out completely and then the build program will automatically query the
architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why
it's best to specify the desired archs explicitly.

stas00 on Apr 1, 2021

Yes.

In my case both my build and target machines are same, so didn’t use TORCH_CUDA_ARCH_LIST .

But yeah, it’s always better to explicitly mention. For reference, I used torch.cuda.get_device_properties(device) to check my device architecture which gives o/p like _CudaDeviceProperties(name='Tesla V100-SXM2-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80) .

I’m not very sure but I thought my device architecture is 7.0 from above o/p. One can also check list of CUDA architectures that installed torch is compiled for using torch.cuda.get_arch_list() which gives o/p as - ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75'] for torch==1.7.1+cu101 ['sm_37', 'sm_50', 'sm_60', 'sm_70'] for torch==1.8.1+cu101

Not sure whether this is the correct way to check. May be @stas00 can confirm.

saichandrapandraju on Apr 1, 2021

BTW, I do recommend you use an explicit TORCH_CUDA_ARCH_LIST for your gpus during the build, since from what I understand you may get a better performance that way. Especially if your build machine doesn’t have the same gpus as your target machine.

stas00 on Mar 31, 2021

Awesome! Thank you for the report, @saichandrapandraju

Except you don’t need step 4. Step 5 is all you need after you cloned the repo.

Step 4 is for when you want to install it locally. and is similar to Steps 5+6 but you don’t get a wheel to take to another machine.

I had the same issue with fairscale on several setups no matter what I tried it won’t build at runtime, but prebuilding into a wheel and installing that worked.

stas00 on Mar 31, 2021

If you want to save the hassle of upgrading to 11.1, and keep 10.1, I’d do pre-building from the source, since it’d help you identify any problems easier.

If this downloads and installs modules from external sources, my VM won’t have open internet access and it has to go through my company’s firewall. If this (downloading from external sources) is the case, I may not pre-build from source. Plz confirm whether it does collect necessary things from external sources.

Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p’s here

OK, so since you prebuilt from source on colab (thank you for sharing the outcomes), you now know what’s involved. It’ll install dependencies just like when you don’t pre-build from source. So if you are able to do pip install deepspeed on your setup you can also do the same here. i.e. preinstall all the dependencies when you have the network just like you’d do normally.

Here is yet another approach to consider. Build a binary wheel on whatever normal machine where you have a similar cuda setup:

git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="6.1;8.6" DS_BUILD_OPS=1 python setup.py build_ext -j8 bdist_wheel

adjust TORCH_CUDA_ARCH_LIST for the required archs on the target machine.

Now you have dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl (will be a different name depending on the build).

Now you can install it on your VM and you don’t need to build anything at run time, you just do:

pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl

I presume you will already have the other dependencies installed since you already did that for pip install deepspeed.

I wonder if DeepSpeed should document this approach on their advanced install page.

stas00 on Mar 26, 2021

This is odd, since I have just re-run my notebook on the free version of colab and it didn’t have any problems.

So you have may have noticed you made a progress, since you managed to now build the deepspeed extensions using this notebook. But then something killed the process immediately after it built the extension. so now you have a correct combination of the packages.

Try to re-run that last cell again - since the extension is now built and cached (that is if you’re in the same session, if not start a new and re-run this cell second time if it dies again the first time).

In theory everybody gets mostly the same environment, but perhaps it’s not so. Could you monitor that you disk space and RAM are not at 100% - perhaps the watchdog kills the process when resources are exhausted?

I’m curious what happens if you run the training cell the 2nd time.

stas00 on Mar 24, 2021