DeepSpeed: Error building extension 'fused_adam' with DeepSpeed==0.3.13
Hi,
I upgraded DeepSpeed
to 0.3.13
and Torch
to 1.8.0
and while using DeepSpeed with HF (HuggingFace), I’m getting below error -
RuntimeError: Error building extension ‘fused_adam’ and here is the stacktrace -
[2021-03-23 07:03:49,374] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-23 07:03:49,407] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jovyan/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /home/jovyan/.cache/torch_extensions/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1672 check=True,
-> 1673 env=env)
1674 except subprocess.CalledProcessError as e:
/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
437 raise CalledProcessError(retcode, process.args,
--> 438 output=stdout, stderr=stderr)
439 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-24-3435b262f1ae> in <module>
----> 1 trainer.train()
~/.local/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
901 delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
902 if self.args.deepspeed:
--> 903 model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
904 self.model = model.module
905 self.model_wrapped = model # will get further wrapped in DDP
~/.local/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
416 model=model,
417 model_parameters=model_parameters,
--> 418 config_params=config,
419 )
420
~/.local/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
123 dist_init_required=dist_init_required,
124 collate_fn=collate_fn,
--> 125 config_params=config_params)
126 else:
127 assert mpu is None, "mpu must be None with pipeline parallelism"
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
181 self.lr_scheduler = None
182 if model_parameters or optimizer:
--> 183 self._configure_optimizer(optimizer, model_parameters)
184 self._configure_lr_scheduler(lr_scheduler)
185 self._report_progress(0)
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
596 logger.info('Using client Optimizer as basic optimizer')
597 else:
--> 598 basic_optimizer = self._configure_basic_optimizer(model_parameters)
599 if self.global_rank == 0:
600 logger.info(
~/.local/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
670 optimizer = FusedAdam(model_parameters,
671 **optimizer_parameters,
--> 672 adam_w_mode=effective_adam_w_mode)
673
674 elif self.optimizer_name() == LAMB_OPTIMIZER:
~/.local/lib/python3.6/site-packages/deepspeed/ops/adam/fused_adam.py in __init__(self, params, lr, bias_correction, betas, eps, adam_w_mode, weight_decay, amsgrad, set_grad_none)
70 self.set_grad_none = set_grad_none
71
---> 72 fused_adam_cuda = FusedAdamBuilder().load()
73 # Skip buffer
74 self._dummy_overflow_buf = torch.cuda.IntTensor([0])
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
213 return importlib.import_module(self.absolute_name())
214 else:
--> 215 return self.jit_load(verbose)
216
217 def jit_load(self, verbose=True):
~/.local/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
250 extra_cuda_cflags=self.nvcc_args(),
251 extra_ldflags=self.extra_ldflags(),
--> 252 verbose=verbose)
253 build_duration = time.time() - start_build
254 if verbose:
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1089 is_python_module,
1090 is_standalone,
-> 1091 keep_intermediates=keep_intermediates)
1092
1093
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1300 verbose=verbose,
1301 with_cuda=with_cuda,
-> 1302 is_standalone=is_standalone)
1303 finally:
1304 baton.release()
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1405 build_directory,
1406 verbose,
-> 1407 error_prefix=f"Error building extension '{name}'")
1408
1409
~/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _run_ninja_build(build_directory, verbose, error_prefix)
1681 if hasattr(error, 'output') and error.output: # type: ignore
1682 message += f": {error.output.decode()}" # type: ignore
-> 1683 raise RuntimeError(message) from e
1684
1685
RuntimeError: Error building extension 'fused_adam'
Versions that I’m using are -
Collecting environment information...
PyTorch version: 1.8.0+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
Nvidia driver version: 450.51.06
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] kubeflow-pytorchjob==0.1.3
[pip3] numpy==1.18.5
[pip3] torch==1.8.0+cu101
[pip3] torchvision==0.8.1
[conda] Could not collect
transformers==4.4.2
DeepSpeed==0.3.13
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
But I was able to run DeepSpeed-0.3.10
with HuggingFace-4.3.2
and Torch-1.7.1+cu101
without any issue.
Plz suggest how to proceed further…
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (10 by maintainers)
Thanks a lot @stas00
Finally it worked. As colab is having python-3.7, I replicated what you’ve said in my AWS EC2 instance where I had many CUDAs including 10.1 and 11.1.
Reiterating the steps I followed so that it can help someone with similar issues -
Created a conda environment with python-3.6.9 (because my target machine where I want to run DeepSpeed is having 3.6.9).
Changed
PATH
andLD_LIBRARY_PATH
to point to CUDA-10.1 ( again because of my target machine) as suggested in HF’s installation notes here. Below are the commands -verify torch versions with
python -m torch.utils.collect_env
Check whether compatible op’s were installed or not with
ds_report
dist/
and install in target machine usingpip install deepspeed-0.3.13+7fcc891-cp36-cp36m-linux_x86_64.whl
That’s the correct way:
major=7, minor=0
=>7.0
Also you can find the full list of all archs at https://developer.nvidia.com/cuda-gpus
Incidentally I have just added all this information to the docs, hopefully should be merged in the next few days:
Yes.
In my case both my build and target machines are same, so didn’t use
TORCH_CUDA_ARCH_LIST
.But yeah, it’s always better to explicitly mention. For reference, I used
torch.cuda.get_device_properties(device)
to check my device architecture which gives o/p like_CudaDeviceProperties(name='Tesla V100-SXM2-32GB', major=7, minor=0, total_memory=32510MB, multi_processor_count=80)
.I’m not very sure but I thought my device architecture is
7.0
from above o/p. One can also check list of CUDA architectures that installed torch is compiled for usingtorch.cuda.get_arch_list()
which gives o/p as -['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75']
for torch==1.7.1+cu101['sm_37', 'sm_50', 'sm_60', 'sm_70']
for torch==1.8.1+cu101Not sure whether this is the correct way to check. May be @stas00 can confirm.
BTW, I do recommend you use an explicit
TORCH_CUDA_ARCH_LIST
for your gpus during the build, since from what I understand you may get a better performance that way. Especially if your build machine doesn’t have the same gpus as your target machine.Awesome! Thank you for the report, @saichandrapandraju
Except you don’t need step 4. Step 5 is all you need after you cloned the repo.
Step 4 is for when you want to install it locally. and is similar to Steps 5+6 but you don’t get a wheel to take to another machine.
I had the same issue with fairscale on several setups no matter what I tried it won’t build at runtime, but prebuilding into a wheel and installing that worked.
OK, so since you prebuilt from source on colab (thank you for sharing the outcomes), you now know what’s involved. It’ll install dependencies just like when you don’t pre-build from source. So if you are able to do
pip install deepspeed
on your setup you can also do the same here. i.e. preinstall all the dependencies when you have the network just like you’d do normally.Here is yet another approach to consider. Build a binary wheel on whatever normal machine where you have a similar cuda setup:
adjust
TORCH_CUDA_ARCH_LIST
for the required archs on the target machine.Now you have
dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
(will be a different name depending on the build).Now you can install it on your VM and you don’t need to build anything at run time, you just do:
I presume you will already have the other dependencies installed since you already did that for
pip install deepspeed
.I wonder if DeepSpeed should document this approach on their advanced install page.
This is odd, since I have just re-run my notebook on the free version of colab and it didn’t have any problems.
So you have may have noticed you made a progress, since you managed to now build the deepspeed extensions using this notebook. But then something killed the process immediately after it built the extension. so now you have a correct combination of the packages.
Try to re-run that last cell again - since the extension is now built and cached (that is if you’re in the same session, if not start a new and re-run this cell second time if it dies again the first time).
In theory everybody gets mostly the same environment, but perhaps it’s not so. Could you monitor that you disk space and RAM are not at 100% - perhaps the watchdog kills the process when resources are exhausted?
I’m curious what happens if you run the training cell the 2nd time.