DeepSpeed: Error building extension 'cpu_adam'

Hey guys, I’m having a problem getting DeepSpeed working with XLM-Roberta. I’m trying to run it on an Amazon Linux machine, which is based on Red Hat. Here are a some versions of packages/dependencies I’m using:

cuda version: 10.2 transformers: 4.4.2 pytorch: 1.7.1 deepspeed: 0.3.13 gcc/c++/g++: (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)

I must admit I had some issues upgrading the CUDA version from the default 10.0 on the instance to 10.2 and GCC from 4.8.5 to 7.2.1 but since I don’t get the error that the torch and installed CUDA versions are different and that GCC has a version lower than 5, I’d assume I’m in the clear.

Here’s the essential part of the code I’m running (from a notebook):

import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

from transformers import Trainer, TrainingArguments, XLMRobertaForSequenceClassification, XLMRobertaTokenizer

model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base')

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    save_steps=500,
    save_total_limit=2,
    deepspeed="my_ds_config.json"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Here’s the content of my config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "adam_w_mode": true,
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    }
}

Here’s the output of my ds_config:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.13+22d5a1f, 22d5a1f, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.2

And finally, here’s the stack trace:

[2021-03-24 15:29:36,478] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-24 15:29:36,494] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/ec2-user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-131-cc14ac05ecbb> in <module>
     30 )
     31 
---> 32 trainer.train()

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    665                     optimizer = DeepSpeedCPUAdam(model_parameters,
    666                                                  **optimizer_parameters,
--> 667                                                  adamw_mode=effective_adam_w_mode)
    668                 else:
    669                     from deepspeed.ops.adam import FusedAdam

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     76         DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
     77         self.adam_w_mode = adamw_mode
---> 78         self.ds_opt_adam = CPUAdamBuilder().load()
     79 
     80         self.ds_opt_adam.create_adam(self.opt_id,

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089     if isinstance(cuda_sources, str):
   1090         cuda_sources = [cuda_sources]
-> 1091 
   1092     cpp_sources.insert(0, '#include <torch/extension.h>')
   1093 

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1315 
   1316 
-> 1317 def verify_ninja_availability():
   1318     r'''
   1319     Raises ``RuntimeError`` if `ninja <https://ninja-build.org/>`_ build system is not

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _import_module_from_library(module_name, path, is_python_module)
   1697                       sources,
   1698                       objects,
-> 1699                       ldflags,
   1700                       library_target,
   1701                       with_cuda) -> None:

~/anaconda3/envs/python3/lib/python3.6/imp.py in find_module(name, path)
    295         break  # Break out of outer loop when breaking out of inner loop.
    296     else:
--> 297         raise ImportError(_ERR_MSG.format(name), name=name)
    298 
    299     encoding = None

ImportError: No module named 'cpu_adam'

Thanks in advance for your help!

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 18 (8 by maintainers)

Most upvoted comments

I also occur that. before AttributeError: ‘DeepSpeedCPUAdam’ object has no attribute ‘ds_opt_adam’ The error show: cannot make a dir in /tmp/torch_extensions/build for cpu_adam. So I change the DEFAULT_TORCH_EXTENSION_PATH in the file /anaconda3/envs/XXXXX/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py then it works

ziweiji on Nov 17, 2021

In my case, this BUG is due to ninja compile error, you can change directory to ~/.cache/torch_extensions/cpu_adam, then run ninja -v to see the error details.

Chiang97912 on Apr 14, 2023

Thanks @stas00 for clarifying this : )

RezaYazdaniAminabadi on Mar 26, 2021

We already have examples for running for some transformer networks. For this argument, I think you might just add local_rank to your parser arguments the same as here.

This is no longer needed in deepspeed since https://github.com/microsoft/DeepSpeed/pull/825 and transformers master has been adjusted accordingly. You just need to have env LOCAL_RANK to be set.

I do have a follow-up question though: Correct me if I’m wrong, but the only way to use DeepSpeed would be to use the HuggingFace Trainer class?

Not at all. You can do your own integration and not rely on the HF Trainer.

If you do use transformers Trainer for a time being while this is all new you must use the transformers master branch as frequent deepspeed-related updates are made.

If you have build problems please make sure you read: https://huggingface.co/transformers/main_classes/trainer.html#installation-notes though looking at OP I think you have all the right components. Just check that PATH/LD_LIBRARY_PATH are good.

Perhaps try to pre-build deepspeed: https://github.com/microsoft/DeepSpeed/issues/885#issuecomment-808339237

stas00 on Mar 26, 2021