DeepSpeed: [BUG] RuntimeError: Tensors must be contiguous error while finetuning with deepspeed.

I am just trying to fine-tune “EleutherAI/gpt-neo-1.3B” for casualLM on google colab. Without anything, it gives out of memory error. I was checking what can I do and I found deepspeed. I added deepspeed=‘ds_config.json’, to my training arguments in jupyter notebook and used configuration from the official page “ds_config_zero2.json”. After that, I start to get this error. I am trying to do it in the notebook, not as a command.

To Reproduce try fine-tuning gpt-neo

This is the full error

The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: text. If text are not expected by `GPTNeoForCausalLM.forward`,  you can safely ignore this message.
[2023-01-23 12:41:08,453] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-21-3435b262f1ae>](https://localhost:8080/#) in <module>
----> 1 trainer.train()

10 frames
[/usr/local/lib/python3.8/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1525             self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1526         )
-> 1527         return inner_training_loop(
   1528             args=args,
   1529             resume_from_checkpoint=resume_from_checkpoint,

[/usr/local/lib/python3.8/dist-packages/transformers/trainer.py](https://localhost:8080/#) in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1594         )
   1595         if args.deepspeed:
-> 1596             deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
   1597                 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
   1598             )

[/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py](https://localhost:8080/#) in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint, inference)
    342     )
    343 
--> 344     deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
    345 
    346     if resume_from_checkpoint is not None:

[/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py](https://localhost:8080/#) in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
    123 
    124     if not isinstance(model, PipelineModule):
--> 125         engine = DeepSpeedEngine(args=args,
    126                                  model=model,
    127                                  optimizer=optimizer,

[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
    299 
    300         # Configure distributed model
--> 301         self._configure_distributed_model(model)
    302 
    303         self._get_model_parameters()

[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in _configure_distributed_model(self, model)
   1185 
   1186         if not self.amp_enabled():
-> 1187             self._broadcast_model()
   1188 
   1189     # check if parameters are duplicated in optimizer param_groups

[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in _broadcast_model(self)
   1100             else:
   1101                 if torch.is_tensor(p) and is_replicated(p):
-> 1102                     dist.broadcast(p,
   1103                                    groups._get_broadcast_src_rank(),
   1104                                    group=self.data_parallel_group)

[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py](https://localhost:8080/#) in log_wrapper(*args, **kwargs)
    125         # Return the op, then stop the op's timer
    126         try:
--> 127             return func(*args, **kwargs)
    128         finally:
    129             if comms_logger.enabled:

[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py](https://localhost:8080/#) in broadcast(tensor, src, group, async_op, prof, log_name, debug)
    230               debug=get_caller_func()):
    231     global cdb
--> 232     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
    233 
    234 

[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py](https://localhost:8080/#) in broadcast(self, tensor, src, group, async_op)
     68 
     69     def broadcast(self, tensor, src, group=None, async_op=False):
---> 70         return torch.distributed.broadcast(tensor=tensor,
     71                                            src=src,
     72                                            group=group,

[/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py](https://localhost:8080/#) in broadcast(tensor, src, group, async_op)
   1402         group_src_rank = get_group_rank(group, src)
   1403         opts.rootRank = group_src_rank
-> 1404         work = group.broadcast([tensor], opts)
   1405     if async_op:
   1406         return work

RuntimeError: Tensors must be contiguous

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

System info (please complete the following information): google colab

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 24 (4 by maintainers)

Most upvoted comments

In line with @FarzanT 's comment, you may try make this change (comm.py L214) within deepspeed to minimize the risk.

It’s working but I need some time to check if the learning curve makes sense.

# deepspeed/comm/comm.py
@timed_op
def broadcast(tensor, src, group=None, async_op=False, prof=False, log_name='broadcast', debug=get_caller_func()):                         
    global cdb
    if not tensor.is_contiguous():
        tensor = tensor.contiguous()
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 

Hello, I just faced the same issue. I found out that the problem lies in the device_map argument of Hugging Face’s AutoModel... classes. Changing the argument from device_map="auto" to device_map=None fixed the issue for me! I hope this help!

@KeeratKG Ah sorry I don’t recall, should be either huggyllama/llama-7b or Salesforce/codegen2-7B.