DeepSpeed: [BUG] RuntimeError: Tensors must be contiguous error while finetuning with deepspeed.
I am just trying to fine-tune “EleutherAI/gpt-neo-1.3B” for casualLM on google colab. Without anything, it gives out of memory error. I was checking what can I do and I found deepspeed. I added deepspeed=‘ds_config.json’, to my training arguments in jupyter notebook and used configuration from the official page “ds_config_zero2.json”. After that, I start to get this error. I am trying to do it in the notebook, not as a command.
To Reproduce try fine-tuning gpt-neo
This is the full error
The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: text. If text are not expected by `GPTNeoForCausalLM.forward`, you can safely ignore this message.
[2023-01-23 12:41:08,453] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-21-3435b262f1ae>](https://localhost:8080/#) in <module>
----> 1 trainer.train()
10 frames
[/usr/local/lib/python3.8/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1525 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1526 )
-> 1527 return inner_training_loop(
1528 args=args,
1529 resume_from_checkpoint=resume_from_checkpoint,
[/usr/local/lib/python3.8/dist-packages/transformers/trainer.py](https://localhost:8080/#) in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1594 )
1595 if args.deepspeed:
-> 1596 deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
1597 self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_checkpoint
1598 )
[/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py](https://localhost:8080/#) in deepspeed_init(trainer, num_training_steps, resume_from_checkpoint, inference)
342 )
343
--> 344 deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
345
346 if resume_from_checkpoint is not None:
[/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py](https://localhost:8080/#) in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params)
123
124 if not isinstance(model, PipelineModule):
--> 125 engine = DeepSpeedEngine(args=args,
126 model=model,
127 optimizer=optimizer,
[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_params, dont_change_device)
299
300 # Configure distributed model
--> 301 self._configure_distributed_model(model)
302
303 self._get_model_parameters()
[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in _configure_distributed_model(self, model)
1185
1186 if not self.amp_enabled():
-> 1187 self._broadcast_model()
1188
1189 # check if parameters are duplicated in optimizer param_groups
[/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py](https://localhost:8080/#) in _broadcast_model(self)
1100 else:
1101 if torch.is_tensor(p) and is_replicated(p):
-> 1102 dist.broadcast(p,
1103 groups._get_broadcast_src_rank(),
1104 group=self.data_parallel_group)
[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py](https://localhost:8080/#) in log_wrapper(*args, **kwargs)
125 # Return the op, then stop the op's timer
126 try:
--> 127 return func(*args, **kwargs)
128 finally:
129 if comms_logger.enabled:
[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py](https://localhost:8080/#) in broadcast(tensor, src, group, async_op, prof, log_name, debug)
230 debug=get_caller_func()):
231 global cdb
--> 232 return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
233
234
[/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py](https://localhost:8080/#) in broadcast(self, tensor, src, group, async_op)
68
69 def broadcast(self, tensor, src, group=None, async_op=False):
---> 70 return torch.distributed.broadcast(tensor=tensor,
71 src=src,
72 group=group,
[/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py](https://localhost:8080/#) in broadcast(tensor, src, group, async_op)
1402 group_src_rank = get_group_rank(group, src)
1403 opts.rootRank = group_src_rank
-> 1404 work = group.broadcast([tensor], opts)
1405 if async_op:
1406 return work
RuntimeError: Tensors must be contiguous
ds_report output
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.8.0, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.2
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
System info (please complete the following information): google colab
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 24 (4 by maintainers)
In line with @FarzanT 's comment, you may try make this change (comm.py L214) within deepspeed to minimize the risk.
It’s working but I need some time to check if the learning curve makes sense.
Hello, I just faced the same issue. I found out that the problem lies in the
device_map
argument of Hugging Face’sAutoModel...
classes. Changing the argument fromdevice_map="auto"
todevice_map=None
fixed the issue for me! I hope this help!@KeeratKG Ah sorry I don’t recall, should be either
huggyllama/llama-7b
orSalesforce/codegen2-7B
.