accelerate: "IndexError: tuple index out of range" for the zero_stage=3

I am trying to integrate deep-speed into this script and have successfully run it for zero stage 2, but when I tried it for zero stage 3 this error prompts just after completion of the first epoch. I have made changes in the finetune_using_clm.py file as suggested in this huggingface/accelerate repo, and have created a new file tuned.py.

The error for the zero stage 3, indicates to the: Traceback (most recent call last): File "tuned.py", line 398, in main accelerator.backward(loss) The whole error is:

Traceback (most recent call last):
  File "tuned.py", line 398, in main
    accelerator.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1310, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 156, in backward
    self.engine.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1860, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 2070, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 144, in backward
    ctx.pre_backward_function(ctx.module)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 487, in pre_sub_module_backward_function
    param_coordinator.trace_prologue(sub_module)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
    if sub_module != self.__submodule_order[self.__step_id]:
IndexError: tuple index out of range

I don’t know why it gives this error as it is running well while using the zero stage 2.

Any help in this regard would be highly appreciated.

I am using Google Colab for the task.

Packages version: mpi4py-3.1.4 deepspeed-0.7.6 accelerate-0.15.0 transformers-4.25.1

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23

Most upvoted comments

Hi, Thank you so much, @pacman100! It is okay now. Thanks again for taking out time to the issue. Means a lot!

Hello @asifehmad, after the eval loop, you aren’t having model.train() before resuming training. Add model.train() on line 447 here https://github.com/asifehmad/clm_model_tuning/blob/main/tuned.py#L447 and things sgould work. Also, the way you are saving model is wrong when using deepspeed stage 3. Please refer https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py#L708-L722 for the same