DeepSpeed: AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Hi, I want use DeepSpeed to speed my transformer , and I came across such problem:

  File "main.py", line 460, in <module>
    main(args)
  File "main.py", line 392, in main
    train_stats = train_one_epoch(
  File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
  File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

My config.json is as follows:

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu":1,
  "steps_per_print": 100,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 1e-2
    }
  },
  "flops_profiler": {
    "enabled": false,
    "profile_step": 100,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 18,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
      "stage": 1,
      "cpu_offload": false,
      "contiguous_gradients": true,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size":1e8,
      "allgather_bucket_size": 5e8

  },
  "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false,
      "cpu_checkpointing": false
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 1
Comments: 23 (7 by maintainers)

Most upvoted comments

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

workingloong on Dec 13, 2023

Hi @jeffra, yes I’m experiencing the same issue. Here is the error I get:

  File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
    surrogate.backward()
  File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

And here is my config file:

{
    "zero_optimization": {
       "stage": 2,
       "offload_optimizer": {
           "device": "cpu",
           "pin_memory": true
       },
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": false,
       "contiguous_gradients": false
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

ant-louis on Jun 1, 2022

I got the same issue. But fixed by remove a redundant backward.

        outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        # loss.backward()   # remove this line
        model_engine.backward(loss)
        model_engine.step()

And this code is from chatGPT, so it is excusable.

dabney777 on Jun 26, 2023

Well, let me join this thread too… Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

AlexKay28 on Apr 20, 2023

Isn’t this problem solved? I’m currently facing a similar error. I’m using FusedAdam as an optimizer so I’m not using the FP16 option, but it’s similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

“stage”: 2 > “stage”:1 Solved

heojeongyun on Mar 8, 2023