DeepSpeed: AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Hi, I want use DeepSpeed to speed my transformer , and I came across such problem:

  File "main.py", line 460, in <module>
    main(args)
  File "main.py", line 392, in main
    train_stats = train_one_epoch(
  File "/opt/ml/code/deepspeed/engine.py", line 57, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=clip_grad, clip_mode=clip_mode,
  File "/usr/local/lib/python3.8/dist-packages/timm/utils/cuda.py", line 43, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 661, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1104, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 724, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

My config.json is as follows:

{
  "gradient_accumulation_steps": 1,
  "train_micro_batch_size_per_gpu":1,
  "steps_per_print": 100,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00001,
      "weight_decay": 1e-2
    }
  },
  "flops_profiler": {
    "enabled": false,
    "profile_step": 100,
    "module_depth": -1,
    "top_modules": 3,
    "detailed": true
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 18,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "zero_optimization": {
      "stage": 1,
      "cpu_offload": false,
      "contiguous_gradients": true,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size":1e8,
      "allgather_bucket_size": 5e8

  },
  "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false,
      "cpu_checkpointing": false
  },
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 23 (7 by maintainers)

Most upvoted comments

I solved it by using DeepSpeedEngine.backward(loss) and DeepSpeedEngine.step() not torch nativeloss.backward() and optimizer.step().

Hi @jeffra, yes I’m experiencing the same issue. Here is the error I get:

  File "/root/envs/star/lib/python3.8/site-packages/grad_cache/grad_cache.py", line 242, in forward_backward
    surrogate.backward()
  File "/root/envs/star/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/envs/star/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 769, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1250, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/root/envs/star/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 826, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

And here is my config file:

{
    "zero_optimization": {
       "stage": 2,
       "offload_optimizer": {
           "device": "cpu",
           "pin_memory": true
       },
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
       "overlap_comm": false,
       "contiguous_gradients": false
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

I got the same issue. But fixed by remove a redundant backward.

        outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        # loss.backward()   # remove this line
        model_engine.backward(loss)
        model_engine.step()

And this code is from chatGPT, so it is excusable.

Well, let me join this thread too… Have the same issue as described above

The code I run can be found here: https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v4/train.py

Configuration I use

{
    "zero_allow_untested_optimizer": True,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": True,
        "overlap_comm": True,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "allgather_bucket_size": 200000000,
        "reduce_bucket_size": 200000000,
        "sub_group_size": 1000000000000,
    },
    "activation_checkpointing": {
        "partition_activations": False,
        "cpu_checkpointing": False,
        "contiguous_memory_optimization": False,
        "synchronize_checkpoint_boundary": False,
    },
    "aio": {
        "block_size": 1048576,
        "queue_depth": 8,
        "single_submit": False,
        "overlap_events": True,
        "thread_count": 1,
    },
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": 1,
    "bf16": {"enabled": True},
}

Traceback:

Traceback (most recent call last):
  File "train.py", line 367, in <module>
    trainer.run(m_cfg, train_dataset, None, tconf)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 433, in _run_impl
    return self._strategy.launcher.launch(run_method, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 443, in _run_with_setup
    return run_method(*args, **kwargs)
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 177, in run
    run_epoch('train')
  File "/home/alexkay28/RWKV-LM/RWKV-v4/src/trainer.py", line 129, in run_epoch
    self.backward(loss)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/lite.py", line 260, in backward
    self._precision.backward(tensor, module, *args, **kwargs)
  File "/home/vscode/.local/lib/python3.8/site-packages/lightning_lite/plugins/precision/precision.py", line 68, in backward
    tensor.backward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 482, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 804, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1252, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/vscode/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 847, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(0, self.elements_in_ipg_bucket, param.numel())
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

Isn’t this problem solved? I’m currently facing a similar error. I’m using FusedAdam as an optimizer so I’m not using the FP16 option, but it’s similar.

Here is the error I get:

Traceback (most recent call last):
  File "/root/QuickDraw/train.py", line 244, in <module>
    train(opt)
  File "/root/QuickDraw/train.py", line 165, in train
    torch.autograd.backward(loss)
  File "/project/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 857, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1349, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/project/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 902, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' 

this is my deepspeed_config file:

{
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },


    "steps_per_print": 1,

    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
          
 
}

“stage”: 2 > “stage”:1 Solved