DeepSpeed: [BUG] RuntimeError: still have inflight params [.ds_summary of Parameter containing:

Describe the bug

what’s the possible reason for error below

  File "/home/xihe/xinhe/distNAS/DeepspeedNAS/train.py", line 200, in train_zero
    engine.backward(loss)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/engine.py", line 1980, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2088, in backward
    self._get_param_coordinator(training=True).reset_step()
  File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
    raise RuntimeError(
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/cm/extra/Utils/CUDA/11.1.0.0_455.23.05'
DeepSpeed general environment info:
torch install path ............... ['/datasets/xihe/miniconda3/envs/colossal/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+3667758, 3667758, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 25 (6 by maintainers)

Commits related to this issue

Most upvoted comments

@iamsile Hi, Could you please tell me how to fix this? Many Thanks.

Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you.

@marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script.