DeepSpeed: [BUG] RuntimeError: still have inflight params [.ds_summary of Parameter containing:
Describe the bug
what’s the possible reason for error below
File "/home/xihe/xinhe/distNAS/DeepspeedNAS/train.py", line 200, in train_zero
engine.backward(loss)
File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/engine.py", line 1980, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 2088, in backward
self._get_param_coordinator(training=True).reset_step()
File "/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
raise RuntimeError(
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/cm/extra/Utils/CUDA/11.1.0.0_455.23.05'
DeepSpeed general environment info:
torch install path ............... ['/datasets/xihe/miniconda3/envs/colossal/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/xihe/xinhe/deepspeed/DeepSpeed/deepspeed']
deepspeed info ................... 0.8.3+3667758, 3667758, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 25 (6 by maintainers)
Commits related to this issue
- fix #3156 — committed to microsoft/DeepSpeed by HeyangQin a year ago
@iamsile Hi, Could you please tell me how to fix this? Many Thanks.
Hello @justHungryMan @tobideusser. This issue has been fixed by a collaborative effort with the lightning team. Please update the deepspeed and lightning to apply the fix. Thank you.
@marsggbo if the error is still there even with the latest deepspeed. Please feel free to reopen this issue with a reproduce script.