DeepSpeed: [BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model
Describe the bug I tried running deepspeed zero 3 on a new huggingface model and got the following error:
[2023-12-13 04:12:18,837] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Traceback (most recent call last):
File "/home/ubuntu/mixtral_hf/deepspeed_zero.py", line 36, in <module>
outputs = model.generate(inputs, max_new_tokens=20)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 1731, in generate
return self.greedy_search(
File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 2592, in greedy_search
outputs = self(
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1581, in _call_impl
hook_result = hook(self, args, result)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 350, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 203, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [{'id': 9, 'status': 'AVAILABLE', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 11, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 15, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 17, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 21, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 27, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}]
To Reproduce Steps to reproduce the behavior:
Simple inference script to reproduce:
model_id = "mistralai/Mixtral-8x7B-v0.1"
ds_config = {
"bf16": {
"enabled": True,
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
}
},
"train_micro_batch_size_per_gpu": 1,
}
hfdsc = HfDeepSpeedConfig(ds_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()
model = ds_engine.module
inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=20)
output_str = tokenizer.decode(outputs[0])
What packages are required and their versions
- HuggingFace 4.65
- Deepspeed 0.12.4
- Torch 2.1
- Cuda 12.1
ds_report output
Please run ds_report
to give us details about your setup.
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch']
torch version .................... 2.1.1
deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 124.52 GB
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- AWS g5.16x large instance
- OS: Ubuntu 22.04
- GPU: Nvidia A10G
- OS: [e.g. Ubuntu 18.04]
- GPU count: 1
- Python version: 3.10.13
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 2
- Comments: 36 (12 by maintainers)
Commits related to this issue
- Add API to set a module as a leaf node when recursively setting Z3 hooks (#4966) ZeRO3 does not work with MoE models because the order of executing modules can change at every forward/backward pass (... — committed to microsoft/DeepSpeed by tohtana 5 months ago
- Add API to set a module as a leaf node when recursively setting Z3 hooks (#4966) ZeRO3 does not work with MoE models because the order of executing modules can change at every forward/backward pass (... — committed to mauryaavinash95/DeepSpeed by tohtana 5 months ago
Guys, thanks for the great debugging and collaboration here to understand this problem. The fundamental issue is that zero3 caches the parameter trace to enable parameter prefetching to reduce all-gather latency. Unfortunately, since MoE layers can activate different experts across iterations, the parameter trace cache is invalidated when the expert changes. The warning messages are for the trace cache invalidations. In this case, the warning is avoidable since prefetching is disabled by setting
"stage3_prefetch_bucket_size": 0
, so a minor fix is required in this case. However, in general inference speed will be very slow as observed.We have not previously tested zero3 and MoE, but we will prioritize this investigation now given the interest.
@ftgreat The root cause of this issue is that DeepSpeed tries to run reduce-scatter for only a part of experts.
ZeRO3 sets hooks on parameters to run reduce-scatter. However, the hook is not fired unless the expert is activated at a forward pass. Our data parallel processes may activate different sets of experts. We need all processes to join such a communication collective, but the reduce-scatter is called only on some processes in this case.
Since we already implemented the API to set a leaf module for ZeRO3, the solution will be to delay reduce-scatter until the backward pass of the leaf module finishes. I will work on this direction.
@tohtana In my testing of Mixtral fine-tune phrase using Zero3, training process hanged at step5 for the same datasets. This patch seems not fixed my hang issue during training. As you declared, this patch should have fixed for text generation issue using Zero3.
After my debugging, I found the hang probably are related to these lines from MixtralSparseMoeBlock implementation as follows and hangs happened when some experts have been assigned to no tokens in training batch. https://github.com/huggingface/transformers/blob/e547458c43dfdbbb8f6a7757237e234c44e20a8f/src/transformers/models/mixtral/modeling_mixtral.py#L823-L824
Could you please give me some explanation about why this implementation caused hang using Zero3? (Zero2 runs normally). Thanks for your reply.
@tohtana I wrote a monkey patch using dense moe impl instead of mixtral sparse moe. Tested ok for my cases, no hangs happend. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py
Still wanna detailed explanations about the cause of sparse moe impl. Thanks.
I can fully fine-tune Mistral7b*8 instruct with deepspeed zero3 on 2 A100-80GB instances, the code won’t hook and run smoothly. I didn’t change anything except disabling the evaluation part to calculate ppl for val data set. The fine-tuned model looks normal but I still don’t know why it can happen. I just provide my training environment for your inference. Transformer version: 4.36.2, deepspeed 0.12.5, deepspeed zero_3 config:
@BBerabi You can try it with xtuner, https://github.com/InternLM/xtuner/tree/main/xtuner/configs/mixtral
But remember that, using
deepspeed_zero3
instead ofdeepspeed_zero3_offload
I am also observing the same issue even with
"stage_prefetch_bucket_size": 0
. The runtime error about inflight parameters does not occur but the process just hangs indefinitely and crashes at the end with timeout.Did someone manage to fine-tune Mixtral with zero3 and huggingface? Could you share your deepspeed config? @K-Nick @LZHgrla @ryandeng1
I got the error with “stage_prefetch_bucket_size”: 0 + zero3