DeepSpeed: [BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model

Describe the bug I tried running deepspeed zero 3 on a new huggingface model and got the following error:

      [2023-12-13 04:12:18,837] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
      Invalidate trace cache @ step 14: expected module 19, but got module 34
      Traceback (most recent call last):
        File "/home/ubuntu/mixtral_hf/deepspeed_zero.py", line 36, in <module>
          outputs = model.generate(inputs, max_new_tokens=20)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
          return func(*args, **kwargs)
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 1731, in generate
          return self.greedy_search(
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 2592, in greedy_search
          outputs = self(
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1581, in _call_impl
          hook_result = hook(self, args, result)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
          ret_val = func(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 350, in _end_of_forward_hook
          self.get_param_coordinator(training=False).reset_step()
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 203, in reset_step
          raise RuntimeError(f"still have inflight params "
          RuntimeError: still have inflight params [{'id': 9, 'status': 'AVAILABLE', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 11, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 15, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 17, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 21, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 27, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}]

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce:

  model_id = "mistralai/Mixtral-8x7B-v0.1"
  ds_config = {
      "bf16": {
          "enabled": True,
      },
      "zero_optimization": {
          "stage": 3,
          "offload_param": {
              "device": "cpu",
          }
      },
      "train_micro_batch_size_per_gpu": 1,
  }
  
  hfdsc = HfDeepSpeedConfig(ds_config)
  
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
  model.eval()
  
  ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
  ds_engine.module.eval()
  model = ds_engine.module
  
  inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda")
  outputs = model.generate(inputs, max_new_tokens=20)   
  output_str = tokenizer.decode(outputs[0])

What packages are required and their versions

  • HuggingFace 4.65
  • Deepspeed 0.12.4
  • Torch 2.1
  • Cuda 12.1

ds_report output Please run ds_report to give us details about your setup.

    DeepSpeed C++/CUDA extension op report
    --------------------------------------------------
    NOTE: Ops not installed will be just-in-time (JIT) compiled at
          runtime if needed. Op compatibility means that your system
          meet the required dependencies to JIT install the op.
    --------------------------------------------------
    JIT compiled ops requires ninja
    ninja .................. [OKAY]
    --------------------------------------------------
    op name ................ installed .. compatible
    --------------------------------------------------
    async_io ............... [NO] ....... [OKAY]
    fused_adam ............. [NO] ....... [OKAY]
    cpu_adam ............... [NO] ....... [OKAY]
    cpu_adagrad ............ [NO] ....... [OKAY]
    cpu_lion ............... [NO] ....... [OKAY]
     [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
    evoformer_attn ......... [NO] ....... [NO]
    fused_lamb ............. [NO] ....... [OKAY]
    fused_lion ............. [NO] ....... [OKAY]
    inference_core_ops ..... [NO] ....... [OKAY]
    cutlass_ops ............ [NO] ....... [OKAY]
    quantizer .............. [NO] ....... [OKAY]
    ragged_device_ops ...... [NO] ....... [OKAY]
    ragged_ops ............. [NO] ....... [OKAY]
    random_ltd ............. [NO] ....... [OKAY]
     [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
     [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
    sparse_attn ............ [NO] ....... [NO]
    spatial_inference ...... [NO] ....... [OKAY]
    transformer ............ [NO] ....... [OKAY]
    stochastic_transformer . [NO] ....... [OKAY]
    transformer_inference .. [NO] ....... [OKAY]
    --------------------------------------------------
    DeepSpeed general environment info:
    torch install path ............... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch']
    torch version .................... 2.1.1
    deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed']
    deepspeed info ................... 0.12.4, unknown, unknown
    torch cuda version ............... 12.1
    torch hip version ................ None
    nvcc version ..................... 12.1
    deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
    shared memory (/dev/shm) size .... 124.52 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • AWS g5.16x large instance
  • OS: Ubuntu 22.04
  • GPU: Nvidia A10G
  • OS: [e.g. Ubuntu 18.04]
  • GPU count: 1
  • Python version: 3.10.13

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 2
  • Comments: 36 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Guys, thanks for the great debugging and collaboration here to understand this problem. The fundamental issue is that zero3 caches the parameter trace to enable parameter prefetching to reduce all-gather latency. Unfortunately, since MoE layers can activate different experts across iterations, the parameter trace cache is invalidated when the expert changes. The warning messages are for the trace cache invalidations. In this case, the warning is avoidable since prefetching is disabled by setting "stage3_prefetch_bucket_size": 0, so a minor fix is required in this case. However, in general inference speed will be very slow as observed.

We have not previously tested zero3 and MoE, but we will prioritize this investigation now given the interest.

@ftgreat The root cause of this issue is that DeepSpeed tries to run reduce-scatter for only a part of experts.

ZeRO3 sets hooks on parameters to run reduce-scatter. However, the hook is not fired unless the expert is activated at a forward pass. Our data parallel processes may activate different sets of experts. We need all processes to join such a communication collective, but the reduce-scatter is called only on some processes in this case.

Since we already implemented the API to set a leaf module for ZeRO3, the solution will be to delay reduce-scatter until the backward pass of the leaf module finishes. I will work on this direction.

@tohtana In my testing of Mixtral fine-tune phrase using Zero3, training process hanged at step5 for the same datasets. This patch seems not fixed my hang issue during training. As you declared, this patch should have fixed for text generation issue using Zero3.

After my debugging, I found the hang probably are related to these lines from MixtralSparseMoeBlock implementation as follows and hangs happened when some experts have been assigned to no tokens in training batch. https://github.com/huggingface/transformers/blob/e547458c43dfdbbb8f6a7757237e234c44e20a8f/src/transformers/models/mixtral/modeling_mixtral.py#L823-L824

Could you please give me some explanation about why this implementation caused hang using Zero3? (Zero2 runs normally). Thanks for your reply.

Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.

@tohtana I wrote a monkey patch using dense moe impl instead of mixtral sparse moe. Tested ok for my cases, no hangs happend. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Still wanna detailed explanations about the cause of sparse moe impl. Thanks.

I can fully fine-tune Mistral7b*8 instruct with deepspeed zero3 on 2 A100-80GB instances, the code won’t hook and run smoothly. I didn’t change anything except disabling the evaluation part to calculate ppl for val data set. The fine-tuned model looks normal but I still don’t know why it can happen. I just provide my training environment for your inference. Transformer version: 4.36.2, deepspeed 0.12.5, deepspeed zero_3 config:

  "gradient_accumulation_steps": 8,
  "train_micro_batch_size_per_gpu": 4,
  "prescale_gradients": false,
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 3, 
    "offload_param": {
        "device": "none"
    }, 
    "offload_optimizer": {
        "device": "none"
    }, 
    "stage3_param_persistence_threshold": 1.000000e+04, 
    "stage3_max_live_parameters": 3.000000e+07, 
    "stage3_prefetch_bucket_size": 3.000000e+07, 
    "memory_efficient_linear": false
  }, 
  "steps_per_print": 1,
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "bf16": {
    "enabled": true
  }
}```

@BBerabi You can try it with xtuner, https://github.com/InternLM/xtuner/tree/main/xtuner/configs/mixtral

But remember that, using deepspeed_zero3 instead of deepspeed_zero3_offload

I am also observing the same issue even with "stage_prefetch_bucket_size": 0. The runtime error about inflight parameters does not occur but the process just hangs indefinitely and crashes at the end with timeout.

Did someone manage to fine-tune Mixtral with zero3 and huggingface? Could you share your deepspeed config? @K-Nick @LZHgrla @ryandeng1

I got the error with “stage_prefetch_bucket_size”: 0 + zero3

Invalidate trace cache @ step 1323: expected module 2476, but got module 2510                                    | 20/2466 [02:15<4:12:07,  6.18s/it, gpt_loss=1.28, loss_mean=1.22, balancing_loss=8]







[rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9ddd19c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f9d7ef58142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f9d7ef5e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f9d7ef5eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f9ddccb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f9e8ad78ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126660 (0x7f9e8ae0a660 in /usr/lib/x86_64-linux-gnu/libc.so.6)