DeepSpeed: [BUG] ValueError: max() arg is an empty sequence using bf16 zero stage3

│ /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py:307  │
│ in <listcomp>                                                                │
│                                                                              │
│    304 │   │   │   max([                                                     │
│    305 │   │   │   │   max(tensor.numel(),                                   │
│    306 │   │   │   │   │   tensor.ds_numel) for tensor in fp16_partitioned_g │
│ ❱  307 │   │   │   ]) for fp16_partitioned_group in self.fp16_partitioned_gr │
│    308 │   │   ])                                                            │
│    309 │   │   print_rank_0(                                                 │
│    310 │   │   │   f'Largest partitioned param numel = {largest_partitioned_ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence

To Reproduce Steps to reproduce the behavior: Happened during finetuning on flan 11b model . Here is the entire error gist - https://gist.github.com/sujithjoseph/c410514acfccc76974a8130a8afd2169

Here is the deepspeed config https://gist.github.com/sujithjoseph/92bf27de6bba704b57c3b9eb7aa00365

ds_report output ds report - https://gist.github.com/sujithjoseph/c725de5fb38bb3c20e4fb6fd55f63848

System info (please complete the following information):

  • OS: Debian GNU/Linux 10 (buster)
  • GPU count and types [ 1 machine with 4 A100s - 40G*4]
  • Python version 3.7

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Accelerate + PEFT

deepspeed_config:

deepspeed_config_file: zero_stage3_offload_config.json zero3_init_flag: true

Additional context

I assume that bf16 configs and fp16 configs are interchangeable

    "bf16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 20 (6 by maintainers)

Most upvoted comments

same error here,any update?

Same error when using loralib with zero2 & 3

same error RuntimeError: torch.cat(): expected a non-empty list of Tensors when accelerate.prepare. So how to solve it?

Got it. I don’t have experience with those memory restriction flags, which seem to be Accelerate flags. I don’t think those flags are hooked into deepspeed. Can you please pose this question on their forum? I think we can work with them to enable the desired feature.