DeepSpeed: [BUG] Error while training with Deepspeed

Describe the bug Deepspeed runs into a bug while training a CodeLlama-34B model with QLoRA using this script

To Reproduce Run the script with deepspeed file passed into the params. The deepspeed config i used is given below:

{
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 16777216,
    "stage3_prefetch_bucket_size": 15099494.4,
    "stage3_param_persistence_threshold": 40960,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Expected behavior

Expected behaviour is deepspeed training without any errors. The following error (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>) pops up with the traceback as given below

[2023-09-08 19:26:04,877] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:07,007] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-08 19:26:07,007] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_llama2_codegen.py --bf16 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --model_name /workspace/CodeLlama-34b-Python-hf --dataset_name llama_data --save_steps 100 --num_train_epochs 2 --learning_rate 2e-5 --weight_decay 0.01 --lora_alpha 256 --lora_r 32 --use_qlora True --max_seq_length 8192 --run_name CodeGen-34B-combined-train --deepspeed_path /workspace/deepspeed_config_stage3.json
[2023-09-08 19:26:08,340] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:10,452] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-08 19:26:10,452] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-08 19:26:10,452] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-09-08 19:26:10,452] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-08 19:26:10,452] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-08 19:26:12,700] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:12,745] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:19,891] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:26:19,891] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-08 19:26:20,266] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:27:56,714] [INFO] [partition_parameters.py:342:__exit__] finished initializing model - num_params = 435, num_elems = 33.74B
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.65s/it]
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.61s/it]
trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map:   0%|                                 | 0/2787 [00:00<?, ? examples/s]trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 630.71 examples/s]
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 656.34 examples/s]
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 20.74338674545288 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 12.17668604850769 seconds
Parameter Offload: Total persistent parameters: 2367488 in 145 params
You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f74ee477ed0>
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io … [NO] … [NO] fused_adam … [NO] … [OKAY] cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] quantizer … [NO] … [OKAY] random_ltd … [NO] … [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn … [NO] … [NO] spatial_inference … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] transformer_inference … [NO] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/usr/local/lib/python3.10/dist-packages/torch’] torch version … 2.0.1+cu118 deepspeed install path … [‘/usr/local/lib/python3.10/dist-packages/deepspeed’] deepspeed info … 0.10.3+542dc0d5, 542dc0d5, master torch cuda version … 11.8 torch hip version … None nvcc version … 11.8 deepspeed wheel compiled w. … torch 2.0, cuda 11.8 shared memory (/dev/shm) size … 188.00 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types 4xA100 80GB
  • Interconnects (if applicable) N/A
  • Python version 3.10
  • Any other relevant info about your setup

Launcher context used deepspeed launcher with huggingface integration

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 4
  • Comments: 18 (8 by maintainers)

Commits related to this issue

Most upvoted comments

  1. with zero.init enabled, I get below with the latest branch of Accelerate, Transformers and latest release of Deepspeed:
File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
        raise ValueError(set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)

  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
ValueError        : raise ValueError(raise ValueError(Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.


ValueErrorValueError: : Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
  1. Below is the memory usage when zero_init=False and qlora+deepSpeed stage 3 for Llama 70B. GPU memory usage per GPU: 20% of 80Gb = 16GB per GPU. However, the initial memory per GPU during model loading would be 35GB (0.5*70B) as each GPU loads the pretrained model in 4 bits. If zero_init is enabled with QLoRA, then one could finetune 70B model on 8 24GB GPUs which would be great. Code: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/sft/training Command:
accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "mistral-sft-lora-ds" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16"
Screenshot 2024-03-05 at 6 30 39 PM

Hi, @tohtana As you said, using DeepSpeed’s API solved the problem.

Here’s how I solved it.

state_dict = self.engine._zero3_consolidated_16bit_state_dict()
lora_state_dict = get_peft_model_state_dict(self.model, state_dict)
self.model.save_pretrained(save_dir)
torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))

Thank you very much for your reply.

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Thank you for your https://github.com/microsoft/DeepSpeed/pull/4647 !! It works well in my environment, too!