axolotl: The error maze of deepspeed + qlora + falcon
I’ve been trying to make the combination deepspeed + qlora + falcon
work but due to unknown reasons I’ve stuck in an error maze.
Setup
- Docker image:
winglian/axolotl-runpod:main-py3.9-cu118-2.0.0
- Entry script:
bash -c "curl -H 'Cache-Control: no-cache' https://raw.githubusercontent.com/utensil/llm-playground/main/scripts/entry/prepare_ax.sh -sSf | bash"
ds_config.json
(final version, modified from the default one in axolotl):
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": "auto"
},
"fp16": {
"enabled": "auto",
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 32,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"gradient_accumulation_steps": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
examples/falcon/config-40b-qlora.yml
# 1b: tiiuae/falcon-rw-1b
# 7b: tiiuae/falcon-7b
# 40b: tiiuae/falcon-40b
base_model: tiiuae/falcon-40b
base_model_config: tiiuae/falcon-40b
# required by falcon custom model code: https://huggingface.co/tiiuae/falcon-7b/tree/main
trust_remote_code: true
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_8bit: false
# enable 4bit for QLoRA
load_in_4bit: true
gptq: false
strict: false
push_dataset_to_hub: utensil
hf_use_auth_token: true
datasets:
- path: QingyiSi/Alpaca-CoT
data_files:
- Chain-of-Thought/formatted_cot_data/gsm8k_train.json
type: "alpaca:chat"
dataset_prepared_path: last_run_prepared
val_set_size: 0.01
# enable QLoRA
adapter: qlora
lora_model_dir:
sequence_len: 2048
max_packed_sequence_len:
# hyperparameters from QLoRA paper Appendix B.2
# "We find hyperparameters to be largely robust across datasets"
lora_r: 64
lora_alpha: 16
# 0.1 for models up to 13B
# 0.05 for 33B and 65B models
lora_dropout: 0.05
# add LoRA modules on all linear layers of the base model
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project: falcon-qlora
wandb_watch:
wandb_run_id:
wandb_log_model:
output_dir: /content/axolotl-trained/falcon-qlora-40b-gsm8k/
# QLoRA paper Table 9
# - 16 for 7b & 13b
# - 32 for 33b, 64 for 64b
# Max size tested on A6000
# - 7b: 40
# - 40b: 4
# decrease if OOM, increase for max VRAM utilization
micro_batch_size: 1
gradient_accumulation_steps: 1
num_epochs: 3
# Optimizer for QLoRA
# optimizer: paged_adamw_32bit
torchdistx_path:
# lr_scheduler: cosine
# QLoRA paper Table 9
# - 2e-4 for 7b & 13b
# - 1e-4 for 33b & 64b
learning_rate: 0.0002
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
# stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 3
resume_from_checkpoint:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 5
save_steps: 10
debug:
deepspeed:
weight_decay: 0.01
fsdp:
fsdp_config:
special_tokens:
pad_token: "<|endoftext|>"
bos_token: ">>ABSTRACT<<"
eos_token: "<|endoftext|>"
- Environment reported by
ds_report
Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.3+52907a66, 52907a66, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
Errors
Error | Cause | Solution |
---|---|---|
RuntimeError: CUDA version mismatch! DeepSpeed ops were compiled and installed with a different version than what is being used at runtime. Please re-install DeepSpeed or switch torch versions. Install CUDA version=11.8, Runtime CUDA version=11.7 + AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' (#138) |
torch 2.0.1 reinstalled for CUDA 11.7 due to unknown reason ❓ |
pip3 install -U torch --index-url https://download.pytorch.org/whl/cu118 |
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0> |
Training started, then this error during forward due to unknown reason ❓ |
❌ |
ValueError: Found optimizer configured in the DeepSpeed config, but no scheduler . Please configure a scheduler in the DeepSpeed config. |
Need to add optimizer configs, but others work fine without it |
Add optimizer configs as in the final config above |
many mismatch errors | hf and ds configs mismatch | axolotl config must not set optimizer and lr_scheduler , many ds configs need to set to auto , add missing ds config keys, see the final config above |
compile errors when reinstalling deepspeed with TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed --global-option="build_ext" --global-option="-j8" # --global-option="bdist_wheel" |
Maybe it’s not a complete environment for compilation ❓ | ❌ |
ValueError: Can't find a valid checkpoint at /content/axolotl-trained/falcon-qlora-40b-gsm8k/checkpoint-50 |
If I disable deepspeed and use just accelerate for multiple GPUs, training is normal but failed to resume from checkpoint ( tried each of latest 3 checkpoints) ❓ |
❌ |
ValueError: ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config |
If I run deepspeed with 1 A100, training is normal but eval fails with this error due to is_zero3() return false for eval ❓ |
❌ |
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 2
- Comments: 15 (9 by maintainers)
Title should be changed to include llama models, I get the
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f0f04211eb0>
error on startup with 4x 4090s with zero3 config.Hi @utensil, I’m working with DeepSpeed and am having similar issues. Although my process is still broken, I’ll share my current config in case it helps. For testing, I have been able to start the the training process on 1 node w/ 3x A6000s under zero 2. Here is my
Makefile
target:My accelerate config:
Deepspeed config:
Axolotl config: