LLaMA-Factory: ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`

Reminder

I have read the README and searched the existing issues.

Reproduction

Issue #2420 did not fix as of c901aa6.

Full error output


[2024-03-12 08:50:54,244] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[INFO|tokenization_utils_base.py:2044] 2024-03-12 08:52:37,843 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2044] 2024-03-12 08:52:37,843 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2044] 2024-03-12 08:52:37,844 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2044] 2024-03-12 08:52:37,844 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2044] 2024-03-12 08:52:37,844 >> loading file tokenizer.json
[INFO|configuration_utils.py:726] 2024-03-12 08:52:37,905 >> loading configuration file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\config.json
[INFO|configuration_utils.py:791] 2024-03-12 08:52:37,906 >> Model config LlamaConfig {
  "_name_or_path": "C:\\LLaMA-Factory\\checkpoints\\Llama-2-13b-chat-hf",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13824,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 40,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.38.1",
  "use_cache": true,
  "vocab_size": 32000
}

[INFO|modeling_utils.py:3254] 2024-03-12 08:52:37,931 >> loading weights file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\model.safetensors.index.json
[INFO|modeling_utils.py:1400] 2024-03-12 08:52:37,931 >> Instantiating LlamaForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:845] 2024-03-12 08:52:37,932 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.01s/it]
[INFO|modeling_utils.py:3992] 2024-03-12 08:52:47,320 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4000] 2024-03-12 08:52:47,320 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:798] 2024-03-12 08:52:47,321 >> loading configuration file C:\LLaMA-Factory\checkpoints\Llama-2-13b-chat-hf\generation_config.json
C:\LLaMA-Factory\venv\lib\site-packages\transformers\generation\configuration_utils.py:410: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
C:\LLaMA-Factory\venv\lib\site-packages\transformers\generation\configuration_utils.py:415: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
C:\LLaMA-Factory\venv\lib\site-packages\transformers\generation\configuration_utils.py:410: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
C:\LLaMA-Factory\venv\lib\site-packages\transformers\generation\configuration_utils.py:415: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
[INFO|configuration_utils.py:845] 2024-03-12 08:52:47,322 >> Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.9,
  "top_p": 0.6
}

Traceback (most recent call last):
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\queueing.py", line 407, in call_prediction
    output = await route_utils.call_process_api(
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\blocks.py", line 1199, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\utils.py", line 519, in async_iteration
    return await iterator.__anext__()
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\utils.py", line 512, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\LLaMA-Factory\venv\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "C:\LLaMA-Factory\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 2106, in run_sync_in_worker_thread
    return await future
  File "C:\LLaMA-Factory\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 833, in run
    result = context.run(func, *args)
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\utils.py", line 495, in run_sync_iterator_async
    return next(iterator)
  File "C:\LLaMA-Factory\venv\lib\site-packages\gradio\utils.py", line 649, in gen_wrapper
    yield from f(*args, **kwargs)
  File "C:\LLaMA-Factory\src\llmtuner\webui\components\export.py", line 71, in save_model
    export_model(args)
  File "C:\LLaMA-Factory\src\llmtuner\train\tuner.py", line 52, in export_model
    model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args)
  File "C:\LLaMA-Factory\src\llmtuner\model\loader.py", line 150, in load_model_and_tokenizer
    model = load_model(tokenizer, model_args, finetuning_args, is_trainable, add_valuehead)
  File "C:\LLaMA-Factory\src\llmtuner\model\loader.py", line 94, in load_model
    model = init_adapter(model, model_args, finetuning_args, is_trainable)
  File "C:\LLaMA-Factory\src\llmtuner\model\adapter.py", line 110, in init_adapter
    model: "LoraModel" = PeftModel.from_pretrained(model, adapter)
  File "C:\LLaMA-Factory\venv\lib\site-packages\peft\peft_model.py", line 353, in from_pretrained
    model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
  File "C:\LLaMA-Factory\venv\lib\site-packages\peft\peft_model.py", line 727, in load_adapter
    dispatch_model(
  File "C:\LLaMA-Factory\venv\lib\site-packages\accelerate\big_modeling.py", line 374, in dispatch_model
    raise ValueError(
ValueError: We need an `offload_dir` to dispatch this model according to this `device_map`, the following submodules need to be offloaded: base_model.model.model.layers.33, base_model.model.model.layers.34, base_model.model.model.layers.35, base_model.model.model.layers.36, base_model.model.model.layers.37, base_model.model.model.layers.38, base_model.model.model.layers.39, base_model.model.model.norm, base_model.model.lm_head.

Expected behavior

Start WebUI in Export tab.
Selected Adapter/QLoRA-4bit train on Llama2-13B with set by Model path
Set Export dir to empty folder and [Export]

System Info

system info

transformers version: 4.38.1
Platform: Windows-11
Python version: 3.10.8
Huggingface_hub version: 0.20.1
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Single GPU 24GB VRAM

Others

hoi hiyouga!

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 16 (16 by maintainers)

Commits related to this issue

fix #2802 — committed to leixy76/LLaMA-Factory by hiyouga 4 months ago
fix #2802 — committed to hiyouga/LLaMA-Factory by hiyouga 4 months ago
fix #2802 — committed to sanjay920/LLaMA-Factory by hiyouga 4 months ago
fix #2802 — committed to sanjay920/LLaMA-Factory by hiyouga 4 months ago

Most upvoted comments

We have made some fix, please try again

hiyouga on Mar 14, 2024