vllm: Load Mixtral 8x7b AWQ model failed

I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below:

INFO 12-24 09:22:55 llm_engine.py:73] Initializing an LLM engine with config: model='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)
(RayWorkerVllm pid=2491) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
(RayWorkerVllm pid=2491)   warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 729, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
    self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 755, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/workspace/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2563, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=2492, ip=172.17.0.2, actor_id=ccdc00b5ccaf06b948a44c5301000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f3cba935990>)
  File "/workspace/vllm/engine/ray_utils.py", line 31, in execute_method
    return executor(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 79, in load_model
    self.model_runner.load_model()
  File "/workspace/vllm/worker/model_runner.py", line 57, in load_model
    self.model = get_model(self.model_config)
  File "/workspace/vllm/model_executor/model_loader.py", line 72, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/workspace/vllm/model_executor/models/mixtral.py", line 430, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.26.block_sparse_moe.experts.0.w2.qweight'

About this issue

Original URL
State: open
Created 6 months ago
Comments: 28 (15 by maintainers)

Most upvoted comments

im running into similiar issue with latest stable release on 2x 4090s

python -m vllm.entrypoints.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
 --dtype auto --tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1 \
 --quantization awq --trust-remote-code \
 --tensor-parallel-size 2 --gpu-memory-utilization 0.98 --enforce-eager

The server never fully loads. just hangs on

WARNING 01-05 17:01:33 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-05 17:01:34,848	INFO worker.py:1724 -- Started a local Ray instance.
INFO 01-05 17:01:36 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=True, seed=0)

eschmidbauer on Jan 5, 2024

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

It is not an issue with the tokenizer. I saw that there is a high chance for Mixtral to generate an end token, especially if dates/numbers are involved. I tried to do some investigation, but I stopped.

Some of the results from back then can be found here: https://github.com/joennlae/vllm/blob/019ee402923d43cb225afaf356d559556d615aef/write_up.md

Also I was not able to reproduce this issue with TGI.

joennlae on Mar 18, 2024

Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number 2. I am facing this issue too. Did you manage to find a fix?

thomasfloqs on Mar 18, 2024

@casper-hansen I can confirm that works. for me at least

kniteli on Jan 5, 2024

I used my own AWQ quantization. Try quantizing it yourself and maybe that will fix the problem.

iibw on Jan 4, 2024