vllm: Load Mixtral 8x7b AWQ model failed
I am using the latest vllm docker image, trying to run Mixtral 8x7b model quantized in AWQ format. I got error message as below:
INFO 12-24 09:22:55 llm_engine.py:73] Initializing an LLM engine with config: model='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer='/models/openbuddy-mixtral-8x7b-v15.2-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)
(RayWorkerVllm pid=2491) /usr/local/lib/python3.10/dist-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
(RayWorkerVllm pid=2491) warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/vllm/entrypoints/openai/api_server.py", line 729, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/workspace/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/workspace/vllm/engine/async_llm_engine.py", line 269, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/workspace/vllm/engine/async_llm_engine.py", line 314, in _init_engine
return engine_class(*args, **kwargs)
File "/workspace/vllm/engine/llm_engine.py", line 108, in __init__
self._init_workers_ray(placement_group)
File "/workspace/vllm/engine/llm_engine.py", line 195, in _init_workers_ray
self._run_workers(
File "/workspace/vllm/engine/llm_engine.py", line 755, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/workspace/vllm/engine/llm_engine.py", line 732, in _run_workers_in_batch
all_outputs = ray.get(all_outputs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::RayWorkerVllm.execute_method() (pid=2492, ip=172.17.0.2, actor_id=ccdc00b5ccaf06b948a44c5301000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x7f3cba935990>)
File "/workspace/vllm/engine/ray_utils.py", line 31, in execute_method
return executor(*args, **kwargs)
File "/workspace/vllm/worker/worker.py", line 79, in load_model
self.model_runner.load_model()
File "/workspace/vllm/worker/model_runner.py", line 57, in load_model
self.model = get_model(self.model_config)
File "/workspace/vllm/model_executor/model_loader.py", line 72, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/workspace/vllm/model_executor/models/mixtral.py", line 430, in load_weights
param = params_dict[name]
KeyError: 'model.layers.26.block_sparse_moe.experts.0.w2.qweight'
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 28 (15 by maintainers)
im running into similiar issue with latest stable release on 2x 4090s
The server never fully loads. just hangs on
It is not an issue with the tokenizer. I saw that there is a high chance for Mixtral to generate an end token, especially if dates/numbers are involved. I tried to do some investigation, but I stopped.
Some of the results from back then can be found here: https://github.com/joennlae/vllm/blob/019ee402923d43cb225afaf356d559556d615aef/write_up.md
Also I was not able to reproduce this issue with TGI.
Hi @joennlae, thanks a lot for noticing this issue with the tokenizer and the prediction of the number
2
. I am facing this issue too. Did you manage to find a fix?@casper-hansen I can confirm that works. for me at least
I used my own AWQ quantization. Try quantizing it yourself and maybe that will fix the problem.