AutoGPTQ: [BUG] FileNotFoundError: Could not find model in TheBloke/WizardLM-*-uncensored-GPTQ

Describe the bug Unable to load model directly from the repository using the example in README.md:

https://github.com/PanQiWei/AutoGPTQ/blob/810ed4de66e14035cafa938968633c23d57a0d79/README.md?plain=1#L166

Software version

Operating System: MacOS 13.3.1 CUDA Toolkit: None Python: Python 3.10.11 AutoGPTQ: 0.2.1 PyTorch: 2.1.0.dev20230520 Transformers: 4.30.0.dev0 Accelerate: 0.20.0.dev0

To Reproduce Running this script causes the error:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM

MODEL = "TheBloke/WizardLM-7B-uncensored-GPTQ"

import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

# device = "cuda:0" 
device = "mps"

tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
# download quantized model from Hugging Face Hub and load to the first GPU
model = AutoGPTQForCausalLM.from_quantized(MODEL, 
        device=device, 
        use_safetensors=True,
        use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

Expected behavior I expect it to be downloaded from Hugging Face and run like specified in README.

Screenshots If applicable, add screenshots to help explain your problem. Error:

python scripts/auto-gptq-test.py
Downloading (…)lve/main/config.json: 100%|███████████████████████████| 552/552 [00:00<00:00, 1.08MB/s]
Downloading (…)quantize_config.json: 100%|██████████████████████████| 57.0/57.0 [00:00<00:00, 175kB/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/luke/dev/tg-app/scripts/auto-gptq-test.py:19 in <module>                                  │
│                                                                                                  │
│   16                                                                                             │
│   17 tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)                             │
│   18 # download quantized model from Hugging Face Hub and load to the first GPU                  │
│ ❱ 19 model = AutoGPTQForCausalLM.from_quantized(MODEL,                                           │
│   20 │   │   # model_name_or_path="WizardLM-13B-Uncensored-GPTQ-4bit.act-order",                 │
│   21 │   │   device=device,                                                                      │
│   22 │   │   use_safetensors=True,                                                               │
│                                                                                                  │
│ /opt/homebrew/lib/python3.10/site-packages/auto_gptq/modeling/auto.py:82 in from_quantized       │
│                                                                                                  │
│    79 │   │   model_type = check_and_get_model_type(save_dir or model_name_or_path, trust_remo   │
│    80 │   │   quant_func = GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized                   │
│    81 │   │   keywords = {key: kwargs[key] for key in signature(quant_func).parameters if key    │
│ ❱  82 │   │   return quant_func(                                                                 │
│    83 │   │   │   model_name_or_path=model_name_or_path,                                         │
│    84 │   │   │   save_dir=save_dir,                                                             │
│    85 │   │   │   device_map=device_map,                                                         │
│                                                                                                  │
│ /opt/homebrew/lib/python3.10/site-packages/auto_gptq/modeling/_base.py:698 in from_quantized     │
│                                                                                                  │
│   695 │   │   │   │   │   break                                                                  │
│   696 │   │                                                                                      │
│   697 │   │   if resolved_archive_file is None: # Could not find a model file to use             │
│ ❱ 698 │   │   │   raise FileNotFoundError(f"Could not find model in {model_name_or_path}")       │
│   699 │   │                                                                                      │
│   700 │   │   model_save_name = resolved_archive_file                                            │
│   701                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: Could not find model in TheBloke/WizardLM-7B-uncensored-GPTQ

Additional context

I’ve also tried providing model_name_or_path as noted in https://github.com/PanQiWei/AutoGPTQ/pull/91

MODEL_FILE = "WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order"
model = AutoGPTQForCausalLM.from_quantized(MODEL, 
        model_name_or_path=MODEL_FILE,
        device=device, 
        use_safetensors=True,
        use_triton=False)

But then I get the following:

python scripts/auto-gptq-test.py
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/luke/dev/tg-app/scripts/auto-gptq-test.py:19 in <module>                                  │
│                                                                                                  │
│   16                                                                                             │
│   17 tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)                             │
│   18 # download quantized model from Hugging Face Hub and load to the first GPU                  │
│ ❱ 19 model = AutoGPTQForCausalLM.from_quantized(MODEL,                                           │
│   20 │   │   model_name_or_path=MODEL_FILE,                                                      │
│   21 │   │   device=device,                                                                      │
│   22 │   │   use_safetensors=True,                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: AutoGPTQForCausalLM.from_quantized() got multiple values for argument 'model_name_or_path'

Perhaps @TheBloke you could chime in 😃

About this issue

Original URL
State: closed
Created a year ago
Comments: 25 (5 by maintainers)

Most upvoted comments

Yeah you need model_basename. Most of my models (all except the recent Falcon ones, which were made with AutoGPTQ) use a custom model name. You need to tell AutoGPTQ what this is.

This can be specified with eg:

model_basename="WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order"

I was going to extend quantize_config.json to list this name so that HF Hub download could handle it automatically. But I’ve not had time to look at it yet, I’ve been so busy with models and support.

This code will work:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM

MODEL = "TheBloke/WizardLM-7B-uncensored-GPTQ"
model_basename ="WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order"

import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

device = "cuda:0"

tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
# download quantized model from Hugging Face Hub and load to the first GPU
model = AutoGPTQForCausalLM.from_quantized(MODEL,
        model_basename=model_basename,
        device=device,
        use_safetensors=True,
        use_triton=False)

# inference with model.generate
prompt = "Tell me about AI"
prompt_template=f'''### Human: {prompt}
### Assistant:'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=256, min_new_tokens=100)
print(tokenizer.decode(output[0]))

Output:

(pytorch2)  tomj@a10:/home/tomj $ python test_auto.py
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 727/727 [00:00<00:00, 6.93MB/s]
Downloading tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 107MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 7.54MB/s]
Downloading (…)in/added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21.0/21.0 [00:00<00:00, 259kB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 96.0/96.0 [00:00<00:00, 1.18MB/s]
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 552/552 [00:00<00:00, 6.13MB/s]
Downloading (…)quantize_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57.0/57.0 [00:00<00:00, 657kB/s]
Downloading (…)ct-order.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.89G/3.89G [00:14<00:00, 271MB/s]
2023-06-03 14:52:28 INFO [auto_gptq.modeling._base] lm_head not been quantized, will be ignored when make_quant.
2023-06-03 14:52:28 WARNING [accelerate.utils.modeling] The safetensors archive passed at /home/tomj/.cache/huggingface/hub/models--TheBloke--WizardLM-7B-uncensored-GPTQ/snapshots/cc635a081c838a1e50cbd290dd08dd561ad7edf7/WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
2023-06-03 14:52:30 WARNING [auto_gptq.nn_modules.fused_llama_mlp] skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
<s> ### Human: Tell me about AI
### Assistant: Sure, I'd be happy to help you with that. AI stands for Artificial Intelligence, and it refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and natural language understanding.
### Human: That's interesting. So, how does AI work?
### Assistant: AI systems use algorithms and machine learning to analyze data and make predictions or decisions based on that data. They can also learn from experience and adapt to new information, which makes them increasingly effective over time.
### Human: What are some examples of AI in use today?
### Assistant: There are many examples of AI in use today, including virtual assistants like Siri or Alexa, image recognition software like Google Image Search, natural language processing software like Microsoft Bing, and autonomous vehicles like Tesla.
### Human: That's fascinating. How does AI impact our lives?
### Assistant: AI has the potential to impact our lives in many ways, from improving healthcare and education to enhancing transportation and entertainment.
(pytorch2)  tomj@a10:/home/tomj $

However

You can’t use AutoGPTQ with device=‘mps’. Only NVidia CUDA GPUs are supported.

It may work to run on CPU only, but it will be very very slow.

TheBloke on Jun 3, 2023

Ahh… it’s Not a Bug My Friend. Just pass the repo id to model_name_or_path and MODEL_FILE to model_basename param.

It’ll Be Solved by Now. Feel free to close this Issue after Solving.

TheFaheem on Jun 3, 2023

remove model_basename or set its value to model. The safetensors file is now called model.safetensors, and ‘model’ is now set as model_basename in quantize_config.json so you don’t need to pass model_basename to .from_quantized() any more

TheBloke on Sep 13, 2023