AutoGPTQ: Is This Inference Speed Slow?

So Here is my Script for Infrence:

import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM
from huggingface_hub import hf_hub_download
from transformers import GenerationConfig
import time

#model_path = hf_hub_download(repo_id="TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ", filename="gptq_model-4bit-64g.safetensors")

# Download the model from HF and store it locally, then reference its location here:
#quantized_model_dir = model_path

from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained(
    "TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ",
    use_fast=False
)
model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/WizardLM-Uncensored-Falcon-7B-GPTQ",
    use_triton=False,
    use_safetensors=True,
    device="cuda:0",
    trust_remote_code=True,
    max_memory={i: "13 GIB" for i in range(torch.cuda.device_count())}
)

#pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device_map="auto")

prompt = "Write a story about alpaca"
prompt_template = f"### Instruction: {prompt}\n### Response:"

start = time.time()
tokens = tokenizer(prompt_template, return_tensors="pt").to(model.device)
gen_config = GenerationConfig(max_new_tokens=256, temperature=0.3, top_k=35, top_p=0.90, pad_token_id=tokenizer.eos_token_id)
output = model.generate(inputs=tokens.input_ids, generation_config=gen_config)
print(tokenizer.decode(output[0]))

delay = time.time()
total_time = (delay - start)
time_per_token = total_time / 256

# Calculate tokens per second
tokens_per_second = 256 / total_time

# Print the results
print("Total inference time: {:.2f} ms".format(total_time))
print("Number of tokens generated: {}".format(256))
print("Time per token: {:.2f} ms/token".format(time_per_token))
print("Tokens per second: {:.2f} token/s".format(tokens_per_second))

This is What The Output is,

### Instruction: Write a story about alpaca
### Response:Once upon a time, in a small village nestled in the mountains, there lived a young girl named Maya. Maya was known for her love of animals, especially alpacas. She spent most of her days tending to the village's small herd of alpacas, helping to groom and feed them.
One day, Maya received a letter in the mail from a far-off land. The letter was from a group of scientists who were studying alpacas and had discovered something amazing. They had found a way to use alpaca wool to create a new type of fabric that was both warm and waterproof.
Maya was thrilled at the prospect of using her beloved alpacas to help the world. She immediately set out to learn more about the new fabric and how it could be used. She spent months studying and experimenting, and eventually, she came up with a plan to create a line of clothing made entirely from the new fabric.
With the help of her friends and family, Maya began to weave the fabric into clothing, scarves, and even blankets. The clothing was not only warm and waterproof, but it was also incredibly soft and comfortable. Maya's designs were a hit, and soon her clothing was being sold all over the world.
Maya's success

Total inference time: 210.78 ms
Number of tokens generated: 256
Time per token: 0.82 ms/token
Tokens per second: 1.21 token/s

I Think 1 Tokens Per Second is too low for GPTQ on gpu. Or is this Normal? or Is there Anything I Should Adjust to increase the Infrence Speed?

About this issue

Original URL
State: closed
Created a year ago
Comments: 33 (10 by maintainers)

Most upvoted comments

I used your script exactly

I have like 40+ GPTQ models on my Hugging Face page. All of them should work with AutoGPTQ.

A model doesn’t need to be created with AutoGPTQ to work with AutoGPTQ. It is compatible also with models made with GPTQ-for-LLaMa.

Soon I will start making all models with AutoGPTQ. But you can use AutoGPTQ with all GPTQ models, don’t worry about what made it. If you find a model that doesn’t work, ping me about it.

TheBloke on Jun 2, 2023