unsloth: Unsloth breaks the inference ?!

Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:

from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
    model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
    max_seq_length = max_seq_length,
    dtype=dtype,
    load_in_4bit = False
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)

The output:

==((====))==  Unsloth: Fast Llama patching release 23.11
   \\   /|    GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \    CUDA compute capability = 8.0
\        /    Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
 "-____-"     bfloat16 support = TRUE

Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']

I have tried more than 4 different Llamas including yours and the same issue.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 17 (10 by maintainers)

Most upvoted comments

@namtranase Oh not yet πŸ˜ƒ I actually am working on making inference 2-4x faster, which I might push in an hour πŸ˜ƒ Hopefully if you can, it would be wonderful if you could test it out πŸ˜ƒ)

@namtranase @ammarali32 Just pushed a 2x faster inference on the main branch!! πŸ˜ƒ Hope you can try it out πŸ˜ƒ) It natively makes inference faster without any tricks - ie num_beams, batched etc all are faster πŸ˜ƒ

Call FastLanguageModel.for_inference(model) before doing inference to make it faster πŸ˜ƒ Call FastLanguageModel.for_training(model) to revert it back for finetuning.

https://github.com/unslothai/unsloth/assets/23090290/dab1ea44-34bc-4585-819f-3621614ff871

@namtranase Oh wait did you get gibberish after finetuning or before finetuning. If after finetuning, I suggest you directly use the chat template for finetuning, and not Alpaca. That’s because the model you are using is realdy finetuned. I would use the non chat version if you want Alpaca style.

@ammarali32 Oh it’s not yet supported - we’re working on making it out for our next release πŸ˜ƒ

@namtranase Hey sorry back! For TinyLlama, you’ll have to follow exactly their prompt format:

# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# ...

@ammarali32 I fixed it!!! It would be awesome if you can try it out! I also updated the Alpaca Colab example https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing

image