unsloth: Unsloth breaks the inference ?!
Hello, thanks for your contribution it is really promising but for some reason it breaks the generation and inference Here is an example:
from unsloth import FastLlamaModel
import torch
max_seq_length = 1024 # Can change to any number <= 4096
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
# Load Llama model
model, tokenizer = FastLlamaModel.from_pretrained(
model_name = "TheBloke/Llama-2-7B-fp16", # Supports any llama model
max_seq_length = max_seq_length,
dtype=dtype,
load_in_4bit = False
# token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
inputs = tokenizer.encode("the concept of ", return_tensors="pt", add_special_tokens = True).to(model.device)
answer = model.generate(inputs, max_new_tokens = 20)
tokenizer.batch_decode(answer, skip_special_tokens = False)
The output:
==((====))== Unsloth: Fast Llama patching release 23.11
\\ /| GPU: A100-SXM4-40GB. Max memory: 39.587 GB
O^O/ \_/ \ CUDA compute capability = 8.0
\ / Pytorch version: 2.1.0+cu118. CUDA Toolkit = 11.8
"-____-" bfloat16 support = TRUE
Loading checkpoint shards: 100%
2/2 [00:14<00:00, 6.61s/it]
['<s> the concept of 1<s> Tags\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n \\\n']
I have tried more than 4 different Llamas including yours and the same issue.
About this issue
- Original URL
- State: open
- Created 7 months ago
- Comments: 17 (10 by maintainers)
@namtranase Oh not yet π I actually am working on making inference 2-4x faster, which I might push in an hour π Hopefully if you can, it would be wonderful if you could test it out π)
@namtranase @ammarali32 Just pushed a 2x faster inference on the main branch!! π Hope you can try it out π) It natively makes inference faster without any tricks - ie num_beams, batched etc all are faster π
Call
FastLanguageModel.for_inference(model)
before doing inference to make it faster π CallFastLanguageModel.for_training(model)
to revert it back for finetuning.https://github.com/unslothai/unsloth/assets/23090290/dab1ea44-34bc-4585-819f-3621614ff871
@namtranase Oh wait did you get gibberish after finetuning or before finetuning. If after finetuning, I suggest you directly use the chat template for finetuning, and not Alpaca. Thatβs because the model you are using is realdy finetuned. I would use the non chat version if you want Alpaca style.
@ammarali32 Oh itβs not yet supported - weβre working on making it out for our next release π
@namtranase Hey sorry back! For TinyLlama, youβll have to follow exactly their prompt format:
@ammarali32 I fixed it!!! It would be awesome if you can try it out! I also updated the Alpaca Colab example https://colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing