transformers: keyerror : mistral (for transformer version = 4.30) and Import Error Using `load_in_8bit=True` requires Accelerate: for transformer version > 4.30

System Info

transformer version : 4.30,4.31,4.34,4.35 python version : 3.11.1, 3.11.5,3.8

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

1. #loading packges
from torch import cuda, bfloat16
import transformers
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
import accelerate
base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print("loaded all packages")
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print("Printing Device...")
print(device)
print("loading model....")
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth#,
#    load_in_8bit=False    
)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
#    load_in_8bit=False,
    device_map='auto',
    use_auth_token=hf_auth,
    offload_folder="save_folder"
)
# Load model directly
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, 
    task='text-generation',
    temperature=0.1,  
    max_new_tokens=4096,  
    repetition_penalty=1.1 
)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)

Run this

Expected behavior

we should get inference

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 31 (10 by maintainers)

Most upvoted comments

hi @jitender-cnvrg @Abhaycnvrg Thanks a lot for iterating, loading a model in 4bit / 8bit should work out of the box on a simple Free-tier Google colab instance. Make sure to select T4 on runtime type. I made a quick example here: https://colab.research.google.com/drive/1zia3Q9FXhNHOhdwA9p8zD4qgPEWkZvHl?usp=sharing and made sure it works.

younesbelkada on Nov 17, 2023

On CPU yes, but it might be slow, please consider using Mistral-7b on a free tier Google colab instance using bitsandbytes 4bit

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

I advise to use https://huggingface.co/ybelkada/Mistral-7B-v0.1-bf16-sharded as the weights are sharded with smaller shards (~2GB) otherwise it will lead to CPU OOM when trying to load mistral weights on google colab

younesbelkada on Nov 10, 2023

python == 3.9.16 / transformers == 4.35.0 / accelerate 0.25.dev0 (from source) / bitsandbytes 0.41.1

younesbelkada on Nov 9, 2023

Hi everyone! I used to face this issue sometimes when using Google colab and when libraries are not correctly installed. In case you are using Kaggle or Google Colab notebook can you try to delete the runtime and re-start it again? If not retry everything on a fresh new environment by making sure to install all the required packages pip install transformers accelerate bitsandbytes

younesbelkada on Nov 9, 2023