ipex-llm: RuntimeError: PyTorch is not linked with support for xpu devices

RuntimeError: PyTorch is not linked with support for xpu devices

Install BigDL GPU version on Windows 11 as https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html image

when execute the code as below, the model is chatglm3-6b

import torch
import time
import argparse
import numpy as np

from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM3 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="d:/chatglm3-6b",
                        help='The huggingface repo id for the ChatGLM3 model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="AI是什么?",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path

    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModel.from_pretrained(model_path,
                                      load_in_4bit=True,
                                      trust_remote_code=True)
    
    model.save_low_bit("bigdl_chatglm3-6b-q4_0.bin")
    #run the optimized model on Intel GPU
    model = model.to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

the Error will occur: image

Does BigDL support to run ChatGLM3-6b on ARC GPU right now?

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

don’t worry, the stream_chat API will move input tokens to model’s device automatically (here), so you just need to move model to xpu

Yes!! Thank you very much for guidance! It works!!! 1703814340971

Tested sample code

import streamlit as st
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="ChatGLM3-6B+BigDL-LLM演示",
    page_icon=":robot:",
    layout="wide"
)
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

@st.cache_resource
def get_model():
    # 载入ChatGLM3-6B模型并实现INT4量化
    model = AutoModel.from_pretrained(model_path,
                                    load_in_4bit=True,
                                    trust_remote_code=True)
    model = model.to('xpu')
    # 载入tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    return tokenizer, model

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None

# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)

# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:

    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values,
        max_length=max_length,
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        message_placeholder.markdown(response)

    # 更新历史记录和past key values
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

1703814512915

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error. You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Maybe we should test on laptop because A770M is GPU for Laptop. I’ll try if I could reproduce this error on a Laptop.

My machine is the NUC12 蝰蛇峡谷(Serpent Canyon) i7 12700H+Arc A770M

I change ‘xpu:1’ back to ‘xpu’ and set ONEAPI_DEVICE_SELECTOR=level_zero:1 – It works!!! Thank you very much!!! 1703727325631

run the code

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

1703727425889