ipex-llm: Unable to run Speech T5 on XPU

Hello,

I am trying to run Speech T5 on XPU but am unable to. It is this model https://huggingface.co/microsoft/speecht5_tts and here is my code:

from bigdl.llm.transformers import AutoModelForSpeechSeq2Seq, AutoModelForCausalLM
import intel_extension_for_pytorch as ipex
from bigdl.llm import optimize_model

model = AutoModelForCausalLM.from_pretrained("microsoft/speecht5_tts",
                                                    torch_dtype="auto",
                                                    trust_remote_code=True,
                                                    low_cpu_mem_usage=True
                                                    )
model = optimize_model(model)
model = model.to('xpu')

and I am getting the following error:

raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers.models.speecht5.configuration_speecht5.SpeechT5Config'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MusicgenConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

Is there support for text-to-speech by BigDL? Or am I missing something?

Regards, Nedim

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

Hi @nedo99,

For bigdl-llm>=2.5.0b20240204, you could run speech t5 with BigDL-LLM optimization as below 😃

Env (PyTorch 2.1 with oneAPI 2024.0):

conda create -n speecht5-test python=3.9
conda activate speecht5-test

pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install datasets soundfile 

Runtime Configuration: following here

Code:

import torch
from transformers import SpeechT5Processor, SpeechT5HifiGan, SpeechT5ForTextToSpeech
from datasets import load_dataset
import soundfile as sf
import time

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

from bigdl.llm import optimize_model
model = optimize_model(model, modules_to_not_convert=["speech_decoder_postnet.feat_out",
                                                      "speech_decoder_postnet.prob_out"]) 
model = model.to('xpu')
vocoder = vocoder.to('xpu')

text = "On a cold winter night, a lonely traveler found a shimmering stone in the snow, unaware that it would lead him to a world full of wonders."
inputs = processor(text=text, return_tensors="pt").to('xpu')

# load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors",
                                  split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to('xpu')

with torch.inference_mode():
  # wamrup
  st = time.perf_counter()
  speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
  print(f'Warmup time: {time.perf_counter() - st}')

  st1 = time.perf_counter()
  speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
  torch.xpu.synchronize()
  st2 = time.perf_counter()
  print(f"Inference time: {st2-st1}")

sf.write("speech_bigdl_llm.wav", speech.to('cpu').numpy(), samplerate=16000)

Please let us know for any further problems 😃