transformers: UMT5 incredibly slow in generating
System Info
transformersversion: 4.33.1- Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.4
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.22.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
Who can help?
@ArthurZucker and @younesbelkada and @gante
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
import time
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
if __name__ == "__main__":
timings = {}
for model_name in ("facebook/mbart-large-50-many-to-one-mmt", "google/umt5-small"):
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map={"": "cuda"})
print(model_name, model.num_parameters())
# google/umt5-small 306601984
# facebook/mbart-large-50-many-to-one-mmt 1122990080
tokenizer = AutoTokenizer.from_pretrained(model_name)
gen_config = GenerationConfig.from_pretrained(
model_name,
max_new_tokens=200,
max_length=None,
num_beams=1,
)
text = "I would really like to eat some cookies now."
if "t5" in model_name:
text = f"translate English to Dutch: {text}"
encoded = tokenizer(text, return_tensors="pt")
encoded = {k: v.to(model.device) for k, v in encoded.items()}
start_time = time.perf_counter_ns()
for _ in range(100):
_ = model.generate(**encoded, generation_config=gen_config)
timings[model_name] = time.perf_counter_ns() - start_time
for model_name, timings in timings.items():
print(f"Generation duration for {model_name.split('/')[1]}:\t{timings}")
# Generation duration for mbart-large-50-many-to-one-mmt: 22413427363
# Generation duration for umt5-small: 207906791077
So despite UMT5-small having only about 27% the number of parameters of the MBART-large model it is 9-10x slower!
(I also tried with a gc.collect() after each generation.)
Expected behavior
Faster inference/generation speed. Training is fine so I assume caching of past states is not (correctly) implemented but I might be wrong. This PR on adding caching to T5 by @patrickvonplaten might be related: https://github.com/huggingface/transformers/pull/3682
About this issue
- Original URL
- State: open
- Created 10 months ago
- Comments: 17
We should pre-compute all the positional bias wrt to the max sequence length of the model, cache it and only fetch the ones we need! Same for T5 but it’s already pretty fast. Will open a PR !
I can’t reproduce your results. Do you have accelerate installed? Can you share your
transformers-cli env? @ArthurZuckeralso you can do
encoded = tokenizer(text, return_tensors="pt").to(model.device)😉Hey Thanks for reporting I’ll investigate! Not sure why you would need to run 100 iterations of the
generatemethod this way, but for one generation:For 10 iterations:
16.204639673233032so not sure if this is simply a bug in the time logging?