transformers: UMT5 incredibly slow in generating

System Info

transformers version: 4.33.1
Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34
Python version: 3.11.4
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)

Who can help?

@ArthurZucker and @younesbelkada and @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

import time

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig


if __name__ == "__main__":
    timings = {}

    for model_name in ("facebook/mbart-large-50-many-to-one-mmt", "google/umt5-small"):
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map={"": "cuda"})
        print(model_name, model.num_parameters())
        # google/umt5-small                        306601984
        # facebook/mbart-large-50-many-to-one-mmt 1122990080
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        gen_config = GenerationConfig.from_pretrained(
            model_name,
            max_new_tokens=200,
            max_length=None,
            num_beams=1,
        )
        text = "I would really like to eat some cookies now."
        if "t5" in model_name:
            text = f"translate English to Dutch: {text}"

        encoded = tokenizer(text, return_tensors="pt")
        encoded = {k: v.to(model.device) for k, v in encoded.items()}
        start_time = time.perf_counter_ns()
        for _ in range(100):
            _ = model.generate(**encoded, generation_config=gen_config)

        timings[model_name] = time.perf_counter_ns() - start_time

    for model_name, timings in timings.items():
        print(f"Generation duration for {model_name.split('/')[1]}:\t{timings}")
        # Generation duration for mbart-large-50-many-to-one-mmt:  22413427363
        # Generation duration for umt5-small:                     207906791077

So despite UMT5-small having only about 27% the number of parameters of the MBART-large model it is 9-10x slower!

(I also tried with a gc.collect() after each generation.)

Expected behavior

Faster inference/generation speed. Training is fine so I assume caching of past states is not (correctly) implemented but I might be wrong. This PR on adding caching to T5 by @patrickvonplaten might be related: https://github.com/huggingface/transformers/pull/3682

About this issue

Original URL
State: open
Created 10 months ago
Comments: 17

Most upvoted comments

We should pre-compute all the positional bias wrt to the max sequence length of the model, cache it and only fetch the ones we need! Same for T5 but it’s already pretty fast. Will open a PR !

ArthurZucker on Sep 13, 2023

I can’t reproduce your results. Do you have accelerate installed? Can you share your transformers-cli env? @ArthurZucker

BramVanroy on Sep 13, 2023

also you can do encoded = tokenizer(text, return_tensors="pt").to(model.device) 😉

ArthurZucker on Sep 13, 2023

Hey Thanks for reporting I’ll investigate! Not sure why you would need to run 100 iterations of the generate method this way, but for one generation:

umt5:

>>> start_time = time.time();model.generate(**encoded, generation_config=gen_config);print(time.time()-start_time)
1.5145587921142578

mbart:

>>> start_time = time.time();model.generate(**encoded, generation_config=gen_config);print(time.time()-start_time)
1.5777842998504639

For 10 iterations:

umt5: 16.204639673233032
mbart: 16.71877956390381

so not sure if this is simply a bug in the time logging?

ArthurZucker on Sep 13, 2023