transformers: UMT5 incredibly slow in generating

System Info

  • transformers version: 4.33.1
  • Platform: Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.34
  • Python version: 3.11.4
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.22.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)

Who can help?

@ArthurZucker and @younesbelkada and @gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

import time

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig


if __name__ == "__main__":
    timings = {}

    for model_name in ("facebook/mbart-large-50-many-to-one-mmt", "google/umt5-small"):
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map={"": "cuda"})
        print(model_name, model.num_parameters())
        # google/umt5-small                        306601984
        # facebook/mbart-large-50-many-to-one-mmt 1122990080
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        gen_config = GenerationConfig.from_pretrained(
            model_name,
            max_new_tokens=200,
            max_length=None,
            num_beams=1,
        )
        text = "I would really like to eat some cookies now."
        if "t5" in model_name:
            text = f"translate English to Dutch: {text}"

        encoded = tokenizer(text, return_tensors="pt")
        encoded = {k: v.to(model.device) for k, v in encoded.items()}
        start_time = time.perf_counter_ns()
        for _ in range(100):
            _ = model.generate(**encoded, generation_config=gen_config)

        timings[model_name] = time.perf_counter_ns() - start_time

    for model_name, timings in timings.items():
        print(f"Generation duration for {model_name.split('/')[1]}:\t{timings}")
        # Generation duration for mbart-large-50-many-to-one-mmt:  22413427363
        # Generation duration for umt5-small:                     207906791077

So despite UMT5-small having only about 27% the number of parameters of the MBART-large model it is 9-10x slower!

(I also tried with a gc.collect() after each generation.)

Expected behavior

Faster inference/generation speed. Training is fine so I assume caching of past states is not (correctly) implemented but I might be wrong. This PR on adding caching to T5 by @patrickvonplaten might be related: https://github.com/huggingface/transformers/pull/3682

About this issue

  • Original URL
  • State: open
  • Created 10 months ago
  • Comments: 17

Most upvoted comments

We should pre-compute all the positional bias wrt to the max sequence length of the model, cache it and only fetch the ones we need! Same for T5 but it’s already pretty fast. Will open a PR !

I can’t reproduce your results. Do you have accelerate installed? Can you share your transformers-cli env? @ArthurZucker

also you can do encoded = tokenizer(text, return_tensors="pt").to(model.device) 😉

Hey Thanks for reporting I’ll investigate! Not sure why you would need to run 100 iterations of the generate method this way, but for one generation:

  • umt5:
>>> start_time = time.time();model.generate(**encoded, generation_config=gen_config);print(time.time()-start_time)
1.5145587921142578
  • mbart:
>>> start_time = time.time();model.generate(**encoded, generation_config=gen_config);print(time.time()-start_time)
1.5777842998504639

For 10 iterations:

  • umt5: 16.204639673233032
  • mbart: 16.71877956390381

so not sure if this is simply a bug in the time logging?