TensorRT-LLM: LLAVA is slow due to unnecessary output tokens

System Info

  • H100

Who can help?

@kaiy

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Use the official code to run LLAVA1.5-13B

Expected behavior

Much higher throughput – currently I got ~9.8 img/sec with batchsize=48, where sglang has 18.6 img/sec. TensorRT-LLM should be at least 2x faster than sglang or vllm.

actual behavior

See above

additional notes

I also benchmarked llama2 and the throughput is expected. Looking into the code I found the output ids contain all the image tokens, where official llava code only contain the text tokens. Is it possible the LLM part are predicting the image tokens as well so it cause the slowdown?

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 18 (2 by maintainers)

Most upvoted comments

@symphonylyh Thanks for the explanation! In my experiments I used the exactly the same images as input, so the output lengths should be same for all methods. In my use case, I’d care more about the throughput instead of latency. For your reference, here is the full table of my experiments: max_new_tokens=48, model=llava1.5-13b

vllm throughput: 19 img/sec sglang throughput: 18.2 img/sec trt-llm throughput: 17.7 img/sec

I really appreciate your help and the trt-llm support. Will definitely try inflight batching once it’s available. I’m closing this issue since my concerns are all addressed, thank you so much!

@amukkara Thanks for the suggestions! I tried both and didn’t see a clear speedup though, I think the inflight batching might be more helpful in my use case.