TensorRT-LLM: LLAVA is slow due to unnecessary output tokens
System Info
- H100
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Use the official code to run LLAVA1.5-13B
Expected behavior
Much higher throughput – currently I got ~9.8 img/sec with batchsize=48, where sglang has 18.6 img/sec. TensorRT-LLM should be at least 2x faster than sglang or vllm.
actual behavior
See above
additional notes
I also benchmarked llama2 and the throughput is expected. Looking into the code I found the output ids contain all the image tokens, where official llava code only contain the text tokens. Is it possible the LLM part are predicting the image tokens as well so it cause the slowdown?
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 18 (2 by maintainers)
@symphonylyh Thanks for the explanation! In my experiments I used the exactly the same images as input, so the output lengths should be same for all methods. In my use case, I’d care more about the throughput instead of latency. For your reference, here is the full table of my experiments:
max_new_tokens=48, model=llava1.5-13b
I really appreciate your help and the trt-llm support. Will definitely try inflight batching once it’s available. I’m closing this issue since my concerns are all addressed, thank you so much!
@amukkara Thanks for the suggestions! I tried both and didn’t see a clear speedup though, I think the inflight batching might be more helpful in my use case.