lm-evaluation-harness: Speed up inference problems
I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.
Unfortunately, using torch.compile and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.
Other issue is that scores with batch size 1 and 4 differs - tested with and without logits_cache and torch.use_deterministic_algorithms(True). Is it possible to obtain the same results? Maybe there is some problem with padding?
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 19 (19 by maintainers)
Commits related to this issue
- Add vLLM FAQs to README (#1625) (#1633) — committed to EleutherAI/lm-evaluation-harness by haileyschoelkopf 3 months ago
Thanks! vllm bs=auto max_model_len=4096 01:33 (+01:30 for
Processed prompts?) 0.3856Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM’s continuous batching. Don’t know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn’t do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass
enable_prefix_caching=Trueto model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.