lm-evaluation-harness: Speed up inference problems

I am trying to speed up benchmarking on A100. Below are times of tests on one task in two versions using Mistral.

image

Unfortunately, using torch.compile and flash_attention slows down inference. Also, vllm is very slow for loglikelihood task.

Other issue is that scores with batch size 1 and 4 differs - tested with and without logits_cache and torch.use_deterministic_algorithms(True). Is it possible to obtain the same results? Maybe there is some problem with padding?

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks! vllm bs=auto max_model_len=4096 01:33 (+01:30 for Processed prompts?) 0.3856

Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM’s continuous batching. Don’t know if it slows down when logprobs are returned, but most of the tweaks in vllm are kv-cache related so makes sense it doesn’t do so well with non-generation tasks. They also have experimental support for prefix caching across batches (pass enable_prefix_caching=True to model_args, might have to add it to the model init to format the boolean correctly), which might speed things up (esp. for fewshot prompts).

mistral was particularly sensitive to batch differences: see #1425. Not sure what the reason was. Llama by comparison, not so much.