FlexGen: Question: FlexGen seems slower than simple CPU code, am I missing something? [see discussion]

Hi! I’m trying to reproduce FlexGen results and compare with more naive methods and i’m getting weird results. Can you please help me?

edit: added benchmark details and a minimalistic code to reproduce my claims.

library versions (click to expand)

ubuntu-server 18.04 PyTorch and dependencies installed via anaconda 4.9.1, package versions: pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0 numpy 1.23.4 py38h14f4228_0
numpy-base 1.23.4 py38h31eccc5_0

Transformers and tqdm installed via pip: transformers 4.25.1 tqdm 4.64.1 bitsandbytes 0.37.0

I ran OPT-175B on a similar machine:

dual Xeon 6426Y (mid range server cpu) and 256GB RAM which is slightly more than in the benchmark, ~but the code never uses more than 200GB. (the benchmark setup has 208 GB)
using prefix length 512 and output length 32, similar to the README benchmark, and used a batch size of 8 (edited; thanks to @merrymercy for pointing out the discrepancy).

I am using standard Hugging Face code, with transformers.models.opt.modeling_opt.OPTForCausalLM. The model was quantized to 8-bit using PyTorch PTDQ on linear layers with all default parameters.

Based on my measurements, I am getting 2.06 tokens per second in a basic CPU setup for a 8-bit model, or about 3.9 seconds per batch-step. This is basic HuggingFace + PyTorch PTDQ, no deepspeed / accelerate. Note: this does not account for prefill, so it is not a fair comparison, see adjusted figures below

In turn, FlexGen reports 1.12 tokens per second for a 4-bit OPT-175B model [tricky:

And, weirdly, __ simple 8-bit CPU inference beats both FlexGen and FlexGen©__ – given the large-batch setup in question.

Did I understand the evaluation setup correctly? If not, can you please tell me what am I missing?

Summary and corrections from the discussion below:

Based on the suggestions by @merrymercy , it is inappropriate to compare CPU with batch size 64 since to does not fit in the original testing environment. I have updated the metrics with batch size 8 (to be safe), the decoding throughput fell from 3.66 to 2.06 tokens/second.

Based on the discussion with @Ying1123 : In Section 6.0, the generative throughput is defined as “the number of generated tokens / (prefill time + decoding time)”.

Here, prefill time stands for encoding the input sequence in parallel, layer-by-layer. If the baseline algorithm prefills naively on CPU, FlexGen©-4-bit does indeed outperform the CPU-8bit baseline. For CPU, most of the time constitutes to prefilling. For GPU, there is the opposite situation: prefill is quick since it can be done with one offloading cycle; in turn, generation requires multiple offloading cycles and takes longer.

In further discussion, we consider an option of running prefill on GPU (using simple offloading, streaming KVs to CPU), then running inference on a CPU.

On a single T4 GPU, you can prefill 8 samples of 512 tokens with OPT-175B model in 8-bit precision (cuda 8-bit code runs Linear8bitLt from bitsandbytes 0.37.0 threshold=6) in 91.2 seconds using naive overlapped offloading. The CPU decoding time is, in turn, 124.3 seconds on 2x 6426Y. The aggregate baseline throughput is 8 * 32 / (91.18 + 124.277) ~= 1.19 tokens / second.

While the naive code is still faster, the difference between flexgen and baseline is not as significant as I originally thought. Important: later in this thread, @Ying1123 provides their own evaluation on a somewhat weaker CPU (2ghz, less cores, virtualized). For that setup FlexGen-4bit on GPU is indeed 1.6x faster than 8-bit CPU baseline, even if we account for gpu prefill. I thank @Ying1123 and @merrymercy for pointing it out the differences and apologize for taking up their time.

(Expand) Limitations that I left unaddressed

the baseline algorithm uses 8-bit compression, while FlexGen© is using a 4-bit compression algorithm; It would be better to evaluate with the same compression level. If the baseline is switched for 4-bit compression, it would also make sense to increase the batch size.
the throughput comparison depends on the chosen sequence length and CPU type. I have a hunch that shorter sequence lengths would benefit from GPU-side decoding while longer sequence lengths favour CPU to avoid transferring the attention cache. @Ying1123 correctly points out that it would be best to compare the two approaches more systematically.
the GPU prefill was measured separately on a different machine. This is because the original 6426Y machine has no gpu attached. In turn, the machine with T4 has a more powerful CPU (epyc 7742) that decodes faster (1.67t/s final throughput), but is significantly more expensive. For a pure academic comparison, it would be best to evaluate both setups on a number of identical machines with difference cpu/gpu balance.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 3
Comments: 19

Most upvoted comments

Happy to see that we are gradually reaching some common conclusions. Here are some thoughts on your questions.

Q1

For instance, if we compare the CPUs and their CPU decoding throughput, we get: GCP gpu-optimized instance: 32 virtual / 16 physical cores, 2GHz per core -> 1.08 token/s General-purpose instance: 64 virtual / 32 physical cores, 2.5Ghz per core -> 2.06 tokens/s

The scaling cannot be as perfect as you calculated. We also run your script on a GCP memory-optimized instance with 768 GB of memory. The CPU is Intel® Xeon® CPU @ 2.60GHz with 96 vCPU. The decoding throughput is 1.42 token/s, still less than the 1.71 token/s we got with FlexGen 4bit. An instance with such a high number of vCPU is not cheap on GCP – this instance is 2x more expensive than the T4 instance we used.

Edited: The 96-vCPU instance has AVX512 and VNNI support.

Q2

It is also curious that all other evaluated “baseline” algorithms (accelerate, deepspeed, petals) are hopelessly slower both FlexGen and the naive CPU baseline on the same machine.

You are right. We mentioned in the paper that these baseline systems optimize more for latency or directly use suboptimal strategies inherited from the training systems. If we just use them out of the box, they cannot use a batch size larger than 2 on our setup. This is the best we can get with these systems. Please see the benchmark scripts for baselines here.

One goal of the paper is to point out that they missed a huge room for improvement in terms of throughput.

Q3

I was particularly surprised that, DeepSpeed Inference - which is using inference-optimzed kernels and targets according to the cited paper - hopelessly lost to both flexgen and CPU.

Your surprise makes sense because this paper packed a lot of stuff. Specifically, we use the “DeepSpeed ZeRO-Inference” feature in this paper, suggested by this blog.

Note that “DeepSpeed Inference” and “DeepSpeed ZeRO-Inference” are two different things, as suggested by this huggingface repo. “DeepSpeed Inference” utilizes inference-optimized kernels but does not support offloading, so we cannot use it. “DeepSpeed ZeRO-Inference” is actually the one that uses the “ZeRO-3” technique to do offloading, which is used by us.

Q4

I want to mention an additional note about the CPU-only baseline. We did think about this at the very beginning. However, as we have shown, it is slower than FlexGen even with a 96-core CPU on GCP. So we didn’t spend too much time working on it.

Why we don’t include a CPU-only baseline? There is no paper or code that we can use or reference to. No one did accuracy-preserving int8 LLM inference on the CPU before. Your benchmark script is awesome, but it does not exist before we release this repo. It does not work as an end-to-end example with correct outputs that we can verify either. We thank your contribution and would like to see follow-ups in this direction.

We really appreciate these insightful discussions! Could we add you to the acknowledgment of our paper?

merrymercy on Feb 23, 2023

I think you are missing a GPU

BojanFaletic on Feb 21, 2023

Q1 The scaling cannot be as perfect as you calculated.

I see. It is indeed more complicated. I cannot be sure, but there’s another thing that could be at play: the high-memory instances could have cpus without VNNI support. GCP has older-gen special CPUs, just like it still has old GPUs. You can think of VNNI as “tensor cores, but smaller and for CPU” - and they have already seen mass adoption by the time T4 have rolled out, so both your and my benchmarks likely have them. To repeat, this is a wild guess and I have no way of verifying it.

Q2 One goal of the paper is to point out that they missed a huge room for improvement in terms of throughput.

Agreed. This checks out with deepspeed-inference’s abstract and intro, which emphasizes latency over throughput.

Q3 Note that “DeepSpeed Inference” and “DeepSpeed ZeRO-Inference” are two different things,

Thank you for clarification, I somehow missed that when I was reading 😃

Q4 No one did accuracy-preserving int8 LLM inference on the CPU before.

True enough. It’s curious because the quantization library I refer to has been around since 2019 and cpu 8-bit acceleration was around since 2018 - but indeed there is no dedicated library for running OPT/BLOOM (specifically) on cpu in 8-bit.

From my understanding, LLM CPU inference is “pathologically underhyped”, despite that many non-academic practitioners were asking about CPU inference (in general) in the past (example).

Could we add you to the acknowledgment of our paper?

I appreciate the gesture, but, unfortunately, I cannot accept for personal reasons. Please, don’t make the wrong implication of this: you guys have done a great step towards making LLM inference affordable, which is, like, the most important problem in today’s DL engineering. I also appreciate our discussion – I had a few (more constructive) follow-up points in mind, but i will need a few hours to check the numbers.

@Eric-Wallace-WebHost Thanks for this discussion, I learned a lot from this.

Since others may be viewing this, let me make it clear: I have nothing but respect to the authors and even more respect for the problem that authors are trying to solve. We need a way to make running large models accessible, and this is a significant step in this direction. The way I see it, most pre-prints published today have oversights - and I am “guilty” of such oversights myself. What defines the few good papers is that authors are willing to work on fixing these oversights, which is the case here.

justheuristic on Feb 23, 2023

you guys have done a great step towards making LLM inference affordable, which is, like, the most important problem in today’s DL engineering

Can’t agree with you more on this. Only thing I’d add is that it is not just inference, but also fine-tuning. Making those large models accessible will unleash a lot of human creativity.

min-xu-ai on Feb 23, 2023

Thank you for running the benchmarks so quickly. If I read it correctly, this evaluated is proportional to the relative CPU performance. In other words, “Yay! It all makes sense now!!”

For instance, if we compare the CPUs and their cpu decoding throughput, we get:

GCP gpu-optimized instance: 32 virtual / 16 physical cores, 2GHz per core -> 1.08 token/s
General-purpose instance: 64 virtual / 32 physical cores, 2.5Ghz per core -> 2.06 tokens/s

Boring CPU stuff (click to expand)

Based on my (limited) understanding of gcp infrastructure, the “Intel® Xeon® CPU @ 2.00GHz” is a slice of a virtualized Xeon CPU from a physical server that has multuple T4 GPUs and more cores. I couldn’t find the exact CPU model from GCP, but similar aws instances have 2nd gen xeon-gold cpus - and i’m using a newer 4rd gen of the same xeon-gold cpu.

If we assume that the GCP instance is also 2nd gen xeon, there is also a slight IPC (instructions per cycle) improvement from using 4th gen xeon in my case, but, to it should be at most few x 10% in total.

In other words, it makes total sense that the FlexGen 4bit wins against this baseline on the first CPU.

We just compose the quantization with normal pytorch operators without fusion. I guess the int8 kernels in PTDQ are better optimized with FBGEMM.

True enough, but the FBGEMM code only affects linear layers, so the cpu baseline is not even fusing operators - not that fusing would help on CPU anyway 😃

It is also curious that all other evaluated “baseline” algorithms (accelerate, deepspeed, petals) are hopelessly slower both FlexGen and the naive CPU baseline on the same machine.

I was particularly surprised that, DeepSpeed Inference - which is using inference-optimzed kernels and targets according to the cited paper - hopelessly lost to both flexgen and CPU.

// If I read the results correctly, DeepSpeed-Inference baseline is not just 112x slower than FlexGen 4-bit, it is also ~27x slower than simply running on on CPU, even with CPU-only prefill on the same test hardware. Please correct me if I misunderstood something.

p.s. to avoid misleading the accidental reader, I added the “Important! Authors evaluated…” in bold to the first message in the thread.

justheuristic on Feb 22, 2023

Hi @justheuristic, the update looks good to me! We also tried to run your scripts on our GCP instance, the same one used in the paper. Here is what I got. We will keep updating the results when we get more.

Setup

The CPU is an Intel® Xeon® CPU @ 2.00GHz with 32 vCPUs and 208GB of DRAM. We tried to run the provided script with bath size = 8 on the GCP instance. It runs out of memory during the weight initialization. We saw a “Killed” during the initialization of 69-th layer. To make it runnable, we change the number of layers from 96 to 48. We then run the script and scale the latency printed by the script by 2 to get an estimation for the 175B model.

OPT-175B Results

PTDQ (int8)

prefill latency: 718.5s, prefill throughput: 5.7 token/s
decode latency: 235.2s, decode throughput: 1.08 token/s
total latency: 953.7 s, total throughput: 0.269 token/s

FlexGen (int4)

prefill latency: 1468.57 s, prefill throughput: 50.20 token/s
decode latency: 2603.31 s, decode throughput: 1.71 token/s
total latency: 4071.88 s, total throughput: 1.12 token/s

If we take your approach to use GPU for prefill and CPU for decoding, we get throughput 8 * 32 / (91.18 + 235.2) = 0.78 token/s

On our setup, the throughput of your proposed approaches is still lower than FlexGen for both prefill and decoding, although they are definitely more efficient than the default offloading strategy in HF accelerate and deepspeed. Note that the int4 in FlexGen is not well-optimized. We did not use any specialized CUDA kernel for quantization. We just compose the quantization with normal pytorch operators without fusion. I guess the int8 kernels in PTDQ are better optimized with FBGEMM.

Conclusion

I believe the results also depend on the capabilities of CPUs (both memory and compute). We mainly tested on normal cloud CPUs and desktop CPUs and found them too slow, so we ignored these CPU-only options. Another reason is that there is no any good available implementation (instead of a simplified benchmark script) of the methods proposed by you. I think in the future we should study the whole space. For example, in the case when the CPU memory is not enough for PTDQ to hold the whole model, we need offloading to disk anyway. In this case, the techniques in FlexGen will show a bigger win.

This is the point of FlexGen. It provides infra to allow you easily try different offloading strategies and approximation methods.

Ying1123 on Feb 23, 2023

Hi! Thans for the answer

Tried other model sizes?

Agreed. I will benchmark all the setups and reply here shortly, should take about a day. Also, I suppose it was rude of me to make performance claims without publishing the exact code. I will do so shortly.

Token/second/dollar:

That’s a good point - but they are about the same price 😃

The dual xeon 6426Y setup costs marginally more than the T4 GPU by itself (i.e. without the computer in which you plug it)

Xeon MSRP is about 1500 per cpu, so 3000 in total – and T4 MSRP is about $2300 (plus whatever cpu you’re using). Both can be bought used for cheaper, and both have more cost-efficient analogies in the desktop segment.

My point is, 6426Y is about what you typically get on a server that has enough RAM to load OPT-175B - and you gotta have a CPU there anyways. And even if you go out of your way to put a weak CPU in there, the rest of the platform (200gb ram, mobo, psu) will take up most of the price.

I have a hunch - but not 100% certain - that if dual cpu gives you 3.66 token / sec, a single cpu be about 50% of that and it would still be faster than 1.12 (flexgen table).

justheuristic on Feb 22, 2023