vllm: Mixtral tokens-per-second slower than expected, 10 tps

I’m observing slower TPS than expected with mixtral.

Specifically, I’m seeing ~10-11 TPS.

It would be helpful to know what others have observed!

Here’s some details about my configuration:

I’ve experimented with TP=2 and TP=4 with A100 80GBs.

I’m running in a container with the following vllm and megablocks versions.

vllm @ git+https://github.com/vllm-project/vllm@d537c625cb039983a0bf61aa36ba8139a2905609
megablocks==0.5.0

nvidia-smi

Tue Dec 12 22:19:27 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0              72W / 400W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P0              76W / 400W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0              70W / 400W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:00:08.0 Off |                    0 |
| N/A   35C    P0              72W / 400W |      2MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I’m initializing the model and generating like:

from vllm import LLM, SamplingParams

llm = LLM(MODEL_DIR, tensor_parallel_size=2)

sampling_params = SamplingParams(
             temperature=0.75,
             top_p=1,
             max_tokens=800,
             presence_penalty=1.15,
)

instructions = "Write a poem about open source machine learning." 
template = """<s>[INST] <<SYS>>\n{system}\n<</SYS>>\n{instructions} [/INST] """
prompt = template.format(system="", user=prompt)

result = llm.generate(prompt, sampling_params) 

pip freeze

accelerate==0.25.0
aiohttp==3.9.1
aioprometheus==23.3.0
aiosignal==1.3.1
anyio==4.1.0
attrs==23.1.0
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
cog @ file:///tmp/cog-0.0.1.dev-py3-none-any.whl#sha256=8769f6b9295f50c618f03f1fb334913222ba180d70c05624c4530eece5750259
einops==0.7.0
fastapi==0.98.0
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.12.2
h11==0.14.0
httptools==0.6.1
huggingface-hub==0.19.4
idna==3.6
Jinja2==3.1.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
MarkupSafe==2.1.3
megablocks==0.5.0
mpmath==1.3.0
msgpack==1.0.7
multidict==6.0.4
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
orjson==3.9.10
packaging==23.2
pandas==2.1.4
peft==0.7.0
protobuf==4.25.1
psutil==5.9.6
pyarrow==14.0.1
pydantic==1.10.13
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
quantile-python==1.1
ray==2.8.1
referencing==0.32.0
regex==2023.10.3
requests==2.31.0
rpds-py==0.13.2
safetensors==0.4.1
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
stanford-stk==0.0.6
starlette==0.27.0
structlog==23.2.0
sympy==1.12
tokenizers==0.15.0
torch==2.1.1
tqdm==4.66.1
transformers==4.36.0
triton==2.1.0
typing_extensions==4.9.0
tzdata==2023.3
urllib3==2.1.0
uvicorn==0.24.0.post1
uvloop==0.19.0
vllm @ git+https://github.com/vllm-project/vllm@d537c625cb039983a0bf61aa36ba8139a2905609
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23
yarl==1.9.4

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 4
  • Comments: 17 (4 by maintainers)

Most upvoted comments

We updated the docker, built on 2xA100 and tested this afternoon, good perf (100+ tok/s), we’ll update instructions shortly. Thanks!

@hamelsmu Hi Hamel, nice to see you again! We expect the performance issue to be solved once #2090 is merged. Please stay tuned!

We updated the docker, built on 2xA100 and tested this afternoon, good perf (100+ tok/s), we’ll update instructions shortly. Thanks!

how can we get 100+ tok/s ? any update ?

Wow thanks @WoosukKwon 😍 that’s excellent news

What are the settings / setup being used with 2xH100s? I’m running the docker image built from the main branch with openai api server with all default settings and getting at most 20t/s with tp=2