vllm: Mixtral tokens-per-second slower than expected, 10 tps
I’m observing slower TPS than expected with mixtral.
Specifically, I’m seeing ~10-11 TPS.
It would be helpful to know what others have observed!
Here’s some details about my configuration:
I’ve experimented with TP=2 and TP=4 with A100 80GBs.
I’m running in a container with the following vllm and megablocks versions.
vllm @ git+https://github.com/vllm-project/vllm@d537c625cb039983a0bf61aa36ba8139a2905609
megablocks==0.5.0
nvidia-smi
Tue Dec 12 22:19:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:00:05.0 Off | 0 |
| N/A 36C P0 72W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB Off | 00000000:00:06.0 Off | 0 |
| N/A 35C P0 76W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB Off | 00000000:00:07.0 Off | 0 |
| N/A 34C P0 70W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB Off | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 72W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I’m initializing the model and generating like:
from vllm import LLM, SamplingParams
llm = LLM(MODEL_DIR, tensor_parallel_size=2)
sampling_params = SamplingParams(
temperature=0.75,
top_p=1,
max_tokens=800,
presence_penalty=1.15,
)
instructions = "Write a poem about open source machine learning."
template = """<s>[INST] <<SYS>>\n{system}\n<</SYS>>\n{instructions} [/INST] """
prompt = template.format(system="", user=prompt)
result = llm.generate(prompt, sampling_params)
pip freeze
accelerate==0.25.0
aiohttp==3.9.1
aioprometheus==23.3.0
aiosignal==1.3.1
anyio==4.1.0
attrs==23.1.0
certifi==2023.11.17
charset-normalizer==3.3.2
click==8.1.7
cog @ file:///tmp/cog-0.0.1.dev-py3-none-any.whl#sha256=8769f6b9295f50c618f03f1fb334913222ba180d70c05624c4530eece5750259
einops==0.7.0
fastapi==0.98.0
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.12.2
h11==0.14.0
httptools==0.6.1
huggingface-hub==0.19.4
idna==3.6
Jinja2==3.1.2
jsonschema==4.20.0
jsonschema-specifications==2023.11.2
MarkupSafe==2.1.3
megablocks==0.5.0
mpmath==1.3.0
msgpack==1.0.7
multidict==6.0.4
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
orjson==3.9.10
packaging==23.2
pandas==2.1.4
peft==0.7.0
protobuf==4.25.1
psutil==5.9.6
pyarrow==14.0.1
pydantic==1.10.13
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
PyYAML==6.0.1
quantile-python==1.1
ray==2.8.1
referencing==0.32.0
regex==2023.10.3
requests==2.31.0
rpds-py==0.13.2
safetensors==0.4.1
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
stanford-stk==0.0.6
starlette==0.27.0
structlog==23.2.0
sympy==1.12
tokenizers==0.15.0
torch==2.1.1
tqdm==4.66.1
transformers==4.36.0
triton==2.1.0
typing_extensions==4.9.0
tzdata==2023.3
urllib3==2.1.0
uvicorn==0.24.0.post1
uvloop==0.19.0
vllm @ git+https://github.com/vllm-project/vllm@d537c625cb039983a0bf61aa36ba8139a2905609
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23
yarl==1.9.4
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 4
- Comments: 17 (4 by maintainers)
We updated the docker, built on 2xA100 and tested this afternoon, good perf (100+ tok/s), we’ll update instructions shortly. Thanks!
@hamelsmu Hi Hamel, nice to see you again! We expect the performance issue to be solved once #2090 is merged. Please stay tuned!
how can we get 100+ tok/s ? any update ?
Wow thanks @WoosukKwon 😍 that’s excellent news
What are the settings / setup being used with 2xH100s? I’m running the docker image built from the main branch with openai api server with all default settings and getting at most 20t/s with tp=2