TensorRT-LLM: Broken output for int4 weight-only quantized version of merged Llama2 70b model with more layers
System Info
- CPU archtecture: x86_64
- CPU/Host memory size: 500GB total
- GPU properties (for quant)
- GPU name: 4x NVIDIA A100 80GB
- GPU memory size: 320GB total
- GPU properties (for runtime)
- GPU name: 4x NVIDIA RTX 4090
- GPU memory size: 96GB total
- Libraries
- tensorrt @ file:///usr/local/tensorrt/python/tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl
- tensorrt-llm==0.9.0.dev2024022700
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cudnn-cu12==8.9.2.26
- nvidia-ammo==0.7.3
- Container used: nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
- NVIDIA driver version: 535.129.03
- OS: Ubuntu 22.04.4 LTS
- CUDA version: 12.2
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
- Launch
nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
container image - Install tensorrt-llm according to the readme:
apt update
apt install openmpi-bin libopenmpi-dev
pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
- Also clone the repo so we can use the scripts
git clone https://github.com/NVIDIA/TensorRT-LLM
- Download the model from huggingface
huggingface-cli download wolfram/miquliz-120b-v2.0 --local-dir /workspace/miquliz
- Prepare the dependencies for checkpoint conversion script
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
- Run the checkpoint conversion script as follows
python3 convert_checkpoint.py --model_dir /workspace/miquliz/ --output_dir /workspace/miquliz-quantized/ --tp_size 4 --dtype float16 --use_weight_only --weight_only_precision int4 --fp8_kv_cache --enable_fp8
- Copy the quantized model to inference server
- Build engine as follows
trtllm-build --checkpoint_dir /workspace/miquliz-quantized/ --output_dir /workspace/miquliz-engine/ --max_batch_size 1 --max_output_len 256 --weight_only_precision int4 --gemm_plugin float16 --paged_kv_cache enable --use_custom_all_reduce disable --multi_block_mode enable
- Run engine as follows
mpirun --allow-run-as-root -n 4 python3 ../run.py --max_output_len 256 --tokenizer_dir /workspace/miquliz/ --engine_dir /workspace/miquliz-engine/
Expected behavior
The engine builds without warning messages and generates sensible output
actual behavior
The engine builds, but many warning messages of this format are printed for ever tensor in every layer:
[TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/139/post_layernorm/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
Running it produces garbage output:
Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给 (...)
Build engine log: log (1).txt Run engine log: log2.txt
additional notes
Since int4_awq quant format did not work at all, I am trying the basic 4 bit quant instead. Still experimenting with the other options to see if the issue is one of the settings but it is extremely slow to iterate with the 120B model.
If I use this smaller model instead, the warning messages are still generated, but the engine does not seem to be broken and generates reasonable output.
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 40
Closing this since my engine built successfully!