TensorRT-LLM: Broken output for int4 weight-only quantized version of merged Llama2 70b model with more layers

System Info

CPU archtecture: x86_64
CPU/Host memory size: 500GB total
GPU properties (for quant)
- GPU name: 4x NVIDIA A100 80GB
- GPU memory size: 320GB total
GPU properties (for runtime)
- GPU name: 4x NVIDIA RTX 4090
- GPU memory size: 96GB total
Libraries
- tensorrt @ file:///usr/local/tensorrt/python/tensorrt-9.2.0.post12.dev5-cp310-none-linux_x86_64.whl
- tensorrt-llm==0.9.0.dev2024022700
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cudnn-cu12==8.9.2.26
- nvidia-ammo==0.7.3
- Container used: nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
NVIDIA driver version: 535.129.03
OS: Ubuntu 22.04.4 LTS
CUDA version: 12.2

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Launch nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 container image
Install tensorrt-llm according to the readme: apt update apt install openmpi-bin libopenmpi-dev pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
Also clone the repo so we can use the scripts git clone https://github.com/NVIDIA/TensorRT-LLM
Download the model from huggingface huggingface-cli download wolfram/miquliz-120b-v2.0 --local-dir /workspace/miquliz
Prepare the dependencies for checkpoint conversion script cd TensorRT-LLM/examples/llama pip install -r requirements.txt
Run the checkpoint conversion script as follows python3 convert_checkpoint.py --model_dir /workspace/miquliz/ --output_dir /workspace/miquliz-quantized/ --tp_size 4 --dtype float16 --use_weight_only --weight_only_precision int4 --fp8_kv_cache --enable_fp8
Copy the quantized model to inference server
Build engine as follows trtllm-build --checkpoint_dir /workspace/miquliz-quantized/ --output_dir /workspace/miquliz-engine/ --max_batch_size 1 --max_output_len 256 --weight_only_precision int4 --gemm_plugin float16 --paged_kv_cache enable --use_custom_all_reduce disable --multi_block_mode enable
Run engine as follows mpirun --allow-run-as-root -n 4 python3 ../run.py --max_output_len 256 --tokenizer_dir /workspace/miquliz/ --engine_dir /workspace/miquliz-engine/

Expected behavior

The engine builds without warning messages and generates sensible output

actual behavior

The engine builds, but many warning messages of this format are printed for ever tensor in every layer:

[TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[TRT] [W] Missing scale and zero-point for tensor LLaMAForCausalLM/transformer/layers/139/post_layernorm/CONSTANT_1_output_0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor

Running it produces garbage output:

Input [Text 0]: "<s> Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给 (...)

Build engine log: log (1).txt Run engine log: log2.txt

additional notes

Since int4_awq quant format did not work at all, I am trying the basic 4 bit quant instead. Still experimenting with the other options to see if the issue is one of the settings but it is extremely slow to iterate with the 120B model.

If I use this smaller model instead, the warning messages are still generated, but the engine does not seem to be broken and generates reasonable output.