TensorRT-LLM: llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

System Info

3090 gpu 0.7.1 tensorrt-llm

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

  1. get bin_model: python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o /root/TensorRT -LLM/examples/llama/llama2_7b_w8_int8_kv_cache/ --calibrate-kv-cache -t fp16
  2. build model: python build.py --bin_model_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/bin_model_dir/ --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --output_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/1-gpu --int8_kv_cache --use_weight_only
  3. test the model: python mmlu.py --hf_model_dir /root/models/Llama-2-7b/ --engine_dir /root/TensorRT-LLM/examples/llama/llama2_7b_w8_int8_kv_cache/1-gpu/ --test_trt_llm (mmlu.py is provided by TensorRT-LLM here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/mmlu.py

Unfortunately, step 3 gives me:

Average accuracy 0.297 - math
Average accuracy 0.399 - health
Average accuracy 0.300 - physics
Average accuracy 0.519 - business
Average accuracy 0.361 - biology
Average accuracy 0.274 - chemistry
Average accuracy 0.299 - computer science
Average accuracy 0.349 - economics
Average accuracy 0.317 - engineering
Average accuracy 0.367 - philosophy
Average accuracy 0.513 - other
Average accuracy 0.439 - history
Average accuracy 0.404 - geography
Average accuracy 0.475 - politics
Average accuracy 0.380 - psychology
Average accuracy 0.512 - culture
Average accuracy 0.330 - law
Average accuracy 0.306 - STEM
Average accuracy 0.367 - humanities
Average accuracy 0.409 - social sciences
Average accuracy 0.457 - other (business, health, misc.)
**Average accuracy: 0.384**

the final mmlu accuracy is 38.4, but fp16 accuracy is 45.9, which is very bad. But according to some LLM quantization papers, the acc should not drop so much in this case.

the config.json generated by build.py is something like this:

{
  "builder_config": {
    "autopp_config": null,
    "gather_context_logits": false,
    "gather_generation_logits": false,
    "hf_modules_to_trtllm_modules": {
      "down_proj": "mlp_4h_to_h",
      "gate_proj": "mlp_h_to_4h",
      "k_proj": "attn_k",
      "o_proj": "attn_dense",
      "q_proj": "attn_q",
      "up_proj": "mlp_gate",
      "v_proj": "attn_v"
    },
    "hidden_act": "silu",
    "hidden_size": 4096,
    "int8": true,
    "lora_target_modules": null,
    "max_batch_size": 8,
    "max_beam_width": 1,
    "max_input_len": 2048,
    "max_num_tokens": null,
    "max_output_len": 512,
    "max_position_embeddings": 2048,
    "max_prompt_embedding_table_size": 0,
    "mlp_hidden_size": 11008,
    "name": "llama",
    "num_heads": 32,
    "num_kv_heads": 32,
    "num_layers": 32,
    "parallel_build": false,
    "pipeline_parallel": 1,
    "precision": "float16",
    "quant_mode": 66,
    "tensor_parallel": 1,
    "trtllm_modules_to_hf_modules": {
      "attn_dense": "o_proj",
      "attn_k": "k_proj",
      "attn_q": "q_proj",
      "attn_v": "v_proj",
      "mlp_4h_to_h": "down_proj",
      "mlp_gate": "up_proj",
      "mlp_h_to_4h": "gate_proj"
    },
    "use_refit": false,
    "vocab_size": 32000
  },
"plugin_config": {
    "attention_qk_half_accumulation": false,
    "bert_attention_plugin": false,
    "context_fmha_type": 0,
    "enable_xqa": false,
    "gemm_plugin": "float16",
    "gpt_attention_plugin": "float16",
    "identity_plugin": false,
    "layernorm_plugin": false,
    "layernorm_quantization_plugin": false,
    "lookup_plugin": false,
    "lora_plugin": false,
    "multi_block_mode": false,
    "nccl_plugin": false,
    "paged_kv_cache": false,
    "quantize_per_token_plugin": false,
    "quantize_tensor_plugin": false,
    "remove_input_padding": false,
    "rmsnorm_plugin": false,
    "rmsnorm_quantization_plugin": false,
    "smooth_quant_gemm_plugin": false,
    "tokens_per_block": 0,
    "use_context_fmha_for_generation": false,
    "use_custom_all_reduce": false,
    "use_paged_context_fmha": false,
    "weight_only_groupwise_quant_matmul_plugin": false,
    "weight_only_quant_matmul_plugin": "float16"
  }
}

Is there any bug in the quantization code?

Expected behavior

expected mmlu acc does not drop that much

actual behavior

mmlu acc drops so much

additional notes

no more

About this issue

  • Original URL
  • State: open
  • Created 5 months ago
  • Comments: 32

Most upvoted comments

@Tracin I have tested int8-kv-cache and smoothquant w8a8 respectively on Llama-1-7b, both of them got good accuracy( close to fp16 accuracy, about 35.5 on MMLU), just like what you have done before. So just blame on Llama-2-7b.

Just regard this as a cross-check.

@Tracin k v seperate scales(per-tensor,static),acc is good. k v merged scales, I will test this case later today.