onnxruntime: Incorrect/Garbage Responses for Llama-2-7b-hf with INT4 GPTQ/RTN Asymmetric Quantization

Describe the issue

I am trying to quantize and run Llama-2-7b-hf model using the example here.

I was able to successfully generate the int4 model with GPTQ quantization by running below command. Settings:

Namespace(model_input='.\\llama2-7b-fp32\\', model_output='.\\Llama-2-7b-hf-gptq-asym', benchmark=False, quantize=True, batch_size=1, workspace='nc_workspace', algorithm='GPTQ', pad_max=196, seqlen=2048, tasks=['winogrande', 'copa', 'piqa', 'rte', 'hellaswag', 'openbookqa', 'lambada_openai', 'lambada_standard', 'wikitext'], dataset='NeelNanda/pile-10k', block_size=32, is_symmetric=False, accuracy_level=0, sampling_size=8)

However, when I try to run on CPU, I get garbage results for any prompt.

- Prompt: ONNX Runtime is
- Response: ONNX Runtime is  prisoner categorieпута Clientública одногоúblicaública одногоúblicaúblicaúblicapplyúblicaúblicaúblicaúblicaúblicaúblicaúblicażeública geometricúblicażeúblicaúblicaúblicaúblicaúblicaúblicaúblicaúblicaúblicaுúblicaúblicaúblicaże zou[ întRunública Stim cruelF

- Prompt: I want to book a vacation to Hawaii. First, I need to
- Response: I want to book a vacation to Hawaii. First, I need to Statusifier liesStatusifierDOCTYPEissenschaft schedulecmpyed optyed optultan")yed opt diferenелісляcompos into")ultan intoultan optultan \( into oderifierultan rappresentultanел diferenyedyedམła intoyed into")cloudflareел

- Prompt: A good workout routine is
- Response: A good workout routine is 今设 gewesen gewesenісляwardwardwardward musical pueblo gewesen gewesen gewesen gewesenove gewesenoveісля instant zouwardxisісляwardісля instantoveRemoteісля gewesen только estaven толькоxis instantіслярия Wahl только zou서іслярияottiottiaba

- Prompt: How are astronauts launched into space?
- Response: How are astronauts launched into space? emarkemarkemark기 Wahl------+ел기ел기기yed finsелeringелłyyed finsyedелел기othy기 fatyed기temperaturen기기temperaturen thouісляtemperaturen기othy기yed Agutemperaturenелелел thouелinental

Similar output is observed with RTN Asymmetric INT4 model as well.

To reproduce

Following onnxruntime-inference-examples WOQ README.

python main.py --model_input .\llama2-7b-fp32\ --model_output .\Llama-2-7b-hf-gptq-asym --accuracy_level 0 --quantize --algorithm GPTQ

I have used the inference code from here with some changes mentioned below

use_fp16 = False  # True when KV cache inputs/outputs are in float16
use_buffer_share = False  # True when --use_gqa was passed during export
device = torch.device("cpu")  # running on CPU

Urgency

No response

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

v1.17.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 16 (6 by maintainers)

Commits related to this issue

fix memory mapping on Windows (#19623) ### Description  Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. W... — committed to microsoft/onnxruntime by yufenglee 4 months ago
fix memory mapping on Windows (#19623) ### Description  Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. W... — committed to microsoft/onnxruntime by yufenglee 4 months ago
Cherry pick memory map fix (#19835) ### Description  Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. We n... — committed to microsoft/onnxruntime by maggie1059 4 months ago
fix memory mapping on Windows (#19623) ### Description  Windows memory map casts mapped_offset to DWORD directly. It will be truncated if it is larger than 2^32-1. W... — committed to microsoft/onnxruntime by yufenglee 4 months ago

Most upvoted comments

For Wikitext

Task	Version	Metric	Value
wikitext	1	word_perplexity	9.1113
		byte_perplexity	1.5116
		bits_per_byte	0.5961

This also looks close to the published numbers.

VishalX on Feb 23, 2024

@yufenglee, I tried Asymmetric BlockWise, RTN & GPTQ, with the above fix. Responses for all these include German sentences/words. Do you think this is due to quantization loss only? The earlier published numbers from: #17390, suggests GPTQ (G32Asym accuracy = 0.7326 for Lambada_openai. With the responses like these, I’m not so sure that the above can be reproduced. I’ll generate the score and see what I get. However, there could be some other issue as well? What do you think?

@VishalX, it would be great if you can try reproducing and get the score.

@yufenglee I can reproduce the published numbers with minor difference.

Task	Version	Metric	Value		Stderr
lambada_openai	0	ppl	3.5593	±	0.0714
		acc	0.7314	±	0.0062

Accuracy for lambada_openai is: 0.7314185911119736

I’ll check for Wikitext as well.

VishalX on Feb 23, 2024

I can repro the issue locally.

For the Symmetric quantization, the “Hinweis: Die folgende Seite ist nur auf Englisch verfügbar.” in the 1st prompt is German and means “Note: The following page is only available in English.”. I ran the same model with CUDA EP and get same result. It is caused by model quantization accuracy.
For the asymmetric quantization on Windows, We need to investigate more.

yufenglee on Feb 21, 2024