transformers.js: [Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release

I have observed relatively significant degradation in transcription quality between versions 2.0.1 and 2.2.0 using automatic-speech-recognition (whisper) although using the very same configuration.

Here is an example output for the same input:

2.0.1 result
0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.

2.2.0 result
0 -> 3  [ [ ]
3 -> 4  [
4 -> 5  ]
5 -> 7  We have main engine start
7 -> 10  4, 3, 2, 1
10 -> 11  [ ]
11 -> 10  [
10 -> 12  ]
12 -> 11  [
11 -> 12  ]
12 -> 27.36  [ ] [ (thunder rumbling) - You're a jerk, Tom. - Look, Celia, we have to follow our passions.
27.36 -> 30.72  You have your robotics and I just wanna be awesome in space.
30.72 -> 32.68  - Why don't you just admit that you're
32.68 -> 34.68  freaked out by my robot hand?
34.68 -> 36.2  - I'm not freaked out, but it's...
37.68 -> 38.68  All right, fine.
38.68 -> 39.519999999999996  I'm freaked out.
39.52 -> 40.92  I'm having nightmares that I'm being chased
40.92 -> 42.44  but he's dying robotic class.
42.44 -> 44.36  - Oh, what in our dumb?
44.36 -> 45.18  We're done.
50.18 -> 53.68  - Robots memory, synced, and locked.

These are the observed issues:

lowered timestamp precision, for example You're a jerk, Tom. from 23 -> 27.08 to 12 -> 27.36
words are recognized with less precision, for example from chased by these giant robotic claws to chased ... but he's dying robotic class.
random appearance of [ and ] characters (although can be filtered in postprocesing)

I am wondering if these two versions use differently trained models?

Or is there any extra configuration I could pass into 2.2.0 pipeline/pipe to get the results at least matching 2.0.1 without drop in performance (I am aware of num_beams but that decreases performance heavily)

I am attaching min repro testing scripts for evaluation, keep it running 1-2 minutes and the output appears in the console.log

The worker is as simple as:

import { pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.min.js";

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.2.0 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);

About this issue

Original URL
State: closed
Created a year ago
Comments: 20 (11 by maintainers)

Commits related to this issue

Allow user to set `per_channel` and `reduce_range` quantization parameters (#156) Also save quantization options — committed to xenova/transformers.js by xenova a year ago
Allow user to set `per_channel` and `reduce_range` quantization params (#156) (#157) * Allow user to set `per_channel` and `reduce_range` quantization parameters (#156) Also save quantization opti... — committed to xenova/transformers.js by xenova a year ago

Most upvoted comments

Great! Thanks so much for helping to investigate. This is the commit https://github.com/xenova/transformers.js/commit/ec00d4f540e8c1f298beadc4a71b05613568f6bc which changed the default quantization parameters in the conversion script.

This change was necessary for many text-only models, which showed significantly worse performance without these settings (both in JS and in python). This issue was first noticed a few weeks ago when GH actions started failing. Here’s some example code and output to show the problem:

from optimum.onnxruntime import ORTModelForTokenClassification
import onnxruntime
from transformers import pipeline, AutoTokenizer

model_path = './models/Davlan/distilbert-base-multilingual-cased-ner-hrl'

session_options = onnxruntime.SessionOptions()

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

model_ort = ORTModelForTokenClassification.from_pretrained(
    model_path + '/onnx',
    file_name='model.onnx',
    use_io_binding=True,
    session_options=session_options
)

ort_pipeline = pipeline(
    task="token-classification",
    model=model_ort,
    tokenizer=tokenizer,
)

ARTICLE = "The Golden State Warriors are an American professional basketball team based in San Francisco."

out = ort_pipeline(ARTICLE)

print(f'{out=}')

With reduce_range=True, per_channel=True (correct)

[{'entity': 'B-ORG', 'score': 0.9998536, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99986124, 'index': 3, 'word': 
'State', 'start': 11, 'end': 16}, {'entity': 'I-ORG', 'score': 0.9998661, 'index': 4, 'word': 'Warriors', 'start': 17, 'end': 25}, {'entity': 'B-LOC', 'score': 
0.9997049, 'index': 13, 'word': 'San', 'start': 80, 'end': 83}, {'entity': 'I-LOC', 'score': 0.9987282, 'index': 14, 'word': 'Francisco', 'start': 84, 'end': 93}]

With reduce_range=False, per_channel=False (incorrect):

[{'entity': 'B-ORG', 'score': 0.99973637, 'index': 2, 'word': 'Golden', 'start': 4, 'end': 10}, {'entity': 'I-ORG', 'score': 0.99928397, 'index': 3, 'word': 'State', 'start': 11, 'end': 16}]

This is due to “saturation issues” when using int8 for weights: https://docs.openvino.ai/2022.3/pot_saturation_issue.html

However, as you point out, it looks like this is not necessary for whisper models. I will do some more testing, but it might make sense to revert the model to use reduce_range=False, per_channel=False.

xenova on Jun 21, 2023

Thanks for the quick fix @xenova , I can confirm models from the latest commit https://huggingface.co/Xenova/whisper-base.en/commit/86134155f8ad5593996868d4544b3d49ea0b1163 provide solid transcriptions.

jozefchutka on Jun 22, 2023

Bingo! When enforcing model revision https://huggingface.co/Xenova/whisper-base.en/commit/95502fc2ffd132c6859cf58a66f4977c3c6abac2

I am getting good the good results for both 2.0.1 and 2.2.0

import { env, pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.0.1/dist/transformers.min.js";

env.allowLocalModels = false;
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.0.1/dist/';

const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const revision = "95502fc2ffd132c6859cf58a66f4977c3c6abac2";

const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model, {revision});
const result = await pipe(buffer, {
	chunk_length_s: 30,
	stride_length_s: 5,
	return_timestamps: true});

let content = "2.0.1 result\n";
for(let {text, timestamp} of result.chunks)
	content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);

0 -> 7  We have main engine start.
7 -> 10  4, 3, 2, 1.
10 -> 13  Yeah, whoa!
23 -> 27.08  You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002  You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996  Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46  I'm not freaked out, but it's...
37.46 -> 39.36  All right, fine. I'm freaked out.
39.36 -> 42.36  I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18  Oh, what in our tongue? We're done.
50.16 -> 53.78  - Robots memory, synced, and locked.

So I wonder if the new onnx files / conversion process can be somehow tuned to get good speed and good transcriptions at the same time?

jozefchutka on Jun 21, 2023

The models’ onnx files were updated around 3 weeks ago: https://huggingface.co/Xenova/whisper-tiny.en/tree/main/onnx to be in line with the conversion process used by the rest of the models (which resulted in performance improvements for other seq2seq models). Specifically, this is due to the different quantization parameters used.

xenova on Jun 21, 2023

If you have a look at attached scripts you will notice there is no revision specified for pipeline. That makes me believe these are the very same models in use

jozefchutka on Jun 21, 2023