transformers.js: [Bug] whisper transcription quality dropped between 2.0.1 and 2.2.0 release
I have observed relatively significant degradation in transcription quality between versions 2.0.1 and 2.2.0 using automatic-speech-recognition
(whisper) although using the very same configuration.
Here is an example output for the same input:
2.0.1 result
0 -> 7 We have main engine start.
7 -> 10 4, 3, 2, 1.
10 -> 13 Yeah, whoa!
23 -> 27.08 You're a jerk, Tom. Look, Celia, we have to follow our passions.
27.08 -> 30.240000000000002 You have your robotics, and I just wanna be awesome in space.
30.24 -> 34.519999999999996 Why don't you just admit that you're freaked out by my robot hand?
34.52 -> 37.46 I'm not freaked out, but it's...
37.46 -> 39.36 All right, fine. I'm freaked out.
39.36 -> 42.36 I'm having nightmares that I'm being chased by these giant robotic claws.
42.36 -> 45.18 Oh, what in our tongue? We're done.
50.16 -> 53.78 - Robots memory, synced, and locked.
2.2.0 result
0 -> 3 [ [ ]
3 -> 4 [
4 -> 5 ]
5 -> 7 We have main engine start
7 -> 10 4, 3, 2, 1
10 -> 11 [ ]
11 -> 10 [
10 -> 12 ]
12 -> 11 [
11 -> 12 ]
12 -> 27.36 [ ] [ (thunder rumbling) - You're a jerk, Tom. - Look, Celia, we have to follow our passions.
27.36 -> 30.72 You have your robotics and I just wanna be awesome in space.
30.72 -> 32.68 - Why don't you just admit that you're
32.68 -> 34.68 freaked out by my robot hand?
34.68 -> 36.2 - I'm not freaked out, but it's...
37.68 -> 38.68 All right, fine.
38.68 -> 39.519999999999996 I'm freaked out.
39.52 -> 40.92 I'm having nightmares that I'm being chased
40.92 -> 42.44 but he's dying robotic class.
42.44 -> 44.36 - Oh, what in our dumb?
44.36 -> 45.18 We're done.
50.18 -> 53.68 - Robots memory, synced, and locked.
These are the observed issues:
- lowered timestamp precision, for example
You're a jerk, Tom.
from23 -> 27.08
to12 -> 27.36
- words are recognized with less precision, for example from
chased by these giant robotic claws
tochased ... but he's dying robotic class.
- random appearance of
[
and]
characters (although can be filtered in postprocesing)
I am wondering if these two versions use differently trained models?
Or is there any extra configuration I could pass into 2.2.0 pipeline/pipe to get the results at least matching 2.0.1 without drop in performance (I am aware of num_beams
but that decreases performance heavily)
I am attaching min repro testing scripts for evaluation, keep it running 1-2 minutes and the output appears in the console.log
The worker is as simple as:
import { pipeline } from "https://cdn.jsdelivr.net/npm/@xenova/transformers@2.2.0/dist/transformers.min.js";
const file = "tos.pcm";
const model = "Xenova/whisper-base.en";
const buffer = new Float32Array(await (await fetch(file)).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition", model);
const result = await pipe(buffer, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: true});
let content = "2.2.0 result\n";
for(let {text, timestamp} of result.chunks)
content += `${timestamp[0]} -> ${timestamp[1]} ${text}\n`;
console.log(content);
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (11 by maintainers)
Commits related to this issue
- Allow user to set `per_channel` and `reduce_range` quantization parameters (#156) Also save quantization options — committed to xenova/transformers.js by xenova a year ago
- Allow user to set `per_channel` and `reduce_range` quantization params (#156) (#157) * Allow user to set `per_channel` and `reduce_range` quantization parameters (#156) Also save quantization opti... — committed to xenova/transformers.js by xenova a year ago
Great! Thanks so much for helping to investigate. This is the commit https://github.com/xenova/transformers.js/commit/ec00d4f540e8c1f298beadc4a71b05613568f6bc which changed the default quantization parameters in the conversion script.
This change was necessary for many text-only models, which showed significantly worse performance without these settings (both in JS and in python). This issue was first noticed a few weeks ago when GH actions started failing. Here’s some example code and output to show the problem:
With reduce_range=True, per_channel=True (correct)
With reduce_range=False, per_channel=False (incorrect):
This is due to “saturation issues” when using int8 for weights: https://docs.openvino.ai/2022.3/pot_saturation_issue.html
However, as you point out, it looks like this is not necessary for whisper models. I will do some more testing, but it might make sense to revert the model to use reduce_range=False, per_channel=False.
Thanks for the quick fix @xenova , I can confirm models from the latest commit https://huggingface.co/Xenova/whisper-base.en/commit/86134155f8ad5593996868d4544b3d49ea0b1163 provide solid transcriptions.
Bingo! When enforcing model revision https://huggingface.co/Xenova/whisper-base.en/commit/95502fc2ffd132c6859cf58a66f4977c3c6abac2
I am getting good the good results for both 2.0.1 and 2.2.0
So I wonder if the new onnx files / conversion process can be somehow tuned to get good speed and good transcriptions at the same time?
The models’ onnx files were updated around 3 weeks ago: https://huggingface.co/Xenova/whisper-tiny.en/tree/main/onnx to be in line with the conversion process used by the rest of the models (which resulted in performance improvements for other seq2seq models). Specifically, this is due to the different quantization parameters used.
If you have a look at attached scripts you will notice there is no
revision
specified for pipeline. That makes me believe these are the very same models in use