DeepSpeech: Language model incorrectly drops spaces for out-of-vocabulary words

Mozilla DeepSpeech will sometimes create long runs of text with no spaces:

omiokaarforfthelastquarterwastoget

This happens even with short audio clips (4 seconds) with a native American english speaker recorded using a high quality microphone in Mac OS X laptops. I’ve isolated the problem to interaction with the language model rather than the acoustic model or length of audio clips, as the problem goes away when the language model is turned off.

The problem might be related to encountering out-of-vocabulary terms.

I’ve put together test files with results that show the issue is related to the language model somehow rather than the length of the audio or the acoustic model.

I’ve provided 10 chunked WAV files at 16khz 16 bit depth, each 4 seconds long, that are a subset of a fuller 15 minute audio file (I have not provided that full 15 minute file, as a few shorter reproducible chunks are sufficient to reproduce the problem):

https://www.dropbox.com/sh/3qy65r6wo8ldtvi/AAAAVinsD_kcCi8Bs6l3zOWFa?dl=0

The audio segments deliberately include occasional out-of-vocabulary terms, mostly technical, such as “OKR”, “EdgeStore”, “CAPE”, etc.

Also in that folder are several text files that show the output with the standard language model being used, showing the garbled words together (chunks_with_language_model.txt):

Running inference for chunk 1
so were trying again a maybeialstart this time

Running inference for chunk 2
omiokaarforfthelastquarterwastoget

Running inference for chunk 3
to car to state deloedmarchinstrumnalha

Running inference for chunk 4
a tonproductcaseregaugesomd produce sidnelfromthat

Running inference for chunk 5
i am a to do that you know 

Running inference for chunk 6
we finish the kepehandlerrwend finished backfileprocessing 

Running inference for chunk 7
and is he teckdatthatwewould need to do to split the cape 

Running inference for chunk 8
out from sir handler and i are on new 

Running inference for chunk 9
he is not monolithic am andthanducotingswrat 

Running inference for chunk 10
relizationutenpling paws on that until it its a product signal

Then, I’ve provided similar output with the language model turned off (chunks_without_language_model.txt):

Running inference for chunk 1
so we're tryng again ah maybe alstart this time

Running inference for chunk 2
omiokaar forf the last quarter was to get

Running inference for chunk 3
oto car to state deloed march in strumn alha

Running inference for chunk 4
um ton product  caser egauges somd produc sidnel from that

Running inference for chunk 5
am ah to do that ou nowith

Running inference for chunk 6
we finishd the kepe handlerr wend finished backfile processinga

Running inference for chunk 7
on es eteckdat that we would need to do to split the kae ha

Running inference for chunk 8
rout frome sir hanler and ik ar on newh

Running inference for chunk 9
ch las not monoliic am andthan ducotings wrat 

Running inference for chunk 10
relization u en pling a pas on that until it its a product signal

I’ve included both these files in the shared Dropbox folder link above.

Here’s what the correct transcript should be, manually done (chunks_correct_manual_transcription.txt):

So, we're trying again, maybe I'll start this time.

So my OKR for the last quarter was to get AutoOCR to a state that we could
launch an external alpha, and product could sort of gauge some product signal
from that. To do that we finished the CAPE handler, we finished backfill 
processing, we have some tech debt that we would need to do to split the CAPE 
handler out from the search handler and make our own new handler so its not
monolithic, and do some things around CAPE utilization. We are kind of putting
a pause on that until we get some product signal.

This shows the language model is the source of this problem; I’ve seen anecdotal reports from the official message base and blog posts that this is a wide spread problem. Perhaps when the language model hits an unknown n-gram, it ends up combining all of them together rather than retaining the space between them.

Discussion around this bug started on the standard DeepSpeech discussion forum: https://discourse.mozilla.org/t/text-produced-has-long-strings-of-words-with-no-spaces/24089/13 https://discourse.mozilla.org/t/longer-audio-files-with-deep-speech/22784/3

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository):

The standard client.py was slightly modified to segment the longer 15 minute audio clip into 4 second blocks.

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Mac OS X 10.12.6 (16G1036)

  • TensorFlow installed from (our builds, or upstream TensorFlow):

Both Mozilla DeepSpeech and TensorFlow were installed into a virtualenv setup via the following requirements.txt file:

tensorflow==1.4.0
deepspeech==0.1.0
numpy==1.13.3
scipy==0.19.1
webrtcvad==2.0.10
  • TensorFlow version (use command below):
('v1.4.0-rc1-11-g130a514', '1.4.0')
  • Python version:
Python 2.7.13
  • Bazel version (if compiling from source):

Did not compile from source.

  • GCC/Compiler version (if compiling from source):

Same

  • CUDA/cuDNN version:

Used CPU only version

  • GPU model and memory:

Used CPU only version

  • Exact command to reproduce:

I haven’t provided my full modified client.py that segments longer audio, but to run with a language model using the standard deepspeech command against a known 4 seconds audio clip included in the Dropbox folder shared above you can run the following:

# Set $DEEPSPEECH to where full Deep Speech checkout is; note that my own git checkout
# for the `deepspeech` runner is at git sha fef25e9ea6b0b6d96dceb610f96a40f2757e05e4
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt $DEEPSPEECH/models/lm.binary $DEEPSPEECH/models/trie

# Similar command to run without language model -- spaces retained for unknown words:
deepspeech $DEEPSPEECH/models/output_graph.pb chunk_2_length_4.0_s.wav $DEEPSPEECH/models/alphabet.txt 

This is clearly a bug and not a feature 😃

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 29
  • Comments: 54 (21 by maintainers)

Most upvoted comments

I’ve implemented length normalization (word_count_weight was only a gross approximation) as well as switch to a fixed OOV score (which was in a TODO list for a long time) as part of the streaming changes, which will be merged soon for our next release. When we have binaries available for testing I’ll comment here so anyone interested can test if it improves the decoder behavior on the cases described here. Thanks a lot for the investigation and suggestions, @bernardohenz, @GeorgeFedoseev and @titardrew!

I am facing the same issue on a rather similar configuration to the one described above. Was there any progress on this? Thanks!

Hi @reuben when will you have the binaries available?

@reuben is currently working on moving to ctcdecode, which amongst others should fix this issue

They will be available with our next release, v0.2, when it is ready 😃

What assumptions does the acoustic model make (i.e. whats the distribution and characteristics of the audio training data?) The audio I provided sounds pretty clear IMHO, but perhaps the audio training data doesn’t have enough diversity to help the deep net generalize (i.e. the deep net is essentially overfitting to the training data and isn’t generalizing well).

Hi @reuben, I am also seeing this problem in the master branch. Could you provide a patch with your implementation to deal with it?

Could anyone who’s seeing this issue test the new decoder on master?

There’s native client builds here: https://tools.taskcluster.net/groups/FyclewklSUqN6FXHavrhKQ

The acoustic model is the same as v0.2, and the trie is in data/lm/trie.ctcdecode after you update to latest master. Testing with some problematic examples I had shows much better results, but the links in this thread are all broken so I couldn’t test with your files.

Let me know how it goes.

@BradNeuberg @reuben is this issue closed? Am running the 0.2.0va7 (with ldc93s1 and a new wav file) version of Deepspeech, and the result (like hiieieddiitwenty) doesn’t match with the language model. If there is any tweek to force respecting the model, am buying it even if it is time consuming.

@bernardohenz I think that in that part (if (!alphabet_.IsSpace(to_label))) it should be just that oov word gets score lower than any vocabulary word.

Try to increase word_count_weight_ from default 1 to something like 3.5. This resulted in less concatenation for me and decreased my WER by 3-4%.

Probably the bug is somewhere in this function:

https://github.com/mozilla/DeepSpeech/blob/e34c52fcb98854c5ecc5639a8ace6196f5825fbd/native_client/beam_search.h#L56-L92

Seems that the problem is that sequences with out-of-vocabulary words receive more score without spaces than with spaces.

@spencer-brown On the recordings you made yourself, did you record directly to 16KHz, 16bit, mono audio? (The recordings sound like they were made at a lower Hz and/or bit depth.)

Also, I’d tend to agree that the drop in the recording quality is likely largely to blame for the poor results on the recordings you made yourself. We’re currently training models that will be more robust to background noise.

just tested 0.2.0 release (deepspeech and models), still get long words out of English vocabulary,

This example is a phone call recording (one channel out of two), TTS works well for the first sentence (a pre-recorded welcome message). Then it is a part of real conversation. TTS doesn’t work properly.

The command and outputs are

(deepspeech-venv) jonathan@ubuntu:~$ deepspeech --model ~/deepspeech-0.2.0-models/models/output_graph.pb --audio ~/audio/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav --alphabet ~/deepspeech-0.2.0-models/models/alphabet.txt --lm ~/deepspeech-0.2.0-models/models/lm.binary --trie ~/deepspeech-0.2.0-models/models/trie Loading model from file /home/jonathan/deepspeech-0.2.0-models/models/output_graph.pb TensorFlow: v1.6.0-18-g5021473 DeepSpeech: v0.2.0-0-g009f9b6 Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage. 2018-09-20 11:02:49.456955: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 0.134s. Loading language model from files /home/jonathan/deepspeech-0.2.0-models/models/lm.binary /home/jonathan/deepspeech-0.2.0-models/models/trie Loaded language model in 3.85s. Running inference. thank you for calling national storage your call may be recorded for coaching and quality the poses place let us not an if ye prefer we didn’t record your colt to day in wall constrashionalshordistigwisjemaigay am so it just so he put in your code held everything in a disbosmygriparsesnwygorighticame so she’s not like um that’s all good if you won’t care if i can just reserve something from my end over the foreign am i can reserve at the same on mine price you will looking out as well um which sent a and a unit is he looking out without which location for it an put it by sereerkapcoolofmijustrynorfrommians or a man we after the ground floor on the upper of a at Inference took 33.947s for 58.674s audio file.

The audio can be found from https://s3.us-east-2.amazonaws.com/fonedynamicsuseast2/C2AICXLGB3D2SMK4WPZF26KEZTRUA6OYR1.wav

Hi @reuben Any update on these binaries? I too would like to test their impact on decoder behavior.

+1😉

The build requirements are here[1] and the build instructions are here[2].