DeepSpeech: CRLF in alphabet file breaks usage

python ./DeepSpeech.py --train_files ../zh-cn/clips/train.csv --dev_files ../zh-cn/clips/dev.csv --test_files ../zh-cn/clips/test.csv

Traceback (most recent call last):
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 33, in _label_from_string
    return self._str_to_label[string]
KeyError: '母'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 130, in text_to_char_array
    transcript = np.asarray(alphabet.encode(series['transcript']))
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 47, in encode
    res.append(self._label_from_string(char))
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 39, in _label_from_string
    ).with_traceback(e.__traceback__)
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 33, in _label_from_string
    return self._str_to_label[string]
KeyError: "ERROR: Your transcripts contain characters (e.g. '母') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./DeepSpeech.py", line 965, in <module>
    absl.app.run(main)
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "./DeepSpeech.py", line 938, in main
    train()
  File "./DeepSpeech.py", line 438, in train
    train_phase=True)
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/feeding.py", line 101, in create_dataset
    df['transcript'] = df.apply(text_to_char_array, alphabet=Config.alphabet, result_type='reduce', axis=1)
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/frame.py", line 6928, in apply           return op.get_result()
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 186, in get_result       return self.apply_standard()
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 292, in apply_standard
    self.apply_series_generator()
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
    results[i] = self.f(v)
  File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 112, in f
    return func(x, *args, **kwds)
  File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 136, in text_to_char_array
    raise ValueError('While processing: {}\n{}'.format(series['wav_filename'], e))
ValueError: ('While processing: /mnt/d/FPProject_git/My_Work/speech/mozilla/zh-cn/clips/common_voice_zh-CN_18782225.wav\n"ERROR: Your transcripts contain characters (e.g. \'母\') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."', 'occurred at index 550')

I am training deep speech model in chinese. I have already downloaded chiense dataset from voice.mozilla.org (Common Voice). Then, I tried training as the description in Project Document(https://github.com/mozilla/DeepSpeech/blob/master/TRAINING.rst#training-your-own-model). But when I train data using DeepSpeech.py, it failed with above errors. data/alphabet.txt already contains ‘母’. I can’t find reason. Please help me.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 26 (2 by maintainers)

Commits related to this issue

Most upvoted comments

@JinZhuXing Just removing CRLF line ending was enough. Training is only tested / supported on Linux so far, as documented. Looks like this was prepared on a Windows system to be with CRLF line endings.