DeepSpeech: CRLF in alphabet file breaks usage
python ./DeepSpeech.py --train_files ../zh-cn/clips/train.csv --dev_files ../zh-cn/clips/dev.csv --test_files ../zh-cn/clips/test.csv
Traceback (most recent call last):
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 33, in _label_from_string
return self._str_to_label[string]
KeyError: '母'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 130, in text_to_char_array
transcript = np.asarray(alphabet.encode(series['transcript']))
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 47, in encode
res.append(self._label_from_string(char))
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 39, in _label_from_string
).with_traceback(e.__traceback__)
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 33, in _label_from_string
return self._str_to_label[string]
KeyError: "ERROR: Your transcripts contain characters (e.g. '母') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./DeepSpeech.py", line 965, in <module>
absl.app.run(main)
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "./DeepSpeech.py", line 938, in main
train()
File "./DeepSpeech.py", line 438, in train
train_phase=True)
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/feeding.py", line 101, in create_dataset
df['transcript'] = df.apply(text_to_char_array, alphabet=Config.alphabet, result_type='reduce', axis=1)
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/frame.py", line 6928, in apply return op.get_result()
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 186, in get_result return self.apply_standard()
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 292, in apply_standard
self.apply_series_generator()
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
results[i] = self.f(v)
File "/home/kms/anaconda3/envs/speechenv/lib/python3.6/site-packages/pandas/core/apply.py", line 112, in f
return func(x, *args, **kwds)
File "/mnt/d/FPProject_git/My_Work/speech/mozilla/DeepSpeech/util/text.py", line 136, in text_to_char_array
raise ValueError('While processing: {}\n{}'.format(series['wav_filename'], e))
ValueError: ('While processing: /mnt/d/FPProject_git/My_Work/speech/mozilla/zh-cn/clips/common_voice_zh-CN_18782225.wav\n"ERROR: Your transcripts contain characters (e.g. \'母\') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."', 'occurred at index 550')
I am training deep speech model in chinese. I have already downloaded chiense dataset from voice.mozilla.org (Common Voice). Then, I tried training as the description in Project Document(https://github.com/mozilla/DeepSpeech/blob/master/TRAINING.rst#training-your-own-model). But when I train data using DeepSpeech.py, it failed with above errors. data/alphabet.txt already contains ‘母’. I can’t find reason. Please help me.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 26 (2 by maintainers)
Commits related to this issue
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
- Enforce proper line ending removal when reading alphabet Fixes #2611 — committed to lissyx/STT by deleted user 4 years ago
@JinZhuXing Just removing
CRLF
line ending was enough. Training is only tested / supported on Linux so far, as documented. Looks like this was prepared on a Windows system to be withCRLF
line endings.