fairseq: Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment

My Question:

How can I get prepare_text.sh running correctly in a fresh Ubuntu Jupyterlab environment? What needs to be installed, what variables set, etc.?

I’ve run into various issues attempting to run the script prepare_text.sh, from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/unsupervised/scripts/prepare_text.sh.

Right now, I’m stuck on preprocess.py: error: unrecognized arguments: --dict-only, but I’ve run into some other errors that I’ve had to workaround, detailed below.

Full current output:

After getting through all the other issues I detail below, currently this is what I see when I attempt to run the script.

I cloned the https://github.com/pytorch/fairseq.git repo, and navigated to the scripts folder: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts before running this.

(wav2vecu_pre) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/fairseq/examples/wav2vec/unsupervised/scripts$ zsh prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
cut: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/dict.txt: No such file or directory
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
one is 
sed: can't read /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
paste: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
2021-06-03 16:39:42 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/lm.phones.filtered.txt', validpref=None, testpref=None, align_suffix=None, destdir='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/dict.phn.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=70)
Traceback (most recent call last):
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in <module>
    cli_main()
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main
    main(args)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 98, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py", line 54, in load_dictionary
    return Dictionary.load(filename)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 214, in load
    d.add_from_file(f)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 225, in add_from_file
    self.add_from_file(fd)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 249, in add_from_file
    raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: '<SIL>'. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.
prepare_text.sh:49: command not found: lmplz
prepare_text.sh:50: command not found: build_binary
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary
prepare_text.sh:56: command not found: lmplz
prepare_text.sh:57: command not found: build_binary
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable

Fixed (?) Problem: Can’t seem to run it from the same folder as the README (workaround: run from scripts folder)

First, I can’t run it from the same folder as the README at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data says to. If you try doing so, you get errors with, e.g. path not found to other scripts.

zsh scripts/prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/normalize_and_filter_text.py': [Errno 2] No such file or directory

Fixed (?) Problem: “ValueError: lid.187.bin cannot be opened for loading!” (workaround: use lid.176.bin instead)

Solution: download a different language ID model, and edit the code to use it.

https://fasttext.cc/docs/en/language-identification.html has a different model, lid.176.bin

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

and edit this portion of normalize_and_filter_text.py:

    parser.add_argument(
        "--fasttext-model",
        help="path to fasttext model",
        default="lid.176.bin",
    )

Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)

The script does not list which dependencies are needed. So far I’ve determined that phonemizer, fasttext are needed, and I think fairseq too. Any more I’m missing?

Fixed (?) Problem: can’t find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).

I set this to point to the top level of the cloned repo. not sure if that’s right.

(I cloned the repo to ~/work/fairseq/)

export FAIRSEQ_ROOT=~/work/fairseq/

Fixed (?) Problem: Not sure what language code to use. (guessed `sw`)

I’ve got Swahili data. Not sure whether to use sw, or swahili or what, I assume I should pick from https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

Code

Here’s the command I use to invoke the script. Other than editing the default langid model, I haven’t edited anything else in the repo, should be the same as https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts. git log shows c47a9b2eef0f41b0564c8daf52cb82ea97fc6548 as the commit.

zsh prepare_text.sh language /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out

What have you tried?

Tried reading https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data
Tried reading https://github.com/pytorch/fairseq/issues/3581 and https://github.com/pytorch/fairseq/issues/3586
Googling for various keywords such as “fairseq preprocess dict-only”

What’s your environment?

I’m in a Jupyterlab in a Docker container, running Ubuntu.

OS is Ubuntu 20.04.2:

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

pip list:

pip listPackage                Version
---------------------- -------------------
antlr4-python3-runtime 4.8
attrs                  21.2.0
certifi                2021.5.30
cffi                   1.14.5
clldutils              3.9.0
colorlog               5.0.1
csvw                   1.11.0
Cython                 0.29.23
dataclasses            0.6
editdistance           0.5.3
fairseq                0.10.0
fasttext               0.9.2
hydra-core             1.0.6
isodate                0.6.0
joblib                 1.0.1
numpy                  1.20.3
omegaconf              2.0.6
phonemizer             2.2.2
pip                    21.1.2
portalocker            2.0.0
pybind11               2.6.2
pycparser              2.20
python-dateutil        2.8.1
PyYAML                 5.4.1
regex                  2021.4.4
rfc3986                1.5.0
sacrebleu              1.5.1
segments               2.2.0
setuptools             49.6.0.post20210108
six                    1.16.0
tabulate               0.8.9
torch                  1.8.1
tqdm                   4.61.0
typing-extensions      3.10.0.0
uritemplate            3.0.1
wheel                  0.36.2

conda list:

conda list
# packages in environment at /opt/conda/envs/wav2vecu_pre:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
antlr4-python3-runtime    4.8                      pypi_0    pypi
attrs                     21.2.0                   pypi_0    pypi
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py39hf3d152e_0    conda-forge
cffi                      1.14.5                   pypi_0    pypi
clldutils                 3.9.0                    pypi_0    pypi
colorlog                  5.0.1                    pypi_0    pypi
csvw                      1.11.0                   pypi_0    pypi
cython                    0.29.23                  pypi_0    pypi
dataclasses               0.6                      pypi_0    pypi
editdistance              0.5.3                    pypi_0    pypi
fairseq                   0.10.0                   pypi_0    pypi
fasttext                  0.9.2                    pypi_0    pypi
hydra-core                1.0.6                    pypi_0    pypi
isodate                   0.6.0                    pypi_0    pypi
joblib                    1.0.1                    pypi_0    pypi
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.3                   pypi_0    pypi
omegaconf                 2.0.6                    pypi_0    pypi
openssl                   1.1.1k               h7f98852_0    conda-forge
phonemizer                2.2.2                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
portalocker               2.0.0                    pypi_0    pypi
pybind11                  2.6.2                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
python                    3.9.4           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.9                      1_cp39    conda-forge
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
regex                     2021.4.4                 pypi_0    pypi
rfc3986                   1.5.0                    pypi_0    pypi
sacrebleu                 1.5.1                    pypi_0    pypi
segments                  2.2.0                    pypi_0    pypi
setuptools                49.6.0           py39hf3d152e_3    conda-forge
six                       1.16.0                   pypi_0    pypi
sqlite                    3.35.5               h74cdb3f_0    conda-forge
tabulate                  0.8.9                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torch                     1.8.1                    pypi_0    pypi
tqdm                      4.61.0                   pypi_0    pypi
typing-extensions         3.10.0.0                 pypi_0    pypi
tzdata                    2021a                he74cb21_0    conda-forge
uritemplate               3.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge

I also apt-installed phonemizer dependencies:

sudo apt-get install festival espeak-ng mbrola

And finally, here’s what I get from apt list|grep installed apt-list.txt

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 59 (8 by maintainers)

Most upvoted comments

Followed instructions at https://github.com/kpu/kenlm/blob/master/BUILDING to install dependencies for kenlm. What they don’t mention is that you need to take the resulting binaries from kenlm/build/bin/ and copy them to /usr/bin

cdleong on Jun 3, 2021