NeMo: NeMo no longer supports transcripts with diacritics

Describe the bug

I am training an Arabic model with diacritics. Digitally, each diacritic is represented as a separate (unicode) character from the actual letter. Here’s an example of the vocabulary/text that we are using. There are 10,878 unique words in the vocabulary (so large enough for a SPE tokenizer with a vocab_size of 1024).

I am able to generate a SPE-based tokenizer using our diacritized vocabulary. I can confirm that the generated document.txt and vocab.txt for the tokenizer have Arabic text represented correctly. However, somewhere during training the tokenizer fails to decode the text properly.

This is an example output from the WER metric:

[NeMo I 2022-03-04 09:48:36 wer_bpe:204] reference:غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇
[NeMo I 2022-03-04 09:48:36 wer_bpe:205] predicted:جعلنا غبعوثون الض تجري غ ض غ استكبروا غ استكبرواࣲ غ وللكافرين إ وإليه الض قلتم استكبروابعوثون غ وإلى أظلم استكبروا س استكبروا غ عنهم أحسن استكبروا جعلنا الض الجاه ]
تجري الهمد للذين غ استكبروا القرآن غر جعلنا غ رؤوس أصببعوثون ضآيات شديد شديد وإلى غࣲ استكبروا غ استكبروا غ جعلنا غ أحسن الض ألفافا أنفسكم والسماء آتينا لله غ ض آتينا غ جعلنا الض غ قلتم آتينا الجاه تجري غ وأخ جعلنا ]
⁇  ألفافا جعلنا غإ غ استكبروا وأخ وإلى وأخ غ إ غ جعلنا تجري وإلى استكبروابعوثون

Notice two things here:

  1. The reference text is only two characters: and غ.
  2. The predicted text does not have any diacritics (despite the tokenizer and vocab having diacritics). After a few epochs, the model “converges” and predicts only those two characters. Basically the loss goes to 0 in ~1k steps.
[NeMo I 2022-03-04 09:49:46 wer_bpe:204] reference:غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇
[NeMo I 2022-03-04 09:49:46 wer_bpe:205] predicted:غ ⁇  غ ⁇  جعلنا جعلنا ⁇  وإلى غ ⁇  ⁇  غ ⁇  غ ⁇  ⁇  غ ⁇  جعلنا عليها ⁇  ألفافا جعلنا ⁇  غ ⁇  غ ⁇  غ ⁇  غ ⁇  لله بأفواه

If I train a model without diacritics, the same behavior occurs. The last working NeMo version was 1.4.0 (will test with 1.5.0). Previously, the model would converge at a reasonable rate and achieves good results (~4% WER).

Steps/Code to reproduce bug

There’s nothing really special about my training script, it’s taken from the examples. The model is finetuned from the stt_en_citrinet_1024 model. cfg.model_target is nemo.collections.asr.models.EncDecCTCModelBPE

from omegaconf import OmegaConf
import pytorch_lightning as pl
import wandb

from nemo.collections.asr.models import ASRModel
from nemo.core.config import hydra_runner
from nemo.utils import logging, model_utils
from nemo.utils.exp_manager import exp_manager


@hydra_runner(config_path="conf/citrinet/", config_name="config")
def main(cfg):
    # Setup trainer and exp. manager
    trainer = pl.Trainer(**cfg.trainer)
    log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
    # Setup Model
    model_class = model_utils.import_class_by_path(cfg.model_target)  # type: ASRModel
    asr_model = model_class.from_pretrained(model_name=cfg.init_from_pretrained_model)
    asr_model.cfg = cfg.model
    asr_model.set_trainer(trainer)
    asr_model.setup_training_data(cfg.model.train_ds)
    asr_model.setup_multiple_validation_data(cfg.model.validation_ds)
    asr_model.setup_optimization(cfg.model.optim)
    # Setup Augmentation
    asr_model.spec_augmentation = asr_model.from_config_dict(cfg.model.spec_augment)
    # Change vocab
    asr_model.change_vocabulary(
        new_tokenizer_dir=cfg.model.tokenizer.dir,
        new_tokenizer_type=cfg.model.tokenizer.type
    )
    trainer.fit(asr_model)


if __name__ == '__main__':
    main()

The tokenizer is generated using process_asr_text_tokenizer.py:

python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
         --data_root=tokenizers \
         --vocab_size=1024 \
         --tokenizer=spe \
         --log

Expected behavior

The model should be able to converge and predict Arabic text accurately.

Environment overview (please complete the following information)

  • Environment location: Docker (nvcr.io/nvidia/pytorch:21.10-py3)
  • Method of NeMo install: pip install nemo_toolkit[all]==1.7.0
  • If method of install is [Docker], provide docker pull & docker run commands used

Dockerfile:

FROM nvcr.io/nvidia/pytorch:21.10-py3

ARG DEBIAN_FRONTEND=noninteractive

COPY ./requirements.txt requirements.txt 

RUN apt update && \
    apt install -y ffmpeg libsndfile1 && \
    python3 -m pip install --upgrade pip && \
    python3 -m pip install -r requirements.txt
docker run --rm -it --gpus all --shm-size 64G --ipc=host --env-file .env -v /home/$USER:/home/$USER train

Additional context

GPU: 8xV100 (AWS p3dn.24xlarge)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

@itzsimpl The ordering given there is what happens inside the constructor of the nemo model itself. When you use init from pretrained - it will first load your model with your config and data loaders etc and only load the pytorch checkpoint weights from the older model into your already initialized model.

The older model’s tokenizer or dataloaders are not used at all, only its weights are copied into your new model.

Ditto, this diagram is helpful. Will leave comments on the PR.

If possible, updating the ASR_CTC_Language_Finetuning tutorial would also be helpful. Just a sentence in the “Update the vocabulary” section saying something along the lines of “The change_vocabulary() must be performed before setting up your datasets otherwise you might get decoding errors.”

Agreed that we need to improve documentation, such issues are subtle and incredibly hard to debug. Theres no way to even test this easily to raise an error.