NeMo: NeMo no longer supports transcripts with diacritics
Describe the bug
I am training an Arabic model with diacritics.
Digitally, each diacritic is represented as a separate (unicode) character from the actual letter.
Here’s an example of the vocabulary/text that we are using.
There are 10,878 unique words in the vocabulary (so large enough for a SPE tokenizer with a vocab_size
of 1024).
I am able to generate a SPE-based tokenizer using our diacritized vocabulary.
I can confirm that the generated document.txt
and vocab.txt
for the tokenizer have Arabic text represented correctly.
However, somewhere during training the tokenizer fails to decode the text properly.
This is an example output from the WER
metric:
[NeMo I 2022-03-04 09:48:36 wer_bpe:204] reference:غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇
[NeMo I 2022-03-04 09:48:36 wer_bpe:205] predicted:جعلنا غبعوثون الض تجري غ ض غ استكبروا غ استكبرواࣲ غ وللكافرين إ وإليه الض قلتم استكبروابعوثون غ وإلى أظلم استكبروا س استكبروا غ عنهم أحسن استكبروا جعلنا الض الجاه ]
تجري الهمد للذين غ استكبروا القرآن غر جعلنا غ رؤوس أصببعوثون ضآيات شديد شديد وإلى غࣲ استكبروا غ استكبروا غ جعلنا غ أحسن الض ألفافا أنفسكم والسماء آتينا لله غ ض آتينا غ جعلنا الض غ قلتم آتينا الجاه تجري غ وأخ جعلنا ]
⁇ ألفافا جعلنا غإ غ استكبروا وأخ وإلى وأخ غ إ غ جعلنا تجري وإلى استكبروابعوثون
Notice two things here:
- The
reference
text is only two characters:⁇
andغ
. - The predicted text does not have any diacritics (despite the tokenizer and vocab having diacritics). After a few epochs, the model “converges” and predicts only those two characters. Basically the loss goes to 0 in ~1k steps.
[NeMo I 2022-03-04 09:49:46 wer_bpe:204] reference:غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇
[NeMo I 2022-03-04 09:49:46 wer_bpe:205] predicted:غ ⁇ غ ⁇ جعلنا جعلنا ⁇ وإلى غ ⁇ ⁇ غ ⁇ غ ⁇ ⁇ غ ⁇ جعلنا عليها ⁇ ألفافا جعلنا ⁇ غ ⁇ غ ⁇ غ ⁇ غ ⁇ لله بأفواه
If I train a model without diacritics, the same behavior occurs. The last working NeMo version was 1.4.0
(will test with 1.5.0
).
Previously, the model would converge at a reasonable rate and achieves good results (~4% WER).
Steps/Code to reproduce bug
There’s nothing really special about my training script, it’s taken from the examples.
The model is finetuned from the stt_en_citrinet_1024
model.
cfg.model_target
is nemo.collections.asr.models.EncDecCTCModelBPE
from omegaconf import OmegaConf
import pytorch_lightning as pl
import wandb
from nemo.collections.asr.models import ASRModel
from nemo.core.config import hydra_runner
from nemo.utils import logging, model_utils
from nemo.utils.exp_manager import exp_manager
@hydra_runner(config_path="conf/citrinet/", config_name="config")
def main(cfg):
# Setup trainer and exp. manager
trainer = pl.Trainer(**cfg.trainer)
log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
# Setup Model
model_class = model_utils.import_class_by_path(cfg.model_target) # type: ASRModel
asr_model = model_class.from_pretrained(model_name=cfg.init_from_pretrained_model)
asr_model.cfg = cfg.model
asr_model.set_trainer(trainer)
asr_model.setup_training_data(cfg.model.train_ds)
asr_model.setup_multiple_validation_data(cfg.model.validation_ds)
asr_model.setup_optimization(cfg.model.optim)
# Setup Augmentation
asr_model.spec_augmentation = asr_model.from_config_dict(cfg.model.spec_augment)
# Change vocab
asr_model.change_vocabulary(
new_tokenizer_dir=cfg.model.tokenizer.dir,
new_tokenizer_type=cfg.model.tokenizer.type
)
trainer.fit(asr_model)
if __name__ == '__main__':
main()
The tokenizer is generated using process_asr_text_tokenizer.py:
python process_asr_text_tokenizer.py --manifest=<path to train manifest files, seperated by commas> \
--data_root=tokenizers \
--vocab_size=1024 \
--tokenizer=spe \
--log
Expected behavior
The model should be able to converge and predict Arabic text accurately.
Environment overview (please complete the following information)
- Environment location: Docker (
nvcr.io/nvidia/pytorch:21.10-py3
) - Method of NeMo install:
pip install nemo_toolkit[all]==1.7.0
- If method of install is [Docker], provide
docker pull
&docker run
commands used
Dockerfile:
FROM nvcr.io/nvidia/pytorch:21.10-py3
ARG DEBIAN_FRONTEND=noninteractive
COPY ./requirements.txt requirements.txt
RUN apt update && \
apt install -y ffmpeg libsndfile1 && \
python3 -m pip install --upgrade pip && \
python3 -m pip install -r requirements.txt
docker run --rm -it --gpus all --shm-size 64G --ipc=host --env-file .env -v /home/$USER:/home/$USER train
Additional context
GPU: 8xV100 (AWS p3dn.24xlarge)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 28 (14 by maintainers)
@itzsimpl The ordering given there is what happens inside the constructor of the nemo model itself. When you use init from pretrained - it will first load your model with your config and data loaders etc and only load the pytorch checkpoint weights from the older model into your already initialized model.
The older model’s tokenizer or dataloaders are not used at all, only its weights are copied into your new model.
Ditto, this diagram is helpful. Will leave comments on the PR.
If possible, updating the
ASR_CTC_Language_Finetuning
tutorial would also be helpful. Just a sentence in the “Update the vocabulary” section saying something along the lines of “Thechange_vocabulary()
must be performed before setting up your datasets otherwise you might get decoding errors.”Agreed that we need to improve documentation, such issues are subtle and incredibly hard to debug. Theres no way to even test this easily to raise an error.