transformers: Windows: Can't find vocabulary file for MarianTokenizer

🐛 Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10

Information

Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel

Model I am using (Bert, XLNet …): MarianMTModel

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • [X ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [X ] my own task or dataset: (give details below)

To reproduce

Paste code from example and run:

from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr'  # source language
trg = 'en'  # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'

model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text])  # don't need tgt_text for inference
gen = model.generate(**batch)  # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True)  # returns "Where is the the bus stop ?"
print(words)

Steps to reproduce the behavior:

  1. Run the example
  2. Program terminates on tok = MarianTokenizer.from_pretrained(mname)
stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
    self._setup_normalizer()
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
    self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
  File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
    super().__init__(argv)
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
    self.start()
  File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
    env=env,
  File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
    tok = MarianTokenizer.from_pretrained(mname)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
    "Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

Expected behavior

prints [“Where is the the bus stop ?”]

Environment info

  • transformers version: 2.9.1
  • Platform: Windows-10-10.0.18362-SP0
  • Python version: 3.6.4
  • PyTorch version (GPU?): 1.5.0+cu101 (True)
  • Tensorflow version (GPU?): 2.1.0 (True)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (3 by maintainers)

Most upvoted comments

Just upgraded to version 3.0, and everything is working!