transformers: Windows: Can't find vocabulary file for MarianTokenizer
🐛 Bug MarianTokenizer.from_pretrained() fails in Python 3.6.4 in Windows 10
Information
Occurs with using the example here: https://huggingface.co/transformers/model_doc/marian.html?highlight=marianmtmodel#transformers.MarianMTModel
Model I am using (Bert, XLNet …): MarianMTModel
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- [X ] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [X ] my own task or dataset: (give details below)
To reproduce
Paste code from example and run:
from transformers import MarianTokenizer, MarianMTModel
from typing import List
src = 'fr' # source language
trg = 'en' # target language
sample_text = "où est l'arrêt de bus ?"
mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
model = MarianMTModel.from_pretrained(mname)
tok = MarianTokenizer.from_pretrained(mname)
batch = tok.prepare_translation_batch(src_texts=[sample_text]) # don't need tgt_text for inference
gen = model.generate(**batch) # for forward pass: model(**batch)
words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the the bus stop ?"
print(words)
Steps to reproduce the behavior:
- Run the example
- Program terminates on
tok = MarianTokenizer.from_pretrained(mname)
stdbuf was not found; communication with perl may hang due to stdio buffering.
Traceback (most recent call last):
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1055, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 89, in __init__
self._setup_normalizer()
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_marian.py", line 95, in _setup_normalizer
self.punc_normalizer = MosesPunctuationNormalizer(self.source_lang)
File "C:\Program Files\Python\lib\site-packages\mosestokenizer\punctnormalizer.py", line 47, in __init__
super().__init__(argv)
File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 64, in __init__
self.start()
File "C:\Program Files\Python\lib\site-packages\toolwrapper.py", line 108, in start
env=env,
File "C:\Program Files\Python\lib\subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "C:\Program Files\Python\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Development/Research/COVID-19-Misinfo2/src/translate_test_2.py", line 9, in <module>
tok = MarianTokenizer.from_pretrained(mname)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "C:\Program Files\Python\lib\site-packages\transformers\tokenization_utils.py", line 1058, in _from_pretrained
"Unable to load vocabulary from file. "
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.
Expected behavior
prints [“Where is the the bus stop ?”]
Environment info
transformersversion: 2.9.1- Platform: Windows-10-10.0.18362-SP0
- Python version: 3.6.4
- PyTorch version (GPU?): 1.5.0+cu101 (True)
- Tensorflow version (GPU?): 2.1.0 (True)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (3 by maintainers)
Just upgraded to version 3.0, and everything is working!