transformers: Too many bugs in Version 2.5.0
- It cannot be installed on MacOS. By runing
pip install -U transformers, I got the following errors:
Building wheels for collected packages: tokenizers Building wheel for tokenizers (PEP 517) … error ERROR: Command errored out with exit status 1: command: /anaconda/bin/python /anaconda/lib/python3.7/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/5h/fr2vhgsx4jd8wz4bphzt22_8p1v0bf/T/tmpfh6km7na cwd: /private/var/folders/5h/fr2vhgsx4jd8wz4bphzt22_8p1v0bf/T/pip-install-fog09t3h/tokenizers Complete output (36 lines): running bdist_wheel running build running build_py creating build creating build/lib creating build/lib/tokenizers copying tokenizers/init.py -> build/lib/tokenizers creating build/lib/tokenizers/models copying tokenizers/models/init.py -> build/lib/tokenizers/models creating build/lib/tokenizers/decoders copying tokenizers/decoders/init.py -> build/lib/tokenizers/decoders creating build/lib/tokenizers/normalizers copying tokenizers/normalizers/init.py -> build/lib/tokenizers/normalizers creating build/lib/tokenizers/pre_tokenizers copying tokenizers/pre_tokenizers/init.py -> build/lib/tokenizers/pre_tokenizers creating build/lib/tokenizers/processors copying tokenizers/processors/init.py -> build/lib/tokenizers/processors creating build/lib/tokenizers/trainers copying tokenizers/trainers/init.py -> build/lib/tokenizers/trainers creating build/lib/tokenizers/implementations copying tokenizers/implementations/byte_level_bpe.py -> build/lib/tokenizers/implementations copying tokenizers/implementations/sentencepiece_bpe.py -> build/lib/tokenizers/implementations copying tokenizers/implementations/base_tokenizer.py -> build/lib/tokenizers/implementations copying tokenizers/implementations/init.py -> build/lib/tokenizers/implementations copying tokenizers/implementations/char_level_bpe.py -> build/lib/tokenizers/implementations copying tokenizers/implementations/bert_wordpiece.py -> build/lib/tokenizers/implementations copying tokenizers/init.pyi -> build/lib/tokenizers copying tokenizers/models/init.pyi -> build/lib/tokenizers/models copying tokenizers/decoders/init.pyi -> build/lib/tokenizers/decoders copying tokenizers/normalizers/init.pyi -> build/lib/tokenizers/normalizers copying tokenizers/pre_tokenizers/init.pyi -> build/lib/tokenizers/pre_tokenizers copying tokenizers/processors/init.pyi -> build/lib/tokenizers/processors copying tokenizers/trainers/init.pyi -> build/lib/tokenizers/trainers running build_ext running build_rust error: Can not find Rust compiler
ERROR: Failed building wheel for tokenizers Running setup.py clean for tokenizers Failed to build tokenizers ERROR: Could not build wheels for tokenizers which use PEP 517 and cannot be installed directly
- On Linux, it can be installed, but failed with the following code:
import transformers transformers.AutoTokenizer.from_pretrained(“bert-base-cased”).save_pretrained(“./”) transformers.AutoModel.from_pretrained(“bert-base-cased”).save_pretrained(“./”) transformers.AutoTokenizer.from_pretrained(“./”) transformers.AutoModel.from_pretrained(“./”)
Actually, it is the second line that generates the following errors:
Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/anaconda/lib/python3.7/site-packages/transformers/tokenization_utils.py”, line 587, in save_pretrained return vocab_files + (special_tokens_map_file, added_tokens_file) TypeError: unsupported operand type(s) for +: ‘NoneType’ and ‘tuple’
- The vocabulary size of xlm-roberta is wrong, so it failed with the following code, (this bug also exist in Version 2.4.1):
import transformers tokenizer = transformers.AutoTokenizer.from_pretrained(“xlm-roberta-base”) tokenizer.convert_ids_to_tokens(range(tokenizer.vocab_size))
The error is actually caused by the wrong vocab size:
[libprotobuf FATAL /sentencepiece/src/…/third_party/protobuf-lite/google/protobuf/repeated_field.h:1506] CHECK failed: (index) < (current_size_): terminate called after throwing an instance of ‘google::protobuf::FatalException’ what(): CHECK failed: (index) < (current_size_): zsh: abort python
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (4 by maintainers)
Hi! Indeed, there have been a few issues as this was the first release incorporating
tokenizersby default. A new version oftokenizersandtransformerswill be available either today or tomorrow and should fix most of these.I cannot answer that, I don’t know what the roadmap looks like.
tokenizerssits in its own repository. You can find it here and its Python bindings here.I think that the fast tokenizers are tested to get the exact same output as the other ones.
use_fastuses thetokenizerslibrary which is a new, extremely fast implementation of different tokenizers. I agree that for the first few releases it might’ve been better to expose the argument but setting it to False by default as to catch errors only by early adopters. Now many errors are reported that could’ve otherwise been avoided. In the meantime, you can explicitly set it to False.For future reference, when you say that some code “fails”, please also provide the stack trace. This helps greatly when debugging.