rasa: [Diet Classifier] ValueError: Number of examples should be the same for all data.
Rasa version: 1.9.2
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version: 3.6 Operating system (windows, osx, …): linux Issue: When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107
Error (including full traceback):
2020-03-27 13:13:06 INFO rasa.nlu.model - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component DIETClassifier
Traceback (most recent call last):
File "/home/gunsu/diet/bin/rasa", line 8, in <module>
sys.exit(main())
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
cmdline_arguments.func(cmdline_arguments)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
persist_nlu_training_data=args.persist_nlu_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
persist_nlu_training_data,
File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
persist_nlu_training_data=persist_nlu_training_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
persist_nlu_training_data=persist_nlu_training_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
interpreter = trainer.train(training_data, **kwargs)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
updates = component.train(working_data, self.config, **context)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
model_data = self.preprocess_train_data(training_data)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
label_attribute=label_attribute,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
self.num_examples = self.number_of_examples()
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.
Command or request that led to error:
rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models
Content of configuration file (config.yml) (if relevant):
language: "xx"
pipeline:
- name: "component.KoreanTokenizer"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_count_vectors"
"token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
- name: DIETClassifier
intent_classification: True
entity_recognition: False
use_masked_language_model: False
BILOU_flag: False
number_of_transformer_layers: 0
epochs: 100
Content of domain file (domain.yml) (if relevant):
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (8 by maintainers)
@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:
Solutions:
token_patternto"(?u)\b\w+\b"for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)jiebatokenizer is a good one)@tabergma It’s good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using
jiebaas the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.Thanks @howl-anderson for the comment. We actually tackle problem 1 already in https://github.com/RasaHQ/rasa/issues/5905. It is already merged into master.
Just to be sure, if you update your custom tokenizer and solve the
token_patternissue, the problem is gone?Hi! when I use rasa 1.10.1,the result is still reported the same error