rasa: [Diet Classifier] ValueError: Number of examples should be the same for all data.

Rasa version: 1.9.2

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version: 3.6 Operating system (windows, osx, …): linux Issue: When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107

Error (including full traceback):

2020-03-27 13:13:06 INFO     rasa.nlu.model  - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Traceback (most recent call last):
  File "/home/gunsu/diet/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
    persist_nlu_training_data=args.persist_nlu_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
    persist_nlu_training_data,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
    model_data = self.preprocess_train_data(training_data)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
    label_attribute=label_attribute,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
    model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
    self.num_examples = self.number_of_examples()
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
    f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

Command or request that led to error:

rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models

Content of configuration file (config.yml) (if relevant):

language: "xx"
pipeline:
  - name: "component.KoreanTokenizer"
  - name: "intent_entity_featurizer_regex"
  - name: "intent_featurizer_count_vectors"
    "token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
  - name: DIETClassifier
    intent_classification: True
    entity_recognition: False
    use_masked_language_model: False
    BILOU_flag: False
    number_of_transformer_layers: 0
    epochs: 100

Content of domain file (domain.yml) (if relevant):

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (8 by maintainers)

Most upvoted comments

@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:

The same issue as #1515, I have a very detailed explanation in there and I think it affects all the East Asian language (Chinese, Keras and more)
out-of-date custom tokenizer: the tokenizer which I provide doesn’t compatible with current Rasa (1.10.5). Rasa changed the tokenizer protocol since 1.7.0 (https://github.com/RasaHQ/rasa/releases/tag/1.7.0):

By default all tokenizer add a special token (CLS) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."

Solutions:

Set token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)
Check your tokenizer whether it supports the new tokenizer protocol if you are using a custom tokenizer (if it is not, try to rewrite your custom tokenizer according to one of the official tokenizers, for example, jieba tokenizer is a good one)

howl-anderson on Jul 7, 2020

@tabergma It’s good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using jieba as the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.

howl-anderson on Jul 7, 2020

Thanks @howl-anderson for the comment. We actually tackle problem 1 already in https://github.com/RasaHQ/rasa/issues/5905. It is already merged into master.

Just to be sure, if you update your custom tokenizer and solve the token_pattern issue, the problem is gone?

tabergma on Jul 7, 2020

Hi！ when I use rasa 1.10.1，the result is still reported the same error

shfshf on Jul 6, 2020