rasa: Capitalization throwing off the tensorflow_embedding classifier

Rasa NLU version: 0.12.2

Operating system: Windows 10

Content of model configuration file:

language: "en"

pipeline: "tensorflow_embedding"

Issue: Capitalization is seriously messing up the intent classification for a model I trained using the new pipeline tensorflow_embedding. Example (I’m just posting the relevant output from the parser):

'text': 'no'
'intent': {'confidence': 0.9569746255874634, 'name': 'disagree'}
'text': 'No'
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# See the lower confidence
#----
'text': 'yes'
'intent': {'confidence': 0.9270809888839722, 'name': 'agree'}
'text':'Yes' 
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# It's classifying it completely wrongly.
# (variations like 'yEs', 'yES', and 'YES' also gives the exact same confidences as 'Yes')
#----
'text':'hi'
'intent': {'confidence': 0.8774316310882568, 'name': 'greet'}
'text':''Hi'
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# Again completely wrong!

I have no capital letters in any of my training data utterances. I have trained another model using the same data with the spacy_sklearn pipeline which gives me exact to the last digit same confidence in intents however I capitalize my input to the model.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 16 (15 by maintainers)

Most upvoted comments

This is an interesting usability issue. There are a number of ways you can remedy this, and we should document this better.

The simplest is to pass a preprocessor to the CountVectorFeaturizer which just lowercases everything. Then “Hi” and “hi” get mapped to the same feature.

Another approach is to add nlp_spacy and tokenizer_spacy to the pipeline, because if spaCy is present, we will actually replace each token with its lemma. We didn’t do that by default bc then you would still have to load a spaCy model.

Here is my pipeline and indeed i load spaCy’s language model for tokenization that could be the reason why i am getting better results with tensorflow. Wasn’t aware of that 👍

pipeline:
# this is using the spacy sklearn pipeline, adding duckling
# all components will use their default values

- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_spacy"
- name: "ner_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "ner_duckling_http"
  locale: "nl_Nothing"
  url: "http://duckling:8000"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

For our work, i switched to tensorflow at the moment because it is giving better results compared to SpaCy’s default model. But it is a narrow domain chatbot focussed on a proper set of questions. There are some niche edge cases we see but that are usually handled by the art of interrogation and asking the proper question to your user.