rasa: MemoryError with tensorflow_embedding on ~73k dataset with 38 intents

As mentioned in the title, I am feeding in ~73k lines of training data classified into 38 intents. And I would end up using ~200k lines of messages to create my final model. But even for 73k, I get a MemoryError. This doesn’t seem to be a RAM issue as I don’t see my RAM getting fully used up while running the training code. Any inputs would be valuable. Below are the details:

Rasa NLU version: 0.13.8 Operating system: Windows Server 2016

Training the model as:

python -m rasa_nlu.train -c nlu_config.yml --data rasa_classification_train_set.md -o models --fixed_model_name nlu_classify_75k_38ctgy --project current --verbose

Content of model configuration file:

language: "en"

pipeline: "tensorflow_embedding"

Output / Issue:

2019-01-14 08:40:41 INFO     rasa_nlu.training_data.loading  - Training data format of rasa_classification_train_set.md is md
2019-01-14 08:40:43 INFO     rasa_nlu.training_data.training_data  - Training data stats:
        - intent examples: 73962 (38 distinct intents)
** removing entity names **
        - entity examples: 0 (0 distinct entities)
        - found entities:

2019-01-14 08:40:46 INFO     rasa_nlu.model  - Starting to train component tokenizer_whitespace
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component ner_crf
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component ner_synonyms
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Finished training component.
2019-01-14 08:40:55 INFO     rasa_nlu.model  - Starting to train component intent_featurizer_count_vectors
Traceback (most recent call last):
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 184, in <module>
    num_threads=cmdline_args.num_threads)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 154, in do_train
    interpreter = trainer.train(training_data, **kwargs)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\model.py", line 196, in train
    **context)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\featurizers\count_vectors_featurizer.py", line 214, in train
    X = self.vect.fit_transform(lem_exs).toarray()
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

During this runtime, I dont see my RAM getting used up more than 6GB, even though I have a 16GB RAM. Thanks for your help!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 46 (18 by maintainers)

Commits related to this issue

Most upvoted comments

we’ve been experiencing some memory errors ourselves, it might just be that the array it’s about to create would be too big to fit into memory. The point where it breaks is when it’s converting a scipy sparse array into a numpy array – the numpy array is much bigger than the scipy sparse array which is probably what’s causing that. We don’t really have a quick fix for that right now, but may be merging a fix for that in future as we’re working on optimising training for the tensorflow pipeline ourselves

Same problem here with 8k intent and 1 - 4 common_examples for each… 50Gib memory allocated, only 2Gib used when failed…

Using Docker on Ubuntu 18.04 (FROM python:3.6.8-slim-stretch)

rasa-config.yml :

language: "fr"

pipeline:
- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"

From Python console :

>>> import sys
>>> is_64bits = sys.maxsize > 2**32
>>> print(is_64bits)
True

From docker stats :

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
494fdf122804 rasanlu_python_prod_1 0.01% 2.182GiB / 50GiB 4.36% 13.6MB / 135kB 0B / 54.2MB 82

CPU peak at 2365% (24 cores), 50 Gib never reached (no difference with 120Gib)

Error from logs :

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/train.py", line 174, in <module>
    num_threads=cmdline_args.num_threads)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/train.py", line 149, in do_train
    interpreter = trainer.train(training_data, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/model.py", line 190, in train
    **context)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 446, in train
    training_data, intent_dict)
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 272, in _prepare_data_for_training
    all_Y = self._create_all_Y(X.shape[0])
  File "/usr/local/lib/python3.6/site-packages/rasa_nlu/classifiers/embedding_intent_classifier.py", line 256, in _create_all_Y
    all_Y = np.stack([self.encoded_all_intents for _ in range(size)])
  File "/usr/local/lib/python3.6/site-packages/numpy/core/shape_base.py", line 423, in stack
    return _nx.concatenate(expanded_arrays, axis=axis, out=out)
MemoryError

We have two branches:

  1. The following branch uses sparse matrices in the CountVectorsFeaturizer. The features are then used in the EmbeddingIntentClassifier. However, the code is not cleaned up. https://github.com/RasaHQ/rasa/tree/entity-recognition
  2. We are currently cleaning up the above branch and moving everything to https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model. (It might take another 1-2 weeks until all the functionality of the first branch is on this branch.)

yeah it’s because we’re using numpy arrays here, which at this point take up a huge amount of memory. Or are about to, as you said. The solution to this is using sparse arrays, which we are in a separate branch which isn’t quite ready to be merged yet. We will be merging it in the next few months so you should have no more problems with memory at that point

I have the same problem with MemoryError and can’t train my model using tensorflow_embedding on a big training set. As a workaround, I train model only on a small training set.

@suryavamsi1563 The branch https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model is not ready yet. We faced some issues on the way. You should be able to use it beginning of next week.

i’m gonna leave this open, because it is an issue and we are looking into it

@akelad I had the same hypothesis given the point where it breaks. Thanks for your response & hoping the Rasa team will be fixing this in future as working with Rasa module has been very helpful. I ended up building an independent classification model (fine for my use case) & will be using Rasa for entity extraction (better usability than CRF).

@kenzydarioo A workaround would be to manually split your data and build sequential models for additional intents. These additional intents can be tagged as ‘Others’ in the previous model - As the ‘MemoryError’ seems majorly due to the # of intents. For example, I was able to create a model with 200k training data with just 5 intents (though most of the data were duplicates). Let us know on the approach which worked for you!

@akelad is it possible to split the .md training data and train it separately but somehow append it to one model in the end ? because i’m experiencing the same thing using tensorflow embedding config. Thanks in advance.