rasa: MemoryError with tensorflow_embedding on ~73k dataset with 38 intents
As mentioned in the title, I am feeding in ~73k lines of training data classified into 38 intents. And I would end up using ~200k lines of messages to create my final model. But even for 73k, I get a MemoryError. This doesn’t seem to be a RAM issue as I don’t see my RAM getting fully used up while running the training code. Any inputs would be valuable. Below are the details:
Rasa NLU version: 0.13.8 Operating system: Windows Server 2016
Training the model as:
python -m rasa_nlu.train -c nlu_config.yml --data rasa_classification_train_set.md -o models --fixed_model_name nlu_classify_75k_38ctgy --project current --verbose
Content of model configuration file:
language: "en"
pipeline: "tensorflow_embedding"
Output / Issue:
2019-01-14 08:40:41 INFO rasa_nlu.training_data.loading - Training data format of rasa_classification_train_set.md is md
2019-01-14 08:40:43 INFO rasa_nlu.training_data.training_data - Training data stats:
- intent examples: 73962 (38 distinct intents)
** removing entity names **
- entity examples: 0 (0 distinct entities)
- found entities:
2019-01-14 08:40:46 INFO rasa_nlu.model - Starting to train component tokenizer_whitespace
2019-01-14 08:40:55 INFO rasa_nlu.model - Finished training component.
2019-01-14 08:40:55 INFO rasa_nlu.model - Starting to train component ner_crf
2019-01-14 08:40:55 INFO rasa_nlu.model - Finished training component.
2019-01-14 08:40:55 INFO rasa_nlu.model - Starting to train component ner_synonyms
2019-01-14 08:40:55 INFO rasa_nlu.model - Finished training component.
2019-01-14 08:40:55 INFO rasa_nlu.model - Starting to train component intent_featurizer_count_vectors
Traceback (most recent call last):
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 184, in <module>
num_threads=cmdline_args.num_threads)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\train.py", line 154, in do_train
interpreter = trainer.train(training_data, **kwargs)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\model.py", line 196, in train
**context)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\rasa_nlu\featurizers\count_vectors_featurizer.py", line 214, in train
X = self.vect.fit_transform(lem_exs).toarray()
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 947, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\harsh.khadloya\AppData\Local\Continuum\Anaconda3\lib\site-packages\scipy\sparse\base.py", line 1184, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
During this runtime, I dont see my RAM getting used up more than 6GB, even though I have a 16GB RAM. Thanks for your help!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 46 (18 by maintainers)
Commits related to this issue
- Merge pull request #1621 from RasaHQ/none-route handle passing in None for a channels route — committed to RasaHQ/rasa by tmbo 5 years ago
- fix: github.job property is empty, GITHUB_JOB should be job id (#1646) * fix: github.job property is empty, GITHUB_JOB should be job id fix: github.job property is empty #1621 fix: GITHUB_JOB should... — committed to taytzehao/rasa by alex-savchuk a year ago
we’ve been experiencing some memory errors ourselves, it might just be that the array it’s about to create would be too big to fit into memory. The point where it breaks is when it’s converting a scipy sparse array into a numpy array – the numpy array is much bigger than the scipy sparse array which is probably what’s causing that. We don’t really have a quick fix for that right now, but may be merging a fix for that in future as we’re working on optimising training for the tensorflow pipeline ourselves
Same problem here with 8k intent and 1 - 4 common_examples for each… 50Gib memory allocated, only 2Gib used when failed…
Using Docker on Ubuntu 18.04 (FROM python:3.6.8-slim-stretch)
rasa-config.yml :
From Python console :
From docker stats :
CPU peak at 2365% (24 cores), 50 Gib never reached (no difference with 120Gib)
Error from logs :
We have two branches:
CountVectorsFeaturizer. The features are then used in theEmbeddingIntentClassifier. However, the code is not cleaned up. https://github.com/RasaHQ/rasa/tree/entity-recognitionyeah it’s because we’re using numpy arrays here, which at this point take up a huge amount of memory. Or are about to, as you said. The solution to this is using sparse arrays, which we are in a separate branch which isn’t quite ready to be merged yet. We will be merging it in the next few months so you should have no more problems with memory at that point
I have the same problem with MemoryError and can’t train my model using tensorflow_embedding on a big training set. As a workaround, I train model only on a small training set.
@suryavamsi1563 The branch https://github.com/RasaHQ/rasa/tree/combined-entity-intent-model is not ready yet. We faced some issues on the way. You should be able to use it beginning of next week.
i’m gonna leave this open, because it is an issue and we are looking into it
@akelad I had the same hypothesis given the point where it breaks. Thanks for your response & hoping the Rasa team will be fixing this in future as working with Rasa module has been very helpful. I ended up building an independent classification model (fine for my use case) & will be using Rasa for entity extraction (better usability than CRF).
@kenzydarioo A workaround would be to manually split your data and build sequential models for additional intents. These additional intents can be tagged as ‘Others’ in the previous model - As the ‘MemoryError’ seems majorly due to the # of intents. For example, I was able to create a model with 200k training data with just 5 intents (though most of the data were duplicates). Let us know on the approach which worked for you!
@akelad is it possible to split the .md training data and train it separately but somehow append it to one model in the end ? because i’m experiencing the same thing using tensorflow embedding config. Thanks in advance.