dvc.org: tutorials: caught MemoryError when "Running in bulk" in deep/define-ml-pipeline#running-in-bulk
Please provide information about your setup DVC version: 0.40.2 (installed by pip) OS: Ubuntu 18.04 RAM: 8GB
~~I am following a tutorial in https://dvc.org/doc/tutorial/define-ml-pipeline.~~ UPDATE: This refers to http://localhost:3000/doc/tutorials/deep/define-ml-pipeline#running-in-bulk now.
In “Running in bulk” section, I failed to run this command and caught an error.
$ dvc run -d code/featurization.py -d code/conf.py \
-d data/Posts-train.tsv -d data/Posts-test.tsv \
-o data/matrix-train.p -o data/matrix-test.p \
python code/featurization.py
Running command:
python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
Traceback (most recent call last):
File "code/featurization.py", line 48, in <module>
train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 30 (28 by maintainers)
@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.
This issue affects me aswell.
Paste from my Discord message:
What’s wrong? While running featurization.py I get some kind buffer overflow. 16GB of RAM get consumed in seconds and the execution halts after a couple of seconds of system freeze.
I get a The input data frame data/Posts-train.tsv size is (66999, 3) output, so far the code is valid. But the next step most likely goes sideways, because a injected print(test) does not show up after train_words.
My setup includes 16GB of RAM. Despite the older statements I don’t get a memory error raised. I think dvc may not be verbose about python errors.
@shcheklein @Naba7 @efiop This is just a problem for RAM requirements.
I ran it on my system which has the following configuration :
And it executed successfully by using almost 98% of RAM.
imo, RAM requirement is just more than 16 Gb then.
with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.
User from discord is running into the MemoryError on the same step but now in the get-started guide. Discord context: https://discordapp.com/channels/485586884165107732/563406153334128681/581584115644629012
https://github.com/iterative/dvc.org/issues/380
Reopening this. As we discussed privately with @kurianbenoy, we need to find a way to modify it a bit so that we can run it on a smaller machine. Ideas to try: filter the dataset artificially, try less features (it’s 5000, try 2500 by default), check if there is a way to use some optimized arrays.
Hi @mexeniz !
Looks like you are running out of memory 🙁 As opposed to our get-started guide, our tutorial has some beefy requirements on RAM. Have you tried get-started already? https://dvc.org/doc/get-started In essence, it is a simplified tutorial.
Tutorial got absorbed with get started. Closing this. For get started we have a separate ticket for this.
As @shcheklein said I tried out reducing no of features in
Count Vectoriser
from 2500, 1000, 100, 50,1 and all of them gave memory error.With 12 GB of RAM, I am still getting memory error. @shcheklein