dvc.org: tutorials: caught MemoryError when "Running in bulk" in deep/define-ml-pipeline#running-in-bulk

Please provide information about your setup DVC version: 0.40.2 (installed by pip) OS: Ubuntu 18.04 RAM: 8GB

~~I am following a tutorial in https://dvc.org/doc/tutorial/define-ml-pipeline.~~ UPDATE: This refers to http://localhost:3000/doc/tutorials/deep/define-ml-pipeline#running-in-bulk now.

In “Running in bulk” section, I failed to run this command and caught an error.

$ dvc run -d code/featurization.py -d code/conf.py \
            -d data/Posts-train.tsv -d data/Posts-test.tsv \
            -o data/matrix-train.p -o data/matrix-test.p \
            python code/featurization.py
Running command:
	python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
Traceback (most recent call last):
  File "code/featurization.py", line 48, in <module>
    train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 30 (28 by maintainers)

Most upvoted comments

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

ryokugyu on Jun 27, 2019

This issue affects me aswell.

Paste from my Discord message:

What’s wrong? While running featurization.py I get some kind buffer overflow. 16GB of RAM get consumed in seconds and the execution halts after a couple of seconds of system freeze.

I get a The input data frame data/Posts-train.tsv size is (66999, 3) output, so far the code is valid. But the next step most likely goes sideways, because a injected print(test) does not show up after train_words.

My setup includes 16GB of RAM. Despite the older statements I don’t get a memory error raised. I think dvc may not be verbose about python errors.

depate on Mar 6, 2020

@shcheklein @Naba7 @efiop This is just a problem for RAM requirements.

I ran it on my system which has the following configuration : config_system

And it executed successfully by using almost 98% of RAM.

imo, RAM requirement is just more than 16 Gb then.

ryokugyu on Jun 22, 2019

@ryokugyu Unfortunately I don’t know specific minimal RAM requirements for running the tutorial 😦

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

ryokugyu on May 30, 2019

User from discord is running into the MemoryError on the same step but now in the get-started guide. Discord context: https://discordapp.com/channels/485586884165107732/563406153334128681/581584115644629012

https://github.com/iterative/dvc.org/issues/380

efiop on May 24, 2019

Reopening this. As we discussed privately with @kurianbenoy, we need to find a way to modify it a bit so that we can run it on a smaller machine. Ideas to try: filter the dataset artificially, try less features (it’s 5000, try 2500 by default), check if there is a way to use some optimized arrays.

shcheklein on May 17, 2019

Hi @mexeniz !

Looks like you are running out of memory 🙁 As opposed to our get-started guide, our tutorial has some beefy requirements on RAM. Have you tried get-started already? https://dvc.org/doc/get-started In essence, it is a simplified tutorial.

efiop on May 14, 2019

Tutorial got absorbed with get started. Closing this. For get started we have a separate ticket for this.

shcheklein on Jul 18, 2020

@ryokugyu Unfortunately I don’t know specific minimal RAM requirements for running the tutorial 😦

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

As @shcheklein said I tried out reducing no of features in Count Vectoriser from 2500, 1000, 100, 50,1 and all of them gave memory error.

kurianbenoy on Jun 2, 2019

With 12 GB of RAM, I am still getting memory error. @shcheklein

ryokugyu on May 26, 2019