dvc.org: tutorials: caught MemoryError when "Running in bulk" in deep/define-ml-pipeline#running-in-bulk

Please provide information about your setup DVC version: 0.40.2 (installed by pip) OS: Ubuntu 18.04 RAM: 8GB

~~I am following a tutorial in https://dvc.org/doc/tutorial/define-ml-pipeline.~~ UPDATE: This refers to http://localhost:3000/doc/tutorials/deep/define-ml-pipeline#running-in-bulk now.

In “Running in bulk” section, I failed to run this command and caught an error.

$ dvc run -d code/featurization.py -d code/conf.py \
            -d data/Posts-train.tsv -d data/Posts-test.tsv \
            -o data/matrix-train.p -o data/matrix-test.p \
            python code/featurization.py
Running command:
	python code/featurization.py
The input data frame data/Posts-train.tsv size is (66999, 3)
Traceback (most recent call last):
  File "code/featurization.py", line 48, in <module>
    train_words = np.array(df_train.text.str.lower().values.astype('U'))
MemoryError
ERROR: failed to run command - stage 'matrix-train.p.dvc' cmd python code/featurization.py failed

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 30 (28 by maintainers)

Most upvoted comments

@Naba7 I am working on a new tutorial. It will be up soon. With a smaller dataset and fewer RAM requirements.

This issue affects me aswell.

Paste from my Discord message:

What’s wrong? While running featurization.py I get some kind buffer overflow. 16GB of RAM get consumed in seconds and the execution halts after a couple of seconds of system freeze.

I get a The input data frame data/Posts-train.tsv size is (66999, 3) output, so far the code is valid. But the next step most likely goes sideways, because a injected print(test) does not show up after train_words.

My setup includes 16GB of RAM. Despite the older statements I don’t get a memory error raised. I think dvc may not be verbose about python errors.

@shcheklein @Naba7 @efiop This is just a problem for RAM requirements.

I ran it on my system which has the following configuration : config_system

And it executed successfully by using almost 98% of RAM.

imo, RAM requirement is just more than 16 Gb then.

@ryokugyu Unfortunately I don’t know specific minimal RAM requirements for running the tutorial 😦

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

User from discord is running into the MemoryError on the same step but now in the get-started guide. Discord context: https://discordapp.com/channels/485586884165107732/563406153334128681/581584115644629012

https://github.com/iterative/dvc.org/issues/380

Reopening this. As we discussed privately with @kurianbenoy, we need to find a way to modify it a bit so that we can run it on a smaller machine. Ideas to try: filter the dataset artificially, try less features (it’s 5000, try 2500 by default), check if there is a way to use some optimized arrays.

Hi @mexeniz !

Looks like you are running out of memory 🙁 As opposed to our get-started guide, our tutorial has some beefy requirements on RAM. Have you tried get-started already? https://dvc.org/doc/get-started In essence, it is a simplified tutorial.

Tutorial got absorbed with get started. Closing this. For get started we have a separate ticket for this.

@ryokugyu Unfortunately I don’t know specific minimal RAM requirements for running the tutorial 😦

with 12 GB of RAM, it is still not executing. I think the problem is not memory, rather its some implementation issue.

As @shcheklein said I tried out reducing no of features in Count Vectoriser from 2500, 1000, 100, 50,1 and all of them gave memory error.

With 12 GB of RAM, I am still getting memory error. @shcheklein