joblib: "timeout or by a memory leak.", UserWarning
System and versions
Mac OS 10.14.4 Python 3.6 Joblib 0.12.5
Usage
I have 32 big batches, and each batch contains unequal number of works. I am using joblib within a loop for the big batch, and the joblib function is to run the works within one batch at the same time. The errors come when I run the loop. However, when I run the batch one by one, there is no such error.
pseudo code:
results_list=[]
for batch in batch_list:
results=Parallel(n_jobs=num_cores,verbose=1)(delayed(func)(x[i]) for i in batch )))
result_list.append(results)
Errors analysis
/Users/x/miniconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
- I have tried gc.collection() to release the memory. But it does not help.
- I checked the total memory, and the memory is only 10% of the total I have.
- When I change the worker number to a smaller number (from 20 to 10), more warning comes.
- During the loop, the chance and frequency for the warning to show is larger if that batch has more works inside.
- Based on my previous validation on some cases, even with this warning, the results look the same as the one generated without warning (when I do it individually)
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 1
- Comments: 25 (6 by maintainers)
Thanks a lot @ogrisel for explaining/clarifying the way worker processes are restarted.
I agree, the backend (loky) should assume that user provided code does not leak memory instead of emitting a warning on “correct” implementations.
This happens to me often when my workers need to launch memory intensive tasks, the typical case is when each worker needs to fit a tensorflow/keras neural network or a catboost model that hit the RAM hard. Particularly tensorflow seems to be a bit nasty giving memory back after it is done. This has never caused any issue for me apart from the warning message.
I also would be glad if this could be addressed in a new joblib release.
According to https://github.com/joblib/joblib/issues/883#issuecomment-1521521711 any benign but sufficiently large memory allocation by a worker process will trigger the warning and the subsequent restart of that process.
Many (unnecessary) restarts of the worker processes will invariably slow down the parallel execution, even for (worker) code that is not leaking any memory.
@ogrisel I full agree with you on
One workaround that I found is to monkey-patch the
_MAX_MEMORY_LEAK_SIZEvariable before executing parallel codeThank you for the explanation (and the amazing work with developing joblib)!
I encountered the same problem when each subprocess complies
numbacode. Either disabling the memory leak check as @ogrisel suggested, or at least allowing the user to set_MAX_MEMORY_LEAK_SIZEwould be really convenient here.Note that the threshold of 300 MB is more than twice the size of a Python process after importing a bunch of heavy libraries with many compiled extensions (e.g. numpy, scipy and such).
But importing torch on top of the previous libraries almost reaches the limit:
When I use ‘multipleprocessing’ as backend, this warning seems gone.
Any update on the status of this issue? It has been open for some time. Thank you.
Hi, thanks a lot for making and maintaining
joblib. I am also seeing this warning on code that doesn’t leak memory (at least not according to my best knowledge).This is probably due to the (somewhat arbitrary) threshold defined in
process_executor.py, specificallyI assume this threshold is set to safeguard against subprocesses/workers that actually leak memory. But, unfortunately it triggers a warning on “correct” code allocating a (somewhat large) chunk of memory within the worker process. This allocation is often outside of the control of the user, e.g. because it happens in 3rd party tools, and receiving a warning in such a situation can be confusing/misleading. Also restarting a worker process may in principle stall the (concurrent) execution of jobs, because worker processes are killed and started over and over again.
Please forgive me, if my understanding here is wrong/incomplete, I’d be glad for any suggestions/explanations that help to clear this up.
I have managed to recreate the issue repeatedly on a Linux machine.
Say we have 91 parquet files, each with a partition column numbered 1-100. Each file is from 1-10GB, mean 3GB, std 2GB. Now from inside a Jupyter notebook, run the following:
This is expected. It’s just a warning stating that some workers have been preemptively restarted (between 2 tasks) because their memory usage grew too much.
Can you a produce a stand alone minimal reproduction case?