joblib: "timeout or by a memory leak.", UserWarning

System and versions

Mac OS 10.14.4 Python 3.6 Joblib 0.12.5

Usage

I have 32 big batches, and each batch contains unequal number of works. I am using joblib within a loop for the big batch, and the joblib function is to run the works within one batch at the same time. The errors come when I run the loop. However, when I run the batch one by one, there is no such error.

pseudo code:

results_list=[]
for batch in batch_list:
    results=Parallel(n_jobs=num_cores,verbose=1)(delayed(func)(x[i]) for i in batch )))
    result_list.append(results)

Errors analysis

/Users/x/miniconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
  • I have tried gc.collection() to release the memory. But it does not help.
  • I checked the total memory, and the memory is only 10% of the total I have.
  • When I change the worker number to a smaller number (from 20 to 10), more warning comes.
  • During the loop, the chance and frequency for the warning to show is larger if that batch has more works inside.
  • Based on my previous validation on some cases, even with this warning, the results look the same as the one generated without warning (when I do it individually)

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 1
  • Comments: 25 (6 by maintainers)

Most upvoted comments

Thanks a lot @ogrisel for explaining/clarifying the way worker processes are restarted.

I agree, the backend (loky) should assume that user provided code does not leak memory instead of emitting a warning on “correct” implementations.

This happens to me often when my workers need to launch memory intensive tasks, the typical case is when each worker needs to fit a tensorflow/keras neural network or a catboost model that hit the RAM hard. Particularly tensorflow seems to be a bit nasty giving memory back after it is done. This has never caused any issue for me apart from the warning message.

I also would be glad if this could be addressed in a new joblib release.

According to https://github.com/joblib/joblib/issues/883#issuecomment-1521521711 any benign but sufficiently large memory allocation by a worker process will trigger the warning and the subsequent restart of that process.

Many (unnecessary) restarts of the worker processes will invariably slow down the parallel execution, even for (worker) code that is not leaking any memory.

@ogrisel I full agree with you on

So maybe loky should just assume that the user provided code does not leak memory and not attempt to monitor memory usage of the workers. If it does leak (and the workers are killed by the OS) it would be up to the user to fix their code.

One workaround that I found is to monkey-patch the _MAX_MEMORY_LEAK_SIZE variable before executing parallel code

import joblib.externals.loky

joblib.externals.loky.process_executor._MAX_MEMORY_LEAK_SIZE = int(3e10)

from joblib import Parallel, delayed
results = Parallel(...)(...)

Thank you for the explanation (and the amazing work with developing joblib)!

I encountered the same problem when each subprocess complies numba code. Either disabling the memory leak check as @ogrisel suggested, or at least allowing the user to set _MAX_MEMORY_LEAK_SIZE would be really convenient here.

Note that the threshold of 300 MB is more than twice the size of a Python process after importing a bunch of heavy libraries with many compiled extensions (e.g. numpy, scipy and such).

>>> import psutil
>>> p = psutil.Process()
>>> p.memory_info().rss / 1e6
57.638912

>>> import numpy, scipy, pandas, pyarrow
>>> p.memory_info().rss / 1e6
136.347648

But importing torch on top of the previous libraries almost reaches the limit:

>>> import torch
>>> p.memory_info().rss / 1e6
265.601024

When I use ‘multipleprocessing’ as backend, this warning seems gone.

Any update on the status of this issue? It has been open for some time. Thank you.

Hi, thanks a lot for making and maintaining joblib. I am also seeing this warning on code that doesn’t leak memory (at least not according to my best knowledge).

import time
import numpy as np
import joblib

def job_():
    time.sleep(0.2)
    X = np.zeros((7000, 7000))


if __name__ == "__main__":
    with joblib.parallel_backend("loky"):
        with joblib.Parallel(verbose=51) as parallel:
            parallel(
                joblib.delayed(job_)() for _ in range(20)
            )

This is probably due to the (somewhat arbitrary) threshold defined in process_executor.py, specifically

# Number of bytes of memory usage allowed over the reference process size.
_MAX_MEMORY_LEAK_SIZE = int(3e8)

I assume this threshold is set to safeguard against subprocesses/workers that actually leak memory. But, unfortunately it triggers a warning on “correct” code allocating a (somewhat large) chunk of memory within the worker process. This allocation is often outside of the control of the user, e.g. because it happens in 3rd party tools, and receiving a warning in such a situation can be confusing/misleading. Also restarting a worker process may in principle stall the (concurrent) execution of jobs, because worker processes are killed and started over and over again.

Please forgive me, if my understanding here is wrong/incomplete, I’d be glad for any suggestions/explanations that help to clear this up.

I have managed to recreate the issue repeatedly on a Linux machine.

Say we have 91 parquet files, each with a partition column numbered 1-100. Each file is from 1-10GB, mean 3GB, std 2GB. Now from inside a Jupyter notebook, run the following:

import joblib
import pandas as pd

def parallel_map(callable, iterable, n_jobs=2, progress=False, parallel=True, verbose=0):
    return joblib.Parallel(n_jobs=n_jobs, verbose=verbose)(joblib.delayed(callable)(arg) for arg in iterable)

def partition_stats(file_path: Path) -> dict:
    df = pd.read_parquet(file_path)    
    counts = df[PARTITION].value_counts()
    return {
        'name': data_name,
        'n_partitions': counts.index.unique().size,
        'mean_per_part': counts.mean(),
        'std_per_part': counts.std()
    }

    return {
        'name': data_name,
        'n_partitions': 0,
        'mean': 0,
        'std': 0,
    }

file_paths = [ 
# put the files here
]

df = pd.DataFrame(parallel_map(partition_stats, file_paths, n_jobs=10, progress=True))

Based on my previous validation on some cases, even with this warning, the results look the same as the one generated without warning (when I do it individually)

This is expected. It’s just a warning stating that some workers have been preemptively restarted (between 2 tasks) because their memory usage grew too much.

Can you a produce a stand alone minimal reproduction case?