joblib: "timeout or by a memory leak.", UserWarning

System and versions

Mac OS 10.14.4 Python 3.6 Joblib 0.12.5

Usage

I have 32 big batches, and each batch contains unequal number of works. I am using joblib within a loop for the big batch, and the joblib function is to run the works within one batch at the same time. The errors come when I run the loop. However, when I run the batch one by one, there is no such error.

pseudo code:

results_list=[]
for batch in batch_list:
    results=Parallel(n_jobs=num_cores,verbose=1)(delayed(func)(x[i]) for i in batch )))
    result_list.append(results)

Errors analysis

/Users/x/miniconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:700: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning

I have tried gc.collection() to release the memory. But it does not help.
I checked the total memory, and the memory is only 10% of the total I have.
When I change the worker number to a smaller number (from 20 to 10), more warning comes.
During the loop, the chance and frequency for the warning to show is larger if that batch has more works inside.
Based on my previous validation on some cases, even with this warning, the results look the same as the one generated without warning (when I do it individually)

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 1
Comments: 25 (6 by maintainers)

Most upvoted comments

Thanks a lot @ogrisel for explaining/clarifying the way worker processes are restarted.

I agree, the backend (loky) should assume that user provided code does not leak memory instead of emitting a warning on “correct” implementations.

trendelkampschroer on Apr 25, 2023

This happens to me often when my workers need to launch memory intensive tasks, the typical case is when each worker needs to fit a tensorflow/keras neural network or a catboost model that hit the RAM hard. Particularly tensorflow seems to be a bit nasty giving memory back after it is done. This has never caused any issue for me apart from the warning message.

jlopezpena on Oct 4, 2019

I also would be glad if this could be addressed in a new joblib release.

According to https://github.com/joblib/joblib/issues/883#issuecomment-1521521711 any benign but sufficiently large memory allocation by a worker process will trigger the warning and the subsequent restart of that process.

Many (unnecessary) restarts of the worker processes will invariably slow down the parallel execution, even for (worker) code that is not leaking any memory.

@ogrisel I full agree with you on

So maybe loky should just assume that the user provided code does not leak memory and not attempt to monitor memory usage of the workers. If it does leak (and the workers are killed by the OS) it would be up to the user to fix their code.

trendelkampschroer on Sep 14, 2023

One workaround that I found is to monkey-patch the _MAX_MEMORY_LEAK_SIZE variable before executing parallel code

import joblib.externals.loky

joblib.externals.loky.process_executor._MAX_MEMORY_LEAK_SIZE = int(3e10)

from joblib import Parallel, delayed
results = Parallel(...)(...)

shchur on Sep 13, 2023

Thank you for the explanation (and the amazing work with developing joblib)!

I encountered the same problem when each subprocess complies numba code. Either disabling the memory leak check as @ogrisel suggested, or at least allowing the user to set _MAX_MEMORY_LEAK_SIZE would be really convenient here.

shchur on May 15, 2023

Note that the threshold of 300 MB is more than twice the size of a Python process after importing a bunch of heavy libraries with many compiled extensions (e.g. numpy, scipy and such).

>>> import psutil
>>> p = psutil.Process()
>>> p.memory_info().rss / 1e6
57.638912

>>> import numpy, scipy, pandas, pyarrow
>>> p.memory_info().rss / 1e6
136.347648

But importing torch on top of the previous libraries almost reaches the limit:

>>> import torch
>>> p.memory_info().rss / 1e6
265.601024

ogrisel on Apr 25, 2023

When I use ‘multipleprocessing’ as backend, this warning seems gone.

YubinXie on Jun 4, 2019

Any update on the status of this issue? It has been open for some time. Thank you.

duemig on Sep 13, 2023

Hi, thanks a lot for making and maintaining joblib. I am also seeing this warning on code that doesn’t leak memory (at least not according to my best knowledge).

import time
import numpy as np
import joblib

def job_():
    time.sleep(0.2)
    X = np.zeros((7000, 7000))


if __name__ == "__main__":
    with joblib.parallel_backend("loky"):
        with joblib.Parallel(verbose=51) as parallel:
            parallel(
                joblib.delayed(job_)() for _ in range(20)
            )

This is probably due to the (somewhat arbitrary) threshold defined in process_executor.py, specifically

# Number of bytes of memory usage allowed over the reference process size.
_MAX_MEMORY_LEAK_SIZE = int(3e8)

I assume this threshold is set to safeguard against subprocesses/workers that actually leak memory. But, unfortunately it triggers a warning on “correct” code allocating a (somewhat large) chunk of memory within the worker process. This allocation is often outside of the control of the user, e.g. because it happens in 3rd party tools, and receiving a warning in such a situation can be confusing/misleading. Also restarting a worker process may in principle stall the (concurrent) execution of jobs, because worker processes are killed and started over and over again.

Please forgive me, if my understanding here is wrong/incomplete, I’d be glad for any suggestions/explanations that help to clear this up.

trendelkampschroer on Apr 25, 2023

I have managed to recreate the issue repeatedly on a Linux machine.

Say we have 91 parquet files, each with a partition column numbered 1-100. Each file is from 1-10GB, mean 3GB, std 2GB. Now from inside a Jupyter notebook, run the following:

import joblib
import pandas as pd

def parallel_map(callable, iterable, n_jobs=2, progress=False, parallel=True, verbose=0):
    return joblib.Parallel(n_jobs=n_jobs, verbose=verbose)(joblib.delayed(callable)(arg) for arg in iterable)

def partition_stats(file_path: Path) -> dict:
    df = pd.read_parquet(file_path)    
    counts = df[PARTITION].value_counts()
    return {
        'name': data_name,
        'n_partitions': counts.index.unique().size,
        'mean_per_part': counts.mean(),
        'std_per_part': counts.std()
    }

    return {
        'name': data_name,
        'n_partitions': 0,
        'mean': 0,
        'std': 0,
    }

file_paths = [ 
# put the files here
]

df = pd.DataFrame(parallel_map(partition_stats, file_paths, n_jobs=10, progress=True))

fny on Sep 22, 2022

Based on my previous validation on some cases, even with this warning, the results look the same as the one generated without warning (when I do it individually)

This is expected. It’s just a warning stating that some workers have been preemptively restarted (between 2 tasks) because their memory usage grew too much.

Can you a produce a stand alone minimal reproduction case?

ogrisel on Jun 4, 2019