tensorflow: memory leak in tf.keras.Model.predict

https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below):
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 25 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@jvishnuvardhan Predict in a loop it is a quite recurrent issue, I remember some weeks ago I’ve just triaged two of this tickets. How we could better expose this in the docs? /cc @lamberta @MarkDaoust

Lol, I’m fighting with the memory-leak problems in multiple TensorFlow service in PROD since years and implemented different things like watchers that check the memory usage to gracefully restart our workers before they OOM-crash in a job, and adding tf.config.threading.set_inter_op_parallelism_threads(1); tf.config.threading.set_intra_op_parallelism_threads(1) to reduce the amount of leakage, etc.

Just yesterday I finally discovered this. Maybe one can prevent future users like me from wasting so much time/energy on this by adjusting “What’s the difference between Model methods predict() and __call__()?” in the Keras FAQ, which currently recommends using the memory-leaking way of doing predictions:

You should use model(x) when you need to retrieve the gradients of the model call, and you should use predict() if you just need the output value. In other words, always use predict() unless you’re in the middle of writing a low-level gradient descent loop (as we are now).

🙂

My attempts:

2.4.1: leak 2.7.1: leak

How the problem occurs to me?

def predict(self, img: np.ndarray) -> np.ndarray:
	return self._model.predict(np.expand_dims(img, axis=0))

How did I solve it?

def predict(self, img: np.ndarray) -> np.ndarray:
    return self._model(convert_to_tensor(np.expand_dims(img, axis=0)), training=False).numpy()

@plooney model.predict is a high-level API which is designed for batch-predicting outside of any loops. It automatically wraps your model into a tf.function and maintains graph based execution. Which means, if there is any change in input signature (shape and dtype) to that function (here model.predict), then it traces multiple models instead of a single model as you are expecting.

In your case, inImm is a numpy input which is considered as different signature each time you provide it in a for loop to a function wrapped by tf.function. However, providing inImm as a tensor will result in same input signature and hence there is a single graph to which these inputs are fed and results are obtained. In the numpy case, there are 60 static graphs (which is not what you want). As there are many static graphs, the memory is increasing with each for loop iteration.

When I added one line in your code, the code is not crashing. Please check the gist here. Thanks!

inImm=tf.convert_to_tensor(inImm)

Please read 1, 2, 3, and 4. These resources will help you more. Thanks!

Please close the issue if this was resolved for you. Thanks!

In case this helps:

If the dataset can fit in memory, then the following functions can replace the call to model.predict:

def generate_batches(
    x: np.ndarray | tf.Tensor, batch_size: int = 32
) -> np.ndarray | tf.Tensor:
    """Generate batches of test data for inference.

    Args:
        x (np.ndarray | tf.Tensor):
            Full test data set.
        batch_size (int, default=32):
            Batch size.

    Yields:
        np.ndarray | tf.Tensor:
            Yielded batches of test data.
    """
    for index in range(0, x.shape[0], batch_size):
        yield x[index : index + batch_size]


def predict(
    model: tf.keras.Model,
    x: np.ndarray | tf.Tensor,
    batch_size: int = 32,
) -> np.ndarray:
    """Predict using generated batched of test data.

    - Used instead of model.predict() due to memory leaks.
    - https://github.com/tensorflow/tensorflow/issues/44711

    Args:
        model (tf.keras.Model):
            The model to use for prediction.
        x (np.ndarray | tf.Tensor):
            Full test data set.
        batch_size (int, default=32):
            Batch size.

    Returns:
        np.ndarray:
            Predictions on the test data.
    """
    y_batches = []
    for x_batch in generate_batches(x=x, batch_size=batch_size):
        y_batch = model(x_batch, training=False).numpy()
        y_batches.append(y_batch)

    return np.concatenate(y_batches)


# instead of
# y_pred = model.predict(x_test)

# use
y_pred = predict(model=model, x=x_test, batch_size=32)

Else, if the dataset does not fit in memory, then consider using tf.data:

def create_tf_dataset(
    data_split: str,
    x: np.ndarray,
    y: np.ndarray,
    batch_size: int,
    use_mixed_precision: bool,
) -> tf.data.Dataset:
    """Create a TensorFlow dataset.

    - Cache train data before shuffling for performance (consider full dataset size).
    - Shuffle train data to increase accuracy (not needed for validation or test data).
    - Batch train data after shuffling for unique batches at each epoch.
    - Cache test data after batching as batches can be the same between epochs.
    - End pipeline with prefetching for performance.
    
    Args:
        data_split (str):
            The data split to create the dataset for.
            Supported are "train", "validation", and "test".
        x (np.ndarray):
            The feature data.
        y (np.ndarray):
            The target data.
        batch_size (int):
            The batch size.
        use_mixed_precision (bool):
            Whether to use mixed precision.

    Raises:
        ValueError: If the data split is not supported.

    Returns:
        tf.data.Dataset:
            The TensorFlow dataset.
    """
    if data_split not in {"train", "validation", "test"}:
        raise ValueError(f"Invalid data split: {data_split}")

    if use_mixed_precision:
        tf.keras.mixed_precision.set_global_policy("mixed_float16")
        x = x.astype(np.float16)
        y = y.astype(np.float16)

    ds = tf.data.Dataset.from_tensor_slices((x, y))

    if data_split == "train":
        ds = ds.cache()
        set_random_seed(seed=RANDOM_SEED)
        ds = ds.shuffle(number_of_samples, seed=RANDOM_SEED)
        ds = ds.batch(batch_size)
    else:
        ds = ds.batch(batch_size)
        ds = ds.cache()

    ds = ds.prefetch(AUTOTUNE)

    return ds


# need to do this call separately on a machine with enough memory
ds_test = create_tf_dataset(
    data_split="test",
    x=x_test,
    y=y_test,
    batch_size=32,
    use_mixed_precision=True,
)

# then use it
y_pred = model.predict(ds_test)

So, random thoughts (I re-opened just to leave these)

  1. as mentioned above, predict shouldn’t be used in a loop. Sad things happen if you do. Ideally use model call w/ training=False directly (manually wrapping your model call in a tf.function if needed for performance reasons)

  2. Numpy inputs to predict/fit/evaluate get converted to tf.data datasets then iterated over. The specific conversion implementation in place currently ends up copying the data and is poorly suited for large input sizes (it is prone to ooms for large inputs). There’s a number of other github issues related to this floating around, but we have so far been unable to prioritize this in core Tensorflow. If you need performant numpy input to keras fit/evaluate/predict my current recommendation is using Tensorflow-io’s numpy inputs: https://www.tensorflow.org/io/api_docs/python/tfio/experimental/IODataset#from_numpy because it should be more performant and avoid excess memory copies.

  3. The fact that gc.collect fixes this makes me think something about the numpy conversion is also creating cyclical references that the python gc doesn’t trigger for (the memory is consumed by the gpu which isn’t tracked by the python gc, so the python gc fails to trigger because it thinks there’s still plenty of memory). We’ve seen cyclical issues like this cause issues elsewhere (e.g. when creating multiple models), but due to the two aforementioned points we don’t have the bandwidth right now to prioritize tracking down & fixing this specific one.

I had the same issue with ‘.predict’, running on about 50,000 inputs over several hours, seen a leak of around 0.35 GB. Traced the leak back to the ‘.predict’ method. Tried replacing it with the ‘call’ method, which solved the memory leak but was slower by about 50%.

Switching to the __call__ function significantly reduced the amount of leakage, but still leaked about 70bytes per __call__ in my environment. Finally, converting the keras model to a bare TensorFlow graph seems to have eliminated the leakage in my environment.

model: tf.keras.Model
x: np.ndarray

graph = tf.function(model)
# When processing large data, it is necessary to add logic to divide the data into small batches for processing.
result = graph(convert_to_tensor(x)) 

@jvishnuvardhan thanks for the clear explanation. If this were calling a tf.function in a loop that would be 100% the correct answer. But Model.predict manages some of this to avoid this problem (in general keras fit/evaluate/predict never require that the user convert inputs to tensors). It looks like something more complicated is happening.

The first two clues that suggest it are:

  1. It’s not printing the frequent retracing warning.
  2. You’re creating a single numpy array and passing that multiple times and it still goes OOM. Except for constants the caching logic is based on object identity, re-using the object should reuse the same function trace.

Investigating a little farther you can find the model.predict_function is the @tf.function that runs in here. Inspecting that, both it’s ._list_all_concrete_functions() and .pretty_printed_concrete_signatures() show that there is only one graph, and predict is handling the conversion of the numpy array to a Tensor.

So I agree that this is leaking memory somewhere. But I’ve confirmed that it’s not the tf.function cache causing it.

@tomerk, you’re pretty familiar with this code, do you have any ideas?