tensorflow: Memory Leak in tf.data.Dataset.from_generator

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de 2.1.0
  • Python version: Python 3.6.6
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: CUDA Version: 10.1 cudnn-10.1
  • GPU model and memory: TITAN RTX 24190MiB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior tf.data.Dataset.from_generator leaks memory after each call even if followed by gc.collect().

Describe the expected behavior Memory should be released when no reference exists for the dataset.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

import gc
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
import tracemalloc
import linecache


def display_top(snapshot, key_type='lineno', limit=3):
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def generator():
    yield tf.zeros(2, 3)


tracemalloc.start()
for i in range(1000):
    dataset = tf.data.Dataset.from_generator(generator, output_types=tf.int32, output_shapes=[None])
    del dataset
    gc.collect()
    snapshot = tracemalloc.take_snapshot()
    display_top(snapshot)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Top 3 lines
#1: python3.6/_weakrefset.py:84: 159.5 KiB
    self.data.add(ref(item, self._remove))
#2: python3.6/_weakrefset.py:37: 38.2 KiB
    self.data = set()
#3: python3.6/_weakrefset.py:48: 32.4 KiB
    self._iterating = set()
461 other: 306.4 KiB
Total allocated size: 536.4 KiB
Top 3 lines
#1: python3.6/_weakrefset.py:84: 159.5 KiB
    self.data.add(ref(item, self._remove))
#2: python3.6/_weakrefset.py:37: 38.2 KiB
    self.data = set()
#3: python3.6/_weakrefset.py:48: 32.4 KiB
    self._iterating = set()
516 other: 343.1 KiB
Total allocated size: 573.1 KiB

...

Top 3 lines
#1: python3.6/weakref.py:335: 257.8 KiB
    self = ref.__new__(type, ob, callback)
#2: debug/tf_dataset_memory_leak.py:45: 189.7 KiB
    dataset = tf.data.Dataset.from_generator(generator, output_types=tf.int32, output_shapes=[None])
#3: ops/script_ops.py:257: 174.7 KiB
    return "pyfunc_%d" % uid
519 other: 2423.3 KiB
Total allocated size: 3045.5 KiB

It leaks 3MB in 1000 calls. In some real projects, it can leak as much as 5GB and keeps increasing.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 15
  • Comments: 29 (11 by maintainers)

Commits related to this issue

Most upvoted comments

4 months passed without any progress. I can do nothing but rewrite my entire project into PyTorch. I really love Keras and hope one day the user experience of TensorFlow will line up with Keras.

Found a workaround, del together with gc.collect() working fine for me.:

   `images_list = np.vstack(images_to_load)
    pred_result = self.model.predict(images_list, batch_size=10)

    del(pred_result)
    gc.collect()`

@luvwinnie Actually, your snippet could be a different issue. This issue is for memory leaks when tf.data.Dataset.from_generator is called repeatedly. Could you file another issue? (ideally with a reproducible Colab example like this bug)