astropy: Memory leak in Table indices

Description

Repeatedly accessing an indexed Table causes memory use to grow in an unexpected and undesired way. In a real-world application on a large table this was causing memory use to exceed 18 Gb. Removing the table index and repeating the access code kept memory use below 1 Gb. We used memray to see memory climbing continuously during a loop which repeatedly accessed elements of an indexed table.

Expected behavior

Memory use should remain approximately constant after the first access.

How to Reproduce

The following should reproduce the problem. You can use a package like memray to monitor memory or just watch a process monitor for memory use of the Python process. For me this starts with about 180 Mb of memory after the first table t is created. After running this memory use is around 1 Gb, while I would expect something under 400 Mb.

import numpy as np
from astropy.table import MaskedColumn
from astropy.table.table_helpers import simple_table
from astropy.time import Time
from tqdm import tqdm

size = 250000
t = simple_table(size=size, cols=26)
idxs = Time(np.random.randint(0, size // 20, size=size), format="cxcsec").isot
t["idx"] = MaskedColumn(idxs)  # THIS IS THE PROBLEM
t.add_index(["idx"])

idxs = np.random.choice(t["idx"], size=100, replace=False)
for idx in tqdm(idxs):
    star_obs = t[t["idx"] == idx]

Versions

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import astropy; print("astropy", astropy.__version__)
import numpy; print("Numpy", numpy.__version__)
import erfa; print("pyerfa", erfa.__version__)
try:
    import scipy
    print("Scipy", scipy.__version__)
except ImportError:
    print("Scipy not installed")
try:
    import matplotlib
    print("Matplotlib", matplotlib.__version__)
except ImportError:
    print("Matplotlib not installed")
macOS-14.2.1-x86_64-i386-64bit
Python 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:27:35) [Clang 14.0.6 ]
astropy 5.3.1
Numpy 1.23.5
pyerfa 2.0.0.1
Scipy 1.10.0
Matplotlib 3.6.3

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 24 (22 by maintainers)

Most upvoted comments

👋 Hey, memray author and Python core dev here. I run your example with native symbols and debug info and this is what I get: Screenshot 2024-02-27 at 17 19 40

So looks like most of the memory (2.1GB) is allocated in astropy/table/column.py:529. You can easily do the following to get these using docker:

$ docker run  --platform linux/amd64 -v $PWD:/src  --rm -it ubuntu:latest bash
$ apt-get install python3-numpy debuginfod python3-pip
$ python3 -m pip install astropy memray
$ export DEBUGINFOD_URLS="https://debuginfod.ubuntu.com" 
$ python3 -m memray example.py --trace-python-allocators --native
$ python3 -m memray flamegraph output.bin --leaks

If you want to give the refcycle theory a go you can use https://docs.python.org/3/library/gc.html#gc.set_debug with https://docs.python.org/3/library/gc.html#gc.DEBUG_LEAK to confirm what are cycles.

@neutrinoceros - assigning a MaskedArray to a table gives a MaskedColumn - so no joy there…

Ohhhh, this produces a beautiful leak. Screenshot 2024-02-26 at 8 37 05 PM

It’s been seen in the wild on both Linux and Mac.

@neutrinoceros - one thing I just thought about … you might try using the SCEngine sorting engine (instead of the default sorted array) and see if the leak persists. That might help localize the problem. See : https://docs.astropy.org/en/stable/table/indexing.html#engines

I can reproduce this (on macOS too), so that’s a start. First, I’ve attempted to run garbage collection every 10th iterations of the loop (just because it’s easy to test): no change, so it seems safe to conclude that the problem isn’t trivial (maybe some unreachable reference cycles are generated, so garbage-collection isn’t completely out of the picture).

Memray indeed helps visualise a slow but steady growth in resident memory, while the heap size stays at bay. I found that with memray’s default behaviour, you don’t get much more details (most of the allocations are not traced). Enabling the --native flag allows to trace allocations from C/C++ extensions, which seems to be what we want here, however, there are a number of known limitations for this on macOS:

quoting https://bloomberg.github.io/memray/native_mode.html

For the best native mode experience, we recommend running your program on Linux using an interpreter and libraries built with as much debugging information as possible.

in general native mode symbolification will not work well in macOS.

While trying to use it, I was also unfortunate enough to stumble upon what now looks like a CPython bug (reported and discussed at https://github.com/bloomberg/memray/issues/553). Switching to Python 3.12.2 resolved the problem so I’m now able to get a first view of native allocations. Here’s the script I’m using (with t.py containing the MWE from @taldcroft)

# t.sh
set -euxo pipefail
rm report.bin memray-flamegraph-report.html || true
python -m memray run -o report.bin --native t.py
python -m memray flamegraph report.bin
open memray-flamegraph-report.html

This took me long enough to figure out, which is why I’m reporting at an early stage. I will now try to actually inspect the profile and see if it contains enough information to find the bug (or get a sense of where to look more closely).

If it doesn’t suffice, running with CPython + numpy + astropy all compiled with debug symbols would be necessary; however I know that for numpy this is significantly simpler on Linux (macOS is supposed to be supported too but I never could get anything from it). I was planning to set up a Linux VM at some point, this may be the excuse I’ve been waiting for. Although, before I do that, I want to ask @MridulS if he happens to be in a better starting position to try this.