bottleneck: Leaks memory when input is not a numpy array

If you run the following program you see that nansum leaks all the memory it are given when passed a Pandas object. If it is passed the ndarray underlying the Pandas object instead then there is no leak:

import psutil
import gc

def f():
    x = np.zeros(10*1024*1024, dtype='f4')

    # Leaks 40MB/iteration
    bottleneck.nansum(pd.Series(x))
    # No leak:
    #bottleneck.nansum(x)

process = psutil.Process(os.getpid())

def _get_usage():
    gc.collect()
    return process.memory_info().private / (1024*1024)

last_usage = _get_usage()
print(last_usage)

for _ in range(10):
    f()
    usage = _get_usage()
    print(usage - last_usage)
    last_usage = usage

This affects not just nansum, but apparently all the reduction functions (with or without axis specified), and at least some other functions like move_max.

I’m not completely sure why this happens, but maybe it’s because PyArray_FROM_O is allocating a new array in this case, and the ref count of that is not being decremented by anyone? https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/reduce_template.c#L1237

I’m using Bottleneck 1.2.1 with Pandas 0.23.1. sys.version is 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)].

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 15 (3 by maintainers)

Commits related to this issue

Most upvoted comments

OK, I merged the memory leak fix into master.