astropy: astropy.io.fits not returning memory

First, I’m sorry I know nothing about memory… I am using macOS 10.14.5 on MacBook Pro 15" 2018.

While I was reducing 150 FITS files (each ~ 70MB), I realized the pipeline I used got ~2-3 times slower as time goes. I’ve been experiencing tens of such cases, but couldn’t find any possiblities… Then 2 weeks ago, just by chance, I was looking at the memory usage, and found that the memory was going “full”.

I asked one of my colleagues about it and the friend actually has experienced it almost everyday but couldn’t find any other solutions but to use except OSError: gc.collect(). The friend uses Windows and LINUX. But it seemed like my mac does not give any OSError, so it did not help. Also gc.collect() at the end of the for loop did not return memory.

Then I found from astropy faq that I should do del hdu.data and gc.collect(). But as I mentioned above, the following simple code didn’t seem to return any memory.

from pathlib import Path
from astropy.io import fits
import gc

TOPPATH = Path('./reduced')
allfits = list(TOPPATH.glob("*.fits"))
allfits.sort()

for fpath in allfits:
    hdul = fits.open(fpath)
    data = hdul[0].data
    test = data + 1
    hdul.close()
    del data
    del test
    del hdul[0].data
    # gc.collect()  # tested putting it at many different places of the code

or following exactly the same as what is in the faq:

for fpath in allfits:
    with fits.open(fpath) as hdul:
        for hdu in hdul:
            data = hdu.data        
            test = data + 1
            del data
            del hdu.data
            del test

When I run these, memory usage quickly increased from 40% to 100%. Because the above faq says “In some extreme cases files are opened and closed fast enough”, so I put time.sleep(xxx) in the middle of the for loop for testing, but it didn’t help. I tried resetting the variables as None (test=None, data=None, hdul[0].data=None) at the end of the loop, as well as using CCDData.read instead of fits.open, etc, but found no hope.

Just for temporary usage, thus, I let my mac to do the job without any gc.collect() or del, because it worked although it’s painfully slow. Then to compare the differences between two algorithms I developed, I had to run similar reduction pipeline for the identical files, so I used two Jupyter Notebook kernels for running the two nearly identical pipelines at the same time. Since the number of files to be processed (or the memory usage) got doubled, it got much slower as time goes. Then suddenly it gave “too many files open” error in the middle of the processing, and Jupyter Notebook, Jupyter Lab, etc never worked again. I spent maybe 3-4 hours of googling, got hopeless, and I had to clean install Anaconda after removing it… It took almost a full working day for me…

I don’t know what happened with Jupyter, but this shock was strong enough for me to become… cautious. As I don’t know much about computer, I can’t imagine what the fundamental problem is even after reading the above faq. I believe there should be a way to solve this issue, but it’s maybe just far beyond my ability. I hope a workaround is explicitly available to astropy users so that people like me become more careful.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 31 (23 by maintainers)

Most upvoted comments

The line hdul.close() in the code above is doing nothing, because you have open references to a memapped array, i.e. data.

It is not doing nothing. The “main” file handle is closed, but not the memmap one. If the references are deleted before .close then the memmap is closed as well.

saimn on Aug 5, 2019

Oh sorry I confused a bit. memmap=False solved the original issue I was having. I was confused with a different one I was having. Sorry.

Closing the issue.

ysBach on Feb 13, 2020

@ysBach - The cached memory is not a good indicator. I don’t know how it works for Mac OS, on Linux the memory is used by the kernel to cache files as long as there is memory available (https://linux-mm.org/Low_On_Memory), but if you need this memory the cached one will be freed or put to swap (which will add some overhead, which may correspond to what you are seeing in your plot).

So the memory that actually counts is the resident memory (“Real Mem” in Mac Activity Monitor). You could measure this memory usage with your script using psrecord, or directly using psutil in your loop:

In [11]: import psutil 
    ...: import gc 
    ...:  
    ...: p = psutil.Process() 
    ...: memref = p.memory_info().rss 
    ...:  
    ...: for i in range(100): 
    ...:     hdul = fits.open('testflat.fits') 
    ...:     data = hdul[0].data 
    ...:     test = data + 1 
    ...:     del data 
    ...:     del test 
    ...:     del hdul[0].data 
    ...:     hdul.close() 
    ...:     #gc.collect() 
    ...:     print(i, p.memory_info().rss - memref, len(p.open_files()))

saimn on Aug 7, 2019

for fpath in allfits:
    hdul = fits.open(fpath, memmap=False)
    data = hdul[0].data

    test = data + 1
    hdul.close()
    del data
    del hdul[0].data
    del test

It would be better to delete the references before closing the file, as then hdul.close() would be able to close the memmap as well. Otherwise it relies on the garbage collector to close the memmap when there are no references.

saimn on Aug 5, 2019

@MSeifert04 , I can’t say I understand the results completely. I simply can say a bulk of the memory is freed eventually. So, I disagree that FITS does not release any memory, but I cannot say for sure that FITS does not leak memory. 😉

pllim on Aug 5, 2019

I tried profiling using https://pypi.org/project/memory-profiler/ and here are my results: https://github.com/pllim/playpen/blob/master/measure_fits_memory.py

Memory gets freed at some point, but exactly where depends on memmap. Try also add del hdu and del hdul to your code and see. And feel free to run that script locally to see if there is variation across hardware architecture, etc. I ran mine on RHEL 7 with Python 3.7.3, astropy 4.0dev, and Numpy 1.16.4, FWIW.

pllim on Aug 5, 2019