astropy: astropy.io.fits not returning memory
First, I’m sorry I know nothing about memory… I am using macOS 10.14.5 on MacBook Pro 15" 2018.
While I was reducing 150 FITS files (each ~ 70MB), I realized the pipeline I used got ~2-3 times slower as time goes. I’ve been experiencing tens of such cases, but couldn’t find any possiblities… Then 2 weeks ago, just by chance, I was looking at the memory usage, and found that the memory was going “full”.
I asked one of my colleagues about it and the friend actually has experienced it almost everyday but couldn’t find any other solutions but to use except OSError: gc.collect()
. The friend uses Windows and LINUX. But it seemed like my mac does not give any OSError
, so it did not help. Also gc.collect()
at the end of the for loop did not return memory.
Then I found from astropy faq that I should do del hdu.data
and gc.collect()
. But as I mentioned above, the following simple code didn’t seem to return any memory.
from pathlib import Path
from astropy.io import fits
import gc
TOPPATH = Path('./reduced')
allfits = list(TOPPATH.glob("*.fits"))
allfits.sort()
for fpath in allfits:
hdul = fits.open(fpath)
data = hdul[0].data
test = data + 1
hdul.close()
del data
del test
del hdul[0].data
# gc.collect() # tested putting it at many different places of the code
or following exactly the same as what is in the faq:
for fpath in allfits:
with fits.open(fpath) as hdul:
for hdu in hdul:
data = hdu.data
test = data + 1
del data
del hdu.data
del test
When I run these, memory usage quickly increased from 40% to 100%. Because the above faq says “In some extreme cases files are opened and closed fast enough”, so I put time.sleep(xxx)
in the middle of the for loop for testing, but it didn’t help. I tried resetting the variables as None
(test=None
, data=None
, hdul[0].data=None
) at the end of the loop, as well as using CCDData.read
instead of fits.open
, etc, but found no hope.
Just for temporary usage, thus, I let my mac to do the job without any gc.collect()
or del
, because it worked although it’s painfully slow. Then to compare the differences between two algorithms I developed, I had to run similar reduction pipeline for the identical files, so I used two Jupyter Notebook kernels for running the two nearly identical pipelines at the same time. Since the number of files to be processed (or the memory usage) got doubled, it got much slower as time goes. Then suddenly it gave “too many files open” error in the middle of the processing, and Jupyter Notebook, Jupyter Lab, etc never worked again. I spent maybe 3-4 hours of googling, got hopeless, and I had to clean install Anaconda after removing it… It took almost a full working day for me…
I don’t know what happened with Jupyter, but this shock was strong enough for me to become… cautious. As I don’t know much about computer, I can’t imagine what the fundamental problem is even after reading the above faq. I believe there should be a way to solve this issue, but it’s maybe just far beyond my ability. I hope a workaround is explicitly available to astropy users so that people like me become more careful.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 31 (23 by maintainers)
It is not doing nothing. The “main” file handle is closed, but not the memmap one. If the references are deleted before
.close
then the memmap is closed as well.Oh sorry I confused a bit.
memmap=False
solved the original issue I was having. I was confused with a different one I was having. Sorry.Closing the issue.
@ysBach - The cached memory is not a good indicator. I don’t know how it works for Mac OS, on Linux the memory is used by the kernel to cache files as long as there is memory available (https://linux-mm.org/Low_On_Memory), but if you need this memory the cached one will be freed or put to swap (which will add some overhead, which may correspond to what you are seeing in your plot).
So the memory that actually counts is the resident memory (“Real Mem” in Mac Activity Monitor). You could measure this memory usage with your script using psrecord, or directly using psutil in your loop:
It would be better to delete the references before closing the file, as then
hdul.close()
would be able to close the memmap as well. Otherwise it relies on the garbage collector to close the memmap when there are no references.@MSeifert04 , I can’t say I understand the results completely. I simply can say a bulk of the memory is freed eventually. So, I disagree that FITS does not release any memory, but I cannot say for sure that FITS does not leak memory. 😉
I tried profiling using https://pypi.org/project/memory-profiler/ and here are my results: https://github.com/pllim/playpen/blob/master/measure_fits_memory.py
Memory gets freed at some point, but exactly where depends on
memmap
. Try also adddel hdu
anddel hdul
to your code and see. And feel free to run that script locally to see if there is variation across hardware architecture, etc. I ran mine on RHEL 7 with Python 3.7.3, astropy 4.0dev, and Numpy 1.16.4, FWIW.