astropy: Cache locked errors when download cache dir is on NFS
Description
We use Astropy on the LIGO-Caltech computing cluster on which users’ home directories are shared across all worker nodes using NFS. We are finding that when users are running many Python scripts at once that use Astropy, a large fraction of the jobs are failing with the error:
RuntimeError: Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 14764.
The error even happens if we run the script once first on the submit node to pre-populate the cache before running it on the worker nodes.
We know that we could work around this issue by setting environment variables to put the cache directory on a non-NFS filesystem (e.g. in /tmp). However, this is a new issue in the most recent version of Astropy: we are only seeing this error with Astropy 4.0, and not with 3.2.3. So in the past, Astropy seemed to be more robust on our clusters when running with the cache in an NFS mount, without any environment tuning.
I’ve created a minimal test case that is able to reproduce the failure by getting the EarthLocation site registry.
These symptoms are similar to #9642, but that was fixed in Astropy 4.0, and we are seeing this issue in Astropy 4.0.
Expected behavior
In every run, including the runs on the worker nodes, the test Python script runs to completion, prints the text All OK
, and exits with status 0.
Actual behavior
In many runs on the worker nodes, the Python script raises this exception and crashes:
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 26260.',) [astropy.time.core]
Traceback (most recent call last):
File "/home/leo.singer/astropy-4.0-test/./test.py", line 3, in <module>
EarthLocation._get_site_registry()
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/coordinates/earth.py", line 529, in _get_site_registry
reg = get_downloaded_sites()
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/coordinates/sites.py", line 135, in get_downloaded_sites
content = get_pkg_data_contents('coordinates/sites.json', encoding='UTF-8')
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 701, in get_pkg_data_contents
cache=cache) as fd:
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 505, in get_pkg_data_fileobj
conf.dataurl_mirror + data_name],
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 236, in get_readable_fileobj
http_headers=http_headers)
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file
with _cache(pkgname) as (dldir, url2hash):
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1690, in _cache
with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash:
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1591, in _cache_lock
raise RuntimeError(msg)
RuntimeError: Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 14764.
Steps to Reproduce
- Clean all cache data:
$ rm -rf ~/.cache ~/.astropy
- Run test script once on the submit node to prime the cache:
$ ./test.py
Downloading https://hpiers.obspm.fr/iers/bul/bulc/Leap_Second.dat
|=======================================================================================| 1.3k/1.3k (100.00%) 0s
Downloading http://data.astropy.org/coordinates/sites.json
|=======================================================================================| 23k/ 23k (100.00%) 0s
All OK
- Submit 100 instances of the script to the HTCondor scheduler to farm them out to the worker nodes:
$ mkdir -p logs
$ condor_submit test.sub
Submitting job(s)....................................................................................................
100 job(s) submitted to cluster 141880176.
Python script test.py
:
#!/usr/bin/env python
from astropy.coordinates import EarthLocation
EarthLocation._get_site_registry()
print('All OK')
HTCondor submit file test.sub
:
accounting_group = ligo.dev.o3.cbc.pe.bayestar
universe = vanilla
executable = ./test.py
getenv = true
error = logs/$(Cluster).$(Process).err
queue 100
System Details
Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-redhat-7.7-Nitrogen Python 3.6.8 (default, Aug 7 2019, 19:55:42) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] Numpy 1.18.1 astropy 4.0 (scipy not installed)
I will ask the LIGO Lab computing team to post relevant details about file system configuration.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 27 (22 by maintainers)
If anyone has idea on how to fix this in https://github.com/astropy/astropy/blob/master/astropy/utils/data.py , please feel free to submit an exploratory PR (you can even mark it as Draft PR if you want). My drastic approach (if I ever find time to revisit this) would be to remove locking bit by bit until this works, but I understand that @aarchiba is not very thrilled about it. 😅
we are seeing the same issue accessing the astropy cache from multiple MPI processes, see https://github.com/healpy/healpy/issues/609
No, I don’t. No other jobs on the cluster, no cron jobs, etc.