astropy: Cache locked errors when download cache dir is on NFS

Description

We use Astropy on the LIGO-Caltech computing cluster on which users’ home directories are shared across all worker nodes using NFS. We are finding that when users are running many Python scripts at once that use Astropy, a large fraction of the jobs are failing with the error:

RuntimeError: Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 14764. 

The error even happens if we run the script once first on the submit node to pre-populate the cache before running it on the worker nodes.

We know that we could work around this issue by setting environment variables to put the cache directory on a non-NFS filesystem (e.g. in /tmp). However, this is a new issue in the most recent version of Astropy: we are only seeing this error with Astropy 4.0, and not with 3.2.3. So in the past, Astropy seemed to be more robust on our clusters when running with the cache in an NFS mount, without any environment tuning.

I’ve created a minimal test case that is able to reproduce the failure by getting the EarthLocation site registry.

These symptoms are similar to #9642, but that was fixed in Astropy 4.0, and we are seeing this issue in Astropy 4.0.

Expected behavior

In every run, including the runs on the worker nodes, the test Python script runs to completion, prints the text All OK, and exits with status 0.

Actual behavior

In many runs on the worker nodes, the Python script raises this exception and crashes:

WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 26260.',) [astropy.time.core]
Traceback (most recent call last):
  File "/home/leo.singer/astropy-4.0-test/./test.py", line 3, in <module>
    EarthLocation._get_site_registry()
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/coordinates/earth.py", line 529, in _get_site_registry
    reg = get_downloaded_sites()
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/coordinates/sites.py", line 135, in get_downloaded_sites
    content = get_pkg_data_contents('coordinates/sites.json', encoding='UTF-8')
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 701, in get_pkg_data_contents
    cache=cache) as fd:
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 505, in get_pkg_data_fileobj
    conf.dataurl_mirror + data_name],
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 236, in get_readable_fileobj
    http_headers=http_headers)
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file
    with _cache(pkgname) as (dldir, url2hash):
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1690, in _cache
    with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash:
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/leo.singer/astropy-4.0-test/env/lib64/python3.6/site-packages/astropy/utils/data.py", line 1591, in _cache_lock
    raise RuntimeError(msg)
RuntimeError: Cache is locked after 5.02 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/leo.singer/.astropy/cache/download/py3/lock. Lock claims to be held by process 14764.

Steps to Reproduce

  1. Clean all cache data:
$ rm -rf ~/.cache ~/.astropy
  1. Run test script once on the submit node to prime the cache:
$ ./test.py 
Downloading https://hpiers.obspm.fr/iers/bul/bulc/Leap_Second.dat
|=======================================================================================| 1.3k/1.3k (100.00%)         0s
Downloading http://data.astropy.org/coordinates/sites.json
|=======================================================================================|  23k/ 23k (100.00%)         0s
All OK
  1. Submit 100 instances of the script to the HTCondor scheduler to farm them out to the worker nodes:
$ mkdir -p logs
$ condor_submit test.sub 
Submitting job(s)....................................................................................................
100 job(s) submitted to cluster 141880176.

Python script test.py:

#!/usr/bin/env python
from astropy.coordinates import EarthLocation
EarthLocation._get_site_registry()
print('All OK')

HTCondor submit file test.sub:

accounting_group = ligo.dev.o3.cbc.pe.bayestar
universe = vanilla
executable = ./test.py
getenv = true
error = logs/$(Cluster).$(Process).err
queue 100

System Details

Linux-3.10.0-1062.12.1.el7.x86_64-x86_64-with-redhat-7.7-Nitrogen Python 3.6.8 (default, Aug 7 2019, 19:55:42) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] Numpy 1.18.1 astropy 4.0 (scipy not installed)

I will ask the LIGO Lab computing team to post relevant details about file system configuration.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 27 (22 by maintainers)

Most upvoted comments

If anyone has idea on how to fix this in https://github.com/astropy/astropy/blob/master/astropy/utils/data.py , please feel free to submit an exploratory PR (you can even mark it as Draft PR if you want). My drastic approach (if I ever find time to revisit this) would be to remove locking bit by bit until this works, but I understand that @aarchiba is not very thrilled about it. 😅

we are seeing the same issue accessing the astropy cache from multiple MPI processes, see https://github.com/healpy/healpy/issues/609

Thanks, @lpsinger . Your shutils.rmtree failing in prepare.py sounds a lot like ansible/ansible#34335 (comment) . Do you have other things that are accessing the cache at that time that is unrelated to the scripts being run here?

No, I don’t. No other jobs on the cluster, no cron jobs, etc.