pex: Intermittent / nondeterministic failure building `requirements.pex` due to missing `site-packages`

Our CI will occasionally fail with an error like:

Traceback (most recent call last):
  File "/home/color/.cache/nce/99421dee8fedc336d5a6bb8322fbfa602bc68fe63303ae702ec7b9c5672cd086/bindings/venvs/2.15.0rc6/lib/python3.9/site-packages/pants/engine/process.py", line 289, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Building 1 requirement for requirements.pex from the 3rdparty/lockfiles/resolves/pants-plugins.lockfile resolve: pantsbuild.pants<2.16,>=2.15.0a0' failed with exit code 1.
stdout:

stderr:
The virtualenv at /tmp/pants/named_caches/pex_root/venvs/4c85013f478e0393bbf8db8fcf02e1def7ff5031/ba7a55164c2afb363895254bbb1063124dd74d5b.lck.work is not valid. No site-packages directory was found in its sys.path:
/opt/python/3.9.16/lib/python39.zip
/opt/python/3.9.16/lib/python3.9
/opt/python/3.9.16/lib/python3.9/lib-dynload
/opt/python/3.9.16/lib/python3.9/site-packages

We’ve seen it hit across different resolves, Python interpreter versions, and Pex versions - the error above happened on Pex v2.1.122.

It happens very infrequently (once every few weeks).

I captured the Pants caches & execution dir for the error above. The archive is too big to attach via GitHub, but I can share it via Slack / GDrive upload.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks @danxmoran. I’ll be running some experiments using overlay2 today. The kernel notes indicate some POSIX non-compliance, the most interesting bit being how an inode can differ for a file if it starts in a lower layer and is copied up to the upper layer -> this would allow the same path to be treated as two different paths and foil locking. That said, the lower layer is read only and would not have a lockfile in it IIUC - i.e.: I think /tmp is fully in the upper layer in your setup. Also, the lock file is opened write-only which would necessitate being copied up to the upper layer during the open call of each and every attempt to lock it IIUC; so this bit of POSIX non-compliance seems to be ruled out as an issue here.

Even though it should have no bearing on your issue, you might try upgrading to Pex 2.1.124 and reporting back in a few weeks if you’re game:

[pex-cli]
version = "v2.1.124"
known_versions = [
  "v2.1.124|macos_arm64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|macos_x86_64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|linux_x86_64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|linux_arm64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988"
]

@danxmoran if you can run this in the CI container you use - substituting the full path of the python interpreter your CI runs with, this will provide a sanity check that that Python thinks it has flock and is not falling back to fcntl emulation:

python -c 'import sysconfig; print(sysconfig.get_config_var("HAVE_FLOCK"))'

It’s easy to get the same process owning a lock 2x under fcntl - I can find no way to get this to happen under flock given each thread opens its own private fd to lock with in the atomic_directory code.

The other sanity check is to triple-confirm that no part of /tmp/pants/named_caches/pex_root is an NFS mount. Linux will silently convert a flock lock to a fcntl lock for NFS files / file descriptors.