pex: Intermittent / nondeterministic failure building `requirements.pex` due to missing `site-packages`
Our CI will occasionally fail with an error like:
Traceback (most recent call last):
File "/home/color/.cache/nce/99421dee8fedc336d5a6bb8322fbfa602bc68fe63303ae702ec7b9c5672cd086/bindings/venvs/2.15.0rc6/lib/python3.9/site-packages/pants/engine/process.py", line 289, in fallible_to_exec_result_or_raise
raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Building 1 requirement for requirements.pex from the 3rdparty/lockfiles/resolves/pants-plugins.lockfile resolve: pantsbuild.pants<2.16,>=2.15.0a0' failed with exit code 1.
stdout:
stderr:
The virtualenv at /tmp/pants/named_caches/pex_root/venvs/4c85013f478e0393bbf8db8fcf02e1def7ff5031/ba7a55164c2afb363895254bbb1063124dd74d5b.lck.work is not valid. No site-packages directory was found in its sys.path:
/opt/python/3.9.16/lib/python39.zip
/opt/python/3.9.16/lib/python3.9
/opt/python/3.9.16/lib/python3.9/lib-dynload
/opt/python/3.9.16/lib/python3.9/site-packages
We’ve seen it hit across different resolves, Python interpreter versions, and Pex versions - the error above happened on Pex v2.1.122.
It happens very infrequently (once every few weeks).
I captured the Pants caches & execution dir for the error above. The archive is too big to attach via GitHub, but I can share it via Slack / GDrive upload.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (22 by maintainers)
Commits related to this issue
- Wrap inter-process locks in in-process locks. This is needed to have independent POSIX fcntl locks in the same process by multiple threads and also needed whenever BSD flock locks silently use fcntl ... — committed to jsirois/pex by jsirois a year ago
- Wrap inter-process locks in in-process locks. (#2070) This is needed to have independent POSIX fcntl locks in the same process by multiple threads and also needed whenever BSD flock locks silently us... — committed to pex-tool/pex by jsirois a year ago
Thanks @danxmoran. I’ll be running some experiments using overlay2 today. The kernel notes indicate some POSIX non-compliance, the most interesting bit being how an inode can differ for a file if it starts in a lower layer and is copied up to the upper layer -> this would allow the same path to be treated as two different paths and foil locking. That said, the lower layer is read only and would not have a lockfile in it IIUC - i.e.: I think
/tmpis fully in the upper layer in your setup. Also, the lock file is opened write-only which would necessitate being copied up to the upper layer during the open call of each and every attempt to lock it IIUC; so this bit of POSIX non-compliance seems to be ruled out as an issue here.Even though it should have no bearing on your issue, you might try upgrading to Pex 2.1.124 and reporting back in a few weeks if you’re game:
@danxmoran if you can run this in the CI container you use - substituting the full path of the python interpreter your CI runs with, this will provide a sanity check that that Python thinks it has
flockand is not falling back tofcntlemulation:It’s easy to get the same process owning a lock 2x under
fcntl- I can find no way to get this to happen underflockgiven each thread opens its own private fd to lock with in theatomic_directorycode.The other sanity check is to triple-confirm that no part of
/tmp/pants/named_caches/pex_rootis an NFS mount. Linux will silently convert aflocklock to afcntllock for NFS files / file descriptors.