pants: Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets
(NOTE: There’s been a lot of discussion on this in Slack here, but I realized I never filed an issue for it)
Describe the bug
A few times per day (out of ~hundreds of jobs), our CI jobs will fail when trying to build dockerfile_parser.pex
. The error always hits at the beginning of goal execution, and affects all our goals (test
, lint
, check
). The error looks like:
13:28:44.08 [INFO] Starting: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:56.61 [INFO] Completed: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:57.16 [INFO] [auth-plugin] Successfully loaded Toolchain token from env var 'TOOLCHAIN_AUTH_TOKEN', expiration: 2022-12-20T20:52:09+00:00.
13:29:17.25 [INFO] Starting: Building dockerfile_parser.pex from dockerfile-parser_default.lock
13:29:18.32 [INFO] Canceled: Building dockerfile_parser.pex from dockerfile-parser_default.lock
Exception caught: (pants.engine.internals.scheduler.ExecutionError)
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/bin/pants", line 8, in <module>
sys.exit(main())
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
PantsLoader.main()
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
cls.run_default_entrypoint()
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
exit_code = runner.run(start_time)
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
runner = LocalPantsRunner.create(
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 159, in create
specs = calculate_specs(
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 94, in calculate_specs
(changed_addresses,) = session.product_request(
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
return self.execute(request)
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
self._raise_on_error([t for _, t in throws])
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
raise ExecutionError(
Exception message: 1 Exception encountered:
Engine traceback:
in select
in pants.vcs.changed.find_changed_owners
in pants.backend.project_info.dependees.find_dependees
in pants.backend.project_info.dependees.map_addresses_to_dependees
in pants.engine.internals.graph.resolve_dependencies (src/projects/rhp/operations:rhp-api)
in pants.backend.docker.util_rules.dependencies.inject_docker_dependencies (src/projects/rhp/operations:rhp-api)
in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
File "/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex", line 234, in <module>
runpy.run_module(module_name, run_name="__main__", alter_sys=True)
File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 203, in run_module
mod_name, mod_spec, code = _get_module_details(mod_name)
File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 138, in _get_module_details
raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser
Pants version
- The error started appearing on
PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba
, and has continued appearing as we’ve upgraded along the2.13.x
branch. - As far as I can tell, the error was not appearing on
v2.13.0a0
OS
Linux
Additional info
Through debugging on Slack, we’ve found:
__pants_df_parser.py
truly isn’t in the venv when this error occurs- The error doesn’t occur if we use
--no-process-cleanup
- The sandbox PEX created for the dockerfile parser contains the file, even when the pex_root venv does not. Running the PEX creates a venv that contains the file.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 64 (62 by maintainers)
Commits related to this issue
- Fix venv re-population race. There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove ... — committed to jsirois/pants by jsirois 2 years ago
- Fix venv re-population race. (#16931) There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so ... — committed to pantsbuild/pants by jsirois 2 years ago
- Fix venv re-population race. (Cherry-pick of #16931) There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on ... — committed to jsirois/pants by jsirois 2 years ago
- Fix venv re-population race. (Cherry-pick of #16931) There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on ... — committed to jsirois/pants by jsirois 2 years ago
- Fix venv re-population race. (Cherry-pick of #16931) (#16935) There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is ... — committed to pantsbuild/pants by jsirois 2 years ago
- Fix venv re-population race. (Cherry-pick of #16931) (#16936) There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is a... — committed to pantsbuild/pants by jsirois 2 years ago
- Ensure that sandboxed processes exit before their sandboxes are cleaned up (#18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues/16778#issuecomment-1491120170, ... — committed to pantsbuild/pants by stuhood a year ago
- Ensure that sandboxed processes exit before their sandboxes are cleaned up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues/16778#issuecommen... — committed to jsirois/pants by stuhood a year ago
- Ensure that sandboxed processes exit before their sandboxes are cleaned up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues/16778#issuecommen... — committed to jsirois/pants by stuhood a year ago
- Ensure that sandboxed processes exit before their sandboxes are cleaned up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues/16778#issuecommen... — committed to jsirois/pants by stuhood a year ago
- Ensure that sandboxed processes exit before their sandboxes are clean… (#18641) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues/167... — committed to pantsbuild/pants by jsirois a year ago
- Ensure that sandboxed processes exit before their sandboxes are clean… (#18640) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues... — committed to pantsbuild/pants by jsirois a year ago
- Ensure that sandboxed processes exit before their sandboxes are clean… (#18642) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in https://github.com/pantsbuild/pants/issues... — committed to pantsbuild/pants by jsirois a year ago
I think if the above is the explanation, then just switching to Pex’s
--sh-boot
mode and eliminating the wrapper script will do the trick and allow Pants to keep all the hard earned milliseconds saved by this old wrapper script pre-dating--sh-boot
.Ok, I started re-reviewing this from the top instead of focusing in on Pex. That leads here: https://github.com/pantsbuild/pants/blob/0bbf1fbb6078cacb1e3a75f5ea67a14d0398910c/src/python/pants/backend/python/util_rules/pex.py#L750-L757
Clearly? the rm -rf can be interrupted and since
rm -r
is implemented as a depth-1st delete, that can leave the rootvenv/pex
file present while portions of its subtree have been removed. So, given that, imagine this: *1. A process runs on a machine with no venv in the PEX_ROOT yet, and enters the outer guard. 2. Another process enters the guard and builds the venv and startsexec
’uting it. 3. The original process starts processing the rm -rf on the venv the process from step 2 is using.*N.B. 1 can only happen when the PEX in question is not built locally, since we pre
--seed
the PEX venv when building it. In CI, that means either the LMDB store is cached but the named_caches (PEX_ROOT) are not or that remote caching is enabled and that’s where the pre-built PEX comes from.@danxmoran can you check me on this logic / if it fully describes / comports with your setup where you are seeing failures?
Ok. The data @thejcannon provided was a second instance of similar data @danxmoran provided many months ago and this was finally enough to switch the focus to remote cache hit cancellation of the VenvPex build process more sharply. Thank you both for being willing to collect data.
My fundamental analysis error over these past 9 months was assuming the data loss was on the output side. It in fact looks like it was on the input side instead. I did not complete an analysis of the Pants code that sets up the sandbox the Pex CLI is run in to create the
__df_parser.py
VenvPex, but @stuhood provided expert input that the contents of that sandbox were all there if the Pex CLI process was launched.Taking that to heart, but still focusing on the input side, the missing file is a user code file (the same file) in both reported cases. The Pex is a packed VenvPex. A packed VenvPex is a directory containing 4 classes of items:
PEX-INFO
,main.py
,__pex__/
import hook package..bootstrap
zip of the Pex bootstrap code.deps/*.whl
installed wheel zips.When such a VenvPex is built, Pants passes
--seed
to ensure the PEX_ROOT caches are pre-seeded. In the course of seeding a VenvPex from a packed PEX, the packed PEX is 1st installed as an unzipped pex in the PEX root before re-execing from that to do the venv install in the venvs cache under the PEX_ROOT. The unzipped PEX installation has 3 atomic inputs and 1 non-atomic input - the non-atomic one being the user code. That is gathered via anos.walk
here: https://github.com/pantsbuild/pex/blob/bfea50d28f875d74607d6cc5ef93271744d9dd7b/pex/layout.py#L274-L289So, the
dest_dir
there is anatomic_directory
work dir - which is as things should be. But the walk assumes the inputs will be stable. During Pants remote cache speculation however, a remote cache hit beating a local VenvPex build and--seed
will issue aSIGKILL
to the Pex CLI process and then go through Rust tear down which includes RAII finalization of the sandbox/tmp
dir. The Pex CLI process can thus have the sandbox packed PEX dir open for reading while its contents are nuked by the Rust RAII tear-down leading to anos.walk
over an empty sandbox packed PEX directory that results in populating the unzipped PEXuser_code
cache with 0 user code files. From that point on the cache is poisoned and the user code (__df_parser.py
) will always be missing when a venv is created or re-created from that PEX_ROOT cache.@stuhood has concurred this is a likely cause and has volunteered to take over providing a fix for this uncontrolled asynchrony on the Pants side. The idea being a Pants
Process
should not have to come equipped with its own bespoke code to guard against its inputs being mutated out from under it.Another data-point:
We saw this error a few times yesterday. In every case:
pants publish
on a singledocker_image
target, so there were no concurrent Pex processesdockerfile_parser.pex/__pants_df_parser.py
is present! But there is no.pyc
for the file withindockerfile_parser.pex/__pycache__/
Could there possibly be a race condition between populating the sandbox & invoking the dockerfile parser? Such that sometimes the parser is invoked before
__pants_df_parser.py
lands in the right placeThanks for looking with a critical eye. I need more second eyes.
At a broad brush this is not an issue because Pex handles atomic directory population with the
AtomicDirectory
/atomic_directory
construct. So, yes, racy looking from the Pants side, but Pex handles it - supposedly! Pex only swaps an entire directory into existence from the readers point of view with one atomicrename
call from a working directory that has finished populating to a final resting visible target directory.I used this edit in a kept sandbox:
Which reveals:
So, the 1st command re-creates the venv if
./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f
doesn’t exist. That dir is the venv dir created in anatomic_directory
guard here: https://github.com/pantsbuild/pex/blob/f0c162d7258ee9c5992ae588c266892f37d0d4de/pex/pex_bootstrapper.py#L477-L493The Exec command uses
./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/pex
and the rest of the venv content which is all underneath the atomic venv dir of./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/
; so if there is a race bug it lies squarely in the Pex atomic_directory code as near as I can tell. This is why I’ve spent all the effort over there. You’ve seen some of that in reviews. Its that code that seems like it must be broken somehow but where I can see no way it can be broken yet.That’s almost certainly https://github.com/pantsbuild/pex/issues/1968 and not related to the bsd toggle.
Thanks @danxmoran.
When you get a chance you can try the following in
pants.toml
to pull in the release here: https://github.com/pantsbuild/pex/releases/tag/v2.1.112:That should not fix any bugs but it should emit warnings to stderr when strange things are afoot. This may give more useful information. The warnings have two possible lines, but the 1st will always be of the form:
Assuming you get failures / the new warnings, I’d then like you to try adding the following additional content to
pants.toml
and see if this has any effect on the errors / warnings:Yes, happy to help 👍
Yeah, Eric’s suggestion need not be followed. My late realization (for the umpteenth time) of only processes being cached, not @rules, makes that avenue make no sense. Apparently Eric also gets confused by this!
In short, I’m not giving up on the
rm -rf
explanation yet because it actually does explain this perfectly.