pants: Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets

(NOTE: There’s been a lot of discussion on this in Slack here, but I realized I never filed an issue for it)

Describe the bug

A few times per day (out of ~hundreds of jobs), our CI jobs will fail when trying to build dockerfile_parser.pex. The error always hits at the beginning of goal execution, and affects all our goals (test, lint, check). The error looks like:

13:28:44.08 [INFO] Starting: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:56.61 [INFO] Completed: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:57.16 [INFO] [auth-plugin] Successfully loaded Toolchain token from env var 'TOOLCHAIN_AUTH_TOKEN', expiration: 2022-12-20T20:52:09+00:00.
13:29:17.25 [INFO] Starting: Building dockerfile_parser.pex from dockerfile-parser_default.lock
13:29:18.32 [INFO] Canceled: Building dockerfile_parser.pex from dockerfile-parser_default.lock
Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
    PantsLoader.main()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
    cls.run_default_entrypoint()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 159, in create
    specs = calculate_specs(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 94, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(
Exception message: 1 Exception encountered:
Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.util_rules.dependencies.inject_docker_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex", line 234, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

Pants version

  • The error started appearing on PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba, and has continued appearing as we’ve upgraded along the 2.13.x branch.
  • As far as I can tell, the error was not appearing on v2.13.0a0

OS

Linux

Additional info

Through debugging on Slack, we’ve found:

  1. __pants_df_parser.py truly isn’t in the venv when this error occurs
  2. The error doesn’t occur if we use --no-process-cleanup
  3. The sandbox PEX created for the dockerfile parser contains the file, even when the pex_root venv does not. Running the PEX creates a venv that contains the file.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 64 (62 by maintainers)

Commits related to this issue

Most upvoted comments

I think if the above is the explanation, then just switching to Pex’s --sh-boot mode and eliminating the wrapper script will do the trick and allow Pants to keep all the hard earned milliseconds saved by this old wrapper script pre-dating --sh-boot.

Ok, I started re-reviewing this from the top instead of focusing in on Pex. That leads here: https://github.com/pantsbuild/pants/blob/0bbf1fbb6078cacb1e3a75f5ea67a14d0398910c/src/python/pants/backend/python/util_rules/pex.py#L750-L757

Clearly? the rm -rf can be interrupted and since rm -r is implemented as a depth-1st delete, that can leave the root venv/pex file present while portions of its subtree have been removed. So, given that, imagine this: *1. A process runs on a machine with no venv in the PEX_ROOT yet, and enters the outer guard. 2. Another process enters the guard and builds the venv and starts exec’uting it. 3. The original process starts processing the rm -rf on the venv the process from step 2 is using.

*N.B. 1 can only happen when the PEX in question is not built locally, since we pre --seed the PEX venv when building it. In CI, that means either the LMDB store is cached but the named_caches (PEX_ROOT) are not or that remote caching is enabled and that’s where the pre-built PEX comes from.

@danxmoran can you check me on this logic / if it fully describes / comports with your setup where you are seeing failures?

Ok. The data @thejcannon provided was a second instance of similar data @danxmoran provided many months ago and this was finally enough to switch the focus to remote cache hit cancellation of the VenvPex build process more sharply. Thank you both for being willing to collect data.

My fundamental analysis error over these past 9 months was assuming the data loss was on the output side. It in fact looks like it was on the input side instead. I did not complete an analysis of the Pants code that sets up the sandbox the Pex CLI is run in to create the __df_parser.py VenvPex, but @stuhood provided expert input that the contents of that sandbox were all there if the Pex CLI process was launched.

Taking that to heart, but still focusing on the input side, the missing file is a user code file (the same file) in both reported cases. The Pex is a packed VenvPex. A packed VenvPex is a directory containing 4 classes of items:

  1. loose Pex-internal files: PEX-INFO, main.py, __pex__/ import hook package.
  2. loose user-code
  3. a .bootstrap zip of the Pex bootstrap code
  4. 0 or more .deps/*.whl installed wheel zips.

When such a VenvPex is built, Pants passes --seed to ensure the PEX_ROOT caches are pre-seeded. In the course of seeding a VenvPex from a packed PEX, the packed PEX is 1st installed as an unzipped pex in the PEX root before re-execing from that to do the venv install in the venvs cache under the PEX_ROOT. The unzipped PEX installation has 3 atomic inputs and 1 non-atomic input - the non-atomic one being the user code. That is gathered via an os.walk here: https://github.com/pantsbuild/pex/blob/bfea50d28f875d74607d6cc5ef93271744d9dd7b/pex/layout.py#L274-L289

So, the dest_dir there is an atomic_directory work dir - which is as things should be. But the walk assumes the inputs will be stable. During Pants remote cache speculation however, a remote cache hit beating a local VenvPex build and --seed will issue a SIGKILL to the Pex CLI process and then go through Rust tear down which includes RAII finalization of the sandbox /tmp dir. The Pex CLI process can thus have the sandbox packed PEX dir open for reading while its contents are nuked by the Rust RAII tear-down leading to an os.walk over an empty sandbox packed PEX directory that results in populating the unzipped PEX user_code cache with 0 user code files. From that point on the cache is poisoned and the user code (__df_parser.py) will always be missing when a venv is created or re-created from that PEX_ROOT cache.

@stuhood has concurred this is a likely cause and has volunteered to take over providing a fix for this uncontrolled asynchrony on the Pants side. The idea being a Pants Process should not have to come equipped with its own bespoke code to guard against its inputs being mutated out from under it.

Another data-point:

We saw this error a few times yesterday. In every case:

  1. The error occurred in a CI job running pants publish on a single docker_image target, so there were no concurrent Pex processes
  2. The “building Pex” process for the Dockerfile parser was cancelled after a remote-cache hit
  3. Looking in the post-failure sandboxes, dockerfile_parser.pex/__pants_df_parser.py is present! But there is no .pyc for the file within dockerfile_parser.pex/__pycache__/

Could there possibly be a race condition between populating the sandbox & invoking the dockerfile parser? Such that sometimes the parser is invoked before __pants_df_parser.py lands in the right place

Thanks for looking with a critical eye. I need more second eyes.

see the file-system in a state where the venv dir exists, but the venv isn’t fully populated yet

At a broad brush this is not an issue because Pex handles atomic directory population with the AtomicDirectory / atomic_directory construct. So, yes, racy looking from the Pants side, but Pex handles it - supposedly! Pex only swaps an entire directory into existence from the readers point of view with one atomic rename call from a working directory that has finished populating to a final resting visible target directory.

I used this edit in a kept sandbox:

--- orig.sh     2023-03-06 08:10:20.122821818 -0800
+++ pytest_runner.pex_pex_shim.sh       2023-03-06 08:15:18.662747777 -0800
@@ -38,6 +38,12 @@
 target_venv_executable="$(adjust_relative_paths .cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/pex)"
 venv_dir="$(adjust_relative_paths .cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f)"

+echo >&2 "Potential command lines reified:"
+echo >&2 Venv dir: ${venv_dir}
+echo >&2 Re-create command: PEX_INTERPRETER=1 ${execute_pex_args} -c ''
+echo >&2 Exec command: exec "${target_venv_executable}" "$@"
+exit 0
+
 # Let PEX_TOOLS invocations pass through to the original PEX file since venvs don't come
 # with tools support.
 if [ -n "${PEX_TOOLS:-}" ]; then

Which reveals:

$ ./__run.sh
Potential command lines reified:
Venv dir: ./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f
Re-create command: PEX_INTERPRETER=1 /home/jsirois/.pyenv/versions/3.7.15/bin/python3.7 ./pytest_runner.pex -c
Exec command: exec ./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/pex --no-header --color=yes --junit-xml=src.python.pants.util.strutil_test.py.tests.xml -o junit_family=xunit2 src/python/pants/util/strutil_test.py

So, the 1st command re-creates the venv if ./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f doesn’t exist. That dir is the venv dir created in an atomic_directory guard here: https://github.com/pantsbuild/pex/blob/f0c162d7258ee9c5992ae588c266892f37d0d4de/pex/pex_bootstrapper.py#L477-L493

The Exec command uses ./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/pex and the rest of the venv content which is all underneath the atomic venv dir of ./.cache/pex_root/venvs/d71cc2f587902b5a4d4703ad4a1698b69d856498/1c78e5d7c70b9dd98f5dbf924df6147a02ac3d9f/; so if there is a race bug it lies squarely in the Pex atomic_directory code as near as I can tell. This is why I’ve spent all the effort over there. You’ve seen some of that in reviews. Its that code that seems like it must be broken somehow but where I can see no way it can be broken yet.

That’s almost certainly https://github.com/pantsbuild/pex/issues/1968 and not related to the bsd toggle.

Thanks @danxmoran.

When you get a chance you can try the following in pants.toml to pull in the release here: https://github.com/pantsbuild/pex/releases/tag/v2.1.112:

[pex-cli]
version = "v2.1.112"
known_versions = [
  "v2.1.112|macos_arm64|6b0be9f1dcf4d105aec06fd87d405ba46b45c266ef8549727b86541eea7c2f57|4065703",
  "v2.1.112|macos_x86_64|6b0be9f1dcf4d105aec06fd87d405ba46b45c266ef8549727b86541eea7c2f57|4065703",
  "v2.1.112|linux_x86_64|6b0be9f1dcf4d105aec06fd87d405ba46b45c266ef8549727b86541eea7c2f57|4065703",
  "v2.1.112|linux_arm64|6b0be9f1dcf4d105aec06fd87d405ba46b45c266ef8549727b86541eea7c2f57|4065703"
]

That should not fix any bugs but it should emit warnings to stderr when strange things are afoot. This may give more useful information. The warnings have two possible lines, but the 1st will always be of the form:

[pid:{pid}, tid:{thread id}, cwd:{cwd}]: After obtaining an exclusive lock on {lockfile}, failed to establish a work directory at {workdir} due to: {err}

Assuming you get failures / the new warnings, I’d then like you to try adding the following additional content to pants.toml and see if this has any effect on the errors / warnings:

[subprocess-environment]
env_vars.add = ["_PEX_FILE_LOCK_STYLE=bsd"]

Yes, happy to help 👍

Yeah, Eric’s suggestion need not be followed. My late realization (for the umpteenth time) of only processes being cached, not @rules, makes that avenue make no sense. Apparently Eric also gets confused by this!

In short, I’m not giving up on the rm -rf explanation yet because it actually does explain this perfectly.