bazel: --build_python_zip fails when runfiles include file containing '=' character (affects pyspark==2.4.6)
Description of the problem / feature request:
A Bazel Python ‘zipapp’ cannot be built using --build_python_zip
when the underlying py_binary
target depends on pyspark==2.4.6
. I think this is because pyspark
contains files that include the “=” character in their filename, which breaks some logic in the --build_python_zip
action.
Example Error:
INFO: Analyzed 2 targets (22 packages loaded, 708 targets configured).
INFO: Found 2 targets...
ERROR: /Users/jonathon/work/reproduce_zipapp_bug/spark_hello_world/BUILD:4:10: PythonZipper spark_hello_world/main.zip failed (Exit 255): zipper failed: error executing command external/bazel_tools/tools/zip/zipper/zipper cC bazel-out/darwin-fastbuild/bin/spark_hello_world/main.zip @bazel-out/darwin-fastbuild/bin/spark_hello_world/main.zip-0.params
Use --sandbox_debug to see verbose messages from the sandbox zipper failed: error executing command external/bazel_tools/tools/zip/zipper/zipper cC bazel-out/darwin-fastbuild/bin/spark_hello_world/main.zip @bazel-out/darwin-fastbuild/bin/spark_hello_world/main.zip-0.params
Use --sandbox_debug to see verbose messages from the sandbox
File kittens/date=2018-01/not-image.txt=external/pypi/pypi__pyspark/pyspark/data/mllib/images/partitioned/cls=kittens/date=2018-01/not-image.txt does not seem to exist.
INFO: Elapsed time: 2.814s, Critical Path: 0.35s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I have a full public reproduction over in this repo: https://github.com/thundergolfer/bazel-build_python_zip-bug-reproduction (instructions in the README)
What operating system are you running Bazel on?
MacOS Catalina 10.15.7
What’s the output of bazel info release
?
release 3.7.2
If bazel info release
returns “development version” or “(@non-git)”, tell us how you built Bazel.
Replace this line with your answer.
What’s the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD
?
git remote get-url origin ; git rev-parse main ; git rev-parse HEAD
git@github.com:thundergolfer/bazel-build_python_zip-bug-reproduction.git
ff5d23b14ade117b74494ecd3a0ed5666b8f224e
ff5d23b14ade117b74494ecd3a0ed5666b8f224e
Have you found anything relevant by searching the web?
- GitHub issues: https://github.com/bazelbuild/rules_docker/issues/1254 seems relevant.
👋 I can look further into this and submit a fix + test, when time permits.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 15 (10 by maintainers)
To add onto that note, the executable zip generated by
--build_python_zip
after I applied my patch has terrible cold start performance due to Bazel’s implementation not caching the extracted files after first run.I ended up abandoning this approach for distributing my python code.
Is there any progress on this? This issue breaks our build since we introduced a new dependency that itself has a sub-dependency on
pyarrow
(as described by @thundergolfer and @benjaminRomano).The patch would be welcome. I don’t see any problems fixing zipper to support
=
in the filenames.Same kind of issue in
pyarrow==3.0.0
. It includes a file with path:pyarrow/tests/data/feather/v0.17.0.version=2-compression=lz4.feather
.I’m also able to reproduce this issue with
pyspark==3.0.2
so this issue has not gone away on their side.