bazel: TensorFlow is broken with Bazel@HEAD

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2640#01835384-a5a8-4cef-839a-92041e93f6fb

(03:29:23) ERROR: /var/lib/buildkite-agent/builds/bk-docker-062w/bazel-downstream-projects/tensorflow/tensorflow/BUILD:1409:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command (from target //tensorflow:tf_python_api_gen_v2)
  (cd /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/18887e87f83f3c86081828342b40a00a/execroot/org_tensorflow && \
  exec env - \
    PATH=/var/lib/buildkite-agent/.cache/bazelisk/local/-tmp-tmppfqvvf95-bazel/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PYTHON_BIN_PATH=/usr/bin/python3 \
    PYTHON_LIB_PATH=/usr/lib/python3/dist-packages \
    TF2_BEHAVIOR=1 \
  /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2 --root_init_template=tensorflow/api_template.__init__.py --apidir=bazel-out/k8-opt/bin/tensorflow_api/v2/ --apiname=tensorflow --apiversion=2  --compat_apiversion=1 --compat_apiversion=2  -- 
...
...

Culprit finder: https://buildkite.com/bazel/culprit-finder/builds/3607#01834b10-64ec-4e64-812f-2357b4ccd39a Bisect shows dadc49e437018f482640ed76fae5307daf9911a8 is the culprit

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 40 (39 by maintainers)

Commits related to this issue

Most upvoted comments

@c-parsons @meteorcloudy I created this commit over half a year ago, feel free to change the logic in any way you see fit - including rolling it back. It only affects the non-Starlark implementation anyway.

cc_shared_library built-in code defines and exposes the provider PropagationHints. Your custom rule can instantiate this provider which the graph structure aspect will be able to see.

In the PropagationHints instance that you create, you place in a list field the attributes of your custom rule through which you want propagation to happen. I think this separates responsibilities cleanly. For example, if you have a new attribute you want to add to the list, you wouldn’t have to wait for a change to Bazel, you can just change what you put inside the PropagationHints provider.


    PropagationHints = provider(fields = {
          "attributes" = "Attributes through which the cc_shared_library aspect should propagate"
     })
    ....
    propagation_attributes = []
    if PropagationHints in target:    
        propagation_attributes.extend(target[PropagationHints].attributes)
    elif hasattr(ctx.rule.attr, "deps") and type(ctx.rule.attr.deps) != "Target": # Original code before dadc49e437018f482640ed76fae5307daf9911a8
        propagation_attributes.append("deps")

    for attribute in propagation_attributes:
        for dep in ctx.rule.attr[attribute]:
            if GraphNodeInfo in dep:
                 children.append(dep[GraphNodeInfo])
  .....

I didn’t test the code but that’s the idea.

Looking at this a bit more I can imagine why you sent the change in the first place. First just to document what the problem is, let me rephrase the problem that Chris found during debugging.

The cc_shared_library implementation decides what to link statically based on the structure of the build graph, it uses the aspect to figure out that structure. It tries to link statically every library in the transitive closure of the direct deps of the shared library (i.e. roots) that is reachable without going through dynamic libraries. That’s why we need the structure of the graph.

Chris’ original change is correct and wrong at the same time. It makes libraries be linked statically that should be linked statically but weren’t linked before (because his project has custom C++ rules with different attribute names which is a valid use case) but it also links libraries statically that shouldn’t be linked statically.

The proposed fix with hard-coding the attribute names “srcs” and “hdrs” cuts the chain that builds the structure of the graph that propagates via genrules, filegroups, etc… I think that chain can be cut earlier by using required_providers on the aspect definition. This will skip the genrule and filegroup altogether, it also forces the custom C++ rules to advertise the CcInfo provider, so Chris you might want to check that this works for your project too, not just for Tensorflow.

If it works, I think the whole line if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with" or fieldname == "srcs" or fieldname == "hdrs": can be removed, including the field names that were already there: _cc_toolchain and target_compatible_with.

Note: There is still one edge case that a precompiled library coming from a filegroup or generated by a genrule isn’t linked statically when it should, but if any project in the future bumped into that, they should wrap that target in a cc_import which would be the principled thing to do. This shouldn’t happen with TF now since by using the old code they were definitely not relying on any filegroup or genrule being linked that anyway.

Thanks a bunch! Have reproed the error successfully.

Still don’t have any leads on the underlying failure – it may take me a bit.

I can confirm I can reproduce this issue at https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce with Bazel at dadc49e437018f482640ed76fae5307daf9911a8, but not at the parent commit of https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce