tensorflow: Wrong order of dependencies after running freeze_graph and/or optimize_for_inference

I haven’t found any mention of this anywhere online. It makes the graph serializations completely useless for inference.

Steps to reproduce:

  • create graph that contains tf.contrib.layers.batch_norm with tf.bool tensor as is_training argument (to force use of Switch node
  • run freeze_graph.freeze_graph and optimize_for_inference_lib.optimize_for_inference
  • load resulting graph on Android via TensorFlowInferenceInterface

What happened: ADB Logcat shows error message E/TensorFlowInferenceInterface: Failed to load model from 'file:///android_asset/optimized_model.pb': java.io.IOException: Not a valid TensorFlow Graph serialization: Node 'conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/sub_1/x': Control dependencies must come after regular dependencies

Why did this happen: I found out that the order of dependencies was inconsistent after the processing.

Dependencies before processing:

input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"
input: "^conv1/bn1/BatchNorm/cond/switch_t"

Dependencies after processing:

input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"
input: "conv1/bn1/BatchNorm/cond/Switch:1"

What is wrong: The control dependencies (starting with ‘^’) should be after the regular dependencies.

Expected behaviour: Reordering of dependencies to ensure ordering consistency.

Expected order of dependencies:

input: "conv1/bn1/BatchNorm/cond/Switch:1"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 34 (12 by maintainers)

Most upvoted comments

Can you try using the new Graph Transform Tool approach to optimizing for inference? https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms/#optimizing-for-deployment

I’m hoping to deprecate the old optimize_for_inference Python script soon, so it would be helpful to know if this works better.

@ronny3050 This arguably simple python script should work for you:

import tensorflow as tf
from google.protobuf import text_format

gd = tf.GraphDef()

with tf.gfile.FastGFile("model.pb", "r") as f:
    text_format.Merge(f.read(), gd)

for node in gd.node:
    if node.op == "Switch":
        node.op = "Identity"
        del node.input[1]

tf.train.write_graph(gd, ".", "fixed_model.pb")

Feel free to ask if there are further issues!

and the winner is… bazel build -c opt --copt=“-DSELECTIVE_REGISTRATION” --copt=“-DSUPPORT_SELECTIVE_REGISTRATION” //tensorflow/contrib/android:libtensorflow_inference.so --crosstool_top=//external:android/crosstool --host_crosstool_top=@bazel_tools//tools/cpp:toolchain --cpu=armeabi-v7a

I get a 3.9 MB libtensorflow_inference.so that doesn’t crash the app because of tensorflow code (well, it crashes later on, but because of a bug in MY code, which is much less frustrating as I’m going to be able to fix it quick)

So, either the tenserflow developers intended to have 2 different flags : SELECTIVE_REGISTRATION for ops and SUPPORT_SELECTIVE_REGISTRATION for types and this is not documented either it is a bug in the tenserflow code

Please fix ! @petewarden @andrewharp

mmh managed to build the android app with make and found why it wouldn’t build with basel (used a wrong git clone command missing recurse subprojects).

Will now try to build the android app with Bazel and then with custom ops (selective) and kernels

manually replaced the placeholder with a constant op (potentially in a bad way), rerun through transform_graph -> no change to the Switch ops that are fed with a constant false.

That’s it… let’s build CPU:DT_BOOL for android… 😕

ok, built tensorflow from sources (30 minutes O.O) retrained my model (took mostly as long for 1 epoch as the prebuilt tensorflow binary, despite the added cpu instructions ? I didn’t add any optimisation flag (there is no documentation for that anyway)) built summarize_graph (it takes sooooooooooooooooooooo long to compile C++ 😕, like 8 minutes for a simple utility)

Also, tried this command bazel build tensorflow/tools/graph_transforms:transform_graph bazel-bin/tensorflow/tools/graph_transforms/transform_graph \ --in_graph=tensorflow_inception_graph.pb \ --out_graph=optimized_inception_graph.pb \ --inputs='Mul' \ --outputs='softmax' \ --transforms=' strip_unused_nodes(type=float, shape="1,299,299,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms'

loaded the .pb file, exported as .pbtext to see the differences (went from 370kb to 29kb, nice). There should have been quite a lot of optimizations in there…

Yet, the DT_BOOL & keras leraning-phase stuff is still in there (obviously, because the keras_learning_phase placeholder hasn’t been replaced by a constant op)

And… looking at the doc for the transform_graph tools, I don’t see any transformation that allows one to replace a placeholder op (with a single bool) by a const op… 😕 sigh… Am I supposed to write a custom transform function for that ?

And if I do, will the switch op disappear with an opmtimising phase ?

/me begins to think that he’ll throw the towel and just build tensorflow for android with CPU:BOOL kernel (why is there support for GPU:BOOL and not for CPU:BOOL anyway ?)

I tried replacing a PlaceHolder node holding a boolean, namely node { name: "dropout_1/keras_learning_phase" op: "Placeholder" attr { key: "dtype" value { type: DT_BOOL } } attr { key: "shape" value { shape { } } } }

with this node (not sure if it is correct) node { name: "dropout_1/keras_learning_phase" op: "Const" attr { key: "dtype" value { type: DT_BOOL } } attr { key: "value" value { tensor { dtype: DT_BOOL tensor_shape { } bool_val: false } } } }

and then running it through

`gd = tf.GraphDef()

from google.protobuf import text_format with tf.gfile.FastGFile(self.exportPath+self.modelName+“Constant.pbtxt”, “r”) as f: text_format.Merge(f.read(), gd)

  tf.train.write_graph(gd,  self.exportPath,  self.modelName + "Test.pbtxt") 
  optimized_graph_def = optimize_for_inference(input_graph_def= gd,
                     input_node_names="conv2d_1_input".split(","),# \
                    output_node_names="activation_6/Softmax".split(","),
                        placeholder_type_enum=dtypes.float32.as_datatype_enum)
  optimizedGraphPath = self.exportPath +  self.modelName + "ConstantOptimized.pbtxt"
  tf.train.write_graph(optimized_graph_def,  self.exportPath,  self.modelName + "ConstantOptimized.pbtxt") 
  self.freeze(self.modelName + "ConstantOptimized", self.modelName + "-"+str(self.step))`

But, when freezing, I ended up with ValueError: graph_def is invalid at node u’dropout_1/cond/mul/y’: More inputs specified (‘dropout_1/cond/Switch:1’) than the op expects…

was my attempt correct ?

Next I’m going to try to build tenserflow with basel…I really need those graph transform tools

I happened to encounter the exact same issue (with DT_BOOL) crashing my app at run time. Been trying to work around that for 2 days now (trying to remove keras_learning_phase Switch node branch)… this is frustrating

I had to manually turn Switch ops into Identity ops. Effectively, is_training is now permanently False. Seems to be issue #6124, maybe related to #5919

@petewarden Okay, I have used the new Graph Transform Tool with arguments:

--inputs='x' --outputs='y_conv' --transforms='strip_unused_nodes(type=float, shape="1,49,257,1") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms'

and all of the previous error messages are gone. The serialization works just fine now, but when trying to run the model:

Inference exception: java.lang.IllegalArgumentException: No OpKernel was registered to support Op 'Switch' with these attrs.  Registered devices: [CPU], Registered kernels:
        device='CPU'; T in [DT_FLOAT]
        device='CPU'; T in [DT_INT32]
        device='GPU'; T in [DT_STRING]
        device='GPU'; T in [DT_BOOL]
        device='GPU'; T in [DT_INT32]
        device='GPU'; T in [DT_FLOAT]
        
        [[Node: drop/cond/Switch = Switch[T=DT_BOOL](is_training/_0__cf__0, is_training/_0__cf__0)]]