tensorflow: Wrong order of dependencies after running freeze_graph and/or optimize_for_inference

I haven’t found any mention of this anywhere online. It makes the graph serializations completely useless for inference.

Steps to reproduce:

create graph that contains tf.contrib.layers.batch_norm with tf.bool tensor as is_training argument (to force use of Switch node
run freeze_graph.freeze_graph and optimize_for_inference_lib.optimize_for_inference
load resulting graph on Android via TensorFlowInferenceInterface

What happened: ADB Logcat shows error message E/TensorFlowInferenceInterface: Failed to load model from 'file:///android_asset/optimized_model.pb': java.io.IOException: Not a valid TensorFlow Graph serialization: Node 'conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/sub_1/x': Control dependencies must come after regular dependencies

Why did this happen: I found out that the order of dependencies was inconsistent after the processing.

Dependencies before processing:

input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"
input: "^conv1/bn1/BatchNorm/cond/switch_t"

Dependencies after processing:

input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"
input: "conv1/bn1/BatchNorm/cond/Switch:1"

What is wrong: The control dependencies (starting with ‘^’) should be after the regular dependencies.

Expected behaviour: Reordering of dependencies to ensure ordering consistency.

Expected order of dependencies:

input: "conv1/bn1/BatchNorm/cond/Switch:1"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/BatchNorm/BatchNorm/moving_mean"
input: "^conv1/bn1/BatchNorm/cond/AssignMovingAvg/BatchNorm/moving_mean/AssignAdd"

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 34 (12 by maintainers)

Most upvoted comments

Can you try using the new Graph Transform Tool approach to optimizing for inference? https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms/#optimizing-for-deployment

I’m hoping to deprecate the old optimize_for_inference Python script soon, so it would be helpful to know if this works better.

petewarden on Mar 15, 2017

@ronny3050 This arguably simple python script should work for you:

import tensorflow as tf
from google.protobuf import text_format

gd = tf.GraphDef()

with tf.gfile.FastGFile("model.pb", "r") as f:
    text_format.Merge(f.read(), gd)

for node in gd.node:
    if node.op == "Switch":
        node.op = "Identity"
        del node.input[1]

tf.train.write_graph(gd, ".", "fixed_model.pb")

Feel free to ask if there are further issues!

Androbin on Apr 27, 2017

and the winner is… bazel build -c opt --copt=“-DSELECTIVE_REGISTRATION” --copt=“-DSUPPORT_SELECTIVE_REGISTRATION” //tensorflow/contrib/android:libtensorflow_inference.so --crosstool_top=//external:android/crosstool --host_crosstool_top=@bazel_tools//tools/cpp:toolchain --cpu=armeabi-v7a

I get a 3.9 MB libtensorflow_inference.so that doesn’t crash the app because of tensorflow code (well, it crashes later on, but because of a bug in MY code, which is much less frustrating as I’m going to be able to fix it quick)

So, either the tenserflow developers intended to have 2 different flags : SELECTIVE_REGISTRATION for ops and SUPPORT_SELECTIVE_REGISTRATION for types and this is not documented either it is a bug in the tenserflow code

Please fix ! @petewarden @andrewharp

Lakedaemon on Jun 1, 2017

mmh managed to build the android app with make and found why it wouldn’t build with basel (used a wrong git clone command missing recurse subprojects).

Will now try to build the android app with Bazel and then with custom ops (selective) and kernels

Lakedaemon on May 29, 2017

manually replaced the placeholder with a constant op (potentially in a bad way), rerun through transform_graph -> no change to the Switch ops that are fed with a constant false.

That’s it… let’s build CPU:DT_BOOL for android… 😕

Lakedaemon on May 29, 2017

ok, built tensorflow from sources (30 minutes O.O) retrained my model (took mostly as long for 1 epoch as the prebuilt tensorflow binary, despite the added cpu instructions ? I didn’t add any optimisation flag (there is no documentation for that anyway)) built summarize_graph (it takes sooooooooooooooooooooo long to compile C++ 😕, like 8 minutes for a simple utility)

Also, tried this command bazel build tensorflow/tools/graph_transforms:transform_graph bazel-bin/tensorflow/tools/graph_transforms/transform_graph \ --in_graph=tensorflow_inception_graph.pb \ --out_graph=optimized_inception_graph.pb \ --inputs='Mul' \ --outputs='softmax' \ --transforms=' strip_unused_nodes(type=float, shape="1,299,299,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms'

loaded the .pb file, exported as .pbtext to see the differences (went from 370kb to 29kb, nice). There should have been quite a lot of optimizations in there…

Yet, the DT_BOOL & keras leraning-phase stuff is still in there (obviously, because the keras_learning_phase placeholder hasn’t been replaced by a constant op)

And… looking at the doc for the transform_graph tools, I don’t see any transformation that allows one to replace a placeholder op (with a single bool) by a const op… 😕 sigh… Am I supposed to write a custom transform function for that ?

And if I do, will the switch op disappear with an opmtimising phase ?

/me begins to think that he’ll throw the towel and just build tensorflow for android with CPU:BOOL kernel (why is there support for GPU:BOOL and not for CPU:BOOL anyway ?)

Lakedaemon on May 29, 2017

I tried replacing a PlaceHolder node holding a boolean, namely node { name: "dropout_1/keras_learning_phase" op: "Placeholder" attr { key: "dtype" value { type: DT_BOOL } } attr { key: "shape" value { shape { } } } }

with this node (not sure if it is correct) node { name: "dropout_1/keras_learning_phase" op: "Const" attr { key: "dtype" value { type: DT_BOOL } } attr { key: "value" value { tensor { dtype: DT_BOOL tensor_shape { } bool_val: false } } } }

and then running it through

`gd = tf.GraphDef()

from google.protobuf import text_format with tf.gfile.FastGFile(self.exportPath+self.modelName+“Constant.pbtxt”, “r”) as f: text_format.Merge(f.read(), gd)

  tf.train.write_graph(gd,  self.exportPath,  self.modelName + "Test.pbtxt") 
  optimized_graph_def = optimize_for_inference(input_graph_def= gd,
                     input_node_names="conv2d_1_input".split(","),# \
                    output_node_names="activation_6/Softmax".split(","),
                        placeholder_type_enum=dtypes.float32.as_datatype_enum)
  optimizedGraphPath = self.exportPath +  self.modelName + "ConstantOptimized.pbtxt"
  tf.train.write_graph(optimized_graph_def,  self.exportPath,  self.modelName + "ConstantOptimized.pbtxt") 
  self.freeze(self.modelName + "ConstantOptimized", self.modelName + "-"+str(self.step))`

But, when freezing, I ended up with ValueError: graph_def is invalid at node u’dropout_1/cond/mul/y’: More inputs specified (‘dropout_1/cond/Switch:1’) than the op expects…

was my attempt correct ?

Next I’m going to try to build tenserflow with basel…I really need those graph transform tools

Lakedaemon on May 29, 2017

I happened to encounter the exact same issue (with DT_BOOL) crashing my app at run time. Been trying to work around that for 2 days now (trying to remove keras_learning_phase Switch node branch)… this is frustrating

Lakedaemon on May 28, 2017

I had to manually turn Switch ops into Identity ops. Effectively, is_training is now permanently False. Seems to be issue #6124, maybe related to #5919

Androbin on Mar 27, 2017

@petewarden Okay, I have used the new Graph Transform Tool with arguments:

--inputs='x' --outputs='y_conv' --transforms='strip_unused_nodes(type=float, shape="1,49,257,1") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms'

and all of the previous error messages are gone. The serialization works just fine now, but when trying to run the model:

Inference exception: java.lang.IllegalArgumentException: No OpKernel was registered to support Op 'Switch' with these attrs.  Registered devices: [CPU], Registered kernels:
        device='CPU'; T in [DT_FLOAT]
        device='CPU'; T in [DT_INT32]
        device='GPU'; T in [DT_STRING]
        device='GPU'; T in [DT_BOOL]
        device='GPU'; T in [DT_INT32]
        device='GPU'; T in [DT_FLOAT]
        
        [[Node: drop/cond/Switch = Switch[T=DT_BOOL](is_training/_0__cf__0, is_training/_0__cf__0)]]

Androbin on Mar 18, 2017