tensorflow: Unable to import frozen graph with batchnorm

Error when loading the frozen graph with tensorflow.contrib.layers.python.layers.batch_norm ValueError: graph_def is invalid at node u'BatchNorm/cond/AssignMovingAvg/Switch': Input tensor 'BatchNorm/moving_mean:0' Cannot convert a tensor of type float32 to an input of type float32_ref freeze_graph.py doesn’t seem to store moving_mean and moving_variance properly

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 36
  • Comments: 80 (28 by maintainers)

Commits related to this issue

Most upvoted comments

The full script I use to convert a checkpoint model to a protobuf graph is below, in case more people using batch norm layers find it useful.

"""
Convert model.ckpt to model.pb
"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
from tensorflow.python.framework import graph_util

# create a session
sess = tf.Session()

# import best model
saver = tf.train.import_meta_graph('model.ckpt.meta') # graph
saver.restore(sess, 'model.ckpt') # variables

# get graph definition
gd = sess.graph.as_graph_def()

# fix batch norm nodes
for node in gd.node:
  if node.op == 'RefSwitch':
    node.op = 'Switch'
    for index in xrange(len(node.input)):
      if 'moving_' in node.input[index]:
        node.input[index] = node.input[index] + '/read'
  elif node.op == 'AssignSub':
    node.op = 'Sub'
    if 'use_locking' in node.attr: del node.attr['use_locking']

# generate protobuf
converted_graph_def = graph_util.convert_variables_to_constants(sess, gd, ["logits_set"])
tf.train.write_graph(converted_graph_def, '/path/to/save/', 'model.pb', as_text=False)

@petewarden this is still a problem. severely limits the ability to put models in production with batch norm (which are most models…)

This is hitting me too; it’s a bad bug that makes it hard to use BatchNorm in production settings.

yeah , Fuck tensorflow !!!

@pavelgonchar This has worked for me:

# read graph definition
f = tf.python.platform.gfile.FastGFile(model_path)
gd = graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())

# fix nodes
for node in graph_def.node:
  if node.op == 'RefSwitch':
    node.op = 'Switch'
    for index in xrange(len(node.input)):
      if 'moving_' in node.input[index]:
        node.input[index] = node.input[index] + '/read'
  elif node.op == 'AssignSub':
    node.op = 'Sub'
    if 'use_locking' in node.attr: del node.attr['use_locking']

# import graph into session
tf.import_graph_def(graph_def, name='')

I’ve only changed the inputs related to the “moving_variance” and “moving_mean”.

fix batch norm nodes

    for node in input_graph_def.node:
        if node.op == 'RefSwitch':
            node.op = 'Switch'
            for index in range(len(node.input)):
                if 'moving_' in node.input[index] and "Switch" not in node.input[index]:
                    node.input[index] = node.input[index] + '/read'
        elif node.op == 'AssignSub':
            node.op = 'Sub'
            if 'use_locking' in node.attr: del node.attr['use_locking']
        elif node.op == 'AssignAdd':
            node.op = 'Add'
            if 'use_locking' in node.attr: del node.attr['use_locking']

i add “if ‘moving_’ in node.input[index] and “Switch” not in node.input[index]:” and i have solved my problem, thanks!

The workaround of @barbolo worked for me (for python 3 change xrange to range. But would be nice if native tensorflow would allow to freeze a graph with batch norm without these kind of workarounds!

This is hitting me too; it’s a bad bug that makes it hard to use BatchNorm in production settings.

I found another work-around for this. Our implementation of batch norm was using tf.cond() to distinguish between training-time and test-time behavior. At training time, the variables in batch norm have to be updated. This causes an error when those variables are converted to constants.

When freezing a graph for inference only, the update operations are still present in the frozen graph because tf.cond() chooses the behavior at run-time, not compile-time. The easiest solution for me was to generate two graphs that share all of their variables, one for training and one for testing. This way you can eliminate the call to tf.cond() and distinguish behaviors at compile time. Then Tensorflow correctly removes all the update operations when calling tf.graph_util.convert_variables_to_constants() on the inference output.

As other users have pointed out, one could also fix this using the ‘blacklist’ option in tf.graph_util.convert_variables_to_constants(). The downside of this is that unneeded ops are still present in the frozen graph.

Is there going to be a more comprehensive patch for this any time soon? I am surprised this issue has been open for almost a year with no action. This seems like a very big issue, since batch norm is so useful for training large, complex networks. It is not always practical to re-train the network without batch norm for deployment. Users should not be relying on hacks that edit the graph after the fact.

@pavelgonchar Your suggestion didn’t work for me:

# read graph definition
f = tf.python.platform.gfile.FastGFile(model_path)
gd = graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())

# fix nodes
for node in graph_def.node:
  if node.op == 'RefSwitch':
    node.op = 'Switch'
    for index in xrange(len(node.input)):
      node.input[index] = node.input[index] + '/read'
  elif node.op == 'AssignSub':
    node.op = 'Sub'
    if 'use_locking' in node.attr: del node.attr['use_locking']

# import graph into session
tf.import_graph_def(graph_def, name='')

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 357, in import_graph_def
    % (input_name,)))
ValueError: graph_def is invalid at node u'conv1/BatchNorm/cond/AssignMovingAvg/Switch': Input tensor 'conv1/BatchNorm/cond/pred_id/read:0' not found in graph_def..

I solved this problem by using the tf.layers.batch_normalization rather the tf.contrib.layers.batch_norm.

@XiaodanLi001 since I started using TensorFlow Estimator API I never had this problem again. However, there is still a trick. You do must define your training op within tf.control_dependencies context, like the snippet bellow, because although the batch_norm layers allocates some variables, they are not trainable, but updatable. So you do need to ensure that they are being updated at every training step. I hope it helps

 # You DO must get this collection in order to perform updates on batch_norm variables
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(
    loss=total_loss, global_step=tf.train.get_global_step())

OBS: I use tf.layers too.

Bump

Same problem I trined GooglNet model and get the error when importing a frozen graph

ValueError: Input 0 of node save/Assign_41 was passed float from auxiliary_classifier_1/classifier/biases/Adam_1:0 incompatible with expected float_ref.

None of the answers I found on the net do not answer this specific situation, where the problem is in the save op and Adam optimizer

Closing since @barbolo 's solution seems to work.

In general the best route is to create a separate eval graph with is_training=False for batchnorm, freeze the training checkpoint into that graph.

Thanks!

@barbolo your answer works for me!!! you really save my project… so many thanks!

I encountered an error ValueError: graph_def is invalid at node 'conv3_1/BatchNorm/AssignMovingAvg': Input tensor 'conv3_1/BatchNorm/moving_mean:0' Cannot convert a tensor of type float32 to an input of type float32_ref. when using BatchNorm layer in slim. Not sure how to solve it…

I meet the same problem when deeling with BN in resnet, it seems that the function inference_graph = extract_sub_graph(input_graph_def, output_node_names) in graph_util.convert_variables_to_constants()cuts off the node of moving_mean and moving_variance, I solve it in this way: save the model in train mode, then load the model in eval or test mode, run and save again, freeze the last model, you will find the moving_mean and moving_variance node. BTW, my goal is to get the variables such as mean, variance, and weight in the frozen model, I don’t need to load it again.

@drpngx @gunan - I think the state machine got confused here. @barbolo replied with his version but no tensorflower picked up the ball.