tensorflow: Error when using batch_norm with MirroredStrategy

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 TensorFlow installed from (source or binary): using nvidia containter: https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/rel_18.06.html#rel_18.06 TensorFlow version (use command below):1.8.0 Python version: 3.5 Bazel version (if compiling from source): N/A GCC/Compiler version (if compiling from source): N/A CUDA/cuDNN version: 9.0.176 GPU model and memory: nvidia tesla v100 Exact command to reproduce: N/A

Describe the problem

I want to apply transfer learning using an existing pretrained network (inception v4). For this I use the tf-slim models. When running this on a single GPU, This works as expected, however when using tf.contrib.distribute.MirroredStrategy I get an exception. Apparently the MirroredStrategy has some issues with batch_norm

Source code / logs

I tried to extract the relevant part from my code:

import research.slim.nets.nets_factory as nets_factory

...

def construct_architecture(self, input_tensor, mode):
        # network topology
        network_fn = nets_factory.get_network_fn('inception_v4', num_classes=self.configuration.get("nr_classes"),
                                                      is_training=True)
        logits,_ = network_fn(input_tensor)

        if mode == tf.estimator.ModeKeys.PREDICT:
            predicted_classes = tf.argmax(logits, 1)
            predictions = {
                'class_ids': predicted_classes[:, tf.newaxis],
                'probabilities': tf.nn.softmax(logits),
                'logits': logits,
            }
            self.output_tensor = predictions
        else:
            self.output_tensor = logits

This results in the following error trace:

Traceback (most recent call last):
  File "train.py", line 27, in <module>
    experimenter.run_training_experiment(config)
  File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 39, in run_training_experiment
    tf.estimator.train_and_evaluate(self.trainer, self.training_specs, self.eval_specs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 590, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 691, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 376, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1143, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1255, in _train_model_distributed
    self.config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 777, in call_for_each_tower
    return self._call_for_each_tower(fn, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 308, in _call_for_each_tower
    coord.join(threads)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/onnx/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 519, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1133, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 111, in get_model_fn
    self.configure_network(input_tensor=features, output_tensor=labels, mode=mode)
  File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 75, in configure_network
    self.network.construct_network(input_tensor=input_tensor, output_tensor=output_tensor, mode=mode)
  File "/media/local/myfiles/mynetwork.py", line 64, in construct_network
    self.construct_architecture(input_tensor=input_tensor,mode=mode)
  File "/media/local/myfiles/mynetwork.py", line 48, in construct_architecture
    logits,_ = network_fn(input_tensor)
  File "/opt/tf-slim/models/research/slim/nets/nets_factory.py", line 141, in network_fn
    return func(images, num_classes, is_training=is_training, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/slim-0.1-py3.5.egg/nets/inception_v4.py", line 286, in inception_v4
  File "/usr/local/lib/python3.5/dist-packages/slim-0.1-py3.5.egg/nets/inception_v4.py", line 178, in inception_v4_base
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
    conv_dims=2)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1066, in convolution
    outputs = normalizer_fn(outputs, **normalizer_params)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 650, in batch_norm
    outputs = layer.apply(inputs, training=is_training)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 362, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/normalization.py", line 158, in call
    return super(BatchNormalization, self).call(inputs, training=training)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 514, in call
    outputs = self._fused_batch_norm(inputs, training=training)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 420, in _fused_batch_norm
    momentum)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 369, in _assign_moving_average
    with ops.colocate_with(variable):
  File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3939, in _colocate_with_for_gradient
    with self.colocate_with(op, ignore_existing):
  File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3992, in colocate_with
    op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1255, in internal_convert_to_tensor_or_indexed_slices
    value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1094, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/values.py", line 414, in _tensor_conversion_mirrored
    assert not as_ref
AssertionError

Thanks for the help

Jonas

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

I’m seeing the same issue in 1.12.0 (from the tensorflow/tensorflow:1.12.0-gpu Docker image):

Minimal example to reproduce:

import tensorflow as tf
import numpy as np

print('tensorflow version: %s' % tf.__version__)

def input_fn():
    return (
        tf.data.Dataset.from_tensor_slices([0])
        .map(lambda _: tf.random_uniform([1], 0, np.pi * 2))
        .map(lambda x: (x, tf.sin(x)))
        .repeat()
        .batch(10)
    )

def model_fn(features, labels, mode):
    net = tf.layers.dense(features, units=20)
    net = tf.nn.tanh(net)
    net = tf.contrib.layers.batch_norm(net)
    net = tf.layers.dense(net, units=20)
    net = tf.nn.tanh(net)
    output = tf.layers.dense(net, units=1)

    if mode == tf.estimator.ModeKeys.TRAIN:
        loss = tf.reduce_mean(tf.pow(output - labels, 2))
        train_op = tf.train.GradientDescentOptimizer(0.1).minimize(loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(tf.estimator.ModeKeys.TRAIN, loss=loss, train_op=train_op)

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
estimator = tf.estimator.Estimator(model_fn=model_fn, config=config)

estimator.train(input_fn=input_fn, steps=1000)

Output:

tensorflow version: 1.12.0
INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpF_Z_3F
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f29a0297d90>, '_model_dir': '/tmp/tmpF_Z_3F', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f2a82dfe110>, '_master': '', '_distribute_coordinator_mode': None}
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:XLA_GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:XLA_CPU:0
INFO:tensorflow:Configured nccl all-reduce.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Error reported to Coordinator: 
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 795, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "<ipython-input-114-26a55e49533e>", line 19, in model_fn
    net = tf.contrib.layers.batch_norm(net)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
    return func(*args, **current_args)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 596, in batch_norm
    scope=scope)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 416, in _fused_batch_norm
    is_training, _delay_updates, moving_vars_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/utils.py", line 214, in smart_cond
    return static_cond(pred_value, fn1, fn2)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/utils.py", line 192, in static_cond
    return fn1()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 410, in _delay_updates
    moving_mean, mean, decay, zero_debias=zero_debias_moving_mean)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/moving_averages.py", line 84, in assign_moving_average
    with ops.colocate_with(variable):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4094, in _colocate_with_for_gradient
    with self.colocate_with(op, ignore_existing):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4146, in colocate_with
    op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1307, in internal_convert_to_tensor_or_indexed_slices
    value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/values.py", line 439, in _tensor_conversion_mirrored
    assert not as_ref
AssertionError

I’ve replaced tf.contrib.layers.batch_norm(net) with tf.keras.layers.BatchNormalization()(net) but I’m seeing the same problem

Same problem here as well. tf version 1.12.