tensorflow: Error when using batch_norm with MirroredStrategy
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 TensorFlow installed from (source or binary): using nvidia containter: https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/rel_18.06.html#rel_18.06 TensorFlow version (use command below):1.8.0 Python version: 3.5 Bazel version (if compiling from source): N/A GCC/Compiler version (if compiling from source): N/A CUDA/cuDNN version: 9.0.176 GPU model and memory: nvidia tesla v100 Exact command to reproduce: N/A
Describe the problem
I want to apply transfer learning using an existing pretrained network (inception v4). For this I use the tf-slim models. When running this on a single GPU, This works as expected, however when using tf.contrib.distribute.MirroredStrategy I get an exception. Apparently the MirroredStrategy has some issues with batch_norm
Source code / logs
I tried to extract the relevant part from my code:
import research.slim.nets.nets_factory as nets_factory
...
def construct_architecture(self, input_tensor, mode):
# network topology
network_fn = nets_factory.get_network_fn('inception_v4', num_classes=self.configuration.get("nr_classes"),
is_training=True)
logits,_ = network_fn(input_tensor)
if mode == tf.estimator.ModeKeys.PREDICT:
predicted_classes = tf.argmax(logits, 1)
predictions = {
'class_ids': predicted_classes[:, tf.newaxis],
'probabilities': tf.nn.softmax(logits),
'logits': logits,
}
self.output_tensor = predictions
else:
self.output_tensor = logits
This results in the following error trace:
Traceback (most recent call last):
File "train.py", line 27, in <module>
experimenter.run_training_experiment(config)
File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 39, in run_training_experiment
tf.estimator.train_and_evaluate(self.trainer, self.training_specs, self.eval_specs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 451, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 590, in run
return self.run_local()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 691, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 376, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1143, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1255, in _train_model_distributed
self.config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 777, in call_for_each_tower
return self._call_for_each_tower(fn, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 308, in _call_for_each_tower
coord.join(threads)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/onnx/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 519, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1133, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 111, in get_model_fn
self.configure_network(input_tensor=features, output_tensor=labels, mode=mode)
File "/media/local/BDA_tf_framework/neuralnetwork/trainingexperimenter.py", line 75, in configure_network
self.network.construct_network(input_tensor=input_tensor, output_tensor=output_tensor, mode=mode)
File "/media/local/myfiles/mynetwork.py", line 64, in construct_network
self.construct_architecture(input_tensor=input_tensor,mode=mode)
File "/media/local/myfiles/mynetwork.py", line 48, in construct_architecture
logits,_ = network_fn(input_tensor)
File "/opt/tf-slim/models/research/slim/nets/nets_factory.py", line 141, in network_fn
return func(images, num_classes, is_training=is_training, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/slim-0.1-py3.5.egg/nets/inception_v4.py", line 286, in inception_v4
File "/usr/local/lib/python3.5/dist-packages/slim-0.1-py3.5.egg/nets/inception_v4.py", line 178, in inception_v4_base
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1154, in convolution2d
conv_dims=2)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1066, in convolution
outputs = normalizer_fn(outputs, **normalizer_params)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 650, in batch_norm
outputs = layer.apply(inputs, training=is_training)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 805, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 362, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/normalization.py", line 158, in call
return super(BatchNormalization, self).call(inputs, training=training)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 514, in call
outputs = self._fused_batch_norm(inputs, training=training)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 420, in _fused_batch_norm
momentum)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/normalization.py", line 369, in _assign_moving_average
with ops.colocate_with(variable):
File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3939, in _colocate_with_for_gradient
with self.colocate_with(op, ignore_existing):
File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3992, in colocate_with
op = internal_convert_to_tensor_or_indexed_slices(op, as_ref=True).op
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1255, in internal_convert_to_tensor_or_indexed_slices
value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1094, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/values.py", line 414, in _tensor_conversion_mirrored
assert not as_ref
AssertionError
Thanks for the help
Jonas
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (5 by maintainers)
I’m seeing the same issue in 1.12.0 (from the
tensorflow/tensorflow:1.12.0-gpuDocker image):Minimal example to reproduce:
Output:
I’ve replaced
tf.contrib.layers.batch_norm(net)withtf.keras.layers.BatchNormalization()(net)but I’m seeing the same problemSame problem here as well. tf version 1.12.