tensorflow: Error when trying to use tf.contrib.distribute.MirroredStrategy in tf.estimator

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04.2
  • TensorFlow installed from: source
  • TensorFlow version: 1.7
  • Python version: 3.5
  • Bazel version: 0.11.1
  • GCC/Compiler version: 5.4.0
  • CUDA/cuDNN version: 9/7
  • GPU model and memory: TITAN X (Pascal) 11170 MB memory

I’m trying to add multi-gpu support to my tensorflow training code using tf.contrib.distribute.MirroredStrategy as a parameter to tf.estimator.RunConfig. I get the following error message:

Traceback (most recent call last):
  File "python3.5/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "python3.5/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 248, in _call_for_each_tower
    self, *merge_args, **merge_kwargs)
  File "python3.5/site-packages/tensorflow/python/training/optimizer.py", line 667, in _distributed_apply
    reduced_grads = distribution.batch_reduce("sum", grads_and_vars)
  File "python3.5/site-packages/tensorflow/python/training/distribute.py", line 801, in batch_reduce
    return self._batch_reduce(method_string, value_destination_pairs)
  File "python3.5/site-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 295, in _batch_reduce
    value_destination_pairs)
  File "python3.5/site-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 169, in batch_reduce
    raise ValueError("`value_destination_pairs` must be a list or a tuple of "
ValueError: `value_destination_pairs` must be a list or a tuple of tuples of PerDevice objects and destinations 

The following code produces the error (I omitted the code for parsing the tfrecord to image tensor as I don’t believe this code effects the error, but I can add it if necessary):

import glob, os
import tensorflow as tf
slim = tf.contrib.slim

# Initialization of args - arguments parser.
def init():
    pass


def input_fn():

    dataset = tf.data.TFRecordDataset(glob.glob(os.path.join(args.train_data_dir, 'train*')))
    dataset = dataset.map(
                lambda x: parse_and_preprocess_image(x, args.image_size),
                num_parallel_calls=2,
    )
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size=4)
    dataset = dataset.prefetch(1)

    return dataset


def model_fn(features, labels=None, mode=tf.estimator.ModeKeys.TRAIN, params=None):

    train_images_batch = features
    res = slim.conv2d(inputs=train_images_batch, kernel_size=9, stride=1, num_outputs=3, scope='conv1')
    loss = tf.reduce_mean((train_images_batch - res) ** 2)
    optimizer = tf.train.AdamOptimizer(0.001)
    train_op = slim.learning.create_train_op(loss, optimizer)
    return tf.estimator.EstimatorSpec(
        mode=tf.estimator.ModeKeys.TRAIN,
        loss=loss, train_op=train_op)


def train():

    init()

    distribution = tf.contrib.distribute.MirroredStrategy(num_gpus=args.num_gpus)

    config = tf.estimator.RunConfig(
        model_dir=args.log_dir,
        train_distribute=distribution,
    )

    estimator = tf.estimator.Estimator(model_fn=model_fn, config=config)
    estimator.train(
            input_fn=input_fn,
            max_steps=args.train_steps,
        )


def main():
    train()


if __name__ == '__main__':
    main()

Thank you! Adva

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 4
  • Comments: 18 (10 by maintainers)

Most upvoted comments

Oh, I remembered this bug that we found during the 1.8RCs.

If you update to the latest stable release then the problem goes away.

Hi I tested on a multi-gpu machine and got a different error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Creating a partition for /device:CPU:7 which doesn't exist in the list of available devices. Available devices: /device:CPU:0,/device:GPU:0,/device:GPU:1,/device:GPU:2,/device:GPU:3,/device:GPU:4,/device:GPU:5,/device:GPU:6,/device:GPU:7

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "conditional_instance_norm/cin/bug.py", line 80, in <module>
    main()
  File "conditional_instance_norm/cin/bug.py", line 76, in main
    train()
  File "conditional_instance_norm/cin/bug.py", line 68, in train
    max_steps=args.train_steps,
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 841, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 977, in _train_model_distributed
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1059, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 567, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1191, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 971, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Creating a partition for /device:CPU:7 which doesn't exist in the list of available devices. Available devices: /device:CPU:0,/device:GPU:0,/device:GPU:1,/device:GPU:2,/device:GPU:3,/device:GPU:4,/device:GPU:5,/device:GPU:6,/device:GPU:7