tensorflow: train with multi-gpu with MirroredStrategy will hang-up

System information

Have I written custom code: N/A OS Platform and Distribution: CentOS Linux release 7.3.1611 TensorFlow installed from: (pip install tf-nightly-gpu) TensorFlow version: Tensorflow(‘v1.9.0-rc2-5345-g57d31aa599’, ‘1.12.0-dev20181005’) Bazel version: N/A GPU model and memory: Tesla P40 24G Exact command to reproduce: N/A Mobile device: N/A CUDA/cuDNN version: cuda 9.0 with cudnn7.1.4

I train with tensorflow for multi-gpu with MirroredStrategy and estimator. I got the problem: when I set the distribute mode with the following code it will got stuck after runing some training steps:

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
        config=config)

bug when I run without distribute mode like this:

distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig()
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
        config=config)

It runs ok. Why? Is that a bug of MirroredStrategy?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 16 (3 by maintainers)

Most upvoted comments

With TF 1.12 I still have the issue that @magnofel encountered (only difference since TF 1.11 is that it freezes before displaying INFO:tensorflow:Initialize system). @seemuch do you have any update on this ? Thanks a lot.

have same problem. Stuck after variable initialization.

Have I written custom code: N/A OS Platform and Distribution: Ubuntu 18.04 TensorFlow installed from: (pip install tensorflow-gpu) TensorFlow version: 1.11.0 Bazel version: N/A GPU model and memory: 1080ti Exact command to reproduce: see code below Mobile device: N/A CUDA/cuDNN version: cuda 10.0 with cudnn7.3.1.20

import tensorflow as tf
def model_fn(features, labels, mode):
  layer = tf.layers.Dense(1)
  logits = layer(features)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {"logits": logits}
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)

  loss = tf.losses.mean_squared_error(
      labels=labels, predictions=tf.reshape(logits, []))

  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(mode, loss=loss)

  if mode == tf.estimator.ModeKeys.TRAIN:
    train_op = tf.train.GradientDescentOptimizer(0.2).minimize(loss)
    return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

def input_fn():
  features = tf.data.Dataset.from_tensors([[1.]]).repeat(100)
  labels = tf.data.Dataset.from_tensors(1.).repeat(100)
  return tf.data.Dataset.zip((features, labels))


distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
classifier = tf.estimator.Estimator(model_fn=model_fn, config=config)
classifier.train(input_fn=input_fn)
classifier.evaluate(input_fn=input_fn)

output:

INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpcqwt3jg0
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpcqwt3jg0', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7fe0e5733e80>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe0e5733f98>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': None}
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Configured nccl all-reduce.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:batch_all_reduce invoked for batches size = 2 with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpcqwt3jg0/model.ckpt.
INFO:tensorflow:Initialize system

I also encounter the same problem. I used dataset to read data, the MirroredStrategy job will hang on at last batch if I use “shuffle->repeat->batch”. But when I changed to “shuffle->batch->repeat”, the job will finished correctly.

drop_remainder in batch() can’t solve the problem

@seemuch i have resolved my issue and it can possibly help others here (@jnd77 @magnofel @honeytidy ) If you use AMD Treadripper and motherboard without PLX chips then you should go to UEFI and disable IOMMU. NCCL is not compatible with it. More you can find here and here

@patzm I think cloud providers test thier instances for compatibility (at least GCP, AWS do I think). And you can always disable it with grub config as @jnd77 did.

Thanks a lot @Luonic. With your links, we solved the issue. We disabled IOMMU via grub as mentioned here.