tensorflow: train with multi-gpu with MirroredStrategy will hang-up
System information
Have I written custom code: N/A OS Platform and Distribution: CentOS Linux release 7.3.1611 TensorFlow installed from: (pip install tf-nightly-gpu) TensorFlow version: Tensorflow(‘v1.9.0-rc2-5345-g57d31aa599’, ‘1.12.0-dev20181005’) Bazel version: N/A GPU model and memory: Tesla P40 24G Exact command to reproduce: N/A Mobile device: N/A CUDA/cuDNN version: cuda 9.0 with cudnn7.1.4
I train with tensorflow for multi-gpu with MirroredStrategy and estimator. I got the problem: when I set the distribute mode with the following code it will got stuck after runing some training steps:
distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig(train_distribute=distribution)
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
config=config)
bug when I run without distribute mode like this:
distribution = tf.contrib.distribute.MirroredStrategy()
config = tf.estimator.RunConfig()
estimator = tf.estimator.Estimator(model_fn=mymodel_fn, model_dir='logs',
config=config)
It runs ok. Why? Is that a bug of MirroredStrategy?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 16 (3 by maintainers)
With TF 1.12 I still have the issue that @magnofel encountered (only difference since TF 1.11 is that it freezes before displaying
INFO:tensorflow:Initialize system
). @seemuch do you have any update on this ? Thanks a lot.have same problem. Stuck after variable initialization.
Have I written custom code: N/A OS Platform and Distribution: Ubuntu 18.04 TensorFlow installed from: (pip install tensorflow-gpu) TensorFlow version: 1.11.0 Bazel version: N/A GPU model and memory: 1080ti Exact command to reproduce: see code below Mobile device: N/A CUDA/cuDNN version: cuda 10.0 with cudnn7.3.1.20
output:
I also encounter the same problem. I used dataset to read data, the MirroredStrategy job will hang on at last batch if I use “shuffle->repeat->batch”. But when I changed to “shuffle->batch->repeat”, the job will finished correctly.
drop_remainder in batch() can’t solve the problem
@seemuch i have resolved my issue and it can possibly help others here (@jnd77 @magnofel @honeytidy ) If you use AMD Treadripper and motherboard without PLX chips then you should go to UEFI and disable IOMMU. NCCL is not compatible with it. More you can find here and here
@patzm I think cloud providers test thier instances for compatibility (at least GCP, AWS do I think). And you can always disable it with grub config as @jnd77 did.
Thanks a lot @Luonic. With your links, we solved the issue. We disabled IOMMU via grub as mentioned here.