tensorflow: Run multi-worker with nccl error: NET/IB : collective mismatch error

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): tensorflow/models
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04.3
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): tensorflow:2.0.0-gpu (docker image)
  • TensorFlow version (use command below): 2.0.0
  • Python version: 2.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1/7.6.0.64
  • NCCL version: 2.5.4
  • GPU model and memory: Nvidia P40

Describe the current behavior

  • worker A
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO  NCCL_IB_GID_INDEX=3 NCCL_IB_HCA=mlx5_1:1 NCCL_IB_SL=3 NCCL_SOCKET_IFNAME=eth1 python -m resnet_imagenet_main --model_dir=/tmp/model_dir/resnet  --num_gpus=1   --batch_size=32  --use_synthetic_data=true --worker_hosts=A:2222,B:2222  --task_index=0 --distribution_strategy=multi_worker_mirrored --all_reduce_alg=nccl
  • worker B
CUDA_VISIBLE_DEVICES=0 NCCL_DEBUG=INFO  NCCL_IB_GID_INDEX=3 NCCL_IB_HCA=mlx5_1:1 NCCL_IB_SL=3 NCCL_SOCKET_IFNAME=eth1 python -m resnet_imagenet_main --model_dir=/tmp/model_dir/resnet  --num_gpus=1   --batch_size=32  --use_synthetic_data=true --worker_hosts=A:2222,B:2222 --task_index=1 --distribution_strategy=multi_worker_mirrored --all_reduce_alg=nccl

The error as follows:

...
2019-11-18 10:26:43.012421: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
:499:967 [0] NCCL INFO NET/Socket : Using [0]eth1:100.x.x.x<0>
:499:967 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
2019-11-18 10:26:43.248455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
:499:967 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE ; OOB eth1:100.x.x.x<0>
:499:978 [0] NCCL INFO Setting affinity for GPU 0 to 03,fffff000,003fffff
:499:978 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance :  PHB
:499:978 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/IB/0
:499:978 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/IB/0
:499:978 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
:499:978 [0] NCCL INFO NCCL_IB_SL set by environment to 3.
:499:978 [0] NCCL INFO comm 0x7f3e640025b0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE

:499:979 [0] external/nccl_archive/src/transport/net_ib.cc:651 NCCL WARN NET/IB : collective mismatch error local size 1048576 remote 32768 addr 7f7563811000 rkey 103480 seq 2/2
:499:979 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:31 -> 3
:499:979 [0] NCCL INFO external/nccl_archive/src/transport/net.cc:470 -> 3
:499:979 [0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]
...

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (14 by maintainers)

Most upvoted comments

Keras + MultiWorkerMirroredStrategy + NCCL + IB works fine in 2.1. Looks like there is a bug in 2.0.