tensorflow: Large batch size with dense layers will fail all_reduce occasionally

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): os: Linux os kernel version: #1 SMP Debian 4.9.210-1 (2020-01-20) os release version: 4.9.0-12-amd64 os platform: Linux-4.9.0-12-amd64-x86_64-with-debian-9.12
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2.0
Python version: 3.7.7
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: 10.1
GPU model and memory: Tesla P100-PCIE-16GB (on Google Cloud Platform)

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior I tried to use the SyncBatchnormalization (tf.keras.layers.experimental.SyncBatchNormalization) in my models but I found it will result in NaN in the model sometimes. Thus I decided to take a closer look. In the below code I implemented a simple SyncBatchnormalization. And I found when the batch size is very large (e.g. 262144) for dense layers (which is the case for the sub-module in my model), the all_reduce will occasionally result in wrong results.

Describe the expected behavior all_reduce results in correct results at all times.

Standalone code to reproduce the issue

import numpy as np
import tensorflow as tf
from tensorflow.python.distribute import distribution_strategy_context as ds
from tensorflow.python.distribute import reduce_util
from tensorflow.python.keras.layers import normalization


class SyncBatchNormalization(normalization.BatchNormalizationBase):
    """The SyncBatchNormalization in TF 2.2 seems causing NaN issue.
    We implement this one to avoid the issue.
    See https://github.com/google-research/simclr/blob/bfe07eed7f101ab51f3360100a28690e1bfbf6ec/resnet.py#L37-L85
    """

    def __init__(self,
                 axis=-1,
                 momentum=0.99,
                 epsilon=1e-3,
                 center=True,
                 scale=True,
                 beta_initializer='zeros',
                 gamma_initializer='ones',
                 moving_mean_initializer='zeros',
                 moving_variance_initializer='ones',
                 beta_regularizer=None,
                 gamma_regularizer=None,
                 beta_constraint=None,
                 gamma_constraint=None,
                 renorm=False,
                 renorm_clipping=None,
                 renorm_momentum=0.99,
                 trainable=True,
                 adjustment=None,
                 name=None,
                 **kwargs):
        # Currently we only support aggregating over the global batch size.
        super(SyncBatchNormalization, self).__init__(
            axis=axis,
            momentum=momentum,
            epsilon=epsilon,
            center=center,
            scale=scale,
            beta_initializer=beta_initializer,
            gamma_initializer=gamma_initializer,
            moving_mean_initializer=moving_mean_initializer,
            moving_variance_initializer=moving_variance_initializer,
            beta_regularizer=beta_regularizer,
            gamma_regularizer=gamma_regularizer,
            beta_constraint=beta_constraint,
            gamma_constraint=gamma_constraint,
            renorm=renorm,
            renorm_clipping=renorm_clipping,
            renorm_momentum=renorm_momentum,
            fused=False,
            trainable=trainable,
            virtual_batch_size=None,
            name=name,
            **kwargs)

    def _calculate_mean_and_var(self, inputs, reduction_axes, keep_dims):
        shard_mean, shard_variance = super(SyncBatchNormalization, self)._calculate_mean_and_var(
            inputs, reduction_axes, keep_dims=keep_dims)
        replica_ctx = ds.get_replica_context()
        if replica_ctx:
            group_mean, group_variance = replica_ctx.all_reduce(reduce_util.ReduceOp.MEAN, [shard_mean, shard_variance])
            mean_distance = tf.math.squared_difference(tf.stop_gradient(group_mean), shard_mean)
            group_variance += replica_ctx.all_reduce(reduce_util.ReduceOp.MEAN, mean_distance)
            tf.cond(tf.reduce_mean(group_variance) > 50,
                    lambda: tf.print(
                        f"\n{self.name} id", replica_ctx.replica_id_in_sync_group, "/",
                        replica_ctx.num_replicas_in_sync, "\n",
                        "local mean distance:", mean_distance, "mean local mean distance",
                        tf.reduce_mean(mean_distance), "\n",
                        "group var:", group_variance, "mean group var:", tf.reduce_mean(group_variance), "\n",
                        "local var:", shard_variance, "mean local var:", tf.reduce_mean(shard_variance), "\n",
                        "group mean:", group_mean, "mean group mean", tf.reduce_mean(group_mean), "\n",
                        "local mean:", shard_mean, "mean local mean", tf.reduce_mean(shard_mean), "\n",
                        "size:", tf.shape(shard_mean)),
                    lambda: tf.no_op()
                    )
            return group_mean, group_variance
        else:
            return shard_mean, shard_variance


class Test(tf.keras.models.Model):
    def __init__(self):
        super(Test, self).__init__()
        self.mlps = []
        for i in range(10):
            self.mlps.append(tf.keras.Sequential([
                tf.keras.layers.Dense(512),
                SyncBatchNormalization(),
                tf.keras.layers.ReLU(),
                tf.keras.layers.Dense(256),
                SyncBatchNormalization(),
                tf.keras.layers.ReLU(),
                tf.keras.layers.Dense(128),
            ]))
        self.head = tf.keras.layers.Dense(10)

    def call(self, inputs, training=None, mask=None):
        out = []
        for mlp in self.mlps:
            out.append(mlp(inputs))
        return self.head(tf.concat(out, axis=-1))


dummy_data = np.random.random((2621440, 3)).astype(np.float32) * 6 - 3
dummy_label = np.random.randint(0, 10, 2621440).astype(np.int32)
# print(dummy_label.shape)
dataset = tf.data.Dataset.from_tensor_slices((dummy_data, dummy_label)).batch(262144)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = Test()
    model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.Adam(learning_rate=0)
    )
    model.fit(dataset, epochs=10000)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Here I paste the outputs when the code runs on a machine with 4 P100 GPUs:

$ python test_syncbn.py
2020-08-02 00:12:57.236317: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-02 00:12:58.708082: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.712891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:58.713072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.714765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:58.714901: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.801883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:58.802048: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.803362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:58.803698: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-08-02 00:12:58.805523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-08-02 00:12:58.807244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-08-02 00:12:58.807614: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-08-02 00:12:58.809516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-08-02 00:12:58.810620: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-08-02 00:12:58.814611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-02 00:12:58.814734: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.815664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.816579: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.817486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.818413: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.819344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.820235: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.821133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:58.821967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2020-08-02 00:12:58.822388: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-02 00:12:58.830470: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2000185000 Hz
2020-08-02 00:12:58.831013: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c4178b0db0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-02 00:12:58.831039: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-02 00:12:59.256430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.329113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.360236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.367112: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.368223: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c4145a2150 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-02 00:12:59.368249: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-08-02 00:12:59.368256: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-08-02 00:12:59.368267: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-08-02 00:12:59.368293: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla P100-PCIE-16GB, Compute Capability 6.0
2020-08-02 00:12:59.371553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.372373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:59.372472: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.373300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:00:05.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:59.373389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.374198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:00:06.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:59.374265: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.375094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:00:07.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2020-08-02 00:12:59.375161: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-08-02 00:12:59.375183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-08-02 00:12:59.375204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-08-02 00:12:59.375223: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-08-02 00:12:59.375242: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-08-02 00:12:59.375260: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-08-02 00:12:59.375280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-02 00:12:59.375341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.376284: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.377135: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.378016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.378845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.379650: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.380503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.381341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.382160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2020-08-02 00:12:59.382205: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-08-02 00:12:59.386308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-02 00:12:59.386333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1 2 3
2020-08-02 00:12:59.386341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y N N
2020-08-02 00:12:59.386350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N N N
2020-08-02 00:12:59.386355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2:   N N N Y
2020-08-02 00:12:59.386363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3:   N N Y N
2020-08-02 00:12:59.386614: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.387475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.388358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.389240: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.390107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.390967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15056 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
2020-08-02 00:12:59.391525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.392374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15056 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:05.0, compute capability: 6.0)
2020-08-02 00:12:59.392906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.393843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 15056 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0)
2020-08-02 00:12:59.394386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-02 00:12:59.395303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 15056 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0)
Epoch 1/10000
2020-08-02 00:13:52.468511: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
10/10 [==============================] - 3s 335ms/step - loss: 2.7836
Epoch 2/10000
10/10 [==============================] - 3s 335ms/step - loss: 2.7742
Epoch 3/10000
10/10 [==============================] - 3s 339ms/step - loss: 2.7742
Epoch 4/10000
10/10 [==============================] - 3s 341ms/step - loss: 2.7742
Epoch 5/10000
 1/10 [==>...........................] - ETA: 0s - loss: 2.7735
sync_batch_normalization_9 id 0 / 4
 local mean distance: [372190.406 373072.062 361857.031 ... 11281.4863 14100.6201 13597.5859] mean local mean distance 71088.7109
 group var: [278891.656 279530 271112.156 ... 8756.89258 10855.8447 10485.8672] mean group var: 53282.5
 local var: [0.216322735 0.351680636 0.654191434 ... 0.101593301 0.440946668 0.21738404] mean local var: 0.440011531
 group mean: [611.271729 611.215698 601.546692 ... 105.587341 118.7118 115.605087] mean group mean 91.9500732
 local mean: [1.1976552 0.419436097 0.00116307894 ... -0.626998782 -0.0342339203 -1.00359774] mean local mean -0.01634828
 size: [256]

sync_batch_normalization_9 id 1 / 4
 local mean distance: [0.806274891 0.0985621 1.88196e-05 ... 0.223046422 0.000869709416 0.567140639] mean local mean distance 0.132760748
 group var: [279142.844 279803.875 271389.75 ... 8461.29 10575.9697 10198.2939] mean group var: 53316.6094
 local var: [0.217430741 0.351392329 0.65093118 ... 0.102492645 0.439819723 0.220061317] mean local var: 0.439691663
 group mean: [0.299309373 0.104648665 0.00144605222 ... -0.15742597 -0.00983027834 -0.251029134] mean group mean -0.00408996362
 local mean: [1.19723749 0.418594658 0.00578420889 ... -0.629703879 -0.0393211134 -1.00411654] mean local mean -0.0163598545
 size: [256]

sync_batch_normalization_9 id 2 / 4
 local mean distance: [372190.406 373070.594 361849.094 ... 11281.6875 14101.8311 13597.8037] mean local mean distance 71088.5547
 group var: [278891.656 279530 271112.156 ... 8756.89258 10855.8447 10485.8672] mean group var: 53282.5
 local var: [0.214262575 0.350284219 0.650416493 ... 0.101662345 0.442076713 0.21601209] mean local var: 0.439108968
 group mean: [611.271729 611.215698 601.546692 ... 105.587341 118.7118 115.605087] mean group mean 91.9500732
 local mean: [1.19765222 0.420656025 0.00775553659 ... -0.627944 -0.0393257216 -1.00453138] mean local mean -0.0164480079
 size: [256]

sync_batch_normalization_9 id 3 / 4
 local mean distance: [372189.5 373072.375 361852.312 ... 11281.6631 14100.9863 13597.001] mean local mean distance 71088.6406
 group var: [278891.656 279530 271112.156 ... 8756.89258 10855.8447 10485.8672] mean group var: 53282.5
 local var: [0.218500063 0.352682829 0.651915729 ... 0.102855965 0.440146983 0.216531143] mean local var: 0.439344555
 group mean: [611.271729 611.215698 601.546692 ... 105.587341 118.7118 115.605087] mean group mean 91.9500732
 local mean: [1.19834054 0.419162512 0.00508273114 ... -0.627833903 -0.0357722044 -1.00109076] mean local mean -0.0162223056
 size: [256]
10/10 [==============================] - 3s 343ms/step - loss: 2.6909
Epoch 6/10000
10/10 [==============================] - 3s 337ms/step - loss: 2.6679
Epoch 7/10000
10/10 [==============================] - 3s 336ms/step - loss: 2.6742

As you can see, the group_mean readouts from different replicas are different.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 25 (12 by maintainers)

Most upvoted comments

We’ve rolled back a commit that appears to fix the issue, but we don’t understand the root cause yet.

@dubey Should we keep this issue open to investigate the root cause?

crccw on Oct 12, 2020

Same issue here. I got a lower result with sync bn than normal bn or single GPU

edwardyehuang on Aug 5, 2020

@anj-s can you take a look, i also got this problem when training multi-gpu with SynBatchNorm in my framework (https://github.com/TensorSpeech/TensorFlowTTS). Seems you are the person implemented this SynBatchnorm :3

dathudeptrai on Aug 10, 2020

Seems like a duplicate of https://github.com/tensorflow/tensorflow/issues/41539. Incorrect results of NCCL allreduce can be reproduced since TF 1.15.

ppwwyyxx on Aug 4, 2020

A bit more info: I tried to run my model with syncBN on TPUs and it seem I did not encounter similar issue, which might indicate that all_reduce on TPU is correct?

YurongYou on Aug 6, 2020