tensorflow: nccl_ops.all_sum does not correctly reduce gradients

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
TensorFlow installed from (source or binary):binary
TensorFlow version (use command below): v2.2.0-rc3-33-g70087ab4f4 2.2.0-rc4
Python version:3.7
Bazel version (if compiling from source):n/a
GCC/Compiler version (if compiling from source):n/a
CUDA/cuDNN version:10.1/7.6.5
GPU model and memory:P100, V100

Describe the current behavior The allreduce operation nccl_ops.all_sum does not correctly sum gradients. The results are incorrect.

Standalone code to reproduce the issue

#!/usr/bin/env python
import argparse
from tensorflow.compat import v1 as tf
import tqdm

def split_grad_list(grad_list):
    g = []
    v = []
    for tower in grad_list:
        g.append([x[0] for x in tower])
        v.append([x[1] for x in tower])
    return g, v

def allreduce_grads(all_grads):
    # reduce gradients for N variables on K devices
    from tensorflow.python.ops import nccl_ops as nccl
    nr_tower = len(all_grads)
    assert nr_tower > 1
    new_all_grads = []  # N x K
    for grads in zip(*all_grads):
        # k grads
        summed = nccl.all_sum(grads)

        grads_for_devices = []  # K
        true_sum = tf.add_n(grads)
        for g in summed:
            diff = tf.abs(true_sum - g)
            eql = diff < 1e-4
            nccl_res_correct = tf.reduce_all(eql, name="corr_" + grads[0].op.name)

            def flat(x):
                x = tf.reshape(x, [-1])
                x = tf.slice(x, [0], [tf.minimum(tf.size(x), 200)])
                return x

            assert_op = tf.debugging.Assert(nccl_res_correct, [
                tf.reduce_max(diff), flat(true_sum), flat(g)], summarize=1000,
                name='assert_' + grads[0].op.name)
            with tf.control_dependencies([assert_op]):
                g = tf.identity(g)
            grads_for_devices.append(g)
        new_all_grads.append(grads_for_devices)
    # transpose to K x N
    ret = list(zip(*new_all_grads))
    return ret

def build_graph(image, label, idx):
    v1 = tf.get_variable('aaa/W', shape=[3, 3, 3, 64], trainable=True)
    v2 = tf.get_variable('bbb/W', shape=[3, 3, 3, 64], trainable=True)
    v = v1 if idx == 0 else v2
    image = tf.nn.conv2d(image, v, 1, padding='SAME', data_format='NCHW')

    def conv(name, x, chan, stride=1):
        with tf.variable_scope(name):
            in_chan = x.shape[1]
            W = tf.get_variable('W', [3, 3, in_chan, chan])
            ret = tf.nn.conv2d(x, W, strides=stride, padding="SAME", data_format="NCHW")
            return tf.nn.relu(ret)

    x = conv('conv1', image, 64)
    x = conv('conv2', x, 64)
    x = conv('conv3', x, 1280, stride=2)
    x = conv('conv4', x, 1280, stride=2)
    x = conv('conv5', x, 10)
    logits = tf.reduce_mean(x, axis=[2, 3])
    cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=label)
    cost = tf.reduce_mean(cost, name='cross_entropy_loss')
    return cost


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--gpu', type=int)
    args = parser.parse_args()
    num_gpu = args.gpu

    with tf.Graph().as_default():
        opt = tf.train.GradientDescentOptimizer(0.001)

        grad_list = []
        for k in range(num_gpu):
            with tf.device("/gpu:{}".format(k)), tf.variable_scope("tower{}".format(k)):
                print("Building {} ...".format(k))
                image = tf.random.uniform([32, 3, 30, 30])
                label = tf.random.uniform([32], maxval=9, dtype=tf.int32)
                cost = build_graph(image, label, k)
                varlist = [x for x in tf.trainable_variables() if x.name.startswith("tower{}".format(k))]
                print("Varlist for tower {}: ".format(k), [x.name for x in varlist])
                wd_cost = [tf.reduce_sum(x) * 1e-3 for x in varlist]
                cost = tf.add_n([cost] + wd_cost)
                grads = opt.compute_gradients(cost, var_list=varlist)
                grad_list.append(grads)

        all_grads, all_vars = split_grad_list(grad_list)
        all_grads = allreduce_grads(all_grads)
        grad_list = [list(zip(gs, vs)) for gs, vs in zip(all_grads, all_vars)]

        train_ops = []
        for idx, grad_and_vars in enumerate(grad_list):
            with tf.device('/gpu:{}'.format(idx)):
                train_ops.append(opt.apply_gradients(
                    grad_and_vars, name='apply_grad_{}'.format(idx)))
        train_op = tf.group(*train_ops)

        sess = tf.Session()
        sess.run(tf.global_variables_initializer())
        print("Training ...")
        for k in tqdm.trange(5000):
            sess.run(train_op)

The above code trains a toy network on random data, and allreduce the gradients using nccl_ops.all_sum. It checks the allreduce results against the sum of gradients computed by a naive add_n, and asserts that the difference is reasonably small. However, the difference can be quite large sometimes and the assertion usually fails within 100 steps of training.

The code above (written in TF1 style) can be run on a machine with >=2 GPUs using

$ TF2_BEHAVIOR=0 python a.py --gpu 2
Building 0 ...
 Varlist for tower 0:  ['tower0/aaa/W:0', 'tower0/bbb/W:0', 'tower0/conv1/W:0', 'tower0/conv2/W:0', 'tower0/conv3/W:0', 'tower0/conv4/W:0', 'tower0/conv5/W:0']                                      
Building 1 ...  
Varlist for tower 1:  ['tower1/aaa/W:0', 'tower1/bbb/W:0', 'tower1/conv1/W:0', 'tower1/conv2/W:0', 'tower1/conv3/W:0', 'tower1/conv4/W:0', 'tower1/conv5/W:0'] 
1%|▉                                                                    | 71/5000 [00:06<07:39, 10.73it/s]    
Traceback (most recent call last):                                                                                                                                                                  
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call                                                             
    return fn(*args)                                                                                                                                                                                
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn                                                              
    target_list, run_metadata)                                                                                                                                                                      
  File "/private/home/yuxinwu/env/py37-tf2.2v2/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun                                                  
    run_metadata)                                                                                                                                                                                   
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.                                                                                                                
  (0) Invalid argument: assertion failed: [0.00100000016] [0.00234295963 0.00230941921 0.00176228327 0.00197261758 0.00213356828 0.00188576151 0.00211580051 0.00221353304

My initial investigation suggests (no proof, just a guess) that the bug might appear because the gradients are computed on each GPU in different order.

The bug was found to exist in TF 1.15 as well. Have not tested earlier versions. The bug rarely triggers itself if I revert https://github.com/tensorflow/tensorflow/pull/31481, which is a PR that make allreduce ops scheduled as early as possible. collective_ops.all_reduce with the ring implementation does not seem to have similar issue, but it significantly slows down my training.

cc @dubey @yuefengz @chsigg who may have context on this issue.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (23 by maintainers)

Links to this issue

[D] TensorFlow's NCCL all-reduce does not always compute results correctly since a year ago

Most upvoted comments

We have reverted #31481 to unblock users, but we don’t have a good handle on the root cause. We will continue to investigate the root cause.

dubey on Oct 12, 2020

Any updates after a month?

ppwwyyxx on Sep 1, 2020

I was able to reproduce the error but without much consistency. have 4 P100s on my machine and running TF2_BEHAVIOR=0 python a.py --gpu 4 doesn’t fail every time. When I did see the error it was just after step 500. So far running with 2/4 GPUs hasn’t produced the error.

nikitamaia on Jul 22, 2020