tensorflow: Tensorflow2.0 distributed training gives error :- A non-DistributedValues value 8 cannot be reduced with the given reduce op ReduceOp.SUM.

Getting the error when i am running distributed training as described in this tutorial (https://colab.research.google.com/github/tensorflow/docs/blob/r2.0rc/site/en/r2/guide/distribute_strategy.ipynb#scrollTo=IlYVC0goepdk)

System information

OS Platform: - Linux Ubuntu 16.04):
TensorFlow installed from (source or binary): binary
TensorFlow version: 2.0-beta
Python version: 3.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 10.01
GPU model and memory: p100, 16GB

@tf.function
def train_step(dist_inputs):
    def step_fn(inputs):
        features, labels = inputs

        with tf.GradientTape() as tape:
            logits = model(features)
            #print(logits)
            cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
                        logits=logits, labels=labels)
            loss = tf.reduce_sum(cross_entropy) * (1.0 / global_batch_size)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
        return cross_entropy

    per_example_losses = mirrored_strategy.experimental_run_v2(
            step_fn, args=(dist_inputs,))
    mean_loss = mirrored_strategy.reduce(
            tf.compat.v2.distribute.ReduceOp.MEAN, per_example_losses, axis=0)
    return mean_loss

I am getting this error while running abobe distributed code. `ValueError: A non-DistributedValues value 8 cannot be reduced with the given reduce op ReduceOp.SUM.

But it works when I use tf.distribute.ReduceOp.SUM instead of tf.distribute.ReduceOp.MEAN `

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 23 (8 by maintainers)

Most upvoted comments

Hi, @samuelmat19 You can still train your model on distributed GPU, I am able to train using ReduceOp.SUM only but after that, you have to divide by global batch size you will get the same value which ReduceOp.mean would give.

reduce_sum_out = self.mirrored_strategy.reduce(
            tf.distribute.ReduceOp.SUM, per_example_losses, axis=0)

loss  = reduce_sum_out / global_batch_size

akanyaani on Nov 14, 2019

Hi @gadagashwini

It works on colab because you are not running on multiple GPU, ReduceOp.MEAN is working when I am running on a single GPU, but not working when I am running on multiple GPU.

akanyaani on Sep 10, 2019

I also have this bug. Anyone gets proper solution or fixes yet?

samkoesnadi on Nov 14, 2019

@CheungZeeCn @akanyaani

The above code run perfectly in 3-GPU configuration The code fails when the global_batch_size is perfectly dividable by no.of GPUs.

For example, if the no. of GPUs is 4. with global_batch_size 20. Each replica will have batch size of 5 each. This will fail with error ValueError: A non-DistributedValues value 5 cannot be reduced with the given reduce op ReduceOp.SUM.

If 3-GPU batch size across replicas is 7, 7, 6. which will train perfectly Screenshot 2020-04-24 at 8 41 30 PM

Screenshot of code running in 3 and 4 GPU configuration 3-GPU screencapture-localhost-8080-notebooks-testing-multigpu-ipynb-2020-04-24-20_44_13 4-GPU

kamalkraj on Apr 24, 2020