tensorflow: Incorrect gradient for ctc_loss on GPU when using logit_length

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 9.12 (TF2.2 DeepLearning image on GCP)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary): Preinstalled
TensorFlow version (use command below): v2.2.0-0-g2b96f36 2.2.0-dlenv
Python version: 3.7.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: V10.1.243
GPU model and memory: Nvidia tesla P100

Describe the current behavior

I have experienced inconsistencies in the computation of the gradient of tf.nn.ctc_loss between the CPU and GPU implementations when the logit_length argument contains something else than [num_frames]*batch_size. Mostly I observe that the gradient relative to logits for the GPU implementation does not contain zeros after the end of the sequence as given by logit_length. Whereas this is the case for the CPU implementation which seems to work correctly.

I have noticed that the unit tests for this op do not test this case in particular (see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/kernel_tests/ctc_loss_op_test.py#L993).

Standalone code to reproduce the issue

import tensorflow as tf

use_logits_lengths = True

batch_size = 8
num_labels = 27
max_labels_length = 32
max_logits_length = 128

labels = []
labels_lengths = []
logits = []
logits_lengths = []
for i in range(batch_size):
    labels_lengths.append(tf.random.uniform([], 1, max_labels_length, tf.int32))
    labels.extend(tf.random.uniform([labels_lengths[-1]], 0, num_labels-1, tf.int32))

    # I multiply label_length by 2 to make sure there are enough frames
    logits_lengths.append(tf.random.uniform([], labels_lengths[-1].numpy()*2, max_logits_length+1, tf.int32))

labels = tf.RaggedTensor.from_row_lengths(labels, labels_lengths).to_sparse()
labels_lengths = tf.concat(labels_lengths, 0)
logits = tf.random.uniform([batch_size, max_logits_length, num_labels])
logits_lengths = tf.concat(logits_lengths, 0)
logits_lengths_full = tf.constant([max_logits_length]*batch_size)

def ctc_compare_cpu_gpu(logits_lengths):
    print("logits_lengths", logits_lengths.numpy())

    with tf.device("/gpu:0"):
        with tf.GradientTape() as t:
            t.watch(logits)
            gpu_loss = tf.nn.ctc_loss(labels, logits, labels_lengths, logits_lengths, logits_time_major=False, blank_index=-1)
        gpu_grad = t.gradient(gpu_loss, [logits])[0]

    with tf.device("/cpu:0"):
        with tf.GradientTape() as t:
            t.watch(logits)
            cpu_loss = tf.nn.ctc_loss(labels, logits, labels_lengths, logits_lengths, logits_time_major=False, blank_index=-1)
        cpu_grad = t.gradient(cpu_loss, [logits])[0]

    print("Max loss error", tf.math.abs(gpu_loss - cpu_loss).numpy().max())
    print("Max grad error", tf.math.abs(gpu_grad - cpu_grad).numpy().max())
    print()
    return cpu_loss, gpu_loss, cpu_grad, gpu_grad

ctc_compare_cpu_gpu(logits_lengths_full)
ctc_compare_cpu_gpu(logits_lengths)

Output:

logits_lengths [128 128 128 128 128 128 128 128]
Max loss error 0.00012207031
Max grad error 0.00014734268

logits_lengths [ 70  86  22  74 112 121 103 123]
Max loss error 6.1035156e-05
Max grad error 0.9669469

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 21 (11 by maintainers)

Most upvoted comments

Was able to replicate the issue in TF v2.5,please find the gist here…Thanks !

sushreebarsa on May 30, 2021

I tried the script on V100 a couple of times and I can see the flakiness: Run 1:

Max loss error 0.00021362305
Max grad error 0.00022548437

logits_lengths [  8 108  86  90  66  53 110  97]
Max loss error 9.1552734e-05
Max grad error 0.00022548437

Run X:

Max loss error 9.1552734e-05
Max grad error 9.518862e-05

logits_lengths [ 61  88  79  14  42 112  95  60]
Max loss error 6.1035156e-05
Max grad error 0.55553436

Looking into it.

kaixih on May 4, 2021