tensorflow: Infinity mask breaks gradient

I’m trying to do softmax over selected indices, using infinity mask to silent out the unwanted ones. However, the gradient of those unwanted entires become nan as opposed to 0.

The reason I didn’t use boolean mask is that the mask indices are different in my batch, which can’t end up with a nice matrix form. If there’s workaround here I’ll be more than happy to adopt.

The code I tested the infinity mask is

import numpy as np
import tensorflow as tf

a = tf.placeholder(tf.float32, [5])
inf_mask = tf.placeholder(tf.float32, [5])

b = tf.multiply(a, inf_mask)
sf = tf.nn.softmax(b)

loss = (sf[2] - 0)
grad = tf.gradients(loss, a)

sess = tf.Session()

a_np = np.ones([5])
np_mask = np.ones([5]) * 4
np_mask[1] = -np.inf

print sess.run([sf, grad], feed_dict={
    a: a_np,
    inf_mask: np_mask
})

sess.close()

The output is [array([ 0.25, 0. , 0.25, 0.25, 0.25], dtype=float32), [array([-0.25, nan, 0.75, -0.25, -0.25], dtype=float32)]]

The mask is working but the gradient has a nan, which should have been 0 I think.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 7
  • Comments: 17 (6 by maintainers)

Commits related to this issue

Most upvoted comments

@hongzimao @sy2737 I think you guys were on the right track originally, just didn’t debug things quite correctly. You wanted a-inf_mask, not multiply. The second solution posted above is still dangerous. stable softmax should be e^(a-max(a)).

The key is that exp(-inf)==0, max(a, -inf)==a and a-inf==-inf. Unfortunately, 0*inf==nan, so making the mask correctly is tricky.

Two most numerically stable options would be either -inf mask or just using a sparse softmax (which might be better depending on what you are doing).

Below is an example of using -inf mask. It has some specifics because of broadcasting but you should be able to make it into whatever you need. Note that if your intention is to use this for loss calculations, you should be doing something else. Should only be using softmax itself for things like attention.

  • Use tf.sequence_mask to create a mask from sequence lengths
  • Create an infinity mask (this is the ugly part) – tf.where to get the indices – tf.tile to make as many infs as required (broadcasting doesn’t seem to work) – tf.scatter_nd to make the mask using the indices and the infs
  • Then just tf.nn.softmax(logits - infmask, axis=1)
def masked_softmax(logits, mask):
    """
    Masked softmax over dim 1, mask broadcasts over dim 2
    :param logits: (N, L, T)
    :param mask: (N, L)
    :return: probabilities (N, L, T)
    """
    v = tf.shape(logits)[2]
    indices = tf.cast(tf.where(tf.logical_not(mask)), tf.int32)
    inf = tf.constant(np.array([[np.inf]], dtype=np.float32), dtype=tf.float32)
    infs = tf.tile(inf, [tf.shape(indices)[0], v])
    infmask = tf.scatter_nd(
        indices=indices,
        updates=infs,
        shape=tf.shape(logits))
    _p = tf.nn.softmax(logits - infmask, axis=1)
    return _p

My solution to this problem:

def maskedSoftmax(logits, mask):
    """
    Masked softmax over dim 1
    :param logits: (N, L)
    :param mask: (N, L)
    :return: probabilities (N, L)
    """
    indices = tf.where(mask)
    values = tf.gather_nd(logits, indices)
    denseShape = tf.cast(tf.shape(logits), tf.int64)
    sparseResult = tf.sparse_softmax(tf.SparseTensor(indices, values, denseShape))
    result = tf.scatter_nd(sparseResult.indices, sparseResult.values, sparseResult.dense_shape)
    result.set_shape(logits.shape)
    return result

(Edit: My first proposal had problems with None in shape of logits)

Unfortunately, as @yaroslavvb mentioned the masked_softmax implementation by @bstriner broke for me when computing gradients, producing NaNs in computing the loss.

A simple workaround that got it working for me was replacing np.inf with tf.float32.max. This, of course, incurs some penalty as the padded values will not be completely negligible, but I think it is the most numerically stable approach.

I’m also asking if there are any other downsides to this approach as I’m only just starting out with Tensorflow and Machine Learning in general, so I’d appreciate knowing if this approach is actually breaking anything.

softmax is written to avoid numerical inaccuracy for ill-conditioned finite values numbers. It does this by subtracting off the max abs value and doing the computation around that. That means that injecting infinities to its arguments will give you nans as you are seeing. This numerically robust computation is key for many models. I think if you can get away with the 0 to 1 solution that is pretty decent. You could look at some of the sparse cross entropy softmax with logits functions for maximum robustness and the ability to have a sparse subset of values.

@NickRyder You can adapt the sparse_logsoftmax below. Inputs are dense logits and sparse indices. It gives you the normalized logits in a dense matrix. You can then use the sparse_crossentropy_loss below to get the logits at the labels.

def sparse_logsoftmax(logits, idx):
    dense_shape = tf.cast(tf.shape(logits), tf.int64)
    logits_values = tf.gather_nd(params=logits, indices=idx)
    sparse_logits = tf.SparseTensor(indices=idx, values=logits_values, dense_shape=dense_shape)
    lmax = tf.sparse_reduce_max(sp_input=sparse_logits, axis=-1, keep_dims=True)
    lmax = tf.stop_gradient(lmax)
    normed_logits = logits - lmax
    normed_exp_values = tf.exp(tf.gather_nd(params=normed_logits, indices=idx))
    sparse_normed_exp = tf.SparseTensor(indices=idx, values=normed_exp_values, dense_shape=dense_shape)
    normed_sum = tf.log(tf.sparse_reduce_sum(sp_input=sparse_normed_exp, axis=-1, keep_dims=True)) + lmax
    lsm = logits - normed_sum
    return lsm


def sparse_crossentropy_loss(logits, labels):
    n = tf.shape(labels)[0]
    idx = tf.stack((tf.range(n), labels), axis=-1)
    nll = - tf.reduce_mean(tf.gather_nd(params=logits, indices=idx))
    return nll