tensorflow: Infinity mask breaks gradient
I’m trying to do softmax over selected indices, using infinity mask to silent out the unwanted ones. However, the gradient of those unwanted entires become nan as opposed to 0.
The reason I didn’t use boolean mask is that the mask indices are different in my batch, which can’t end up with a nice matrix form. If there’s workaround here I’ll be more than happy to adopt.
The code I tested the infinity mask is
import numpy as np
import tensorflow as tf
a = tf.placeholder(tf.float32, [5])
inf_mask = tf.placeholder(tf.float32, [5])
b = tf.multiply(a, inf_mask)
sf = tf.nn.softmax(b)
loss = (sf[2] - 0)
grad = tf.gradients(loss, a)
sess = tf.Session()
a_np = np.ones([5])
np_mask = np.ones([5]) * 4
np_mask[1] = -np.inf
print sess.run([sf, grad], feed_dict={
a: a_np,
inf_mask: np_mask
})
sess.close()
The output is [array([ 0.25, 0. , 0.25, 0.25, 0.25], dtype=float32), [array([-0.25, nan, 0.75, -0.25, -0.25], dtype=float32)]]
The mask is working but the gradient has a nan, which should have been 0 I think.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 7
- Comments: 17 (6 by maintainers)
Commits related to this issue
- fix(pi): do not use softmax for policy predictions - "Note that if your intention is to use [Softmax] for loss calculations, you should be doing something else. Should only be using softmax itself f... — committed to justindujardin/mathy by justindujardin 5 years ago
- chore(release): 1.0.0 [skip ci] # 1.0.0 (2020-01-01) ### Bug Fixes * **colab:** add gym pip install to needed examples ([e407faf](https://github.com/justindujardin/mathy/commit/e407fafef65b58a49763... — committed to justindujardin/mathy by semantic-release-bot 4 years ago
@hongzimao @sy2737 I think you guys were on the right track originally, just didn’t debug things quite correctly. You wanted
a-inf_mask, not multiply. The second solution posted above is still dangerous. stable softmax should bee^(a-max(a)).The key is that
exp(-inf)==0,max(a, -inf)==aanda-inf==-inf. Unfortunately,0*inf==nan, so making the mask correctly is tricky.Two most numerically stable options would be either -inf mask or just using a sparse softmax (which might be better depending on what you are doing).
Below is an example of using -inf mask. It has some specifics because of broadcasting but you should be able to make it into whatever you need. Note that if your intention is to use this for loss calculations, you should be doing something else. Should only be using softmax itself for things like attention.
tf.sequence_maskto create a mask from sequence lengthstf.whereto get the indices –tf.tileto make as many infs as required (broadcasting doesn’t seem to work) –tf.scatter_ndto make the mask using the indices and the infstf.nn.softmax(logits - infmask, axis=1)My solution to this problem:
(Edit: My first proposal had problems with
Nonein shape of logits)Unfortunately, as @yaroslavvb mentioned the
masked_softmaximplementation by @bstriner broke for me when computing gradients, producingNaNs in computing the loss.A simple workaround that got it working for me was replacing
np.infwithtf.float32.max. This, of course, incurs some penalty as the padded values will not be completely negligible, but I think it is the most numerically stable approach.I’m also asking if there are any other downsides to this approach as I’m only just starting out with Tensorflow and Machine Learning in general, so I’d appreciate knowing if this approach is actually breaking anything.
softmax is written to avoid numerical inaccuracy for ill-conditioned finite values numbers. It does this by subtracting off the max abs value and doing the computation around that. That means that injecting infinities to its arguments will give you nans as you are seeing. This numerically robust computation is key for many models. I think if you can get away with the 0 to 1 solution that is pretty decent. You could look at some of the sparse cross entropy softmax with logits functions for maximum robustness and the ability to have a sparse subset of values.
@NickRyder You can adapt the sparse_logsoftmax below. Inputs are dense logits and sparse indices. It gives you the normalized logits in a dense matrix. You can then use the sparse_crossentropy_loss below to get the logits at the labels.