tensorflow: [Bug] Clip by norm NaN gradients

a = tf.zeros([3], dtype=tf.float32)
b = tf.clip_by_norm(a, 1.)
c = tf.gradients(b,a)
s = tf.Session()
s.run(c)
[array([nan, nan, nan], dtype=float32)]

The gradient should obviously be [1,1,1] for all vectors a of norm smaller than 1, since this function should be the identity for those vectors.

Have I written custom code: OS Platform and Distribution: Ubuntu 14.10 TensorFlow installed from: pip3 TensorFlow version: 1.10.1 Bazel version: CUDA/cuDNN version: GPU model and memory: Exact command to reproduce: see above Mobile device:

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

Not quite, as you will find that you can reproduce this issue even with eager execution enabled. Dynamic graphs would help if we only needed to support the case where the l2norm is a scalar (that is when there is no axis argument).

The underlying issue is that an op like tf.maximum, which we use here to pick which coordinates of the tensor need to be divided by the norm, produces a gradient of 0 wrt the inputs it did not use to compute the output. At the same time, an op like tf.sqrt() produces a gradient of upstream_gradient * 1/2sqrt(input). If 1/2sqrt(input) is inf or NaN, multiplying it by 0 (which is the upstream gradient for the coordinates which were not used) will result in a NaN, which is what you’re seeing here.

We are looking into fixing this overall issue but it’s tricky to do so without slowing down all operations whose gradients boil down up upstream_gradient * f(x) when f(x) can be inf or NaN.

On Thu, Sep 13, 2018 at 2:42 PM Octavian Ganea notifications@github.com wrote:

Yes, I think it’s called “dynamic graphs” 😃

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22048#issuecomment-421161926, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxbP6PBJ4r1skKHmNNETdqcrB2jP9ks5uatE8gaJpZM4WYiLF .

–

Alex

alextp on Sep 13, 2018

The tool I used to debug this and partially fix, at least for the zeros case, tf.add_check_numerics_ops, works pretty well for identifying these issues.

Note that you can get around this using a tf.cond-based version (which behaves the same as a dynamic graph) of clip_by_norm if you only care about a scalar norm.

On Thu, Sep 13, 2018 at 4:04 PM Octavian Ganea notifications@github.com wrote:

It is especially annoying since it is usually very hard to debug and doesn’t have a “mathematical” cause. It would be nice to have a tool that would help TF users to understand when this issue is generating their “NaN problem”.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22048#issuecomment-421178362, or mute the thread https://github.com/notifications/unsubscribe-auth/AAATxZbDbQaFQeJNEMdEq7b5jIiEBttjks5uauSEgaJpZM4WYiLF .

–

Alex

alextp on Sep 13, 2018