tensorflow: tf.keras.layers.LSTM + tf.function fails to compute jacobian with pfor on GPU

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): v1.12.1-34938-g99fea8da0d 2.3.0-rc0
Python version: 3.7
CUDA/cuDNN version:

$ conda list | grep cud
cudatoolkit               10.1.243             h6bb024c_0
cudnn                     7.6.5                cuda10.1_0

GPU model and memory: Nvidia GeForce GTX 1080 Ti

Describe the current behavior TensorFlow crashes when computing GradientTape.jacobians for an output of tf.keras.layers.LSTM within a tf.function, when running on GPU.

Describe the expected behavior The graph compiles correctly and efficiently computes the jacobian.

Standalone code to reproduce the issue

import tensorflow as tf


batch_size, sequence_length = 2, 3

x_input = tf.keras.layers.Input(
    shape=(sequence_length, 1),
    name='input',
    dtype=tf.float32)

mask_input = tf.keras.layers.Input(
    shape=(sequence_length, ),
    name='mask',
    dtype=tf.bool)


out = tf.keras.layers.LSTM(
    units=8,
    return_sequences=True,
    return_state=False,
)(x_input, mask=mask_input)
out = tf.keras.layers.Dense(1, activation='linear')(out)
model = tf.keras.Model((x_input, mask_input), out)

x = tf.random.uniform(
    (batch_size, sequence_length, x_input.shape[-1]),
    dtype=x_input.dtype)

mask = tf.sequence_mask(
    tf.random.uniform(
        (batch_size, ), minval=0, maxval=sequence_length, dtype=tf.int32),
    maxlen=sequence_length,
)[..., ::-1]


@tf.function(experimental_relax_shapes=True)
def compute_jacobian():
    y_true = tf.zeros(batch_size)
    with tf.GradientTape() as tape:
        y = model((x, mask))
        y = tf.reduce_sum(y, axis=1)
        loss = tf.losses.MSE(y_pred=y, y_true=y_true)

    jacobian = tape.jacobian(
        loss, model.trainable_variables, experimental_use_pfor=True)

    return jacobian


jacobian = compute_jacobian()

Other info / logs Running the above code results in a huge error trace and finally outputs:

NotImplementedError: Vectorization tried to stack variant tensor Tensor(“gradients/while_grad/gradients/grad_ys_4/pfor/Identity:0”, shape=(), dtype=variant). This is likely because vectorization of that variant is not fully supported yet.

I know that the pfor flag is experimental, and setting experimental_use_pfor=False would make the code run. However, in that case the resulting graph runs so slow that it’s effectively unusable even for simple 2-element jacobian. Using parallel_iterations=10, experimental_use_pfor=False above results in the following warning, which might have something to do with the slowness:

2020-07-03 08:31:26.889383: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] function_optimizer failed: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32. 2020-07-03 08:31:26.933046: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] layout failed: Out of range: src_output = 26, but num_outputs is only 26 2020-07-03 08:31:26.978710: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] function_optimizer failed: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32. 2020-07-03 08:31:27.036554: W tensorflow/core/common_runtime/process_function_library_runtime.cc:773] Ignoring multi-device function optimization failure: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32.

Any workarounds would also be much appreciated and I’d even be happy to contribute a fix for this if one would be doable without much c++ experience.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (13 by maintainers)

Most upvoted comments

Closing this issue now since original problem has been fixed. @hartikainen if these warnings cause any performance issues for you, feel free to open a new issue as they do not seem to be related. Thanks everyone!

nikitamaia on Nov 9, 2020

@nikitamaia thank you for the update! Confirmed that the fix in #37053 is working for calculating the Jacobian with parallel_for enabled.

zredlined on Oct 5, 2020

Hi all, please see the last comment in #37053, which seems to be the same underlying issue and is now fixed. With nightly I was able to run the code originally posted by @hartikainen. Let me know if anyone is still seeing problems when running with tf-nightly.

nikitamaia on Oct 5, 2020

After some investigation, it looks like this is a known issue and is being worked on. I can update this thread when there is a fix.

nikitamaia on Jul 21, 2020

Was able to reproduce the issue with TF v2.2, TF v2.3.0-rc0 and TF-nightly (i.e. v2.4.0-dev20200703). Please find the attached gist. Thanks!

amahendrakar on Jul 3, 2020