tensorflow: tf.keras.layers.LSTM + tf.function fails to compute jacobian with pfor on GPU
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 16.04
- TensorFlow installed from (source or binary):
Binary
- TensorFlow version (use command below):
v1.12.1-34938-g99fea8da0d 2.3.0-rc0
- Python version:
3.7
- CUDA/cuDNN version:
$ conda list | grep cud
cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
- GPU model and memory:
Nvidia GeForce GTX 1080 Ti
Describe the current behavior
TensorFlow crashes when computing GradientTape.jacobian
s for an output of tf.keras.layers.LSTM
within a tf.function
, when running on GPU.
Describe the expected behavior The graph compiles correctly and efficiently computes the jacobian.
Standalone code to reproduce the issue
import tensorflow as tf
batch_size, sequence_length = 2, 3
x_input = tf.keras.layers.Input(
shape=(sequence_length, 1),
name='input',
dtype=tf.float32)
mask_input = tf.keras.layers.Input(
shape=(sequence_length, ),
name='mask',
dtype=tf.bool)
out = tf.keras.layers.LSTM(
units=8,
return_sequences=True,
return_state=False,
)(x_input, mask=mask_input)
out = tf.keras.layers.Dense(1, activation='linear')(out)
model = tf.keras.Model((x_input, mask_input), out)
x = tf.random.uniform(
(batch_size, sequence_length, x_input.shape[-1]),
dtype=x_input.dtype)
mask = tf.sequence_mask(
tf.random.uniform(
(batch_size, ), minval=0, maxval=sequence_length, dtype=tf.int32),
maxlen=sequence_length,
)[..., ::-1]
@tf.function(experimental_relax_shapes=True)
def compute_jacobian():
y_true = tf.zeros(batch_size)
with tf.GradientTape() as tape:
y = model((x, mask))
y = tf.reduce_sum(y, axis=1)
loss = tf.losses.MSE(y_pred=y, y_true=y_true)
jacobian = tape.jacobian(
loss, model.trainable_variables, experimental_use_pfor=True)
return jacobian
jacobian = compute_jacobian()
Other info / logs Running the above code results in a huge error trace and finally outputs:
NotImplementedError: Vectorization tried to stack variant tensor Tensor(“gradients/while_grad/gradients/grad_ys_4/pfor/Identity:0”, shape=(), dtype=variant). This is likely because vectorization of that variant is not fully supported yet.
I know that the pfor flag is experimental, and setting experimental_use_pfor=False
would make the code run. However, in that case the resulting graph runs so slow that it’s effectively unusable even for simple 2-element jacobian. Using parallel_iterations=10, experimental_use_pfor=False
above results in the following warning, which might have something to do with the slowness:
2020-07-03 08:31:26.889383: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] function_optimizer failed: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32. 2020-07-03 08:31:26.933046: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] layout failed: Out of range: src_output = 26, but num_outputs is only 26 2020-07-03 08:31:26.978710: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:581] function_optimizer failed: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32. 2020-07-03 08:31:27.036554: W tensorflow/core/common_runtime/process_function_library_runtime.cc:773] Ignoring multi-device function optimization failure: Invalid argument: Input 0 of node while/enter/_15 was passed bool from functional_1/lstm/PartitionedCall:5 incompatible with expected int32.
Any workarounds would also be much appreciated and I’d even be happy to contribute a fix for this if one would be doable without much c++ experience.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (13 by maintainers)
Closing this issue now since original problem has been fixed. @hartikainen if these warnings cause any performance issues for you, feel free to open a new issue as they do not seem to be related. Thanks everyone!
@nikitamaia thank you for the update! Confirmed that the fix in #37053 is working for calculating the Jacobian with parallel_for enabled.
Hi all, please see the last comment in #37053, which seems to be the same underlying issue and is now fixed. With nightly I was able to run the code originally posted by @hartikainen. Let me know if anyone is still seeing problems when running with tf-nightly.
After some investigation, it looks like this is a known issue and is being worked on. I can update this thread when there is a fix.
Was able to reproduce the issue with TF v2.2, TF v2.3.0-rc0 and TF-nightly (i.e. v2.4.0-dev20200703). Please find the attached gist. Thanks!