tensorflow: Keras LSTM does not work with tf.distribute [2.0]

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Minor tweak to tutorial code.
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.
TensorFlow version (use command below): tf-nightly-gpu-2.0-preview
Python version: 3.6
CUDA/cuDNN version: 10.0/7.4

I copy-pasted this tutorial code (MNIST distributed training with TF2.0) but used tf.distribute.MirroredStrategy() (instead of MultiWorker). It worked. Then I changed the model architecture to a simple Embedding -> LSTM -> Dense architecture. It broke with the following errror:

Cannot place the graph because a reference or resource edge connects colocation groups with incompatible assigned devices: /job:worker/replica:0/task:0/device:GPU:0 vs /job:worker/replica:0/task:0/device:CPU:0. The edge src node is while_22/exit/_100 , and the dst node is while_0_RetVal
	 [[node sequential/lstm/StatefulPartitionedCall (defined at tf2_multiworker_tutorial/main.py:109) ]]

This was executed on a remote cluster single-machine with 2 GPUs. Note that I’ve been seeing this error ever since the initial 2.0 alpha release. The code is as follows:

import tensorflow as tf

BUFFER_SIZE = 10000
BATCH_SIZE = 64
LEARNING_RATE = 1e-4


def input_fn(mode, input_context=None):
    max_seq_len = 3
    rnn_dataset = tf.data.Dataset\
        .range(10)\
        .repeat(10 * BUFFER_SIZE) \
        .map(lambda x: (
        tf.ones(shape=(max_seq_len,), dtype=tf.int64),
        tf.ones(shape=(max_seq_len,), dtype=tf.int64)))
    if input_context:
        rnn_dataset = rnn_dataset.shard(
            input_context.num_input_pipelines,
            input_context.input_pipeline_id)
    return rnn_dataset.batch(BATCH_SIZE)


def model_fn(features, labels, mode):
    vocab_size = 100
    embed_size = 16
    state_size = 7
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_size),
        tf.keras.layers.LSTM(units=state_size, return_sequences=True),
        tf.keras.layers.Dense(10, activation='softmax')])
    logits = model(features, training=False)

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            tf.estimator.ModeKeys.PREDICT,
            predictions={'logits': logits})

    optimizer = tf.compat.v1.train.GradientDescentOptimizer(
        learning_rate=LEARNING_RATE)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True,
        reduction=tf.keras.losses.Reduction.NONE)
    loss = tf.reduce_sum(loss(labels, logits)) * (1. / BATCH_SIZE)
    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, loss=loss)

    return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss,
        train_op=optimizer.minimize(
            loss, tf.compat.v1.train.get_or_create_global_step()))


def main():
    strategy = tf.distribute.MirroredStrategy()
    config = tf.estimator.RunConfig(
        train_distribute=strategy,
        log_step_count_steps=1)

    classifier = tf.estimator.Estimator(
        model_fn=model_fn, model_dir='/tmp/multiworker', config=config)
    tf.estimator.train_and_evaluate(
        classifier,
        train_spec=tf.estimator.TrainSpec(input_fn=input_fn, max_steps=10),
        eval_spec=tf.estimator.EvalSpec(input_fn=input_fn))


if __name__ == '__main__':
    main()

Again, the common theme I’ve observed is that if tf.keras.LSTM is part of my model and I’m using tf.distribute, it breaks with this error. Otherwise, it works just fine.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 34 (9 by maintainers)

Most upvoted comments

UPDATE: something that might be useful to know is that if I replace tf.keras.LSTM with tf.keras.RNN(cell=tf.compat.v1.nn.rnn_cell.LSTMCell(...)), everything works as expected for single-worker multi-gpu. The issue is definitely within tf.keras.LSTM itself.

For multi-worker multi-gpu, I still get the collective ops error (that the tutorial suggests should be a warning). This error goes away if I wraps things with tf.estimator.

Overall, @qlzh727 @karmel are these being worked on? I understand there are a ton of changes for TF2.0 but having Keras LSTM be incompatible with tf.distribute seems like a fairly significant issue.

mckinziebrandon on Jun 28, 2019

Yes.

qlzh727 on Aug 2, 2019

@qlzh727 I have this issue when using TimeDistributed LSTM with mask_zero=True this is my model : model=tf.keras.Sequential() embeding_layer=layers.Embedding(self.vocab_size,self.word_vector_dim,weights=[word_embeding_matrix],trainable=False,mask_zero=True) model.add(TimeDistributed(embeding_layer)) model.add(TimeDistributed(tf.keras.layers.LSTM(50))) model.add(tf.keras.layers.Bidirectional(costumized_lstm.Costumized_LSTM(50))) # model.add(tf.keras.layers.Bidirectional(costumized_lstm.Costumized_LSTM(100))) model.add(layers.Dense(3,activation='softmax')) opt=tf.keras.optimizers.Adam(learning_rate=0.001) model.compile(optimizer=opt,loss='categorical_crossentropy',metrics=['accuracy',self.f1_m,self.precision_m, self.recall_m]) self.model=model

and this is the error : `C:\Users\jalil\PycharmProjects\untitled1\venv\Scripts\python.exe C:/Users/jalil/PycharmProjects/untitled1/main_file.py 2019-11-14 14:11:36.983144: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll 2019-11-14 14:11:45.638679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll 2019-11-14 14:11:46.216495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038 pciBusID: 0000:01:00.0 2019-11-14 14:11:46.216676: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-11-14 14:11:46.217282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-11-14 14:11:50.885396: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-11-14 14:11:51.214275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038 pciBusID: 0000:01:00.0 2019-11-14 14:11:51.214484: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-11-14 14:11:51.218182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-11-14 14:11:51.905201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-14 14:11:51.905307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-11-14 14:11:51.905366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-11-14 14:11:51.906228: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4757 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2) WARNING:tensorflow:From C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\keras\backend.py:3983: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Train on 35000 samples, validate on 6447 samples Epoch 1/1000 2019-11-14 14:12:33.178251: W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions ‘__inference___backward_cudnn_lstm_with_fallback_671418_672877’ and ‘__inference___backward_cudnn_lstm_with_fallback_671418_672877_specialized_for_StatefulPartitionedCall_at___inference_distributed_function_675292’ both implement ‘lstm_81cdaa4a-fa6f-4675-abbb-02fb4cd0189b’ but their signatures do not match. 2019-11-14 14:12:33.544669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll 2019-11-14 14:12:34.397677: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll

32/35000 […] - ETA: 2:03:422019-11-14 14:12:35.151804: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cudnn_rnn_ops.cc:1498 : Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): ‘cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)’ 2019-11-14 14:12:35.152137: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): ‘cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)’ [[{{node cond_64/then/_0/CudnnRNNV3}}]] 2019-11-14 14:12:35.152541: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: {{function_node __forward_cudnn_lstm_with_fallback_672876_specialized_for_sequential_time_distributed_1_lstm_StatefulPartitionedCall_at___inference_distributed_function_675292}} {{function_node __forward_cudnn_lstm_with_fallback_672876_specialized_for_sequential_time_distributed_1_lstm_StatefulPartitionedCall_at___inference_distributed_function_675292}} CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): ‘cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void*)&padding_fill)’ [[{{node cond_64/then/_0/CudnnRNNV3}}]] [[sequential/time_distributed_1/lstm/StatefulPartitionedCall]]

32/35000 […] - ETA: 2:25:41Traceback (most recent call last): File “C:/Users/jalil/PycharmProjects/untitled1/main_file.py”, line 102, in <module> main_model_instance.train_model(train_batch_data,train_batch_labels,test_batch_data,test_batch_labels) File “C:\Users\jalil\PycharmProjects\untitled1\main_model.py”, line 103, in train_model history = self.model.fit(x=np.array(train_batch_data),y=np.array(train_batch_labels),validation_data=(np.array(test_batch_data),np.array(test_batch_labels)),epochs=1000,callbacks=[tensorboard_callback]) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\keras\engine\training.py”, line 734, in fit use_multiprocessing=use_multiprocessing) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py”, line 324, in fit total_epochs=epochs) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py”, line 123, in run_one_epoch batch_outs = execution_function(iterator) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py”, line 86, in execution_function distributed_function(input_fn)) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\def_function.py”, line 439, in call return self._stateless_fn(*args, *kwds) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\function.py”, line 1822, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\function.py”, line 1141, in _filtered_call self.captured_inputs) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\function.py”, line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\function.py”, line 511, in call ctx=ctx) File “C:\Users\jalil\PycharmProjects\untitled1\venv\lib\site-packages\tensorflow_core\python\eager\execute.py”, line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File “<string>”, line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: [Derived] CUDNN_STATUS_BAD_PARAM in tensorflow/stream_executor/cuda/cuda_dnn.cc(1424): 'cudnnSetRNNDataDescriptor( data_desc.get(), data_type, layout, max_seq_length, batch_size, data_size, seq_lengths_array, (void)&padding_fill)’ [[{{node cond_64/then/_0/CudnnRNNV3}}]] [[sequential/time_distributed_1/lstm/StatefulPartitionedCall]] [Op:__inference_distributed_function_675292]

Function call stack: distributed_function -> distributed_function -> distributed_function `

do you have any suggestions, please?

jalilasadi on Nov 14, 2019

Hi @mckinziebrandon, we also got bug report from internal and we are trying to address it.

For now, i think you can work around the issue by using tf.keras.RNN(cell=tf.keras.LSTMCell(…)), which should give the same math result.

qlzh727 on Jul 1, 2019

@karmel I can try that, sure. However, if that’s the case (that people should not mix them), then perhaps TensorFlow should not suggest doing so in their own tutorials (recall that this post is a fairly trivial modification to an official tutorial). It does look like a lot has been updated regarding the tutorials on distributed training, so I’ll check those out too. Will update here after I try your suggestion.

mckinziebrandon on Jun 22, 2019

@gadagashwini Just confirmed that yes, this is still an issue on the current TF gpu-nightly (2.0).

mckinziebrandon on Jun 13, 2019

@mckinziebrandon Ran the code in colab using GPU got the below error. AttributeError: module ‘tensorflow._api.v1.keras.losses’ has no attribute ‘SparseCategoricalCrossentropy’.

muddham on May 31, 2019