tensorflow: [TF 2.0] Using keras.metrics in TPU training results in error

I am trying to train a BERT model from https://github.com/tensorflow/models/tree/master/official/nlp on TPU in google colab. I changed the metrics list passed to model in compile method to:

bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

where get_metrics is a function which returns a list of metrics (“accuracy” and instance for Recall and Precision built in tensorflow.keras.metrics):

from tensorflow.keras.metrics import Recall, Precision

def get_metrics():
    return ["accuracy",
            Recall(),
            Precision(), ]

Training results in the following error (after one epoch ends, before validation statistics are displayed):

I1018 16:34:07.313311 140541208393600 remote.py:151] Entering into master device scope: /job:worker/replica:0/task:0
2019-10-18 16:34:07.359467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-18 16:34:07.465723: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-10-18 16:34:07.465842: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (7b6f1b4d4089): /proc/driver/nvidia/version does not exist
2019-10-18 16:34:07.466260: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-18 16:34:07.472748: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-10-18 16:34:07.473076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3172f40 executing computations on platform Host. Devices:
2019-10-18 16:34:07.473114: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-10-18 16:34:07.475920: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.475955: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
2019-10-18 16:34:07.476742: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:30501
2019-10-18 16:34:07.497844: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 10.29.203.98:8470}
2019-10-18 16:34:07.497905: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:30501}
INFO:tensorflow:Initializing the TPU system: 10.29.203.98:8470
I1018 16:34:07.499603 140541208393600 tpu_strategy_util.py:70] Initializing the TPU system: 10.29.203.98:8470
INFO:tensorflow:Clearing out eager caches
I1018 16:34:15.119202 140541208393600 tpu_strategy_util.py:94] Clearing out eager caches
INFO:tensorflow:Finished initializing TPU system.
I1018 16:34:15.121769 140541208393600 tpu_strategy_util.py:114] Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
I1018 16:34:15.128222 140541208393600 tpu_system_metadata.py:148] Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
I1018 16:34:15.128440 140541208393600 tpu_system_metadata.py:149] *** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
I1018 16:34:15.129121 140541208393600 tpu_system_metadata.py:150] *** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
I1018 16:34:15.129209 140541208393600 tpu_system_metadata.py:152] *** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129295 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.129720 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
I1018 16:34:15.129811 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
I1018 16:34:15.129892 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
I1018 16:34:15.129969 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
I1018 16:34:15.130045 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
I1018 16:34:15.130121 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
I1018 16:34:15.130197 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
I1018 16:34:15.130281 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
I1018 16:34:15.130358 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
I1018 16:34:15.130436 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
I1018 16:34:15.130511 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.130593 140541208393600 tpu_system_metadata.py:154] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)
I1018 16:34:15.248266 140541208393600 train.py:212] Training using TF 2.0 Keras compile/fit API with distrubuted strategy.
WARNING:tensorflow:Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
W1018 16:35:33.236943 140541208393600 training_utils.py:1547] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
Train on 129 steps, validate on 65 steps
Epoch 1/5
2019-10-18 16:38:03.018892: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-10-18 16:38:03.020371: E tensorflow/core/platform/default/device_tracer.cc:70] CUDA error: CUDA_ERROR_NO_DEVICE
  1/129 [..............................] - ETA: 5:12:28 - loss: 1.0083 - accuracy: 0.2031 - recall: 0.1719 - precision: 0.2619WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
W1018 16:38:06.456197 140541208393600 callbacks.py:244] Method (on_train_batch_end) is slow compared to the batch update (1.610206). Check your callbacks.
128/129 [============================>.] - ETA: 1s - loss: 0.5022 - accuracy: 0.7563 - recall: 0.5862 - precision: 0.81392019-10-18 16:38:45.271991: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0
Additional GRPC error information:
{"created":"@1571416725.271891392","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 0","grpc_status":3}
2019-10-18 16:38:45.272429: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1
Additional GRPC error information:
{"created":"@1571416725.272350919","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 1","grpc_status":3}
2019-10-18 16:38:45.272841: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2
Additional GRPC error information:
{"created":"@1571416725.272756237","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 2","grpc_status":3}
2019-10-18 16:38:45.273165: E tensorflow/core/distributed_runtime/eager/remote_tensor_handle_data.cc:50] Unable to destroy remote tensor handles: Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3
Additional GRPC error information:
{"created":"@1571416725.273105048","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 55877, Output num: 3","grpc_status":3}
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 340, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 332, in main
    run_bert(strategy, input_meta_data)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 287, in run_bert
    use_keras_compile_fit=FLAGS.use_keras_compile_fit)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 226, in run_bert_classifier
    custom_callbacks=None)
  File "/gdrive/My Drive/DeepLearningBERT/nn/train.py", line 143, in run_keras_compile_fit
    callbacks=custom_callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 439, in model_iteration
    steps_name='validation_steps')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 299, in model_iteration
    batch_outs = f(actual_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 878, in execution_function
    return [out.numpy() for out in distributed_function(input_fn)]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py", line 526, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnimplementedError:  Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.
	TPU compilation failed
	 [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]
Additional GRPC error information:
{"created":"@1571416725.270929013","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":" Compilation failure: Asked to propagate a dynamic dimension from hlo %tuple.5198 = (pred[], f32[4,2]{1,0}) tuple(pred[] %convert.5196, f32[4,2]{1,0} %add.5004), metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}@{1}@0 to hlo %conditional.5209 = (pred[]) conditional(pred[] %convert.5196, (pred[], f32[4,2]{1,0}) %tuple.5198, (pred[], f32[4,2]{1,0}) %tuple.5198), true_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_true_127733_const_0__.5199, false_computation=%metrics_precision_assert_greater_equal_Assert_AssertGuard_false_127734_const_0__.5204, metadata={op_type="If" op_name="metrics/precision/assert_greater_equal/Assert/AssertGuard"}, which is not implemented.\n\tTPU compilation failed\n\t [[{{node tpu_compile_succeeded_assert/_6193329545322784681/_7}}]]","grpc_status":12} [Op:__inference_distributed_function_127913]

Function call stack:
distributed_function -> distributed_function

2019-10-18 16:38:53.401848: E tensorflow/core/distributed_runtime/rpc/eager/grpc_eager_client.cc:72] Remote EagerContext with id 6450803200035565614 does not seem to exist.

With only “accuracy” returned it works well finishing all epochs. With custom metrics like:

def precision(y_true, y_pred):
    y_pred = tf.math.rint(y_pred)
    TP = tf.math.reduce_sum(y_pred * y_true)
    FP = tf.math.reduce_sum(y_pred * (1 - y_true))

    _precision = tf.math.divide(TP, (TP + FP + eps))
    return _precision

it works as well, but the values returned are not correct. I suppose this is happening because on the TPU there are X steps per loop computed and somehow (I didn’t dig too much into it) messes up the output metric. I tried with builtin functions to verify the behavior but it resulted in the error previously mentioned.

Snippet of the training call (the function is called run_keras_compile_fit in the github link I provided and it can be found in bert/run_classifier.py with almost none custom code added):

    with strategy.scope():
        training_dataset = train_input_fn()
        evaluation_dataset = eval_input_fn()
        bert_model, sub_model = model_fn()
        optimizer = bert_model.optimizer

        if init_checkpoint:
            checkpoint = tf.train.Checkpoint(model=sub_model)
            checkpoint.restore(init_checkpoint).assert_existing_objects_matched()

        bert_model.compile(optimizer=optimizer, loss=loss_fn, metrics=get_metrics())

        summary_dir = os.path.join(model_dir, 'summaries')
        summary_callback = tf.keras.callbacks.TensorBoard(summary_dir)
        checkpoint_path = os.path.join(model_dir, 'checkpoint')
        checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
            checkpoint_path, save_weights_only=True, save_best_only=True, mode='min')

        if custom_callbacks is not None:
            custom_callbacks += [summary_callback, checkpoint_callback]
        else:
            custom_callbacks = [summary_callback, checkpoint_callback]

        bert_model.fit(
            x=training_dataset,
            validation_data=evaluation_dataset,
            steps_per_epoch=steps_per_epoch,
            epochs=epochs,
            validation_steps=eval_steps,
            callbacks=custom_callbacks)

        return bert_model

In colab I installed the stable release of tensorflow 2.0 as the nightly version doesn’t work well with colab’s TPU’s for now. The keras metrics are supposed to work with TPUs or this is not yet a feature?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (5 by maintainers)

Most upvoted comments

tf.keras.metrics are not working with TPUs on colab version 2.3

https://colab.research.google.com/drive/1C09OUXP-7Es4KIthVA6daRcGq_bJKGe8#scrollTo=hbXc0o4p2W0a

it looks like tf.metrics.AUC is causing the error.

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

NotFoundError: 9 root error(s) found.
  (0) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/body/_1/while/Const_7/_337]]
  (1) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_2758051469797844661/_31}}]]
  (2) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_3997717623177238997/_19}}]]
  (3) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/loop_body_control/_50/_148]]
  (4) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_12310723377697630185/_10}}]]
  (5) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_13160019155312455749/_25}}]]
  (6) Invalid argument: {{function_node __inference_train_step_4729}} Compilation failure: Incompatible shapes: [200,128] vs. [200,16384]
	 [[{{node while/body/_1/while/LogicalAnd}}]]
	TPU compilation failed
	 [[tpu_compile_succeeded_assert/_18307140185544563773/_5]]
  (7) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
	 [[while/body/_1/while/Const_3/_281]]
  (8) Not found: {{function_node __inference_train_step_4729}} No proto found for key <<NO PROGRAM AS COMPILATION FAILED>>
	 [[{{node TPUVariableReshard/reshard/_14520013377784163202/_28}}]]
0 successful operations.
0 derived errors ignored.

CalebEverett on Dec 8, 2020

@georgealexandruvlad

Hi, sorry about the breakage. The internal version of this issue got routed to me yesterday and we should have a fix very soon today (at least on our nightly release).

The root cause is our compiler had trouble handling conditionals with dynamic shapes, which is introduced by “Assert” operation in Metric.

@rxsang also added an option to disable the dynamic shapes behavior, IIRC you can enable that by setting strategy.experimental_enable_dynamic_batch_size = False

yunxing on Oct 24, 2019

TPUs are not workong on Colab either.

tensorflow: 2.4

https://colab.research.google.com/drive/1slQKTzSOnE9U70QCQCoGJrcyXCdqRJwz#scrollTo=3Qz6XSPEDsyZ

shenzhun on Dec 28, 2020

Missed the notification. This should be fixed in nightly releases, do you have access to those ? I remember we also have 1.x nightly release which should also include the fix. cc @rxsang who is more familiar with this than me.

yunxing on Nov 7, 2019