tensorflow: Combo TPU/TFRecords for model.evaluate is not working

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow installed from (source or binary):
  • TensorFlow version (use command below): 1.14
  • Python version: 3.6.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory: TPU Colab

Describe the current behavior On a TPU (colab) running model.evaluate on a tf.data.Dataset build with TFRecords throw : Compilation failure: Dynamic Spatial reduce window is not supported: %reduce-window.21 = f32[1,127,127,3]{3,2,1,0} reduce-window(f32[1,256,256,3]{3,2,1,0} %reshape.12, f32[] %constant.16), window={size=1x3x3x1 stride=1x2x2x1}, to_apply=%max_F32.17, metadata={op_type=“MaxPool” op_name=“max_pooling2d_4/MaxPool”} TPU compilation failed

The fit method work perfectly with the same dataset. Evaluation working perfectly if i rebuild the model and load the weights on a CPU/GPU instance.

I don’t have this issue on TPU if the tf.data.Dataset is not built from TFRecords

Describe the expected behavior model.evaluate should work and provide a result close from the last fit iteration

Code to reproduce the issue

Put it on Colab and replace GOOGLE_BUCKET_TO_DEFINE by a real bucket (2 occurrences)

import tensorflow as tf
import numpy as np

#get a image as input data for model : tensorflow logo
!curl https://avatars0.githubusercontent.com/u/15658638?s=256 --output tensor_logo.png

#build 8 tfrecords based on the logo downloaded
def build_tf_records():
  def serialize_example_pyfunction(image, label):
    feature = {
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image.numpy()])),
        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label.numpy()]))
    }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

  def tf_serialize_example(image, label):
    tf_string = tf.py_function(
      serialize_example_pyfunction,
      (image, label),  # pass these args to the above function.
      tf.string)      # the return type is `tf.string`.
    return tf.reshape(tf_string, ())# The result is a scalar

  dataset = tf.data.Dataset.from_tensor_slices(['/content/tensor_logo.png' for x in range(8)])
  dataset = dataset.map(lambda x : (tf.read_file(x), 0))
  dataset = dataset.map(tf_serialize_example)
  return dataset

#write tf records to disc and move on a google bucket
with tf.Session() as sess:
  filename = 'tfrecord.test'
  writer = tf.data.experimental.TFRecordWriter(filename)
  writting = writer.write(build_tf_records())
  sess.run(writting)

!ls

from google.colab import auth
auth.authenticate_user()

!gsutil cp tfrecord.test gs://GOOGLE_BUCKET_TO_DEFINE/

# the aim is to quickly build 8 items (x, y=0) with x of shape (256, 256, 3)
def train_input_fn_dummy_data():
  dataset = tf.data.Dataset.from_tensor_slices(([0 for x in range(8)]))
  dataset = dataset.map(lambda x : (tf.random.normal((256, 256, 3)), [0]))
  dataset = dataset.batch(8)
  return dataset
  

def train_input_fn_from_tf_records():
  # Create a description of the features.
  feature_description = {
    'image': tf.FixedLenFeature([], tf.string),
    'label': tf.FixedLenFeature([], tf.int64, default_value=0)
  }

  def _parse_function(example_proto):
    # Parse the input tf.Example proto using the dictionary above.
    return tf.parse_single_example(example_proto, feature_description)

  def _process_string_image(dic):
    image_string = dic['image']
    image_decoded = tf.image.decode_png(image_string, channels=3)
    image_decoded = tf.cast(image_decoded, tf.float32)/255.
    return image_decoded, tf.cast([dic['label']], tf.int32)

  list_files = tf.data.Dataset.list_files('gs://GOOGLE_BUCKET_TO_DEFINE/tfrecord.test')
  raw_tfrecords = tf.data.TFRecordDataset(list_files)
  files_as_dict = raw_tfrecords.map(_parse_function)
  files = files_as_dict.map(_process_string_image)
  files = files.batch(8, drop_remainder=True)
  
  return files

#basic check to compare train_input_fn_dummy_data, train_input_fn_from_tf_records
#can't be run after TPU initialisation
with tf.Session() as sess:
  batch = train_input_fn_dummy_data().make_one_shot_iterator().get_next()
  while True:
      try:
          records = sess.run(batch)
          print('shape of dummy items :', records[0].shape, records[1].shape)
      except tf.errors.OutOfRangeError: break
  batch = train_input_fn_from_tf_records().make_one_shot_iterator().get_next()
  while True:
      try:
          records = sess.run(batch)
          print('shape of tfrecords items :', records[0].shape, records[1].shape)
      except tf.errors.OutOfRangeError: break

##shape of dummy items : (8, 256, 256, 3) (8, 1)
##shape of tfrecords items : (8, 256, 256, 3) (8, 1)


#initialize tpu only once
if not('strategy' in globals()):
  resolver = tf.contrib.cluster_resolver.TPUClusterResolver()
  tf.contrib.distribute.initialize_tpu_system(resolver)
  strategy = tf.contrib.distribute.TPUStrategy(resolver)

#build model and compile
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(256, 256, 3))
  x = tf.keras.layers.MaxPooling2D((3, 3), strides=(2, 2))(inputs)
  output = tf.keras.layers.GlobalAveragePooling2D()(x)
  output = tf.keras.layers.Dense(1, activation = 'sigmoid')(output)
  model = tf.keras.Model(inputs=inputs, outputs=output)
  model.compile('adam', loss='binary_crossentropy', metrics=['binary_accuracy'])

model.summary()

# fit the model, no issue
print('train dummy')
model.fit(train_input_fn_dummy_data(), epochs= 1, steps_per_epoch=1)
print('train tfrecords')
model.fit(train_input_fn_from_tf_records(), epochs= 1, steps_per_epoch=1)
print('evaluate dummy')
model.evaluate(train_input_fn_dummy_data(), steps=1)
print('evaluate tfrecords')
model.evaluate(train_input_fn_from_tf_records(), steps=1)

##train dummy
##WARNING:tensorflow:Expected a shuffled dataset but input dataset `x` is not shuffled. ##Please invoke `shuffle()` on input dataset.
##WARNING:tensorflow:From /usr/local/lib/python3.6/dist-##packages/tensorflow/python/keras/engine/training_distributed.py:411: Variable.load (from ##tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
##Instructions for updating:
##Prefer Variable.assign which has equivalent behavior in 2.X.
##1/1 [==============================] - 0s 397ms/step - loss: 0.3180 - ##binary_accuracy: 1.0000
##train tfrecords
##1/1 [==============================] - 1s 786ms/step - loss: 0.4993 - ##binary_accuracy: 1.0000
##evaluate dummy
##1/1 [==============================] - 1s 1s/step
##1/1 [==============================] - 1s 1s/step
##evaluate tfrecords
##---------------------------------------------------------------------------
##UnimplementedError                        Traceback (most recent call last)
##/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, ##fn, *args)
##   1355     try:
##-> 1356       return fn(*args)
##   1357     except errors.OpError as e:
##
##10 frames
##UnimplementedError: From /job:worker/replica:0/task:0:
##Compilation failure: Dynamic Spatial reduce window is not supported: %reduce-window.21 ##= f32[1,127,127,3]{3,2,1,0} reduce-window(f32[1,256,256,3]{3,2,1,0} %reshape.12, f32[] ##%constant.16), window={size=1x3x3x1 stride=1x2x2x1}, to_apply=%max_F32.17, ##metadata=##{op_type="MaxPool" op_name="max_pooling2d_4/MaxPool"}
##	TPU compilation failed
##	 [[{{node TPUReplicateMetadata_3}}]]
##
##During handling of the above exception, another exception occurred:


Other info / logs

The fit method work perfectly with the same dataset. Evaluation working perfectly if i rebuild the model and load the weights on a CPU/GPU instance.

I don’t have this issue on TPU if the tf.data.Dataset is not built from TFRecords

Full StackTrace : UnimplementedError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1355 try: -> 1356 return fn(*args) 1357 except errors.OpError as e:

10 frames /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1340 return self._call_tf_sessionrun( -> 1341 options, feed_dict, fetch_list, target_list, run_metadata) 1342

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1428 self._session, options, feed_dict, fetch_list, target_list, -> 1429 run_metadata) 1430

UnimplementedError: From /job:worker/replica:0/task:0: Compilation failure: Dynamic Spatial reduce window is not supported: %reduce-window.21 = f32[1,127,127,3]{3,2,1,0} reduce-window(f32[1,256,256,3]{3,2,1,0} %reshape.12, f32[] %constant.16), window={size=1x3x3x1 stride=1x2x2x1}, to_apply=%max_F32.17, metadata={op_type=“MaxPool” op_name=“max_pooling2d_4/MaxPool”} TPU compilation failed [[{{node TPUReplicateMetadata_3}}]]

During handling of the above exception, another exception occurred:

UnimplementedError Traceback (most recent call last) <ipython-input-8-22e1b9ed9dfe> in <module>() 6 model.evaluate(train_input_fn_dummy_data(), steps=1) 7 print(‘evaluate tfrecords’) ----> 8 model.evaluate(train_input_fn_from_tf_records(), steps=1)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in evaluate(self, x, y, batch_size, verbose, sample_weight, steps, callbacks, max_queue_size, workers, use_multiprocessing) 902 sample_weight=sample_weight, 903 steps=steps, –> 904 callbacks=callbacks) 905 906 batch_size = self._validate_or_infer_batch_size(batch_size, steps, x)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_distributed.py in evaluate_distributed(model, x, y, batch_size, verbose, sample_weight, steps, callbacks) 168 if distributed_training_utils.is_tpu_strategy(model._distribution_strategy): 169 return experimental_tpu_test_loop( –> 170 model, dataset, verbose=verbose, steps=steps, callbacks=callbacks) 171 else: 172 return training_arrays.test_loop(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_distributed.py in experimental_tpu_test_loop(model, dataset, verbose, steps, callbacks) 562 callbacks._call_batch_hook(mode, ‘begin’, current_step, batch_logs) 563 try: –> 564 _, batch_outs = K.batch_get_value([test_op, output_tensors]) 565 except errors.OutOfRangeError: 566 warning_msg = 'Make sure that your dataset can generate at least ’

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py in batch_get_value(tensors) 3008 raise RuntimeError(‘Cannot get value inside Tensorflow graph function.’) 3009 if tensors: -> 3010 return get_session(tensors).run(tensors) 3011 else: 3012 return []

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata) 948 try: 949 result = self._run(None, fetches, feed_dict, options_ptr, –> 950 run_metadata_ptr) 951 if run_metadata: 952 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata) 1171 if final_fetches or final_targets or (handle and feed_dict_tensor): 1172 results = self._do_run(handle, final_targets, final_fetches, -> 1173 feed_dict_tensor, options, run_metadata) 1174 else: 1175 results = []

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1348 if handle is None: 1349 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1350 run_metadata) 1351 else: 1352 return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):

UnimplementedError: From /job:worker/replica:0/task:0: Compilation failure: Dynamic Spatial reduce window is not supported: %reduce-window.21 = f32[1,127,127,3]{3,2,1,0} reduce-window(f32[1,256,256,3]{3,2,1,0} %reshape.12, f32[] %constant.16), window={size=1x3x3x1 stride=1x2x2x1}, to_apply=%max_F32.17, metadata={op_type=“MaxPool” op_name=“max_pooling2d_4/MaxPool”} TPU compilation failed [[node TPUReplicateMetadata_3 (defined at <ipython-input-8-22e1b9ed9dfe>:8) ]]

Original stack trace for ‘TPUReplicateMetadata_3’: File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main “main”, mod_spec) File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code exec(code, run_globals) File “/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py”, line 16, in <module> app.launch_new_instance() File “/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py”, line 658, in launch_instance app.start() File “/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py”, line 477, in start ioloop.IOLoop.instance().start() File “/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py”, line 832, in start self._run_callback(self._callbacks.popleft()) File “/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py”, line 605, in _run_callback ret = callback() File “/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py”, line 277, in null_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py”, line 536, in <lambda> self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0)) File “/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py”, line 450, in _handle_events self._handle_recv() File “/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py”, line 480, in _handle_recv self._run_callback(callback, msg) File “/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py”, line 432, in _run_callback callback(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py”, line 277, in null_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py”, line 283, in dispatcher return self.dispatch_shell(stream, msg) File “/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py”, line 235, in dispatch_shell handler(stream, idents, msg) File “/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py”, line 399, in execute_request user_expressions, allow_stdin) File “/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py”, line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File “/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py”, line 533, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py”, line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File “/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py”, line 2828, in run_ast_nodes if self.run_code(code, result): File “/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py”, line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File “<ipython-input-8-22e1b9ed9dfe>”, line 8, in <module> model.evaluate(train_input_fn_from_tf_records(), steps=1) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py”, line 904, in evaluate callbacks=callbacks) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_distributed.py”, line 170, in evaluate_distributed model, dataset, verbose=verbose, steps=steps, callbacks=callbacks) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training_distributed.py”, line 520, in experimental_tpu_test_loop _test_step_fn, args=(test_input_data,)) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py”, line 249, in experimental_run_v2 return _tpu_run(self, fn, args, kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py”, line 196, in _tpu_run maximum_shapes=maximum_shapes) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py”, line 592, in replicate maximum_shapes=maximum_shapes)[1] File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu.py”, line 854, in split_compile_and_replicate num_replicas=num_replicas, use_tpu=use_tpu, **metadata_kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_tpu_ops.py”, line 6039, in tpu_replicate_metadata name=name) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper op_def=op_def) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func return func(*args, **kwargs) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py”, line 3616, in create_op op_def=op_def) File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py”, line 2005, in init self._traceback = tf_stack.extract_stack()

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

That worked. There was a tf.image.decode_jpeg in the pipeline, so I added a tf.reshape after it. Gracias.

Dynamic image size is the next thing we are going to support. For now making images static will make the test pass.

On Tue, Apr 21, 2020, 6:27 PM rxsang notifications@github.com wrote:

It seems you get this error because your image size is dynamic (from the error message, the dimension which has an upper bound of 262), which we don’t support it yet). Can you share your input pipeline code? If it is not intended that you have dynamic image size, maybe some dataset ops produce dynamic image size in your case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/32868#issuecomment-617469873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYOWIQPHCPT4N546TAYELRNZBWRANCNFSM4I3DV4KQ .

It seems you get this error because your image size is dynamic (from the error message, the dimension which has an upper bound of 262), which we don’t support it yet). Can you share your input pipeline code? If it is not intended that you have dynamic image size, maybe some dataset ops produce dynamic image size in your case.