tensorflow: Keras Colab TPU Error when compiling and fitting a pre-trained model in 1.14

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 1.14
Python version: 3.6
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior When doing transfer learning using pre-trained keras model (Xception per say) with the imagenet weights and adding a classification layer there is an error when fitting. If you use no weights there is no error during fitting, but if you fine tune the model, re-compile it and fit again the same error pops.

Describe the expected behavior No error, it was working in 1.13.

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. The easiest way to reproduce the problem is using the official notebook : https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb and re-compiling and fitting a second time

import os

resolver = tf.contrib.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)

with strategy.scope():
  model = create_model()
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])

model.fit(
    x_train.astype(np.float32), y_train.astype(np.float32),
    epochs=17,
    steps_per_epoch=60,
    validation_data=(x_test.astype(np.float32), y_test.astype(np.float32)),
    validation_freq=17
)

##This part was added##
print('Fine tuning')

with strategy.scope():
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])

model.fit(
    x_train.astype(np.float32), y_train.astype(np.float32),
    epochs=17,
    steps_per_epoch=60,
    validation_data=(x_test.astype(np.float32), y_test.astype(np.float32)),
    validation_freq=17
)


model.save_weights('./fashion_mnist.h5', overwrite=True)

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. And it leads to this error W0617 21:41:40.701483 140301503657856 tpu_strategy_util.py:56] TPU system %s has already been initialized. Reinitializing the TPU can cause previously created variables on TPU to be lost. Epoch 1/17 60/60 [==============================] - 5s 81ms/step - loss: 1.0900 - sparse_categorical_accuracy: 0.6855 Epoch 2/17 60/60 [==============================] - 1s 24ms/step - loss: 0.5287 - sparse_categorical_accuracy: 0.8202 Epoch 3/17 60/60 [==============================] - 2s 25ms/step - loss: 0.4317 - sparse_categorical_accuracy: 0.8518 Epoch 4/17 60/60 [==============================] - 1s 25ms/step - loss: 0.3728 - sparse_categorical_accuracy: 0.8692 Epoch 5/17 60/60 [==============================] - 1s 25ms/step - loss: 0.3453 - sparse_categorical_accuracy: 0.8776 Epoch 6/17 60/60 [==============================] - 1s 24ms/step - loss: 0.3080 - sparse_categorical_accuracy: 0.8898 Epoch 7/17 60/60 [==============================] - 1s 24ms/step - loss: 0.2892 - sparse_categorical_accuracy: 0.8954 Epoch 8/17 60/60 [==============================] - 1s 24ms/step - loss: 0.2641 - sparse_categorical_accuracy: 0.9044 Epoch 9/17 60/60 [==============================] - 1s 25ms/step - loss: 0.2485 - sparse_categorical_accuracy: 0.9093 Epoch 10/17 60/60 [==============================] - 1s 24ms/step - loss: 0.2337 - sparse_categorical_accuracy: 0.9135 Epoch 11/17 60/60 [==============================] - 1s 25ms/step - loss: 0.2236 - sparse_categorical_accuracy: 0.9170 Epoch 12/17 60/60 [==============================] - 1s 25ms/step - loss: 0.2081 - sparse_categorical_accuracy: 0.9232 Epoch 13/17 60/60 [==============================] - 1s 25ms/step - loss: 0.1962 - sparse_categorical_accuracy: 0.9281 Epoch 14/17 60/60 [==============================] - 2s 25ms/step - loss: 0.1816 - sparse_categorical_accuracy: 0.9318 Epoch 15/17 60/60 [==============================] - 1s 24ms/step - loss: 0.1717 - sparse_categorical_accuracy: 0.9355 Epoch 16/17 60/60 [==============================] - 1s 24ms/step - loss: 0.1666 - sparse_categorical_accuracy: 0.9375 Epoch 17/17 10/10 [==============================] - 7s 720ms/step 10/10 [==============================] - 7s 720ms/step 60/60 [==============================] - 13s 216ms/step - loss: 0.1535 - sparse_categorical_accuracy: 0.9424 - val_loss: 0.2364 - val_sparse_categorical_accuracy: 0.9235 Fine tuning Epoch 1/17

NotFoundErrorTraceback (most recent call last) <ipython-input-5-ca82800c4c4f> in <module>() 33 steps_per_epoch=60, 34 validation_data=(x_test.astype(np.float32), y_test.astype(np.float32)), —> 35 validation_freq=17 36 ) 37

7 frames /usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.pyc in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 647 steps_per_epoch=steps_per_epoch, 648 validation_steps=validation_steps, –> 649 validation_freq=validation_freq) 650 651 batch_size = self._validate_or_infer_batch_size(

/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_distributed.pyc in fit_distributed(model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq) 126 steps_per_epoch=steps_per_epoch, 127 validation_steps=validation_steps, –> 128 validation_freq=validation_freq) 129 else: 130 return training_arrays.fit_loop(

/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_distributed.pyc in experimental_tpu_fit_loop(model, dataset, epochs, verbose, callbacks, initial_epoch, steps_per_epoch, val_dataset, validation_steps, validation_freq) 412 prev_step_count = step_count 413 try: –> 414 _, outputs = K.batch_get_value([train_op, output_tensors]) 415 except errors.OutOfRangeError: 416 logging.warning('Your dataset iterator ran out of data; ’

/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend.pyc in batch_get_value(tensors) 3008 raise RuntimeError(‘Cannot get value inside Tensorflow graph function.’) 3009 if tensors: -> 3010 return get_session(tensors).run(tensors) 3011 else: 3012 return []

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata) 948 try: 949 result = self._run(None, fetches, feed_dict, options_ptr, –> 950 run_metadata_ptr) 951 if run_metadata: 952 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata) 1171 if final_fetches or final_targets or (handle and feed_dict_tensor): 1172 results = self._do_run(handle, final_targets, final_fetches, -> 1173 feed_dict_tensor, options, run_metadata) 1174 else: 1175 results = []

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata) 1348 if handle is None: 1349 return self._do_call(_run_fn, feeds, fetches, targets, options, -> 1350 run_metadata) 1351 else: 1352 return self._do_call(_prun_fn, handle, feeds, fetches)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):

NotFoundError: From /job:worker/replica:0/task:0: Resource worker/batch_normalization_3_1/moving_mean/replica_7/N10tensorflow3VarE does not exist. [[node TPUReplicateMetadata_5 (defined at <ipython-input-5-ca82800c4c4f>:35) ]]

Original stack trace for u’TPUReplicateMetadata_5’: File “/usr/lib/python2.7/runpy.py”, line 174, in _run_module_as_main “main”, fname, loader, pkg_name) File “/usr/lib/python2.7/runpy.py”, line 72, in _run_code exec code in run_globals File “/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py”, line 16, in <module> app.launch_new_instance() File “/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py”, line 658, in launch_instance app.start() File “/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py”, line 477, in start ioloop.IOLoop.instance().start() File “/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py”, line 888, in start handler_func(fd_obj, events) File “/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py”, line 277, in null_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py”, line 450, in _handle_events self._handle_recv() File “/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py”, line 480, in _handle_recv self._run_callback(callback, msg) File “/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py”, line 432, in _run_callback callback(*args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py”, line 277, in null_wrapper return fn(*args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py”, line 283, in dispatcher return self.dispatch_shell(stream, msg) File “/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py”, line 235, in dispatch_shell handler(stream, idents, msg) File “/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py”, line 399, in execute_request user_expressions, allow_stdin) File “/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py”, line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File “/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py”, line 533, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py”, line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File “/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py”, line 2822, in run_ast_nodes if self.run_code(code, result): File “/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py”, line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File “<ipython-input-5-ca82800c4c4f>”, line 35, in <module> validation_freq=17 File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training.py”, line 649, in fit validation_freq=validation_freq) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_distributed.py”, line 128, in fit_distributed validation_freq=validation_freq) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/training_distributed.py”, line 367, in experimental_tpu_fit_loop initial_loop_values=initial_loop_values) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/distribute/distribute_lib.py”, line 1501, in experimental_run_steps_on_iterator initial_loop_values) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/distribute/tpu_strategy.py”, line 416, in _experimental_run_steps_on_iterator replicate_outputs = rewrite_fn() File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/distribute/tpu_strategy.py”, line 397, in rewrite_fn replicate_outputs = tpu.replicate(run_fn, replicate_inputs) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/tpu/tpu.py”, line 592, in replicate maximum_shapes=maximum_shapes)[1] File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/tpu/tpu.py”, line 854, in split_compile_and_replicate num_replicas=num_replicas, use_tpu=use_tpu, **metadata_kwargs) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py”, line 6039, in tpu_replicate_metadata name=name) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper op_def=op_def) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func return func(*args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3616, in create_op op_def=op_def) File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 2005, in init self._traceback = tf_stack.extract_stack()

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (3 by maintainers)

Most upvoted comments

I had a similar issue where when trying to fine-tune a pre-trained Keras model on a Colab TPU (using model.load_weights() to load the pre-trained weights), I got an error message saying “Resource worker/batch_normalization_1/moving_mean/replica_5/N10tensorflow3VarE does not exist”.

with strategy.scope():
  model = create_model()
  model.load_weights('fashion_mnist.h5')
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])

# Got `Resource worker/batch_normalization_10/moving_mean/replica_7/N10tensorflow3VarE does not exist` error after calling model.fit()

However, this issue was fixed when I changed my code to call model.load_weights() after model.compile() (rather than before), like so:

with strategy.scope():
  model = create_model()
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])
  model.load_weights('fashion_mnist.h5') # Load weights after model.compile()

# model trains without issue

horacejlee on Jul 18, 2019

I can confirm this issue. In my case I was getting the same error, when trying to specify weights for an Embedding layer, eg:

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(..., weights = [embedding_matrix], ...),
        ...
    ])
    model.compile(...)

If I specified them after compiling the model, the code would run fine, eg:

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(...),
        ...
    ])
    model.compile(...)
    model.layers[0].set_weights([embedding_matrix])

@jvishnuvardhan are you happy with the example provided above, or do you still want a standalone example to reproduce the issue?

dimitry-ishenko on Aug 19, 2019

Was able to reproduce the reported issue on Colab with Tensorflow 1.14.0-rc1. Thanks!

gadagashwini-zz on Jun 19, 2019

@DecentMakeover My earlier post:

Maybe look at my example above, where I (1) create the model, (2) compile the model and (3) set the weights.

Your SO post:

model.load_weights('saved_models/wieghts_ef5.h5',by_name = True)
model.compile(loss='mse', optimizer=RAdam(lr=0.00005), metrics=['mse', 'acc'])

So, one more time: (2) COMPILE the model, (3) SET THE WEIGHTS.

dimitry-ishenko on Sep 4, 2019

God! I don’t have my laptop near me , I’ll check this soon as I’m back and let you know ,

DecentMakeover on Sep 3, 2019