tensorflow: tf.tpu.experimental.initialize_tpu_system fails to work on nightly builds

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colaboratory (Linux)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): tf-nightly_v2.3.0.dev20200619
  • Python version: v3.6.9
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • TPU: Google Colab runtime with TPU accelerator

Describe the current behavior tf.tpu.experimental.initialize_tpu_system(tpu_cluster_resolver) raises an unexpected error. The stack trace containing this error is provided underneath.

The sample notebook that was being used to train an EfficientNetB0 model with TPUStrategy containing the error message is provided here.

Describe the expected behavior A ResNet50 model (as EfficientNetB0 is only present in TF-nightly) with similar code is able to run successfully with TPUStrategy and there are no such errors reported while calling tf.tpu.experimental.initialize_tpu_system. A notebook with the corresponding training code for TFv2.2 can be found here.

Standalone code to reproduce the issue Just calling tf.tpu.experimental.initialize_tpu_system using the standard mechanism on a Colab runtime with TPU should suffice.

tpu_url = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=tpu_url)

tf.config.experimental_connect_to_cluster(tpu_cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(tpu_cluster_resolver)

strategy = tf.distribute.experimental.TPUStrategy(tpu_cluster_resolver)

Other info / logs

Running on TPU  ['10.57.138.26:8470']
INFO:tensorflow:Initializing the TPU system: grpc://10.57.138.26:8470

INFO:tensorflow:Initializing the TPU system: grpc://10.57.138.26:8470

INFO:tensorflow:Clearing out eager caches

INFO:tensorflow:Clearing out eager caches

---------------------------------------------------------------------------

InvalidArgumentError                      Traceback (most recent call last)

<ipython-input-4-a42f01f7e70e> in <module>()
      6 
      7 tf.config.experimental_connect_to_cluster(tpu)
----> 8 tf.tpu.experimental.initialize_tpu_system(tpu)
      9 tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

3 frames

/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_strategy_util.py in initialize_tpu_system(cluster_resolver)
    101     context.context()._clear_caches()  # pylint: disable=protected-access
    102 
--> 103     serialized_topology = output.numpy()
    104 
    105     # TODO(b/134094971): Remove this when lazy tensor copy in multi-device

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in numpy(self)
   1061     """
   1062     # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1063     maybe_arr = self._numpy()  # pylint: disable=protected-access
   1064     return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
   1065 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
   1029       return self._numpy_internal()
   1030     except core._NotOkStatusException as e:  # pylint: disable=protected-access
-> 1031       six.raise_from(core._status_to_exception(e.code, e.message), None)  # pylint: disable=protected-access
   1032 
   1033   @property

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}

Typically, this bug is prevalent only on nightly builds (master branch) and not on TF v2.2 release.

/cc: @tanzhenyu

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 23 (6 by maintainers)

Most upvoted comments

@swghosh Yeah, Colab always defaults to the latest stable version of TPUs because of technical limitations. The current version of Colab TPU that runs is version 2.2, so Colab TPUs are only supported with TensorFlow v2.2 by default.

In general, we only support TPUs when the same version is used on both the user TF and on the TPUs. In practice, there is some scope for different versions working together if there are no changes in terms of protocol or op definitions (e.g. no new ops, ops didn’t change signature, etc…) but it is not supported.

Now, it seems to initialize well with 2.3.0 and nightly with Colab. However, whenever I run gcloud ai-platform training scripts, it always produces the same error whatever version of tf I use… It is really frustrating. If it is all about the version mismatch, I can’t find any single clue for solution anywhere.

gcloud ai-platform now supports TF 2.3, so the issue should be resolved.