tensorflow: tf.tpu.experimental.initialize_tpu_system fails to work on nightly builds
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colaboratory (Linux)
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): tf-nightly_v2.3.0.dev20200619
- Python version: v3.6.9
- Bazel version (if compiling from source): N/A
- GCC/Compiler version (if compiling from source): N/A
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
- TPU: Google Colab runtime with TPU accelerator
Describe the current behavior
tf.tpu.experimental.initialize_tpu_system(tpu_cluster_resolver)
raises an unexpected error. The stack trace containing this error is provided underneath.
The sample notebook that was being used to train an EfficientNetB0
model with TPUStrategy
containing the error message is provided here.
Describe the expected behavior
A ResNet50 model (as EfficientNetB0 is only present in TF-nightly) with similar code is able to run successfully with TPUStrategy and there are no such errors reported while calling tf.tpu.experimental.initialize_tpu_system
. A notebook with the corresponding training code for TFv2.2 can be found here.
Standalone code to reproduce the issue
Just calling tf.tpu.experimental.initialize_tpu_system
using the standard mechanism on a Colab runtime with TPU should suffice.
tpu_url = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tpu_cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=tpu_url)
tf.config.experimental_connect_to_cluster(tpu_cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(tpu_cluster_resolver)
strategy = tf.distribute.experimental.TPUStrategy(tpu_cluster_resolver)
Other info / logs
Running on TPU ['10.57.138.26:8470']
INFO:tensorflow:Initializing the TPU system: grpc://10.57.138.26:8470
INFO:tensorflow:Initializing the TPU system: grpc://10.57.138.26:8470
INFO:tensorflow:Clearing out eager caches
INFO:tensorflow:Clearing out eager caches
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-4-a42f01f7e70e> in <module>()
6
7 tf.config.experimental_connect_to_cluster(tpu)
----> 8 tf.tpu.experimental.initialize_tpu_system(tpu)
9 tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
3 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/tpu_strategy_util.py in initialize_tpu_system(cluster_resolver)
101 context.context()._clear_caches() # pylint: disable=protected-access
102
--> 103 serialized_topology = output.numpy()
104
105 # TODO(b/134094971): Remove this when lazy tensor copy in multi-device
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in numpy(self)
1061 """
1062 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
-> 1063 maybe_arr = self._numpy() # pylint: disable=protected-access
1064 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
1065
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in _numpy(self)
1029 return self._numpy_internal()
1030 except core._NotOkStatusException as e: # pylint: disable=protected-access
-> 1031 six.raise_from(core._status_to_exception(e.code, e.message), None) # pylint: disable=protected-access
1032
1033 @property
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
InvalidArgumentError: NodeDef expected inputs 'string' do not match 0 inputs specified; Op<name=_Send; signature=tensor:T -> ; attr=T:type; attr=tensor_name:string; attr=send_device:string; attr=send_device_incarnation:int; attr=recv_device:string; attr=client_terminated:bool,default=false; is_stateful=true>; NodeDef: {{node _Send}}
Typically, this bug is prevalent only on nightly builds (master branch) and not on TF v2.2 release.
/cc: @tanzhenyu
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 23 (6 by maintainers)
@swghosh Yeah, Colab always defaults to the latest stable version of TPUs because of technical limitations. The current version of Colab TPU that runs is version 2.2, so Colab TPUs are only supported with TensorFlow v2.2 by default.
In general, we only support TPUs when the same version is used on both the user TF and on the TPUs. In practice, there is some scope for different versions working together if there are no changes in terms of protocol or op definitions (e.g. no new ops, ops didn’t change signature, etc…) but it is not supported.
gcloud ai-platform
now supports TF 2.3, so the issue should be resolved.