tensorflow: TPU PyFunction results in UnavailableError: failed to connect to all addresses

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Modified Colab MNIST guide
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
  • TensorFlow version (use command below): 2.2-rc3

Describe the current behavior When processing pipeline for tf.data.Dataset contains usage of tf.py_function the UnavailableError: failed to connect to all addresses is thrown on TPU environment.

Describe the expected behavior tf.py_function is working on TPU environments.

Standalone code to reproduce the issue Colab notebook with simplified example. In my original code the preprocessing function is more complicated.

Other info / logs Related issue: 34346. Stacktrace:

---------------------------------------------------------------------------
UnavailableError                          Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in execution_mode(mode)
   1985       ctx.executor = executor_new
-> 1986       yield
   1987     finally:

14 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    660       except AttributeError:
--> 661         return structure.from_compatible_tensor_list(self._element_spec, ret)
    662 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in from_compatible_tensor_list(element_spec, tensor_list)
    229       lambda spec, value: spec._from_compatible_tensor_list(value),
--> 230       element_spec, tensor_list)
    231 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in _from_tensor_list_helper(decode_fn, element_spec, tensor_list)
    204     value = tensor_list[i:i + num_flat_values]
--> 205     flat_ret.append(decode_fn(component_spec, value))
    206     i += num_flat_values

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in <lambda>(spec, value)
    228   return _from_tensor_list_helper(
--> 229       lambda spec, value: spec._from_compatible_tensor_list(value),
    230       element_spec, tensor_list)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_spec.py in _from_compatible_tensor_list(self, tensor_list)
    176     assert len(tensor_list) == 1
--> 177     tensor_list[0].set_shape(self._shape)
    178     return tensor_list[0]

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in set_shape(self, shape)
   1103   def set_shape(self, shape):
-> 1104     if not self.shape.is_compatible_with(shape):
   1105       raise ValueError(

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in shape(self)
   1066       except core._NotOkStatusException as e:
-> 1067         six.raise_from(core._status_to_exception(e.code, e.message), None)
   1068 

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

UnavailableError: failed to connect to all addresses
Additional GRPC error information:
{"created":"@1587494349.376555159","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3959,"referenced_errors":[{"created":"@1587494349.376552078","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}

During handling of the above exception, another exception occurred:

UnavailableError                          Traceback (most recent call last)
<ipython-input-8-f9a6a321af70> in <module>()
      1 train_dataset, test_dataset = get_dataset()
----> 2 list(train_dataset.take(1))

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
    629 
    630   def __next__(self):  # For Python 3 compatibility
--> 631     return self.next()
    632 
    633   def _next_internal(self):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in next(self)
    668     """Returns a nested structure of `Tensor`s containing the next element."""
    669     try:
--> 670       return self._next_internal()
    671     except errors.OutOfRangeError:
    672       raise StopIteration

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    659         return self._element_spec._from_compatible_tensor_list(ret)  # pylint: disable=protected-access
    660       except AttributeError:
--> 661         return structure.from_compatible_tensor_list(self._element_spec, ret)
    662 
    663   @property

/usr/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
     97                 value = type()
     98             try:
---> 99                 self.gen.throw(type, value, traceback)
    100             except StopIteration as exc:
    101                 # Suppress StopIteration *unless* it's the same exception that

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in execution_mode(mode)
   1987     finally:
   1988       ctx.executor = executor_old
-> 1989       executor_new.wait()
   1990 
   1991 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py in wait(self)
     65   def wait(self):
     66     """Waits for ops dispatched in this executor to finish."""
---> 67     pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
     68 
     69   def clear_error(self):

UnavailableError: failed to connect to all addresses
Additional GRPC error information:
{"created":"@1587494349.376555159","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3959,"referenced_errors":[{"created":"@1587494349.376552078","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 20
  • Comments: 31 (6 by maintainers)

Most upvoted comments

Facing same issue when iterating over a dataset created wih tf.data.Dataset.from_generator

Here are some suggestions:

  • tf.data.Dataset.from_generator is known to be incompatible with TPU hardware (reference).
  • It’s better to apply tf.py_function related functionalities to your dataset ahead of time and serialize them as TFRecords inside a GCS Bucket which you would use anyway when using TPUs for training on large datasets. This effectively eliminates the need to use tf.py_function during the data loading phase for training. It might even add a bit of speed up in the overall training pipeline since now we are eliminating a non-graph operation from the pipeline.

I am having this issue as well. This issue should be higher priority because it makes it impossible to run Huggingface tokenization on TPUs with TensorFlow

Seems to still be an issue using py_function with a TPU.

Same Issue. Please resolve it quickly as it is limiting the usage of tf.data API drastically for large datasets

I have the same issue with tf.numpy_function.

facing the same problem when using tf.py_function to define a customized layer, please fix it.

@oja and any others who were planning on doing tokenization with HuggingFace tokenizers:

The Wordpiece tokenizer from TF.Text can do tokenization that is compatible with graph mode (doesn’t require py_function.) Since HuggingFace is popular, I have a script that copies the vocab over from a pretrained HuggingFace tokenizer and uses it with a tensorflow_text.WordpieceTokenizer: https://gist.github.com/noahtren/6f9f6ecf2f81d0975c4f54afaeb95318

I’m having the exact same issue, when using a generator for training in TPUs in Colab. When using TensorFlow 2.2 (Stable), I get this other issue.

However, when trying with a nightly version, I’m getting the error from this issue:

UnavailableError: {{function_node __inference_train_function_5896}} failed to connect to all addresses
Additional GRPC error information:
{"created":"@1589320590.316232748","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3937,"referenced_errors":[{"created":"@1589320590.316230089","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]

@oja I’m having the exact same problem so I thought I’d share what I found: https://github.com/huggingface/transformers/pull/1424/files#diff-5843fc9f06d46f05183ab24e6d139575R39

In this code they’re doing tokenization in advance and building a tf.data.Dataset from that. Of course this won’t work if your dataset won’t fit in memory, and it would still be great to have py_function working with TPUs. 😃