tensorflow: TPU PyFunction results in UnavailableError: failed to connect to all addresses
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Modified Colab MNIST guide
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
- TensorFlow version (use command below):
2.2-rc3
Describe the current behavior
When processing pipeline for tf.data.Dataset
contains usage of tf.py_function
the UnavailableError: failed to connect to all addresses
is thrown on TPU environment.
Describe the expected behavior
tf.py_function
is working on TPU environments.
Standalone code to reproduce the issue Colab notebook with simplified example. In my original code the preprocessing function is more complicated.
Other info / logs Related issue: 34346. Stacktrace:
---------------------------------------------------------------------------
UnavailableError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in execution_mode(mode)
1985 ctx.executor = executor_new
-> 1986 yield
1987 finally:
14 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
660 except AttributeError:
--> 661 return structure.from_compatible_tensor_list(self._element_spec, ret)
662
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in from_compatible_tensor_list(element_spec, tensor_list)
229 lambda spec, value: spec._from_compatible_tensor_list(value),
--> 230 element_spec, tensor_list)
231
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in _from_tensor_list_helper(decode_fn, element_spec, tensor_list)
204 value = tensor_list[i:i + num_flat_values]
--> 205 flat_ret.append(decode_fn(component_spec, value))
206 i += num_flat_values
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in <lambda>(spec, value)
228 return _from_tensor_list_helper(
--> 229 lambda spec, value: spec._from_compatible_tensor_list(value),
230 element_spec, tensor_list)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_spec.py in _from_compatible_tensor_list(self, tensor_list)
176 assert len(tensor_list) == 1
--> 177 tensor_list[0].set_shape(self._shape)
178 return tensor_list[0]
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in set_shape(self, shape)
1103 def set_shape(self, shape):
-> 1104 if not self.shape.is_compatible_with(shape):
1105 raise ValueError(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in shape(self)
1066 except core._NotOkStatusException as e:
-> 1067 six.raise_from(core._status_to_exception(e.code, e.message), None)
1068
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
UnavailableError: failed to connect to all addresses
Additional GRPC error information:
{"created":"@1587494349.376555159","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3959,"referenced_errors":[{"created":"@1587494349.376552078","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
During handling of the above exception, another exception occurred:
UnavailableError Traceback (most recent call last)
<ipython-input-8-f9a6a321af70> in <module>()
1 train_dataset, test_dataset = get_dataset()
----> 2 list(train_dataset.take(1))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
629
630 def __next__(self): # For Python 3 compatibility
--> 631 return self.next()
632
633 def _next_internal(self):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in next(self)
668 """Returns a nested structure of `Tensor`s containing the next element."""
669 try:
--> 670 return self._next_internal()
671 except errors.OutOfRangeError:
672 raise StopIteration
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
659 return self._element_spec._from_compatible_tensor_list(ret) # pylint: disable=protected-access
660 except AttributeError:
--> 661 return structure.from_compatible_tensor_list(self._element_spec, ret)
662
663 @property
/usr/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
97 value = type()
98 try:
---> 99 self.gen.throw(type, value, traceback)
100 except StopIteration as exc:
101 # Suppress StopIteration *unless* it's the same exception that
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in execution_mode(mode)
1987 finally:
1988 ctx.executor = executor_old
-> 1989 executor_new.wait()
1990
1991
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py in wait(self)
65 def wait(self):
66 """Waits for ops dispatched in this executor to finish."""
---> 67 pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
68
69 def clear_error(self):
UnavailableError: failed to connect to all addresses
Additional GRPC error information:
{"created":"@1587494349.376555159","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3959,"referenced_errors":[{"created":"@1587494349.376552078","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 20
- Comments: 31 (6 by maintainers)
Facing same issue when iterating over a dataset created wih tf.data.Dataset.from_generator
Here are some suggestions:
tf.data.Dataset.from_generator
is known to be incompatible with TPU hardware (reference).tf.py_function
related functionalities to your dataset ahead of time and serialize them as TFRecords inside a GCS Bucket which you would use anyway when using TPUs for training on large datasets. This effectively eliminates the need to usetf.py_function
during the data loading phase for training. It might even add a bit of speed up in the overall training pipeline since now we are eliminating a non-graph operation from the pipeline.I am having this issue as well. This issue should be higher priority because it makes it impossible to run Huggingface tokenization on TPUs with TensorFlow
Seems to still be an issue using
py_function
with a TPU.Same Issue. Please resolve it quickly as it is limiting the usage of tf.data API drastically for large datasets
I have the same issue with tf.numpy_function.
facing the same problem when using tf.py_function to define a customized layer, please fix it.
@oja and any others who were planning on doing tokenization with HuggingFace tokenizers:
The Wordpiece tokenizer from TF.Text can do tokenization that is compatible with graph mode (doesn’t require
py_function
.) Since HuggingFace is popular, I have a script that copies the vocab over from a pretrained HuggingFace tokenizer and uses it with atensorflow_text.WordpieceTokenizer
: https://gist.github.com/noahtren/6f9f6ecf2f81d0975c4f54afaeb95318I’m having the exact same issue, when using a generator for training in TPUs in Colab. When using TensorFlow 2.2 (Stable), I get this other issue.
However, when trying with a nightly version, I’m getting the error from this issue:
@oja I’m having the exact same problem so I thought I’d share what I found: https://github.com/huggingface/transformers/pull/1424/files#diff-5843fc9f06d46f05183ab24e6d139575R39
In this code they’re doing tokenization in advance and building a tf.data.Dataset from that. Of course this won’t work if your dataset won’t fit in memory, and it would still be great to have py_function working with TPUs. 😃