tensorflow: Error "failed to connect to all addresses" when training on TPU with Colab

Hello!

Describe the current behavior When running the my Colab I get the following error:

UnavailableError: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1599555147.064735819","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1599555147.064732799","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]

When I try to iterator over my dataset:

<ipython-input-5-9732b6b5faf1> in train(self)
     42 
     43             for epoch_iter in range(1, 5):
---> 44                 for step, batch in enumerate(train_ds):
     45                     self.global_step = iterations.numpy()
     46

Describe the expected behavior Being able to iterate over the dataset.

Standalone code to reproduce the issue The full Colab can be found here https://colab.research.google.com/drive/1jZaFmrDmBGbGkg8oKEJ78cL2PgIB7UF5?usp=sharing

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 30 (18 by maintainers)

Most upvoted comments

unfortunately I stumbled upon this issue when trying with generator. I tried an example from docs here and it returns same error. This happens when I initialize the TPU first, and then dataset later.

import tensorflow as tf
try:
  tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
  print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
  tf.config.experimental_connect_to_cluster(tpu)
  tf.tpu.experimental.initialize_tpu_system(tpu)
  strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
  print('WARNING: TPU not available!')
  strategy = tf.distribute.get_strategy()

import itertools

def gen():
  for i in itertools.count(1):
    yield (i, [1] * i)

dataset = tf.data.Dataset.from_generator(
     gen,
     (tf.int64, tf.int64),
     (tf.TensorShape([]), tf.TensorShape([None])))

list(dataset.take(3).as_numpy_iterator())

error:

UnavailableError: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1606047177.389355931","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1606047176.622134062","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}

It’s on Google Colab btw

Smankusors on Nov 22, 2020

Awesome!! The @bfontain’s tip made it works!! I had to adapt my code to take into account some new changes and now it works like a charm with TF 2.6.0 and the TPU client set to this same version with:

from cloud_tpu_client import Client
c = Client(tpu='')
c.configure_tpu_version(tf.__version__, restart_type='ifNeeded')

BTW when I run print(c.run_time_version()) I get the error AttributeError: 'Client' object has no attribute 'run_time_version' with cloud_tpu_client==0.10.

jplu on Aug 12, 2021

The TensorFlow release may be done, but it may be a few days before the TPU runtime release is available. You can check the runtime version your associated TPU is on via the Client() object above,

!pip install cloud_tpu_client
from cloud_tpu_client import Client
c = Client(tpu='')
print(c.runtime_version())

bfontain on Aug 12, 2021

Ideally a colab user should not need to do this. We could detect a tf version change after running a !pip or %pip or something similar and automatically make the runtime match if possible. An alternative would be to enhance %tensorflow_version (which does support TPU runtime version changing) to support more than just a handful of versions. In particular it doesn’t support nightly.

bfontain on Aug 11, 2021

FYI, if you are using tf-nightly with TPUs you may need to make the TPU runtime version match with the currently installed TF version. To do this:

!pip install cloud_tpu_client
from cloud_tpu_client import Client
c = Client(tpu='')
c.configure_tpu_version(tf.__version__, restart_type='ifNeeded')

bfontain on Aug 11, 2021

@Smankusors: I believe the root cause for the error you are encountering is different from that of this initial bug report.

As for your case, could you instead materialize the dataset (instead of using from_generator API) ? tf.data.Dataset.from_generator() API is not yet supported on TPU’s and I believe this is the reason for the failure.

hongjunChoi on Nov 24, 2020

@jplu Yes there is no visibility about your issue reproducibility in the @Saduf2019 gist cause this new issue could be the only one (in this case your issue It Is already solved in nightly) or just an early issue that doesn’t let to trigger your own.

The only thing to “quickly” disambiguate the status on nightly Is to resolve this new Graphdef incompatibilty.

bhack on Sep 23, 2020