tensorflow: Error "failed to connect to all addresses" when training on TPU with Colab
Hello!
Describe the current behavior When running the my Colab I get the following error:
UnavailableError: failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1599555147.064735819","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1599555147.064732799","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
When I try to iterator over my dataset:
<ipython-input-5-9732b6b5faf1> in train(self)
42
43 for epoch_iter in range(1, 5):
---> 44 for step, batch in enumerate(train_ds):
45 self.global_step = iterations.numpy()
46
Describe the expected behavior Being able to iterate over the dataset.
Standalone code to reproduce the issue The full Colab can be found here https://colab.research.google.com/drive/1jZaFmrDmBGbGkg8oKEJ78cL2PgIB7UF5?usp=sharing
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 30 (18 by maintainers)
unfortunately I stumbled upon this issue when trying with generator. I tried an example from docs here and it returns same error. This happens when I initialize the TPU first, and then dataset later.
error:
It’s on Google Colab btw
Awesome!! The @bfontain’s tip made it works!! I had to adapt my code to take into account some new changes and now it works like a charm with TF 2.6.0 and the TPU client set to this same version with:
BTW when I run
print(c.run_time_version())I get the errorAttributeError: 'Client' object has no attribute 'run_time_version'withcloud_tpu_client==0.10.The TensorFlow release may be done, but it may be a few days before the TPU runtime release is available. You can check the runtime version your associated TPU is on via the Client() object above,
Ideally a colab user should not need to do this. We could detect a tf version change after running a !pip or %pip or something similar and automatically make the runtime match if possible. An alternative would be to enhance %tensorflow_version (which does support TPU runtime version changing) to support more than just a handful of versions. In particular it doesn’t support nightly.
FYI, if you are using tf-nightly with TPUs you may need to make the TPU runtime version match with the currently installed TF version. To do this:
@Smankusors: I believe the root cause for the error you are encountering is different from that of this initial bug report.
As for your case, could you instead materialize the dataset (instead of using
from_generatorAPI) ? tf.data.Dataset.from_generator() API is not yet supported on TPU’s and I believe this is the reason for the failure.@jplu Yes there is no visibility about your issue reproducibility in the @Saduf2019 gist cause this new issue could be the only one (in this case your issue It Is already solved in nightly) or just an early issue that doesn’t let to trigger your own.
The only thing to “quickly” disambiguate the status on nightly Is to resolve this new Graphdef incompatibilty.