tensorflow: Google Collab error for TPU - UnavailableError: {{function_node __inference_train_function_99378}} failed to connect to all addresses
No longer able to train model using google cloud TPU on my gist, it was training fine 2 months ago and now I get the following error:
UnavailableError: {{function_node __inference_train_function_99378}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1601903304.230958587","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1601903304.089639211","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
[[{{node IteratorGetNext}}]]
Seems related to this issue: https://github.com/tensorflow/tensorflow/issues/43037
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 4
- Comments: 18 (4 by maintainers)
Facing the same issue with tf 2.3.0
Still no solution? Should we consider this to not using generators on TPU?
Have just tried, using tf 2.3.0 reverts to the same error:
Have done so, receive the same error, GIST
@JessicaLopezEspejel Unfortunately not. I had to resort to using GPU. Please let me know if you find a solution.
Has anybody been able to solve this?
I’m trying to run BERT on Google Colab TPU, however I’m getting a similar error. Tensorflow version 2.8.0 Code I’m using for loading the TPU is vastly based on the original code for pre-training T5 by Google taken from here:
Code I’m using to run BERT is this:
the
run_mlm.pyscript can be seen here.Full error message can be seen here.
Any help is much appreciated thanks.