tensorflow: Google Collab error for TPU - UnavailableError: {{function_node __inference_train_function_99378}} failed to connect to all addresses

No longer able to train model using google cloud TPU on my gist, it was training fine 2 months ago and now I get the following error:

UnavailableError: {{function_node __inference_train_function_99378}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1601903304.230958587","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1601903304.089639211","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
	 [[{{node IteratorGetNext}}]]

Seems related to this issue: https://github.com/tensorflow/tensorflow/issues/43037

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 4
  • Comments: 18 (4 by maintainers)

Most upvoted comments

Facing the same issue with tf 2.3.0

Still no solution? Should we consider this to not using generators on TPU?

Have just tried, using tf 2.3.0 reverts to the same error:

UnavailableError                          Traceback (most recent call last)
<ipython-input-8-6ac9763b188d> in <module>()
    129 
    130 generator = data_generator(texts, train_features, 1, max_sequence)
--> 131 model.fit_generator(generator, steps_per_epoch=steps, epochs=10, callbacks=callbacks_list, verbose=1)
    132 model.save(mydrive + '/output/weights.hdf5')

16 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

UnavailableError: {{function_node __inference_train_function_56800}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1602080688.607997979","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"@1602080688.315542576","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}
	 [[{{node IteratorGetNext}}]]

In the past this error message has indicated a version mismatch. Can you first try and import from tensorflow.keras instead of native keras?

Have done so, receive the same error, GIST

@JessicaLopezEspejel Unfortunately not. I had to resort to using GPU. Please let me know if you find a solution.

Has anybody been able to solve this?

I’m trying to run BERT on Google Colab TPU, however I’m getting a similar error. Tensorflow version 2.8.0 Code I’m using for loading the TPU is vastly based on the original code for pre-training T5 by Google taken from here:

print("Installing dependencies...")
%tensorflow_version 2.x

import functools
import os
import time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import tensorflow.compat.v1 as tf
import tensorflow_datasets as tfds

BASE_DIR = "gs://bucket-xx" #@param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
  raise ValueError("You must enter a BASE_DIR.")
DATA_DIR = os.path.join(BASE_DIR, "data/text.csv")
MODELS_DIR = os.path.join(BASE_DIR, "models/bert")
ON_CLOUD = True


if ON_CLOUD:
  print("Setting up GCS access...")
  import tensorflow_gcs_config
  from google.colab import auth
  # Set credentials for GCS reading/writing from Colab and TPU.
  TPU_TOPOLOGY = "v2-8"
  try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
    TPU_ADDRESS = tpu.get_master()
    print('Running on TPU:', TPU_ADDRESS)
  except ValueError:
    raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
  auth.authenticate_user()
  tf.enable_eager_execution()
  tf.config.experimental_connect_to_host(TPU_ADDRESS)
  tensorflow_gcs_config.configure_gcs_from_colab_auth()

tf.disable_v2_behavior()

# Improve logging.
from contextlib import contextmanager
import logging as py_logging

if ON_CLOUD:
  tf.get_logger().propagate = False
  py_logging.root.setLevel('INFO')

@contextmanager
def tf_verbosity_level(level):
  og_level = tf.logging.get_verbosity()
  tf.logging.set_verbosity(level)
  yield
  tf.logging.set_verbosity(og_level)

Code I’m using to run BERT is this:

!python /content/scripts/run_mlm.py \
--model_name_or_path bert-base-cased \
--tpu_num_cores 8 \
--validation_split_percentage 20 \
--line_by_line \
--learning_rate 2e-5 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 256 \
--num_train_epochs 4 \
--output_dir MODELS_DIR \
--train_file /content/text.csv

the run_mlm.py script can be seen here.

Full error message can be seen here.

Any help is much appreciated thanks.