tensorflow: TPU with tensorflow 2.0 -- 'DeleteIterator' OpKernel missing

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
  • OS Platform and Distribution: Debian GNU/Linux 9
  • TensorFlow installed from (source or binary): /usr/bin/pip3
  • TensorFlow version (use command below): 2.0.0b1
  • Python version: 3.5.3
  • TPU type: v2-8
  • TPU software version: 1.14

Describe the current behavior I am running a TPU allocated by ctpu up in tensorflow 2.0 (I’m aware this isn’t fully supported atm). I have a simple training loop functioning mostly following the guidelines here: https://www.tensorflow.org/beta/guide/distribute_strategy#using_tfdistributestrategy_with_custom_training_loops

To my surprise I have encountered few issues along the way, but one that I can’t seem to remedy on my end is this: tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}

This error doesn’t seem to actually break anything, but I’m worried there could be a TPU memory leak or something and that is difficult to verify without any useable TPU profiling tools for TF 2.0. EDIT: The bug does actually cause the program to crash after a few iterations.

Describe the expected behavior I expect to be able to delete the iterator object within the TPU strategy scope.

Code to reproduce the issue

for epoch in range(self.train_epochs):
    with tf.device(self.device), self.distribution_strategy.scope():
        dataset = self.fill_experience_buffer()
        exp_buff = iter(dataset)

        for step in tqdm(range(self.train_steps), "Training epoch {}".format(epoch)):
            train_step(next(exp_buff))

The issue occurs the second time around in the loop when the exp_buff variable is rewritten with iter(dataset). I’ve tried explicitly freeing the object with ‘del exp_buff’ within and outside of the scope():, but the same error occurs regardless.

Other info / logs Full error message here (the message appears 8 times, once for each TPU device, but the messages are identical:

Exception ignored in: <bound method IteratorResourceDeleter.__del__ of <tensorflow.python.data.ops.iterator_ops.IteratorResourceDeleter object at 0x7f0bac1883c8>>
Traceback (most recent call last):
  File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 531, in __del__
    handle=self._handle, deleter=self._deleter)
  File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 800, in delete_iterator
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}
        .  Registered:  device='CPU'
  device='GPU'

Additional GRPC error information:
{"created":"@1567559533.240191338","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}\n\t.  Registered:  device='CPU'\n  device='GPU'\n","grpc_status":5} [Op:DeleteIterator]
Exception ignored in: <bound method IteratorResourceDeleter.__del__ of <tensorflow.python.data.ops.iterator_ops.IteratorResourceDeleter object at 0x7f0bac188518>>
Traceback (most recent call last):
  File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 531, in __del__
    handle=self._handle, deleter=self._deleter)
  File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 800, in delete_iterator
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}
        .  Registered:  device='CPU'
  device='GPU'

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 10
  • Comments: 19 (10 by maintainers)

Most upvoted comments

And about {{DeleteIterator}} issue, This issue is raised when the code suddenly crashes and kills the execution. So while debugging TPU code, these errors are to be ignored (unless there’s nothing else in the traceback, apart from these). Scroll up to find out what killed the code. In your case there’s a high chance, The TPU cannot process it since 2 different context IDs are produced for the same training graph.

Sorry about that! First time submitting a bug report. I’ve recreated a small standalone script that demonstrates the bug.

import numpy as np
import tensorflow as tf

tpu_address = 'youngalou'
device = '/job:worker'
train_epochs = 100
train_steps = 100
dataset_size = 1000
batch_size = 256

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=tpu_address)
tf.config.experimental_connect_to_host(cluster_resolver.master())
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)

def get_dataset():
    dataset = tf.data.Dataset.from_tensor_slices((np.zeros((dataset_size,128),dtype=np.float32), np.zeros((dataset_size,1),dtype=np.float32)))
    dataset = dataset.shuffle(dataset_size).repeat().batch(batch_size)
    return tpu_strategy.experimental_distribute_dataset(dataset)

for _ in range(train_epochs):
    with tf.device(device), tpu_strategy.scope():
        dataset = get_dataset()
        exp_buff = iter(dataset)

        for _ in range(train_steps):
            train_batch = next(exp_buff)

Hey! @youngalou , You don’t need to open the strategy and device scope everytime. This produces different context ID, while training the model. So, I’ld ask you to shift the tf.device() and strategy.scope() outside the loop.

with tf.device(device), tpu_strategy.scope():
    for _ in range(train_epochs):
        dataset = get_dataset()
        exp_buff = iter(dataset)

        for _ in range(train_steps):
            train_batch = next(exp_buff)

This is supposed to work.