tensorflow: TPU with tensorflow 2.0 -- 'DeleteIterator' OpKernel missing
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
- OS Platform and Distribution: Debian GNU/Linux 9
- TensorFlow installed from (source or binary): /usr/bin/pip3
- TensorFlow version (use command below): 2.0.0b1
- Python version: 3.5.3
- TPU type: v2-8
- TPU software version: 1.14
Describe the current behavior
I am running a TPU allocated by ctpu up
in tensorflow 2.0 (I’m aware this isn’t fully supported atm). I have a simple training loop functioning mostly following the guidelines here:
https://www.tensorflow.org/beta/guide/distribute_strategy#using_tfdistributestrategy_with_custom_training_loops
To my surprise I have encountered few issues along the way, but one that I can’t seem to remedy on my end is this:
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}
This error doesn’t seem to actually break anything, but I’m worried there could be a TPU memory leak or something and that is difficult to verify without any useable TPU profiling tools for TF 2.0. EDIT: The bug does actually cause the program to crash after a few iterations.
Describe the expected behavior I expect to be able to delete the iterator object within the TPU strategy scope.
Code to reproduce the issue
for epoch in range(self.train_epochs):
with tf.device(self.device), self.distribution_strategy.scope():
dataset = self.fill_experience_buffer()
exp_buff = iter(dataset)
for step in tqdm(range(self.train_steps), "Training epoch {}".format(epoch)):
train_step(next(exp_buff))
The issue occurs the second time around in the loop when the exp_buff
variable is rewritten with iter(dataset)
. I’ve tried explicitly freeing the object with ‘del exp_buff’ within and outside of the scope():, but the same error occurs regardless.
Other info / logs Full error message here (the message appears 8 times, once for each TPU device, but the messages are identical:
Exception ignored in: <bound method IteratorResourceDeleter.__del__ of <tensorflow.python.data.ops.iterator_ops.IteratorResourceDeleter object at 0x7f0bac1883c8>>
Traceback (most recent call last):
File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 531, in __del__
handle=self._handle, deleter=self._deleter)
File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 800, in delete_iterator
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}
. Registered: device='CPU'
device='GPU'
Additional GRPC error information:
{"created":"@1567559533.240191338","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}\n\t. Registered: device='CPU'\n device='GPU'\n","grpc_status":5} [Op:DeleteIterator]
Exception ignored in: <bound method IteratorResourceDeleter.__del__ of <tensorflow.python.data.ops.iterator_ops.IteratorResourceDeleter object at 0x7f0bac188518>>
Traceback (most recent call last):
File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 531, in __del__
handle=self._handle, deleter=self._deleter)
File "/home/youngalou/.local/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 800, in delete_iterator
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'DeleteIterator' OpKernel for TPU devices compatible with node {{node DeleteIterator}}
. Registered: device='CPU'
device='GPU'
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 10
- Comments: 19 (10 by maintainers)
And about
{{DeleteIterator}}
issue, This issue is raised when the code suddenly crashes and kills the execution. So while debugging TPU code, these errors are to be ignored (unless there’s nothing else in the traceback, apart from these). Scroll up to find out what killed the code. In your case there’s a high chance, The TPU cannot process it since 2 different context IDs are produced for the same training graph.Hey! @youngalou , You don’t need to open the strategy and device scope everytime. This produces different context ID, while training the model. So, I’ld ask you to shift the
tf.device()
andstrategy.scope()
outside the loop.This is supposed to work.