tensorflow: Copying tensors to GPU fails non-deterministically
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-174-generic x86_64)
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
- TensorFlow installed from (source or
binary):
pip install tensorflow==2.2.0rc1
- TensorFlow version (use command below): v2.2.0-rc0-43-gacf4951a2f 2.2.0-rc1
- Python version: Python 3.7.3
- CUDA/cuDNN version: 10.1, V10.1.243
- GPU model and memory: 2x GeForce GTX 1080 8GB
Describe the current behavior I have a system with two GPUs. Since I need the fix of https://github.com/tensorflow/tensorflow/issues/33929 I have upgraded from 2.1 to 2.2.0rc0 and 2.2.0rc1. In both cases, I sometimes get the following exception when trying to train on a GPU:
Traceback (most recent call last):
File "/data/personal/username/deployed/project/project/bin/train.py", line 86, in <module>
main()
File "/data/personal/username/deployed/project/project/bin/train.py", line 82, in main
train() # pragma: no cover
File "/data/personal/username/deployed/project/project/bin/train.py", line 75, in train
callbacks=callbacks,
File "/data/personal/username/deployed/project/project/objects/models/KerasModel.py", line 190, in train_on_generator
callbacks=callbacks,
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 65, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 783, in fit
tmp_logs = train_function(iterator)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run AddV2: Attempted to set tensor for existing mirror. [Op:AddV2]
Traceback (most recent call last):
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 241, in __call__
return func(device, token, args)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 130, in __call__
ret = self._func(*args)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 309, in wrapper
return func(*args, **kwargs)
File "/data/personal/username/deployed/project/project/objects/generators/GeneratorRetinaNet.py", line 85, in _getitem_pre_anchors
image, node = self.get_prepped_node(idx=idx)
File "/data/personal/username/deployed/project/project/objects/generators/Generator.py", line 252, in get_prepped_node
node = self.get_node(idx=idx)
File "/data/personal/username/deployed/project/project/objects/generators/GeneratorObjectDetection.py", line 97, in get_node
node = super(GeneratorObjectDetection, self).get_node(idx=idx)
File "/data/personal/username/deployed/project/project/objects/generators/Generator.py", line 376, in get_node
return self._dataFetcher.get_node(idx)
File "/data/personal/username/deployed/project/project/objects/generators/DataFetchers/DataFetcher.py", line 117, in get_node
nodes = self.get_nodes(node_index, node_index + 1)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 997, in binary_op_wrapper
return func(x, y, name=name)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1276, in _add_dispatch
return gen_math_ops.add_v2(x, y, name=name)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 480, in add_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run AddV2: Attempted to set tensor for existing mirror. [Op:AddV2]
The odd thing about this is, that is is not deterministic behavior. At one moment it may not work. When I kill the process, wait a minute and try again, it sometimes works. In general, retrying one or two times will fix the issue, but currently I’m only successful to train about 10% of the time. Is this a known issue?
Note that in (TF1.13,) TF2.0 and TF2.1, this seemed to work fine for me.
Before training, I always use os.environ['CUDA_VISIBLE_DEVICES'] = '0
or '1'
, depending on the GPU I want to use. I have tried replacing this with tf.config.set_visible_devices
to no avail.
For completeness, I do not use TPUs or distributed strategies.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (10 by maintainers)
Update 2: Another part of my pipeline reproduced this issue. Instead of putting it in a with
tf.device()
block as above, I turned OFF “num_parallel_calls” in my dataset.map() function. The error went away. It seems there is something about thenum_parallel_calls
attribute that triggers this error.Update: If I put the problematic line (and related lines) inside a
with tf.device('/device:cpu:0'):
block, then I don’t get the error and everything works well. The idea was to prevent a copy from CPU to GPU, and this achieves that idea. Not sure what happened, but it’s a quick-fix.I am experiencing this issue as well. I am training with a tf.dataset and using dataset.map() + a tf.py_function() to process the dataset elements. Can’t share code. Maybe I’ll try tf-nightly.
This is the error message:
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run Mul: Attempted to set tensor for existing mirror. [Op:Mul]
This happens when this line is executed inside the map() -> py_func() -> my_func():
img = tf.image.convert_image_dtype(img, tf.float32)