tensorflow: Copying tensors to GPU fails non-deterministically

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-174-generic x86_64)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): pip install tensorflow==2.2.0rc1
TensorFlow version (use command below): v2.2.0-rc0-43-gacf4951a2f 2.2.0-rc1
Python version: Python 3.7.3
CUDA/cuDNN version: 10.1, V10.1.243
GPU model and memory: 2x GeForce GTX 1080 8GB

Describe the current behavior I have a system with two GPUs. Since I need the fix of https://github.com/tensorflow/tensorflow/issues/33929 I have upgraded from 2.1 to 2.2.0rc0 and 2.2.0rc1. In both cases, I sometimes get the following exception when trying to train on a GPU:

Traceback (most recent call last):
  File "/data/personal/username/deployed/project/project/bin/train.py", line 86, in <module>
    main()
  File "/data/personal/username/deployed/project/project/bin/train.py", line 82, in main
    train()  # pragma: no cover
  File "/data/personal/username/deployed/project/project/bin/train.py", line 75, in train
    callbacks=callbacks,
  File "/data/personal/username/deployed/project/project/objects/models/KerasModel.py", line 190, in train_on_generator
    callbacks=callbacks,
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 65, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 783, in fit
    tmp_logs = train_function(iterator)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
    self.captured_inputs)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run AddV2: Attempted to set tensor for existing mirror. [Op:AddV2]
Traceback (most recent call last):

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 241, in __call__
    return func(device, token, args)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/script_ops.py", line 130, in __call__
    ret = self._func(*args)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 309, in wrapper
    return func(*args, **kwargs)

  File "/data/personal/username/deployed/project/project/objects/generators/GeneratorRetinaNet.py", line 85, in _getitem_pre_anchors
    image, node = self.get_prepped_node(idx=idx)

  File "/data/personal/username/deployed/project/project/objects/generators/Generator.py", line 252, in get_prepped_node
    node = self.get_node(idx=idx)

  File "/data/personal/username/deployed/project/project/objects/generators/GeneratorObjectDetection.py", line 97, in get_node
    node = super(GeneratorObjectDetection, self).get_node(idx=idx)

  File "/data/personal/username/deployed/project/project/objects/generators/Generator.py", line 376, in get_node
    return self._dataFetcher.get_node(idx)

  File "/data/personal/username/deployed/project/project/objects/generators/DataFetchers/DataFetcher.py", line 117, in get_node
    nodes = self.get_nodes(node_index, node_index + 1)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 997, in binary_op_wrapper
    return func(x, y, name=name)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1276, in _add_dispatch
    return gen_math_ops.add_v2(x, y, name=name)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 480, in add_v2
    _ops.raise_from_not_ok_status(e, name)

  File "/home/username/.local/share/virtualenvs/project-1BQtdZDZ/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)

  File "<string>", line 3, in raise_from

tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run AddV2: Attempted to set tensor for existing mirror. [Op:AddV2]

The odd thing about this is, that is is not deterministic behavior. At one moment it may not work. When I kill the process, wait a minute and try again, it sometimes works. In general, retrying one or two times will fix the issue, but currently I’m only successful to train about 10% of the time. Is this a known issue?

Note that in (TF1.13,) TF2.0 and TF2.1, this seemed to work fine for me.

Before training, I always use os.environ['CUDA_VISIBLE_DEVICES'] = '0 or '1', depending on the GPU I want to use. I have tried replacing this with tf.config.set_visible_devices to no avail.

For completeness, I do not use TPUs or distributed strategies.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 34 (10 by maintainers)

Most upvoted comments

Update 2: Another part of my pipeline reproduced this issue. Instead of putting it in a with tf.device() block as above, I turned OFF “num_parallel_calls” in my dataset.map() function. The error went away. It seems there is something about the num_parallel_calls attribute that triggers this error.

AroMorin on May 29, 2020

Update: If I put the problematic line (and related lines) inside a with tf.device('/device:cpu:0'): block, then I don’t get the error and everything works well. The idea was to prevent a copy from CPU to GPU, and this achieves that idea. Not sure what happened, but it’s a quick-fix.

AroMorin on May 29, 2020

I am experiencing this issue as well. I am training with a tf.dataset and using dataset.map() + a tf.py_function() to process the dataset elements. Can’t share code. Maybe I’ll try tf-nightly.

This is the error message: tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run Mul: Attempted to set tensor for existing mirror. [Op:Mul]

This happens when this line is executed inside the map() -> py_func() -> my_func():

img = tf.image.convert_image_dtype(img, tf.float32)

AroMorin on May 29, 2020