tensorflow: MirroredStrategy fails with RaggedTensor

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Education N 1903
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on a mobile device: None
TensorFlow installed from (source or binary): pip install
TensorFlow version (use command below): 2.6
Python version: 3.9.7
Bazel version (if compiling from source): None
GCC/Compiler version (if compiling from source): None
CUDA/cuDNN version: 11.2 / 8.1.0
GPU model and memory: Nvidia GTX 1080 Ti 11GB
Exact command to reproduce:

python name_of_file.py

Describe the problem

The usage of any RaggedTensor in the model to train fails with the MirroredStrategy, whereas is works when there is no strategy. The code described bellow is a toy example, the ragged tensor is just included in the model graph by a tf.print but of course the problem is the same (and original one) when it is included in the calculations for the output of the model.

Source code / logs

import tensorflow as tf

@tf.function
def loss(a,b):
    return tf.reduce_mean(tf.abs(a), axis=1)

class FailL(tf.keras.layers.Layer):
    def __init__(self):
        super(FailL, self).__init__()
    
    def call(self, inputs):
        tf.print(tf.ragged.constant([[1],[1,1]]))
        return inputs

class FailM(tf.keras.Model):
    def __init__(self, strategy):
        super(FailM, self).__init__()
        self.strategy = strategy
        if self.strategy is not None:
            with self.strategy.scope():
                self.layer1 = tf.keras.layers.Conv2D(1,[3,3])
                self.layer2 = FailL()
        else:
            self.layer1 = tf.keras.layers.Conv2D(1,[3,3])
            self.layer2 = FailL()
    
    @tf.function
    def call(self, inputs):
        return self.layer2(self.layer1(inputs))
    
    def compile(self):
        if self.strategy is not None:
            with self.strategy.scope():
                super(FailM, self).compile()
                self.loss = loss
                self.optimizer = tf.keras.optimizers.Adam()
        else:
            super(FailM, self).compile()
            self.loss = loss
            self.optimizer = tf.keras.optimizers.Adam()
            
    def train_step(self, data):
        with tf.GradientTape() as tape:
            rag = self.layer2(self.layer1(data))
            loss = self.loss(rag,0)
        grads = tape.gradient(loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        return {"loss": loss}
    
    @tf.function
    def distributed_train_step(self, data):
        per_replica_losses = self.strategy.run(self.train_step, args=(data,))
        return {prl: self.strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses[prl], axis=None) for prl in per_replica_losses}
    
    def choose_train_step(self, data):
        if self.strategy is None:
            return self.train_step(data)
        else:
            return self.distributed_train_step(data)

for choose_strat in [None, 
                     tf.distribute.MirroredStrategy(devices=['GPU:0']),
                     ]:
    tf.print('Try with a strategy: ', type(choose_strat))
    model = FailM(choose_strat)
    model.compile()
    res = model.choose_train_step(tf.ones([3,10,10,3]))
    tf.print('Result:', res)

And then run it

TRACEBACK::

Try with a strategy:  <class 'tensorflow.python.distribute.mirrored_strategy.MirroredStrategy'>
Traceback (most recent call last):
  File "C:\..\trash_test_ragfail.py", line 74, in <module>
    res = model.choose_train_step(tf.ones([3,10,10,3]))
  File "C:\..\trash_test_ragfail.py", line 68, in choose_train_step
    return self.distributed_train_step(data)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 3039, in __call__
    return graph_function._call_flat(
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 1963, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 591, in call
    outputs = execute.execute(
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  2 root error(s) found.
  (0) Invalid argument: 2 root error(s) found.
  (0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
  (1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
  (1) Invalid argument: 2 root error(s) found.
  (0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
  (1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
0 successful operations.
0 derived errors ignored.
         [[{{node test_l_1/StringFormat_1/AsString/map/TensorArrayUnstack/TensorListFromTensor/_18}}]]
         [[Func/test_l_1/StringFormat_1/AsString/map/while/body/_1/input/_59/_32]]
  (1) Invalid argument:  2 root error(s) found.
  (0) Invalid argument: 2 root error(s) found.
  (0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
  (1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
  (1) Invalid argument: 2 root error(s) found.
  (0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
  (1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
0 successful operations.
0 derived errors ignored.
         [[{{node test_l_1/StringFormat_1/AsString/map/TensorArrayUnstack/TensorListFromTensor/_18}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_687]

Function call stack:
distributed_train_step -> distributed_train_step

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (6 by maintainers)

Most upvoted comments

Issue is reproducible in TF 2.7.0 with GPU. Here’s the gist

sanatmpa1 on Nov 15, 2021