tensorflow: MirroredStrategy fails with RaggedTensor
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Education N 1903
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on a mobile device: None
- TensorFlow installed from (source or binary): pip install
- TensorFlow version (use command below): 2.6
- Python version: 3.9.7
- Bazel version (if compiling from source): None
- GCC/Compiler version (if compiling from source): None
- CUDA/cuDNN version: 11.2 / 8.1.0
- GPU model and memory: Nvidia GTX 1080 Ti 11GB
- Exact command to reproduce:
python name_of_file.py
Describe the problem
The usage of any RaggedTensor in the model to train fails with the MirroredStrategy, whereas is works when there is no strategy. The code described bellow is a toy example, the ragged tensor is just included in the model graph by a tf.print but of course the problem is the same (and original one) when it is included in the calculations for the output of the model.
Source code / logs
import tensorflow as tf
@tf.function
def loss(a,b):
return tf.reduce_mean(tf.abs(a), axis=1)
class FailL(tf.keras.layers.Layer):
def __init__(self):
super(FailL, self).__init__()
def call(self, inputs):
tf.print(tf.ragged.constant([[1],[1,1]]))
return inputs
class FailM(tf.keras.Model):
def __init__(self, strategy):
super(FailM, self).__init__()
self.strategy = strategy
if self.strategy is not None:
with self.strategy.scope():
self.layer1 = tf.keras.layers.Conv2D(1,[3,3])
self.layer2 = FailL()
else:
self.layer1 = tf.keras.layers.Conv2D(1,[3,3])
self.layer2 = FailL()
@tf.function
def call(self, inputs):
return self.layer2(self.layer1(inputs))
def compile(self):
if self.strategy is not None:
with self.strategy.scope():
super(FailM, self).compile()
self.loss = loss
self.optimizer = tf.keras.optimizers.Adam()
else:
super(FailM, self).compile()
self.loss = loss
self.optimizer = tf.keras.optimizers.Adam()
def train_step(self, data):
with tf.GradientTape() as tape:
rag = self.layer2(self.layer1(data))
loss = self.loss(rag,0)
grads = tape.gradient(loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
return {"loss": loss}
@tf.function
def distributed_train_step(self, data):
per_replica_losses = self.strategy.run(self.train_step, args=(data,))
return {prl: self.strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses[prl], axis=None) for prl in per_replica_losses}
def choose_train_step(self, data):
if self.strategy is None:
return self.train_step(data)
else:
return self.distributed_train_step(data)
for choose_strat in [None,
tf.distribute.MirroredStrategy(devices=['GPU:0']),
]:
tf.print('Try with a strategy: ', type(choose_strat))
model = FailM(choose_strat)
model.compile()
res = model.choose_train_step(tf.ones([3,10,10,3]))
tf.print('Result:', res)
And then run it
TRACEBACK::
Try with a strategy: <class 'tensorflow.python.distribute.mirrored_strategy.MirroredStrategy'>
Traceback (most recent call last):
File "C:\..\trash_test_ragfail.py", line 74, in <module>
res = model.choose_train_step(tf.ones([3,10,10,3]))
File "C:\..\trash_test_ragfail.py", line 68, in choose_train_step
return self.distributed_train_step(data)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
result = self._call(*args, **kwds)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 3039, in __call__
return graph_function._call_flat(
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 1963, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\function.py", line 591, in call
outputs = execute.execute(
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\tf2_6\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
0 successful operations.
0 derived errors ignored.
[[{{node test_l_1/StringFormat_1/AsString/map/TensorArrayUnstack/TensorListFromTensor/_18}}]]
[[Func/test_l_1/StringFormat_1/AsString/map/while/body/_1/input/_59/_32]]
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
0 successful operations.
0 derived errors ignored.
[[{{node test_l_1/StringFormat_1/AsString/map/TensorArrayUnstack/TensorListFromTensor/_18}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_687]
Function call stack:
distributed_train_step -> distributed_train_step
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (6 by maintainers)
Issue is reproducible in TF
2.7.0with GPU. Here’s the gist