tensorflow: Input_signature of a tf.function decorator crashes when using multiple GPUs with MirroredStrategy

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Linux 7.6.1810
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: /
TensorFlow installed from (source or binary): pip binary
TensorFlow version (use command below): tensorflow-gpu 2.0.0-beta0
Python version: 3.6.8
Bazel version (if compiling from source): /
GCC/Compiler version (if compiling from source): /
CUDA/cuDNN version: 10.0.130 / 7.6.0
GPU model and memory: Tesla P100-SXM2-16GB

Describe the current behavior Tensorflow crashes when checking the input_signature of a tf.function decorator when using multiple GPUs in a MirroredStrategy. A ValueError is generated cause a PerReplica object cannot be converted to a Tensor (see the log below). Below you can find the minimum code needed to reproduce the error. The code runs just fine when I only utilize one GPU strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"]). Furthermore, if the optional argument input_signature is discarded (only using @tf.function()) the error disappears too (again using multiple GPUs). Hence, the specific combination of input_signature and multiple GPUs causes the problem (which I need for performance reasons in my work).

Describe the expected behavior The code below won’t generate any errors.

Code to reproduce the issue

import tensorflow as tf
import numpy as np
        
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
    
with strategy.scope():
    dataset = tf.data.Dataset.from_tensor_slices(np.ones([100, 12]).astype(np.float32))
    dataset = dataset.batch(4)
    dataset = strategy.experimental_distribute_dataset(dataset)
    
    def compute(input_data):
        return tf.reduce_sum(input_data, [1])
    
    @tf.function(input_signature = (tf.TensorSpec([None, 12], tf.float32),))
    def distributed_run(input_data):
        return strategy.experimental_run_v2(compute, args = (input_data,))

    for x in dataset:
        output = distributed_run(x)
        print(output)

Other info / logs

Traceback (most recent call last): File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1216, in _convert_inputs_to_signature value, dtype_hint=spec.dtype) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1100, in convert_to_tensor return convert_to_tensor_v2(value, dtype, preferred_dtype, name) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1158, in convert_to_tensor_v2 as_ref=False) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1237, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 305, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 246, in constant allow_broadcast=True) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 254, in _constant_impl t = convert_to_eager_tensor(value, ctx, dtype) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 115, in convert_to_eager_tensor return ops.EagerTensor(value, handle, device, dtype) ValueError: Attempt to convert a value (PerReplica:{ 0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Tensor: id=107, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)>, 1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Tensor: id=108, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)> }) with an unsupported type (<class ‘tensorflow.python.distribute.values.PerReplica’>) to a Tensor.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “issue2.py”, line 19, in <module> output = distributed_run(x) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py”, line 432, in call *args, **kwds) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1169, in canonicalize_function_inputs self._flat_input_signature) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1222, in _convert_inputs_to_signature (str(inputs), str(input_signature))) ValueError: When input_signature is provided, all inputs to the Python function must be convertible to tensors.Inputs ((PerReplica:{ 0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Tensor: id=107, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)>, 1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Tensor: id=108, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)> },)), input_signature((TensorSpec(shape=(None, 12), dtype=tf.float32, name=None),)).

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (6 by maintainers)

Most upvoted comments

This issue has now been fixed. You can use the element_spec property on the dataset or iterator to specify the tf.TypeSpec. For example,

# For the `experimental_distribute_dataset API`
dataset = tf.data.Dataset(...)
dist_dataset = strategy.experimental_distributed_dataset(dataset)
# Use the `element_spec` of the distributed dataset
@tf.function(input_signature=[dist_dataset.element_spec])
def train_step(...):
  ....

# Use the `element_spec` of the distributed iterator
iterator = iter(dist_dataset)
@tf.function(input_signature=[iterator.element_spec])
def train_step(..)
  ...

# For the `experimental_distribute_datasets_from_function` API
# Use the `element_spec` of the distributed iterator
dataset = tf.data.Dataset(...)
dist_dataset = strategy.experiemental_distribute_datasets_from_function(dataset)
iterator = iter(dist_dataset)
@tf.function(input_signature=[iterator.element_spec])
def train_step(..)
  ...

Please reopen if you run into issues.

anj-s on Dec 27, 2019