tensorflow: Input_signature of a tf.function decorator crashes when using multiple GPUs with MirroredStrategy
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS Linux 7.6.1810
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: /
- TensorFlow installed from (source or binary): pip binary
- TensorFlow version (use command below): tensorflow-gpu 2.0.0-beta0
- Python version: 3.6.8
- Bazel version (if compiling from source): /
- GCC/Compiler version (if compiling from source): /
- CUDA/cuDNN version: 10.0.130 / 7.6.0
- GPU model and memory: Tesla P100-SXM2-16GB
Describe the current behavior
Tensorflow crashes when checking the input_signature of a tf.function decorator when using multiple GPUs in a MirroredStrategy. A ValueError is generated cause a PerReplica object cannot be converted to a Tensor (see the log below). Below you can find the minimum code needed to reproduce the error. The code runs just fine when I only utilize one GPU strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0"]). Furthermore, if the optional argument input_signature is discarded (only using @tf.function()) the error disappears too (again using multiple GPUs). Hence, the specific combination of input_signature and multiple GPUs causes the problem (which I need for performance reasons in my work).
Describe the expected behavior The code below won’t generate any errors.
Code to reproduce the issue
import tensorflow as tf
import numpy as np
        
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
    
with strategy.scope():
    dataset = tf.data.Dataset.from_tensor_slices(np.ones([100, 12]).astype(np.float32))
    dataset = dataset.batch(4)
    dataset = strategy.experimental_distribute_dataset(dataset)
    
    def compute(input_data):
        return tf.reduce_sum(input_data, [1])
    
    @tf.function(input_signature = (tf.TensorSpec([None, 12], tf.float32),))
    def distributed_run(input_data):
        return strategy.experimental_run_v2(compute, args = (input_data,))
    for x in dataset:
        output = distributed_run(x)
        print(output)
Other info / logs
Traceback (most recent call last): File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1216, in _convert_inputs_to_signature value, dtype_hint=spec.dtype) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1100, in convert_to_tensor return convert_to_tensor_v2(value, dtype, preferred_dtype, name) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1158, in convert_to_tensor_v2 as_ref=False) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1237, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 305, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 246, in constant allow_broadcast=True) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 254, in _constant_impl t = convert_to_eager_tensor(value, ctx, dtype) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py”, line 115, in convert_to_eager_tensor return ops.EagerTensor(value, handle, device, dtype) ValueError: Attempt to convert a value (PerReplica:{ 0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Tensor: id=107, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)>, 1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Tensor: id=108, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)> }) with an unsupported type (<class ‘tensorflow.python.distribute.values.PerReplica’>) to a Tensor.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File “issue2.py”, line 19, in <module> output = distributed_run(x) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py”, line 432, in call *args, **kwds) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1169, in canonicalize_function_inputs self._flat_input_signature) File “/data/gent/gvo000/gvo00003/vsc41939/GENIUS/miniconda3/envs/tensorflow2/lib/python3.6/site-packages/tensorflow/python/eager/function.py”, line 1222, in _convert_inputs_to_signature (str(inputs), str(input_signature))) ValueError: When input_signature is provided, all inputs to the Python function must be convertible to tensors.Inputs ((PerReplica:{ 0 /job:localhost/replica:0/task:0/device:GPU:0: <tf.Tensor: id=107, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)>, 1 /job:localhost/replica:0/task:0/device:GPU:1: <tf.Tensor: id=108, shape=(2, 12), dtype=float32, numpy= array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)> },)), input_signature((TensorSpec(shape=(None, 12), dtype=tf.float32, name=None),)).
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (6 by maintainers)
This issue has now been fixed. You can use the
element_specproperty on the dataset or iterator to specify thetf.TypeSpec. For example,Please reopen if you run into issues.