tensorflow: Use of Keras `jit_compile` in a distribution strategy causes a `std::system_error`
Click to expand!
Issue Type
Bug
Source
binary
Tensorflow Version
tf 2.9.1
Custom Code
No
OS Platform and Distribution
No response
Mobile device
No response
Python version
3.10
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
11.2/8.1.1.33
GPU model and memory
No response
Current Behaviour?
The following error is thrown during training after a number of steps / epochs
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
I am able to reproduce this error in colab with my sample code
Standalone code to reproduce the issue
import keras
import tensorflow as tf
def build_model_() -> keras.Model:
input = tf.keras.layers.Input(shape=(5,), name='input_a')
x = tf.keras.layers.Dense(512, activation = 'relu')(input)
x = tf.keras.layers.Dense(512, activation = 'relu')(x)
output = tf.keras.layers.Dense(1, name='output')(x)
model = tf.keras.models.Model(inputs=input, outputs=output)
return model
strategy = tf.distribute.MirroredStrategy()
print(f"Can see {strategy.num_replicas_in_sync} gpus")
with strategy.scope():
model = build_model_()
model.compile(loss = 'mse', jit_compile=True)
BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = tf.data.Dataset.from_tensors(
(tf.ones(5), 1)
).repeat(10_000_000).batch(GLOBAL_BATCH_SIZE).with_options(options)
history = model.fit(
x = dataset,
epochs=7,
verbose = 1,
)
Relevant log output
No response
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 17 (5 by maintainers)
This worked for me