tensorflow: Use of Keras `jit_compile` in a distribution strategy causes a `std::system_error`

Click to expand!

Issue Type

Bug

Source

binary

Tensorflow Version

tf 2.9.1

Custom Code

No

OS Platform and Distribution

No response

Mobile device

No response

Python version

3.10

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

11.2/8.1.1.33

GPU model and memory

No response

Current Behaviour?

The following error is thrown during training after a number of steps / epochs

terminate called after throwing an instance of 'std::system_error'
what():  Resource temporarily unavailable

I am able to reproduce this error in colab with my sample code

Standalone code to reproduce the issue

import keras
import tensorflow as tf

def build_model_() -> keras.Model:
    input = tf.keras.layers.Input(shape=(5,), name='input_a')
    x = tf.keras.layers.Dense(512, activation = 'relu')(input)
    x = tf.keras.layers.Dense(512, activation = 'relu')(x)
    output = tf.keras.layers.Dense(1, name='output')(x)
    model = tf.keras.models.Model(inputs=input, outputs=output)
    return model


strategy = tf.distribute.MirroredStrategy()
print(f"Can see {strategy.num_replicas_in_sync} gpus")
with strategy.scope():
  model = build_model_()
  model.compile(loss = 'mse', jit_compile=True)

BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA

dataset = tf.data.Dataset.from_tensors(
    (tf.ones(5), 1)
).repeat(10_000_000).batch(GLOBAL_BATCH_SIZE).with_options(options)

history = model.fit(
    x = dataset,
    epochs=7,
    verbose = 1,
)

Relevant log output

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (5 by maintainers)

Most upvoted comments

This worked for me

echo 409659100 | sudo tee -a /proc/sys/kernel/threads-max
echo 409659100 | sudo tee -a /proc/sys/vm/max_map_count
echo 10965910 | sudo tee -a /proc/sys/kernel/threads-max
#DefaultTasksMax=100000 to /etc/systemd/system.conf
#UserTasksMax=500000 to /etc/systemd/logind.conf