tensorflow: Keras.fit stuck/error in TensorFlow 2.13/2.14 (TPU is fine, inference on GPU is fine, 2.11 GPU is fine)

Bug

Yes

source

tf 2.14

Yes

Ubuntu 22.04

No response

3.9

No response

No response

No response

Titan RTX * 2 or 4090 * 4

Below are the situations in different environments:

TPU + TensorFlow 2.10: OK
TPU + TensorFlow 2.11: OK
TPU + TensorFlow 2.14: OK
GPU (Titan RTX * 2) + TensorFlow 2.10 (conda): OK
GPU (Titan RTX * 2) + TensorFlow 2.11 (conda): OK
GPU (Titan RTX * 2) + TensorFlow 2.11 (docker): OK
GPU (Titan RTX * 2) + TensorFlow 2.12 (docker): OK
GPU (Titan RTX * 2) + TensorFlow 2.13 (conda): OK
GPU (Titan RTX * 2) + TensorFlow 2.13 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
GPU (Titan RTX * 2) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at Start cannot spawn child process: No such file or directory after Loaded cuDNN version 8600
GPU (Titan RTX * 2) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
GPU (4090 * 4) + TensorFlow 2.11 (conda): OK
GPU (4090 * 4) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
GPU (4090 * 4) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at Start cannot spawn child process: No such file or directory after Loaded cuDNN version 8600

I am currently testing more environments and see if I can narrow down the search space of problematic TensorFlow commits.

See behavior

No response

About this issue

@SuryanarayanaY Update: After I use

tf.distribute.experimental.MultiWorkerMirroredStrategy(
   communication= tf.distribute.experimental.CollectiveCommunication.RING
)

to replace

tf.distribute.MirroredStrategy

2.13 and 2.14 training on GPU is no longer stuck.

It seems some commits caused issues in NCCL (maybe also XLA).