tensorflow: Keras.fit stuck/error in TensorFlow 2.13/2.14 (TPU is fine, inference on GPU is fine, 2.11 GPU is fine)

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

tf 2.14

Custom code

Yes

OS platform and distribution

Ubuntu 22.04

Mobile device

No response

Python version

3.9

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

Titan RTX * 2 or 4090 * 4

Current behavior?

Below are the situations in different environments:

  1. TPU + TensorFlow 2.10: OK
  2. TPU + TensorFlow 2.11: OK
  3. TPU + TensorFlow 2.14: OK
  4. GPU (Titan RTX * 2) + TensorFlow 2.10 (conda): OK
  5. GPU (Titan RTX * 2) + TensorFlow 2.11 (conda): OK
  6. GPU (Titan RTX * 2) + TensorFlow 2.11 (docker): OK
  7. GPU (Titan RTX * 2) + TensorFlow 2.12 (docker): OK
  8. GPU (Titan RTX * 2) + TensorFlow 2.13 (conda): OK
  9. GPU (Titan RTX * 2) + TensorFlow 2.13 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
  10. GPU (Titan RTX * 2) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at Start cannot spawn child process: No such file or directory after Loaded cuDNN version 8600
  11. GPU (Titan RTX * 2) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
  12. GPU (4090 * 4) + TensorFlow 2.11 (conda): OK
  13. GPU (4090 * 4) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at Loaded cuDNN version 8600
  14. GPU (4090 * 4) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at Start cannot spawn child process: No such file or directory after Loaded cuDNN version 8600

I am currently testing more environments and see if I can narrow down the search space of problematic TensorFlow commits.

Standalone code to reproduce the issue

See behavior

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 22 (15 by maintainers)

Most upvoted comments

@SuryanarayanaY Update: After I use

tf.distribute.experimental.MultiWorkerMirroredStrategy(
   communication= tf.distribute.experimental.CollectiveCommunication.RING
)

to replace

tf.distribute.MirroredStrategy

2.13 and 2.14 training on GPU is no longer stuck.

It seems some commits caused issues in NCCL (maybe also XLA).