tensorflow: Keras.fit stuck/error in TensorFlow 2.13/2.14 (TPU is fine, inference on GPU is fine, 2.11 GPU is fine)
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
tf 2.14
Custom code
Yes
OS platform and distribution
Ubuntu 22.04
Mobile device
No response
Python version
3.9
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
Titan RTX * 2 or 4090 * 4
Current behavior?
Below are the situations in different environments:
- TPU + TensorFlow 2.10: OK
- TPU + TensorFlow 2.11: OK
- TPU + TensorFlow 2.14: OK
- GPU (Titan RTX * 2) + TensorFlow 2.10 (conda): OK
- GPU (Titan RTX * 2) + TensorFlow 2.11 (conda): OK
- GPU (Titan RTX * 2) + TensorFlow 2.11 (docker): OK
- GPU (Titan RTX * 2) + TensorFlow 2.12 (docker): OK
- GPU (Titan RTX * 2) + TensorFlow 2.13 (conda): OK
- GPU (Titan RTX * 2) + TensorFlow 2.13 (docker): Prediction OK. keras.fit stuck at
Loaded cuDNN version 8600
- GPU (Titan RTX * 2) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at
Start cannot spawn child process: No such file or directory
afterLoaded cuDNN version 8600
- GPU (Titan RTX * 2) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at
Loaded cuDNN version 8600
- GPU (4090 * 4) + TensorFlow 2.11 (conda): OK
- GPU (4090 * 4) + TensorFlow 2.14 (docker): Prediction OK. keras.fit stuck at
Loaded cuDNN version 8600
- GPU (4090 * 4) + TensorFlow 2.14 (pip): Prediction OK. keras.fit stuck at
Start cannot spawn child process: No such file or directory
afterLoaded cuDNN version 8600
I am currently testing more environments and see if I can narrow down the search space of problematic TensorFlow commits.
Standalone code to reproduce the issue
See behavior
Relevant log output
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 22 (15 by maintainers)
@SuryanarayanaY Update: After I use
to replace
2.13 and 2.14 training on GPU is no longer stuck.
It seems some commits caused issues in NCCL (maybe also XLA).