tensorflow: UnavailableError: Socket closed while using custom training loop
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- Tensorflow version:
2.1.0
About hardware and software system information I’m using Kaggle kernels so more information can be found here: https://github.com/Kaggle/docker-python
Describe the current behavior I’m getting errors like the one below, this happens during the training loop.

Describe the expected behavior The model was supposed to train normally.
Standalone code to reproduce the issue Link for the Kaggle kernel: https://www.kaggle.com/dimitreoliveira/bug-report-unavailableerror-socket-closed
Other info / logs As described on the notebook linked above usually happens when using some combination of the below:
- Long epochs
- Heavy models
- Some loops using too much memory
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (3 by maintainers)
Probably but also probably not your fault. This was working in TF 2.1, it should still work in TF 2.2.
On Mon, 6 Apr 2020 at 15:33, Dimitre Oliveira notifications@github.com wrote:
–
Martin Görner | ML Product Manager, TPU | mgorner@google.com | +1 425 273 0605