tensorflow: UnavailableError: Socket closed while using custom training loop

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • Tensorflow version: 2.1.0

About hardware and software system information I’m using Kaggle kernels so more information can be found here: https://github.com/Kaggle/docker-python

Describe the current behavior I’m getting errors like the one below, this happens during the training loop.

Screenshot from 2020-03-21 09-41-52

Describe the expected behavior The model was supposed to train normally.

Standalone code to reproduce the issue Link for the Kaggle kernel: https://www.kaggle.com/dimitreoliveira/bug-report-unavailableerror-socket-closed

Other info / logs As described on the notebook linked above usually happens when using some combination of the below:

  • Long epochs
  • Heavy models
  • Some loops using too much memory

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

Probably but also probably not your fault. This was working in TF 2.1, it should still work in TF 2.2.

On Mon, 6 Apr 2020 at 15:33, Dimitre Oliveira notifications@github.com wrote:

@martin-gorner https://github.com/martin-gorner do you think this might be because of the custom augmentation functions where I use if and else statements?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/37779#issuecomment-610071495, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHKKZ43LATTIIENFRSNQC3RLJKCJANCNFSM4LQ5DTCA .

Martin Görner | ML Product Manager, TPU | mgorner@google.com | +1 425 273 0605