tensorflow: hangs on model.fit

Although I can’t extract the code to reproduce the problem, I think that documenting it here will help improve this project and anyone who encounters the same problem.

System: Windows 10 Version: tf-nightly-gpu 2020.1.19

I use tf.data.Dataset to provide samples When I use GPU + eager + batch_size > 16, it will hang on model.fit and continue to occupy a core CPU When I use cpu or turn off eager mode or set batch <= 16, he will run normally.

It will continue like this

2020-01-21 01:20:24.249995: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll 2020-01-21 01:20:28.403903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2020-01-21 01:20:29.708097: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only Relying on driver to perform ptx compilation. Modify $PATH to customize ptxas location. This message will be only logged once. 9/1406 […] - ETA: 42:30 - loss: 30.3274 - accuracy: 0.0000e+00

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 9
  • Comments: 34 (6 by maintainers)

Most upvoted comments

Just listing the steps that helped me fix this issue.

I have faced the same problem while using TF 2.1 on a windows laptop. Same code works on a kaggle kernel and colab but hangs on my laptop. So I have followed below steps to isolate the issue.

  1. Upgrade to tensorflow 2.2 - Didn’t fix
  2. create a new virtual environment and install all necessary libraries - Didn’t fix

After above two steps, I believe it has to do with my GPU drivers or CUDA version. It’s annoying that there was no error message or anything. Model.fit does not progress except for some CPU usage by the python process.

  1. So I have updated my CUDA Tool kit, cuDNN and my graphics card driver (just to be on the safe side) - Problem fixed. Seems in line with @liuxingbaoyu 's experience above.

Cool, I solved it by updating graphics card driver. Thanks!

@liuxingbaoyu I get the solution for my problem at this https://github.com/tensorflow/tensorflow/issues/37216#issue-573656418

for this issue maybe it helps, found that my GPU and SSD get too hot (something around 80), so decide to change the system from AirCool to WaterCool and changing the thermal paste

previously on that issue, I report changing OS on the same hardware helps, but to add another SSD drive I moved my PC to another place which was much cooler than where it was

I recommend that using any tools you are comfortable with to check system component temperature