tensorflow: Half precision training very slow and returning nan loss

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

  • OS Platform and Distribution : Win10
  • TensorFlow version (use command below): 2.1.0
  • Python version: 3.6.8
  • CUDA/cuDNN version: 10.1 / 7.6.4
  • GPU model and memory: RTX2080ti (11GB)

Describe the current behavior

I used the “mixed_float16” policy to train the efficientnet model (https://github.com/qubvel/efficientnet), but the training become almost 10 times slower and return nan even if I set a large epsilon.

https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy?hl=en

here are some of the codes to setup mixed float training:

from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)
K.set_epsilon(1e-3)

P.S.: When I trained the densenet121 from (tf.keras.applications) using mixed float16, it runs pretty well.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

I noticed the same issue, but I don’t have a reproducible example to share yet, however I can say that my network uses some bilinear upsampling (as well as a lot of other stuff). More specifically, my training time is 6x longer for the epochs where no nans are present. Nans start appearing after roughly 2000 steps (which takes about 4 hours).

@Hazarapet I don’t know if you had seen this, but indeed bilinear upsampling is a cause of inefficiency for mixed precision in TensorFlow as reported in this issue. They should have fixed the problem for tf 2.4.

However, I am still wondering if the Nan issue has the same cause, and therefore the same fix.

Same thing. Mixed precision with mixed_float16 is super slow. 10 times slower than without it I’ve used the same batch size, the same model and started to count the time for

1. Feedforward process ~10x worst 2. Loss computing process ~6x worst 3. Gradient computing process ~6x worst

All points get worst