tensorflow: _MklSoftmax 2-2.5x Slower in 1.15 Compared to 1.14 and 1.13

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): intelaipg/intel-optimized-tensorflow:1.14.0-mkl-py3 and intelaipg/intel-optimized-tensorflow:1.15.2-mkl-py3
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): see OS
Python version: 3.6
Bazel version (if compiling from source): n/a
GCC/Compiler version (if compiling from source): n/a
CUDA/cuDNN version: n/a
GPU model and memory: n/a

Describe the current behavior We found that the _MklSoftmax operation is quite a bit slower in 1.15 than it was in Tf 1.13 and 1.14, about 2-2.5x worse.

Describe the expected behavior Comparable speed to previous versions.

Standalone code to reproduce the issue n/a but confirmed by @NeoZhangJianyu (see https://github.com/tensorflow/tensorflow/issues/39851#issuecomment-652150250)

Other info / logs n/a

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (9 by maintainers)

Most upvoted comments

@NeoZhangJianyu Thanks a lot – this resolves our problem. The speed is even better than what we see with 1.14!

Edit: Almost a 2520% improvement over stock 1.15.2 from the Intel image.

pks on Sep 10, 2020

@pks My bazel is 3.1.0.

There is binary release now, please install the TF 1.15.0up1 by PIP: pip install https://storage.googleapis.com/intel-optimized-tensorflow/intel_tensorflow-1.15.0up1-cp36-cp36m-manylinux2010_x86_64.whl

NeoZhangJianyu on Sep 2, 2020

@pks For the machine without AVX-512, the issue is present too: it uses AVX2. I will check for any work around and feedback later.

Additional, it’s recommended to upgrade to Tf 2.x. For TF 1.15: This is the last 1.x release for TensorFlow. We do not expect to update the 1.x branch with features, although we will issue patch releases to fix vulnerabilities for at least one year. – refer to https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md

NeoZhangJianyu on Jul 31, 2020

@pks We got the answer from dev team:

The TF 1.15 uses the mkldnn 0.20 version, which will run new path (avx512 impl on CLX or SKX), while TF 1.14 with 0.18 mkldnn will run a reference implementation.

The new path will take much more time for primitive creating than the latter path. It’s the root cause why we have poor performance for SoftMax on TF 1.15, especially for small problem size.

And TF master branch has enabled the primitive cache for SoftMax, so there’s no such performance issue.

This issue has been fixed in TF 2.2.0 and later. Please use TF 2.2.0 and later. We can’t port this new feature back to TF 1.15 because Google has cut off the updates for that branch.

NeoZhangJianyu on Jul 30, 2020