tensorflow: _MklSoftmax 2-2.5x Slower in 1.15 Compared to 1.14 and 1.13
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
intelaipg/intel-optimized-tensorflow:1.14.0-mkl-py3andintelaipg/intel-optimized-tensorflow:1.15.2-mkl-py3 - Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/a
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): see OS
- Python version: 3.6
- Bazel version (if compiling from source): n/a
- GCC/Compiler version (if compiling from source): n/a
- CUDA/cuDNN version: n/a
- GPU model and memory: n/a
Describe the current behavior
We found that the _MklSoftmax operation is quite a bit slower in 1.15 than it was in Tf 1.13 and 1.14, about 2-2.5x worse.
Describe the expected behavior Comparable speed to previous versions.
Standalone code to reproduce the issue n/a but confirmed by @NeoZhangJianyu (see https://github.com/tensorflow/tensorflow/issues/39851#issuecomment-652150250)
Other info / logs n/a
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (9 by maintainers)
@NeoZhangJianyu Thanks a lot – this resolves our problem. The speed is even better than what we see with 1.14!
Edit: Almost a
2520% improvement over stock 1.15.2 from the Intel image.@pks My bazel is 3.1.0.
There is binary release now, please install the TF 1.15.0up1 by PIP:
pip install https://storage.googleapis.com/intel-optimized-tensorflow/intel_tensorflow-1.15.0up1-cp36-cp36m-manylinux2010_x86_64.whl@pks For the machine without AVX-512, the issue is present too: it uses AVX2. I will check for any work around and feedback later.
Additional, it’s recommended to upgrade to Tf 2.x. For TF 1.15:
This is the last 1.x release for TensorFlow. We do not expect to update the 1.x branch with features, although we will issue patch releases to fix vulnerabilities for at least one year.– refer to https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md@pks We got the answer from dev team:
The TF 1.15 uses the mkldnn 0.20 version, which will run new path (avx512 impl on CLX or SKX), while TF 1.14 with 0.18 mkldnn will run a reference implementation.
The new path will take much more time for primitive creating than the latter path. It’s the root cause why we have poor performance for SoftMax on TF 1.15, especially for small problem size.
And TF master branch has enabled the primitive cache for SoftMax, so there’s no such performance issue.
This issue has been fixed in TF 2.2.0 and later. Please use TF 2.2.0 and later. We can’t port this new feature back to TF 1.15 because Google has cut off the updates for that branch.