tensorflow: dnnConversionCreate_F32 fails when running TF with optimized MKL
System information
- Have I written custom code: Yes, (https://github.com/jakubkarczewski/AlexNetF/blob/master/alexnet.py)
- OS Platform and Distribution: Linux Centos 7
- TensorFlow installed from: compiled from source from https://github.com/tensorflow/tensorflow/releases
- TensorFlow version: 1.6.0-rc0
- Python version: 2.7
- Bazel version: 0.10.0
- GCC/Compiler version: stock Centos 7 gcc
- Compilation command:
bazel build --config=mkl --copt="-DINTEL_MKL_ML" --copt="-mfma" --copt="-mavx2" --copt="-march=broadwell" --copt="-O3" -s -c opt //tensorflow/tools/pip_package:build_pip_package;; - Exact command to reproduce: python alexnet.py --training_epoch=1 --model_version=1 output/
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
Describe the problem
Running following training with Tensorflow compiled with command specified above results in error: 2018-02-12 23:40:38.088756: F tensorflow/core/kernels/mkl_lrn_op.cc:595] Check failed: dnnConversionCreate_F32( &convert_input, static_cast<dnnLayout_t>(inimage_shape.GetCurLayout()), lt_internal_input) == E_SUCCESS (-1 vs. 0) as opossed to running without any error and training properly on Tensorflow version available under pip install tensorflow.
For training data I used Imagenet 60gb dataset (http://www.image-net.org/challenges/LSVRC/2012/) with 1000 classes.
What’s more - following error can be found when running with Tensorflow from precompiled wheel files for both versions of python. This makes me think that the way I compile TF is not the problem here.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 27 (15 by maintainers)
As you originally reported: tensorflow/core/kernels/mkl_lrn_op.cc:595] Check failed: dnnConversionCreate_F32( &convert_input, static_cast<dnnLayout_t>(inimage_shape.GetCurLayout()), lt_internal_input) == E_SUCCESS (-1 vs. 0)
The bug lies in LRN op as commenting out the two LRN layers in your script avoids the error. We will fix the LRN MKL ops. Please stay tuned. Thank you!
I might have used a preprocessed ImageNet dataset because I got it from a friend of mine. I bet he used mapping for subdirectories such as: https://gist.github.com/aaronpolhamus/964a4411c0906315deb9f4a3723aac57
Thank you @jakubkarczewski for reporting and for going deeper into this debugging. We will reproduce and provide a fix for this. Please stay tuned.
A colleague suggests: The error appears to be E_INCORRECT_INPUT_PARAMETER. It sounds to me like it might be a shape mismatch of some kind - so perhaps adding logging to see what the two layouts are?
(they traced through dnnConversionCreate_F32 to the Intel site which has https://software.intel.com/en-us/mkl-developer-reference-c-dnnconversioncreate, and then in turn found dnnError_t, which has the following cases:
typedef enum { E_SUCCESS = 0, E_INCORRECT_INPUT_PARAMETER = -1, E_UNEXPECTED_NULL_POINTER = -2, E_MEMORY_ERROR = -3, E_UNSUPPORTED_DIMENSION = -4, E_UNIMPLEMENTED = -127 } dnnError_t;