tensorflow: dnnConversionCreate_F32 fails when running TF with optimized MKL

System information

Have I written custom code: Yes, (https://github.com/jakubkarczewski/AlexNetF/blob/master/alexnet.py)
OS Platform and Distribution: Linux Centos 7
TensorFlow installed from: compiled from source from https://github.com/tensorflow/tensorflow/releases
TensorFlow version: 1.6.0-rc0
Python version: 2.7
Bazel version: 0.10.0
GCC/Compiler version: stock Centos 7 gcc
Compilation command: bazel build --config=mkl --copt="-DINTEL_MKL_ML" --copt="-mfma" --copt="-mavx2" --copt="-march=broadwell" --copt="-O3" -s -c opt //tensorflow/tools/pip_package:build_pip_package;;
Exact command to reproduce: python alexnet.py --training_epoch=1 --model_version=1 output/
CUDA/cuDNN version: N/A
GPU model and memory: N/A

Describe the problem

Running following training with Tensorflow compiled with command specified above results in error: 2018-02-12 23:40:38.088756: F tensorflow/core/kernels/mkl_lrn_op.cc:595] Check failed: dnnConversionCreate_F32( &convert_input, static_cast<dnnLayout_t>(inimage_shape.GetCurLayout()), lt_internal_input) == E_SUCCESS (-1 vs. 0) as opossed to running without any error and training properly on Tensorflow version available under pip install tensorflow. For training data I used Imagenet 60gb dataset (http://www.image-net.org/challenges/LSVRC/2012/) with 1000 classes. What’s more - following error can be found when running with Tensorflow from precompiled wheel files for both versions of python. This makes me think that the way I compile TF is not the problem here.

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 27 (15 by maintainers)

Most upvoted comments

As you originally reported: tensorflow/core/kernels/mkl_lrn_op.cc:595] Check failed: dnnConversionCreate_F32( &convert_input, static_cast<dnnLayout_t>(inimage_shape.GetCurLayout()), lt_internal_input) == E_SUCCESS (-1 vs. 0)

The bug lies in LRN op as commenting out the two LRN layers in your script avoids the error. We will fix the LRN MKL ops. Please stay tuned. Thank you!

wei-v-wang on Apr 6, 2018

I might have used a preprocessed ImageNet dataset because I got it from a friend of mine. I bet he used mapping for subdirectories such as: https://gist.github.com/aaronpolhamus/964a4411c0906315deb9f4a3723aac57

jakubkarczewski on Apr 5, 2018

Thank you @jakubkarczewski for reporting and for going deeper into this debugging. We will reproduce and provide a fix for this. Please stay tuned.

wei-v-wang on Mar 22, 2018

A colleague suggests: The error appears to be E_INCORRECT_INPUT_PARAMETER. It sounds to me like it might be a shape mismatch of some kind - so perhaps adding logging to see what the two layouts are?

(they traced through dnnConversionCreate_F32 to the Intel site which has https://software.intel.com/en-us/mkl-developer-reference-c-dnnconversioncreate, and then in turn found dnnError_t, which has the following cases:

typedef enum { E_SUCCESS = 0, E_INCORRECT_INPUT_PARAMETER = -1, E_UNEXPECTED_NULL_POINTER = -2, E_MEMORY_ERROR = -3, E_UNSUPPORTED_DIMENSION = -4, E_UNIMPLEMENTED = -127 } dnnError_t;

cy89 on Feb 16, 2018