deeplearning4j: Libnd4j: conv2d op (and MKL-DNN-enabled conv2d) slower than DL4J implementation

Edit: Windows 10, 8 core 5960x CPU

Here’s a simple benchmark comparing DL4J ConvolutionLayer implementation (no MKL-DNN) with conv2d op - with and without MKL-DNN enabled: https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88

Updated 28/02/19:

DL4J ConvolutionLayer: average 7.79 ms
conv2d op (use mkl=false): average 19.04 ms
conv2d op (use mkl=true): average 689.96 ms

Edit: Subsampling/pooling test + results: https://gist.github.com/AlexDBlack/b1f5f32e80b631321fe9936814fd8534 Updated 28/02/19

max pooling
DL4J SubsamplingLayer: average 1.1 ms
maxpool2d op (use mkl=false): average 1.09 ms
maxpool2d op (use mkl=true): average 16.86 ms
-----------------
avg pooling
DL4J SubsamplingLayer: average 0.85 ms
avgpool2d op (use mkl=false): average 3.43 ms
avgpool2d op (use mkl=true): average 14.37 ms

Batch norm forward pass test + results: https://gist.github.com/AlexDBlack/e46cf50de14252ac0d43e7a813d6a045

Updated 28/02/2018: batchnorm now faster for both DL4J and libnd4j vs. earlier.

DL4J BatchNormalization: average 4.97 ms
batchnorm_new op (use mkl=false): average 2.33 ms
batchnorm_new op (use mkl=true): average 2.83 ms

Edit 28/02/19: LRN results https://gist.github.com/AlexDBlack/88ab2529a73166b9955c28e8f83a61ef

DL4J LRN: average 34.52 ms
lrn op (use mkl=false): average 14.67 ms
lrn op (use mkl=true): average 5.08 ms

27/02/19: DL4J LSTM vs. lstmBlock op: (note: no mkl-dnn support yet)

DL4J LSTM layer: average 11.33 ms
lstmBlock op: average 6.53 ms

https://gist.github.com/AlexDBlack/8d01ee6e9c42ecf8fd8f988d16698bf6

28/02/19: Softmax op:

Legacy softmax op: average 2.225 ms
softmax custom op: average 0.321 ms

https://gist.github.com/AlexDBlack/88a02e91a9b8e9e93f8da5ce0901d3f6

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (17 by maintainers)

Most upvoted comments

I found what the problem was. We need to add reorder() operations manually to MKL-DNN streams or it falls back on reference implementations (non-JIT) of the other operations. For now, conv2d and conv2d_bp are done in sa_mkldnn, and I get this kind of output with https://gist.github.com/AlexDBlack/31d2d2ce5fdb04e6dcbc0b80e6187f88:

DL4J ConvolutionLayer: average 9.05 ms
conv2d op (use mkl=false): average 7.96 ms
conv2d op (use mkl=true): average 3.12 ms

Though it appears that the nd4j::graph::Context still gets recreated on each call to Nd4j.exec(op). If I hack it to use only one static stream, I get values below 2.5 ms.

To make sure we’re executing with JIT, we can set the MKLDNN_VERBOSE environment variable to 1. We should see messages like these containing “jit” in them like this:

mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw8c out:f32_blocked,num:1,8x32x64x64,0.538086
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_blocked out:f32_nChw8c,num:1,8x32x64x64,0.624023
mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_blocked out:f32_OIhw8i8o,num:1,32x32x2x2,0.0620117
mkldnn_verbose,exec,convolution,jit:avx2,forward_training,fsrc:nChw8c fwei:OIhw8i8o fbia:x fdst:nChw8c,alg:convolution_direct,mb8_g1ic32oc32_ih64oh64kh2sh1dh0ph0_iw64ow64kw2sw1dw0pw0,1.15381

If we see “ref” instead, those will be slow.

saudet on Mar 4, 2019