deeplearning4j: CPU - sgemm deadlock?
Running snapshots (latest DL4J master) on this branch of examples: https://github.com/deeplearning4j/dl4j-examples/pull/589 Specifically GravesLSTMCharModellingExample
System: Windows 10 with MKL.
Output gets to here and just stops…
21:55:57,139 INFO ~ Loaded [CpuBackend] backend
21:55:58,371 INFO ~ Number of threads used for NativeOps: 8
21:56:16,944 INFO ~ Number of threads used for BLAS: 8
21:56:16,966 INFO ~ Backend used: [CPU]; OS: [Windows 10]
21:56:16,966 INFO ~ Cores: [16]; Memory: [7.1GB];
21:56:16,966 INFO ~ Blas vendor: [MKL]
21:56:17,734 INFO ~ Starting MultiLayerNetwork with WorkspaceModes set to [training: SEPARATE; inference: SEPARATE]
Number of parameters in layer 0: 223000
Number of parameters in layer 1: 321400
Number of parameters in layer 2: 15477
Total number of network parameters: 559877
YourKit profiler picks this up as a deadlock:

Note that the 0.9.1 version runs fine… and a bunch of the other examples run fine also. I’m only seeing issues on this one (so far).
Edit: weirdly GravesLSTMCharModellingExample consistently deadlocks for me, but CompGraphLSTMExample runs fine…
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 52 (51 by maintainers)
Commits related to this issue
- * Correct loading order of `mkl_core` to fix MKL issues on Windows (issue deeplearning4j/deeplearning4j#4776) — committed to bytedeco/javacpp-presets by saudet 6 years ago
- * Prevent `Loader` from loading twice copies of the same DLL (issue deeplearning4j/deeplearning4j#4776) — committed to bytedeco/javacpp by saudet 6 years ago
OK, so I guess we merge https://github.com/deeplearning4j/nd4j/pull/2734 and then we can close this issue 😄
Success! Now that I’m using the correct ND4J branch - looks like it’s only loading mkl_core.dll once https://gist.github.com/AlexDBlack/6f95f9be21f9ef0e3e59cdb4f0d1d49a
I can confirm that we’re no longer seeing deadlocks when run via intellij or java -cp (run 20 times each) . 🎉 👏
So - at this point, I guess the only remaining question is MKL 2017… do you want to do some testing there to make sure it still works as expected after your changes?
@AlexDBlack Thanks, but don’t worry about spending more time on that for now. Loading the exact same DLLs twice in memory is probably causing problems. Windows isn’t being helpful here. Let me try to rectify that in JavaCPP first…
OK, so I’ve double checked and fixed some things - now I’m sure that (a) I’m no sa_mkl, and (b) my path is correct.
Result: Both Intellij and maven/java -cp are loading MKL correctly. Both are still deadlocking, however.
Looks like it’s legitimately bad args (well, single bad arg)… fix incoming…
Well, maybe it’s threading related. But why does it work totally fine for all subsequent calls - and not the first one? If it was threading related, shouldn’t it happen all the time?