deeplearning4j: Deadlock during training with OMP_NUM_THREADS >= 8

Issue Description

When training a CNN on text classification, training hangs when using OMP_NUM_THREADS >= 8. For lower num_threads the performance increases almost linearly: OMP_NUM_THREADS | Batches/sec 1 | 2.117 2 | 3.815 4 | 7.006 6 | 9.539

The (simple) network:

        MultiLayerConfiguration config = new NeuralNetConfiguration.Builder()
                .weightInit(WeightInit.RELU)
                .activation(Activation.LEAKYRELU)
                .updater(new Adam(0.01))
                .convolutionMode(ConvolutionMode.Same)
                .l2(0.001)
                .list()
                .layer(new ConvolutionLayer.Builder()
                        .kernelSize(3, 50)
                        .stride(1, 50)
                        .nIn(1)
                        .nOut(100)
                        .build())
                .layer(new GlobalPoolingLayer.Builder()
                        .poolingType(PoolingType.MAX)
                        .dropOut(0.7)
                        .build())
                .layer(new OutputLayer.Builder()
                        .lossFunction(LossFunctions.LossFunction.MCXENT)
                        .activation(Activation.SOFTMAX)
                        .nIn(100)
                        .nOut(AgeGroup.values().length - 1)
                        .build())
                .build();

Output of kill -3 in this gist: https://gist.github.com/tschut/730ebeff7039baed44e52d623c841334.

Version Information

snapshot version of dl4j and nd4j
running on cpu (no gpu) on ubuntu 18.04

$ uname -a
Linux gpu-instance2 4.15.0-1029-gcp #31-Ubuntu SMP Thu Mar 21 09:40:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

processor info

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping:            3
CPU MHz:             2000.180
BogoMIPS:            4000.36
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat arch_capabilities

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 28 (13 by maintainers)

Most upvoted comments

The dependency should look more like:

        <dependency>
                <groupId>org.nd4j</groupId>
                 <artifactId>nd4j-native</artifactId>
                 <version>${dl4j.version}</version>
                 <classifier>linux-x86_64-avx512</classifier>
        </dependency>

treo on May 1, 2019