deeplearning4j: Deadlock using ParagraphVector

Hi, trying to create a word embegging (I use ParagraphVector) with:

  • 0.8.1-SNAPSHOT
  • Ubuntu 16.04.2 LTS
  • Hardware RAM 16 GB, Intel® Xeon® CPU E5-2620 v2 @ 2.10GHz
  • File input contains 513316 lines with size of 873,5 MB. Training corpus is a string (7-bit ASCII) with lenght range of [10, 50 000] more or less.
  • DefaultTokenizerFactory with custom MyPreProcessor

The logs is:

11:35:10.937 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting vocabulary building...
11:35:10.938 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:56.217 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [385141]; Sequences/sec: 2208.58; Words/sec: 962757.81;
?11:58:56.372 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [588558]; Sequences/sec: 72.46; Words/sec: 26455.10;
11:59:41.377 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [667109]; Sequences/sec: 2221.98; Words/sec: 902804.93;
11:59:59.203 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
11:59:59.204 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [764577],  NumWords: [142093275], sequences parsed: [366918], counter: [142093265]
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 764577; Words after: 342429;
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [342429],  NumWords: [140970370], sequences parsed: [366918], counter: [142093265]
12:00:07.468 [Thread-11] INFO  o.d.m.w.wordstore.VocabConstructor - Sequences checked: [366918], Current vocabulary size: [342429]; Sequences/sec: [245.18];
12:00:07.486 [Thread-11] INFO  o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [391.88 MB]
12:00:07.806 [Thread-11] INFO  o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors - Building learning algorithms:
12:00:08.038 [Thread-11] INFO  o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
12:00:08.042 [Thread-11] INFO  o.d.m.s.SequenceVectors - Starting learning process...
12:24:22.518 [VectorCalculationsThread 5] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [43254792];  Lines vectorized so far: [100000]; Seq/sec: [68.76]; Words/sec: [29740.19]; learningRate: [0.01732922544852691]
13:05:27.656 [VectorCalculationsThread 10] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [79385098];  Lines vectorized so far: [200000]; Seq/sec: [40.57]; Words/sec: [20253.57]; learningRate: [0.01092213763943002]
13:24:03.599 [VectorCalculationsThread 0] INFO  o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [111279196];  Lines vectorized so far: [300000]; Seq/sec: [89.61]; Words/sec: [22098.93]; learningRate: [0.005265719332773217]

then, nothing happend, the CPU usage is 0.3% and memory is 52% (-Xmx10000min java options).

I alredy used the same classes and hardware but the file input had 17984 rows and all worked well.

[MyPreProcessor]

public class MyPreprocessor implements TokenPreProcess {
    public MyPreprocessor() { }

    @Override
    public String preProcess(String token) {

        // Clean
        token = StringCleaning.stripPunct(token).toLowerCase();

        // Accents
        token = StringUtils.stripAccents(token);

        // Bad char (only alphanumeric)
        token = token.replaceAll("[^A-Za-z0-9 ]", "");

        // Check if token contains at least one alphabet
        if (!token.matches(".*[a-zA-Z]+.*")) return "";

        // Small "words"
        if (token.length() <= 1) return  "";

        return token;
    }
}

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 21 (9 by maintainers)

Most upvoted comments

To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change iterations(1) to epochs(100) recompile and run the example.

dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/word2vec/Word2VecRawTextExample.java
@@ -45,7 +45,7 @@ public class Word2VecRawTextExample {
         log.info("Building model....");
         Word2Vec vec = new Word2Vec.Builder()
                 .minWordFrequency(5)
-                .iterations(1)
+                .epochs(100)
                 .layerSize(100)
                 .seed(42)
                 .windowSize(5)

Deadlock will look something like:

o.d.e.n.w.Word2VecRawTextExample - Load & Vectorize Sentences....
o.d.e.n.w.Word2VecRawTextExample - Building model....
o.n.l.f.Nd4jBackend - Loaded [CpuBackend] backend
o.n.n.NativeOpsHolder - Number of threads used for NativeOps: 4
o.n.n.Nd4jBlas - Number of threads used for BLAS: 4
o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [7.0GB];
o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
o.d.e.n.w.Word2VecRawTextExample - Fitting Word2Vec model....
o.d.m.s.SequenceVectors - Starting vocabulary building...
o.d.m.w.w.VocabConstructor - Sequences checked: [97162], Current vocabulary size: [242]; Sequences/sec: [20369.39];
o.d.m.e.l.WordVectorSerializer - Projected memory use for model: [0.18 MB]
o.d.m.e.i.InMemoryLookupTable - Initializing syn1...
o.d.m.s.SequenceVectors - Building learning algorithms:
o.d.m.s.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
o.d.m.s.SequenceVectors - Starting learning process...
o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [2]; Words vectorized so far: [634299];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [3]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [4]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [5]; Words vectorized so far: [634296];  Lines vectorized so far: [97161]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [6]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [7]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [8]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [9]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [10]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [11]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [12]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]
o.d.m.s.SequenceVectors - Epoch: [13]; Words vectorized so far: [634303];  Lines vectorized so far: [97162]; learningRate: [1.0E-4]

  1. Closing this since issue fixed long ago.
  2. Not necessary, it depends on corpus size etc. And no, epoch ends once training corpus ends. 1 epoch = 1 pass over training corpus.

Updates here?

That’s bad 😦

Okay, i’ll check that part of code too.

As discussed in gitter issue is preprocessor that returns empty strings as tokens, making w2v very sad.