deeplearning4j: Deadlock using ParagraphVector
Hi, trying to create a word embegging (I use ParagraphVector) with:
- 0.8.1-SNAPSHOT
- Ubuntu 16.04.2 LTS
- Hardware RAM 16 GB, Intel® Xeon® CPU E5-2620 v2 @ 2.10GHz
- File input contains 513316 lines with size of 873,5 MB. Training corpus is a string (7-bit ASCII) with lenght range of [10, 50 000] more or less.
DefaultTokenizerFactorywith customMyPreProcessor
The logs is:
11:35:10.937 [Thread-11] INFO o.d.m.s.SequenceVectors - Starting vocabulary building...
11:35:10.938 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Trying source iterator: [0]
11:35:10.991 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Target vocab size before building: [0]
11:35:56.217 [Thread-11] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [385141]; Sequences/sec: 2208.58; Words/sec: 962757.81;
?11:58:56.372 [Thread-11] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [588558]; Sequences/sec: 72.46; Words/sec: 26455.10;
11:59:41.377 [Thread-11] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [667109]; Sequences/sec: 2221.98; Words/sec: 902804.93;
11:59:59.203 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Wating till all processes stop...
11:59:59.204 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size before truncation: [764577], NumWords: [142093275], sequences parsed: [366918], counter: [142093265]
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Scavenger: Words before: 764577; Words after: 342429;
11:59:59.658 [Thread-11] DEBUG o.d.m.w.wordstore.VocabConstructor - Vocab size after truncation: [342429], NumWords: [140970370], sequences parsed: [366918], counter: [142093265]
12:00:07.468 [Thread-11] INFO o.d.m.w.wordstore.VocabConstructor - Sequences checked: [366918], Current vocabulary size: [342429]; Sequences/sec: [245.18];
12:00:07.486 [Thread-11] INFO o.d.m.e.loader.WordVectorSerializer - Projected memory use for model: [391.88 MB]
12:00:07.806 [Thread-11] INFO o.d.m.e.inmemory.InMemoryLookupTable - Initializing syn1...
12:00:08.038 [Thread-11] INFO o.d.m.s.SequenceVectors - Building learning algorithms:
12:00:08.038 [Thread-11] INFO o.d.m.s.SequenceVectors - building ElementsLearningAlgorithm: [SkipGram]
12:00:08.042 [Thread-11] INFO o.d.m.s.SequenceVectors - Starting learning process...
12:24:22.518 [VectorCalculationsThread 5] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [43254792]; Lines vectorized so far: [100000]; Seq/sec: [68.76]; Words/sec: [29740.19]; learningRate: [0.01732922544852691]
13:05:27.656 [VectorCalculationsThread 10] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [79385098]; Lines vectorized so far: [200000]; Seq/sec: [40.57]; Words/sec: [20253.57]; learningRate: [0.01092213763943002]
13:24:03.599 [VectorCalculationsThread 0] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [111279196]; Lines vectorized so far: [300000]; Seq/sec: [89.61]; Words/sec: [22098.93]; learningRate: [0.005265719332773217]
then, nothing happend, the CPU usage is 0.3% and memory is 52% (-Xmx10000min java options).
I alredy used the same classes and hardware but the file input had 17984 rows and all worked well.
[MyPreProcessor]
public class MyPreprocessor implements TokenPreProcess {
public MyPreprocessor() { }
@Override
public String preProcess(String token) {
// Clean
token = StringCleaning.stripPunct(token).toLowerCase();
// Accents
token = StringUtils.stripAccents(token);
// Bad char (only alphanumeric)
token = token.replaceAll("[^A-Za-z0-9 ]", "");
// Check if token contains at least one alphabet
if (!token.matches(".*[a-zA-Z]+.*")) return "";
// Small "words"
if (token.length() <= 1) return "";
return token;
}
}
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 21 (9 by maintainers)
To reproduce on the dl4j-examples open the file Word2VecRawTextExample, change
iterations(1)toepochs(100)recompile and run the example.Deadlock will look something like:
Updates here?
That’s bad 😦
Okay, i’ll check that part of code too.
As discussed in gitter issue is preprocessor that returns empty strings as tokens, making w2v very sad.