deeplearning4j: rnnTimeStep causes fatal error on trained model?
Not sure if this belongs here or under nd4j issues, but I’ll try to describe stepwise what’s happening:
Step 1: A model and configuration was found and stored using the Arbiter example (resulting in arbiter.model & arbiter-conf.json files being created)
Step 2: The model is loaded and evaluated (custom code where rnnTimeStep is called with no issues)
Step 3: The stored configuration file is loaded and a new model is being trained using the same data as in Step 1
Step 4: The trained model is being evaluated in the same manner as Step 2, but when it reaches the first call of rnnTimeStep the following error happens.
I’ve traced it a little in the MultiLayerNetwork, GravesLSTM and LSTMHelpers classes.
In MultiLayerNetwork it happens on the 2nd iteration of method
public INDArray rnnTimeStep(INDArray input)
at line
input = ((RecurrentLayer)this.layers[i]).rnnTimeStep(input);
In the GravesLSTM it happens at
private FwdPassReturn activateHelper(boolean training, INDArray prevOutputActivations, INDArray prevMemCellState, boolean forBackprop)
when calling
return LSTMHelpers.activateHelper(...);
In LSTMHelpers it happens in the method
public static FwdPassReturn activateHelper(...)
when calling
INDArray recurrentWeightsIFOG = recurrentWeights.get(new INDArrayIndex[] { NDArrayIndex.all(), NDArrayIndex.interval(0, 4 * hiddenLayerSize) }).dup('f');
What’s strange is that it happens only when loading the previously trained and stored model using the configuration file. Loading the stored model from Arbiter does not cause this issue.
What I also noticed is that when the training in Ste3 is finished and the model is stored, the Java process is still running (note that each of the Steps is a separate process/java file run individually one after the other). The process of Step 3 is needed to be manually killed (another issue maybe?)
Unfortunately, I don’t have a runnable example code which I can share through gist at the moment. 😕
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 32 (18 by maintainers)
Interesting. Null pointer leaked into nativeOps?