deeplearning4j: rnnTimeStep causes fatal error on trained model?

Not sure if this belongs here or under nd4j issues, but I’ll try to describe stepwise what’s happening:

Step 1: A model and configuration was found and stored using the Arbiter example (resulting in arbiter.model & arbiter-conf.json files being created)

Step 2: The model is loaded and evaluated (custom code where rnnTimeStep is called with no issues)

Step 3: The stored configuration file is loaded and a new model is being trained using the same data as in Step 1

Step 4: The trained model is being evaluated in the same manner as Step 2, but when it reaches the first call of rnnTimeStep the following error happens.

I’ve traced it a little in the MultiLayerNetwork, GravesLSTM and LSTMHelpers classes.

In MultiLayerNetwork it happens on the 2nd iteration of method public INDArray rnnTimeStep(INDArray input) at line input = ((RecurrentLayer)this.layers[i]).rnnTimeStep(input);

In the GravesLSTM it happens at private FwdPassReturn activateHelper(boolean training, INDArray prevOutputActivations, INDArray prevMemCellState, boolean forBackprop) when calling return LSTMHelpers.activateHelper(...);

In LSTMHelpers it happens in the method public static FwdPassReturn activateHelper(...) when calling INDArray recurrentWeightsIFOG = recurrentWeights.get(new INDArrayIndex[] { NDArrayIndex.all(), NDArrayIndex.interval(0, 4 * hiddenLayerSize) }).dup('f');

What’s strange is that it happens only when loading the previously trained and stored model using the configuration file. Loading the stored model from Arbiter does not cause this issue.

What I also noticed is that when the training in Ste3 is finished and the model is stored, the Java process is still running (note that each of the Steps is a separate process/java file run individually one after the other). The process of Step 3 is needed to be manually killed (another issue maybe?)

Unfortunately, I don’t have a runnable example code which I can share through gist at the moment. 😕

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 32 (18 by maintainers)

Most upvoted comments

Interesting. Null pointer leaked into nativeOps?