deeplearning4j: ImageRecordReader crashes JVM with loaded Keras model in 1.0.0-beta7

Issue Description

I encountered a strange problem in 1.0.0-beta7 while trying to run a Keras model loaded from a .h5 file (e.g., VGG16.h5 from here) - this model previously ran fine in 1.0.0-beta6.

Calling computationGraph.feedForward(features, false) would crash the JVM (error log, using this code snippet:

// Create VGG16 from a Keras .h5 file
ComputationGraph tmpModel = KerasModelImport.importKerasModelAndWeights("VGG16.h5");
tmpModel.init();

ImageRecordReader reader = new ImageRecordReader(224, 224, 3);
reader.initialize(new FileSplit(new File("img_125_5.jpg"))); // Test with a single image
DataSetIterator it = new RecordReaderDataSetIterator(reader, 1);

// Keras model has wrong channel order, so flip it at the reader level
reader.setNchw_channels_first(false);

INDArray features = it.next().getFeatures();
// INDArray features = Nd4j.rand(1, 224, 224, 3); // Runs fine when initializing from random array of same size

System.out.println(Arrays.toString(features.shape())); // prints [1, 224, 224, 3]

tmpModel.feedForward(features, false);

The crash would happen specifically within the ComputationGraph class at line 1976 - figured this by stepping through the code in IntelliJ.

Strangely though, the code snippet above runs fine if you use a random numpy array of the same shape (so the issue isn’t caused by the features shape). Looking into the values of the features given by the DatasetIterator, there aren’t any NaNs or weird values (all are between 0 and 1).

Also interesting to note is that the .h5 model can be saved in beta6 to a zip using model.save(new File("VGG.zip")), then loaded in beta7, and the above snippet works fine (swapping the KerasModelImport... for ComputationGraph.load(new File("beta6KerasVGG.zip"), true);

Another note, the above snippet works fine if using a different model (e.g., ResNet50.h5) - so it’s not all Keras models that this problem occurs with.

Conclusion

On one hand, it seems like the problem is caused by updates to the KerasModelImport process - a .h5 file which loaded and ran fine in 1.0.0-beta6 now no longer works in 1.0.0-beta7. Additionally, saving a .zip file of the beta6 version and loading a new ComputationGraph in beta7 circumvents the above problem.

However, it also seems like the ImageRecordReader or DataSetIterator could be the culprit - when those are taken out of the equation (by using a random INDArray) no errors occur.

Attached files

img_125_5

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version - 1.0.0-beta7
Platform information (OS, etc) - Ubuntu 18.04
CUDA version, if used
NVIDIA driver version, if in use

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (11 by maintainers)

Most upvoted comments

I’ve made a simple Gradle project to demonstrate this and help you reproduce it.

Instructions

Download and unzip the project file from Google Drive: Link
Open/Import the project in IntelliJ (or your IDE of choice). Let your IDE download the relevant dependencies
Run the main() method in Main.java. The project initializes using beta6 so the main() method should complete successfully.
In build.gradle, change the nd4j and dl4j versions from 1.0.0-beta6 to 1.0.0-beta7. Let your IDE import these changes.
Run main() again. This should now cause the program to crash (JVM crash on Ubuntu 18.04 (log file attached) and nondescript Gradle error on Windows 10).

In Main.java, I’ve also written in some different scenarios that I’ve tried to help debug the issue; most notable is Scenario 3 which is the duplicating fix mentioned above.

Hopefully this can be reproduced on your machine, let me know if there’s any other info you’d like 😃

Attached Files

hs_err_pid17974.log

basedrhys on Jun 8, 2020