LightGBM: JVM crash caused by LGBM_DatasetCreateFromCSRSpark
I’m seeing jvm crashes in our spark cluster which I believe are being caused by LGBM_DatasetCreateFromCSRSpark
https://github.com/microsoft/LightGBM/issues/2360 indicated some issues in that method, so I setup a local stress test that just invokes a tiny java application in a loop. Doing this I was quite easily (> 1 out of 100 times) able to elicit a jvm crash but its not exactly what I’m seeing in our spark cluster, though I think they are probably related.
Under the same local stress test, calling LGBM_DatasetCreateFromMat does not lead to crashes.
I added some additional logging to LGBM_DatasetCreateFromCSRSpark and recompiled locally to help diagnose where the issue. I added log lines right before the method being called in lightgbmlib.i
Example logs from good run
BEGIN PROGRAM
[Dynamic-linking native method java.lang.Package.getSystemPackage0 ... JNI]
[Dynamic-linking native method sun.reflect.NativeMethodAccessorImpl.invoke0 ... JNI]
[Dynamic-linking native method java.lang.Class.isInstance ... JNI]
[Dynamic-linking native method java.security.AccessController.doPrivileged ... JNI]
[Dynamic-linking native method java.lang.System.identityHashCode ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_handle ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.new_longp ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.longp_assign ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.long_to_int64_t_ptr ... JNI]
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark ... JNI]
[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; -1077856400
[LightGBM] [Warning] GetArrayLength: 0
[LightGBM] [Warning] GetIntArrayElements
[LightGBM] [Warning] GetDoubleArrayElements
[ repeated 10 times]
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Warning] return 0
[Dynamic-linking native method com.microsoft.ml.lightgbm.lightgbmlibJNI.voidpp_value ... JNI]
Done
Crashing execution
Everything looks similar to the good run except we end up with a crash. There are extra Dynamic-linking native lines most times, but I’m not sure if they are the cause or effect.
[snip]
[LightGBM] [Warning] FindClass
[LightGBM] [Warning] GetMethodID index
[LightGBM] [Warning] GetMethodID values
[LightGBM] [Warning] reserve rows: 10
[LightGBM] [Warning] num cols: 79
[LightGBM] [Warning] GetObjectArrayElement
[LightGBM] [Warning] CallObjectMethod index
[Dynamic-linking native method sun.reflect.ConstantPool.getUTF8At0 ... JNI]
[Dynamic-linking native method java.lang.reflect.Proxy.defineClass0 ... JNI]
[Dynamic-linking native method java.util.TimeZone.getSystemTimeZoneID ... JNI]
[LightGBM] [Warning] CallObjectMethod values
[LightGBM] [Warning] GetArrayLength; 1045488784
FATAL ERROR in native method: Non-array passed to JNI array operations
at com.microsoft.ml.lightgbm.lightgbmlibJNI.LGBM_DatasetCreateFromCSRSpark(Native Method)
at com.microsoft.ml.lightgbm.lightgbmlib.LGBM_DatasetCreateFromCSRSpark(lightgbmlib.java:286)
at program.generateSparseDataset(program.java:104)
at program.main(program.java:136)
So the JVM is telling us that indices is not a valid array, but it is still a valid jobject which allows us to inspect it. What I’m observing here makes no sense to me however. The type of the object varies on each crash; sometimes it is a string, sometimes is a class.
When its a string, it tends to be
{"type":0,"size":79,"indices":[],"values":[]} which is the result of calling .toJson() on the sparse vector
When its a class, it tends to be
org.apache.spark.ml.linalg.SparseVector which is the class we were supposed to call the method on
All this seems to indicate that we’re somehow calling the wrong method here
Environment
Spark cluster is described in https://github.com/microsoft/LightGBM/issues/2360
For this local test OSX 10.14.6 Using a local build of LightGBM and lightgbmlib.jar
Program to reproduce
import com.microsoft.ml.lightgbm.*;
import org.apache.commons.lang.ArrayUtils;
import org.apache.spark.mllib.linalg.SparseVector;
import java.util.ArrayList;
public class program {
private static void validate(int result, String component) throws Exception {
if (result == -1) {
throw new Exception(component + " call failed in LightGBM with error: " + lightgbmlib.LGBM_GetLastError());
}
}
private static SWIGTYPE_p_int64_t intToPtr(int value) {
SWIGTYPE_p_long longPtr = lightgbmlib.new_longp();
lightgbmlib.longp_assign(longPtr, value);
return lightgbmlib.long_to_int64_t_ptr(longPtr);
}
private static SWIGTYPE_p_void generateSparseDataset(SparseVector[] sparseRows) throws Exception {
// https://github.com/Azure/mmlspark/blob/360f2f7d8116a931bf373874cd558c43d7d98973/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L240
int numCols = sparseRows[0].size();
SWIGTYPE_p_p_void datasetOutPtr = lightgbmlib.voidpp_handle();
String datasetParams = "max_bin=255 is_pre_partition=True";
// Generate the dataset for features
validate(lightgbmlib.LGBM_DatasetCreateFromCSRSpark(
sparseRows,
sparseRows.length,
intToPtr(numCols),
datasetParams,
null,
datasetOutPtr),
"Dataset create");
return lightgbmlib.voidpp_value(datasetOutPtr);
}
public static void main(String[] args) throws Exception {
System.out.println("BEGIN PROGRAM");
try {
System.load("/full_path_to_lib/lib_lightgbm.dylib");
System.load("/full_path_to_lib/lib_lightgbm_swig.jnilib");
} catch (UnsatisfiedLinkError e) {
System.err.println(e.getMessage());
e.printStackTrace();
return;
}
int numRow = 10;
int numCols = 79;
SparseVector[] rows = new SparseVector[numRow];
for (int i = 0; i < numRow; i++) {
rows[i] = new SparseVector(numCols, new int[0], new double[0]);
}
SWIGTYPE_p_void dataset = generateSparseDataset(rows);
System.out.println("Done");
}
Grade file
plugins {
java
}
group = "whatever"
version = "1.0-SNAPSHOT"
repositories {
mavenCentral()
}
dependencies {
implementation(files("libs/lightgbmlib.jar"))
implementation("org.apache.spark", "spark-mllib_2.12", "2.4.3")
testCompile("junit", "junit", "4.12")
}
configure<JavaPluginConvention> {
sourceCompatibility = JavaVersion.VERSION_1_8
}
Example crash from our spark cluster
19/08/26 21:19:22 INFO LightGBMRanker: LightGBM worker listening on: 12407
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 418231 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 418052 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 420478 rows and 79 columns
19/08/26 21:19:34 INFO LightGBMRanker: LightGBM worker generating sparse dataset with 423064 rows and 79 columns
*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': double free or corruption (!prev): 0x0000000002c65ab0 ***
*** Error in `/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java': corrupted size vs. prev_size: 0x00007f10e2b234d0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f118a6c07e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f118a6c937a]
/lib/x86_64-linux-gnu/libc.so.6(+0x7e9dc)[0x7f118a6c79dc]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f118a6cd53c]
/lib/x86_64-linux-gnu/libc.so.6(+0x81cde)[0x7f118a6cacde]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x568)[0x7f112343e3d8]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f118a6cd184]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/databricks/python/lib/libstdc++.so.6(_Znwm+0x16)[0x7f118ad6f097]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM7Booster25CreateObjectiveAndMetricsEv+0x21)[0x7f1123323051]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM13DCGCalculator4InitERKSt6vectorIdSaIdEE+0x463)[0x7f112343e2d3]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(LGBM_BoosterCreate+0x163)[0x7f1123314e13]
/local_disk0/tmp/mml-natives5905437768242259451/lib_lightgbm.so(_ZN8LightGBM17ObjectiveFunction23CreateObjectiveFunctionERKSsRKNS_6ConfigE+0x8e3)[0x7f11234a5123]
/local_disk0/tmp/mml-natives8293617996259236758/lib_lightgbm_swig.so(Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1BoosterCreate+0x3f)[0x7f112309be7f]
[0x7f11739f1407]
======= Memory map: ========
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 18 (14 by maintainers)
I’ve been using dense vectors to avoid this method for my current work, I’ll try it again later and report back
On linux I’m able to reproduce the issue with check:jni but not without it, which is interesting.
One other difference is that the failure occurs calling
GetDoubleArrayElementsinstead ofGetIntArrayElementsorGetArrayLength