recommenders: [BUG] Spark smoke test error with Criteo
Description
After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation
tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF
=================================== FAILURES ===================================
_____________________ test_mmlspark_lightgbm_criteo_smoke ______________________
notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'
@pytest.mark.flaky(reruns=5, reruns_delay=2)
@pytest.mark.smoke
@pytest.mark.spark
@pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
notebook_path = notebooks["mmlspark_lightgbm_criteo"]
pm.execute_notebook(
notebook_path,
output_notebook,
kernel_name=kernel_name,
parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
)
output_notebook,
kernel_name=kernel_name,
parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
)
results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
"data"
]
> assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
E assert 0.6292474883613918 == 0.68895 ± 5.0e-02
E + where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
E + where <function approx at 0x7f46b6e30840> = pytest.approx
In which platform does it happen?
How do we replicate the issue?
Expected behavior (i.e. solution)
Other Comments
This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (3 by maintainers)
this is great @imatiach-msft, thanks for chiming in
@anargyri I guess this is kind of getting into very specific details, but the single dataset mode essentially hands the spark dataset to native code on one spark worker and “finishes” the other spark workers, and then the parallelization is done with multithreading code in native layer. The previous case would create a native dataset for each worker, and there would be a lot of unnecessary network communication between them, instead of internal thread parallelization. So with single dataset mode we have a single lightgbm dataset created per machine, and without it we have as many datasets as spark workers (so if 1 core per worker and 8 cores there are 8 lightgbm datasets created). I’m actually surprised that this still gives better accuracy in this case somehow.
Yes, this is right.
I agree it’s not a large difference in AUC. I am comparing runs on the same machine and same code (just flipping the useSingleDatasetMode parameter). I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True.
It’s 100K in the data set we use vs. 45M rows in the full data.
@anargyri @miguelgfierro I made a PR here to update the notebook to remove the early stopping param since it isn’t used: https://github.com/microsoft/recommenders/pull/1620
@mhamilton723 @imatiach-msft we detected a performance drop in LightGBM, any hint about what could be happening?