recommenders: [BUG] Spark smoke test error with Criteo

Description

After upgrading LightGBM and the Spark version, we got this error in the nightly smoke tests, however, we have been running the same code for a long time without this error. It looks like a performance degradation

tests/smoke/examples/test_notebooks_pyspark.py .RRRRRF

=================================== FAILURES ===================================
_____________________ test_mmlspark_lightgbm_criteo_smoke ______________________

notebooks = {'als_deep_dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'a..._dive': '/home/recocat/myagent/_work/10/s/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'

    @pytest.mark.flaky(reruns=5, reruns_delay=2)
    @pytest.mark.smoke
    @pytest.mark.spark
    @pytest.mark.skipif(sys.platform == "win32", reason="Not implemented on Windows")
    def test_mmlspark_lightgbm_criteo_smoke(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["mmlspark_lightgbm_criteo"]
        pm.execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )

            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(DATA_SIZE="sample", NUM_ITERATIONS=50, EARLY_STOPPING_ROUND=10),
        )
    



        results = sb.read_notebook(output_notebook).scraps.dataframe.set_index("name")[
            "data"
        ]
>       assert results["auc"] == pytest.approx(0.68895, rel=TOL, abs=ABS_TOL)
E       assert 0.6292474883613918 == 0.68895 ± 5.0e-02
E        +  where 0.68895 ± 5.0e-02 = <function approx at 0x7f46b6e30840>(0.68895, rel=0.05, abs=0.05)
E        +    where <function approx at 0x7f46b6e30840> = pytest.approx

In which platform does it happen?

How do we replicate the issue?

see details: https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56132&view=logs&j=80b1c078-4399-5286-f869-6bc90f734ab9&t=5e8b8b4f-32ea-5957-d349-aae815b05487

Expected behavior (i.e. solution)

Other Comments

This error is so weird, did LightGBM from SynapseML changed somehow? FYI @anargyri @simonzhaoms

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 27 (3 by maintainers)

Most upvoted comments

this is great @imatiach-msft, thanks for chiming in

miguelgfierro on Jan 21, 2022

@anargyri I guess this is kind of getting into very specific details, but the single dataset mode essentially hands the spark dataset to native code on one spark worker and “finishes” the other spark workers, and then the parallelization is done with multithreading code in native layer. The previous case would create a native dataset for each worker, and there would be a lot of unnecessary network communication between them, instead of internal thread parallelization. So with single dataset mode we have a single lightgbm dataset created per machine, and without it we have as many datasets as spark workers (so if 1 core per worker and 8 cores there are 8 lightgbm datasets created). I’m actually surprised that this still gives better accuracy in this case somehow.

imatiach-msft on Jan 20, 2022

“28s vs. 25s respectively” Just to double-confirm, is the 25s the useSingleDatasetMode=True? It should be faster. Although if the dataset is small there really shouldn’t be any difference, or the time difference is most likely at this point just random.

Yes, this is right.

anargyri on Jan 20, 2022

I agree it’s not a large difference in AUC. I am comparing runs on the same machine and same code (just flipping the useSingleDatasetMode parameter). I am not sure how many workers are used, it is a single machine. It looks like lightGBM uses 6 workers but what is strange is that the lightGBM info is not reported in the output when changing the parameter to True.

anargyri on Jan 20, 2022

It’s 100K in the data set we use vs. 45M rows in the full data.

anargyri on Jan 20, 2022

@anargyri @miguelgfierro I made a PR here to update the notebook to remove the early stopping param since it isn’t used: https://github.com/microsoft/recommenders/pull/1620

imatiach-msft on Jan 20, 2022

@mhamilton723 @imatiach-msft we detected a performance drop in LightGBM, any hint about what could be happening?

miguelgfierro on Jan 20, 2022