splink: Blocking rules are ignored by Splink model

Dear team,

I wanted to congratulate you for the great work done with Splink! Currently, I’m using it for a deduplication project with relatively large datasets (Around 600k rows each dataset, the smallest ones can be around 50k).

I’m setting a configuration file for Splink as:

{
    "link_type": "dedupe_only",
    "blocking_rules": [
        "l.postal_code = r.postal_code AND l.col_example_1 = r.col_example_1 AND l.col_example_2 = r.col_example_2 AND l.col_example_3 != r.col_example_3 AND l.province = r.province"
    ],
    "comparison_columns": [
        {
            "col_name": "col_example_4",
            "data_type": "numeric",
            "num_levels": 2,
            "case_expression": "Catch the nulls with -1, if identical 1, otherwise 0."
        },
        {
            "col_name": "col_example_5",
            "data_type": "numeric",
            "num_levels": 2,
            "case_expression": "Catch the nulls with -1, if identical 1, otherwise 0."
        },
        {
            "col_name": "col_example_6_with_inverse_correlation",
            "data_type": "numeric",
            "num_levels": 3,
            "case_expression": "Catch the nulls, set levels for: 2 if they are identical, 1 if they are in a range of numeric difference, otherwise 0."
        },
        {
            "col_name": "col_example_7_with_inverse_correlation",
            "data_type": "numeric",
            "num_levels": 3,
            "case_expression": "Catch the nulls, set levels for: 2 if they are identical, 1 if they are in a range of numeric difference, otherwise 0."
        },
        {
            "col_name": "col_example_8",
            "data_type": "string",
            "num_levels": 3,
            "case_expression": "Catch nulls and set levels with jaro_winkler_sim()"
        },
        {
            "num_levels": 2,
            "case_expression": "Custom SQL expression: if col_example_9 is a certain type then take col_example_10 into account",
            "custom_columns_used": [
                "col_example_9",
                "col_example_10"
            ],
            "data_type": "string",
            "custom_name": "custom_col_example"
        },
    ],
    "additional_columns_to_retain": [
        "id_col_example",
        "id_col_for_item",
        "date_col"
    ],
    "em_convergence": 0.01
 }

Note that col_example_6_with_inverse_correlation and col_example_7_with_inverse_correlation have an inverse/negative correlation: When the first one has a value, then the other is null and viceversa.

When checking the matching results, I see that the blocking rules are ignored.

For example, postal_code isn’t the same in all records segmented into an estimated matching group (Pair or group of duplicates). In almost all matching groups there is at least 1 record that doesn’t follow the blocking rules.

If blocking_rules avoid comparisons, why are these comparisons between different postal codes still happening?

Kind regards, Rebeca

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

Dear @RobinL @mamonu

I think there’s a possibility that F.monotonically_increasing_id() is the source of the problem. We’ve certainly had problems with ‘non-deterministicness’ of this function before.

One thing you could try is: Step 1: assign an id using monotonically_increasing_id and then save the data out to a parquet file. (If memory serves, it does produce a unique id, but if you’re using it with an iterative algorithm/long lineage in Spark it can ‘change’. ) Step 2: Load the parquet file in, (double check the ID is actually unique - I think it will be), and run everything from that point

I tried this and found out that the data indeed “changed”. The saved dataframe as a parquet file right after adding the F.monotonically_increasing_id() was different than the one just a couple of lines after in the code.

I passed the parqeut file to the clusters_at_thresholds() function and it worked! 🎉 The blocking rules work and all the resulting groups have the same postal_code.

IMO @RobinL is right to suspect the F.monotonically_increasing_id() function. This is unfortunately a known issue tripping up people in the Spark ecosystem

Someone has even written a blog post about this. spark-surprises-for-the-uninitiated

In addition to make things more complicated…in the graphframes package this conversation about a relevant issue has taken place : graphframes/graphframes#159

the outcome of all this is: only by explicitly caching and/or saving at the right time (eg once after running the monotoniclly_inc_id function and perhaps once more before running cc we can be sure that the results will be correct.

Thank you for this information @mamonu , the cause comes from F.monotonically_increasing_id() indeed. And as you said saving it after adding those id works. 😃

Now I’m checking the results and seeing a few clustered groups that have a large number of false duplicates (All respecting the blocking rules) and I wonder if it’s because I’m not caching before running cc or because of the comparison columns and the m and u probabilities.

In the other hand, the groups with fewer number of duplicates (2 to 6 records per group) are, so far I could check, all correct. ✔️

I wanted to thank you for the help! @mamonu @RobinL

Kind regards, Rebeca

IMO @RobinL is right to suspect the F.monotonically_increasing_id() function. This is unfortunately a known issue tripping up people in the Spark ecosystem

Someone has even written a blog post about this. spark-surprises-for-the-uninitiated

In addition to make things more complicated…in the graphframes package this conversation about a relevant issue has taken place : https://github.com/graphframes/graphframes/issues/159

the outcome of all this is: only by explicitly caching and/or saving at the right time (eg once after running the monotoniclly_inc_id function and perhaps once more before running cc we can be sure that the results will be correct.

I think there’s a possibility that F.monotonically_increasing_id() is the source of the problem. We’ve certainly had problems with ‘non-deterministicness’ of this function before.

One thing you could try is: Step 1: assign an id using monotonically_increasing_id and then save the data out to a parquet file. (If memory serves, it does produce a unique id, but if you’re using it with an iterative algorithm/long lineage in Spark it can ‘change’. ) Step 2: Load the parquet file in, (double check the ID is actually unique - I think it will be), and run everything from that point