splink: Blocking rules are ignored by Splink model
Dear team,
I wanted to congratulate you for the great work done with Splink! Currently, I’m using it for a deduplication project with relatively large datasets (Around 600k rows each dataset, the smallest ones can be around 50k).
I’m setting a configuration file for Splink as:
{
"link_type": "dedupe_only",
"blocking_rules": [
"l.postal_code = r.postal_code AND l.col_example_1 = r.col_example_1 AND l.col_example_2 = r.col_example_2 AND l.col_example_3 != r.col_example_3 AND l.province = r.province"
],
"comparison_columns": [
{
"col_name": "col_example_4",
"data_type": "numeric",
"num_levels": 2,
"case_expression": "Catch the nulls with -1, if identical 1, otherwise 0."
},
{
"col_name": "col_example_5",
"data_type": "numeric",
"num_levels": 2,
"case_expression": "Catch the nulls with -1, if identical 1, otherwise 0."
},
{
"col_name": "col_example_6_with_inverse_correlation",
"data_type": "numeric",
"num_levels": 3,
"case_expression": "Catch the nulls, set levels for: 2 if they are identical, 1 if they are in a range of numeric difference, otherwise 0."
},
{
"col_name": "col_example_7_with_inverse_correlation",
"data_type": "numeric",
"num_levels": 3,
"case_expression": "Catch the nulls, set levels for: 2 if they are identical, 1 if they are in a range of numeric difference, otherwise 0."
},
{
"col_name": "col_example_8",
"data_type": "string",
"num_levels": 3,
"case_expression": "Catch nulls and set levels with jaro_winkler_sim()"
},
{
"num_levels": 2,
"case_expression": "Custom SQL expression: if col_example_9 is a certain type then take col_example_10 into account",
"custom_columns_used": [
"col_example_9",
"col_example_10"
],
"data_type": "string",
"custom_name": "custom_col_example"
},
],
"additional_columns_to_retain": [
"id_col_example",
"id_col_for_item",
"date_col"
],
"em_convergence": 0.01
}
Note that col_example_6_with_inverse_correlation
and col_example_7_with_inverse_correlation
have an inverse/negative correlation: When the first one has a value, then the other is null and viceversa.
When checking the matching results, I see that the blocking rules are ignored.
For example, postal_code isn’t the same in all records segmented into an estimated matching group (Pair or group of duplicates). In almost all matching groups there is at least 1 record that doesn’t follow the blocking rules.
If blocking_rules
avoid comparisons, why are these comparisons between different postal codes still happening?
Kind regards, Rebeca
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (8 by maintainers)
Dear @RobinL @mamonu
I tried this and found out that the data indeed “changed”. The saved dataframe as a parquet file right after adding the
F.monotonically_increasing_id()
was different than the one just a couple of lines after in the code.I passed the parqeut file to the
clusters_at_thresholds()
function and it worked! 🎉 The blocking rules work and all the resulting groups have the same postal_code.Thank you for this information @mamonu , the cause comes from
F.monotonically_increasing_id()
indeed. And as you said saving it after adding those id works. 😃Now I’m checking the results and seeing a few clustered groups that have a large number of false duplicates (All respecting the blocking rules) and I wonder if it’s because I’m not caching before running cc or because of the comparison columns and the m and u probabilities.
In the other hand, the groups with fewer number of duplicates (2 to 6 records per group) are, so far I could check, all correct. ✔️
I wanted to thank you for the help! @mamonu @RobinL
Kind regards, Rebeca
IMO @RobinL is right to suspect the
F.monotonically_increasing_id()
function. This is unfortunately a known issue tripping up people in the Spark ecosystemSomeone has even written a blog post about this. spark-surprises-for-the-uninitiated
In addition to make things more complicated…in the graphframes package this conversation about a relevant issue has taken place : https://github.com/graphframes/graphframes/issues/159
the outcome of all this is: only by explicitly caching and/or saving at the right time (eg once after running the
monotoniclly_inc_id
function and perhaps once more before running cc we can be sure that the results will be correct.I think there’s a possibility that
F.monotonically_increasing_id()
is the source of the problem. We’ve certainly had problems with ‘non-deterministicness’ of this function before.One thing you could try is: Step 1: assign an id using
monotonically_increasing_id
and then save the data out to a parquet file. (If memory serves, it does produce a unique id, but if you’re using it with an iterative algorithm/long lineage in Spark it can ‘change’. ) Step 2: Load the parquet file in, (double check the ID is actually unique - I think it will be), and run everything from that point