splink: levenshtein distance and jaro winkler returning inverted value
What happens?
In my databricks environment I’ve been experiencing the opposite expected values when doing jaro_winkler comparisons
Probably an issue on my end, but it really messes up training as you could imagine
To Reproduce
in spark, doing a simple select in a notebook like this:
from pyspark.sql import types from pyspark.sql.types import DoubleType spark.udf.registerJavaFunction(‘jaro_winkler’, ‘uk.gov.moj.dash.linkage.JaroWinklerSimilarity’, types.DoubleType())
%sql select jaro_winkler(“Splink program”, “Splink produce”)
jaro_winkler(Splink program, Splink produce)
0.11428571428571421 Showing 1 row.
0.18 seconds runtime
OS:
11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)
Splink version:
3.4.4
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 17 (6 by maintainers)
Yes, fixed for me too!, on Spark 3.1.3
Stage 1:> (0 + 1) / 1] ±---------------------------+ |jaro_winkler(MARHTA, MARTHA)| ±---------------------------+ |0.9611111111111111 | ±---------------------------+
@mamonu looks like it’s fixed for databricks in Azure. (12.0 ML (includes Apache Spark 3.3.1, Scala 2.12))
jaro_winkler(cat, cats) 1 0.9416666666666667
thank you!
@funkysandman
I knew that this bug existed hence thats why before 1.10 it didnt matter . But yes this makes things much clearer now.
Thanks for doing this investigation about https://issues.apache.org/jira/browse/TEXT-191 . After 4 years of watching this i just didnt think it would be fixed!
will be creating a new jar based on this information. Will update this thread.
@zzandi this is very good to know. I will update the jar at some point and will definitely take all this into account. Perhaps offere 2 sets of jars one for older sparks (2.4.x etc) and one for the latest spark (3.3.x)
Hi, we have experienced the same issue. This is due to spark version incompatibility, and it happens in spark 3.3. However, since spark 3.2 has considerably slower performance, instead of downgrading one could rebuild the jar file with a new version of “org.apache.commons” 1.10.0 instead of 1.4 and also add org.apache.spark 3.3 dependency to pom.xml to overcome this problem.