splink: levenshtein distance and jaro winkler returning inverted value

What happens?

In my databricks environment I’ve been experiencing the opposite expected values when doing jaro_winkler comparisons

Probably an issue on my end, but it really messes up training as you could imagine

To Reproduce

in spark, doing a simple select in a notebook like this:

from pyspark.sql import types from pyspark.sql.types import DoubleType spark.udf.registerJavaFunction(‘jaro_winkler’, ‘uk.gov.moj.dash.linkage.JaroWinklerSimilarity’, types.DoubleType())

%sql select jaro_winkler(“Splink program”, “Splink produce”)

jaro_winkler(Splink program, Splink produce)

0.11428571428571421 Showing 1 row.

0.18 seconds runtime

OS:

11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)

Splink version:

3.4.4

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 17 (6 by maintainers)

Most upvoted comments

Yes, fixed for me too!, on Spark 3.1.3

Stage 1:> (0 + 1) / 1] ±---------------------------+ |jaro_winkler(MARHTA, MARTHA)| ±---------------------------+ |0.9611111111111111 | ±---------------------------+

@mamonu looks like it’s fixed for databricks in Azure. (12.0 ML (includes Apache Spark 3.3.1, Scala 2.12))

jaro_winkler(cat, cats) 1 0.9416666666666667

thank you!

• up to version 1.10, JaroWinklerDistance had a bug and it returned the same value as JaroWinklerSimilarity. (Please see
https://issues.apache.org/jira/browse/TEXT-191

@funkysandman

I knew that this bug existed hence thats why before 1.10 it didnt matter . But yes this makes things much clearer now.

Thanks for doing this investigation about https://issues.apache.org/jira/browse/TEXT-191 . After 4 years of watching this i just didnt think it would be fixed!

will be creating a new jar based on this information. Will update this thread.

@zzandi this is very good to know. I will update the jar at some point and will definitely take all this into account. Perhaps offere 2 sets of jars one for older sparks (2.4.x etc) and one for the latest spark (3.3.x)

Hi, we have experienced the same issue. This is due to spark version incompatibility, and it happens in spark 3.3. However, since spark 3.2 has considerably slower performance, instead of downgrading one could rebuild the jar file with a new version of “org.apache.commons” 1.10.0 instead of 1.4 and also add org.apache.spark 3.3 dependency to pom.xml to overcome this problem.