smetrics: Jaro's implementation doesn't match with reference implementations

Hi,

Thanks for your work on these implementations!

Recently I’ve been tinkering with several distancing algorithms but I have inconsistent results when using xrash/smetrics compared to other Jaro implementations.

I’ve compared the result with:

And on at least the following values it differs:

  • “DIXON” vs. “DICKSONX” (smetrics.Jaro scores lower with 0.683333, reference implementations all score: 0.766667)
  • “gmilcon” vs. “gmilno” (smetrics.Jaro scores lower with 0.849206, reference implementations all score: 0.896825)

Will you accept a PR? I’m not sure if I find the time to figure out why the implementation is off, but I’d thought to first reach out and get a sign of life 😃

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

@adamdecaf Sorry for the delay. Here is my take on it.

Both Wikipedia and Rosetta Code describe the matching range as this:

x

When both |s1| and |s2| are 1, the matching range is -1, and the algorithm doesn’t work as expected. My old code had this edge case covered, that’s why it didn’t fail. Try the new version @ master. I’ve also added more test cases. Please, tell me if it worked for you. Thank you.

Hey guys, If it’s alright I’ll close this issue.

@Dynom I’ve pushed a fixed version to the develop branch. You can point your go mod @develop and check it out.

I’ve also found another article by Winkler with a larger table of tests, and apparently closer to our results. The article is inside the docs/ dir in the develop branch, if you wanna check it out.

Later I’ll opt-in to go mod (I think I’ll start a v2, it seems like the recommended approach).

Interesting! I’d gladly accept a PR.