smetrics: Jaro's implementation doesn't match with reference implementations
Hi,
Thanks for your work on these implementations!
Recently I’ve been tinkering with several distancing algorithms but I have inconsistent results when using xrash/smetrics compared to other Jaro implementations.
I’ve compared the result with:
- https://github.com/masatana/go-textdistance (Go)
- https://github.com/Simmetrics/simmetrics (Java)
- https://pypi.org/project/jaro-winkler/ (Python)
And on at least the following values it differs:
- “DIXON” vs. “DICKSONX” (smetrics.Jaro scores lower with
0.683333
, reference implementations all score:0.766667
) - “gmilcon” vs. “gmilno” (smetrics.Jaro scores lower with
0.849206
, reference implementations all score:0.896825
)
Will you accept a PR? I’m not sure if I find the time to figure out why the implementation is off, but I’d thought to first reach out and get a sign of life 😃
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- search: tweak match percentages after smetrics upgrade See: https://github.com/xrash/smetrics/issues/7 — committed to adamdecaf/watchman by adamdecaf 4 years ago
@adamdecaf Sorry for the delay. Here is my take on it.
Both Wikipedia and Rosetta Code describe the matching range as this:
When both |s1| and |s2| are 1, the matching range is -1, and the algorithm doesn’t work as expected. My old code had this edge case covered, that’s why it didn’t fail. Try the new version @ master. I’ve also added more test cases. Please, tell me if it worked for you. Thank you.
Hey guys, If it’s alright I’ll close this issue.
@Dynom I’ve pushed a fixed version to the
develop
branch. You can point your go mod @develop and check it out.I’ve also found another article by Winkler with a larger table of tests, and apparently closer to our results. The article is inside the
docs/
dir in thedevelop
branch, if you wanna check it out.Later I’ll opt-in to go mod (I think I’ll start a v2, it seems like the recommended approach).
Interesting! I’d gladly accept a PR.