natalie: `Array#hash` producing too many collisions
I’m currently tinkering around with Hash#hash
and noticed this bug in Array#hash
:
[2, 2].hash == [1, 1].hash
# => true
[2, 2].hash
will return 100019 which is the starting point of the hash calculation, because (2.hash * 207269) ^ (2.hash * 207269)
will eliminate each other. My first thought of fixing this is multiplying with a random prime number instead of a fixed one. Would this work?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (18 by maintainers)
Commits related to this issue
- Fix bug with Array#hash [1,1].hash should != [2,2].hash [fixes #188] — committed to natalie-lang/natalie by seven1m 3 years ago
- make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago
- make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago
- Make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago
I’m leaving this open for a few days until @TheGrizzlyDev decides to stop poking at it and moves on. 😆 😆
Naive question: What speaks against adding the hashes instead of xoring them? We would not have the problem with eliminating entries than. However, I see overflow being a problem maybe?
Thanks for fixing this so fast! 😄 Sadly, I think this is not enough yet. Firstly it does not work with nested arrays:
The two 2s and the 3s are eliminating each other.
Secondly, as nat_int_t is “only” 64 bit, it will not work if we shift more than 64 times (when the array is bigger than 64):
Edit: Copied the wrong generate_array method.
Ahahahaha thanks! I’ve spent enough time on it to think in hashes
@TheGrizzlyDev you’re a genius. (That worked.)
There’s so much in way of performance optimization we can do, first and foremost will be a method cache. But we’ll have plenty of time for that… 😉
about that, done some preliminary testing and found out that:
I’ll go a bit deeper tomorrow and maybe try to improve it all, but there’s quite a few quick wins here and this has been quite fun :^)
OK, I took another stab at it. The main problem was that I was calling
Array::eql
and treating it like a bool (forgot to callis_truthy()
), so that was half the battle. I also decided to xor the array index and added MOAR PRIMES lol. 😂Anyway, let me know if y’all find any more weird cases on this one @richardboehme @TheGrizzlyDev. ❤️
How about subtractions though? Ordering is implicitly considered as in non-commutative as addition is, it doesn’t cancel out the previous value but rather “adds” to it unlike xor and wouldn’t overflow in the same way as additions which would lead to a less effective distribution of 0s and 1s
this keeps getting better and better ahahhahah. How about we take a page from DES/AES which have the same issue with using multiple rounds and having just a single key (in this case our prime number)? We could:
PS: I hope you find more issues if this fixes the above, absolutely loving it 😄