natalie: `Array#hash` producing too many collisions

I’m currently tinkering around with Hash#hash and noticed this bug in Array#hash:

[2, 2].hash == [1, 1].hash
# => true

[2, 2].hash will return 100019 which is the starting point of the hash calculation, because (2.hash * 207269) ^ (2.hash * 207269) will eliminate each other. My first thought of fixing this is multiplying with a random prime number instead of a fixed one. Would this work?

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (18 by maintainers)

Commits related to this issue

Fix bug with Array#hash [1,1].hash should != [2,2].hash [fixes #188] — committed to natalie-lang/natalie by seven1m 3 years ago
make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago
make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago
Make Array::hash more reliable by making the hashing process aware of the order of the elements. The big issue with the previous version was a failure with period collisions on arrays containing copie... — committed to TheGrizzlyDev/natalie by TheGrizzlyDev 3 years ago

Most upvoted comments

I’m leaving this open for a few days until @TheGrizzlyDev decides to stop poking at it and moves on. 😆 😆

seven1m on Sep 25, 2021

Naive question: What speaks against adding the hashes instead of xoring them? We would not have the problem with eliminating entries than. However, I see overflow being a problem maybe?

richardboehme on Sep 24, 2021

Thanks for fixing this so fast! 😄 Sadly, I think this is not enough yet. Firstly it does not work with nested arrays:

[[1, 2], [3, 2]].hash == [[1, 3], [2, 3]]
# => true

The two 2s and the 3s are eliminating each other.

Secondly, as nat_int_t is “only” 64 bit, it will not work if we shift more than 64 times (when the array is bigger than 64):

def generate_array(i)
  [].tap do |out|
    out << i
    63.times { out << 0 }
    out << i
  end
end

generate_array(2).hash == generate_array(3).hash
# => true

Edit: Copied the wrong generate_array method.

richardboehme on Sep 24, 2021

@TheGrizzlyDev you’re a genius. (That worked.)

Ahahahaha thanks! I’ve spent enough time on it to think in hashes

TheGrizzlyDev on Sep 24, 2021

@TheGrizzlyDev you’re a genius. (That worked.)

seven1m on Sep 24, 2021

There’s so much in way of performance optimization we can do, first and foremost will be a method cache. But we’ll have plenty of time for that… 😉

seven1m on Sep 26, 2021

about that, done some preliminary testing and found out that:

this implementation of hash collides far less than the MRI one 😄
we can improve a lot just by shifting the amount of digits we shift (my theory was right on how that affects the output, but the guess was really bad ahahahaha) and this should be easy enough :^)
the implementation I just pushed of Array::repeated_combination is unbearably slow, especially when compared with MRI’s one, so we should try to improve that as well to get a better test bench (getting repeated_transpositions would probably help as well)

I’ll go a bit deeper tomorrow and maybe try to improve it all, but there’s quite a few quick wins here and this has been quite fun :^)

TheGrizzlyDev on Sep 25, 2021

OK, I took another stab at it. The main problem was that I was calling Array::eql and treating it like a bool (forgot to call is_truthy()), so that was half the battle. I also decided to xor the array index and added MOAR PRIMES lol. 😂

Anyway, let me know if y’all find any more weird cases on this one @richardboehme @TheGrizzlyDev. ❤️

seven1m on Sep 25, 2021

How about subtractions though? Ordering is implicitly considered as in non-commutative as addition is, it doesn’t cancel out the previous value but rather “adds” to it unlike xor and wouldn’t overflow in the same way as additions which would lead to a less effective distribution of 0s and 1s

TheGrizzlyDev on Sep 24, 2021

this keeps getting better and better ahahhahah. How about we take a page from DES/AES which have the same issue with using multiple rounds and having just a single key (in this case our prime number)? We could:

select a big enough prime number (or any number big enough that if taken in smaller bits would still have a somewhat reasonable distribution of 0s and 1s and at least with more bits than nat_int_t)
for each item we get the hash of, we xor it with a prime number AND then xor it with a subkey of the aforementioned number of 64 bits in length
we do a circular shift (not sure that’s the name of it, but basically a shift where a digit ‘leaving’ on one side would just ‘come back’ from the other) and then repeat the process for the next item?

PS: I hope you find more issues if this fixes the above, absolutely loving it 😄

TheGrizzlyDev on Sep 24, 2021