natalie: `Array#hash` producing too many collisions

I’m currently tinkering around with Hash#hash and noticed this bug in Array#hash:

[2, 2].hash == [1, 1].hash
# => true

[2, 2].hash will return 100019 which is the starting point of the hash calculation, because (2.hash * 207269) ^ (2.hash * 207269) will eliminate each other. My first thought of fixing this is multiplying with a random prime number instead of a fixed one. Would this work?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

I’m leaving this open for a few days until @TheGrizzlyDev decides to stop poking at it and moves on. 😆 😆

Naive question: What speaks against adding the hashes instead of xoring them? We would not have the problem with eliminating entries than. However, I see overflow being a problem maybe?

Thanks for fixing this so fast! 😄 Sadly, I think this is not enough yet. Firstly it does not work with nested arrays:

[[1, 2], [3, 2]].hash == [[1, 3], [2, 3]]
# => true

The two 2s and the 3s are eliminating each other.

Secondly, as nat_int_t is “only” 64 bit, it will not work if we shift more than 64 times (when the array is bigger than 64):

def generate_array(i)
  [].tap do |out|
    out << i
    63.times { out << 0 }
    out << i
  end
end

generate_array(2).hash == generate_array(3).hash
# => true

Edit: Copied the wrong generate_array method.

@TheGrizzlyDev you’re a genius. (That worked.)

Ahahahaha thanks! I’ve spent enough time on it to think in hashes

@TheGrizzlyDev you’re a genius. (That worked.)

There’s so much in way of performance optimization we can do, first and foremost will be a method cache. But we’ll have plenty of time for that… 😉

about that, done some preliminary testing and found out that:

  • this implementation of hash collides far less than the MRI one 😄
  • we can improve a lot just by shifting the amount of digits we shift (my theory was right on how that affects the output, but the guess was really bad ahahahaha) and this should be easy enough :^)
  • the implementation I just pushed of Array::repeated_combination is unbearably slow, especially when compared with MRI’s one, so we should try to improve that as well to get a better test bench (getting repeated_transpositions would probably help as well)

I’ll go a bit deeper tomorrow and maybe try to improve it all, but there’s quite a few quick wins here and this has been quite fun :^)

OK, I took another stab at it. The main problem was that I was calling Array::eql and treating it like a bool (forgot to call is_truthy()), so that was half the battle. I also decided to xor the array index and added MOAR PRIMES lol. 😂

Anyway, let me know if y’all find any more weird cases on this one @richardboehme @TheGrizzlyDev. ❤️

How about subtractions though? Ordering is implicitly considered as in non-commutative as addition is, it doesn’t cancel out the previous value but rather “adds” to it unlike xor and wouldn’t overflow in the same way as additions which would lead to a less effective distribution of 0s and 1s

this keeps getting better and better ahahhahah. How about we take a page from DES/AES which have the same issue with using multiple rounds and having just a single key (in this case our prime number)? We could:

  • select a big enough prime number (or any number big enough that if taken in smaller bits would still have a somewhat reasonable distribution of 0s and 1s and at least with more bits than nat_int_t)
  • for each item we get the hash of, we xor it with a prime number AND then xor it with a subkey of the aforementioned number of 64 bits in length
  • we do a circular shift (not sure that’s the name of it, but basically a shift where a digit ‘leaving’ on one side would just ‘come back’ from the other) and then repeat the process for the next item?

PS: I hope you find more issues if this fixes the above, absolutely loving it 😄