Why Merging Hashes With Xor Can Be Dangerous?
Janne Kemppainen |If you have two distinct hash values that are created by a hashing algorithm that creates a uniformly random distribution then the result of the XOR operation between these values should also follow the same distribution. But why XOR might still not be the best choice for combining the values?
In this post I’ll comment about potentional problems as well as an example where the assumption of the distribution didn’t actually work as expected.
Common problems
The XOR (exclusive or) operation does not care about the order of the inputs. Therefore A ^ B == B ^ A
. This can be undesired, especially if there are multiple hashes that need to be combined.
Another problem is that dentical values combine to zero. If you use XOR to combine hashes with identical values the end result will always be zero. And if you combine multiple values that may contain duplicates those will cancel each other out, meaning that the following is also true:
A ^ B ^ A == B
Problematic example: Rendezvous Hashing
Of course I’m writing this article because I had a bad experience with combining hashes with this naive method. Let’s start from the beginning.
The algorithm
The rendezvous hashing algorithm, also known as highest random weight (HRW) hashing, is an algorithm that can be used to assign a key value to a node from a pool of choices consistently. It is quite simple to implement, efficient, and it has really nice load balancing features where keys are assigned evenly between the nodes, also when new nodes enter or exit the pool.
The non-weighted variant is extremely simple to implement. Here’s a working example written in Python:
import xxhash
def rendezvous_hash(key, nodes):
selected = None
highest_score = -1
for node in nodes:
score = xxhash.xxh64_intdigest(f"{key}{node}")
if score > highest_score:
selected = node
highest_score = score
return selected
The algorithm, in all its’ simplicity, calculates hash values from the combinations of the key with every node and then returns the one that got the highest value.
My optimization attempt
There’s no XOR in sight, why am I even talking about this?
Well, I tried to add some (premature) performance optimizations. Imagine that you need to perform the selection from a list of hundreds of nodes that stay the same between the selections. With the current implementation that would mean that every key would require a hash computation with all nodes.
So naturally a thought emerged, what if we would pre-calculate the hash values for the nodes and only calculate one hash per function call for the key, saving hundreds of calculations? And because the hashes would already be random couldn’t we calculate the scores by simply combining the hashes with XOR?
Turns out I was wrong. Did you already notice the error in my thinking?
What went wrong?
The problem is that while the hashes of the keys and the nodes are all random the key is still a common denominator in the score calculation. Therefore the node that has the most “unique” significant bits gets a bigger share of the cake.
Assume that we have three nodes that have the following hashes:
- A:
1001
- B:
0100
- C:
0110
When using XOR for the hash combination all keys whose hash starts with a zero would be assigned to A. The key hashes starting with a one would be divided between B and C. This is because the most significant bit is enough to win the score if all other candidates start with zero.
Therefore the node A is actually twice as likely to be selected when compared to the other two even though the algorithm is supposed to distribute the load evenly!
Lessons learnt
So when you make mistakes make sure that you try to learn from them. Here are some of my key takeaways:
- Test your algorithms so that they actually have the properties that you’re assuming
- Test the actual performance before making optimizations to see if they are even required
- (Use a battle proven implementation if possible)
I still don’t know what would be a good replacement for the XOR operation in my “optimized” code. Share your solution if you happen to find one!
Discuss on Twitter
I tried to merge hash values with XOR and it didn't go well. Do you know a good alternative for Python?https://t.co/6oeM1He7Hd
— Janne Kemppainen (@pakstech) May 15, 2020
Previous post
Reviewing Typora, My New Favorite Markdown EditorNext post
Create Your Own Python Packages