Multiplicative Hashing Functions -- Notes on Primes, Golden Ratio, and Evil

Introduction

Mapping partitions (or keys) to slices (or buckets) in a distributed or sharded system has a large impact on performance. Different hash-based solutions for this problem exist; each has drawbacks. The goal is to choose a hash function that is simple to implement and gives acceptable performance with as few collisions as possible.

The problem is defined as follows. Given a logical partition number $P$, compute the corresponding slice number $S = s(P)$. Where $0 ≤ P < 2^{32}$, $0 ≤ S < M$, and $M$ denotes the table size (e.g. $M = 2^{14}$). In our “binary” world, assumptions that input data (partition numbers) have uniform distribution are not always correct. Therefore, the hash function $s: P \to S$ must be designed very carefully. In addition to providing uniform distribution of hash values (slice numbers), it has to add some randomness. Luckily, this field is well studied: well-known textbooks of Corman’s and Knuth’s have a good introduction to this field. The last one has more detail explanation; therefore, without hesitation, we make use of Knuth’s book (Section 6.4.): usually explaining in two paragraphs what the book is saying just in one sentence.

Basic Ideas

A first approach is the division-remainder method with $M$ a power of 2:

$s(P) = P \bmod M$,

where $M = 2^{14}$ is the table size (number of slices). A variant is to take $M$ prime instead:

$s(P) = P \bmod M$,

where $M = 16411$ is a prime $> 2^{14}$. Both forms are called “Division Remainder Method”. These notes focus on another method, the “Multiplicative Method”:

$$f(P) = A \cdot P \bmod W$$

$$s(P) = f(P) \div (W \div M)$$

Where $W = 2^{32}$ – the machine word, $M = 2^{14}$ – the table size, $\div$ denotes an integer division, $W \div M = 2^{32 - 14} = 2^{18}$, $A = 2654435761$ – some “magic” integer. It will be explained later, for now we assume that due to this magic, the value of $f(P)$ is almost random. It is easy to see that $s(P)$ is in valid interval of slices, between 0 and $M = 2^{14}$.

Example of Hash Codes

partition $P$dec $P$hex $P$binary $f(P)$binary $S$hex $S$dec $S$
$0$000000000000000000000000000000000000000000000000000000000000
$1$1000000011001111000110111011110011011000110011110001101278D10125
$2$20000000200111100011011101111001101100010001111000110110F1B3867
$3$300000003110110101010011001101101000100111101101010100136A913993
$2^{14}-1$1638300003FFF0100000000110100110001100100111101000000001101100D4109
$2^{14}$16384000040001101111001101100010000000000000011011110011011379B14235
$2^{14}+1$163850000400101111100101000111011100110110001011111001010001F287976
$2^{14}+2$1638600004002000110101101101100110011011000100001101011011006B61718
$2^{15}-1$3276700007FFF000111101010000100000110010011110001111010100007A81960
$2^{15}$327680000800010111100110110001000000000000000101111001101102F3612086
$2^{15}+1$3276900008001010110110000111111111001101100010101101100001116C35827
$2^{15}+2$327700000800211111001010001110111001101100010111110010100013E5115953
$2^{30}-1$10737418233FFFFFFF1010000111001000100001100100111110100001110010287210354
$2^{30}$107374182440000000010000000000000000000000000000000100000000000010004096
$2^{30}+1$1073741825400000011101111000110111011110011011000111011110001101378D14221
$2^{30}+2$10737418264000000201111100011011101111001101100010011111000110111F1B7963
$2^{31}-1$21474836477FFFFFFF1110000111001000100001100100111111100001110010387214450
$2^{31}$214748364880000000100000000000000000000000000000001000000000000020008192
$2^{31}+1$2147483649800000010001111000110111011110011011000100011110001101078D1933
$2^{31}+2$21474836508000000210111100011011101111001101100010101111000110112F1B12059
$2^{32}-1$4294967295FFFFFFFF011000011100100010000110010011110110000111001018726258

Implementation in C

The new hashing function looks at bit complicated, so let’s consider how it (and others) could be implemented in C code. The type uint32_t denotes 32-bit unsigned integer. We start from examples of Division Remainder hashing. For $M$ being a power of 2:

const uint32_t M = 2 << 14;

uint32_t slice(uint32_t P) {
    return P & (M - 1);
}

For prime $M$:

const uint32_t M = 16411;

uint32_t slice(uint32_t P) {
    return P % M;
}

The next function implements Multiplicative hashing:

const uint32_t M = 2 << 14;
const uint32_t A = 2654435761;

uint32_t slice(uint32_t P) {
    return (P * A) >> 18;
}

It works well for arbitrary input data and allows using the same number of slices $M = 2^{14}$. As the reader can see, the multiplicative version uses only a multiplication and a logical shift; on some architectures it can be faster than computing a modulo.

Details for Math Fans

In this section we describe properties of every approach. On analysis of the simple functions, we demonstrate the informal tools that we later apply to more complex functions. We hope that even non great fan of mathematics will be able to catch the main point.

The Power of 2 is Evil

The first function we will consider is $s(P) = P \bmod 2^{14}$. Let’s recall some properties of this function.

Flipping just one bit in the given partition not always leads to a change in the corresponding slice (in hash-value). More formal: for some bit $i$, “occasionally” $s(P) = s(P + 2^i)$ holds; here we assume for simplicity that $P$ has $i$ bit cleared. In our “binary” world, partitions differing in only one bit may have a strong correlation. Therefore, we are expecting that a “good” hashing will produce different hash-code for them (i.e., it will place the partitions in different slices). But for this function, the property fails on any partition $P ≥ 2^{14}$. There are simple templates of partitions having the same hash-code. For example, for any $i$ and $S$, partitions $P = S + 2^{14} \cdot i$ will be placed by the function to the same slice $S$. From a “good” hashing, we are expecting not to have very common patterns, especially, of binary form. But again, this function fails on this condition.

The Art of Primes

The second function is $s(P) = P \bmod M$. Where $M = 16411$. For this function, flipping any bit in a partition $P$ changes the hash code, because $M$ is chosen to be a prime. Actually, this condition on $M$ is too strong. For satisfying this property, it is sufficient that $M ≠ 2^i$ will hold. For example, $M = 15 · 12 · 97 = 17460$ (15 modules, 12 disks, 97 is a prime) is also “good”.

Since $M$ is prime, it seems that the following pattern is not common: $P = S + M · i$. But actually, primes that are close to a power of 2 are also not good. Knuth recommends to choose such $M$ that the following condition will not hold for any small integers $a$ and $j$: $r^j ≡ ± a \pmod M$. Where $r$ denotes the base of computation. From explanation of Knuth, the meaning of $r$ for our case is not too clear: whether $r=2$, $r=16$, or $r=256$? It seems the answer very depends on the type of input data.

By Knuth, if $r=2$, the chosen $M$ is not so good, since $M = 16411 = 2^{14} + 27$, and hence $2^{14} ≡ -27 \pmod M$. For $r=16$, we get that $16^4 ≡ -108 \pmod M$. For $r=256$, $256^2 ≡ -108 \pmod M$.

Knuth explains that such $M$ may produce a hash code that is a simple composition of key digits (in $r$ base system). Instead of trying to understand this explanation, we will give some intuition. Working with numbers, a programmer usually chooses powers of 2 for sizes of structures and buffers (e.g., $2^{10}$ bytes). Then he defines the format of such data and introduces headers (e.g. the header size = 20 bytes). Hence, the size of data without the header becomes very close to the power of 2 (in our example, $2^{10} - 20 = 1004$). On the other hand, embedding this structure to an outer packet (assume the size of this outer header is 30 bytes) leads to the total size being also close to the power of 2 ($2^{10} + 30 = 1054$). As result, most of numbers in our “binary” world are either powers of 2 or close to them. Therefore, such choice of $M$ increases collisions. In other words, not only powers of 2 are evil, but primes closing to them are evil too.

As an example of a “good” prime, let’s consider $M = 24571$. It is a bit smaller then the middle of $2^{14}$ and $2^{15}$.

Introducing a little change in this hashing function, we get better hashing even for $M = 2^{14}$: $s(P) = P \bmod N \bmod M$. In this case $N$ must be some “big” and “far” from a power of 2 prime number. The discussion of this method is out of scope of this Notes.

The Magic of Golden Ratio

We now present in details the multiplicative hashing. First, we discuss the properties of the function $f$: $f(P) = A · P \bmod W$.

The magic number $A$ is chosen to have the following nice properties: $W / A ≈ 1.6$ – the golden ratio and $GCD(A, W) = 1$ ($A$ and $W$ are relatively prime or coprime). Therefore, the function $f$ is 1-1 on 32-bit integers. In other word, for any $P_1~ ≠ P_2$, $f(P_1) ≠ f(P_2)$ holds. This function has a role of perturbation or randomization of the input data (partitions).

Assume given a partition $P$. Let’s set the 0 bit in $P$. Then we get $f(P + 1) = (f(P) + A) \bmod W$. So the bit 0 dirties those bits of $f(P)$ that are set in $A$. This implies that $A$ must be chosen from the following range: $2^{31} < A < 2^{32}$, and in addition it must have a half of bits being set. By Knuth, the best choice is to define it to be the golden value of $W$: $W / A = 1.6 \ldots$ . In this case setting $i$-bit in $P$ dirties in $f(P)$ all the bits from $i$ to 31.

Using the properties of $f()$, we show the intuition why the function $s(P) = f(P) \div 2^{18}$ is a “good” hashing function.

The function $s$ remains 14 most significant bits of $f(P)$. Therefore, from the properties of $f()$, flipping any bit in $P$ will yield flipping at least in one bit of $s(P)$. The function $s$ puts into one slice $S$ all the partitions of the form: $P = (S · 2^{18} + i) · B \bmod 2^{32}$, where $B = 244002641$ (see for details the section Invertibility) and $0 ≤ i < 2^{18}$. Assume that $P_0 = S · B · 2^{18} \bmod 2^{32}$, then $P_i = (P_0 + i · B) \bmod 2^{32}$. The function $g(i) = i · B$ produces almost random 32-bit integers from 18-bit integers. Therefore, it is almost impossible to derive a “nice” sub-sequence from the values of set ${g(i) | 0 ≤ i < 2^{18}}$.

Invertibility

Given a 14-bit slice $S$ and some 18-bit identifier $Id$, we are interested to compute the partition $P$ that is placed by the function $s$ at slice $S$. Note that given a partition $P$, the computation of $Id$ is also depends on the chosen method of hashing. We present for each hashing how the identifier and the inverse of $s()$ can be computed.

For the simplest function $s(P) = P \bmod 2^{14}$, the identifier is computed as follows: $Id = P \div 2^{14}$. Then its inverse is $p(S, Id) = S + Id · 2^{14}$. For prime $M$, the identifier is $Id = P \div M$ and the inverse function is $p(S, Id) = S + Id · M$. For multiplicative hashing, we have $Id = A · P \bmod 2^{14}$. Then $p(S, Id) = (S · 2^{18} + Id) · B \bmod 2^{32}$. This implies from the fact that $A · B ≡ 1 \pmod 2^{32}$ and the inverse of $f()$ is $f^{-1}(x) = x · B \bmod 2^{32}$.

We show the implementation of $p()$ in C code for the multiplicative hashing only:

const uint32_t M = 2 << 14;
const uint32_t B = 244002641;

uint32_t partition(uint32_t S, uint32_t Id) {
    return (S << 18 + Id) * B;
}

Big Numbers

In this section we discuss the behavior of multiplicative hashing on big partition numbers. Recall that flipping just one $i$-bit in $P$ has impact on $(32 - i)$ bits (from $i$ to 31) in $f(P)$. By the definition, $s(P)$ derives 14 most significant bits of $f(P)$. As result, for $i < 18$, the flipping $i$-bit dirties most of 14 bits of $s(P)$, while for $i ≥ 18$, it change less then 14 bits of $s(P)$: $32 - i ≤ 14$. For example, flipping the most significant 31-bit of the partition $P$ flips the only one 14-bit of the corresponding slice $S = s(P)$. This is not a real problem for our case, the partitions are still putted into different slides.

Another problem with flipping of most significant bits is following. For a set of partitions differing in most significant bits only, the number of collisions is greater then for the case that partitions are different in less significant bits. But due to the nice choice of $A$ (being golden ratio of $W$), this is still sufficiently good distribution. Intuitively, if the partitions differ in x most significant bits, then slices for them will also differ in x bits. So most of these partitions will be placed to different slices.

Note that the number of impacted bits in $s(P)$ decreases only for partitions with more then 18 bits. Even then, as described above, it does not mean that we will have a huge degradation of the hashing performance. But if we want to prevent any potentially dangerous case, we may extend the machine word to 64-bits, especially if the used architecture is also 64-bit. In this case, we just compute a new pair of integers $A$ and $B$ ($A$ is the golden ration of and is coprime with $W=2^{64}$, $B$ is coprime with $A$) and correct logical shifts. This is an example of implementation in C:

const uint64_t M = 2 << 14;
const uint64_t A = 11400714819323198485;

uint64_t slice(uint64_t P) {
    return (P * A) >> 50;
}

Example of Big Partitions

The table contains computation of slices for partitions that differ only in three most significant bits. As we can see, corresponding slices also differ only in three bits. This property may be considered as drawback. But note that this case has no collisions.

binary $P$hex $P$binary $f(P)$binary $S$hex $S$dec $S$
00010101010111010100100101011001155D49591000110101001001110001111000100110001101010010$2352$$9042$
00110101010111010100100101011001355D49591010110101001001110001111000100110101101010010$2B52$$11090$
01010101010111010100100101011001555D49591100110101001001110001111000100111001101010010$3352$$13138$
01110101010111010100100101011001755D49591110110101001001110001111000100111101101010010$3B52$$15186$
10010101010111010100100101011001955D49590000110101001001110001111000100100001101010010$0352$$850$
10110101010111010100100101011001B55D49590010110101001001110001111000100100101101010010$0B52$$2898$
11010101010111010100100101011001D55D49590100110101001001110001111000100101001101010010$1352$$4946$
11110101010111010100100101011001F55D49590110110101001001110001111000100101101101010010$1B52$$6994$

Conclusion

TODO

  • T Corman, Ch Leiserson, R Rivest, “Introduction to Algorithms”, 1999
  • D Knuth, “Sorting and Searching”, volume 3 of “The Art of Computer Programming”, 1973. Section 12.3
  • Integer Hash Function, Thomas Wang
  • Selecting a Hash Function, Scott Gasch
  • Hash functions, Paul Hsieh
  • Wikipedia: Hash Table – sites Thomas Wang that the multiplicative hash has “particularly poor clustering behavior”.
  • Prime Double Hash Table, Thomas Wang – the wrong proof.
  • http://web.archive.org/web/19990903133921/http://www.concentric.net/~Ttwang/tech/primehash.htm
  • Wikipedia: Talking Hash Table – note that the proof is wrong, see “Info about multiplicative hashing is wrong” section.
Slava Chernoy
Software Engineer