Xor 1 to 1 hashing

Kragen Javier Sitaker, 2017-07-19 (updated 2017-08-03) (10 minutes)

Suppose you want to map, say, 32-bit words to 32-bit words in an apparently random but easily computable and guaranteed 1-to-1 way, such that you can generate an apparently random sequence by applying this mapping to subsequent values of a counter.

A very simple way to do this is to do a boolean matrix multiply by a nonsingular but otherwise random bit matrix. The roles of multiplication and addition here are taken by AND and XOR, i.e. multiplication and addition in GF(2) — in the 8×8 case, this is the MMIX MXOR instruction. Because of linearity, this will always map 0 to 0, but it can map any nonzero value to any nonzero value.

As a 4-bit example, we can take the following product:

0 1 0 0  b₃
0 1 1 0  b₂
1 0 1 0  b₁
0 1 1 1  b₀

If b₃b₂b₁b₀ is 0001, we get 0111; if b₃b₂b₁b₀ is 0010, we get 1010; if b₃b₂b₁b₀ is 0011, we get 0001 ⊕ 1010 = 1011; and so on. All 16 possible 4-bit bitvectors are produced in this way, but in the somewhat apparently random order 0000, 0111, 1010, 1101, 0110, 0001, 1100, 1011, 0100, 0011, 1110, 1001, 0010, 0101, 1000, 1111. It’s not extremely random — for example, the low-order bit simply repeats the sequence 0, 1, while the high-order bit simply repeats the sequence 0, 0, 1, 1 — but it isn’t obviously ordered either. And it has the advantage that the sequence can be accessed in an arbitrary order. Because it’s all linear, the mapping can also be inverted, which may be an advantage or a disadvantage depending on the context.

Empirically, about 30% of 4×4 bitmatrices are singular, and also about 30% of 8×8 bitmatrices and of 16×16 bitmatrices. This was surprising to me; I am going to conjecture that the actual number is 1/e.

In discrete domains like GF(2), there’s no such thing as an ill-conditioned matrix.

As a longer example, consider this 8×8 matrix:

1 0 1 0 1 0 1 0
1 1 0 0 1 1 0 1
0 1 0 0 1 1 1 0
1 1 0 1 0 0 1 0
1 1 1 1 0 0 1 0
0 1 0 1 1 1 0 0
0 1 0 1 1 0 1 1
0 1 1 1 0 1 0 0

(More compactly, we could say [170, 205, 78, 210, 242, 92, 91, 116] or "ªÍNÒò[t".)

If we apply this matrix to successive counter values, it generates the sequence [0, 116, 91, 47, 92, 40, 7, 115, 242, 134, 169, 221, 174, 218, 245, 129, 210, 166, 137, 253, 142, 250, 213, 161, 32, 84, 123, 15, 124, 8, 39, 83, 78, 58, 21, 97, 18, 102, 73, 61, 188, 200, 231, 147, 224, 148, 187, 207, 156, 232, 199, 179, 192, 180, 155, 239, 110, 26, 53, 65, 50, 70, 105, 29, 205, 185, 150, 226, 145, 229, 202, 190, 63, 75, 100, 16, 99, 23, 56, 76, 31, 107, 68, 48, 67, 55, 24, 108, 237, 153, 182, 194, 177, 197, 234, 158, 131, 247, 216, 172, 223, 171, 132, 240, 113, 5, 42, 94, 45, 89, 118, 2, 81, 37, 10, 126, 13, 121, 86, 34, 163, 215, 248, 140, 255, 139, 164, 208, 170, 222, 241, 133, 246, 130, 173, 217, 88, 44, 3, 119, 4, 112, 95, 43, 120, 12, 35, 87, 36, 80, 127, 11, 138, 254, 209, 165, 214, 162, 141, 249, 228, 144, 191, 203, 184, 204, 227, 151, 22, 98, 77, 57, 74, 62, 17, 101, 54, 66, 109, 25, 106, 30, 49, 69, 196, 176, 159, 235, 152, 236, 195, 183, 103, 19, 60, 72, 59, 79, 96, 20, 149, 225, 206, 186, 201, 189, 146, 230, 181, 193, 238, 154, 233, 157, 178, 198, 71, 51, 28, 104, 27, 111, 64, 52, 41, 93, 114, 6, 117, 1, 46, 90, 219, 175, 128, 244, 135, 243, 220, 168, 251, 143, 160, 212, 167, 211, 252, 136, 9, 125, 82, 38, 85, 33, 14, 122].

Shuffling

One particular use for such arbitrary, randomly-accessible permutations is shuffling — for example, generating a random but repeatable ordering of a playlist. This allows you to resume the shuffled playing at some later time while only storing your position in the playlist and, possibly, the matrix — which, in the 8×8 case, can be represented as a 64-bit number such as 12848028463340370089. (The 32-bit case instead requires a 1024-bit number.)

Only a tiny minority of all possible permutations can be generated by this method. (For example, for the 16×16 case, there are about 10²⁸⁷¹⁸⁸ possible permutations of the 2¹⁶ - 1 nonzero values, but this algorithm can generate only about 10⁷⁶ of them.) Is this subset “fair” in the sense that every nonzero value is equally likely to be second in the permuted sequence?

Special cases

The identity matrix exists, of course, as does a bit-reversal matrix, and every other mere permutation; reflected-binary Gray code (RBGC) is generated by the identity matrix with an extra bit set to the right of the diagonal, and its inverse is the full upper triangular matrix with all bits on the diagonal or to its right set.

1 0 0 0    0 0 0 1    1 1 0 0    1 1 1 1
0 1 0 0    0 0 1 0    0 1 1 0    0 1 1 1
0 0 1 0    0 1 0 0    0 0 1 1    0 0 1 1
0 0 0 1    1 0 0 0    0 0 0 1    0 0 0 1
Identity   Reverse     Gray   Inverse Gray

Homomorphism for XOR

Because this mapping is linear, it’s a homomorphism for XOR (addition in GF(2)), but of course not for other operations such as those in GF(2ⁿ). For example, consider the matrix given earlier:

1 0 1 0 1 0 1 0
1 1 0 0 1 1 0 1
0 1 0 0 1 1 1 0
1 1 0 1 0 0 1 0
1 1 1 1 0 0 1 0
0 1 0 1 1 1 0 0
0 1 0 1 1 0 1 1
0 1 1 1 0 1 0 0

We map 10 (0x0a, LSB in the last row) to 169 (0xa9, LSB being in the last column), 11 (0x0b) to 221 (0xdd), and 1 (10 ⊕ 11) to 116 (0x74). And 169 ⊕ 221 is exactly 116.

The inverse matrix is:

1 1 1 0 1 0 1 0
1 1 0 1 1 1 1 0
0 0 0 1 1 0 0 0
0 1 0 0 1 0 1 1
0 0 0 1 1 1 0 1
1 0 0 0 1 1 0 0
0 1 1 0 1 1 1 1
1 1 1 0 0 1 0 1

And, using this inverse matrix, we can easily do the reverse mapping; it maps 116 back to 1, 221 back to 11, and 169 back to 10.

An interesting question is whether the use of AND as the multiplication function is necessary to get this homomorphism.

Reverse-engineering the matrix from the sequence

For an N×N matrix, given the mappings for any N linearly independent values, you can reconstruct the matrix. What if you are just given M sequential output values, without knowing the corresponding inputs, only that they are sequential counter values? I think you can extract the last roughly lg(M) matrix rows but not necessarily all:

>>> m = bitmatrices.random_nonsingular_matrix(8)
>>> v = [bitmatrices.mxor(m, i) for i in range(20, 29)]
>>> [v[i] ^ v[i+1] for i in range(len(v)-1)]
[30, 38, 30, 64, 30, 38, 30, 141]
>>> v
[1, 31, 57, 39, 103, 121, 95, 65, 204]
>>> m
[25, 239, 222, 170, 205, 171, 56, 30]

The LSB row of the matrix, 30, appears as a difference between half of the adjacent pairs of output values: four of them. The XOR of the last two rows, 56 ⊕ 30 = 38, appears in two positions. There are two other differences, 64 and 141, which are respectively the XOR of the last four and the last three rows, but I think there is no way to distinguish those; and I think the other rows could be anything at all.

I’m not entirely sure because there seems to be a bit more information leaked: we can see that the last two bits of the index changed in the sequence [00, 01, 10, 11, 00, 01, 10, 11, 00], so we know that the selected subset of the first six rows of the matrix (which, as it happens, are just 170 and 171) XOR to 1 at the beginning of this subsequence. But we already know that those first six rows must span a subspace including 1 (supposing our matrix is nonsingular) because we can see that the subspace spanned by 56 and 30 (or 38 and 30) does not include 1.

Topics