A number of algorithms and data structures depend on the
lexicographical ordering of bytestrings; for example, tries, LevelDB,
LC_ALL=C /bin/sort
, the American Flag Sort, and suffix array
construction algorithms. These are often asymptotically higher in
performance than alternatives based entirely on item-to-item
comparisons, and often have better constant factors as well. So it
can be useful to find bytestring encodings of different abstract data
types that preserves those data types’ natural orderings.
There is existing work on this. Dean Landolt’s bytewise
is a
library for encoding arbitary JS data structures as byte strings for
just such purposes. UTF-8 is an algorithm for transforming a sequence
of Unicode codepoints into a sequence of bytes or vice versa, and it
preserves lexicographical ordering in precisely the way I’m talking
about here.
This note, however, is about a specific subproblem: the problem of
encoding a tuple of bytestrings as a bytestring while preserving
lexicographical order. That is, if the alphabet of bytes is Σ
, we
want an injective mapping Σ**
→ Σ*
that is a homomorphism when we
consider the elements of Σ**
and Σ*
as elements of a totally
ordered set whose order is defined lexicographically.
The approach taken by bytewise
for arrays is to encode them as an
array type byte 0xa0, then each item followed by a NUL byte 0x00, then
a final terminating NUL byte 0x00. This is clearly correct if the
encoded items in the array cannot contain NUL bytes, but of course
they can if they themselves are arrays (or, as it happens, numbers,
buffers, or some other types). So bytewise
bytestuffs the item
encodings as follows: an embedded 0x00 as 0x01 0x01 and an embedded
0x01 as 0x01 0x02, and symmetrically, but for other reasons, 0xff and
0xfe are bytestuffed to 0xfe 0xfe and 0xfe 0xfd. This correctly
preserves the lexicographical ordering.
The example currently given in the README is that new
Buffer('ff00fe01', 'hex')
encodes as (hex) 60ff00fe01, 'foo'
encodes as (ASCII) 'pfoo', ['foo']
encodes as (C) "\xa0pfoo\0\0"
,
and [ new Buffer('ff00fe01', 'hex') ]
encodes as (hex)
a060fefe0101fefd01020000.
While this is correct, it has the disadvantage that, for a single level of bytestuffing, the worst-case encoded size is double the decoded size, and, as it happens, the encoding of a value needing bytestuffing will