String tuple encoding

Kragen Javier Sitaker, 2017-04-28 (2 minutes)

A number of algorithms and data structures depend on the lexicographical ordering of bytestrings; for example, tries, LevelDB, LC_ALL=C /bin/sort, the American Flag Sort, and suffix array construction algorithms. These are often asymptotically higher in performance than alternatives based entirely on item-to-item comparisons, and often have better constant factors as well. So it can be useful to find bytestring encodings of different abstract data types that preserves those data types’ natural orderings.

There is existing work on this. Dean Landolt’s bytewise is a library for encoding arbitary JS data structures as byte strings for just such purposes. UTF-8 is an algorithm for transforming a sequence of Unicode codepoints into a sequence of bytes or vice versa, and it preserves lexicographical ordering in precisely the way I’m talking about here.

This note, however, is about a specific subproblem: the problem of encoding a tuple of bytestrings as a bytestring while preserving lexicographical order. That is, if the alphabet of bytes is Σ, we want an injective mapping Σ**Σ* that is a homomorphism when we consider the elements of Σ** and Σ* as elements of a totally ordered set whose order is defined lexicographically.

Bytestuffing

The approach taken by bytewise for arrays is to encode them as an array type byte 0xa0, then each item followed by a NUL byte 0x00, then a final terminating NUL byte 0x00. This is clearly correct if the encoded items in the array cannot contain NUL bytes, but of course they can if they themselves are arrays (or, as it happens, numbers, buffers, or some other types). So bytewise bytestuffs the item encodings as follows: an embedded 0x00 as 0x01 0x01 and an embedded 0x01 as 0x01 0x02, and symmetrically, but for other reasons, 0xff and 0xfe are bytestuffed to 0xfe 0xfe and 0xfe 0xfd. This correctly preserves the lexicographical ordering.

The example currently given in the README is that new Buffer('ff00fe01', 'hex') encodes as (hex) 60ff00fe01, 'foo' encodes as (ASCII) 'pfoo', ['foo'] encodes as (C) "\xa0pfoo\0\0", and [ new Buffer('ff00fe01', 'hex') ] encodes as (hex) a060fefe0101fefd01020000.

While this is correct, it has the disadvantage that, for a single level of bytestuffing, the worst-case encoded size is double the decoded size, and, as it happens, the encoding of a value needing bytestuffing will

Topics