Secure, self-describing, self-delimiting serialization for Python

Kragen Javier Sitaker, 2017-04-11 (8 minutes)

I find myself somewhat unexpectedly desiring a new serialization format for Python data structures. This is unexpected because Python’s standard library already includes several serialization formats: pickle, marshal, json, and xmlrpclib (not counting xdrlib, ConfigParser, and struct), and other formats such as bencode are also widely used in Python.

(bencode might actually be the right solution, but I can’t look at the internet to see right now.)

I’m defining a network protocol for a program I’m writing, and one of the things I want to do in this protocol is to pass Python data structures over the wire. I’m not concerned with being able to serialize arbitrary class instances — I’d be satisfied with built-in data types — but I don’t want a lot of hassle.

The serialization needs to be:

Ideally, the serialization would also be:

But it does not need to be:

Length-free design

val ::= bytes | unicode | tuple | list | dict | boolean | none | int | float
bytes ::= '"' stringcontents
unicode ::= 'u"' stringcontents
stringcontents ::= '"' | stringbyte stringcontents
stringbyte ::= [^"\] | '\"' | '\\'
tuple ::= '(' tuplecontents
tuplecontents ::= ')' | val tuplecontents
list ::= '[' listcontents
listcontents ::= ']' | val listcontents
dict ::= "{" dictcontents
dictcontents ::= "}" | val val dictcontents
boolean ::= "T" | "F"
none ::= "N"
int ::= sign digits " "
digits ::= [0-9] | [0-9] digits
float ::= sign digits "e" sign digits " "
sign ::= "-" |

The pairs of vals in a dict are key-value pairs. Dict keys must appear in sorted order by the lexicographical ordering of their serializations. Unicode strings are represented in UTF-8. The only whitespace allowed is that within strings and that following numbers. The only case where the next byte is not sufficient to dispatch to the appropriate routine is the int/float dichotomy. Yes, floats are expressed without decimal points, so "31416e-4 " is a reasonable representation for an approximation of π.

In practice this should be slightly more compact than bencode except for large binary strings (which it inflates on average by almost 1% but by 100% in the worst case), and much more readable and writable, but considerably slower and more error-prone.

As an example, {"announce-list": [["foo"], ["bar"]], "info": {"files": [{"length": 4541, "path": "baz", "safe": False}], (): (1, 1.0)}} would be represented as {"announce-list"[["foo"]["bar"]]"info"{"files"[{"length"4541 "path""baz""safe"F}]()(1 1e0 )}}.

Length-prefixed design

val ::= bytelength body
bytelength ::= digits
digits ::= [0-9] | [0-9] digits
body ::= bytes | unicode | tuple | list | dict | boolean | none | int | float
bytes ::= 'H' data
unicode ::= 'u' data
data ::= "" | [\x00-\xff] data
tuple ::= '(' vals
vals ::= "" | val vals
list ::= '[' vals
dict ::= "{" pairs
pairs ::= "" | val val pairs
boolean ::= "T" | "F"
none ::= "N"
int ::= " " sign digits
float ::= "." sign digits "e" sign digits
sign ::= "-" |

Here every val begins with a decimal representation of the number of bytes in the body of its serialization, not counting the initial type byte.

As an example, {"announce-list": [["foo"], ["bar"]], "info": {"files": [{"length": 4541, "path": "baz", "safe": False}], (): (1, 1.0)}}, the same example value from before, would be represented as 100{13Hannounce-list14[6[3Hfoo6[3Hbar4Hinfo58{5Hfiles36[33{6Hlength4 45414Hpath3Hbaz4Hsafe1F0(8(1 13.1e0. This is about 10% bigger than the length-free design, and a hell of a lot harder to type or read, especially the parts that seem to say “45414H” and “13.1e0”, but can be navigated efficiently and can support large chunks of binary data.

Stack-based design

Pickle deserializes by interpreting stack-based bytecode similar to Python bytecode (which leads one to wonder why they didn’t just use Python bytecode). The pickle-version-0 encoding of the sample datum {"announce-list": [["foo"], ["bar"]], "info": {"files": [{"length": 4541, "path": "baz", "safe": False}], (): (1, 1.0)}} is the following 188 bytes:

(dp0
S'info'
p1
(dp2
S'files'
p3
(lp4
(dp5
S'path'
p6
S'baz'
p7
sS'length'
p8
I4541
sS'safe'
p9
I00
sas(t(I1
F1.0
tp10
ssS'announce-list'
p11
(lp12
(lp13
S'foo'
p14
aa(lp15
S'bar'
p16
aas.

Here:

Now, I have no idea why pickle incrementally appends stuff to lists and dicts as it builds them. pickle.loads("(S'foo'\nS'bar'\nS'baz'\nI37\nd.") does return {'foo': 'bar', 'baz': 37} as you would expect, and changing the d to an l generates the corresponding list. So I don’t know why a and s exist.

If you wanted to take this approach to make your serialization and deserialization as little code as possible, you could use this approach:

op ::= digits intop | '(' | '}' | ']' | ')' | 'T' | 'F' | 'N' | LF
intop ::= ' ' | '-' | 'H' data | 'u' data | 'F' data
digits ::= [0-9] | [0-9] digits
data ::= "" | [\x00-\xff] data

Here the '}', ']', and ')' ops play the role of 'd', 'l', and 't' in pickle; LF plays the role of '.'; ' ' specifies that the preceding digits just represent an integer (and '-' is the same, but negates it); 'H' and 'u' specify that the preceding digits are a count of following bytes for a byte string or UTF-8-encoded Unicode string; and 'F' represents a floating-point number in some way that I’m not specifying right now.

As an example, {"announce-list": [["foo"], ["bar"]], "info": {"files": [{"length": 4541, "path": "baz", "safe": False}], (): (1, 1.0)}} would be represented as (13Hannounce-list((3Hfoo](3Hbar]]4Hinfo[5Hfiles((6Hlength4541 4Hpath3Hbaz4HsafeF}]()(1 1F1)}}\n, which is one byte longer than the length-free design, but retains most of the efficiency advantage of the length-prefixed design. I'm not sure there’s a meaningful difference, really...

Topics