Profile-guided parser optimization should enable parsing of gigabytes per second

Kragen Javier Sitaker, 2019-05-23 (8 minutes)

It should be possible to make parsing nearly always as fast as KMP string search, or even faster, without even resorting to SIMD.

parselov

As mentioned in Parallel DFA execution:

Bjoern Hoehrmann has been working on an algorithm called “parselov” which compiles a context-free grammar to a finite-state automaton that approximates the CFG using a limited-depth parse stack (one stack item, I think).

The idea, as I understand it, is that you can parse a deterministic CFG with a deterministic pushdown automaton (“DPDA”), which can be analyzed as a finite state machine augmented with an unlimited-depth stack. It follows that you can parse a limited-stack-depth approximation of a context-free grammar with a finite state machine, since there are a finite number of stack states with N items or less. (Evidence from human errors in speaking natural languages, and from the verisimilitude of natural language generated by LSTMs and similar neural networks, suggests that in fact this is what we do most of the time.)

The potentially exponentially large number of stack states suggests that enumerating them for any reasonable depth might be infeasible, but typically the stack states are strongly dependent on one another; in most languages, for example, you can’t be parsing a formal parameter list unless you’re just inside a function declaration, and you can’t be parsing a string unless you’re just inside an expression. So the base of the exponential might be surprisingly low.

But what happens if your pushdown automaton wants to push an (N+1)th item on the stack? Of course, you could decide to abort, and if your stack depth was deep enough, that might be fine. But, as an alternative, you could shift the N-1 items on the stack over by one and enter an “overdepth” state. Then, when popping, you can deterministically reduce the number of stack items in your finite state while keeping the overdepth flag, until you reach some low water mark of such stack items; then you add nondeterministic transitions to unshift each of the possible next stack items, and nondeterministically clear the overdepth flag or not.

If the distance between the low water mark and N, the high water mark, is large enough, then only a tiny fraction of the tokens or characters input will result in a nondeterministic transition.

(The low-water mark could be as low as 0, but in that case, the state you could have to shift back in could be any arbitrary state. If it’s at least 1, you have some context that reduces the number of possible transitions.)

By itself, this gives you a nondeterministic finite state automaton which functions as a sort of amnesiac pushdown automaton, accepting a regular language that is a strict superset of your original context-free language. But, when implementing this, you can add an extra “spill stack” in memory, which precisely tracks the items shifted out of the finite stack window. Then, when doing an unshifting transition, you can pop the spill stack to see exactly which state to transition to.

Thus you have transformed a DPDA into another equivalent DPDA with many more states and much less frequent pushes and pops — most of the time it is just a DFA which occasionally pushes things, but occasionally it pops and consults the popped value, as well as the input, to determine its next transition. The transformation is parameterized by two numbers, the low-water mark and N, the high-water mark. If they are both 0, it’s the identity transformation on DPDAs; if they are both 1, as I understand it, it’s parselov.

(A precisely analogous transformation can transform a nondeterministic PDA into another, similarly more efficient, nondeterministic PDA.)

But we have more degrees of freedom available for optimization than that; we could vary those numbers depending on the stack situation. In particular, we would like to avoid stack activity in situations that occur frequently, but we don’t care about avoiding stack activity in situations that occur rarely; instead we would like to avoid state proliferation.

But grammars don’t carry any information about the relative frequency of different situations.

Profile-guided optimization

PGO is a technique used to enable modern AOT compilers to compete with JIT compilers. You compile your program with profiling turned on, run it for a while, and then compile it a second time using the profiling information to guide the optimization, so that it optimizes things that are important instead of things that aren’t.

In this particular case, the only thing you would need to produce from “profiling” is the sequence of inputs and state and stack changes in your DPDA as it parses some sentences. Then, given a candidate set of sets-of-top-few-stack-items to assign to finite states to avoid pushes and pops, and some kind of description of the low-water-mark policy, it is straightforward to compute the number of pushes and pops that remain, and which states they are from; a greedy optimization process should do a reasonable and perhaps optimal job of choosing where to merge states and where to split them.

However, this PGO should enable us to not just equal lex’s performance, but exceed it substantially, because, on modern processors, the cost of indirecting through a jump table is greater than the cost of a comparison tree in nearly all cases — but the comparison tree’s performance can vary substantially depending on how it’s laid out. So, by taking advantage of statistics about which characters are encountered in which states in practice, we should be able to parse context-free grammars at gigabytes per second on modern CPUs.

Eliminating caching of parse results

If that works out, it should have substantial implications for the architecture of software, because a substantial fraction of modern software is dedicated to caching the results of parsing. Python’s .pyc bytecode files, the Emacs-Lisp .elc they aped, Java .class files, minified JS, the browser DOM, and all kinds of CSV storage things exist mostly to avoid slowly reparsing things that have already been parsed. I’m typing this in Emacs, which has all kinds of stupid hacks to avoid reparsing the buffer for syntax highlighting until there’s some idle time.

If you can parse at gigabytes per second, then in many cases it isn’t useful to store those parse results. If Emacs reparses the buffer from the beginning after every keystroke for syntax-highlighting, that will only begin to impede its responsiveness once the buffer is several megabytes in size; if it checkpoints the parse state every megabyte, there is no danger. Similarly, you could run your database queries directly on the CSV rather than importing it into the database. (None of these examples are pure, though: indexing a column of ten million numbers so you can join it with another table would still take an appreciable fraction of a second, even if your CSV parsing were instant, and both Java and Elisp do substantial optimization before writing their bytecode. But maybe they shouldn’t.)

And storing the parse results is potentially harmful, since in the absence of a general computation-result-caching system like A minimal dependency processing system, Fault-tolerant in-memory cluster computations using containers; or, SPARK, simplified and made flexible, or Kogluktualuk: an operating system based on caching coarse-grained deterministic computations, or at least the kind of reliable filesystem change detection discussed in Immutability-based filesystems: interfaces, problems, and benefits, there’s a constant risk of that preparsed data being outdated with respect to its original source.

Topics