Additive smoothing for Markov models

Kragen Javier Sitaker, 2007 to 2009 (updated 2019-05-19) (11 minutes)

So when you estimate the probability of an event from a sample, you have to “smooth”, which is to say that you have to increase your estimate of the probability a little bit so that it is never zero, since concluding that the probability of any event is zero based only on a sample is clearly wrong. The most common way is to add 1 to both the sample size and the number of observed events, and I think that this is in fact an unbiased estimator.

But if you have more information, maybe you can do better. In particular, I was thinking about estimating the statistical likelihood of a particular word being the next word in a sentence. This is useful, for example, for travesty generators like Mark V. Shaney.

Zeroth-Order Markov Models

You can use a “zeroth-order [Markov model][]”, and just figure that the probability that the next word is, say, “figure” is the probability of “figure” being any particular word --- regardless of context. That is, it’s the number of occurrences of “figure” in your training corpus (plus one), divided by the total number of words in your training corpus (plus one).

That gives you some information, for sure. But if you use that model to generate text, it doesn’t look much like real text.

First-Order Markov Models

If you use a first-order Markov model, you use the previous word, say, “just”. And you look for the number of occurrences of “just figure” in your corpus (plus one) and divide that by the number of occurrences of “just” (followed by any word) (plus one) in the training corpus. And you use that for your estimate.

Text generated from a first-order Markov model tends to look like relatively real text. If you don’t smooth the probabilities, then you only ever get word pairs that occurred in the real text, so the text minimally conforms to a [regular-language][] approximation of the [grammar][] that generated your corpus, and if the model is [ergodic][], the frequencies of the words will additionally be about right. (See also the Jargon File entry on Dissociated Press.)

Second-Order Markov Models

You can use a second-order Markov model and look at the two words, say “and just”, and then divide the number of occurrences of “and just figure” (plus one) by the number of occurrences of “and just *” (plus one) in the corpus. This generates even more realistic text, because although it’s still using a regular-language approximation, the [DFA][] (a Markov model is a DFA augmented with transition probabilities) can now have N² states instead of N states, where N is the number of distinct words in the corpus. So it can be a better approximation.

However, now you run into a problem. The number of states in a second-order Markov model can be greater than the number of words in your text. For example, before I wrote this sentence, this note contained 475 words, of which 162 were unique. That means that a second-order Markov model built from its vocabulary would contain 26244 states, only 474 of which, at most, could contain any sample information!

So if you smooth in the way I suggested above, by adding one to the counts of both “and just figure” and “and just *”, you probably wind up with a gross overestimate of the probability of “and just 26244” and a gross underestimate of the probability of “and just the”, and so in fact the output can look less realistic than the output from a “zeroth-order Markov model”. The usual way to deal with this in Dissociated Press is to not smooth at all; most of the transitions have zero probability. Unfortunately this often results in repeating long passages from the training corpus, especially once you go to third- and higher-order Markov models.

Essentially, such higher-order models are almost inevitably [overfitted][] --- they are so flexible that they end up learning the detailed structure of their training set, and if you try to compensate by smoothing, they don’t learn anything.

How to Fix the Problem

However, it seems to me that you could do the smoothing in a more effective way, by using first-order and “zeroth-order” Markov models from the same training corpus to fill the gap.

Effectively, when you’re spewing out randomly generated text, the smoothing amounts to adding one extra occurrence of, say, “and just”. The standard way of smoothing amounts to assuming that all words that haven’t been seen after “and just” are equally likely.

If instead you use your first-order Markov model to decide on that distribution --- in this case, the smoothed probability distribution of words following “just” --- you’ll do much better.

And, of course, you can smooth the first-order Markov model using the observed word frequencies (the “zeroth-order model”), instead of assuming that all words in the vocabulary are equally likely.

When you’re generating random text, there’s no obvious way to smooth the zeroth-order Markov model; you can’t straightforwardly generate an arbitrary word you’ve never seen before. But in some other applications of Markov models, such as data compression, OCR, speech recognition, and entropy estimation, you only have to deal with things that are actually found in the input text.

Note that in the case that the word pair in the current state does not occur in the training set, this reduces to a first-order Markov model. This suggests that it is the Right Thing.

Data Structures

If you want to implement this algorithm, you can use a [suffix array][] or [suffix tree][] on the training set. This allows you to efficiently answer questions like “how many times does ‘model, and just’ occur in the input, and what is the distribution of what occurs next?” without storing an excessive amount of data. Suffix trees take up an awful lot of space, but Udi Manber and XXX Myers discovered a reasonably efficient algorithm for constructing suffix arrays in 1989-1991 (or was it Manber and Wu in the 1990s?), and it’s used as the basis for the [Glimpse][] search engine. The suffix array merely needs space for one pointer per index point in the original text.

If you want to know what words occurred somewhere preceding a given word or phrase you really want a “prefix tree”, which is a suffix tree built on a reversed version of the input corpus. ...

Other Applications

It occurred to me that if you wanted to know what words to boldface in an English text, you might have some success highlighting the lowest-probability words. This could also be useful in source code highlighting: figuring out which tokens are most informative and which are just noise.

References

Dissociated Press correctly describes Markov-chain random text generation, but incorrectly claims that Emacs’s “dissociated press” is an example thereof. According to the Emacs manual:

Dissociated Press produces results fairly like those of a Markov chain based on a frequency table constructed from the sample text. It is, however, an independent, ignoriginal invention. Dissociated Press techniquitously copies several consecutive characters from the sample between random choices, whereas a Markov chain would choose randomly for each word or character. This makes for more plausible sounding results, and runs faster.

It is a mustatement that too much use of Dissociated Press can be a developediment to your real work, sometimes to the point of outragedy. And keep dissociwords out of your documentation, if you want it to be well userenced and properbose. Have fun. Your buggestions are welcome.


Penn Jillette wrote a lovely explanation of Markov-chain travesty generation in his article on Mark V. Shaney, “I Spent an Interesting Evening Recently with a Grain of Salt”, published in PC Computing volume 4, number 7, July, 1991, p.282.


The idea of an index point is explained in Gonnet, Baeza-Yates, and Snider’s 1991 paper, “Lexicographical Indices for Text: Inverted Indices vs. PAT trees”. They had been working on putting the Oxford English Dictionary on CD-ROM, and so they’d done a bunch of work on full-text indexing. Gonnet was at ETH at the time, and Baeza-Yates was already at the Universidad de Chile. The paper is mostly devoted to advocacy of suffix arrays (called PAT arrays in the article) rather than inverted indices. They explain:

An index point is the beginning of a word or a piece of text which is indexable. Usually such points are preceded by space...

I am a little bit puzzled about this paper; it says the PAT tree data structure “was originally described by Gonnet in the paper ‘Unstructured Data Bases’ [Gon83]”. But the PAT tree is “a Patricia tree constructed over all the possible sistrings [suffixes] of a text,” citing Donald R. Morrison’s 1968 CACM paper on PATRICIA, which I believe describes constructing PATRICIA trees over all the possible suffixes of a text. But I don’t have the paper handy to check.

Tim Bray, who also worked on the OED project, recently wrote a series of articles about the technical aspects of full-text indexing. He also worked for some years at a startup called Open Text whose main product used suffix arrays for its searching. His analysis of why suffix arrays are not widely used today is very interesting.


PATRICIA has been explained in a lot of places, but unfortunately I don’t know where to find the original paper online.


Overfitting is a problem with statistical models generally. Sometimes with neural networks and other machine-learning systems, it’s called “overtraining”.

Hey, maybe I should look at http://www.stat.cmu.edu/~cshalizi/754/.


Perry Lorier tells me that this is how the PPM* algorithm models; IIRC that’s what’s used in the champion PAQ family of compressors. I should look into it.

Topics