Toward a language for hacking around with natural-language processing algorithms

Kragen Javier Sitaker, 2016-09-08 (7 minutes)

To come up with non-words beginning with "sex" consisting of an "s" followed by a real word, in bash:

< ~/devel/wordlist cat | while read freq word
    do case "$word" in
    ex*) echo "s$word";;
    esac
done | grep -vf <(< ~/devel/wordlist head -15000
                  | while read freq word
                    do echo "^$word\$" done) | head -45 | sort

This is probably the wrong way to do it. I thought it might be easier in Python, but it isn't:

import itertools
words = [line.split()[1] for line in open('/home/user/devel/wordlist')]
common_words = set(words[:15000])
swords = ('s' + word for word in words if word.startswith('ex'))
sorted(itertools.islice((sword for sword in swords if sword not in common_words), 0, 45))

Python’s set type, lazy generator expressions, and implicit file-line iteration are useful here, but this still ends up being kind of a lot of code, even more than the bash version, in part because genexes are pretty pointful, which is in part because Python’s methods are not very useful to pass to higher-order expressions like map and filter.

Another thing to keep in mind here is that I invariably write this kind of thing incrementally, looking at the results computed by intermediate versions, in order to decide what to do next. For example, I added the filter to eliminate existing common words when “sexist” showed up in the output, and increased the cutoff from 2000 to 15000 when it continued to show up. Traditional function(arguments) syntax kind of sucks for this, because usually you write the arguments left to right (not least because the cursor’s implicit motion is to the right as you type), and then you have to move back to the beginning to add the function bit. This gets even worse when we’re wrapping something in a list comprehension or a generator expression.

The ideal environment for writing stuff like this incrementally would not be implicitly imperative, so that it could safely evaluate intermediate expressions without fear of damage and evaluate lazily for responsiveness without fear of confusion; it would allow you to add functions on the right; it would use map and filter functions rather than list comprehensions or generator expressions; it would use CLOS-like generic functions, with ML-like implicit currying, instead of methods so that you could use them as map and filter arguments; and it would allow you to write the equivalent of f(a, g(b, h(c, d))) without matching parens.

These requirements actually make it sound a lot like Forth! But I don’t think we need to descend down the rabbit hole of typelessness and syntaxlessness to get these advantages.

Here’s a hack at an alternative syntax that maximizes left-to-right typability with incremental feedback:

'/home/user/devel/wordlist' file, split each; 1 th each -> words
words, 'ex' startswith only;
    ('s' ++) each;
    , (words, :15000 th set) contains not each;
    :45 th sorted

I’m not sure that's right or even parseable, but here are the things I was trying:

The above suggests that it actually is necessary to have some kind of separator between a function and its arguments; in this line

'/home/user/devel/wordlist' file, split each; 1 th each

it’s totally ridiculous that each is being applied not only to split but to the return of file. We could use . to apply functions and call methods:

'/home/user/devel/wordlist'.file, split. each, 1.th. each -> words
words, ('ex'.startswith). only, ('s' ++). each,
    (words, :15000.th. set. contains. not). each.
    :45.th.sorted

Or alternatively we could use ;, but that’s kind of terrible, really, because the whole point of ; is that its left and right precedence are very different:

'/home/user/devel/wordlist'; file, split; each, 1;th; each -> words
words, ('ex'; startswith); only, ('s' ++); each;
    (words, :15000;th; set; contains; not); each;
    :45;th;sorted

Topics