I talked a little about Lotus Agenda in Agenda hypertext. Today I want to explore a different related idea: a flexible tool for making relational views of text file contents, whether written by hand or generated by machine.
An Agenda file was, primarily, an unordered bag of text snippets. A
lot of Unix utilities treat text file lines as sort of database
records — roughly speaking, sort
sorts the records, uniq
eliminates duplicates and possibly provides you with a count, grep
selects records matching a pattern, cut
projects away some of the
columns, look
does a lookup in a sorted index, join
and paste
do
relational joins, and so on.
These utilities are often the fastest way to get certain tasks done, but they’re kind of hard to use (especially robustly), and they don’t deal with ad-hoc-formatted text at all, which was Agenda’s strength. However, Agenda’s queries and views used a sort of tagging system that wasn’t very powerful.
What if we had a system that also treated a text file as a bag of lines, but permitted relational queries on those lines, similar to how Sniki permitted relational queries on its link graph (see Prolog table outlining)? For example, consider these lines:
-rwxr-xr-x 1 user user 19096 Dec 30 2018 xshmucalc_fb
-rw-r--r-- 1 user user 254 Dec 30 2018 erosion1d.c.~1~
-rwxr-xr-x 1 user user 8984 Dec 30 2018 erosion1d
-rw-r--r-- 1 user user 534 Dec 30 2018 erosion1d-log.c.~1~
-rw-r--r-- 1 user user 1122 Dec 30 2018 erosion1d.c
-rw-r--r-- 1 user user 1362 Dec 30 2018 erosion1d-log.c
-rw-r--r-- 1 user user 40055 Dec 30 2018 erosion1d-log.lst
Suppose you match them against this pattern, using two PCRE regexps:
-rwxr-xr-x 1 user user $size:\d+ $date:.* $executable
(executable
gets the default pattern \S+; unquoted space matches
any amount of space.)
This produces the following relation:
size | date | executable |
---|---|---|
19096 | Dec 30 2018 | xshmucalc_fb |
8984 | Dec 30 2018 | erosion1d |
The other lines don’t match the pattern, so they don’t enter into the result.
How about files that have an Emacs backup file, the thing with the
version number between ~
marks?
$_ $_ $_ $_ $size1:\d+ $date1:.* ${fname}.~$v~
&$_ $_ $_ $_ $size2:\d+ $date1:.* $fname
This is a kind of self-join, joining the set of records with itself, but using two different patterns. They might both match the same line in different ways, but they each need to match some line for the query as a whole to succeed. Those lines may not be consecutive — in the above data they are not.
A better way to formulate that query is through a view, or inference rule. Here’s an inference rule that factors out the common part of the above; it consists of a query, followed by one or more templates that transform the query results into inferred lines:
$_ $_ $_ $_ $size:\d+ $date:.* $fname
.: File $fname is $size bytes, modified $date.
This inference rule causes a panoply of inferred lines to implicitly spring into existence:
File erosion1d.c.~1~ is 254 bytes, modified Dec 30 2018.
File erosion1d is 8984 bytes, modified Dec 30 2018.
File erosion1d-log.c.~1~ is 534 bytes, modified Dec 30 2018.
File erosion1d.c is 1122 bytes, modified Dec 30 2018.
And so on. This allows us to write the self-join query from before much more conveniently, now matching these inferred lines:
File $fname $_:.*
&File ${fname}.~$v~ $_:.*
Such inference rules have the potential to produce infinite loops of inferring more and more lines:
GNU$stuff:.*
.: GNU’s Not Unix$stuff
In general I think it is undecidable whether this is happening, so probably the best solution is to ensure that if it happens, it doesn’t lead to disasters.
This is messy, like the Unix shell utilities that inspired it — concatenating raw strings and then trying to parse the results is a recipe for disaster — so maybe inferring JSON and using destructuring matches like modern JS would be better than inferring text lines; then we only need to use text matching when it’s part of the problem statement:
$_ $_ $_ $_ $size:\d+ $date:.* $fname
.: {fname, size, date}
{fname}
&{fname: "${fname}.~$v~"}
Here {fname}
is syntactic sugar for {"fname": "$fname"}
, and
similarly for {fname, size, date}
.
Suppose you specify an aggregate operation on a field:
$perm 1 user user $size.sum:\d+ $_:.*
Aggregation implies that you need to do some kind of grouping in order to have groups to run the aggregation over. How do you do the grouping? Well, in SQL, if you use grouping, the only fields you can legally SELECT without an aggregate operation are the ones you’re going to GROUP BY. (MySQL is historically lax about this.) So implicitly we could just use the fields that don’t have an aggregate operation attached to them, so the above will group by permissions!
perm count size.sum
---------- -------- --------
-rw-r--r-- 5 43327
-rwxr-xr-x 2 28080
Multiple aggregations on the same field don’t fit easily into this approach, but they are rare.
A useful aggregation that makes sense in this context is .join(sep)
.
By default aggregation results should be sorted descending, but by
which field? You should be able to tag a field in your query as the
sort key; consider this data from ps ux
:
user 28358 0.7 7.8 2114240 310916 ? Sl Aug22 59:20 /usr/lib/firefox/firefox -contentproc -childID
user 28905 0.0 1.2 122440 49028 pts/2 S 00:49 0:04 xpdf.real sensors-18-00768.pdf
user 29260 0.0 0.0 22460 3516 pts/4 Ss+ Aug26 0:00 bash
user 29299 0.0 0.0 8340 1892 pts/4 T Aug26 0:00 less sensors-18-00768.pdf
You could match it with:
$uid $pid $_:.* $vsz:\d+ $rss:\d+ $tty $stat $start $cpu $cmd:.*
But maybe you’d like to sort by RSS:
$uid $pid $_:.* $vsz:\d+ $rss+:\d+ $tty $stat $start $cpu $cmd:.*
Or sort ascending instead:
$uid $pid $_:.* $vsz:\d+ $rss-:\d+ $tty $stat $start $cpu $cmd:.*
Of course more sophisticated sorting requires multi-field keys, specifying numeric versus ASCIIbetical versus locale versus mixed-text-and-numbers, field ordering in the key, and so on. But the simple case should be easy.
The above allows you to select lines where a field is equal to a constant or to some concatenation that includes some other field value. So you can find your zero-byte files. But what if you want to find your files that are over a mebibyte? You need to be able to say something like this:
$perm $nlinks user user $size[>1048576] $_:.* $name
Most text files aren’t purely unordered bags, and if you can grok a bit of their structure, you can do much more useful things. There’s a whole ladder of language power to ascend here — first strict sequences of line templates:
commit $commit
Author: $author:.*
Date: $date:.*
Then some kind of vague contextual line-template thing for simple hierarchical structures with header lines at the top, where each occurrence of the last line in the template creates a row, but implicitly pulls in all the fields from the most recent occurrence of each of the others:
* $h1:.*
...
** $h2:.*
...
*** $h3:.*
...
${_:.*}<${url}>$_:.*
This would be especially useful for things like ls -lR
output:
-rw-r--r-- 1 user user 1060956 Dec 2 2018 e0000ea7.au
-rw-r--r-- 1 user user 1060956 Dec 2 2018 e0000ffe.au
./whatsapp-mockup:
total 104
-rw-r--r-- 1 user user 1192 Oct 23 2016 camera.png
-rw-r--r-- 1 user user 591 Oct 23 2016 check.png
Here you want to use a pattern something like this, so that you can know what directory each file is in:
${dir}:
total $dirtotal
...
$perm $nlinks $user $group $size $date:.* $filename
Alternatively, you could tag some kind of thing onto the end of the line pattern that asks for the most recent matching line, without implying a hierarchical structure.
Then some kind of regular expression system for line templates, then a full grammar system (which is getting into Tagging parsers), which could also extend down into the lines themselves, so that you can do things like parse the ingredients out of “Neapolitan ($17): Ham, tomato, garlic, and mozzarella”. The only trouble with that is that an AST isn’t very similar to an N-ary relation.
Probably the largest current source of structured data that needs to be parsed out of big text files, and for which regexps are currently the best choice.
If you were writing a file to be parsed by such a query system, you could imagine a text editor that would make it very easy by offering syntax highlighting and contextual autocomplete — maybe any word that was the same as in the line above it would be grayed out, and TAB at the end of a line that was shorter than the previous line would copy down whatever text was the same on those two lines. So for example at the end of this:
Barbara is Joanie’s mother.
Joanie is Ryan’s mother.
Ryan
TAB would insert a grayed-out “is”, taking you to the next “field”, at which point an autocomplete dropdown combo box thing would suggest that you might want “Joanie” or “Ryan”. (Maybe it could guess that the two data columns were the same type and suggest “Barbara” too.)