Text relational query

Kragen Javier Sitaker, 2019-08-28 (10 minutes)

I talked a little about Lotus Agenda in Agenda hypertext. Today I want to explore a different related idea: a flexible tool for making relational views of text file contents, whether written by hand or generated by machine.

An Agenda file was, primarily, an unordered bag of text snippets. A lot of Unix utilities treat text file lines as sort of database records — roughly speaking, sort sorts the records, uniq eliminates duplicates and possibly provides you with a count, grep selects records matching a pattern, cut projects away some of the columns, look does a lookup in a sorted index, join and paste do relational joins, and so on.

These utilities are often the fastest way to get certain tasks done, but they’re kind of hard to use (especially robustly), and they don’t deal with ad-hoc-formatted text at all, which was Agenda’s strength. However, Agenda’s queries and views used a sort of tagging system that wasn’t very powerful.

The basic idea: create relations by pattern-matching text lines

What if we had a system that also treated a text file as a bag of lines, but permitted relational queries on those lines, similar to how Sniki permitted relational queries on its link graph (see Prolog table outlining)? For example, consider these lines:

-rwxr-xr-x   1 user user     19096 Dec 30  2018 xshmucalc_fb
-rw-r--r--   1 user user       254 Dec 30  2018 erosion1d.c.~1~
-rwxr-xr-x   1 user user      8984 Dec 30  2018 erosion1d
-rw-r--r--   1 user user       534 Dec 30  2018 erosion1d-log.c.~1~
-rw-r--r--   1 user user      1122 Dec 30  2018 erosion1d.c
-rw-r--r--   1 user user      1362 Dec 30  2018 erosion1d-log.c
-rw-r--r--   1 user user     40055 Dec 30  2018 erosion1d-log.lst

Suppose you match them against this pattern, using two PCRE regexps:

-rwxr-xr-x 1 user user $size:\d+ $date:.* $executable

(executable gets the default pattern \S+; unquoted space matches any amount of space.)

This produces the following relation:

sizedateexecutable
19096Dec 30 2018xshmucalc_fb
8984Dec 30 2018erosion1d

The other lines don’t match the pattern, so they don’t enter into the result.

How about files that have an Emacs backup file, the thing with the version number between ~ marks?

$_ $_ $_ $_ $size1:\d+ $date1:.* ${fname}.~$v~
&$_ $_ $_ $_ $size2:\d+ $date1:.* $fname

This is a kind of self-join, joining the set of records with itself, but using two different patterns. They might both match the same line in different ways, but they each need to match some line for the query as a whole to succeed. Those lines may not be consecutive — in the above data they are not.

Inference rules or views

A better way to formulate that query is through a view, or inference rule. Here’s an inference rule that factors out the common part of the above; it consists of a query, followed by one or more templates that transform the query results into inferred lines:

$_ $_ $_ $_ $size:\d+ $date:.* $fname
.: File $fname is $size bytes, modified $date.

This inference rule causes a panoply of inferred lines to implicitly spring into existence:

File erosion1d.c.~1~ is 254 bytes, modified Dec 30  2018.
File erosion1d is 8984 bytes, modified Dec 30  2018.
File erosion1d-log.c.~1~ is 534 bytes, modified Dec 30  2018.
File erosion1d.c is 1122 bytes, modified Dec 30  2018.

And so on. This allows us to write the self-join query from before much more conveniently, now matching these inferred lines:

File $fname $_:.*
&File ${fname}.~$v~ $_:.*

Such inference rules have the potential to produce infinite loops of inferring more and more lines:

GNU$stuff:.*

.: GNU’s Not Unix$stuff

In general I think it is undecidable whether this is happening, so probably the best solution is to ensure that if it happens, it doesn’t lead to disasters.

JSON?

This is messy, like the Unix shell utilities that inspired it — concatenating raw strings and then trying to parse the results is a recipe for disaster — so maybe inferring JSON and using destructuring matches like modern JS would be better than inferring text lines; then we only need to use text matching when it’s part of the problem statement:

$_ $_ $_ $_ $size:\d+ $date:.* $fname
.: {fname, size, date}

{fname}
&{fname: "${fname}.~$v~"}

Here {fname} is syntactic sugar for {"fname": "$fname"}, and similarly for {fname, size, date}.

Grouping

Suppose you specify an aggregate operation on a field:

$perm   1 user user $size.sum:\d+ $_:.*

Aggregation implies that you need to do some kind of grouping in order to have groups to run the aggregation over. How do you do the grouping? Well, in SQL, if you use grouping, the only fields you can legally SELECT without an aggregate operation are the ones you’re going to GROUP BY. (MySQL is historically lax about this.) So implicitly we could just use the fields that don’t have an aggregate operation attached to them, so the above will group by permissions!

      perm    count size.sum
---------- -------- --------
-rw-r--r--        5    43327
-rwxr-xr-x        2    28080

Multiple aggregations on the same field don’t fit easily into this approach, but they are rare.

A useful aggregation that makes sense in this context is .join(sep).

Sorting

By default aggregation results should be sorted descending, but by which field? You should be able to tag a field in your query as the sort key; consider this data from ps ux:

user     28358  0.7  7.8 2114240 310916 ?      Sl   Aug22  59:20 /usr/lib/firefox/firefox -contentproc -childID 
user     28905  0.0  1.2 122440 49028 pts/2    S    00:49   0:04 xpdf.real sensors-18-00768.pdf
user     29260  0.0  0.0  22460  3516 pts/4    Ss+  Aug26   0:00 bash
user     29299  0.0  0.0   8340  1892 pts/4    T    Aug26   0:00 less sensors-18-00768.pdf

You could match it with:

$uid $pid $_:.* $vsz:\d+ $rss:\d+ $tty $stat $start $cpu $cmd:.*

But maybe you’d like to sort by RSS:

$uid $pid $_:.* $vsz:\d+ $rss+:\d+ $tty $stat $start $cpu $cmd:.*

Or sort ascending instead:

$uid $pid $_:.* $vsz:\d+ $rss-:\d+ $tty $stat $start $cpu $cmd:.*

Of course more sophisticated sorting requires multi-field keys, specifying numeric versus ASCIIbetical versus locale versus mixed-text-and-numbers, field ordering in the key, and so on. But the simple case should be easy.

Non-equality criteria

The above allows you to select lines where a field is equal to a constant or to some concatenation that includes some other field value. So you can find your zero-byte files. But what if you want to find your files that are over a mebibyte? You need to be able to say something like this:

$perm   $nlinks user user $size[>1048576] $_:.* $name

Line sequences

Most text files aren’t purely unordered bags, and if you can grok a bit of their structure, you can do much more useful things. There’s a whole ladder of language power to ascend here — first strict sequences of line templates:

commit $commit
Author: $author:.*
Date:   $date:.*

Then some kind of vague contextual line-template thing for simple hierarchical structures with header lines at the top, where each occurrence of the last line in the template creates a row, but implicitly pulls in all the fields from the most recent occurrence of each of the others:

* $h1:.*
...
** $h2:.*
...
*** $h3:.*
...
${_:.*}<${url}>$_:.*

This would be especially useful for things like ls -lR output:

-rw-r--r-- 1 user user 1060956 Dec  2  2018 e0000ea7.au
-rw-r--r-- 1 user user 1060956 Dec  2  2018 e0000ffe.au

./whatsapp-mockup:
total 104
-rw-r--r-- 1 user user  1192 Oct 23  2016 camera.png
-rw-r--r-- 1 user user   591 Oct 23  2016 check.png

Here you want to use a pattern something like this, so that you can know what directory each file is in:

${dir}:
total $dirtotal
...
$perm   $nlinks $user $group $size $date:.* $filename

Alternatively, you could tag some kind of thing onto the end of the line pattern that asks for the most recent matching line, without implying a hierarchical structure.

Then some kind of regular expression system for line templates, then a full grammar system (which is getting into Tagging parsers), which could also extend down into the lines themselves, so that you can do things like parse the ingredients out of “Neapolitan ($17): Ham, tomato, garlic, and mozzarella”. The only trouble with that is that an AST isn’t very similar to an N-ary relation.

Web scraping

Probably the largest current source of structured data that needs to be parsed out of big text files, and for which regexps are currently the best choice.

Data entry

If you were writing a file to be parsed by such a query system, you could imagine a text editor that would make it very easy by offering syntax highlighting and contextual autocomplete — maybe any word that was the same as in the line above it would be grayed out, and TAB at the end of a line that was shorter than the previous line would copy down whatever text was the same on those two lines. So for example at the end of this:

Barbara is Joanie’s mother.
Joanie is Ryan’s mother.
Ryan

TAB would insert a grayed-out “is”, taking you to the next “field”, at which point an autocomplete dropdown combo box thing would suggest that you might want “Joanie” or “Ryan”. (Maybe it could guess that the two data columns were the same type and suggest “Barbara” too.)

Topics