Graph construction

Kragen Javier Sitaker, 2016-09-08 (updated 2017-07-19) (23 minutes)

(See also Circuit notation and A stack of stacks for simple modular electronics.)

I think I have a new approach to concisely and responsively describing directed graph structures that is useful in many different creative endeavors that can be undertaken with computers. I don’t yet have a working implementation; this article is only an exploration.

Here’s an overview of different possible notations for edge-labeled directed graphs (and related structures), including some new applications.

As an example application, in one variant of the notation, this is a schematic for an energy-harvesting power supply circuit taken from a paper on the subject. Here , represents parallel connection, concatenation represents serial connection, and <> represents reversal, among other things:

{{<d> {{ac i, c}}=q <d>, {<d> q <d>, c}} s {<d>, l {c, {<v>, r} r}}}

However, this approach is not limited to circuit design. All of the following can be conveniently represented as edge-labeled directed graphs:

These techniques should be applicable to user interaction and generative programming in all of these areas.

Netlists

You can write these things as netlists or 3-tuples; an abbreviated example of a bubble-and-arrow diagram from the dot man page:

   digraph test123 {
           a -> z [label="hi", weight=100];
           x -> z [label="multi-line\nlabel"];
           b -> x;
   }

Or, here’s a circuit netlist in Spice:

Multiple dc sources
v1 1 0 dc 24
v2 3 0 dc 15
r1 1 2 10k
r2 2 3 8.1k
r3 2 0 4.7k
.end

This requires naming every node; the nodes are named a, b, x, and z in the dot example, and 0, 1, 2, and 3 in the Spice example.

Here’s a dataflow example by zahorjan@cs.washington.edu in MIPS assembly, which is at the same level of abstraction, although it reuses the “dataflow node labels” $t0 and $t1 — each time they are the first operand of an instruction, their previous value is overwritten:

srl    $t0, $t0, 1       # i/2
addi   $t1, $gp, 28      # &A[0]
sll    $t0, $t0, 2       # turn i/2 into a byte offset (*4)
add    $t1, $t1, $t0     # &A[i/2]

If we were to rewrite that with SSA-style subscripts (what Ada Lovelace use prefixed superscripts for) we get:

srl    $t0₁, $t0₀, 1
addi   $t1₀, $gp₀, 28
sll    $t0₂, $t0₁, 2
add    $t1₁, $t1₀, $t0₂

There are better alternatives, though.

Infix

The traditional solution for the arithmetic dataflow case, of course, is to divide the computation into unary and binary operations and write them in infix notation; the above MIPS assembly then renders as follows:

($t0₀ >>> 1 << 2) + ($gp₀ + 28)

This is considerably easier to follow, bringing related things closer together and eliminating most of the accidental names, ending up about three times shorter.

By itself, it only covers a very restricted range of cases: tree-shaped digraphs, in which, if interpreted as dataflow graphs, each value is used only once, producing a single output. But it isn’t immediately obvious how to apply that to get more general kinds of graphs.

Kleene’s Theorem

Stephen Kleene (whose name I mispronounced for 21 years as [klin], but TIL it’s actually [kleɪni]) showed that any regular language, and thus the possible paths through a finite state automaton, can be generated by a regular expression. As I understand it, a regular expression on some alphabet Γ is recursively defined as the smallest language L such that

L = Γ ∪ ({“concat”, “alt”} × L × L) ∪ (L × {“*”}) ∪ {∅, ε}

Hmm, that definition is shitty, I’ll try again. A regular expression Γ re over Γ is defined as the sum type

Γ re = Lit of Γ
     | Concat of (Γ re × Γ re)
     | Alt of (Γ re × Γ re)
     | Closure of Γ re
     | Impossible
     | Emptystring

The language generated by a regular expression is defined as follows:

lang(Lit(x)) = {x}
lang(Concat(a, b)) = {wa || wb for wa in lang(a) for b in lang(b)}
lang(Alt(a, b)) = lang(a) ∪ lang(b)
lang(Closure(a)) = {ε} ∪ lang(Concat(a, Closure(a)))
lang(Impossible) = ∅
lang(Emptystring) = {ε}

Here || is string concatenation. (Or word concatenation, as combinatorists apparently like to say. And who can argue? A C program as little resembles a piece of twine as it does an entry in the dictionary.)

So the nifty thing here is that this gives you a way to generate all possible finite-state-machine languages by building up a finite-state machine from simpler single-entry single-exit finite-state machines, which is to say, building up a directed graph from simpler two-terminal directed graphs. You can’t express every possible graph in this way; in particular, there are a set of “irreducible control flow structures” which, if present, require duplication of nodes. (McCabe’s 1976 paper “A Complexity Measure” explains in more detail in the context of program control flow.)

Traditionally in regular expressions we write Concat, Alt, and Closure as mere juxtaposition, infix |, and postfix *, respectively, although, in the context of Kleene algebras, alternation is traditionally written with infix + instead.

And this is somewhat more expressive; it can express any series-parallel circuit. This includes trees as a subset. If we use the traditional |* notation, we can express the above MIPS-assembly arithmetic graph as follows:

XXX try , instead of |?

(($t0₀ | 1) >>> | 2) << | ($gp₀ | 28) +) +

Here we are supposing that constants and variables obliterate whatever data was present on their branch of the dataflow digraph, that rejoining dataflows creates a node where both pieces of data are available (with the left and right branches distinguished!), and that the operators then reduce the two pieces of data available at their node to 1.

But now we can also represent the SPICE circuit I gave above, for example. One possible representation is as follows:

(v⁻¹[dc 24] r[10k] (r[8.1k] v[dc 15] | r[4.7k]))*

This contains all the useful information in the original netlist, but it is both much easier to write and, I would argue, even much easier to read; it’s easy to see what is in series and what is in parallel. Interpreted as a regular expression, it yields all of the possible traversals through the circuit in a certain direction (which could be the direction current is flowing, but happens not to be).

It’s common to be able to express the majority of an electrical circuit in terms of such series-parallel circuits, with the occasional deviation from series-parallel constructions; the same is true for control flow in programs.

Postfix

The arithmetic expression above may look suspicious:

(($t0₀ | 1) >>> | 2) << | ($gp₀ | 28) +) +

It happens to be exactly equivalent to how you’d express this expression in postfix (RPN), but with stack relationships explicitly called out — every time some value produced by b is on top of some value produced by a on the stack, we have written (a|b) rather than just a b. If we adopt the more usual convention, this expression reduces to the following:

$t0₀ 1 >>> 2 << $gp₀ 28 + +

Can we apply this to the circuit diagram as well? It’s simple:

{ 24 dc v⁻¹  10 k r  { 8.1 k r  15 dc v | 4.7 k r } }

Consider the expression as an RPN program to construct a circuit diagram, with a stack consisting of node identifiers and numbers. It begins with a single node identifier on the stack, and it uses the following operations:

It turns out that, in this context, this semantics for { } makes them adequate both for Kleene closure * and to enclose a binary alternation | in the context of circuit schematic capture, because a wire forward from the beginning of a part of a circuit to its end is the same as a wire backward from its end to its beginning. In other contexts, like regular expressions, these two are not equivalent.

v⁻¹ is a bit suspiciously ad-hoc; really we might prefer this operation of flipping a bit of circuitry around before connecting it to be an orthogonal operation, rather than separately defining v⁻¹, d⁻¹, electrolytic_capacitor⁻¹, and so on. You could instead define v to allocate two fresh nodes and push them both on the stack, then use a separate forward or reverse operation to short two of them together and remove them from the stack. Alternatively, you could have < and > operations that enclose a piece of circuitry to be reversed — the first allocating a fresh node and pushing two references to it onto the stack, the second performing the equivalent of rot }. But really this is the converse relationship on binary relations, about which more later.

In general, I find postfix less readable and more error-prone than infix, so I’m not sure it’s really the right thing, but it has a few advantages which might be relevant in this context:

I’m not sure that “edge-labeled directed graphs” are quite the right abstraction for electrical circuits. Yes, a battery or a diode is a directed relationship between two wire nodes/nets, so representing it as a labeled directed edge in a graph makes some sense. But a bipolar transistor is a relationship between three wire nodes/nets, each of which has a unique relationship to it, and a microcontroller is in some sense a relationship between dozens of nodes. These are more like hypergraphs, but the existing work on directed hypergraphs is pretty slim.

This mirrors the state of relational databases and logic programming, where most of the work is with some arbitrary number of finitary (N-ary) relations, rather than binary relations.

A bipartite directed graph, where wire nodes/nets are one part and components are the other part, would be an adequate model. In dataflow, the parts would instead be values and operations.

Despite all this stuff and despite being quite compact, postfix, even augmented with the { | } operations above, can’t construct arbitrary edge-labeled directed graphs, just directed graphs whose traversals are equivalent to traversals of any given directed graph, and that only at the cost of a potentially exponential expansion in graph size. In particular, they can’t construct any nonplanar graphs, and not even all planar graphs.

Variables/Labels

The usual solution for this in dataflow is to use (possibly immutable) named variables, and in control flow to use labels and goto — essentially escaping back to netlists, where connection is indicated by coincidence of names rather than proximity. Labels or variables are “wormholes” in the otherwise planar and recursive graph construction; they can connect anywhere. Usually you don’t need very many of them.

The above discussion of regular expressions leads us to speculate on whether we should be labeling nodes, as is normally almost universal, or entire subgraphs. Consider the following SPICE model, from the same textbook page as the example I gave earlier:

fullwave bridge rectifier
v1 1 0 sin(0 15 60 0 0)
rload 1 0 10k
d1 1 2 mod1
d2 0 2 mod1
d3 3 1 mod1
d4 3 0 mod1
.model mod1 d
.tran .5m 25m
.plot tran v(1,0) v(2,3) 
.end

Like bridge circuits in general, this is not a series-parallel circuit; you need a label. You can represent most of it as a series-parallel circuit; let’s use the postfix form, and suppose we have the direction-reversing brackets < and > suggested earlier:

{ 15 60 ac v  { d < d > | < d > d } }

This expresses the entire circuit above except for the load resistance. If we add the new operation }= to define a label for a subgraph, defining the word following it to hook up the two nodes on the stack as if they were an arc, then we can write it as follows:

{ 10 k r }= rload  { 15 60 ac v  { d rload d | < d rload d > } }

This defines exactly the same circuit as the SPICE model in a third less text, although it doesn’t specify the simulation parameters. Each time rload is invoked, it connects its environment to the same component, unlike things like d, which generate a new component each time they’re invoked.

Alternatively, and more traditionally, we could define }= to simply define a name for the node on top of the stack, removing it from the stack; when invoked, the name shorts that saved node to whatever is on top of the stack at the time. Then we can define the circuit as follows:

{ 15 60 ac v  { d { 10 k r }= rload < d > | < d > rload d } }

Explicit series connection

Another alternative, which I mention here more for completeness rather than because I think it's necessarily better than the others mentioned, is to make the series connection explicit rather than implicit. In an infix syntax, for example, using the hyphen for the series connection, perhaps R-C is a series connection of a resistor and a capacitor, while R|C is a parallel connection. Then we can do things like L-(C|L-(C|R)). Here, as in the notation at the beginning, the values being combined are not circuit nodes or nets but rather subcircuits with a beginning and end.

We can, of course, use a stack notation for this as well; for example, instead of L-(C|L-(C|R)), we could write LCLCR|-|-. A stack-notation expression can represent any number of subcircuits rather than just one, because it can end with any number of items on the stack; similarly, it can represent a transformation of a circuit or set of circuits, rather than just a circuit. (Hmm, this is a digression; perhaps it belongs elsewhere.)

Without the explicit series combination operator, we can get the same amount of power in a stack-language context by having an explicit “push new disconnected node” operator. I think.

Other arities of edge creation

You could also create an edge between the top two nodes on the stack, consuming both and returning neither. In this model, you must explicitly create each node, then explicitly duplicate any node that will be connected to more than one edge; in a sense, the expected degree of a node is 1. (So to get three resistors in series, you would say NODE DUP NODE R NODE OVER R NODE OVER R. I think.) The other approaches I’ve been describing all have, in some sense, an expected degree of 2 for each node — the operation of adding each new edge to the graph consumes a node and produces a node, which then can be merged with some other node in the (implicitly exceptional) case where that is desired.

You could also imagine leaving all the nodes on the stack, not consuming any of them, so NODE NODE R R R would give you three resistors in parallel.

What’s the best default node degree?

Eyeballing a couple of small schematics, I have:

In some sense, this suggests that the “average” degree of a circuit net (i.e. wire) is somewhere around 3.

However, that’s including the power rails and ground as nets, which commonly have lots of connections. For example, in the last circuit above (Figure 12.67 from Horowitz & Hill 3ed., a simple laser drive circuit), ground has 7 connections and the positive power rail has 3. If we discount those, the average goes down to 2.416̄. The first circuit, figure 5.56, has three ground connections; discounting those gives us 2.375.

Consequently, I think that the most likely optimum is not 1 or ∞ or even somehow 3 but 2.

Binary relations and the relational product

A few years back I spent some time on a database query language based on binary relations called Binate. It has several things in common with the work I’m describing here.

One of the things I never resolved properly in Binate was the status of constants; I treated them as relations from anything whatsoever to the constant value, like the transformation of a register state produced by a load-immediate instruction. This introduces potentially awkward infinities if you take its converse.

Binate uses a really-infix grammar whose expressions all evaluate to binary relations, so it handles non-binary finitary relations with an N-ary relational-product operation, which produces nodes that participate in a bunch of fresh binary relations.

Labels, dynamic scope, and linking

Earlier I mentioned NAND gates as an example of a three-terminal circuit element for which it’s natural to consume two input nets from the stack and produce an output net, but in reality a two-input NAND gate has five terminals: power and ground. Usually in a big part of the circuit you use the same power and ground for all the NAND gates, but maybe not in the entire design. Flip-flops additionally have a clock, and it’s very common to hook lots of them up to the same clock, but to have more than one clock domain in a design.

An analogous phenomenon pops up in graphics programming: it’s common to do a bunch of drawing operations with the same color, the same font, the same line joins, and so on. It’s annoying to have to specify this in every drawing operation.

Dynamically-scoped variables — accessible to lots of things, but set during particular dynamic scopes and restored afterwards — are one possible solution for this. In a stack machine, you could use a PostScript-style load operation to save the current binding of a label on the stack, then run a bunch of graph-constructing code, then restore it at the end. This requires that invoking the label connect you to its current value, like a DEFERred word in Forth or like an operator in PostScript, not like a regular word in Forth.

Another way of looking at this is that it’s like what a linker does with an object file: it resolves the relocations for a given label to point to a given label definition from its environment at link time.

In some cases it may make sense to provide default arguments for things like pullup resistors and bypass capacitors, and those could be provided in a similar way, rather than taken from the stack.

Dropping brackets for parallel construction

In a sense the above uses series combination as the default operation, which can be sort of overridden by providing a different context, like the { } or < > contexts. In algebra, we also have a default operation, which is multiplication, but we don't have to write { 3x² + { 2x + 5 } } conventionally; instead we just write 3x² + 2x + 5. You could implement this by beginning with an empty sum (consisting of, say, 0) and an empty term (consisting of 1) “on the stack” and having the + separator sum the top two terms and push a new empty term “on the stack”; then all the other things can just implicitly multiply. This does require a final summing action on termination of input, and it can’t handle infix division (it needs brackets), but that's not so relevant to graph construction.

You could insist on a single outermost set of parentheses, with these stack effects:

These rules give the right answer for nested expressions like ( ( 3 x + 2 ) x + 4 ).

Topics