Suppose you have a programming system with a garbage-collected object-graph memory where objects are type-tagged immutable vectors of fields. You really only need two stack-bytecode operations for the object memory:
NEW ( class [ val0 val1 ... -- ref )
AT ( ref idx -- val )
Here [
is a PostScript-style stack mark for variadic function
invocation.
AT
fetches a field value, and NEW
creates an object containing the
given fields. Alternatively, you could bake idx
into AT’s bytecode
and have it always operate on self
, the receiver containing the
executing method, and NEW could be an ordinary method call on class
which is implemented with magical native code. Either way, you need
bounds-checking either at compile-time or run-time.
With a reasonable generational garbage collector, this can be the most efficient way to do many things. Other algorithms, however, also need mutation, so you need a third operation to mutate fields: ATPUT.
For a Pythonish language (see Thredsnek: a tiny Python-flavored programming language) you need built-in dict, list, int, float, and probably string classes, plus reflection, dynamic dispatch of variadic method calls, an iterator protocol, and exceptions; and, of course, control flow.
So a minimal OO Pythonish bytecode might need INVOKE, AT, ATPUT, VAR, VARPUT, LIT, IF, GOTO, LOOP, [, and RET: 11 operations. I’m not quite sure how to implement exceptions; catch and finally blocks are rare enough in garbage-collected languages (unlike in C++ or Rust, where every new variable sort of introduces one) that doing it the straightforward way with PUSHJMPBUF and POPJMPBUF opcodes might be adequate.
An alternative block-structured design might unify method scope, object scope, and global scope as simply being three arbitrary levels of an arbitrary set of nested scopes, and this does have benefits to recommend it. However, this requires recovering at compile-time or even run-time further information that was available in the programmer’s mind: in particular, which blocks might become closures (i.e. when objects are created) and which variables are captured by those blocks (i.e. the instance variables of the implicitly created objects).
Looking at existing bytecode may be instructive. Here’s disassembly
of the CPython bytecode for a somewhat arbitrarily chosen Python
library function, the difference_update
method of the deprecated
sets
module.
Disassembly of difference_update:
479 0 LOAD_FAST 0 (self)
3 LOAD_ATTR 0 (_data)
6 STORE_FAST 2 (data)
480 9 LOAD_GLOBAL 1 (isinstance)
12 LOAD_FAST 1 (other)
15 LOAD_GLOBAL 2 (BaseSet)
18 CALL_FUNCTION 2
21 POP_JUMP_IF_TRUE 39
481 24 LOAD_GLOBAL 3 (Set)
27 LOAD_FAST 1 (other)
30 CALL_FUNCTION 1
33 STORE_FAST 1 (other)
36 JUMP_FORWARD 0 (to 39)
482 >> 39 LOAD_FAST 0 (self)
42 LOAD_FAST 1 (other)
45 COMPARE_OP 8 (is)
48 POP_JUMP_IF_FALSE 64
483 51 LOAD_FAST 0 (self)
54 LOAD_ATTR 4 (clear)
57 CALL_FUNCTION 0
60 POP_TOP
61 JUMP_FORWARD 0 (to 64)
484 >> 64 SETUP_LOOP 33 (to 100)
67 LOAD_GLOBAL 5 (ifilter)
70 LOAD_FAST 2 (data)
73 LOAD_ATTR 6 (__contains__)
76 LOAD_FAST 1 (other)
79 CALL_FUNCTION 2
82 GET_ITER
>> 83 FOR_ITER 13 (to 99)
86 STORE_FAST 3 (elt)
485 89 LOAD_FAST 2 (data)
92 LOAD_FAST 3 (elt)
95 DELETE_SUBSCR
96 JUMP_ABSOLUTE 83
>> 99 POP_BLOCK
>> 100 LOAD_CONST 1 (None)
103 RETURN_VALUE
The original source code:
def difference_update(self, other):
"""Remove all elements of another set from this set."""
data = self._data
if not isinstance(other, BaseSet):
other = Set(other)
if self is other:
self.clear()
for elt in ifilter(data.__contains__, other):
del data[elt]
Here’s a longer method, from another obscure standard library module,
netrc.netrc.__repr__
:
Disassembly of __repr__:
130 0 LOAD_CONST 1 ('')
3 STORE_FAST 1 (rep)
131 6 SETUP_LOOP 137 (to 146)
9 LOAD_FAST 0 (self)
12 LOAD_ATTR 0 (hosts)
15 LOAD_ATTR 1 (keys)
18 CALL_FUNCTION 0
21 GET_ITER
>> 22 FOR_ITER 120 (to 145)
25 STORE_FAST 2 (host)
132 28 LOAD_FAST 0 (self)
31 LOAD_ATTR 0 (hosts)
34 LOAD_FAST 2 (host)
37 BINARY_SUBSCR
38 STORE_FAST 3 (attrs)
133 41 LOAD_FAST 1 (rep)
44 LOAD_CONST 2 ('machine ')
47 BINARY_ADD
48 LOAD_FAST 2 (host)
51 BINARY_ADD
52 LOAD_CONST 3 ('\n\tlogin ')
55 BINARY_ADD
56 LOAD_GLOBAL 2 (repr)
59 LOAD_FAST 3 (attrs)
62 LOAD_CONST 4 (0)
65 BINARY_SUBSCR
66 CALL_FUNCTION 1
69 BINARY_ADD
70 LOAD_CONST 5 ('\n')
73 BINARY_ADD
74 STORE_FAST 1 (rep)
134 77 LOAD_FAST 3 (attrs)
80 LOAD_CONST 6 (1)
83 BINARY_SUBSCR
84 POP_JUMP_IF_FALSE 114
135 87 LOAD_FAST 1 (rep)
90 LOAD_CONST 7 ('account ')
93 BINARY_ADD
94 LOAD_GLOBAL 2 (repr)
97 LOAD_FAST 3 (attrs)
100 LOAD_CONST 6 (1)
103 BINARY_SUBSCR
104 CALL_FUNCTION 1
107 BINARY_ADD
108 STORE_FAST 1 (rep)
111 JUMP_FORWARD 0 (to 114)
136 >> 114 LOAD_FAST 1 (rep)
117 LOAD_CONST 8 ('\tpassword ')
120 BINARY_ADD
121 LOAD_GLOBAL 2 (repr)
124 LOAD_FAST 3 (attrs)
127 LOAD_CONST 9 (2)
130 BINARY_SUBSCR
131 CALL_FUNCTION 1
134 BINARY_ADD
135 LOAD_CONST 5 ('\n')
138 BINARY_ADD
139 STORE_FAST 1 (rep)
142 JUMP_ABSOLUTE 22
>> 145 POP_BLOCK
137 >> 146 SETUP_LOOP 85 (to 234)
149 LOAD_FAST 0 (self)
152 LOAD_ATTR 3 (macros)
155 LOAD_ATTR 1 (keys)
158 CALL_FUNCTION 0
161 GET_ITER
>> 162 FOR_ITER 68 (to 233)
165 STORE_FAST 4 (macro)
138 168 LOAD_FAST 1 (rep)
171 LOAD_CONST 10 ('macdef ')
174 BINARY_ADD
175 LOAD_FAST 4 (macro)
178 BINARY_ADD
179 LOAD_CONST 5 ('\n')
182 BINARY_ADD
183 STORE_FAST 1 (rep)
139 186 SETUP_LOOP 31 (to 220)
189 LOAD_FAST 0 (self)
192 LOAD_ATTR 3 (macros)
195 LOAD_FAST 4 (macro)
198 BINARY_SUBSCR
199 GET_ITER
>> 200 FOR_ITER 16 (to 219)
203 STORE_FAST 5 (line)
140 206 LOAD_FAST 1 (rep)
209 LOAD_FAST 5 (line)
212 BINARY_ADD
213 STORE_FAST 1 (rep)
216 JUMP_ABSOLUTE 200
>> 219 POP_BLOCK
141 >> 220 LOAD_FAST 1 (rep)
223 LOAD_CONST 5 ('\n')
226 BINARY_ADD
227 STORE_FAST 1 (rep)
230 JUMP_ABSOLUTE 162
>> 233 POP_BLOCK
142 >> 234 LOAD_FAST 1 (rep)
237 RETURN_VALUE
Here’s the original source code:
def __repr__(self):
"""Dump the class data in the format of a .netrc file."""
rep = ""
for host in self.hosts.keys():
attrs = self.hosts[host]
rep = rep + "machine "+ host + "\n\tlogin " + repr(attrs[0]) + "\n"
if attrs[1]:
rep = rep + "account " + repr(attrs[1])
rep = rep + "\tpassword " + repr(attrs[2]) + "\n"
for macro in self.macros.keys():
rep = rep + "macdef " + macro + "\n"
for line in self.macros[macro]:
rep = rep + line
rep = rep + "\n"
return rep
I feel like these two functions are relatively typical Python code.
From a few arbitrary files including the above, I disassembled about 2650 bytecodes. The top ops are:
LOAD_FAST (630 bytecodes),
CALL_FUNCTION (242 bytecodes),
LOAD_GLOBAL (238 bytecodes, usually in order to call a global),
LOAD_ATTR (228 bytecodes, largely split between
self attributes (preceded by a LOAD_FAST of self) and
method calls (followed by a CALL_FUNCTION)),
LOAD_CONST (179 bytecodes),
STORE_FAST (150 bytecodes),
RETURN_VALUE (132 bytecodes),
POP_JUMP_IF_FALSE (95 bytecodes),
POP_TOP (89 bytecodes), and
COMPARE_OP (80 bytecodes).
These add up to 2063 bytecodes, about 78% of the total. COMPARE_OP
makes it there because it’s a catchall for a variety of comparison
operators, including exception-matching, ==
, and is
, but also
in
, >
, and the like. Most other binary operators have their own
bytecode, like []
(BINARY_SUBSCR, 30 occurrences), and +
(BINARY_ADD, 26 occurrences).
I’ve tried a number of things, but so far I haven’t been able to find a way to get tighter code than with a stack machine. On average, a binary operation (two inputs and one output) on a stack machine has about one associated stack manipulation operation, which is typically between 5 and 8 bits.
Typical two-register code has a 3- or 4-bit field for each operand, so 6–8 bits, and sometimes has an additional MOV instruction associated to prevent one of the inputs from being clobbered; three-register code avoids that at the cost of a third operand field, which in extreme cases (like Lua bytecode) means 24 bits of operands per operation, but is more typically 12–18 bits.
The Mill architecture’s “belt” is a third alternative; like stack machines, the output of each instruction is implicit, but unlike them, inputs are not consumed. Consequently operands must be implicit. I did some speculative analysis (Golomb-coding operands as belt offsets likely won’t increase code density much) of variable-length encoding of a two-address belt code and found that in the code I looked at, the spans were very nearly geometrically distributed, so nearly the optimum was Golomb coding with a bin size of 1, which reduces to unary coding, and this works out to a bit over 5 bits of operand information per two-operand computational instruction (would be 4 if the geometric distribution was exactly correct), which is essentially exactly the same code density as the stack code, just much more expensive to decode.
(An encoding variation I haven’t explored in more detail: since about half of the operands are the output of the immediately previous instruction, make that implicit for two-operand instructions, and insert an extra copy instruction in the other cases. It probably won’t work to improve density further, but it might.)
The TRIPS or EDGE (“explicit data graph execution”) architecture is sort of a mirror image of the “belt”, where instruction inputs are implicit, but outputs are explicit, and point to other instructions in the same block. Presumably this works out about the same.
So I think that probably stack code is the best approach; it typically uses about 10–12 bits per binary computational operation, of which about 5–6 bits is operand information (in the form of stack manipulation operations), which is comparable to the other alternatives; I'm not yet convinced that anything is better. It’s faster to decode and interpret than variable-length instructions, though slower than fixed-width register-based instruction sets. And it easily accommodates operations that consume or produce unusual numbers of values.
The F18A core in the GreenArrays GA4 and GA144 chips has four 5-bit
instructions per 18-bit memory word (when there are no jump targets;
immediate literal constants are pushed on the stack by a @p
instruction anywhere in the previous instruction word). This means
two bits are missing from the last word; the instruction encoding is
designed so that among the 8 expressible instructions for this last
slot are ;
(return), .
(nop), dup
, +
, and unext
(which loops
back to the beginning of the instruction word a number of times stored
in the R register).
In this way, you can always NOP out the truncated instruction slot if you can’t do anything useful with it, but you can fit many of the most common instructions — including particularly the ones that are most likely to be useful at the end of an instruction word. If you can do this ¾ of the time, you can fit 15 instructions into four 18-bit words, an average of 4.8 bits per instruction — and, as I suggested earlier, typically you need about two instructions per computational operation, giving 9.6 bits per computational operation, not counting jump targets.
You could use a similar approach with 32-bit words of bytecode on a 32-bit or 64-bit machine. If each opcode is 6 bits, for example, you can fit 5 whole ones and one 2-bit truncated one into a 32-bit word, or 4 whole ones and two 3-bit truncated ones.
However, jumps will use up the rest of the instruction word; the above
repr example code contains 99 bytecodes, of which 4 were
FOR_ITER
or JUMP_FORWARD
, for an average of about 24 bytecodes
between jumps. This is probably roughly typical, or slightly high, so
a substantial fraction of words will be taken up with such things.
I’m not counting method invocation as a jump (though it is on the F18A), because I propose to carry it out in a different way, as explained below about “Facets and Dispatch”.
For a Pythonish language, we really need some kind of dynamic dispatch that allows us to use the same bytecodes to operate on native integers as on, for example, rational numbers. One way to do this — the approach CPython takes — is to make integers heap-allocated objects too. But it really helps performance, not to mention reliability and timing data leaks, if ordinary integers don’t need to be stored in memory like that.
The standard approach to support runtime polymorphism between integers (held in registers) and pointers is to use one to four bits of the address word as type tags. On byte-addressed machines with four-byte alignment for objects, the lowest two bits of a valid object pointer are always 00, so you can set them to, for example, 01 to indicate integers. Then
64-bit machines with byte-addressed memory have