Golomb-coding operands as belt offsets likely won’t increase code density much

Kragen Javier Sitaker, 2017-06-15 (updated 2017-06-20) (6 minutes)

Suppose we want a really compact format for representing computer programs. Historically speaking, stack-based representations tend to be the most compact, even though not all operands can be implicit; typically you have a ratio of about one stack manipulation to one computational operation. I thought maybe we could do better, with a construction like the Mill’s Belt, and so I did the quick rough analysis below to see what gains were likely; it turns out that you can roughly match 5-bit stack code with a variable-bit-length operand coding, but not clearly beat it.

The idea is that instead of referring to input values implicitly by their position on the stack, you refer to them by how long ago they were produced. At a given point in code with no control-flow join points (jump destinations that can be reached by more than one jump or by non-jumping control flow) each non-constant value was either provided as an input at entry, produced at a single lag into the past. Consider, for example, this piece of assembly code, which copies a byte from stdin to stdout under Linux:

_start: mov $__NR_read, %eax        # literal constant  #0
        mov $stdin, %ebx            # literal constant  #1
        mov %esp, %ecx              # use input #-1
        mov $1, %edx                # literal constant  #2
        int $0x80                   # operation: system call with four args #3
                                    # inputs are #0, #1, #-1, and #2
                                    #            (.-3, .-2, .-4, and .-1)
        test %eax, %eax             # result of operation #5 (.-1) #4
        jz exit                     # jump conditional on result of #4 (.-1)
        mov $__NR_write, %eax       # literal constant  #5
        mov $stdout, %ebx           # #6
        mov %esp, %ecx              # input #-1
        mov $1, %edx                # #7
        int $0x80                   # #8 (using #5 (.-3), #6 (.-2), #-1 (.-9), #7 (.-1))

Here we have ten references to non-constant values that are inputs to four subsequent operations (two system calls, a comparison, and a conditional jump). Some of the non-constant values are the result of a previous operation; others are references to the value of %esp at the entry to this code, some time in the past; and others are references to previously introduced constant values. The offsets into the past at which these references occur are distributed as follows:

.-1 4
.-2 2
.-3 2
.-4 1
.-9 1

Here’s another example, a complete subroutine on amd64:

  20 0000 53            pushq   %rbx           # (subroutine prologue)
  24 0001 488B1F        movq    (%rdi), %rbx   # memory fetch from first input #-1 (.-1)   #0
  25 0004 31C0          xorl    %eax, %eax     # literal constant 0                        #1
  26 0006 48C1EB20      shrq    $32, %rbx      # >> with literal constant and #0 (.-2)     #2
  27 000a 83C301        addl    $1, %ebx       # + with literal constant and #2 (.-1)      #3
  30 000d 0FB6CB        movzbl  %bl, %ecx      # zero-extend #3 (.-1)                      #4
  31 0010 4839D1        cmpq    %rdx, %rcx     # compare #4 (.-1) to third input #-3 (.-8) #5
  32 0013 760B          jbe .L7            # conditional jump on comparison #5 (.-1)
  34 0015 5B            popq    %rbx           # (subroutine epilogue)
  38 0016 C3            ret
  40 0017 660F1F84      .p2align 4,,10         # (no-op)
  40      00000000 
  40      00
  42                .L7:
  44 0020 4889F0        movq    %rsi, %rax     # (register manipulation)
  49 0023 488D14CD      leaq    0(,%rcx,8), %rdx  # × with literal constant 8 and #4 (.-2) #6
  49      00000000 
  51 002b 4889FE        movq    %rdi, %rsi     # use first input #-1 (.-8)
  53 002e 4889C7        movq    %rax, %rdi     # use second input #-2 (.-9)
  55 0031 E8000000      call    memcpy         # invoke 3-input op; result unused          #7
  55      00
  60 0036 0FB6C3        movzbl  %bl, %eax      # zero-extend #3 (.-5) again (reproduce #4) #8
  62 0039 5B            popq    %rbx           # (subroutine epilogue, returning last value)
  65 003a C3            ret

Here we see the following distribution of offsets into the past for value references:

.-1 5
.-2 2
.-5 1
.-8 1
.-9 1

If we sum both tables, we get this:

.-1 9
.-2 4
.-3 2
.-4 1
.-5 1
.-8 1
.-9 2
total 20

So, what probability distribution are these spans drawn from?

This gives a first impression that the probability of a reference going N steps back declines exponentially in N, which is to say, the span is geometrically distributed. The optimal encoding for geometrically encoded data is Golomb coding, which reduces to unary coding in the case where the exponential parameter is 2, which is pretty much what it is here.

If we encode those 20 spans in unary, they consume (+ 9 (* 2 4) (* 3 2) 4 5 8 9 9) = 58 bits, an average of 2.9 bits per operand; this is pretty close to the same density we would get on MuP21 or GreenArrays stack code with its 5-bit operations. This leads me to believe that 5-bit stack operations are already quite close to the optimal density for operand encoding.

(If we assume that the real distribution is exactly the one the encoding is optimized for, we would get exactly 2 bits per operand on average. But it’s probably not.)

However, bytecode normally uses a whole byte per operation. To get such slightly-under-3-bits-per-operand density in stack code, you definitely need your stack manipulation operations to occupy less than 8 bits, because that will get you about 4 bits per operand.

(You would have to elaborate the model a bit to handle merging control-flow, such as at the tops of loops and after conditionals, and to handle functions, and I don’t think this will be difficult, but the likely gains are not large enough to motivate me.)

A possible way to rescue this: suppose that for two-operand instructions, one of the operands is always implicitly the previous result, and you use a separate one-operand “belt manipulation” copy instruction in the rare cases where that isn’t what you want. If these copy instructions are uncommon enough, this could get your average well below the 5.8 bits of operands per computational instruction implied by the above, maybe down to 3 or below.

Topics