What’s the dumbest register allocator that might give you reasonable performance?

Kragen Javier Sitaker, 2016-10-11 (15 minutes)

Programs compiled with Ur-Scheme are about five times slower than similar programs compiled with GCC. TCC is broadly similar.

It occurred to me that a big part of Ur-Scheme's slowness is probably its complete lack of register allocation. It treats the i386 as a stack machine with the top of stack in %eax. And so it occurred to me that even the dumbest register allocator would probably provide a big speedup, maybe up to a slowdown of less than 2.

What's the dumbest register allocator that might give you reasonable performance? Well, on amd64, as I understand it, you have 16 general-purpose registers, but one of them is the stack pointer. You could statically allocate 8 of these registers to local variables (the first 8 local variables) and 7 of them to an expression evaluation stack which is empty in between statements, assuming you have statements and that control flow is a per-statement thing. Then you can allocate the slots on the expression evaluation stack at compile time and spill to memory if your evaluation gets too deep.

I feel like that would probably work well enough, and it would be very simple to implement.

On i386, you don't have as many registers. You could have 4 levels of expression evaluation and 3 local variables held in registers.

Context: how well do programs get optimized?

Here's a Karplus-Strong string synthesis program I wrote in C, something over a year ago, in golfed form:

enum{n=72};s[n]={1<<15},i;main(){void*o=popen("aplay","w");
for(i=0;i<1<<14;i++)fputc((s[i%n]=s[i%n]*.9+s[(i+1)%n]*.1)>>8,o);}

Here's a deobfuscated version of the main loop:

for (i=0; i < 1<<14; i++) {
    int j = i % n, k = (i+1) % n;
    s[j] = s[j]*.9 + s[k]*.1;
    fputc(s[j] >> 8, output);
}

n here is 72, and s starts out with just an impulse in it. (If you're not familiar with KS synthesis, the weighted averaging of two samples there serves as a gentle low-pass IIR filter, attenuating all frequencies but especially the highest frequencies, while feeding the signal back into the delay line after 72 samples limits the original harmonic content to harmonics of 8000 Hz / 72 = 111 Hz, and the result sounds exactly like the string of a six-string guitar.)

I feel like this is a relatively typical piece of C code in some ways. It has a for loop comparing against a constant expression that can be folded, some array accesses and a function call inside the loop, some floating point and integer math, some strength-reduction opportunities, and a local variable (or two, although i is technically a global).

So how badly does TCC do on this program?

First, what's our gold standard? It's actually pretty terrifying, both in terms of the optimizations achieved and the optimizations missed. On i386, here's a commented disassembly of the 48 instructions gcc 4.8.4 emits for the main loop with -O:

 804847d:   89 c6                   mov    %eax,%esi
 # This is the address of i; we are initializing it with 0.
 804847f:   c7 05 64 a1 04 08 00    movl   $0x0,0x804a164
 8048486:   00 00 00 
 # And we also have i in %edi at the top of the loop.
 8048489:   bf 00 00 00 00          mov    $0x0,%edi
 # This magic number is the approximate multiplicative inverse of
 # 72.  72 times this number is 0x10,0000,0008.  That means that if
 # you multiply some 32-bit integer X by 0x38e38e39, then the high
 # word of the 64-bit result will contain the number divided by 72,
 # shifted four bits to the left.  So, for example, 0x38e38e39L *
 # 146 >> 36 = 2L.  So we save it in %ebx so we can multiply by it
 # in order to divide.  Multiplying is usually much faster than
 # division.
 804848e:   bb 39 8e e3 38          mov    $0x38e38e39,%ebx
mainloop:
 8048493:   89 f8                   mov    %edi,%eax
 # Now we divide %eax by 72 by multiplying it by the magic number in
 # %ebx.  %eax is the implicit argument of a one-argument imul in
 # i386.  I guess this means i was in %edi and %eax.
 8048495:   f7 eb                   imul   %ebx
 # The high dword of the result is in %edx, so put it in %ecx.
 8048497:   89 d1                   mov    %edx,%ecx
 # Eliminate those 4 extra bits on its low end so %ecx contains %eax/72.
 8048499:   c1 f9 04                sar    $0x4,%ecx
 # Another copy of i goes into %eax.
 804849c:   89 f8                   mov    %edi,%eax
 # Now we shift it 31 bits to the right, which means we are saving
 # off just its sign bit, which is a little bit stupid because it's
 # statically fairly trivial that it's always positive.
 804849e:   c1 f8 1f                sar    $0x1f,%eax
 # Now we do some kind of negative-number correction to our quotient
 # in %ecx with this.  This amounts to a no-op.
 80484a1:   29 c1                   sub    %eax,%ecx
 # Now we magically multiply %ecx (i/72) by 9 and put the result in
 # %eax.
 80484a3:   8d 04 c9                lea    (%ecx,%ecx,8),%eax
 # Then we shift it three more fucking bits, so guess what, now
 # we've multiplied it by fucking 72.  GCC factored 72 into 8*(8+1)
 # in order to multiply it with two obscure instructions instead of
 # just using an imul.  Now %eax is going to contain (i/72)*72,
 # which is to say, i-(i%n).
 80484a6:   c1 e0 03                shl    $0x3,%eax
 # Now we get another copy of i into %ecx.
 80484a9:   89 f9                   mov    %edi,%ecx
 # And now at last we compute i%n in %ecx, which took 11
 # instructions.  This is the number I called "j" above.
 80484ab:   29 c1                   sub    %eax,%ecx
 # Now we load the integer at s[j] (s[i%n]) as a floating-point
 # number.
 80484ad:   db 04 8d 40 a0 04 08    fildl  0x804a040(,%ecx,4)
 # This isn't in the listing but I bet the constant 0.9 is stored
 # there.
 80484b4:   dc 0d d8 85 04 08       fmull  0x80485d8
 # Now we increment i, which makes a certain amount of sense because
 # now we have to compute s[k], and we aren't going to use i again.
 80484ba:   83 c7 01                add    $0x1,%edi
 # Oh boy, here we go again.  Compute k as (new i) % n in another 11
 # instructions, except now it's in %edi.
 80484bd:   89 f8                   mov    %edi,%eax
 80484bf:   f7 eb                   imul   %ebx
 80484c1:   c1 fa 04                sar    $0x4,%edx
 80484c4:   89 f8                   mov    %edi,%eax
 80484c6:   c1 f8 1f                sar    $0x1f,%eax
 80484c9:   29 c2                   sub    %eax,%edx
 80484cb:   8d 04 d2                lea    (%edx,%edx,8),%eax
 80484ce:   c1 e0 03                shl    $0x3,%eax
 80484d1:   29 c7                   sub    %eax,%edi
 # Compute the weighted sum, multiplying by a different constant in
 # memory, presumably 0.1.
 80484d3:   db 04 bd 40 a0 04 08    fildl  0x804a040(,%edi,4)
 80484da:   dc 0d e0 85 04 08       fmull  0x80485e0
 80484e0:   de c1                   faddp  %st,%st(1)
 # Truncate the result back to an int by way of storing it in the
 # stack frame and loading it into %eax.
 80484e2:   d9 7c 24 0e             fnstcw 0xe(%esp)
 80484e6:   0f b7 44 24 0e          movzwl 0xe(%esp),%eax
 # Okay, I have no idea what is going on here, except that it
 # results in storing the new value of s[j] at 0x8(%esp) and also in
 # %eax.  Somehow this involves overwriting the high byte of
 # some 16-bit number with 0xc!?
 80484eb:   b4 0c                   mov    $0xc,%ah
 80484ed:   66 89 44 24 0c          mov    %ax,0xc(%esp)
 80484f2:   d9 6c 24 0c             fldcw  0xc(%esp)
 80484f6:   db 5c 24 08             fistpl 0x8(%esp)
 80484fa:   d9 6c 24 0e             fldcw  0xe(%esp)
 80484fe:   8b 44 24 08             mov    0x8(%esp),%eax
 # Now we store the result into s[j]; %ecx still has j in it, and
 # 0x804a040 is the base address of s, as indicated by its use above
 # in the fildls.
 8048502:   89 04 8d 40 a0 04 08    mov    %eax,0x804a040(,%ecx,4)
 8048509:   89 74 24 04             mov    %esi,0x4(%esp)
 # Here we shift the s[j] result in %eax right by 8 bits and "push
 # it" in order to call fputc with it.
 804850d:   c1 f8 08                sar    $0x8,%eax
 8048510:   89 04 24                mov    %eax,(%esp)
 8048513:   e8 38 fe ff ff          call   8048350 <fputc@plt>
 # Now we ignore fputc's return value and load the value of i again
 # from memory and increment it, again.  fputc could have changed
 # it, after all, since it's a global variable.
 8048518:   a1 64 a1 04 08          mov    0x804a164,%eax
 804851d:   8d 78 01                lea    0x1(%eax),%edi
 # Save the incremented value to memory.  You never know, fputc
 # could be reading a global variable.
 8048520:   89 3d 64 a1 04 08       mov    %edi,0x804a164
 # Check to see if the loop is done.  It's interesting that the <
 # test got turned into a <= test.
 8048526:   81 ff ff 3f 00 00       cmp    $0x3fff,%edi
 804852c:   0f 8e 61 ff ff ff       jle    8048493 <main+0x36> (mainloop)

All right, so we can see GCC missed a lot of optimization opportunities there, although some of them were missed by the perverse choice to make i a global variable. But it did manage to compute i%72 without using any division instructions. It also manages to keep j and k in registers and avoid recomputing j --- common subexpression elimination. And the control flow is nice and clean, just a single conditional jump instruction at the end, because it can statically check that the loop will execute at least once.

How does TCC do? It's a bit messier but somewhat more straightforward, as you'd expect. tcc 0.9.25 emits 67 instructions:

# Initialize i to 0.  Note the unnecessary indirection through a
# register that isn't used afterwards.
     242:   b8 00 00 00 00          mov    $0x0,%eax
     247:   89 05 00 9f 04 08       mov    %eax,0x8049f00
looptest:
# Load i from memory and check it.  This is at the top of the loop
# so that it can run zero times if necessary.
     24d:   8b 05 00 9f 04 08       mov    0x8049f00,%eax
     253:   81 f8 00 40 00 00       cmp    $0x4000,%eax
     259:   0f 8d c8 00 00 00       jge    0x327 (exitloop)
     25f:   e9 11 00 00 00          jmp    0x275 (loopbody)
mainloop:
# This is "i++".  Neither %eax nor %ecx is used afterwards.  I have
# no idea why it makes a copy in %ecx.
     264:   8b 05 00 9f 04 08       mov    0x8049f00,%eax
     26a:   89 c1                   mov    %eax,%ecx
     26c:   40                      inc    %eax
     26d:   89 05 00 9f 04 08       mov    %eax,0x8049f00
     273:   eb d8                   jmp    0x24d (looptest)
loopbody:
# Load i from memory.
     275:   8b 05 00 9f 04 08       mov    0x8049f00,%eax
     27b:   b9 48 00 00 00          mov    $0x48,%ecx   # 0x48 == 72 == n
     280:   99                      cltd                # clears %edx?
# implicitly divides %edx:%eax by its operand; remainder is in %edx,
# quotient in %eax.
     281:   f7 f9                   idiv   %ecx         
# Shift the quotient left two bits to use it to index dwords.
     283:   c1 e2 02                shl    $0x2,%edx
# Load base address of s into %eax and index it.
     286:   b8 44 9d 04 08          mov    $0x8049d44,%eax
     28b:   01 d0                   add    %edx,%eax
# Save the pointer into a temporary in the stack frame.  This is
# where we are going to store the result of the weighted sum.
     28d:   89 45 f8                mov    %eax,-0x8(%ebp)
# Load i from memory again.  This is the common subexpression s[i%n]
# not being eliminated.
     290:   8b 05 00 9f 04 08       mov    0x8049f00,%eax
     296:   b9 48 00 00 00          mov    $0x48,%ecx
     29b:   99                      cltd   
     29c:   f7 f9                   idiv   %ecx
     29e:   c1 e2 02                shl    $0x2,%edx
     2a1:   b8 44 9d 04 08          mov    $0x8049d44,%eax
     2a6:   01 d0                   add    %edx,%eax
# Load from s[i%n] into %ecx, then into a floating-point register by
# way of the stack, since that's apparently how we do things.
     2a8:   8b 08                   mov    (%eax),%ecx
     2aa:   51                      push   %ecx
     2ab:   db 04 24                fildl  (%esp)
# Pop it back off.
     2ae:   83 c4 04                add    $0x4,%esp
# Store the floating-point version of s[j] in an on-stack temporary?
     2b1:   dd 55 f0                fstl   -0x10(%ebp)
# Not sure what's going on here.
     2b4:   dd d8                   fstp   %st(0)
     2b6:   dd 05 6c 9e 04 08       fldl   0x8049e6c
     2bc:   dc 4d f0                fmull  -0x10(%ebp)
# Okay, now we load i from memory again, increment it, and divide it
# to compute (i+1)%n.
     2bf:   8b 05 00 9f 04 08       mov    0x8049f00,%eax
     2c5:   40                      inc    %eax
     2c6:   b9 48 00 00 00          mov    $0x48,%ecx
     2cb:   99                      cltd   
     2cc:   f7 f9                   idiv   %ecx
# And then again we index off s.
     2ce:   c1 e2 02                shl    $0x2,%edx
     2d1:   b8 44 9d 04 08          mov    $0x8049d44,%eax
     2d6:   01 d0                   add    %edx,%eax
# Wait, what are we doing with the FPU?
     2d8:   dd 55 e8                fstl   -0x18(%ebp)
     2db:   dd d8                   fstp   %st(0)
# Okay, load s[k]
     2dd:   8b 08                   mov    (%eax),%ecx
     2df:   51                      push   %ecx
     2e0:   db 04 24                fildl  (%esp)
     2e3:   83 c4 04                add    $0x4,%esp
# I don't know what's going on here either.
     2e6:   dd 55 e0                fstl   -0x20(%ebp)  # ???
     2e9:   dd d8                   fstp   %st(0)       # ???
     2eb:   dd 05 74 9e 04 08       fldl   0x8049e74    # Maybe 0.1?
 # But this looks like the weighted sum.
     2f1:   dc 4d e0                fmull  -0x20(%ebp)
     2f4:   dc 45 e8                faddl  -0x18(%ebp)
     2f7:   d9 2d 7c 9e 04 08       fldcw  0x8049e7c  # ???
     2fd:   81 ec 04 00 00 00       sub    $0x4,%esp
 # Store the weighted-sum result at the stack pointer.
     303:   db 1c 24                fistpl (%esp)
     306:   d9 2d 7e 9e 04 08       fldcw  0x8049e7e  # ???
 # And now load it into %eax.
     30c:   58                      pop    %eax
# Load the pointer where we are going to save the result (stored at
# 0x28d above) into %ecx.
     30d:   8b 4d f8                mov    -0x8(%ebp),%ecx
# Save the result.
     310:   89 01                   mov    %eax,(%ecx)
# Shift the result before emitting it with fputc.  Note that we
# don't load it from memory again.
     312:   c1 f8 08                sar    $0x8,%eax
# Load o into %ecx to push it.
     315:   8b 4d fc                mov    -0x4(%ebp),%ecx
     318:   51                      push   %ecx
     319:   50                      push   %eax
     31a:   e8 f1 09 00 00          call   0xd10     # fputc
     31f:   83 c4 08                add    $0x8,%esp # pop fputc args
     322:   e9 3d ff ff ff          jmp    0x264 (mainloop)
exitloop:
     327:   c9                      leave  
     328:   c3                      ret

The only real optimization TCC managed here was to keep a value in a register, once, and constant-fold the loop bound. It didn't attempt common subexpression elimination, and its FPU code is relatively horrible.

Topics