My very first toddling steps in ARM assembly language

Kragen Javier Sitaker, 2019-12-10 (updated 2019-12-13) (46 minutes)

It’s long past time I learned to write a little bit of ARM assembly and machine code! So I spent two hours and was able to get hello-world running, and then a few more hours and learned a bunch of other things.

Basic tools

As it happens, although this laptop is an amd64, it has cross-compilation and transparent CPU emulation stuff installed, so this works:

$ cat hello.c
#include <stdio.h>

int main() { printf("hello, world\n"); return 0; }
$ arm-linux-gnueabihf-gcc-5 -static hello.c -o hello.arm
$ ./hello.arm 
hello, world
$

Specifically I have these Linux Mint packages installed:

gcc-5-arm-linux-gnueabihf
gcc-5-arm-linux-gnueabihf-base (its prerequisite)
binutils-arm-linux-gnueabihf
libc6-armhf-cross
qemu-user
qemu-user-binfmt

(Warning: much of the following is quoted from glibc, a copyrighted work licensed under the GNU Lesser General Public License.)

Although I do have libc6-armhf-cross installed, running dynamic executables does not work. This makes disassembly kind of a pain:

$ arm-linux-gnueabihf-objdump -d !$
arm-linux-gnueabihf-objdump -d ./a.out

./a.out:     file format elf32-littlearm


Disassembly of section .init:

00010160 <_init>:
   10160:   e92d4008    push    {r3, lr}
   10164:   eb000092    bl  103b4 <call_weak_fn>
   10168:   e8bd8008    pop {r3, pc}

Disassembly of section .iplt:

0001016c <.iplt>:
   1016c:   4778        bx  pc
   1016e:   46c0        nop         ; (mov r8, r8)
   10170:   e28fc600    add ip, pc, #0, 12
   10174:   e28cca68    add ip, ip, #104, 20    ; 0x68000
   10178:   e5bcfe94    ldr pc, [ip, #3732]!    ; 0xe94

Disassembly of section .text:

00010180 <backtrace_and_maps>:
   10180:   2801        cmp r0, #1
   10182:   f340 8084   ble.w   1028e <backtrace_and_maps+0x10e>
   10186:   2900        cmp r1, #0
(97,167 more lines follow)
$

My first steps

We can see that this is Thumb-2 machine code, with some instructions 16-bit and others 32-bit.

Buried in there is the way system calls work on ARM Linux:

00010aa0 <__libc_do_syscall>:
   10aa0:   b580        push    {r7, lr}
   10aa2:   4667        mov r7, ip
   10aa4:   df00        svc 0
   10aa6:   bd80        pop {r7, pc}

At a guess, r7 selects the system call.

And exit(2):

00020648 <_exit>:
   20648:   b500        push    {lr}
   2064a:   4603        mov r3, r0
   2064c:   f04f 0cf8   mov.w   ip, #248    ; 0xf8
   20650:   f7f0 fa26   bl  10aa0 <__libc_do_syscall>
   20654:   f510 5f80   cmn.w   r0, #4096   ; 0x1000
   20658:   d810        bhi.n   2067c <_exit+0x34>
   2065a:   4618        mov r0, r3
   2065c:   f04f 0c01   mov.w   ip, #1
   20660:   f7f0 fa1e   bl  10aa0 <__libc_do_syscall>
   20664:   f510 5f80   cmn.w   r0, #4096   ; 0x1000
   20668:   d800        bhi.n   2066c <_exit+0x24>
   2066a:   deff        udf #255    ; 0xff
   2066c:   4b07        ldr r3, [pc, #28]   ; (2068c <_exit+0x44>)
   2066e:   ee1d 2f70   mrc 15, 0, r2, cr13, cr0, {3}
   20672:   4240        negs    r0, r0
   20674:   447b        add r3, pc
   20676:   681b        ldr r3, [r3, #0]
   20678:   50d0        str r0, [r2, r3]
   2067a:   deff        udf #255    ; 0xff
   2067c:   4a04        ldr r2, [pc, #16]   ; (20690 <_exit+0x48>)
   2067e:   ee1d 1f70   mrc 15, 0, r1, cr13, cr0, {3}
   20682:   4240        negs    r0, r0
   20684:   447a        add r2, pc
   20686:   6812        ldr r2, [r2, #0]
   20688:   5088        str r0, [r1, r2]
   2068a:   e7e6        b.n 2065a <_exit+0x12>
   2068c:   000589cc    .word   0x000589cc
   20690:   000589bc    .word   0x000589bc

I’m guessing that this is the actual exiting part:

   2065c:   f04f 0c01   mov.w   ip, #1
   20660:   f7f0 fa1e   bl  10aa0 <__libc_do_syscall>

So I tried putting this in a file and compiling it:

        ; Attempt to write an ARM assembly program that exits
        ; successfully.
main:   
        mov.w r7, #1
        svc 0
loop:   b.n loop

But it seems like that is completely the wrong syntax. I asked GCC for a listing please:

$ arm-linux-gnueabihf-gcc-5 -static -Wa,-adhlns=hello.lst hello.c

It obliged:

   1                    .arch armv7-a
   2                    .eabi_attribute 28, 1
   3                    .fpu vfpv3-d16
   4                    .eabi_attribute 20, 1
   5                    .eabi_attribute 21, 1
   6                    .eabi_attribute 23, 3
   7                    .eabi_attribute 24, 1
   8                    .eabi_attribute 25, 1
   9                    .eabi_attribute 26, 2
  10                    .eabi_attribute 30, 6
  11                    .eabi_attribute 34, 1
  12                    .eabi_attribute 18, 4
  13                    .file   "hello.c"
  14                    .section    .rodata
  15                    .align  2
  16                .LC0:
  17 0000 68656C6C      .ascii  "hello, world\000"
  17      6F2C2077 
  17      6F726C64 
  17      00
  18                    .text
  19                    .align  2
  20                    .global main
  21                    .syntax unified
  22                    .thumb
  23                    .thumb_func
  25                main:
  26                    @ args = 0, pretend = 0, frame = 0
  27                    @ frame_needed = 1, uses_anonymous_args = 0
  28 0000 80B5          push    {r7, lr}
  29 0002 00AF          add r7, sp, #0
  30 0004 40F20000      movw    r0, #:lower16:.LC0
  31 0008 C0F20000      movt    r0, #:upper16:.LC0
  32 000c FFF7FEFF      bl  puts
  33 0010 0023          movs    r3, #0
  34 0012 1846          mov r0, r3
  35 0014 80BD          pop {r7, pc}
  37                    .ident  "GCC: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609"
  38 0016 00BF          .section    .note.GNU-stack,"",%progbits
DEFINED SYMBOLS
                            *ABS*:0000000000000000 hello.c
     /tmp/cc20XqWG.s:15     .rodata:0000000000000000 $d
     /tmp/cc20XqWG.s:16     .rodata:0000000000000000 .LC0
     /tmp/cc20XqWG.s:25     .text:0000000000000000 main
     /tmp/cc20XqWG.s:28     .text:0000000000000000 $t

UNDEFINED SYMBOLS
puts

Evidently r0 contains the argument for puts and also the return value for main.

Aping the syntax therein, I tried this:

        @ Attempt to write an ARM assembly program that exits
        @ successfully.
        .arch armv7-a
        .syntax unified
        .thumb
        .globl main
main:   
        mov.w r7, #1
        svc 0
loop:   b.n loop

That does build successfully; arm-linux-gnueabihf-objdump on the resulting executable suggests that the requested instructions were emitted:

(322 lines omitted)
0001049c <main>:
   1049c:   f04f 0701   mov.w   r7, #1
   104a0:   df00        svc 0

000104a2 <loop>:
   104a2:   e7fe        b.n 104a2 <loop>
(87108 lines omitted)

However, upon execution, the program segfaults. So I guessed wrong about something but without a debugger or knowing how to print things it’s hard to tell what still.

Let’s try building a program that exits successfully with GCC:

$ cat return42.c
int main(int argc, char **argv)
{
  return 42;
}
$ make return42
cc -Wall -Werror -std=gnu99    return42.c   -o return42
$ ./return42 
$ echo $?
42
$ arm-linux-gnueabihf-gcc-5 -static -Wa,-adhlns=return42.lst return42.c
$ cat return42.lst 
...
  21                main:
  22                    @ args = 0, pretend = 0, frame = 8
  23                    @ frame_needed = 1, uses_anonymous_args = 0
  24                    @ link register save eliminated.
  25 0000 80B4          push    {r7}
  26 0002 83B0          sub sp, sp, #12
  27 0004 00AF          add r7, sp, #0
  28 0006 7860          str r0, [r7, #4]
  29 0008 3960          str r1, [r7]
  30 000a 2A23          movs    r3, #42
  31 000c 1846          mov r0, r3
  32 000e 0C37          adds    r7, r7, #12
  33 0010 BD46          mov sp, r7
  34                    @ sp needed
  35 0012 5DF8047B      ldr r7, [sp], #4
  36 0016 7047          bx  lr
...

Maybe bx is “branch indirect”. Also maybe I should use some optimization:

  21                main:
  25 0000 2A20          movs    r0, #42
  26 0002 7047          bx  lr

That’s more like it. Can I get that to build?

$ cat return42-arm.s
        @ Attempt to write an ARM assembly program that exits
        @ successfully.
        .arch armv7-a
        .syntax unified
        .thumb
        .globl main
main:   
        mov.w r0, #42
        bx lr
loop:   b.n loop
$ arm-linux-gnueabihf-gcc-5 -static return42-arm.s
$ file a.out
a.out: ELF 32-bit LSB executable, ARM, EABI5 version 1 (GNU/Linux), statically linked, for GNU/Linux 3.2.0, BuildID[sha1]=6ddb42d20b6cff668f5c6ded33b82eeda0e3bec3, not stripped
$ ./a.out
qemu: uncaught target signal 4 (Illegal instruction) - core dumped
Illegal instruction
$

Hmm, that’s not what I was hoping for. Maybe some of the other assembly directives are needed to build a runnable executable?

        @ An ARM assembly program that exits successfully.
        .arch armv7-a
        .syntax unified
        .thumb
        .thumb_func
        .globl main
main:   
        mov.w r0, #42
        bx lr

It turned out to be .thumb_func. The Gas manual explains, “This directive specifies that the following symbol is the name of a Thumb encoded function. This information is necessary in order to allow the assembler and linker to generate correct code for interworking [sic] between Arm and Thumb instructions and should be used even if interworking is not going to be performed. The presence of this directive also implies ‘.thumb’. This directive is not necessary when generating EABI objects. On these targets the encoding is implicit when generating Thumb code.”

(The manual is apparently wrong about it not being necessary when generating EABI objects.)

A little more perusing of the manual allows me to reduce this to the following:

        @ An ARM assembly program that exits successfully.
        .arch armv7-a
        .syntax unified
        .thumb_func
        .globl main
main:   mov.w r0, $42
        bx lr

Adding this line before the return instruction converts the program into an infinite loop, as expected:

loop:   b.n loop

This program runs, prints “hello, world” as hoped for, and exits:

        .globl main
main:   push {lr}
        movw r0, #:lower16:hi
        movt r0, #:upper16:hi
        bl puts
        mov r0, $0
        pop {pc}
hi:     .ascii "hello, world\0"

$ doesn’t work for the half-symbols. Note no .thumb_func, and consequently it generates non-Thumb code!

0001049c <main>:
   1049c:   e52de004    push    {lr}        ; (str lr, [sp, #-4]!)
   104a0:   e30004b4    movw    r0, #1204   ; 0x4b4
   104a4:   e3400001    movt    r0, #1
   104a8:   fa00120e    blx 14ce8 <_IO_puts>
   104ac:   e3a00000    mov r0, #0
   104b0:   e49df004    pop {pc}        ; (ldr pc, [sp], #4)

Sticking .thumb_func back in there corrects this:

0001049c <main>:
   1049c:   b500        push    {lr}
   1049e:   f240 40ae   movw    r0, #1198   ; 0x4ae
   104a2:   f2c0 0001   movt    r0, #1
   104a6:   f004 fc1f   bl  14ce8 <_IO_puts>
   104aa:   2000        movs    r0, #0
   104ac:   bd00        pop {pc}

Okay, let’s try something with real computation:

#include <stdlib.h>

int main(int argc, char **argv)
{
  if (atoi(argv[1]) == 37) printf("whoa\n");
  return 0;
}

This compiles to more or less the following:

  21                main:
  24 0000 08B5          push    {r3, lr}
  25 0002 4868          ldr r0, [r1, #4]
  26 0004 0A22          movs    r2, #10
  27 0006 0021          movs    r1, #0
  28 0008 FFF7FEFF      bl  strtol
  29 000c 2528          cmp r0, #37
  30 000e 05D1          bne .L2
  31 0010 40F20000      movw    r0, #:lower16:.LC0
  32 0014 C0F20000      movt    r0, #:upper16:.LC0
  33 0018 FFF7FEFF      bl  puts
  34                .L2:
  35 001c 0020          movs    r0, #0
  36 001e 08BD          pop {r3, pc}
  38                    .section    .rodata.str1.4,"aMS",%progbits,1
  39                    .align  2
  40                .LC0:
  41 0000 77686F61      .ascii  "whoa\000"
  41      00

So it looks like it’s calling strtol(argv[1], 0, 10), passing the args in r0, r1, and r2, and getting the result in r0. Why it’s saving r3 I have no idea. I’m guessing the ldr r0, [r1, #4] syntax is for indexing 4 bytes off r1 and loading the result into register r0. The rest is the same.

Does it use this same register-passing convention for varargs functions? Let’s see:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int n = atoi(argv[1]);
  if (n != 37) printf("whoa %d\n", n);
  return 0;
}

  21                main:
  22                    @ args = 0, pretend = 0, frame = 0
  23                    @ frame_needed = 0, uses_anonymous_args = 0
  24 0000 08B5          push    {r3, lr}
  25 0002 4868          ldr r0, [r1, #4]
  26 0004 0A22          movs    r2, #10
  27 0006 0021          movs    r1, #0
  28 0008 FFF7FEFF      bl  strtol
  29 000c 2528          cmp r0, #37
  30 000e 07D0          beq .L2
  31 0010 0246          mov r2, r0
  32 0012 40F20001      movw    r1, #:lower16:.LC0
  33 0016 C0F20001      movt    r1, #:upper16:.LC0
  34 001a 0120          movs    r0, #1
  35 001c FFF7FEFF      bl  __printf_chk
  36                .L2:
  37 0020 0020          movs    r0, #0
  38 0022 08BD          pop {r3, pc}
  40                    .section    .rodata.str1.4,"aMS",%progbits,1
  41                    .align  2
  42                .LC0:
  43 0000 77686F61      .ascii  "whoa %d\012\000"
  43      2025640A 
  43      00

This looks pretty similar but it seems to be passing the format argument in r1, the int argument in r2, and the number of arguments in r0, to a function called __printf_chk. I can ape this pretty well:

$ cat you-arm.s
        @ Simple ARM assembly program to say "hello, Fred" when run with "Fred"
        .globl main
        .thumb_func
main:   push {r3, lr}
        ldr r2, [r1, #4]    @ argv[1]
        mov r0, $1
        movw r1, #:lower16:hi
        movt r1, #:upper16:hi
        bl __printf_chk
        mov r0, $0
        pop {r3, pc}
hi:     .ascii "hello, %s\n\0"
$ arm-linux-gnueabihf-gcc-5 -static you-arm.s
$ ./a.out Fred
hello, Fred

So based on the above, I can do the dumb Fibonacci benchmark program:

        @ Simple ARM assembly program to compute dumb Fibonacci
        .thumb_func
fib:    push {r3, lr}
        cmp r0, $0
        beq basecase
        cmp r0, $1
        beq basecase
        push {r0}
        sub r0, r0, $1
        bl fib
        mov r1, r0
        pop {r0}
        push {r1}
        sub r0, r0, $2
        bl fib
        pop {r1}
        add r0, r1, r0
        pop {r3, pc}
basecase:
        mov r0, $1
        pop {r3, pc}


        .globl main
        .thumb_func
main:   push {r3, lr}
        ldr r0, [r1, #4]    @ argv[1]
        mov r1, $0
        mov r2, $10
        bl strtol
        bl fib
        mov r2, r0
        mov r0, $1
        movw r1, #:lower16:hi
        movt r1, #:upper16:hi
        bl __printf_chk
        mov r0, $0
        pop {r3, pc}
hi:     .ascii "fib = %d\n\0"

This comes out as the following:

0001049c <fib>:
   1049c:   b508        push    {r3, lr}
   1049e:   2800        cmp r0, #0
   104a0:   d00e        beq.n   104c0 <basecase>
   104a2:   2801        cmp r0, #1
   104a4:   d00c        beq.n   104c0 <basecase>
   104a6:   b401        push    {r0}
   104a8:   3801        subs    r0, #1
   104aa:   f7ff fff7   bl  1049c <fib>
   104ae:   1c01        adds    r1, r0, #0
   104b0:   bc01        pop {r0}
   104b2:   b402        push    {r1}
   104b4:   3802        subs    r0, #2
   104b6:   f7ff fff1   bl  1049c <fib>
   104ba:   bc02        pop {r1}
   104bc:   1808        adds    r0, r1, r0
   104be:   bd08        pop {r3, pc}

000104c0 <basecase>:
   104c0:   2001        movs    r0, #1
   104c2:   bd08        pop {r3, pc}

000104c4 <main>:
   104c4:   b508        push    {r3, lr}
   104c6:   6848        ldr r0, [r1, #4]
   104c8:   2100        movs    r1, #0
   104ca:   220a        movs    r2, #10
   104cc:   f003 ff9e   bl  1440c <__strtol>
   104d0:   f7ff ffe4   bl  1049c <fib>
   104d4:   1c02        adds    r2, r0, #0
   104d6:   2001        movs    r0, #1
   104d8:   f240 41e8   movw    r1, #1256   ; 0x4e8
   104dc:   f2c0 0101   movt    r1, #1
   104e0:   f012 fa88   bl  229f4 <___printf_chk>
   104e4:   2000        movs    r0, #0
   104e6:   bd08        pop {r3, pc}

So I guess I can say that’s the first program I’ve written in ARM assembly, since the others were mostly just slight modifications of GCC output. I’m still cargo-culting the saving of r3, and I probably should use a less-than comparison rather than two equal-to comparisons, and I don’t know what order registers get pushed.

It segfaults if you feed it -1, and I think maybe this system is configured with apport to send the core dumps to Ubuntu or something.

A minimal -nostdlib program in ARM assembly

And here’s a program that successfully invokes _exit via the SVC instruction instead of using the standard library, and thus can be linked with -nostdlib and doesn’t make a humongous executable:

        @ Attempt to write an ARM assembly program that exits
        @ successfully with -nostdlib.  cf. return42.c.
        .syntax unified
        .thumb_func
        .globl _start
_start: mov r7, #1   @ system call 1: _exit
        mov r0, #42  @ exit return value?
        svc 0
loop:   b.n loop

This produces a reasonable disassembly:

$ arm-linux-gnueabihf-gcc-5 -static -nostdlib goodbyearm.s
$ ./a.out
$ echo $?
42
$ arm-linux-gnueabihf-objdump -d a.out

a.out:     file format elf32-littlearm


Disassembly of section .text:

00010098 <_start>:
   10098:   f04f 0701   mov.w   r7, #1
   1009c:   f04f 002a   mov.w   r0, #42 ; 0x2a
   100a0:   df00        svc 0

000100a2 <loop>:
   100a2:   e7fe        b.n 100a2 <loop>
$

So, that took a couple of hours to figure out, but it did eventually work.

Machine instructions seen thus far

Destination register always comes first.

svc: supervisor call; in Linux we use svc 0 with the system call number in r7.
b.n: branch always
beq, bne: branch if equal or not equal
bx: "branch and exchange" (not necessarily indirect; see below)
bl: branch and link (i.e., call)
push, pop: take sets of registers; can push lr and pop pc. Not sure how order is determined yet.
mov: can load an immediate constant into a register or copy register to register
movt: sets upper 16 bits of register to immediate constant
movw: sets register to 16-bit immediate constant (or maybe sets lower 16 bits?)
ldr, str: load or store registers to memory, supporting index-offset and I think decrement addressing modes
cmp: can compare registers to immediate constants
sub/subs, add/adds: can add or subtract registers, immediate constants, or both
nop: nop.

Still mysterious: cmn.w, bhi.n, udf, mrc, negs, the whole .w and .n and “s” suffix thing, and what is this “ip” register?

Hmm, the Gas manual actually explains the "s" suffix: that means to set the flags. So presumably "add" just does an addition, while "adds" does an addition and also sets carry flags and whatnot.

write(2), and a -nostdlib hello, world

If we want to get output without stdlib, we need to be able to invoke the SVC for write(2); looks like maybe that’s the system call with r7=4:

00021180 <__libc_write>:
   21180:   f8df c04a   ldr.w   ip, [pc, #74]   ; 211ce <__libc_write+0x4e>
   21184:   44fc        add ip, pc
   21186:   f8dc c000   ldr.w   ip, [ip]
   2118a:   f09c 0f00   teq ip, #0
   2118e:   b480        push    {r7}
   21190:   d108        bne.n   211a4 <__libc_write+0x24>
   21192:   2704        movs    r7, #4
   21194:   df00        svc 0
   21196:   bc80        pop {r7}
   21198:   f510 5f80   cmn.w   r0, #4096   ; 0x1000
   2119c:   bf38        it  cc
   2119e:   4770        bxcc    lr
   211a0:   f002 bb4e   b.w 23840 <__syscall_error>
   211a4:   b50f        push    {r0, r1, r2, r3, lr}
   211a6:   f001 f9d9   bl  2255c <__libc_enable_asynccancel>
   211aa:   4684        mov ip, r0
   211ac:   bc0f        pop {r0, r1, r2, r3}
   211ae:   2704        movs    r7, #4
   211b0:   df00        svc 0
   211b2:   4607        mov r7, r0
   211b4:   4660        mov r0, ip
   211b6:   f001 fa15   bl  225e4 <__libc_disable_asynccancel>
   211ba:   4638        mov r0, r7
   211bc:   f85d eb04   ldr.w   lr, [sp], #4
   211c0:   bc80        pop {r7}
   211c2:   f510 5f80   cmn.w   r0, #4096   ; 0x1000
   211c6:   bf38        it  cc
   211c8:   4770        bxcc    lr
   211ca:   f002 bb39   b.w 23840 <__syscall_error>
   211ce:   9d40        .short  0x9d40
   211d0:   bf000005    .word   0xbf000005

I don’t know what all the extra stuff is in there for but presumably it covers up some impedance mismatch between the Linux system call and what the standard library behavior is supposed to be.

On this basis I achieved a stdlibless hello, world:

$ cat hellobarearm.s
        @ Attempt to write an ARM assembly program that hellos
        @ successfully with -nostdlib.  cf. goodbyearm.s
        .syntax unified
        .thumb_func
        .globl _start
_start: mov r7, #4   @ system call 4: write
        mov r0, #0
        movw r1, #:lower16:hello
        movt r1, #:upper16:hello
        mov r2, #(helloend - hello)
        svc 0

        mov r7, #1   @ system call 1: _exit
        mov r0, #0   @ exit return value
        svc 0
hello:  .ascii "hello, world\n"
helloend:       
$ arm-linux-gnueabihf-gcc-5 -static -nostdlib hellobarearm.s
$ ./a.out
hello, world
$ ls -l a.out
-rwxr-xr-x 1 user user 964 Dec 11 02:56 a.out
$ arm-linux-gnueabihf-objdump -d a.out

a.out:     file format elf32-littlearm


Disassembly of section .text:

00010098 <_start>:
   10098:   f04f 0704   mov.w   r7, #4
   1009c:   f04f 0000   mov.w   r0, #0
   100a0:   f240 01b8   movw    r1, #184    ; 0xb8
   100a4:   f2c0 0101   movt    r1, #1
   100a8:   f04f 020d   mov.w   r2, #13
   100ac:   df00        svc 0
   100ae:   f04f 0701   mov.w   r7, #1
   100b2:   f04f 0000   mov.w   r0, #0
   100b6:   df00        svc 0

000100b8 <hello>:
   100b8:   6c6c6568    .word   0x6c6c6568
   100bc:   77202c6f    .word   0x77202c6f
   100c0:   646c726f    .word   0x646c726f
   100c4:   0a              .byte   0x0a

000100c5 <helloend>:
        ...
$

Note that strip or rather arm-linux-gnueabihf-strip seems to break objdump’s ability to disassemble the code, but it still runs. Here’s a dump of the stripped executable:

$ od -vbAn a.out
 177 105 114 106 001 001 001 000 000 000 000 000 000 000 000 000
 002 000 050 000 001 000 000 000 231 000 001 000 064 000 000 000
 034 001 000 000 000 002 000 005 064 000 040 000 002 000 050 000
 005 000 004 000 001 000 000 000 000 000 000 000 000 000 001 000
 000 000 001 000 306 000 000 000 306 000 000 000 005 000 000 000
 000 000 001 000 004 000 000 000 164 000 000 000 164 000 001 000
 164 000 001 000 044 000 000 000 044 000 000 000 004 000 000 000
 004 000 000 000 004 000 000 000 024 000 000 000 003 000 000 000
 107 116 125 000 005 140 061 140 131 337 041 016 031 220 022 215
 156 115 132 247 261 367 300 226 117 360 004 007 117 360 000 000
 100 362 270 001 300 362 001 001 117 360 015 002 000 337 117 360
 001 007 117 360 000 000 000 337 150 145 154 154 157 054 040 167
 157 162 154 144 012 000 101 036 000 000 000 141 145 141 142 151
 000 001 024 000 000 000 005 067 055 101 000 006 012 007 101 010
 001 011 002 012 004 000 056 163 150 163 164 162 164 141 142 000
 056 156 157 164 145 056 147 156 165 056 142 165 151 154 144 055
 151 144 000 056 164 145 170 164 000 056 101 122 115 056 141 164
 164 162 151 142 165 164 145 163 000 000 000 000 000 000 000 000
 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
 000 000 000 000 013 000 000 000 007 000 000 000 002 000 000 000
 164 000 001 000 164 000 000 000 044 000 000 000 000 000 000 000
 000 000 000 000 004 000 000 000 000 000 000 000 036 000 000 000
 001 000 000 000 006 000 000 000 230 000 001 000 230 000 000 000
 056 000 000 000 000 000 000 000 000 000 000 000 004 000 000 000
 000 000 000 000 044 000 000 000 003 000 000 160 000 000 000 000
 000 000 000 000 306 000 000 000 037 000 000 000 000 000 000 000
 000 000 000 000 001 000 000 000 000 000 000 000 001 000 000 000
 003 000 000 000 000 000 000 000 000 000 000 000 345 000 000 000
 064 000 000 000 000 000 000 000 000 000 000 000 001 000 000 000
 000 000 000 000

Although, doh, I seem to be writing to fd 0, not fd 1. QEMU confirms:

$ qemu-arm -strace ./a.out
30784 write(0,0x100b8,13)hello, world
 = 13
30784 exit(0)

read(2) seems to be syscall 3. open(2) syscall 5, and close(2) syscall 6, and maybe brk() 45 and getpid() 20.

The ARM-THUMB Procedure Calling Standard

ARM publishes this document, called the ATPCS for short. It explains the use of the registers: r0 and r1 are used for return values; r0 to r3 are used for arguments and are thus caller-saved; r4 to r11 are callee-saved general-purpose registers; r11 is also FP, the frame pointer; r12 is IP, the "intra-procedure-call scratch register"; r13 is SP; r14 is LR, the link register used by the bl instruction; r15 is PC. The stack grows downwards, and the stack pointer (which must be 8-byte-aligned when calling a public function) points at the last thing that was pushed, not the next thing to push. Interrupt handlers can execute on your stack, so if you have interrupts you can't depend on values you've popped staying put.

Newer versions are longer and of poorer quality, though covering more modern CPU features; fortunately I was able to find an older version, "SWS ESPC 0002 B-01" (B-01 being the version number), from "24 October, 2000", which is only 37 pages.

Original-THUMB instructions can only access r0 to r7 for most operands, so you only have 8 general-purpose registers, like the 8088; more recent ones can access the other 8 registers but I think need longer instructions.

There are also some special uses of r10 ("SL", "stack limit" in stack-checked variants), r9 ("SB", "static base" for shared libraries and other uses of position-independent data), and r7 ("WR", "Thumb-state Work Register"), but I don't think these affect their use most of the time --- from the perspective of writing assembly code, they're mostly just more callee-saved registers, except that I guess if your assembly code is in a shared library it will use r9. More recent versions of the ATPCS eliminate SL and WR and give r9 an additional role, TR, the "thread register", for TLS I guess.

I think the definition of "IP" means that the dynamic linker is free to clobber r12 when it's doing lazy dynamic linking, so the callee may see random crap in r12 if it just got loaded. This also means it's caller-saved.

So, in summary, r4-r11 and SP (r13) are callee-saved, and everything else is caller-saved. (Except that r10 may get horked by "limit-checking support code".)

Parameter passing is what you'd expect: everything gets widened to 32 bits, except 64-bit values are split into two 32-bit values (not sure about endianness). The first four parameters go into r0-r3, and subsequent parameters are passed on the stack, first argument last. (So the argument-count passing used by __printf_chk above is nonstandard, which I guess is why it was calling __printf_chk and not printf.) Return values go in r0, r0-r1, r0-r2, or r0-r3, if they fit, while longer return values are returned "indirectly, in memory, via an additional address parameter." Not sure whether that parameter is passed in by the caller as a ghostly first parameter or what.

Floating-point parameter passing is different but uses floating-point registers.

The floating-point story sounds remarkably like the 8086-family story. The old FPA register set has 8 extended-precision registers (I think 12 bytes instead of the 8087's 10) that can be used as single-precision, which has recently been replaced by "VFP", "vector floating point", which has 16 double-precision registers that can be used as 32 single-precision registers instead. A difference is that they are mutually exclusive: while an Atom supports both 80387 instructions and SSE instructions, ARM chips support either FPA or VFP or neither, not both.

The assembly-language examples use a syntax closely resembling Intel syntax, with ; for comments and no %, but . to mean the current position:

    MOV   LR, PC     ; VAL(C) = . + 8
    MOV   PC, r4

GCC and Gas use something like this syntax by default except for the comments. The ATPCS sometimes says things like "-4[FP]" in the body text; it's not clear to me whether this is valid assembly syntax in ARM's mind, but Gas seems to be writing that as [fp, #-4].

"BX" is "branch and exchange", not "branch indirect" as I thought; it uses the LSB of the address to determine whether to use ARM or THUMB instructions after the jump. There's a note: "In ARM architecture version 5T, a load (but not a move) to the PC also restores the instruction-set state, allowing an inter-working return to be performed using LDR, LDM, or POP," which I guess means that before arm5t that wasn't the case.

Shared libraries and position-independent code

To get position-independent code, you have to use PC-relative references to all your read-only data, and you have to access read-write data by indexing off SB. This in particular means that no static data can point to any other static data without dynamic-linker intervention, even inside the same segment; and read-only data, to be sharable, can only point to read-write data by indexing off SB. I don't know how non-instruction read-only data can point to read-only data at all.

There is a delightful hack for shared libraries to find their read-write data segments in their entry trampolines: every shared library has a "library index" set at load time, and every shared library read-write data segment starts with has four pointers into a "process data table" which lists the data segments of the shared libraries. So you're supposed to do this, in THUMB code:

    MOV LSB, SB            ; for some "low register" LSB
    LDR LSB, [LSB, #my_segment]  ; 0, 1, 2, or 3
    LDR LSB, [LSB, #my_index]    ; set by the dynamic linker

Surprisingly, section 5.8 is about Chez-Scheme-style segmented stacks (called "chunked stacks"). Too bad there's no explicit support for closures, although maybe the GCC-style trampoline hack is better than an explicit context pointer in the ABI, which would slow down every call. (Although the SB thing is pretty close to being exactly an explicit context pointer in the ABI...)

Stack unwinding

A surprising thing about the ATPCS is that it contains an unwinding spec to make it possible for zero-overhead exception handlers to safely unwind the stack, even through shared libraries, restoring all callee-saved registers just as if the functions unwound had returned normally. Even more surprising is that the recommended approach is to examine the binary code of the functions on the stack to identify their prologues or epilogues, and then either interpretively undo the effect of the prologue, or directly execute the epilogue. This is very clever, but for it to work, you need to not move the stack pointer during the body of the function the way my dumb Fibonacci code above does. If that requirement is satisfied, though, I think only a very minimal amount of auxiliary data is required to make the unwinding work.

The Gas manual leads me to believe that this is not the method currently used for unwinding, because it demands that you tell it what registers you're saving with a .save directive.

A bit more disassembly, exploring instruction set differences, and failing to figure out shared libraries

I wrote this C program on a different computer and compiled it with arm-none-eabi-gcc -S -mthumb -O, with no #include files:

int main(int argc, char **argv) {
    printf("%d\n", atoi(argv[1]) << 3 | atoi(argv[2]));
    return 0;
}

The assembly code generated inside main looks mostly like this:

main:
        push    {r3, r4, r5, lr}
        mov     r4, r1
        ldr     r0, [r1, #4]
        bl      atoi
        mov     r5, r0
        ldr     r0, [r4, #8]
        bl      atoi
        lsl     r1, r5, #3
        orr     r1, r0
        ldr     r0, .L2
        bl      printf
        mov     r0, #0
        @ sp needed for prologue
        pop     {r3, r4, r5}
        pop     {r1}
        bx      r1
.L3:
        .align  2
.L2:
        .word   .LC0
        .size   main, .-main
        .section        .rodata.str1.4,"aMS",%progbits,1
        .align  2
.LC0:
        .ascii  "%d\012\000"

The ldr for the atoi arguments confirms that the #4 or #8 is a byte offset. The lsl and orr mnemonics were what I was really looking for, but I'm surprised not to see the left-shift incorporated into an operand, because I thought ARM supported a left-shift in every operand or something.

The ldr r0, .L2 is presumably because the 32-bit constant address of the string at .LC0 is hard to fit into an instruction. The separate pop for the return address is presumably because if it used r1 in the same pop it would have been popped in the wrong order (because in the instruction encoding this is surely some kind of bitfield or something, not a variable-length list of 4-bit register numbers); this also clarifies that the first thing in the push or pop list is the one SP points at within the push/pop pair: the last one to be pushed and the first one to be popped. But why didn't it just pop it into pc rather than using two more instructions? I suspect the answer is what I saw earlier in the ATPCS: old ARMs needed an explicit BX to ensure a switch in instruction encoding.

Previously I was compiling for armv7-a (by the default configuration of my toolchain on the other computer), and I wonder if that resulted in using the freer-form Thumb-2 instruction format, in which you can access the high registers. Indeed, all of these instructions enter into 16 bits, except for the immediate operands of the call instructions:

$ arm-none-eabi-objdump -d shl.o

shl.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <main>:
   0:   b538        push    {r3, r4, r5, lr}
   2:   1c0c        adds    r4, r1, #0
   4:   6848        ldr r0, [r1, #4]
   6:   f7ff fffe   bl  0 <atoi>
   a:   1c05        adds    r5, r0, #0
   c:   68a0        ldr r0, [r4, #8]
   e:   f7ff fffe   bl  0 <atoi>
  12:   00e9        lsls    r1, r5, #3
  14:   4301        orrs    r1, r0
  16:   4803        ldr r0, [pc, #12]   ; (24 <main+0x24>)
  18:   f7ff fffe   bl  0 <printf>
  1c:   2000        movs    r0, #0
  1e:   bc38        pop {r3, r4, r5}
  20:   bc02        pop {r1}
  22:   4708        bx  r1
  24:   00000000    .word   0x00000000

We can see that the ldr to get the constant has been compiled as a PC-relative reference, presumably to support position-independent code --- although I'm not sure how that word is supposed to get the address of the string in it in a relocatable way?

It's not; if I instead compile with arm-none-eabi-gcc -S -mthumb -fPIC -O, I get this instead:

        ldr     r0, .L2
.LPIC0:
        add     r0, pc
        bl      printf
...
.L2:
        .word   .LC0-(.LPIC0+4)
        .size   main, .-main
        .section        .rodata.str1.4,"aMS",%progbits,1
        .align  2
.LC0:
        .ascii  "%d\012\000"

That is, instead of storing the address of the string, it stores the PC-relative offset from the place where the string's absolute address gets computed. The (static) linker can freely relocate the string because the .LC0 relocation will fix up the word at .L2 when the final executable or shared library is built.

With a non-Thumb non-PIC compilation arm-none-eabi-gcc -S -O shl.c the code for main() is instead:

main:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        stmfd   sp!, {r3, r4, r5, lr}
        mov     r4, r1
        ldr     r0, [r1, #4]
        bl      atoi
        mov     r5, r0
        ldr     r0, [r4, #8]
        bl      atoi
        orr     r1, r0, r5, asl #3
        ldr     r0, .L2
        bl      printf
        mov     r0, #0
        ldmfd   sp!, {r3, r4, r5, lr}
        bx      lr

(Again, all of this is without #includes; thus the literal calls to atoi and printf.)

These are 32-bit instructions, and it seems like it's using the stmfd and ldmfd instructions (rather than push and pop) to load and store multiple values; presumably the sp! addressing mode is some kind of magic autoincrement/autodecrement addressing mode. The fact that sp is an explicit operand makes it sound like r13 is just another register and its use as the stack pointer was just a convention, but I don't think that was really true --- I think even old ARM interrupt handlers used r13 to save the registers of the thread being interrupted. (Certainly the ATPCS documents this as a thing that could happen in 2000.)

Some of the instructions have three operands instead of two, and the built-in shift I thought I remembered does see to exist here: orr r1, r0, r5, asl #3. Also it's worth noticing that these instructions are missing the s suffix in the disassembly:

$ arm-none-eabi-objdump -d shl.o

shl.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <main>:
   0:   e92d4038    push    {r3, r4, r5, lr}
   4:   e1a04001    mov r4, r1
   8:   e5910004    ldr r0, [r1, #4]
   c:   ebfffffe    bl  0 <atoi>
  10:   e1a05000    mov r5, r0
  14:   e5940008    ldr r0, [r4, #8]
  18:   ebfffffe    bl  0 <atoi>
  1c:   e1801185    orr r1, r0, r5, lsl #3
  20:   e59f000c    ldr r0, [pc, #12]   ; 34 <main+0x34>
  24:   ebfffffe    bl  0 <printf>
  28:   e3a00000    mov r0, #0
  2c:   e8bd4038    pop {r3, r4, r5, lr}
  30:   e12fff1e    bx  lr
  34:   00000000    .word   0x00000000

The Thumb assembly generated by GCC didn't have the s suffix on the instructions either, but the disassembly did; it turns out that Thumb instructions always update the flags, except for mov and add instructions with high registers. Also note that the disassembly spells stmfd sp!, as push, just like the Thumb version.

What about position-independent mutable data? It turns out to use the same scheme as position-independent immutable data, contrary to what I had expected from the ATPCS. I compiled this C module

static int accumulator;

int octal_digit(int digit) {
    accumulator = accumulator << 3 | digit;
    return accumulator;
}

with arm-none-eabi-gcc -mthumb -S -O -fPIC and got this remarkable result:

octal_digit:
        ldr     r3, .L2
.LPIC0:
        add     r3, pc
        ldr     r1, [r3]
        lsl     r2, r1, #3
        orr     r0, r2
        str     r0, [r3]
        @ sp needed for prologue
        bx      lr
.L3:
        .align  2
.L2:
        .word   .LANCHOR0-(.LPIC0+4)
        .size   octal_digit, .-octal_digit
        .bss
        .align  2
        .set    .LANCHOR0,. + 0
        .type   accumulator, %object
        .size   accumulator, 4
accumulator:
        .space  4

And this disassembly:

00000000 <octal_digit>:
   0:   4b03        ldr r3, [pc, #12]   ; (10 <octal_digit+0x10>)
   2:   447b        add r3, pc
   4:   6819        ldr r1, [r3, #0]
   6:   00ca        lsls    r2, r1, #3
   8:   4310        orrs    r0, r2
   a:   6018        str r0, [r3, #0]
   c:   4770        bx  lr
   e:   46c0        nop         ; (mov r8, r8)
  10:   0000000a    .word   0x0000000a

So we have .L2 in the code segment, just after the end of the function, which contains the BSS address of the read-write variable accumulator, relative to the instruction at .LPIC0. So first the program does a PC-relative ldr to fetch that read-only datum, and then it adds PC to the fetched datum to obtain the address of .LANCHOR0, which is the part of BSS that holds this file's zero-initialized static variables. This doesn't seem like it could possibly permit sharing the code segment, since the data at .L2 would need to be modified according to where (that piece of) BSS is positioned relative to where this code segment is mapped --- it would need a fixup by the dynamic linker.

This code also shows that the str instruction has its destination field on the right.

Without the static

int accumulator;

int octal_digit(int digit) {
    accumulator = accumulator << 3 | digit;
    return accumulator;
}

we get a different piece of code that refers to a global offset table; it sure isn't the scheme described in the ATPCS:

octal_digit:
        ldr     r3, .L2
.LPIC0:
        add     r3, pc
        ldr     r2, .L2+4
        ldr     r3, [r3, r2]
        ldr     r1, [r3]
        lsl     r2, r1, #3
        orr     r0, r2
        str     r0, [r3]
        @ sp needed for prologue
        bx      lr
.L3:
        .align  2
.L2:
        .word   _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
        .word   accumulator(GOT)
        .size   octal_digit, .-octal_digit
        .comm   accumulator,4,4

I mean this still seems to demand that this code be mapped at a fixed memory location relative to the _GLOBAL_OFFSET_TABLE_ if it isn't going to be fixed up at load time. So, I don't know.

Even still, it seems like a relatively heavy price to pay for code segment sharing that instead of accessing a variable by saying

        ldr     r3, [pc, #12]

you have to say

        ldr     r3, [pc, #12]
        add     r3, pc
        ldr     r2, [pc, #something]
        ldr     r3, [r3, r2]
        ldr     r1, [r3]

and also have a per-reference offset stored somewhere the static linker can fix it up; and so I wonder how often it is really worth it.

Costs of accessing variables allocated statically

An interesting thing about this way of referring to variables (or that described in the ATPCS) is that it reverses the traditional costs of referring to statically and dynamically allocated variables. From the 1940s through the 1980s, accessing a statically allocated variable was cheap: it was at a known, constant address in memory, which could be baked into the instruction; while accessing a variable allocated dynamically, for example on the stack, required indexing off the stack pointer or some other kind of base pointer, which itself had some extra cost to create and maintain. (Worse, until around 1970, there were a significant number of computers where an indexed memory access required self-modifying code, because they didn't have index registers.) But in this case we see that accessing two dynamically-allocated variables can be as simple as lsl r2, r1, #3, while accessing a single statically-allocated variable requires a five-instruction watusi.

At first blush this sounds like a straightforward case of architectural evolution, but it isn't really. RAM is just a bunch of registers, after all. There are only a couple of minor details of the ARM architecture that contribute to this situation: it has an efficient encoding for PC-relative addressing (like amd64, unlike i386); loading from a constant pointer requires three instructions (movw, movt, ldr) instead of one; and you only have 16 registers you can address directly, while everything else is much slower, because CPU speed has zoomed way ahead of RAM speed.

Rather than a change in architecture, though, it's mostly an evolution of the execution model. It's just a different way of using the machine that prioritizes different tradeoffs. You could totally use a PDP-10 or 6502 in such a way: mostly reserve the 6502 zero page for local variables and frame pointers and whatnot rather than global variables, and index all your "statically allocated" variables off one of those registers so that separate processes sharing an address space store their mutable state in separate "segments". And although the Cortex-A7 ARM in your cellphone might have gigabytes of RAM and a deep cache hierarchy, the Cortex-M0 in a small STM32 doesn't see a whole lot of difference between its speed of accessing CPU registers and accessing the on-die SRAM, except that it may need to run several instructions to compute an address into the on-die SRAM.

Reading other stuff

ARM published an "assembler user guide" in 2001 that explains the assembly language fairly comprehensively (354 pages!). Its chapter 4 is the ARM instruction set reference, and chapter 5 is the Thumb instruction set reference. It's marked as "superseded" on ARM's unusably bad website, but without a link (that I could find) to the superseding version. On the 15th page, it explains what ARM and Thumb are; on the 16th page, it describes the register-banking scheme used to separate user and supervisor (kernel) mode. It has a wealth of information about the historical development of the instruction set, including explaining literal pools and whatnot.

However, this version of the book lacks such crucial features as 32-bit-wide Thumb-2 instructions and if-then-else blocks.

There's a decent but unfinished 15-page tutorial by Carl Burch under CC-BY-SA; it explains the -s suffix on instructions, the absence of an integer division instruction (though not the existence of extensions that have it), the built-in shift, the umull instruction, the limitations on mov immediate constants, mvn, {ldr,str}{b,}, {ld,st}m{i,d}{b,a}, all the ALU instructions, conditional execution, all the condition codes, all the addressing modes (including examples of scaled-register-offset and immediate-post-indexed addressing), etc.

Despite this admirable level of comprehensiveness, it's imperfect; it seems to be unfinished, stopping after explaining the above but before describing function call and return, and it doesn't cover Thumb at all. Also, the "hailstone sequence" example program has a bug in it in which the ands instruction overwrites the accumulator, preventing the program from ever working, and at one point it erroneously says it's jumping to the beginning of an array of doublewords. And, unfortunately, the tutorial uses ARM's assembly syntax instead of Gas's.

Azeria Labs wrote an ARM assembly cheat sheet, though it's mostly focused on breakins, and they want to charge you for the full-resolution version; it's associated with a poorly-written error-filled tutorial with t0tally k00l diagrams. The discussion of it in 2017 on the orange website links to a lot of better resources.

https://www.coranac.com/tonc/text/asm.htm? http://www.davespace.co.uk/arm/introduction-to-arm/not-trivial.html?

Topics

Programming (286 notes)
Instruction sets (40 notes)
Assembly language (25 notes)