Optimizing the Visitor pattern on the DOM using Quaject-style dynamic code generation

Kragen Javier Sitaker, 2013-05-17 (updated 2013-05-20) (21 minutes)

(I think this was published previously on kragen-tol.)

Suppose you want to be able to execute the Visitor pattern as quickly as possible on some tree structure. You could compile your tree structure into executable code, each node a subroutine which invokes a method of the visitor object — traditionally each node type invokes a different method — and then passes the visitor object to each child node.

Using the "quaject" approach described in Henry Massalin's "Synthesis" thesis, and passing the pointer to the visitor object itself in %ebx, the code to invoke the appropriate method on the visitor might look like this:

    mov %ebx, %edx
    add $20, %edx
    call *%edx
    jc 1f
    ret
1:

Here we are checking the carry flag to see if the visitor object is requesting that we skip child nodes; if so, we return immediately.

The visitor "method" in this case is the code at offset 20 from the visitor base pointer; if its code is very short, it might be entirely inline at that point, but in many cases it will be longer, and only a call instruction will be present there. The called code will receive a pointer to the visitor itself in %ebx.

However, we probably want to pass some additional information to the visitor method; for example, if we're a variable node in a program AST, we might want to pass the name of the variable, or if we're an element node in a DOM, we might want to pass the tag name and attributes. So the entire call might look like this:

    mov $0xfe38d080, %eax
    mov %ebx, %edx
    add $20, %edx
    call *%edx
    jc 1f
    ret
1:

Following the call to the visitor method, we pass the visitor to the children. In the standard i386 GCC calling convention, %ebx, %esi, %edi, and %ebp are callee-saved registers, so if the visitor method follows this convention, we still have %ebx pointing to the visitor after it returns. So we can simply call our children one by one, implicitly passing them the visitor, then return:

    call 0xfe381000
    call 0xfe38d844
    call 0xfe391daa
    ret

For many applications, though, the visitor needs to take some action at the end of each node as well as the beginning. Perhaps it could inspect the child count at the beginning, maintaining its own stack of remaining child counts, and invoke the appropriate code when a child count reaches zero; but it's probably simpler to just notify it explicitly, by invoking another method on it before returning. This suggests putting the node pointer in a callee-saved register too, such as %ebp. With this additional modification, our example node looks like this:

   0:   55                      push   %ebp
   1:   bd 80 d0 38 fe          mov    $0xfe38d080,%ebp
   6:   89 da                   mov    %ebx,%edx
   8:   83 c2 14                add    $0x14,%edx
   b:   ff d2                   call   *%edx
   d:   73 0f                   jae    0x1e
   f:   e8 fc 0f 38 fe          call   0xfe381010
  14:   e8 40 d8 38 fe          call   0xfe38d859
  19:   e8 a6 1d 39 fe          call   0xfe391dc4
  1e:   89 da                   mov    %ebx,%edx
  20:   83 c2 1c                add    $0x1c,%edx
  23:   ff d2                   call   *%edx
  25:   5d                      pop    %ebp
  26:   c3                      ret

(Note that this version has acquired an unfortunate limit of 127 bytes of child nodes, which is to say, 25 of them.)

That's 39 bytes for an executable representation of the data pointer 0xfe38d080 plus a variable-length list of child pointers that happens to contain three pointers; the most straightforward way to represent this

struct node { 
  enum { document_node, element_node, text_node } nodetype;
  struct nodedata *data;
  int n_children;
  struct node *child[0];
};

would have needed 24 bytes, so making the data structure executable has cost us less than a factor of 2 in space.

(In the case where the children are sufficiently small, we can inline them, eliminating another 6 bytes (16%) of call and return.)

We can guarantee the visitor code a stronger calling convention than the usual; an executable tree built in the above fashion behaves in a very stereotyped way: of the caller-saved registers, it uses only %edx, plus %eax as an argument, so the visitor code is free to use %ecx as a private global variable, unless it calls some other code, in which case it must save it as usual. We can free up %edx too for the visitor's use as follows:

   0:   55                      push   %ebp
   1:   53                      push   %ebx
   2:   bd 80 d0 38 fe          mov    $0xfe38d080,%ebp
   7:   83 c3 14                add    $0x14,%ebx
   a:   ff d3                   call   *%ebx
   c:   73 0f                   jae    0x1d
   e:   e8 fc 0f 38 fe          call   0xfe38100f
  13:   e8 40 d8 38 fe          call   0xfe38d858
  18:   e8 a6 1d 39 fe          call   0xfe391dc3
  1d:   83 c3 08                add    $0x8,%ebx
  20:   ff d3                   call   *%ebx
  22:   5b                      pop    %ebx
  23:   5d                      pop    %ebp
  24:   c3                      ret

This has the side advantage of saving us two bytes (5%), but it also means the visitor method is invoked with a pointer to the visitor method rather than the visitor object; it is likely to need to subtract the method offset and add it back later. This seems like a fair tradeoff for giving it two private registers instead of one.

Additionally, the return value of the visitor method (that is, whatever it leaves in %eax) is either passed to the next visitor method called or is the return value of the entire node traversal.

For whatever it matters in today's world, this also means that the per-node overhead for tree traversal is 14 instructions, which seems pretty small. It may blow up your icache, though, and its single conditional jump might be mispredicted more often than the several in a more traditional implementation.

Consider this C implementation, listing generated as follows:

gcc -g -c -Wall -O4 -fomit-frame-pointer -Wa,-adhlns=nonquajectdom.lst nonquajectdom.c

   1                            .file   "nonquajectdom.c"
   2                            .text
   3                    .Ltext0:
   4                            .p2align 4,,15
   6                    _visit_node:
   7                    .LFB1:
   8                            .file 1 "nonquajectdom.c"
   1:nonquajectdom.c **** struct node { 
   2:nonquajectdom.c ****   enum { document_node, element_node, text_node } nodetype;
   3:nonquajectdom.c ****   struct nodedata *data;
   4:nonquajectdom.c ****   int n_children;
   5:nonquajectdom.c ****   struct node *child[0];
   6:nonquajectdom.c **** };
   7:nonquajectdom.c **** 
   8:nonquajectdom.c **** typedef int bool;
   9:nonquajectdom.c **** 
  10:nonquajectdom.c **** struct visitor {
  11:nonquajectdom.c ****   void *visitor_data;
  12:nonquajectdom.c ****   bool (*document_callback)(struct visitor *self, struct nodedata*);
  13:nonquajectdom.c ****   void (*document_end_callback)(struct visitor *self, struct nodedata*);
  14:nonquajectdom.c ****   bool (*element_callback)(struct visitor *self, struct nodedata*);
  15:nonquajectdom.c ****   void (*element_end_callback)(struct visitor *self, struct nodedata*);
  16:nonquajectdom.c ****   void (*text_callback)(struct visitor *self, struct nodedata*);
  17:nonquajectdom.c **** };
  18:nonquajectdom.c **** 
  19:nonquajectdom.c **** void visit_node(struct node *n, struct visitor *v);
  20:nonquajectdom.c **** 
  21:nonquajectdom.c **** static inline void visit_children(struct node *n, struct visitor *v)
  22:nonquajectdom.c **** {
  23:nonquajectdom.c ****   int ii;
  24:nonquajectdom.c **** 
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
  26:nonquajectdom.c ****     visit_node(n->child[ii], v);
  27:nonquajectdom.c ****   }
  28:nonquajectdom.c **** }
  29:nonquajectdom.c **** 
  30:nonquajectdom.c **** static void _visit_node(struct node *n, struct visitor *v)
  31:nonquajectdom.c **** {
   9                            .loc 1 31 0
  10                            .cfi_startproc
  11                    .LVL0:
  12 0000 83EC1C                subl    $28, %esp
  13                    .LCFI0:
  14                            .cfi_def_cfa_offset 32
  15 0003 895C2410              movl    %ebx, 16(%esp)
  16 0007 89C3                  movl    %eax, %ebx
  17                            .cfi_offset 3, -16
  32:nonquajectdom.c ****   switch (n->nodetype) {
  18                            .loc 1 32 0
  19 0009 8B00                  movl    (%eax), %eax
  20                    .LVL1:
  31:nonquajectdom.c **** {
  21                            .loc 1 31 0
  22 000b 89742414              movl    %esi, 20(%esp)
  23 000f 89D6                  movl    %edx, %esi
  24                            .cfi_offset 6, -12
  25 0011 897C2418              movl    %edi, 24(%esp)
  26                            .loc 1 32 0
  27 0015 83F801                cmpl    $1, %eax
  28 0018 7476                  je      .L4
  29                            .cfi_offset 7, -8
  30 001a 7224                  jb      .L3
  31 001c 83F802                cmpl    $2, %eax
  32 001f 750D                  jne     .L1
  33:nonquajectdom.c ****   case element_node:
  34:nonquajectdom.c ****     if (v->element_callback(v, n->data)) visit_children(n, v);
  35:nonquajectdom.c ****     v->element_end_callback(v, n->data);
  36:nonquajectdom.c ****     break;
  37:nonquajectdom.c ****   case document_node:
  38:nonquajectdom.c ****     if (v->document_callback(v, n->data)) visit_children(n, v);
  39:nonquajectdom.c ****     v->document_end_callback(v, n->data);
  40:nonquajectdom.c ****     break;
  41:nonquajectdom.c ****   case text_node:
  42:nonquajectdom.c ****     v->text_callback(v, n->data);
  33                            .loc 1 42 0
  34 0021 8B4304                movl    4(%ebx), %eax
  35 0024 891424                movl    %edx, (%esp)
  36 0027 89442404              movl    %eax, 4(%esp)
  37 002b FF5214                call    *20(%edx)
  38                    .LVL2:
  39                    .L1:
  43:nonquajectdom.c ****     break;
  44:nonquajectdom.c ****   }
  45:nonquajectdom.c **** }
  40                            .loc 1 45 0
  41 002e 8B5C2410              movl    16(%esp), %ebx
  42                    .LVL3:
  43 0032 8B742414              movl    20(%esp), %esi
  44                    .LVL4:
  45 0036 8B7C2418              movl    24(%esp), %edi
  46 003a 83C41C                addl    $28, %esp
  47                            .cfi_remember_state
  48                    .LCFI1:
  49                            .cfi_def_cfa_offset 4
  50                            .cfi_restore 7
  51                            .cfi_restore 6
  52                            .cfi_restore 3
  53 003d C3                    ret
  54                    .LVL5:
  55 003e 6690                  .p2align 4,,7
  56                            .p2align 3
  57                    .L3:
  58                    .LCFI2:
  59                            .cfi_restore_state
  38:nonquajectdom.c ****     if (v->document_callback(v, n->data)) visit_children(n, v);
  60                            .loc 1 38 0
  61 0040 8B4304                movl    4(%ebx), %eax
  62 0043 891424                movl    %edx, (%esp)
  63 0046 89442404              movl    %eax, 4(%esp)
  64 004a FF5204                call    *4(%edx)
  65                    .LVL6:
  66 004d 85C0                  testl   %eax, %eax
  67 004f 7422                  je      .L8
  68                    .LVL7:
  69                    .LBB12:
  70                    .LBB13:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
  71                            .loc 1 25 0
  72 0051 8B4308                movl    8(%ebx), %eax
  73 0054 85C0                  testl   %eax, %eax
  74 0056 7E1B                  jle     .L8
  75 0058 31FF                  xorl    %edi, %edi
  76                    .LVL8:
  77 005a 8DB60000              .p2align 4,,7
  77      0000
  78                            .p2align 3
  79                    .L9:
  80                    .LBB14:
  81                    .LBB15:
  46:nonquajectdom.c **** 
  47:nonquajectdom.c **** void visit_node(struct node *n, struct visitor *v)
  48:nonquajectdom.c **** {
  49:nonquajectdom.c ****   _visit_node(n, v);
  82                            .loc 1 49 0
  83 0060 8B44BB0C              movl    12(%ebx,%edi,4), %eax
  84 0064 89F2                  movl    %esi, %edx
  85                    .LBE15:
  86                    .LBE14:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
  87                            .loc 1 25 0
  88 0066 83C701                addl    $1, %edi
  89                    .LVL9:
  90                    .LBB17:
  91                    .LBB16:
  92                            .loc 1 49 0
  93 0069 E892FFFF              call    _visit_node
  93      FF
  94                    .LVL10:
  95                    .LBE16:
  96                    .LBE17:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
  97                            .loc 1 25 0
  98 006e 3B7B08                cmpl    8(%ebx), %edi
  99 0071 7CED                  jl      .L9
 100                    .LVL11:
 101                    .L8:
 102                    .LBE13:
 103                    .LBE12:
  39:nonquajectdom.c ****     v->document_end_callback(v, n->data);
 104                            .loc 1 39 0
 105 0073 8B4304                movl    4(%ebx), %eax
 106 0076 893424                movl    %esi, (%esp)
 107 0079 89442404              movl    %eax, 4(%esp)
 108 007d FF5608                call    *8(%esi)
  45:nonquajectdom.c **** }
 109                            .loc 1 45 0
 110 0080 8B5C2410              movl    16(%esp), %ebx
 111                    .LVL12:
 112 0084 8B742414              movl    20(%esp), %esi
 113                    .LVL13:
 114 0088 8B7C2418              movl    24(%esp), %edi
 115 008c 83C41C                addl    $28, %esp
 116                            .cfi_remember_state
 117                            .cfi_restore 3
 118                            .cfi_restore 6
 119                            .cfi_restore 7
 120                    .LCFI3:
 121                            .cfi_def_cfa_offset 4
 122 008f C3                    ret
 123                    .LVL14:
 124                            .p2align 4,,7
 125                            .p2align 3
 126                    .L4:
 127                    .LCFI4:
 128                            .cfi_restore_state
  34:nonquajectdom.c ****     if (v->element_callback(v, n->data)) visit_children(n, v);
 129                            .loc 1 34 0
 130 0090 8B4304                movl    4(%ebx), %eax
 131 0093 891424                movl    %edx, (%esp)
 132 0096 89442404              movl    %eax, 4(%esp)
 133 009a FF520C                call    *12(%edx)
 134                    .LVL15:
 135 009d 85C0                  testl   %eax, %eax
 136 009f 7422                  je      .L6
 137                    .LVL16:
 138                    .LBB18:
 139                    .LBB19:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
 140                            .loc 1 25 0
 141 00a1 8B5308                movl    8(%ebx), %edx
 142 00a4 85D2                  testl   %edx, %edx
 143 00a6 7E1B                  jle     .L6
 144 00a8 31FF                  xorl    %edi, %edi
 145                    .LVL17:
 146 00aa 8DB60000              .p2align 4,,7
 146      0000
 147                            .p2align 3
 148                    .L7:
 149                    .LBB20:
 150                    .LBB21:
 151                            .loc 1 49 0
 152 00b0 8B44BB0C              movl    12(%ebx,%edi,4), %eax
 153 00b4 89F2                  movl    %esi, %edx
 154                    .LBE21:
 155                    .LBE20:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
 156                            .loc 1 25 0
 157 00b6 83C701                addl    $1, %edi
 158                    .LVL18:
 159                    .LBB23:
 160                    .LBB22:
 161                            .loc 1 49 0
 162 00b9 E842FFFF              call    _visit_node
 162      FF
 163                    .LVL19:
 164                    .LBE22:
 165                    .LBE23:
  25:nonquajectdom.c ****   for (ii = 0; ii < n->n_children; ii++) {
 166                            .loc 1 25 0
 167 00be 3B7B08                cmpl    8(%ebx), %edi
 168 00c1 7CED                  jl      .L7
 169                    .LVL20:
 170                    .L6:
 171                    .LBE19:
 172                    .LBE18:
  35:nonquajectdom.c ****     v->element_end_callback(v, n->data);
 173                            .loc 1 35 0
 174 00c3 8B4304                movl    4(%ebx), %eax
 175 00c6 893424                movl    %esi, (%esp)
 176 00c9 89442404              movl    %eax, 4(%esp)
 177 00cd FF5610                call    *16(%esi)
  45:nonquajectdom.c **** }
 178                            .loc 1 45 0
 179 00d0 8B5C2410              movl    16(%esp), %ebx
 180                    .LVL21:
 181 00d4 8B742414              movl    20(%esp), %esi
 182                    .LVL22:
 183 00d8 8B7C2418              movl    24(%esp), %edi
 184 00dc 83C41C                addl    $28, %esp
 185                            .cfi_restore 3
 186                            .cfi_restore 6
 187                            .cfi_restore 7
 188                    .LCFI5:
 189                            .cfi_def_cfa_offset 4
 190 00df C3                    ret
 191                            .cfi_endproc
 192                    .LFE1:
 194                            .p2align 4,,15
 195                            .globl  visit_node
 197                    visit_node:
 198                    .LFB2:
  48:nonquajectdom.c **** {
 199                            .loc 1 48 0
 200                            .cfi_startproc
 201                    .LVL23:
 202                            .loc 1 49 0
 203 00e0 8B542408              movl    8(%esp), %edx
 204 00e4 8B442404              movl    4(%esp), %eax
 205 00e8 E913FFFF              jmp     _visit_node
 205      FF
 206                            .cfi_endproc
 207                    .LFE2:
 209                    .Letext0:
DEFINED SYMBOLS
                            *ABS*:0000000000000000 nonquajectdom.c
     /tmp/cc4F3UeL.s:6      .text:0000000000000000 _visit_node
     /tmp/cc4F3UeL.s:197    .text:00000000000000e0 visit_node

NO UNDEFINED SYMBOLS

The above is really hard for me to read, so here's the disassembly from objdump -d:

/home/default/devel/aspmisc/nonquajectdom.o:     file format elf32-i386


Disassembly of section .text:

00000000 <_visit_node>:
   0:   83 ec 1c                sub    $0x1c,%esp
   3:   89 5c 24 10             mov    %ebx,0x10(%esp)
   7:   89 c3                   mov    %eax,%ebx
   9:   8b 00                   mov    (%eax),%eax
   b:   89 74 24 14             mov    %esi,0x14(%esp)
   f:   89 d6                   mov    %edx,%esi
  11:   89 7c 24 18             mov    %edi,0x18(%esp)
  15:   83 f8 01                cmp    $0x1,%eax
  18:   74 76                   je     90 <_visit_node+0x90>
  1a:   72 24                   jb     40 <_visit_node+0x40>
  1c:   83 f8 02                cmp    $0x2,%eax
  1f:   75 0d                   jne    2e <_visit_node+0x2e>
  21:   8b 43 04                mov    0x4(%ebx),%eax
  24:   89 14 24                mov    %edx,(%esp)
  27:   89 44 24 04             mov    %eax,0x4(%esp)
  2b:   ff 52 14                call   *0x14(%edx)
  2e:   8b 5c 24 10             mov    0x10(%esp),%ebx
  32:   8b 74 24 14             mov    0x14(%esp),%esi
  36:   8b 7c 24 18             mov    0x18(%esp),%edi
  3a:   83 c4 1c                add    $0x1c,%esp
  3d:   c3                      ret    
  3e:   66 90                   xchg   %ax,%ax
  40:   8b 43 04                mov    0x4(%ebx),%eax
  43:   89 14 24                mov    %edx,(%esp)
  46:   89 44 24 04             mov    %eax,0x4(%esp)
  4a:   ff 52 04                call   *0x4(%edx)
  4d:   85 c0                   test   %eax,%eax
  4f:   74 22                   je     73 <_visit_node+0x73>
  51:   8b 43 08                mov    0x8(%ebx),%eax
  54:   85 c0                   test   %eax,%eax
  56:   7e 1b                   jle    73 <_visit_node+0x73>
  58:   31 ff                   xor    %edi,%edi
  5a:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  60:   8b 44 bb 0c             mov    0xc(%ebx,%edi,4),%eax
  64:   89 f2                   mov    %esi,%edx
  66:   83 c7 01                add    $0x1,%edi
  69:   e8 92 ff ff ff          call   0 <_visit_node>
  6e:   3b 7b 08                cmp    0x8(%ebx),%edi
  71:   7c ed                   jl     60 <_visit_node+0x60>
  73:   8b 43 04                mov    0x4(%ebx),%eax
  76:   89 34 24                mov    %esi,(%esp)
  79:   89 44 24 04             mov    %eax,0x4(%esp)
  7d:   ff 56 08                call   *0x8(%esi)
  80:   8b 5c 24 10             mov    0x10(%esp),%ebx
  84:   8b 74 24 14             mov    0x14(%esp),%esi
  88:   8b 7c 24 18             mov    0x18(%esp),%edi
  8c:   83 c4 1c                add    $0x1c,%esp
  8f:   c3                      ret    
  90:   8b 43 04                mov    0x4(%ebx),%eax
  93:   89 14 24                mov    %edx,(%esp)
  96:   89 44 24 04             mov    %eax,0x4(%esp)
  9a:   ff 52 0c                call   *0xc(%edx)
  9d:   85 c0                   test   %eax,%eax
  9f:   74 22                   je     c3 <_visit_node+0xc3>
  a1:   8b 53 08                mov    0x8(%ebx),%edx
  a4:   85 d2                   test   %edx,%edx
  a6:   7e 1b                   jle    c3 <_visit_node+0xc3>
  a8:   31 ff                   xor    %edi,%edi
  aa:   8d b6 00 00 00 00       lea    0x0(%esi),%esi
  b0:   8b 44 bb 0c             mov    0xc(%ebx,%edi,4),%eax
  b4:   89 f2                   mov    %esi,%edx
  b6:   83 c7 01                add    $0x1,%edi
  b9:   e8 42 ff ff ff          call   0 <_visit_node>
  be:   3b 7b 08                cmp    0x8(%ebx),%edi
  c1:   7c ed                   jl     b0 <_visit_node+0xb0>
  c3:   8b 43 04                mov    0x4(%ebx),%eax
  c6:   89 34 24                mov    %esi,(%esp)
  c9:   89 44 24 04             mov    %eax,0x4(%esp)
  cd:   ff 56 10                call   *0x10(%esi)
  d0:   8b 5c 24 10             mov    0x10(%esp),%ebx
  d4:   8b 74 24 14             mov    0x14(%esp),%esi
  d8:   8b 7c 24 18             mov    0x18(%esp),%edi
  dc:   83 c4 1c                add    $0x1c,%esp
  df:   c3                      ret

000000e0 <visit_node>:
  e0:   8b 54 24 08             mov    0x8(%esp),%edx
  e4:   8b 44 24 04             mov    0x4(%esp),%eax
  e8:   e9 13 ff ff ff          jmp    0 <_visit_node>

Let's slice that down to just the element_node case, which the listing above shows us begins at 0x90.

00000000 <_visit_node>:
   0:   83 ec 1c                sub    $0x1c,%esp
   3:   89 5c 24 10             mov    %ebx,0x10(%esp)
   7:   89 c3                   mov    %eax,%ebx
   9:   8b 00                   mov    (%eax),%eax
   b:   89 74 24 14             mov    %esi,0x14(%esp)
   f:   89 d6                   mov    %edx,%esi
  11:   89 7c 24 18             mov    %edi,0x18(%esp)
  15:   83 f8 01                cmp    $0x1,%eax
  18:   74 76                   je     90 <_visit_node+0x90>
  ...
  90:   8b 43 04                mov    0x4(%ebx),%eax
  93:   89 14 24                mov    %edx,(%esp)
  96:   89 44 24 04             mov    %eax,0x4(%esp)
  9a:   ff 52 0c                call   *0xc(%edx)
  9d:   85 c0                   test   %eax,%eax
  9f:   74 22                   je     c3 <_visit_node+0xc3>

  a1:   8b 53 08                mov    0x8(%ebx),%edx
  a4:   85 d2                   test   %edx,%edx
  a6:   7e 1b                   jle    c3 <_visit_node+0xc3>
  a8:   31 ff                   xor    %edi,%edi
  (confusing multibyte nop padding removed)

  b0:   8b 44 bb 0c             mov    0xc(%ebx,%edi,4),%eax
  b4:   89 f2                   mov    %esi,%edx
  b6:   83 c7 01                add    $0x1,%edi
  b9:   e8 42 ff ff ff          call   0 <_visit_node>
  be:   3b 7b 08                cmp    0x8(%ebx),%edi
  c1:   7c ed                   jl     b0 <_visit_node+0xb0>

  c3:   8b 43 04                mov    0x4(%ebx),%eax
  c6:   89 34 24                mov    %esi,(%esp)
  c9:   89 44 24 04             mov    %eax,0x4(%esp)
  cd:   ff 56 10                call   *0x10(%esi)

  d0:   8b 5c 24 10             mov    0x10(%esp),%ebx
  d4:   8b 74 24 14             mov    0x14(%esp),%esi
  d8:   8b 7c 24 18             mov    0x18(%esp),%edi
  dc:   83 c4 1c                add    $0x1c,%esp
  df:   c3                      ret

000000e0 <visit_node>:
  e0:   8b 54 24 08             mov    0x8(%esp),%edx
  e4:   8b 44 24 04             mov    0x4(%esp),%eax
  e8:   e9 13 ff ff ff          jmp    0 <_visit_node>

I separated the recursive function from the entry point so that by making it static, GCC could pick a more reasonable calling convention for it; which worked, but only up to a point.

Here our per-node overhead is 7 instructions of preamble, 2 instructions of dispatch (by chance, GCC happened to make dispatch fastest for this case), 3 instructions to call element_callback (using the three-byte call encoding optimized for function pointer tables), 2 instructions to conditionally skip the child nodes, 4 instructions of loop setup, 6 instructions per loop iteration (which we can count toward the child node's cost), 4 instructions to call element_end_callback, and 5 instructions of cleanup, for a total of (+ 7 2 3 2 4 6 4 5) = 33 instructions, of which four are conditional jumps, providing opportunities for branch misprediction. (However, the loop is simple enough that I imagine branch misprediction will be very rare.)

The listing of the C code points out that if the visitor is not a quaject, just a regular struct with function pointers, then we can eliminate two add instructions, which probably don't matter, and 4 of the 37 bytes, which probably do, and also build visitors in plain C instead of assembly, at the cost of an extra indirection. That is:

   0:   55                      push   %ebp
   1:   bd 80 d0 38 fe          mov    $0xfe38d080,%ebp
   6:   ff 53 04                call   *0x4(%ebx)
   9:   73 0f                   jae    0x1a
   b:   e8 fc 0f 38 fe          call   0xfe38100c
  10:   e8 40 d8 38 fe          call   0xfe38d855
  15:   e8 a6 1d 39 fe          call   0xfe391dc0
  1a:   ff 53 08                call   *0x8(%ebx)
  1d:   5d                      pop    %ebp
  1e:   c3                      ret

Now our executable DOM node is 10 instructions and 31 bytes, only 7 bytes more than the 24 bytes the struct needs. Of those 7 bytes, 3 should properly be charged to the child nodes, so the overhead is more like 5 bytes and 8 instructions per node.

http://localhost:8000/wikipedia_en_all_nopic_01_2012/A/Dependency%20theory.html has 512 elements, 681 text nodes, and 17 other nodes. Traversing the whole thing should cost some (* 8 (+ 512 681 17)) = 9680 instructions, or about ten microseconds. The document text is 48962 bytes. The tree structure in this form should cost (* 21 (+ 512 681 17)) = 25410 bytes, plus the cost of the metadata (element names, string sizes and lengths, attributes). The entire document tree, then, nearly fits in the 32KiB L1 cache of my netbook, (although I think that's split, so only 16KiB is instructions), and very easily indeed in its 512KiB L2 cache.

The C code might be more efficient if its tree were binary.

Topics