I was thinking about how to implement a cascade of short-memory filters in C, how to manage all of their circular buffers, and it occurred to me that I can actually use the same circular buffer for all of them. Furthermore, especially if I’m willing to accept a few samples of extra latency, this allows me to move samples from one stage of the pipeline to the next with zero cost.
Suppose, for example, that I’m cascading the three FIR filters y(n) =
x(n) - x(n-1), y(n) = x(n) - x(n-5), and y(n) = x(n) + x(n-15). Let’s
suppose our circular buffer b, indexed by index bi, has a size that is a
power of 2 with bitmask m, so the current input sample is at
b[bi & m]
. It’s convenient to invert the polarity of our first
filter: y(n) = x(n-1) - x(n). We can implement this as follows:
b[(bi - 1) & m] -= b[bi & m];
This converts the sample at bi-1 from being x(n-1) to being y(n).
Now our second-stage filter, y(n) = x(n) - x(n-5), is being fed an inverted signal. So if it, too, is inverted, to y(n) = x(n-5) - x(n), we can implement it as follows:
b[(bi - 6) & m] -= b[(bi - 1) & m];
This converts the sample at bi-6 from being x(n-5) to being y(n). Note that this is the value we computed in the previous step, so it doesn’t need to be loaded from RAM. The inversion of sign in this step cancels the inversion of sign in the previous step.
Then, our final stage, y(n) = x(n) + x(n-15), can be implemented similarly:
b[(bi - 21) & m] += b[(bi - 6) & m];
This converts the sample at bi-21 from being x(n-15) to being y(n).
Finally, you need to delay the output from the pipeline for 21 samples. Here I’m also dividing it by 4 to compensate for the amplification by the above filters.
if (bi >= 21) buf[bi - 21] = b[(bi - 21) & m] >> 2;
Here’s a loop doing the above pipeline, although I haven’t verified that it performs the desired DSP function:
for (bi = 0; bi != n; bi++) {
b[bi & m] = buf[bi];
b[(bi - 1) & m] -= b[bi & m];
b[(bi - 6) & m] -= b[(bi - 1) & m];
b[(bi - 21) & m] += b[(bi - 6) & m];
if (bi >= 21) buf[bi - 21] = b[(bi - 21) & m] >> 2;
}
Here’s what it looks like compiled with GCC, 92 bytes:
400653: 74 5a je 4006af <filter+0x89>
400655: 4d 89 c1 mov %r8,%r9
400658: ba 00 00 00 00 mov $0x0,%edx
; LOOP START
40065d: 41 0f be 01 movsbl (%r9),%eax ; load buf[bi]
400661: 0f b6 ca movzbl %dl,%ecx ; bi & m
400664: 89 04 8c mov %eax,(%rsp,%rcx,4) ; b[bi & m] = ...
400667: 8d 4a ff lea -0x1(%rdx),%ecx ; bi - 1
40066a: 0f b6 c9 movzbl %cl,%ecx ; & m
40066d: 8b 1c 8c mov (%rsp,%rcx,4),%ebx ; load b[(bi-1) & m]
400670: 29 c3 sub %eax,%ebx ; - b[bi & m]
400672: 89 d8 mov %ebx,%eax
400674: 89 1c 8c mov %ebx,(%rsp,%rcx,4) ; store b[(bi-1) & m]
400677: 8d 4a fa lea -0x6(%rdx),%ecx ; bi - 6
40067a: 0f b6 c9 movzbl %cl,%ecx ; & m
40067d: 8b 1c 8c mov (%rsp,%rcx,4),%ebx ; load b[(bi-6) & m]
400680: 29 c3 sub %eax,%ebx ; - b[(bi-1) & m]
400682: 89 d8 mov %ebx,%eax
400684: 89 1c 8c mov %ebx,(%rsp,%rcx,4) ; store b[(bi-6) & m]
400687: 8d 4a eb lea -0x15(%rdx),%ecx ; bi - 21
40068a: 44 0f b6 d1 movzbl %cl,%r10d ; & m (saving entire)
40068e: 42 03 04 94 add (%rsp,%r10,4),%eax ; b[(bi-21) & m] + ...
400692: 42 89 04 94 mov %eax,(%rsp,%r10,4) ; store it
400696: 83 fa 14 cmp $0x14,%edx ; bi >= 21?
400699: 76 09 jbe 4006a4 <filter+0x7e>
40069b: 89 c9 mov %ecx,%ecx ; 2-insn NOP
40069d: c1 f8 02 sar $0x2,%eax ; b[...] >> 2
4006a0: 41 88 04 08 mov %al,(%r8,%rcx,1) ; buf[
4006a4: 83 c2 01 add $0x1,%edx ; bi++
4006a7: 49 83 c1 01 add $0x1,%r9 ; ++&buf[bi]
4006ab: 39 fa cmp %edi,%edx ; bi != n?
4006ad: 75 ae jne 40065d <filter+0x37>
It seems like %edx is being used as our loop counter bi
, %r9 is
being used as the input counter &buf[bi]
, and %edi is being used for
n, the limit. My index mask m
is 255, so GCC is implementing the
masking with movzbl
instructions. Each stage of the pipeline
requires 6 instructions, for whatever that’s worth in 2018, and
there’s an additional 4 instructions or so of loop overhead, for a
total of 28 instructions inside the loop.
The Cortex-M4 equivalent is as follows, 80 bytes and 29 instructions, 25 inside the loop:
22: b334 cbz r4, 72 <filter+0x72>
24: f105 3eff add.w lr, r5, #4294967295 ; 0xffffffff
28: 2300 movs r3, #0
2a: aa01 add r2, sp, #4
; LOOP START
2c: f81e 1f01 ldrb.w r1, [lr, #1]! ; buf[bi]
30: b2d8 uxtb r0, r3 ; bi & m?
32: f842 1020 str.w r1, [r2, r0, lsl #2] ; b[bi & m] = ...
36: 1e58 subs r0, r3, #1 ; bi - 1
38: b2c0 uxtb r0, r0 ; & m?
3a: f852 6020 ldr.w r6, [r2, r0, lsl #2] ; load b[(bi-1) & m]
3e: 1a76 subs r6, r6, r1 ; subtract r1 from it
40: f842 6020 str.w r6, [r2, r0, lsl #2] ; store b[(bi-1) & m]
44: 1f98 subs r0, r3, #6 ; bi - 6
46: b2c0 uxtb r0, r0 ; & m?
48: f852 1020 ldr.w r1, [r2, r0, lsl #2] ; load b[bi-6 & m]
4c: 1b8e subs r6, r1, r6 ; subtract r6 from it
4e: f842 6020 str.w r6, [r2, r0, lsl #2] ; store b[bi-6 & m]
52: f1a3 0015 sub.w r0, r3, #21 ; bi - 21
56: b2c0 uxtb r0, r0 ; & m?
58: f852 1020 ldr.w r1, [r2, r0, lsl #2] ; load b[bi-21 & m]
5c: 4431 add r1, r6 ; add r6 to it
5e: f842 1020 str.w r1, [r2, r0, lsl #2] ; save it
62: 2b14 cmp r3, #20 ; bi > 20?
64: bf84 itt hi ; conditional, unsigned greater
66: 1089 asrhi r1, r1, #2 ; r1 >>= 2
68: f80e 1c15 strbhi.w r1, [lr, #-21]; store buf[bi-21]?
6c: 3301 adds r3, #1
6e: 42bb cmp r3, r7
70: d1dc bne.n 2c <filter+0x2c>
This uses 5 instructions for each stage of the pipeline, against AMD64’s 6, and unlike implementations of AMD64, I think that number actually does mean 5 clock cycles under normal circumstances.
Actually, on further thought, I think my concern about the 21 samples of latency was unfounded. There’s a data path straight through the loop from buf[bi] to what ought to be buf[bi] but is instead buf[bi-21]. There’s additionally a delayed path through the FIFOs, which is as it should be.
(I fixed that bug and removed the >> 2 and verified that the impulse response was as it should be.)
Recursive filters can also be implemented with this structure. Suppose we have a cascade of two humble integrators, y(n) = x(n) + y(n-1), as we might for the beginning of a Hogenauer filter. Of course, we could implement this as follows:
y1 += x;
y2 += y1;
But if we stick to this staged-FIFO structure so that we can implement arbitrary lags, it looks like this:
void filter2(char *buf, int n)
{
enum { bufsiz = 256, m = bufsiz-1 };
int b[bufsiz] = {0};
for (unsigned bi = 0; bi != n; bi++) {
b[bi & m] = buf[bi];
b[bi & m] += b[(bi - 1) & m]; // ∫1
b[(bi - 1) & m] += b[(bi - 2) & m]; // ∫2
b[(bi - 7) & m] -= b[(bi - 1) & m];
b[(bi - 12) & m] -= b[(bi - 7) & m];
buf[bi] = b[(bi - 12) & m] >> 3; // compensate for 25× amplfier
}
}
The ∫1 line computes the new y(n) value from the new x(n) value and the previous y(n) value; the ∫2 line does the same thing, but for the second integrator. Absent any further transmogrification, it leaves a trail of second-order integrator values behind in the buffer. Then, immediately below, we have two feedforward comb filter lines, which tame the wild integrators and convert them into a mild-mannered triangular-kernel FIR filter, completing the structure of a second-order (CIC) Hogenauer low-pass filter, although in this case without the usual decimation step.
In ∫1, because we don’t need to preserve x(n), we can transform it into y(n), saving a memory load. But in the ∫2 stage, we do need to preserve x(n) (the previous stage’s y(n)), so we just overwrite y(n-1) instead.
Again, in the comb filters, we’re inverting the sign here for convenience — instead of computing y(n) = x(n) - x(n-5), we’re computing x(n-5) - x(n), which requires no extra work because we have an even number of such inverting comb filters in the pipeline.
This filter function ends up as follows on amd64 with -Os:
9b: 48 81 ec 18 04 00 00 sub $0x418,%rsp
a2: 48 89 fa mov %rdi,%rdx
a5: b9 00 01 00 00 mov $0x100,%ecx ; buffer size to zero
aa: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
b1: 00 00
b3: 48 89 84 24 08 04 00 mov %rax,0x408(%rsp)
ba: 00
bb: 31 c0 xor %eax,%eax
bd: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
c2: f3 ab rep stos %eax,%es:(%rdi) ; zero the buffer
c4: 48 89 d7 mov %rdx,%rdi
;; LOOP START
c7: 39 f0 cmp %esi,%eax ; bi (%eax) != n
c9: 74 64 je 12f <filter2+0x94>
cb: 0f be 0f movsbl (%rdi),%ecx ; buf[bi]
ce: 8d 50 ff lea -0x1(%rax),%edx ; bi - 1
d1: 44 0f b6 c0 movzbl %al,%r8d ; bi & m
d5: 48 ff c7 inc %rdi ; ++&buf[bi]
d8: 0f b6 d2 movzbl %dl,%edx ; (bi - 1) & m
db: 42 89 4c 84 08 mov %ecx,0x8(%rsp,%r8,4) ; b[(bi-1) & m] = …
e0: 03 4c 94 08 add 0x8(%rsp,%rdx,4),%ecx ; … + b[bi & m]
e4: 42 89 4c 84 08 mov %ecx,0x8(%rsp,%r8,4) ; b[bi & m] = …
e9: 8d 48 fe lea -0x2(%rax),%ecx ; bi - 2
ec: 44 8b 44 94 08 mov 0x8(%rsp,%rdx,4),%r8d ; XXX redundant load
f1: 0f b6 c9 movzbl %cl,%ecx
f4: 44 03 44 8c 08 add 0x8(%rsp,%rcx,4),%r8d ; XXX WTF
f9: 8d 48 f9 lea -0x7(%rax),%ecx ; bi - 7
fc: 0f b6 c9 movzbl %cl,%ecx ; & m
ff: 44 89 44 94 08 mov %r8d,0x8(%rsp,%rdx,4)
104: 8b 54 8c 08 mov 0x8(%rsp,%rcx,4),%edx
108: 44 29 c2 sub %r8d,%edx
10b: 89 54 8c 08 mov %edx,0x8(%rsp,%rcx,4)
10f: 8d 48 f4 lea -0xc(%rax),%ecx
112: ff c0 inc %eax
114: 0f b6 c9 movzbl %cl,%ecx
117: 44 8b 4c 8c 08 mov 0x8(%rsp,%rcx,4),%r9d
11c: 41 29 d1 sub %edx,%r9d
11f: 44 89 ca mov %r9d,%edx
122: 44 89 4c 8c 08 mov %r9d,0x8(%rsp,%rcx,4)
127: c1 fa 03 sar $0x3,%edx
12a: 88 57 ff mov %dl,-0x1(%rdi)
12d: eb 98 jmp c7 <filter2+0x2c>
12f: 48 8b 84 24 08 04 00 mov 0x408(%rsp),%rax
136: 00
137: 64 48 33 04 25 28 00 xor %fs:0x28,%rax ; stack canary‽
13e: 00 00
140: 74 05 je 147 <filter2+0xac>
142: e8 00 00 00 00 callq 147 <filter2+0xac>
147: 48 81 c4 18 04 00 00 add $0x418,%rsp
14e: c3 retq
Looks like it has 6 stores, 5 fetches, and 30 instructions inside the loop to realize this four-stage pipeline, but again only 5 or 6 instructions per stage of the pipeline. I was hoping to reach some kind of conclusion here about whether recursive filtering was going to be slightly costlier, but I really have no idea.
It looks to me like GCC has given up on register allocation in part of this code and is just emitting redundant loads, which is weird.
However, this is definitely less costly:
y1 += x;
y2 += y1;
And if you’re doing a Hogenauer filter, you should probably not only do that, but you should also decimate the stuff downstream from the integrators.
But you can use this approach for other kinds of recursive filtering, too.
On one core of my laptop, this code filters 100 mebibytes (of malloced, mostly uninitialized memory) in 785 ms. That’s about 134 million samples per second, or about 12 clock cycles per sample, or about 3 clock cycles per sample per pipeline stage. Presumably the number on a Cortex-M4 would be more like 6 clock cycles per sample per pipeline stage, due to in-order execution.
This suggests that processing a signal at 44.1 ksps would require about 132 kHz of amd64 clock per stage or 266 kHz of Cortex-M4 clock per stage. This seems like it might come in at the sub-MIPS level I was hoping for in the Bleep modem.
So here’s a Hogenauer filter with a twist — it’s tuned to detect one particular frequency (19110 Hz) and null another one (17640 Hz). At least, that was the intention. I haven’t tested this C code, just code in Python that purports to be equivalent, except that it uses floating-point (!!).
unsigned filter3(signed char *buf, int n)
{
enum { bufsiz = 256, m = bufsiz-1 };
int b[bufsiz] = {0};
uint64_t d1=0, d2=0, d3=0, d4=0, i1=0, i2=0, i3=0, tmp, tmp2, c2=0, c3=0;
int bi5=0, bo=0, odd=0;
for (unsigned bi = 0; bi != n; bi++, bi5++) {
tmp = buf[bi];
tmp2 = tmp - d1; /* Differentiator 1 */
d1 = tmp;
tmp = tmp2 - d2; /* Differentiator 2 */
d2 = tmp2;
tmp2 = tmp - d3; /* Differentiator 3 */
d3 = tmp;
tmp = tmp2 - d4; /* Differentiator 4 */
d4 = tmp2;
i1 += odd ? -tmp : tmp; /* Integrator 1 */
odd = ~odd;
i2 += i1; /* Integrator 2 */
i3 += i2; /* Integrator 3 */
if (bi5 == 5) {
bi5 = 0;
/* First comb filter: y(n) = x(n-2) - x(n); note inversion */
b[bi & m] = i3;
b[(bi - 2) & m] -= i3;
/* Second comb filter: y(n) = x(n-1) - x(n); canceling inversion */
tmp = c2 - b[(bi - 2) & m];
c2 = b[(bi - 2) & m];
/* Third comb filter: y(n) = x(n) - x(n-1), this time not inverting; but now into the output buffer */
buf[bo++] = tmp - c3;
c3 = tmp;
}
}
return bo;
}
This rather alarming C comes out to this perhaps even more alarming assembly:
000000000000014f <filter3>:
14f: 41 57 push %r15
151: 41 56 push %r14
153: b9 00 01 00 00 mov $0x100,%ecx
158: 41 55 push %r13
15a: 41 54 push %r12
15c: 49 89 fc mov %rdi,%r12
15f: 55 push %rbp
160: 53 push %rbx
161: 31 d2 xor %edx,%edx
163: 45 31 db xor %r11d,%r11d
166: 45 31 c0 xor %r8d,%r8d
169: 45 31 ff xor %r15d,%r15d
16c: 48 81 ec 38 04 00 00 sub $0x438,%rsp
173: 45 31 d2 xor %r10d,%r10d
176: 45 31 c9 xor %r9d,%r9d
179: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
180: 00 00
182: 48 89 84 24 28 04 00 mov %rax,0x428(%rsp)
189: 00
18a: 31 c0 xor %eax,%eax
18c: 48 8d 7c 24 28 lea 0x28(%rsp),%rdi
191: 89 74 24 1c mov %esi,0x1c(%rsp)
195: 45 31 f6 xor %r14d,%r14d
198: 45 31 ed xor %r13d,%r13d
19b: 31 ed xor %ebp,%ebp
19d: 31 db xor %ebx,%ebx
19f: f3 ab rep stos %eax,%es:(%rdi)
1a1: 31 ff xor %edi,%edi
; LOOP START
1a3: 3b 54 24 1c cmp 0x1c(%rsp),%edx
1a7: 89 54 24 18 mov %edx,0x18(%rsp)
1ab: 0f 84 a0 00 00 00 je 251 <filter3+0x102>
1b1: 49 0f be 34 14 movsbq (%r12,%rdx,1),%rsi ; sign-extending byte to quad?
1b6: 48 89 34 24 mov %rsi,(%rsp)
1ba: 48 29 de sub %rbx,%rsi ; d1
1bd: 48 89 74 24 08 mov %rsi,0x8(%rsp)
1c2: 48 29 ee sub %rbp,%rsi ; d2
1c5: 48 89 74 24 10 mov %rsi,0x10(%rsp)
1ca: 4c 29 ee sub %r13,%rsi ; d3
1cd: 48 89 f5 mov %rsi,%rbp
1d0: 49 89 f5 mov %rsi,%r13
1d3: 4c 29 f5 sub %r14,%rbp ; d4
1d6: 48 89 eb mov %rbp,%rbx
1d9: 48 f7 db neg %rbx
1dc: 45 85 db test %r11d,%r11d ; odd
1df: 41 f7 d3 not %r11d
1e2: 48 0f 44 dd cmove %rbp,%rbx
1e6: 49 01 d9 add %rbx,%r9 ; i1
1e9: 4d 01 ca add %r9,%r10 ; i2
1ec: 4c 01 d7 add %r10,%rdi ; i3
1ef: 41 83 f8 05 cmp $0x5,%r8d
1f3: 75 40 jne 235 <filter3+0xe6>
; DECIMATED CONDITIONAL START
1f5: 8b 5c 24 18 mov 0x18(%rsp),%ebx
1f9: 44 0f b6 44 24 18 movzbl 0x18(%rsp),%r8d
1ff: 83 eb 02 sub $0x2,%ebx ; bi - 2
202: 0f b6 db movzbl %bl,%ebx ; & m
205: 42 89 7c 84 28 mov %edi,0x28(%rsp,%r8,4) ; b[bi & m] = i3
20a: 44 8b 44 9c 28 mov 0x28(%rsp,%rbx,4),%r8d ; b[bi-2 & m]
20f: 41 29 f8 sub %edi,%r8d ; x(n-2) - x(n)
212: 44 89 44 9c 28 mov %r8d,0x28(%rsp,%rbx,4) ; ‽
217: 4d 63 c0 movslq %r8d,%r8
21a: 48 63 d8 movslq %eax,%rbx
21d: 4c 29 c1 sub %r8,%rcx
220: ff c0 inc %eax
222: 40 88 cd mov %cl,%bpl
225: 44 29 fd sub %r15d,%ebp
228: 49 89 cf mov %rcx,%r15
22b: 4c 89 c1 mov %r8,%rcx
22e: 41 88 2c 1c mov %bpl,(%r12,%rbx,1)
232: 45 31 c0 xor %r8d,%r8d
; DECIMATED CONDITIONAL END
235: 4d 89 ee mov %r13,%r14
238: 41 ff c0 inc %r8d
23b: 48 ff c2 inc %rdx
23e: 4c 8b 6c 24 10 mov 0x10(%rsp),%r13
243: 48 8b 6c 24 08 mov 0x8(%rsp),%rbp
248: 48 8b 1c 24 mov (%rsp),%rbx
24c: e9 52 ff ff ff jmpq 1a3 <filter3+0x54>
; LOOP END
251: 48 8b 94 24 28 04 00 mov 0x428(%rsp),%rdx
258: 00
259: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx
260: 00 00
262: 74 05 je 269 <filter3+0x11a>
264: e8 00 00 00 00 callq 269 <filter3+0x11a>
269: 48 81 c4 38 04 00 00 add $0x438,%rsp
270: 5b pop %rbx
271: 5d pop %rbp
272: 41 5c pop %r12
274: 41 5d pop %r13
276: 41 5e pop %r14
278: 41 5f pop %r15
27a: c3 retq
So, this has an 8-stage per-sample pipeline (which will be shared with the other ultrasound signals I want to detect), followed by a decimator and a three-stage per-output-sample pipeline. The loop overhead is 7 instructions per sample at the end and 3 instructions per sample at the beginning; the 9-stage per-sample code is 20 instructions; then the decimated code is 18 instructions, averaging 3.6 instructions per loop. All in all this should run in 33.6 instructions per sample, which would be about 17 amd64 clock cycles per sample if the same ratio held as for the previous code whose performance I actually measured. Also that works out to 12 pipeline stages, which means about 3 instructions or 1.5 clock cycles per stage, definitely an improvement over the earlier numbers.
It probably isn’t necessary to use 64-bit math in here, and definitely not for all the variables it’s being used for. But I haven’t done the analysis yet to be sure I understand the Hogenauer overflows.
This suggests that doing the full demodulation this way will require a bit over a Cortex-M4 MIPS.
The differentiators on the front end might not be necessary; they're there to filter out the (commonly much more powerful) low-frequency noise.