Sparse filter optimization

Kragen Javier Sitaker, 2019-11-01 (5 minutes)

I've written a few items about "sparse filtering" previously, by which I mean factoring a DSP filter into some network of filters which requires very few total operations per sample. Generally I'm thinking in terms of single-rate LTI filtering, not multi-rate, nonlinear, or non-shift-invariant systems, and generally the nodes in the processing graph are basically comb filters of one kind or another.

This is a difficult optimization problem, being essentially combinatorial in nature: even for small numbers of operations per sample such as 10, the signal flow graph can have any of a large number of topologies, and each processing node in the graph can be configured in a large variety of different ways.

I just checked, and it seems that I haven't written down what now seems to me like the most straightforward way to handle this design problem, unless I'm not finding it.

Optimizing a fixed processing graph

Let's consider a processing graph made out of continuous-time delay, sum, and amplification nodes, and the problem to approximate a desired frequency and perhaps phase response with a fixed processing graph. The only parameters available are the delays and gains, and the frequency and phase response are continuous differentiable functions of them, so we can almost certainly use gradient descent and related methods (see, e.g., Using the method of secants for general optimization and Robust local search in vector spaces using adaptive step sizes, and thoughts on extending quasi-Newton methods) to find a reasonable local optimum. In fact, for many problems, the dimensionality is small enough that I think we can use interval arithmetic to find the global optimum.

This is a continuous-time relaxation of the discrete-time problem we want to solve. Once a good solution to the relaxation has been found, imposing a penalty for fractional-sample delays is likely to be sufficient to find a good solution of the discrete-time problem; branch-and-bound is a surer approach, though exponential-time.

Deriving the processing graph

How can we derive the processing graph? A variety of approaches suggest themselves. We could start with a too-complex processing graph and use optimization penalties to drive most of the amplifier gains to 0, then remove the now-useless connections. We could start with a too-simple processing graph and alternate continuous-optimization phases with graph-mutation phases, which for example might replace a subgraph with two copies of that subgraph, each feeding to an attenuator, summed to form its original input, or insert a delay-0 element. Alternatively, we could just randomly generate a lot of graphs and try to optimize each of them separately.

A nontraditional choice that might improve the mutation properties of the graph is to use comb-filter nodes, which contain one delay, two gains, and a bit distinguishing whether they're feedforward or feedback, rather than separate delay and gain nodes.

Databases and meet-in-the-middle

Of course, exploration of any optimization space can be facilitated by database and meet-in-the-middle approaches. In this case, we could begin any given search using a database of already-generated designs, starting the search from designs that already come close to meeting the desired specification; and, by dividing our desired frequency and phase response by that of every item in the database, we can get a set of frequency responses to search the database for.

For example, suppose we want a zero-phase gain of between 0.9 and 1.1 at 100 Hz and between 0.09 and 0.11 at 1000 Hz, and there's a signature in the database with a zero-phase gain of 0.5 at 100 Hz and 2.0 at 1000 Hz. So, if we found a system that had a zero-phase gain of between 1.8 and 2.2 at 100 Hz and between 0.045 and 0.055 at 1000 Hz, we could cascade the two together to get the desired response.

In practice, though, what we'd want to do is to find a database system or pair of database systems that get close to our objective (as measured by a response curve covering hundreds to millions of frequencies, or perhaps an affine-arithmetic approximation), then use local search as described earlier to converge precisely on that objective.

Even on modern hardware, the database can reasonably weigh terabytes, containing millions to billions of known inexpensive candidate systems. On future hardware we could conceivably tabulate trillions.

Cost functions

Aside from the penalty terms mentioned above, plus the penalty function for not meeting the desired response, there is some computational cost we're trying to minimize; this cost depends on the implementation technology. For digital computation, components of the cost function might include memory buffers, required bit width of those buffers, multiplications, and even additions. Analog signal processing is crucially limited by component precision and noise immunity, so you'd want to take the derivative of the system frequency response with respect to the vector of component values and injected noise and try to minimize that derivative; also, analog delay lines are not only bulky and expensive but also lossy. All of these situations can be handled by tweaking the cost function.

Topics