Rediscovering successive parabolic interpolation: derivative-free optimization of scalar functions by fitting a parabola

Kragen Javier Sitaker, 2019-11-26 (updated 2019-11-27) (8 minutes)

There's a relatively simple derivative-free algorithm with superlinear convergence for finding minima or maxima of regular one-dimensional functions, analogous to the method of secants for finding their zeroes. But under almost all circumstances there are better algorithms.

Deriving successive parabolic interpolation

In $1 recognizer diagrams it is mentioned that the $1 recognizer uses golden-section search, a la Numerical Recipes, to approximate the optimal rotation. This algorithm has very slow convergence, similar to binary chop for finding function zeroes.

Is there a way to adapt the method of secants (see Using the method of secants for general optimization) to approximate the minimum faster? If you have three points (suppose, WOLOG, in order) x0, x1, x2 and the function values y0 = f(x0), y1 = f(x1), y2 = f(x2), then you can calculate some divided differences: (y1 - y0)/(x1 - x0) gives you precisely the average of the derivative on (x0, x1), and similarly for (x1, x2). If the second derivative is small, these give us good estimates for the derivative f' at ½(x0 + x1) and ½(x1 + x2). From those two, we should be able to linearly interpolate or extrapolate to find where the derivative should have a zero crossing, and we can sample another point there, y4 = f(x4).

This vaguely sounds like Nelder-Mead simplex optimization, but in one dimension, but I don't understand Nelder-Mead well enough to know.

Since this is just looking for a zero of the derivative, it will equally well find maxima or minima, depending on the average second derivative in the neighborhood.

In essence, this amounts to calculating the extremum of a parabola fit to the last three points.

So let's work this out.

min_next = (f, x0, x1, x2) => {
    const y0 = f(x0)             // (redundant)
        , y1 = f(x1)             // (redundant)
        , y2 = f(x2)             // f at latest point
        , d0 = (y1-y0)/(x1-x0)   // (redundant)
        , d1 = (y2-y1)/(x2-x1)   // derivative near latest point
        , xa0 = (x0+x1)/2        // (redundant)
        , xa1 = (x1+x2)/2        // where that derivative is
        , d2 = (d1-d0)/(xa1-xa0) // estimate of second derivative,
                                 //     hope it's not zero
        , dx = d1 / d2           // distance from xa1 to
                                 //     extrapolated extremum
    return xa1 - dx              // return extrapolated extremum
}

min_next_n = (f, x0, x1, n) => {
    const xi = [x0, x1, (x0+x1)/2]

    for (let i = 0; i < n; i++) {
        const m = xi.length
        if (isNaN(xi[m - 1])) break
        xi.push(min_next(f, xi[m - 3], xi[m - 2], xi[m - 1]))
    }

    return xi
}

parabolic_extremum = (f, x0, x1, n) => {
    const xi = min_next_n(f, x0, x1, n || 30), m = xi.length
    return isNaN(xi[m - 1]) ? xi[m - 2] : xi[m - 1]
}

On simple examples like x => x * (1 - x) + 0.001 * x*x*x, 0.2, 0.1 it seems to have pretty fast convergence, appearing close to the order φ of convergence like the one-dimensional method of secants. If it happens to run into a horizontal line, though, it fails; for example, x => x * (1 - x)**2, 0, 1 crashes out at 0.5 because f(0) = f(1). The actual maximum is, of course, 1/3, and a better starting point finds it. (And there's a minimum at 1.)

It seems to have some difficulty with rounding, only producing about 9 places of accuracy in that last example, perhaps because the second-derivative expression above becomes very small.

The lines marked (redundant) above are things that, after the first iteration, were already calculated in the previous iteration, so we can avoid calculating them again. What's left over is one function evaluation, five subtractions, an addition, a division by two, three arbitrary divisions, and an operation that can be either a division or a multiplication. To me four divisions per iteration seems kind of heavy-weight.

You might think it would have poor convergence because when we sample the new point there, we are in effect sampling the derivative halfway between that new point and the last old point, so we're only getting halfway to the destination. But we're only delaying convergence by one iteration --- if that derivative-average and the last one point to a place very close to the new point, because the second derivative is almost constant over that interval, the new new point will be very close indeed. So it works out.

It actually can tell whether the extremum it's approaching is a maximum or a minimum, because the estimate that it gets of the second derivative tells it. By replacing a d2 with abs(d2) in the above code, we could make it seek only extrema of one kind.

Unlike the method-of-secants stuff in Using the method of secants for general optimization, it isn't apparent to me how to extend this to functions of vector arguments, functions of more than one argument, or vector-valued functions.

This is called "successive parabolic approximation" and you should almost never use it

The realization that this amounts to fitting a parabola to the last three points led me to the discovery that the standard name for this method is "successive parabolic interpolation"; Wikipedia explains that it has an order of convergence of about 1.33 and has robustness problems. In particular, though, the fact that it's less than √2 means that it's slower than using Newton-Raphson iteration on a derivative evaluated using automatic differentiation, other than faster, as with the method of secants.

The divided-differences approach given above is, I think, faster than the approach I've found elsewhere of solving the Vandermonde linear system, because it's more incremental. But divided differences is of course a standard way to do polynomial interpolation, and its incrementality is one of its standard advantages.

The only time when it is faster to use successive parabolic interpolation rather than Newton-Raphson iteration is when you can't differentiate the function you're optimizing, even though you believe it to be regular, for example because it's experimentally measured rather than computed, or because your software tools are outdated and don't support automatic differentiation, or because its second derivative fails to exist. (If its first derivative fails to exist, successive parabolic interpolation won't work either.) In such a situation, golden-section search is slower, but sometimes only for high precision; and golden-section search is more robust and doesn't have the rounding problems mentioned above. (The optimize function in R uses a mixture of golden-section search and successive parabolic interpolation known as Brent's method.)

So, for example, min_next_n(x => (1/x - 4)**2 , 0.001, 0.1, 30) in the above evaluator takes 7 iterations to converge to within 0.1 of the correct result (0.25), 6 more iterations to converge to within 0.01 of the correct result, and 2 more iterations to converge to within 0.001 of the correct result, at which point it starts having reasonable convergence. I think golden-section search would actually be faster up to that point, although I haven't tried it. However, for functions that are well approximated by a quadratic, successive parabolic interpolation is really fast; for example, in min_next_n(x => Math.sin(x*Math.PI) + Math.sin(x*2*Math.PI), 0.001, 0.1, 30) it eventually converges to 0.29791559955277436, but it's already at 0.2987 in 5 iterations.

Topics