Using the method of secants for general optimization

Kragen Javier Sitaker, 2019-07-22 (updated 2019-11-26) (13 minutes)

The method of secants is an algorithm the humans have been using for some 3000 years to solve a fairly wide variety of inverse problems, or, to their modern way of thinking, find zeroes of a fairly large class of functions. Given a function f from some vector x to the underlying field of that vector, such as ℂ or ℝ, we compute a sequence of iterative approximations:

xₙ = xₙ₋₁ - f(xₙ₋₁) · (xₙ₋₁ - xₙ₋₂) / (f(xₙ₋₁) - f(xₙ₋₂))

You can think of this as a variant of Newton–Raphson iteration, using the secant approximation to the tangent line.

Unlike Newton–Raphson iteration, it doesn’t require computing the derivative of the function, and it has slightly faster convergence under the usual assumptions used to prove the convergence of Newton–Raphson iteration — under those circumstances, the error after each iteration is the previous error to the power φ ≈ 1.618, but each iteration only requires computing a single new value of the function being solved for, while Newton–Raphson iteration requires computing a new value of the function and also of its derivative, which is usually about twice as much work. So in some sense it converges about 31% faster.

However, it needs two guesses to get started instead of one, which makes its convergence conditions somewhat more complicated to describe.

Some ideas occurred to me about how to use the method of secants, so I thought I’d write them down.

I used the method of secants as an extended example in Separating implementation, optimization, and proofs.

(This is probably crushingly naïve compared to all the work out there on optimization methods I don’t understand yet, like Nelder–Mead, Broyden’s method, and the Levenberg–Marquardt algorithm, not to mention the stunning successes in recent years with variants of gradient descent; but I thought it would be worth writing down.)

(Unrelated: the Method of Wecants, a technique for declining to do something you don’t want to do while blaming someone else.)

Vector-domain functions

The method of secants is normally described as a one-dimensional root-finding method, but above, I said that you can generalize the domain of the function to be a vector, as suggested by the form of its recurrence relation. What happens in practice if you try that?

Consider f : ℝ² → ℝ = λ(a, b).a² + b² - 1, a paraboloid which is zero everywhere along the unit circle. If our initial starting guesses are x₀ = (1, 1) and x₁ = (2, 0), the values just sort of randomly oscillate:

That’s because each x value in the method of secants is an affine combination of the previous two values, so there’s no way for them to get off the line b = 2 - a; and, as it happens, that line doesn’t intersect the unit circle. If you’re looking for an intersection of that line with the circle, or more likely some hairy implicit function, that might be great — although, if there are multiple intersections, there’s no guarantee about which one you’ll get, unlike with signed-distance-function raytracing. But if you’re trying to find any solution, it’s not so great that you need to start by picking two points that are collinear with it.

Other pairs of starting points converge just fine for the same function:

There are different approaches to solving this problem. Perhaps the simplest possible one is to use a secant, not through the last two points, but through the first and last of the last m points:

xₙ = xₙ₋₁ - f(xₙ₋₁) · (xₙ₋₁ - xₙ) / (f(xₙ₋₁) - f(xₙ))

In theory this should allow the last m points to be a simplex of an m-1-dimensional space, so their affine combinations would be that m-1-dimensional space, at the expense of somewhat slower convergence. This seems too dumb to work, but it does seem to. Here’s a Python implementation:

def secnd_seq(f, x):
    x = list(x)
    y = [f(xi) for xi in x]

    while True:
        yield x[-1], y[-1]

        div = y[-1] - y[0]
        if not div:
            return

        x.append(x[-1] - y[-1] * (x[-1] - x[0]) / div)
        y.append(f(x[-1]))
        x.pop(0)
        y.pop(0)

Given random points from [-5, 5]², this converges to a point on the unit circle about ¼ of the time with 2 starting points (the orthodox method of secants), but almost always with 3 or more starting points, because three points is enough to span the whole 2-D parameter space almost surely. However, the algorithm frequently takes several thousand iterations to converge! Increasing the number of starting points to 4, 5, or 6 makes it less frequently need more than 100 iterations or more than 1000 iterations, but since it usually converges in less than 50 iterations with 3 points, it might make just as much sense to do a random restart if the algorithm is failing to converge. Still, increasing the number of points more makes the completion time more consistent.

unit_circle = lambda (x, y): x**2 + y**2 - 1

def test_nd(d=3, n=1000, maxiter=100, eps=1e-15, f=unit_circle):
    ok = not_ok = 0

    for i in range(n):
        x = [numpy.random.random(2) * 10 - 5 for j in range(d)]
        items = list(itertools.islice(secnd_seq(f, x), maxiter))
        if abs(items[-1][-1]) < eps:
            print "ok:", x, items[-1][0], len(items)
            ok += 1
        else:
            print "not ok", x, items[-d:]
            not_ok += 1

    return ok, not_ok

(After several thousand trials, it did find a three-starting-point state from which convergence failed after 10000 iterations: three points reported as [array([-2.57650664, -4.90971528]), array([-0.17240513, -3.30215595]), array([-4.9655625 , 3.50688737])]. Unfortunately, that point converges in 82 iterations; like Lorentz’s, my attempt at reproducibility has apparently been defeated by rounding.)

This experimental result suggests that this approach may be usable in high-dimensionality spaces to find zeroes, perhaps, like gradient descent is for finding minima. But I wonder how it compares to running gradient descent, or one of its modern variants like AdaGrad or Adam, on the square of a function?

Finding a zero of a vector-valued function

Suppose you want to find an intersection (x, y) of two circles: (x - x₀)² + (y - y₀)² - r₀² = 0 ∧ (x - x₁)² + (y - y₁)² - r₁² = 0. You could think of this as finding a zero of a vector-valued function f(x, y) : ℝ² → ℝ² = ((x - x₀)² + (y - y₀)² - r₀², (x - x₁)² + (y - y₁)² - r₁²). But we can’t directly apply the method of secants, because we need to divide by the difference of two function outputs, and you can’t divide vectors.

However, we can take a norm of the result vector to get a scalar which will only be 0 when the vector is 0; for example, the L₁ norm: |(x - x₀)² + (y - y₀)² - r₀²| + |(x - x₁)² + (y - y₁)² - r₁²|. Or, in Python:

def circle_intersection(x0, y0, r0, x1, y1, r1):
    def f(xv):
        x, y = xv
        return (abs((x - x0)**2 + (y - y0)**2 - r0**2) +
                abs((x - x1)**2 + (y - y1)**2 - r1**2))

    return f

This of course has a discontinuous derivative whenever we cross one of the circles, and so although the procedure above is able to find intersections successfully, it takes several hundred iterations to do so. But, surprisingly, things get even worse if we try to use the L₂ norm:

def circle_intersection_L2(x0, y0, r0, x1, y1, r1):
    def f(xv):
        x, y = xv
        return (((x - x0)**2 + (y - y0)**2 - r0**2)**2 +
                ((x - x1)**2 + (y - y1)**2 - r1**2)**2)

    return f

I think this is because the quadratic convergence condition of both Newton–Raphson iteration and the method of secants requires the function to have a nonzero derivative in the neighborhood of its root, and the L₂ norm instead has a zero derivative. Still, by using enough initial points, we can usually get to the solutions this way; this converges on 997 out of 1000 attempts, but usually takes between 1000 and 2000 iterations:

test_nd(d=20, maxiter=10000,
        f=circle_intersection_L2(0, 0, 1, 1, 0, 1),
        eps=1e-13)

The L norm is just as bad.

Optimization

Thomas Simpson pointed out that, if you can compute the first and second derivatives of a function, you can use Newton–Raphson iteration to numerically approximate its critical points — saddle points, local minima, and local maxima. Its global minimum must be one of these or, if its domain is compact, a point on the boundary of the domain. (If its domain is finite and open it may not have a minimum, just an infimum.) In low dimensionalities with sufficiently polite functions, this can enable you to quickly find the global minimum, even by manual computation. This approach has expanded into a whole field of “quasi-Newton methods”, but these involve maintaining an approximation of the Hessian matrix of the function being optimized — and in n dimensions, the Hessian has n² elements.

Similarly, you can use the method of secants to numerically approximate a critical point of a function if you can compute the function’s derivative — for example, using automatic differentiation. Earlier, I suggested that maybe you could square a function, at least a real one, and use generic optimization algorithms to find its zeroes. Now I’m suggesting almost the opposite: differentiate a function and use generic root-finding algorithms to find the zeroes of its derivative, then test them to see which one is lowest.

If we are attempting to find a minimum of a function f : ℝ → ℝ, we can start by using automatic differentiation to compute points of ∇f, which are in ℝ. Then we can, perhaps, use the extended method of secants in the way described above — taking some norm of that gradient and attempting to extrapolate to where it becomes zero, using something like 2n or 3n points — with only on the order of 3n + O(1) operations per iteration, rather than the O(n²) required by quasi-Newton methods. (However, I suspect that convergence, if it happens at all, may be slower per iteration with this approach, so overall it may be faster or slower than quasi-Newton methods.)

Genetic algorithms

“Genetic algorithms” is a catchy name for a popular metaheuristic based on Darwinian and Mendelian metaphors. You have a “population” of “chromosomes” over which you compute “fitnesses”; then, to create a new “generation”, you “crossbreed” members of the population chosen randomly (with higher probability increasing with fitness) by combining their “genes” with “crossover” and apply “mutations” to the results. As long as the crossover and mutation operations aren’t too destructive to fitness, each generation of the population will tend to have higher and higher “fitness”.

It occurred to me that the method of secants sort of fits into this mold, but with an especially powerful “crossover” mechanism — it attempts to extrapolate from the differences between the two “genomes” it’s crossbreeding to find the optimum. The offspring is necessarily within the space of affine combinations of the two parents, but its fitness may be much higher than theirs. (But sometimes it’s not.)

The above approach of just drawing secants from further back in the history of approximations, in order to handle higher dimensionality, can be seen as a sort of degenerate genetic algorithm in which we only crossbreed two individuals at a time, and they’re always the oldest and youngest. Presumably more judicious choice of parents would yield faster convergence.

As an example, I was thinking of approximating an image (see Image approximation) with a set of Lambertian spheres of the same color and some lighting, and using the method of secants as above to generate new combinations of spheres. The function being optimized would be a combination of the difference between an image rendered from a given configuration of spheres and a penalty for the complexity of the configuration; the configuration would consist of an (x, y, z) direction for the directional light, an (r, g, b) color for the ambient light, and parameters (xᵢ, yᵢ, zᵢ, Rᵢ, rᵢ, gᵢ, bᵢ) for each sphere in the configuration. Mutation operations would add noise to everything and occasionally clone a sphere. Initial renderings would be low in resolution to speed up the initial search.

Dunno, maybe something like that could work for the structure-from-shading or even structure-from-motion problem, though in the second case you additionally have to estimate the camera path relative to the object. And maybe gradient descent and its variants are better fits.

Topics