Robust local search in vector spaces using adaptive step sizes, and thoughts on extending quasi-Newton methods

Kragen Javier Sitaker, 2019-08-17 (updated 2019-09-15) (15 minutes)

(Probably nothing in here is novel; it just documents my own exploration of the space of local search algorithms in vector spaces, which is well known, but not to me.)

When writing Rubber wheel pinch drive I got confused with some basic trigonometry, trying to solve a triangle with the law of cosines† to find the intersection of two circles. So, impatient, I figured I’d solve the problem with brute force instead, so I wrote the following implementation of local search with random restarts for scalar functions of general vector spaces:

def search(f, p, restarts=30, steps=30):
    best = None
    for start in range(restarts):
        current = p()
        current_badness = f(current)

        for size in range(20):
            for i in range(steps):
                new = current + p() * 4**(4-size)
                new_badness = f(new)
                if new_badness < current_badness:
                    current, current_badness = new, new_badness

        if best is None or best[1] > current_badness:
            best = current, current_badness

Here p is a function that yields a vector from the domain of f, which returns a real.

At first this didn’t work very well because I had erroneously provided a guess function p that was always non-negative, so each restart would progressively step from the neighborhood of 0 to the neighborhood of the correct solution and then get stuck. Nevertheless, it was able to find a reasonably good solution (and then I realized how to solve the problem in closed form without trigonometry).

† The law of cosines is c² = a² + b² - 2ab cos γ, where γ is the angle opposite the side with length c.

Making local search in vector spaces more robust

Random restarts in general make metaheuristic search algorithms more robust; indeed, even the simplest possible metaheuristic search algorithm, “choose a random point”, becomes workable with random restarts.

In this implementation, the restarts aren’t very random.

I tried to make the implementation somewhat robust against p functions of the wrong magnitude, as you can see; the algorithm tries a range of different exponentially-spaced sizes. Now that I’m using a non-broken p, this results in finding a solution correct to ten decimal places, which is pretty good for such a simple local search algorithm. But the particular orders of magnitude I chose are kind of arbitrary; basically I was just hoping they’d cover the right range.

It occurred to me that you can make it more robust against p guesses of the wrong order of magnitude by updating size incrementally, using a procedure like the following: when you find a step that improves the guess, try half that same step or twice that same step. If that improves the guess further, then multiply size by 2 or ½, respectively, before the next iteration. This way, if your step size is about right, size will experience a random walk around it, but if it’s much too small (you’re stepping around a locally flat linear region of the cost function) then size will grow exponentially, while if it’s much too large (you’re rocketing out of the basin with all the solutions in it) it will diminish exponentially.

By the same token, you could make it robust against a p function like the one I provided by trying a step of -p when p makes the situation worse. That way, if you’re in a relatively non-curved region, you can make progress by just shifting into reverse. Trying this actually gives you another crucial piece of information: if going both ways makes it worse, there’s a good chance you’re in a local optimum, using a step size that is too big to have a good chance of improving your position.

You could implement this as follows, though this code is a bit repetitive:

def climb(f, step):
    size, growth, here = 1, 2, step()
    cost = f(here)

    while True:
        yield here, cost  # This may be the same as last time

        where = size * step()
        new = here + where
        new_cost = f(new)
        if new_cost > cost:
            where = -1 * where
            new = here + where
            new_cost = f(new)

        if new_cost > cost:
            size /= 2
            continue

        where *= growth
        grown_new = here + where
        grown_new_cost = f(grown_new)
        if grown_new_cost < new_cost:
            new, new_cost = grown_new, grown_new_cost
            size *= growth
        else:
            growth = 1/growth

        here, cost = new, new_cost

This is able to find a good approximation of the intersection of two circles in “only” 128 iterations, using the following definitions, even when the answer is several orders of magnitude away from the starting step size:

def distance(p1, p2):
    return ((p1 - p2)**2).sum()**0.5

def circles(c1, r1, c2, r2):
    def badness(guess):
        return ((distance(guess, c1) - r1)**2 +
                (distance(guess, c2) - r2)**2)
    return badness

step = lambda: numpy.random.rand(2) - 0.5

Handling valleys (ridges), saddles, and higher-dimensional spaces

However, its adaptive step size will screw it in higher-dimensionality spaces, since when it’s in a valley or saddle point where most directions are bad, it will tend to diminish the step size and find a point on the valley floor very precisely. Perhaps a better approach would be to diminish step sizes by a smaller amount that depends on the dimensionality, so that if there’s at least one good direction, the step size will remain constant on average.

(“Valleys” here are “ridges” in the usual hill-climbing metaphor, because here we’re trying to minimize a cost function, while “hill climbing” is trying to maximize your altitude.)

To escape this trap, you could have different behavior when expanding the step size than when reducing it: keep increasing the step size without changing the step direction until the situation stops getting better. That is, walk along the line of the step by 2, 4, 8, 16, 32, etc., times the size of the initial step, rather than just 2 times, doing a crude line search once a promising direction is found. This will tend to ricochet you between the walls of a canyon, and eliminates the bias in the step-size random walk toward smallness, since any single lucky step in the right direction can increase the step size arbitrarily.

This is similar to the three-dimensional optimization strategy Dave Long tells me some bacteria use: flagella forward while things are improving, flagella backward (resulting in random tumbling) while things are getting worse.

In cases like those Adam and AdaGrad and the like are designed for, where the best step size varies enormously among different dimensions, you might want to use hill climbing — searching along one dimension at a time, so that you can use a separate step size for each dimension. This also potentially permits the use of Acar’s “self-adjusting computation” and similar generic incrementalization algorithms to speed up function evaluation. (See the section in More thoughts on powerful primitives for simplified computer systems architecture on incremental or self-adjusting computation.) Unfortunately, this means that you’ll never make any further progress once you reach the floor of a diagonal valley. In n dimensions, there are 2n dimension-aligned directions a valley can descend in, but 2ⁿ precisely diagonal orientations, so in some sense that’s a much harder problem.

As an example of how relative scales on different dimensions matter, the above routine handles the following somewhat pathological problem reasonably well, although it takes it several hundred iterations; note the incorrect step function:

c = (3, 1e10, 4, 17, -5)
climb(lambda g: ((g-c)**2).sum(), lambda: numpy.random.random(5))

But it handles the following version much worse, requiring hundreds of thousands or possibly millions of iterations to solve it — it gradually exponentially converges on the right answer, but taking about 80'000 iterations per order of magnitude:

climb(lambda g: (((g-c)*(1,1,100,1,1))**2).sum(),
      lambda: numpy.random.random(5))

This induces a steep valley along the third dimension, and so the deltas in the other dimensions like the second and fourth dimensions suffer a drastically reduced step size, slowing their convergence enormously (from the outlandish values they took to get close to the optimum along the second dimension).

On one run, it finally found a solution accurate to two decimals on every dimension in iteration 886'803, after about 15 or 20 minutes of CPython interpretation:

886803 [  3.00416293e+00   1.00000000e+10   3.99999237e+00   1.70599240e+01
  -4.95000939e+00] 0.0177569891399

It goes about four times as fast with a non-buggy step function, converging to that same precision in only 207'804 generations:

207804 [  3.02129932e+00   1.00000000e+10   3.99999109e+00   1.70184720e+01
  -5.04976604e+00] 0.021740755647

The flagella-style variant described above improves on this by about a factor of 7:

28607 [  2.96086659e+00   1.00000000e+10   4.00023842e+00   1.69373547e+01
  -4.95732860e+00] 0.0155987598959

That’s using the following implementation, which additionally is slightly simpler and only does one loss-function evaluation per iteration, instead of up to three:

def climb(f, step):
    size, here = 1, step()
    cost = f(here)

    while True:
        yield here, cost

        where = size * step()
        new = here + where
        new_cost = f(new)
        if new_cost > cost:
            size /= -2
            continue

        # We found a promising direction; swim!
        while True:
            yield new, new_cost
            where *= 2
            new_new = here + where
            new_new_cost = f(new_new)
            if new_new_cost < new_cost:
                new, new_cost = new_new, new_new_cost
                size *= 2
            else:
                break

        here, cost = new, new_cost

And this harness code:

from __future__ import division, print_function
import numpy

if __name__ == '__main__':
    c = (3, 1e10, 4, 17, -5)
    for i, (x, q) in enumerate(climb(lambda g: (((g-c)*(1,1,100,1,1))**2).sum(),
                                     lambda: numpy.random.random(5) - 0.5)):
        print(i, x, q)

As before, it takes four times as long if we leave out the - 0.5:

92241 [  3.04962626e+00   1.00000000e+10   3.99985342e+00   1.69648044e+01
  -4.97438500e+00] 0.00457665876023

Applied to the circle-intersection problem from before, it typically takes more iterations, like around 30 instead of around 10. (This is probably a somewhat smaller penalty than it sounds like, since the circle-intersection code from earlier is often doing two or three loss-function evaluations per iteration.)

It solves this geometric square model to a precision of 1e-6 typically in 1000 to 3000 iterations:

def square_constraints(points):
    p1, p2, p3, p4 = points[0:2], points[2:4], points[4:6], points[6:]
    side1 = p2 - p1
    side2 = p3 - p2
    side3 = p4 - p3
    side4 = p1 - p4
    return (side1.dot(side2)**2 + side2.dot(side3)**2 + side3.dot(side4)**2
            + side4.dot(side1)**2
            + ((side1**2).sum() - 5**2)**2
            + ((side2**2).sum() - (side3**2).sum())**2
    )

climb(square_constraints, lambda: 10 * (numpy.random.rand(8) - 0.5))

Without the fourth perpendicularity constraint, though, it converges very slowly, and the things it converges to don’t look like squares. One sample result with the fourth perpendicularity constraint left out:

141071 [ 2.76119227 -2.36547768 -1.36787743  0.45423421 -1.35017868  0.48015388
 -1.33188753  0.50622078] 9.99823404211e-07

This has minuscule and nearly equal side2 and side3 (lengths about 0.032), whose dot product is consequently also very small even though they’re very nearly parallel, being perpendicular to the nearly-parallel side1 and side4. This allows side1 and side4 to minimize the function by becoming more and more nearly equal and opposite. It pays to be cautious when defining cost functions, since optimization algorithms will exploit whatever vulnerabilities they can find!

I haven’t tried the per-dimension step-size approach above yet, though it would surely help for these examples.

Relationship to other kinds of optimization algorithms

See also Using the method of secants for general optimization for a different generic derivative-free solver for scalar functions of general vector spaces, that one for zeroes instead of minima. I think both of these are only going to be reasonably fast for problems of low dimensionality, but I think hill-climbing suffers exponentially from high dimensionality, while the method of secants (like gradient-descent variants) will suffer only linearly from it.

Of course, you can find local minima of a computable continuous function by using automatic differentiation on it and using the method of secants to find a zero of its derivative.

Although the above implementations are not, hill-climbing and other kinds of local search are applicable to functions on discrete spaces without any kind of comparability between elements, such as graphs or strings of symbols. (That’s the domain. The range still needs to be comparable.) The method of secants requires divisibility and zeroes. Hill-climbing also only requires comparability from its cost function — it takes no notice of its absolute magnitudes, just which values are higher and lower — and this is true of the implementations above.

Thoughts on quasi-Newton methods

For regular problems like the examples above, quasi-Newton methods would probably work better, but require differentiating the cost function. With the advent of automatic differentiation, this is no longer a lot of work, but doing automatic differentiation on pure-Python functions like the above generally requires using forward-mode automatic differentiation, which is painfully slow. Reverse-mode automatic differentiation is dramatically faster for this sort of thing, but generally requires your program to be written differently.

As I said in Using the method of secants for general optimization, quasi-Newton methods require the maintenance of an approximation of the Hessian, and for functions of high-dimensional spaces, the Hessian is fucking humongous. A thing that has surely been tried, but that I haven't seen discussed, and which intuitively seems like it should work better for high-dimensional optimization problems, is using automatic differentiation to compute the gradient of the function to get a direction to move in, but then, rather than moving in that direction by an amount proportional to the magnitude of the gradient (as gradient-descent methods do), use automatic differentiation to compute the second derivative of the loss function along that line, which is a single number rather than a matrix of N² values (for a domain of dimensionality N) like the Hessian. Then you can use a Newton–Raphson step to figure out how far to move along that line: divide the first derivative (the dot product of the gradient and the direction) by the second derivative, thus extrapolating the distance to the point at which the gradient would fall to zero if the Hessian were constant over that region. (Alternatively, use affine arithmetic or reduced affine arithmetic to track down a precise zero along that line — see Fast mathematical optimization with affine arithmetic for details.)

The appealing thing here is that gradient descent has linear convergence (i.e., the logarithm of the approximation error falls linearly with the number of iterations) while under appropriate circumstances Newton’s method has quadratic convergence (i.e., the logarithm of the approximation error falls proportional to the square of the number of iterations). I’m not entirely sure whether Nesterov accelerated gradient descent, gradient descent with momentum, Adam, etc., achieve a better order of convergence, but I think they don’t.

This would be big if true (see Things in Dercuano that would be big if true and Top algorithms) which means that it, given how obvious it is, it probably doesn’t work, or else I’m misunderstanding how quasi-Newton methods work, and they already do this. But it will be interesting to find out why not.

Topics

Programming (286 notes)
Performance (149 notes)
Algorithms (123 notes)
Math (78 notes)
Mathematical optimization (29 notes)
Python (27 notes)
Incremental computation (24 notes)
Umut Acar’s “self-adjusting computation” (6 notes)
Newton–Raphson iteration (“Newton’s method”) (6 notes)
Automatic differentiation (6 notes)
Method of secants (4 notes)
Gradient descent (3 notes)