You can use a camera to measure deep subpixel movements by using moiré patterns generated by geometrical optics, permitting extremely inexpensive sensing and servomechanisms with several degrees of freedom all the way down to submicron scales.
Suppose you have two sheets of A4-size acetate transparency film printed almost entirely black on a 1200-dpi printer, with occasional randomly placed transparent pixels piercing the otherwise uninterrupted darkness. Specifically, about one of every 2048 pixels is transparent; the other 2047 are opaque black. This means that out of all 139 million pixels on either sheet, only about 68000 are transparent.
Let’s lay one of the sheets on top of the other with some integer X and Y pixel offset between them, but no relative rotation. The number of possible ways we can do this is about 9900 in X and 14000 in Y for a total of about 139 million. The odds are overwhelming that about one out of every 2048² = 4 194 304 pixels will happen to have two transparent pixels directly on top of one another, and will therefore be transparent. If we view a bright backlight through the two sheets, we can easily detect this pixel. Each such pixel that we measure reduces the number of candidate alignments by about a factor of 2048; any set of more than about four such pixels is adequate to identify the configuration uniquely, again with overwhelming probability.
All in all, for each configuration, there are about 33 such pixels, several times more than enough.
These discrete configurations are separated by 21 microns, an inch divided by 1200.
Let’s consider the more general case, where the displacement happens in more degrees of freedom and is continuous rather than discrete. Allow the sheets to rotate in two dimensions and have a perspective difference between them. In this case, almost all pixels in one sheet will overlay parts of four pixels in the other sheet, which increases the number of pixels with some light leaking through by about a factor of 4, up to about 133. It also means that we can measure more than just whether a given pixel is transparent or not; we can distinguish degrees of transparency. How many degrees we can distinguish depends on the sources of noise, but we can probably reliably measure something like 64 gray levels for the pixel. This means we can measure the displacement of any given overlapping pixel of something like a 64th of a pixel, which would increase our measurement precision to under 400 nanometers if geometric optics were the truth. But geometric optics breaks down before that point, so we probably can’t do better than about a micron, at least with visible light.
Note that we’re up to about 2400 bits of signal here (133 * (6 + 12)), although it’s heavily redundant. In the worst case, we’re trying to estimate nine degrees of freedom: the distance from the camera to the scene, two degrees of freedom of camera angle, and six degrees of freedom of relative position and orientation of the two transparency sheets. We’re hoping to calculate the relative position and orientation accurate to within about one part in 300 000, which is 18 bits; the other three degrees of freedom might need to be estimated to a similar relative precision even if we don’t care. The upshot is that we need to estimate 162 bits of information from 2400 bits of data, which seems eminently feasible.
How much camera resolution do we need for the pattern of pinholes we see to be unique to that configuration? We might need enough resolution to be able to usually separate the light from different pinholes. This might require something like 256 × 256 pixels on the sensor plane, so that most of the 133 pinholes are by themselves in a row and by themselves in a column. This is a few hundred kilobits of information.
How can we make it practical to estimate this information, even if it is in principle contained in the camera image? In principle, you could simply measure the distance from the 2¹⁶² or so interestingly different configurations and pick the one with the lowest error. Hopefully this is not necessary in practice. Here are some techniques that will probably work:
First identify which pixels are bright, estimating the pixel-rounded displacements from that, then compare their brightnesses to get subpixel data. For a given camera angle, scene distance, and rotation, this cuts the number of configurations down to only 139 million.
Do the coregistration computation in Fourier-transformed spatial frequency space rather than in the spatial domain. What we’re doing in physical space here amounts to multiplying the two transparencies pointwise in the spatial domain, which is equivalent to convolving their frequency spectra. If we guess roughly the right rotation, maybe we can roughly deconvolve the frequency-domain signal of the product with the frequency-domain signal of one of the transparencies, giving us an estimate of the frequency-domain signal of the other — hopefully including its phase shift.
Instead of one single-scale pinhole field, we could divide the sheets into areas of “pinholes” of different sizes, or merely mix different sizes of pinholes together, from the 21-μm single-pixel pinholes up to 21-mm finger holes. Then we can use the much smaller number of significantly different configurations of the bigger holes to tell us what neighborhood to search in for configurations of the smaller holes, in this example through perhaps ten power-of-two hole sizes. There are only about 140 interestingly different translational configurations of 21-mm holes, about 12000 if you include 2D rotation, and half a million if you include 3D rotations. These numbers make exhaustive search feasible. (By necessity the larger holes will need to be distributed somewhat more densely, unless you try to recognize the patterns of smaller holes visible through them instead of just comparing them with each other.) Larger holes may also make the Fourier approach more feasible.
If you’re tracking motion, you can use position estimates from previous frames to find which neighborhoods of configurations to search in in a new frame. You can extend this to many simultaneous hypotheses using particle filters and the like.
Rotations and perspective distortion are less important in small neighborhoods; if you examine a small neighborhood, you don’t need to consider nearly as many rotations. Like large holes, this might benefit from higher density of holes. Consider a circular neighborhood of radius 1150 pixels (about 25 mm), which will contain on average about four million pixels and four of these coincidental pinholes; you can rotate it by up to about 440 microradians before the pattern of pinholes changes. If you were to increase the density of pinholes in the original from one per 2048 to one per 64, then you would have four coincidental pinholes every 2048 pixels, contained in a circle of radius about 72 pixels, which wouldn’t change constellations until it had been rotated by about 2.2 milliradians.
If you compute the Delaunay triangulation of the bright points, you should be able to eliminate the necessity to try many different rotations. Many such tests on clouds of detected features are known in the computer vision literature. You can probably even hash some aspects of the feature graph.
The techniques that involve increasing the pinhole density will decrease the information available per pinhole and probably require more pixels on the image sensor, but they provide more information overall (at least until the density of holes goes above, I think, 1/e).
A related approach is to use the reflections from a sparkling surface illuminated from a single direction for the feedback. For example, a piece of sandstone in the sun. If the light source and camera are fixed, the pattern of sparkles gives fairly precise information about the two angles of the surface to the axis bisecting the angle between the direction to the light and the direction to the camera; the rotation of the pattern on the focal plane gives fairly precise information about the rotation of the surface around that axis; the position of the pattern gives fairly precise information about the translation of the surface perpendicular to that axis; and the scale of the pattern gives very crude information about the translation of the surface along that axis.
Of course, you have to start by calibrating the system with a massive database of sparkle patterns from that surface from many different angles.
The system as described can be improved in several ways:
Riding the Sarmiento train to Once, when it was stuck in a station for a while with the doors open, I observed that moving my head by two fingerswidths (≈20 mm) caused a certain sparkle in the floor to appear; moving it two fingerswidths further caused it to disappear. The sparkle on the floor was about 3 m away from my head, suggesting an angular resolution of some 6 milliradians from simply thresholding that single sparkle; presumably you could get down to 1 mrad by comparing the relative brightnesses of several.
The sharp boundary of the sun’s disc (which should be about 9.3 milliradians across, with an edge of 0.5 milliradians or less) was not evident, suggesting that the sparkle (or something, maybe some clouds) was introducing several milliradians of divergence.
Achieving a divergence of 0.5 milliradians (1.7 minutes of arc) at a wavelength of 400 nm from a blue LED requires a beam waist of at least about 0.4 mm, so if you want that much angular resolution, your facets need to be at least that big. (And your light source needs to, if not subtend that little of the field of view of the sparkly surface, at least have significant energy in spatial frequencies that high — for example, the sun’s sharp boundary. A fuzzy Gaussian light source is kind of the worst case for a light source of a given size, although of course uniform ambient illumination is the worst case.)