After writing Interval raymarching, I just wrote my first raymarcher with signed distance functions (“SDFs”), intentionally a pretty minimal affair. It was a lot quicker to write than My Very First Raytracer, taking about an hour and a page of Lua to initially get working, rather than all night and four pages of C. Some of that difference is being able to build it on top of Yeso, so it doesn’t have to include code for image file output; some of it is that Lua is a bit terser than C; some is that the raymarcher doesn’t support color; half a page of it is the input file parsing in the C raytracer; but most of the reason is that a minimal SDF raymarcher is simpler than a minimal Whitted-style raytracer.
Disappointingly, although it does manage to do full-motion video, it’s a great deal slower than the precise raytracer, although I haven’t added specular reflections, lighting, color, or texture to it yet. I don’t yet know if that’s because it’s an iterative approximation method or because LuaJIT is producing somewhat suboptimal code. In this very simple scene, it’s only doing an average of around 9.5 SDF evaluations per pixel, which is comparable to the number of intersection tests the precise raytracer needed per ray; this weighs on the side of blaming LuaJIT.
Of course it’s a bit silly to be rendering real-time 3-D graphics on the CPU in 2019, even if your GPU is just a shitty little Intel Gen8 (see Notes on the Intel N3700 i915 GPU in this ASUS E403S laptop), but it’s still enough experience to provoke me to write a bunch of things.
In iq/rgba’s notes on the topic, he frequently mentions the importance of using exact SDFs, which provide you with the exact distance to the nearest scene object, rather than inexact SDFs (I think the original sphere-tracing paper calls these “distance underestimate functions”), which merely provide you with a lower bound. He explains that this is important because using correct Euclidean SDFs will speed up the raymarching process dramatically.
Now, this may be true, but unfortunately many of the attractive modeling features of implicit surfaces involve transforming functions in ways that do not preserve SDFs’ exactness. In particular, union (minimum of SDFs) is exact, but intersection and difference (maximum of SDFs and of an SDF and a negated SDF) are not exact. The SDFs generated by other very appealing features, like surface perturbations, bounding volume hierarchies (including IFS rendering), smooth union, and twisting space, are also usually inexact.
More unfortunately, other attractive modeling features of SDFs — such as the ability to round corners by subtracting a constant, or the ability to compute a constant-thickness shell by taking the absolute value and subtracting a constant — depend on the SDFs’ exactness. Consider a sphere whose SDF is λp⃗.|p⃗ - c⃗0| - r0 and another whose SDF is λp⃗.|p⃗ - c⃗1| - r1. Each of these is exact. If we take the CSG difference as λp⃗.|p⃗ - c⃗0| - r0 ∨ r1 - |p⃗ - c⃗1| (where x ∨ y is the maximum of x and y) such that the first sphere has a second-sphere-shaped bite taken out of it, we get an inexact, lower-bound SDF. It’s straightforward to see that attempting to round the sharp corner where the cut penetrates the surface by adding a constant will not work; addition distributes over maximum and minimum, so (a - k0 ∨ k1 - b) + r = a - (k0 - r) ∨ (k1 + r) - b. That is, all the level sets of this SDF can be reached by increasing one radius while decreasing the other. So the hoped-for rounded edge at the intersection line never appears.
(I say “it’s straightforward” but it surprised me when I saw it on the screen.)
A brute-force solution to this problem is to resample the inexact-SDF-generated geometry into some kind of numerical approximation that admits efficient exact SDF computation.
I mentioned autodiff briefly in Interval raymarching, but now that I’ve actually written an SDF raytracer, I feel like I have a better understanding of the situation.
With SDFs you don’t easily know which object you’ve hit. So if you want to do something that depends on the surface normal, such as Lambertian lighting, you don’t have a straightforward way to find it. Moreover, even if you do have a way to find which underlying primitive you’ve collided with, its SDF d(p⃗) doesn’t immediately tell you how to find its normal. You need the gradient of the SDF, ∇d(p⃗). Autodiff can give you that, at the cost of only doubling the computational cost, rather than quadrupling it, which is the cost of the standard approach.
However, as I said in Interval raymarching, using affine arithmetic might actually work better than the instantaneous derivative in this case, in order to avoid aliasing artifacts. iq/rgba has shown compelling demonstrations of how using a too-small box to estimate the gradient can cause aliasing.
(If your final top-level SDF is a tree of ∧ and ∨ operations, where a ∧ b gives you the lesser of a and b while a ∨ b gives you the greater (since real numbers have a total order), you could very reasonably trace the final result of the ∧/∨ back down the tree to its origin, finding the non-CSG surface it originated from. If you retained the computed values, you can do this in linear time in the length of the path you trace, which is probably faster than annotating each value with its origin on the way through this tree, effectively applying the ∧/∨ operations to (d, id) pairs rather than just d values.)
To me, one of the appealing aspects of successive-approximation algorithms like SDF raymarching and any kind of Monte Carlo simulation is that they can be used as “anytime algorithms” in order to guarantee responsiveness at the expense, if necessary, of quality (see Anytime realtime and Patterns for failure-free, bounded-space, and bounded-time programming for more on this). My current implementation does not do this. How can I get this in practice?
Using autodiff, you could perhaps “subsample” the SDF evaluations of the final image, for example tracing a single ray through the center of a 5×5 pixel square; at the final point of contact, rather than just extracting the gradient of the SDF with respect to the (x, y, z) scene coordinates, you could carry the autodiff computation all the way through to the (r, g, b) color value and then compute its Jacobian with respect to the (u, v) pixel coordinates on the screen. This Jacobian gives you a color gradient value at the center of that 5×5 square, and so you could perhaps get an image that looks like a badly compressed YouTube video frame, which is considerably better than you would get by just downsampling an image from 640×360 to 128×72 (same number of samples) or 221×125 (same amount of data).
However, running on the CPU, you can adaptively subsample. If you reached the surface in a small number of SDF evaluations, you didn’t pass close to any other surfaces, and the surface isn’t sharply angled to the ray where you’re hitting it, so you can probably get by with fewer samples.
Even without autodiff, computing an exact SDF d(p⃗) at some point p⃗ tells you a great deal about its values in the neighborhood of p⃗. At any displacement Δp⃗ in any direction, we know d(p⃗ + Δp⃗) ∈ [d(p⃗) - |Δp⃗|, d(p⃗) + |Δp⃗|], because the highest it can be is if Δp⃗ follows the gradient away from the object (and the gradient remains constant over that distance), and the lowest it can be is in the opposite direction. (No such pleasant bound holds for inexact SDFs, neither the upper nor the lower bound.)
(Here by |·| I mean the Euclidean L₂ norm, as before.)
Still, though, even with an inexact SDF, we know that an entire sphere of radius d(p⃗) around p⃗ is devoid of, as they say, “geometry”. This means that if you compute a non-tiny value from an exact or inexact SDF for a ray shot from the camera, you have computed a bound for a whole view frustum around that point; a single SDF evaluation is enough to advance the whole wavefront around that point up to the bound of a sphere around that point, at least the part of the wavefront that has successfully entered that sphere. It may be worthwhile to use a conservative approximation of the sphere, such as a ball made from a weighted sum of the L₁ norm (whose balls are octahedra) and the L∞ norm (whose balls are cubes). I think there is a cheap weighted-sum norm whose balls are these irregular polyhedra with 32 triangular faces — “frequency-2 geodesic spheres” in Buckminster Fuller’s terminology, except that the underlying polyhedron whose triangular faces were subdivided into four triangles is an octahedron, not a dodecahedron.
You can build a Z-buffer and pick arbitrary points in it and evaluate the SDF at them; each SDF evaluation digs a big crater in the neighboring Z-buffer values, but only those values that are initially within the ball defined by that SDF value.
A fun way to do this might be to start with a very-low-resolution Z-buffer (3×2, say), then repeatedly double its resolution, perhaps using pairwise min to compute the newly interpolated pixels. Each doubling, you tighten the termination threshold for SDF iterations so that it’s comparable to the new pixel resolution, and iterate over all the Z-buffer pixels until you’ve hit that termination condition on each of them. This might be able to keep the total number of (primary) SDF evaluations per pixel down to 1–3, instead of the 9.5 I’m seeing or the 100–1000 commonly seen in the demoscene. If this can be combined with the adaptive subsampling mentioned above, it should be possible to get in the neighborhood of 0.1 SDF evaluations per pixel.
I understand that on the GPU it’s dumb to try to do things like that because communication between pixels is super expensive. On the CPU, though, it might be a sensible thing to do.
The above algorithm with its various resolutions of Z-buffers and so on would seem to require a lot of memory management complexity. But one of the appealing things about both raymarching by sphere tracing and Whitted-style raytracers is the simplicity of their memory management: every ray is traced totally independently, though the process of tracing a ray may result in tracing some more rays, recursively.
It occurred to me that this recursive structure is potentially a good way to handle the multiple resolutions. Initially trace a cone that's big enough to encompass, say, a 16x16-pixel region, by marching a ray down its center. When this ray encounters a low enough SDF that the corners of that region are, say, closer to objects than they are to the ray you're marching, split it up into four rays, one marching down the center of each 4x4 quadrant, with a narrower cone; trace one of them at a time. When one of those rays encounters a low enough SDF that the corners of its region are maybe closer to geometry than they are to the ray, break down into 2x2 quadrants, which break down into pixels. And when one of those rays reaches its destination, you invoke a pixel-found callback, or store a value in the single, full-resolution Z-buffer, and backtrack to complete the recursion.
And of course you don't really start at 16x16. You start at 512x512 or 1024x1024 or 2048x2048, whatever your screen is.
The amount you advance is less than with standard SDF sphere tracing, because you want to make sure that you aren't smashing any of the corners of your current pixel region into an object. The simplest bound is to advance from point p by d(p) - r, where d is the SDF and r is the distance from p to the furthest corner of the projection of the current pixel region onto a sphere of radius |p| (assuming the eye is at 0). But this is a pessimistic bound; the precise bound is the positive solution for e of the quadratic equation (e + |p| - |p| cos θ)² + (|p| sin θ)² = d(p)², where θ is the angle from p to the corner of the region, as seen from the eye. When the angle is large, this could give a significantly better bound, but when the angle and the SDF are small, the improvement in the bound is also small in absolute terms, and the angle is almost always small. However, this small improvement is probably usually close to a factor of 2, and my guess is that it might cut the recursion depth needed by a factor of 3 or so in typical cases. So the tighter bound might be worth the extra complexity, or it might not.
This doesn't fit well into a fragment shader, which may be the reason I haven't seen it discussed even though it seems obvious in retrospect, but it seems like it should fit beautifully into a CPU. It's somewhat less optimal than the more complex approach described earlier with the multiple Z-buffers, but it's also enormously simpler, and it's embarrassingly parallel.
On this three-torus scene, I’m getting 5.5 frames per second on my laptop.
I’m not attempting to use SIMD operations like SSE and AVX, and the algorithm is expressed in such a way (with data-dependent iteration bailouts) that I would be very surprised if LuaJIT were finding a way to take advantage of them. I don’t think SSE 4.1 (which LuaJIT has) or even SSE 4.2 (which my CPU has) support half-precision 16-bit floats, so probably a 4× speedup for reasonably coherent vectors is the most I can expect from SIMD. (Again, see Notes on the Intel N3700 i915 GPU in this ASUS E403S laptop.)
The CPU also has four cores, so in theory I should be able to get an additional 4× speedup from multithreading, as long as the pixels aren’t too interdependent.
So in theory I should be able to get a 16× speedup just by applying brute force, working out to 88 fps (at 320×240).
There’s another major reason I’m interested in SDFs that’s actually not real-time graphics at all: automatic fabrication. Historically manufacturing has largely been concerned with the geometry of parts, because steel, fired clay, concrete, glass, and polyethylene — the best materials for many purposes — are sufficiently uniform and isotropic that their geometry is most of what you need to know. (Heat treatment is important, but commonly applied to an entire article; similarly painting, galvanizing, etc. Surface finish is sometimes important, but that’s in some sense a question of geometry too. Work hardening, for example from cold forging, is extremely important.) Moreover, for metals in particular, merely achieving a desired geometry is often a difficult and expensive proposition.
So algorithms that make contending with geometry tractable are of great interest.
(I think one of the really interesting possibilities of automated fabrication is actually that we can make things out of nonuniform and anisotropic materials.)
The existing libraries, algorithms, and user interfaces for dealing with 3-D geometry in computers are utter shit. “Sketching”, “lofting”, “extrusion”, “press/pull”, “pocketing”, “filleting”, and so on, are terribly unexpressive; achieving even the most basic compound curves is often beyond their capabilities. The interactive sculpting user interfaces in things like Blender and ZBrush are more expressive, but incapable of handling any demands for precision. The triangular bounding surface meshes and NURBS commonly used as the internal data representation are humongous, bug-prone, and — for any given level of humongousness — terribly imprecise; furthermore, doing topological optimization is basically impossible with them. The available libraries are buggy as shit and crash all the time. Half the time when you export a mesh from one program to import into another it turns out not to be “manifold”, which is to say, it fails to represent a set of solid objects. Voxel representations are, if anything, even worse, but at least they can handle topopt and don’t have the fucking “manifoldness” problem.
Christopher Olah’s “ImplicitCAD” is an attempt to remedy this situation by using SDFs (and Haskell). I’d like to play with the approach and see what I can get working, but without Haskell.
So, what would it take to import an STL file into an SDF world? How about a thresholded voxel volumetric dataset?
Before I ever heard about 3-D raymarching using 3-D SDFs, I read a now-lost Valve white paper about rendering text “decals” in games using 2-D SDFs. The idea is that you precompute a texture containing a sampled SDF for the letterforms you want to use, and in your shader, you sample from that texture, with the built-in bilinear interpolation the GPU gives you. Then, instead of just using the sampled value as a pixel color, you threshold it to get an alpha value. This allows you to use the bilinear interpolation to interpolate sharp letterform boundaries in between texels.
Moreover, by thresholding it softly, you can get antialiasing. A similar effect comes into play with 3-D SDFs if you stop marching the ray when the SDF falls below about the scale of pixel spacing projected on the surface: the SDF sphere smooths over surface detail smaller than a pixel or so, preventing its high spatial frequencies from aliasing down into lower frequencies on the screen.
A problem with this technique as described is that the bilinear interpolation inevitably kind of rounds off sharp corners where you would want them on the letterforms; to deal with this problem, you can use two or more color channels to approximate different parts of the letterform boundary with smooth curves, which cross at the desired sharp corners. There’s an open-source “mSDF” software package for generating these “multichannel SDFs”.
I wonder if there’s a way to carry the analogy through to 3-D raymarching with SDFs. Perhaps, for example, it could somehow provide a solution to my problem with rounding off edges.
I’m pretty sure now that I can implement SDF-based sphere-tracing
raymarching in JS on <canvas>
to get reasonable 3-D diagrams (see
Dercuano drawings and Dercuano rendering). <canvas>
has a typed array interface for raw pixel access I used in Aikidraw,
so you don’t need a DOM call for every pixel you draw. I think I can
render raster graphics into that interface pretty easily, fast enough
at least for still images.
Specifically, you can call ctx.putImageData(data, x, y)
on a 2d
<canvas>
drawing context, where data
is an ImageData
object made
out of a width .width
, a height .height
, and a Uint8ClampedArray
in RGBA order .data
; there are also ctx.createImageData(w, h)
and
ctx.getImageData(x, y, w, h)
methods. The ImageData
constructor
is more experimental but available in Web Workers.
The raycasting process from any given pixel produces, at least, a z-coordinate or distance at which the first intersection was found. (Maybe it also tells you things like how close the ray came to other objects, or how many iterations it took to converge, or which object it hit, and what the x and y coordinates were too, but at least it has a z-coordinate.) So this gives us a map from screen coordinates (u, v) -> z, which is pretty much exactly a Z-buffer.
Suppose we have nothing more than such a Z-buffer and we want to know the surface normals. A first crude approximation comes from the first backward differences: Z[6, 2] - Z[5, 2] gives us some kind of approximation of dz/du at (5.5, 2), and Z[5, 3] - Z[5, 2] gives us some kind of approximation of dz/dv at (5, 2.5).
These two sets of first differences have three problems:
They can't distinguish discontinuities (from running off the edge of an object, for example) from smooth differences. If we're rendering smooth objects, this is an important difference.
They're expressed in screen space. What we have is (roughly) dz/du and we want dz/dx. x = uz, so dx = u dz, so we need to divide by z. (Analogously for y.)
They're offset by half a pixel from the pixels we want.
I think we can remedy problems #1 and #3 by, from the two adjacent gradient pixels at (5, 1.5) and (5, 2.5), choosing the one with the smallest absolute value. So, for example, if Z[5, 1] = 14, Z[5, 2] = 16, and Z[5, 3] = 15, we have deltas of +2 and -1 on each side of (5, 2), so we pick -1. If Z[5, 4] = 13, that gives us a delta of -2, so from -1 and -2, we pick -1 for the gradient y-element at (5, 3).
This will jitter some of the normals by half a pixel one way and others by half a pixel the other way, and it eliminates discontinuities unless there are discontinuities on both sides of the pixel, in which case it will pick the smaller discontinuity.
(This jitter is sort of like morphological erosion of these depth-gradient values, but by half a pixel rather than an integer number of pixels, and using the absolute value rather than the signed value. Maybe there's some kind of morphological dilation of depth-gradient values that could compensate, but I don't know how to dilate by fractional pixels.)
So this gives us an estimate of the gradient of the z-coordinate with respect to x and y. But what we want is the surface normal. Let's say dz/dx = d and dz/dy = e. Now vectors tangent to the surface include (1, 0, d) and (0, 1, e); their cross-product is a normal, which if I'm not confused is (-d, -e, 1), but that one points the wrong way, so we want (d, e, -1), which points toward the camera. So you can normalize that (÷sqrt(d² + e² + 1)) and you have some kind of reasonable approximation of the surface normal.
Per pixel, this approach requires two subtractions, two binary-minimum-by-absolute-value operations, a reciprocal (of z), two multiplications by it, two squarings, two additions, a reciprocal square root, its negation, and two multiplications by it, 15 operations in all, two of which (the reciprocal and reciprocal square root) are slow. This is probably faster than automatic differentiation of the SDF, and it gains the benefit of antialiasing that the centered-differences approximation has.
A simpler and cheaper function that might still provide a useful illusion of three-dimensionality is to take the gradient of z with respect to (u, v) thus estimated and use it directly for, say, red and cyan color channels, as if the object were illuminated from the right with a red light and from above with a cyan one. Unfortunately, it's signed and may have a large dynamic range. The dynamic-range problem can maybe be handled by running it through some kind of sigmoid, like tanh or erf or something, although those only expand the dynamic range by like a factor of 2 or 3 or something; signedness can then be handled by truncating it with ReLU, i.e., s ∨ 0. To get a third light coming from the camera, a simple approach would be to take the un-truncated sigmoid values and subtract the average of their absolute values from 1, although of course the actually correct thing would be to take the square root of 1 minus the sum of their squares.
Suppose that instead we want to outline objects at discontinuities of depth and discontinuities of slope. The above gradient estimate attempts to smooth over single-pixel discontinuities by eroding them away, but if we skip that step, we can just use the total absolute gradient value (the L1 norm; its Euclidean norm would be more correct) to select where to draw lines. You could use quickselect to find, say, the 96th and 97th percentiles of absolute gradient values, then use a piecewise-linear "smoothstep" to highlight the pixels with those values.
But that only gives us discontinuities of depth. If we want discontinuities of slope, we need an additional level of derivation, to calculate the second-order derivatives of the z-value with respect to (u, v), which are ∂²z/∂u², ∂²z/∂u∂v, and ∂²z/∂v², each of which is a single subtraction if we're satisfied with backward differences, as we should be. These values will be very large for two pixels at depth discontinuities, and somewhat large for one pixel at slope discontinuities; this may be a useful way to approximate the "thick inked lines outline figures, thin inked lines show detail within figures" rule from cartoon drawing.
In this case, although we have three values, we probably shouldn't map them to R, G, and B; it makes some visual sense to have an object diffusely illuminated from different directions with red and cyan lights, but it would make no visual sense to draw vertical lines in red, horizontal lines in blue, and saddle points and edge intersections in green. Instead we should take some kind of norm of this four-element Hessian matrix (which has only three distinct elements because it's symmetric).
Some kind of contouring may be a useful way to help make curved surfaces legible in diagrams; the simplest might be a checkerboard pattern in x, y, and z, formed by frobbing the pixel's brightness a bit with the XOR of some bit from each of the three coordinates. Which bit is chosen determines the scale of the checkerboard pattern along that dimension; if the bits are too significant, the whole image will be within a single cube, while if they're too insignificant, they're much smaller than a pixel, so it amounts to aliasing noise. So ideally you want them to be on the order of 32 pixels in size at the relevant distance.
A drawback of that kind of contouring is that one of the spatial depth cues traditionally used in the visual arts is the amount of detail: foreground objects are drawn in more detail, as if the background objects were out of focus. (Also, they're often reduced in contrast and shifted toward the background color, which makes their details harder to see.) But the suggested checkerboard does just the opposite, putting the most detail in background objects. The standard XOR texture, the three-dimensional version of the Hadamard-Walsh matrix (the parity of the bitwise AND of the three coordinates), and similar fractal volumetric textures seen in a million 1024-byte demos and in My Very First Raytracer are one way to escape from that: they provide detail at a wide range of scales, or even all scales. But I'm not sure how to adjust the level of detail of such a volumetric texture to be coarser at further distances.
Another way to reduce the detail of background objects, related to the sphere-tracing algorithm rather than shading, is to roughen the sphere-tracing termination threshold with distance. If you do this merely proportional to distance, you just get cone-tracing antialiasing, but if you include a faster-growing term such as a quadratic term, you could maybe get background objects to lose surface detail faster than perspective shrinks that surface detail.
A third possible way to add more detail to foreground objects is to perturb the scene SDFs slightly with some kind of noise field to give the surface some kind of texture, but make the noise function drop off at larger z-coordinates, or possibly even middle-clip it (a - (-b ∧ a ∨ b), where a is the original noise function and b is the middle-clipping threshold) with a threshold that grows with distance.
In Happy Jumping, iq/rgba demonstrated a technique for perturbing SDFs slightly with a roughly isotropic volumetric texture made from a wavefunction with a feature size that was small relative to the objects but large relative to nearby pixels; the SDF perturbation was tightly clipped, so only its zero-crossings produced a relief on the surface. Thus, it formed smoothly curved random blobby contours on the surface whose foreshortening helped greatly to visually indicate the orientation of the surface. If you were doing such a thing I think you could reduce the level of detail with distance by reducing the amplitude of the higher-frequency components of the wavefunction.