Snow and snowflakes have been on my mind lately. This is not entirely because
of the storm in Texas — rather, I have been thinking since Christmas about
the snowflake’s *shape:* those small, delicate, spiky circles we all know so
well (err, perhaps there is something of current events here after all…).
This is a blog post about snowflake shapes.

What shape *is* a snowflake? In principle, we all know what snowflakes look
like. We can all imagine the six-fold symmetry, and then we can imagine the
embellishments. And yet— not so fast! As a little child I (and I do not
think I am alone in this) wanted to see the shape of a snowflake for myself.
Of course, I never quite could — at least, to the point where I could stuff
and saturate my eyes — because the delicate snowflakes melted away as soon as
my snow-blind eyes could focus on the crystal caught. Think about it, then:
where does your own conception of snowflake-shape arise from? Does it come from
firsthand experience, or from clip-art and wintertime papercrafts?

(Of course, there

issomething to be said for indirect inference of snowflake-shape. Years later, I would marvel at the fact that you could infer that snowflakes are hexagonal from the existence of 22º haloes, a kind of crystallography at the heavenly scale, linking the macroscopic and microscopic. For now however I am concerned with seeingdirectly, and face to face, with a presence and visual immediacy that cannot be afforded by imagination alone.)

This is no small concern. A year or so ago I visited the California Academy of
Sciences and learned about Wilson Bentley, the first man to photograph a
snowflake. This was on January 15, 1885; he was just shy of 20 years old. To
appreciate this moment, you have to imagine a time *before* January 15, 1885,
when no such photograph existed. Until Bentley’s innovation, no man had seen a
single snowflake for more than a delicate glimpse. Perhaps spatially the
snowflake could be magnified under a microscope (as Hooke and his
contemporaries
demonstrated in the 17th century), but how could anyone magnify it
*temporally?* Bentley’s Edgertonesque miracle — akin to Muybridge’s “horse in
motion,” captured just seven years earlier — is to hold still the moment to
reveal *more* — to give us the opportunity to gaze without urgency. Dare I
say, Bentley *freezes time?* The manipulation of time becomes an act of
unconcealment, and then furthermore an act of creation.
Bentley’s snowflakes outlived him, and they will outlive me.

Wilson Bentley was not a scientist by training, and perhaps that served him well. Over the course of his life he collected over 5,000 images of what he called “ice flowers.” There is something marvelous about this devotion, which spanned years and years, crystal after crystal, wonder undiminished. Surely the 5,000th snowflake brought Wilson as much delight as the 4,999th (and surely it is the case that you know you are in love when you find yourself falling in love afresh every day).

There is also something to the uniformity of Bentley’s collection. Each snowflake is framed carefully in the center of a slide, on a black background. The images are diligently cropped to squares. See for yourself! The University of Wisconsin-Madison hosts an online archive of 1,181 of Bentley’s images, courtesy the Schwerdtfeger Library. Bentley never sought copyright on his images; they are in the public domain, and so I downloaded them, along with their labeled categories. The categories have wonderful names, such as “crystal with broad branches” and “lumped graupel.”

When displayed in a grid, the snowflakes look like a box of delicious holiday sweets. Surely these delights are made of sugar and not ice.

What does one *do* with 1,181 snowflakes? One searches for patterns, the way
Bentley would never have been able to. Bentley’s images demand to be studied as
samples from a distribution, and they are strikingly well-suited to our
favorite statistical computer vision techniques. For example, though they are
scanned at a uniform 1000x1000 resolution, they still make sense when
dramatically downscaled, even to as low as 32x32. Notice also how the
snowflakes are highly amenable to *data augmentation* on account of their
intrinsic symmetry. Each snowflake can be reflected and rotated arbitrarily to
produce several more high-quality synthetic snowflakes “for free.” So, even
though there are nominally 1,181 unique images, there is much more juice here
than one might expect at first. You can’t always play this game: CIFAR images
can’t really be rotated, and most MNIST digits can’t even be flipped. But
snowflakes are snowflakes.

One February evening I trained — with a lot of help from my friends Gautam and Jack — a baby variational autoencoder, which embeds the snowflakes in a small low-dimensional latent space. Here, “low-dimensional” means 10-dimensional: it seems very reasonable to encode 2^10 data points with 10 parameters. In the image below, the top row consists of five randomly-selected source images, and the bottom row is the VAE’s attempted reconstruction. The contrastive green-and-purple colorscheme comes from matplotlib’s default, and reminds me of the delightful pGLO experiments we did in high-school biology.

Of course, as with all things deep-learning, this was easier said than
done. For a technique that falls broadly under “unsupervised learning,” a VAE
requires an awful lot of babysitting! —but then again, one of the joys of
online school is that *nobody can tell* if you’re training machine learning
models behind your Zoom tab. So I trained, and I trained. The current model
takes around 6 hours on my dusty old MacBook, though after the first 30 minutes
the results are already non-embarrassing.

Do you want to build a snowgan?Separately, I also spent some time trying to build a model adversarially. The snowgan (so to speak) consists of two rival sisters: the generator, Elsa, is of the same architecture as the VAE’s decoder, and the discriminator, Anna, learns to distinguish Elsa’s output from real snowflakes. Elsa and Anna compete, and as a result Elsa gets better at camouflaging the fakes among the flakes. Alas, GAN convergence is finicky enough that I lost interest after a few days… open problem!

It turns out that the VAE’s decoder can be used to build a small-scale
snowflake factory. Using ONNX.js and only a little bit of guile, I ported the
VAE’s decoder to JavaScript and attached the latent space to some ```
<input
type="range">
```

sliders. By using the sliders to choose parameters in the latent
space and feeding these parameters to the decoder, one can interpolate to
hallucinate snowflakes that are *not* part of Bentley’s training set. With some
experimentation, I found the sliders that correspond to parts of the latent
space that map to outliers; those parameters I clamped to extreme values in the
non-outlier-ey direction. This interface is what you see at the top of this
blog post.

At time of writing, there is only one Google search result for “this snowflake does not exist,” which points to a 2017 master’s thesis in elementary-school education by Eli̇f Dedebaş. The context is an activity where children are making paper snowflakes, and investigating the relationship between the number of folds and the symmetries. One child creates a snowflake-like shape with

eight-fold symmetry and is asked if she knows why such a shape is unnatural. Shape!

I imagine this slider infrastructure is on par with the sort of fancy tooling
UI that they have up in the Great Snow Factory In The Sky. I’m reminded of
David Wiesner’s wonderful Caldecott-winning picture book *Sector 7*, which I
recall reading (“reading”) in my elementary school’s library. It is about a
cloud factory, accessed from the top of the Empire State Building. Notice how
the large sousaphone-bell-shaped openings emanating from the factory roof are
exactly a portrait of an autoencoder’s decoder.

“Wiesner visited the Empire State Building on a zero-visibility day to research
this book,” reads the About-the-Author at the end, “He was the only person
there.” There is something of Wiesner in Bentley and Bentley in Wiesner, I
think. Think of them both, and yourself, too, as Friedrich’s *Wanderer*.

The results interpolated by the VAE are surprisingly convincing. Here are some snowflakes that do not exist (snowfakes?). I created these images by sliding the sliders haphazardly and recording a trace of the changing image. Often, people display this kind of interpolation in animated form, but in relation — and, indeed, deference — to Bentley’s life’s work, I would rather see them spread spatially than temporally.

I enjoy trying to work out the exact moment of transition between different kinds of snowflakes; there is a snowy saltation that occurs at certain boundaries that I cannot quite point you to. A curious fact is that a t-SNE plot in the VAE’s latent space shows no clustering with respect to the provided labels; the VAE seems to have hallucinated its own classification scheme, one which is perhaps more macroscopically morphological in nature than Magono and Lee’s 1966 classification scheme.

The BentleyBlizzardBlossoms (B3) dataset, along with some PyTorch utilities and the source code and trained model of the baby autoencoder, is available on Github. I think it would make a wonderful alternative to MNIST for small experiments. If you use it for anything, please let me know!

]]>It’s Christmastime again! Shall we talk about trees? This post is about my new holiday hobby: hallucinating tree-shaped fractals using our old friend, gradient descent.

But — let me start at the beginning. For a variety of reasons, the cover
image of Douglas Hofstadter’s *I am a Strange
Loop* has been on my mind
this month. I have been thinking a lot about this wonderful image created by
pointing a camera at a projector that displays what the camera is seeing. In
modern terms: what happens if you screen-share the screen sharing window? This
happens:

Such “feedback loop” recursions often rapidly converge to breathtaking fixed-points. The NYC MoMath even has an art installation that lets you play with a camera-and-projector setup to make wonderful patterns.

As you can imagine, there is a mathematical theory here. At risk of ruining the mystery, I will tell you about it, though without writing any equations. :-) The theory is the theory of Iterated Function Systems, which is almost exactly what it sounds like. Start with a set of affine functions (affine because that’s what camera-projector-systems do) and repeatedly apply them to a set of points, taking the union at each step. Regardless of where you start, you will soon end up with a fractal structure which is the fixed point of the system. Why fractal? —because the self-similarity comes from the infinitely-nested composition of the affine functions.

There are theorems that ensure convergence towards and uniqueness of this fixed
point in the limit, but… it’s believable enough, don’t you think? Following
your intuition, you can build Sierpinski Triangles and Barnsley
Ferns and all sorts of other
beautiful structures using an IFS. I refer you to Section 8.2 of *The
Algorithmic Beauty of Plants* by Prusinkiewicz and Lindenmayer (yes, *that*
Lindenmayer) for more botanical
connections.

Now here is the question that has been on my mind: if I give you a picture of a
fractal-fern, can you give me a set of affine functions whose fixed point is
that fern? This is the *inverse
problem*
for IFSes. According to Wikipedia, this problem has real-world applications in
image compression, but it is *hard* to do in general.

Hmm. Interesting. Can our favorite tool, gradient descent, come to the rescue? At least in an approximate sense? Recall that a two-dimensional affine function is really just a 3x3 matrix (with 6 free parameters) that encodes the details of the transformation. If we could easily compute images of IFS fixed points from a set of matrices, then perhaps we could try to optimize the parameters of these matrices to make the fixed point look the way we want it to.

The challenge, of course, is efficiently *finding* the fixed point —
repeatedly rasterizing and compositing affine transformations on images sounds
expensive — and doing so differentiably sounds like a scalability nightmare!
But there is a wonderful solution to this problem, which is to play the
so-called “chaos game.” Instead of
*densely* applying all the functions to all the points on the plane, you can
*sparsely* approximate the fixed point as follows: Start with a point — any
point! — and *randomly* select one of the affine functions of the IFS and
apply it to the point. Repeat this process with the new point. It turns out
that in the limit, the “trail” left behind by this wandering point converges to
the fixed point of the IFS. (This, too, is a theorem, but again I think it is
believable enough that I will not demand a proof.) This is what the “chaos
game” looks like for the Sierpinski triangle; notice how the salt-and-pepper
spattering of points soon converges boldly into the fractal we know and love
(it’s an animation — stare for a few seconds).

So here is the plan: we start with a point, and play the chaos game with our
current IFS matrices to obtain a *point cloud* that stochastically approximates
the fixed-point. Then, we compare this *point cloud* to our target fern-image
to obtain a “loss.” Finally, we update our IFS matrix parameters to minimize
this loss, until at last the fixed-point converges to the target image. It’s
just “machine learning,” really: we’re learning to play the chaos game!

Here is an outline of the algorithm (with some details elided):

```
# initialize IFS
F = [random_3x3_affine() for _ in range(4)]
o = torch.optim.Adam(F, lr=0.0001)
for step in range(100_000):
o.zero_grad()
# start at origin
v = torch.tensor([0., 0., 1.])
# play the chaos game once
trace = []
for _ in range(200):
# applying an affine function is just
# a matrix multiplication!
v = torch.matmul(random.choice(F), v)
trace.append(v)
# treat the trace as a point cloud
loss = compare(target_image, trace)
# update parameters
loss.backward()
o.step()
```

Ah! Actually, there is *one* detail that I *should* explain: how do you compare
a point cloud and an image? My idea is to convert the target image to a
*second* point cloud by uniformly sampling points from it, for example by
rejection-sampling. Then you can compare the two point clouds by the so-called
“chamfer distance,” which is the mean distance from each point to its nearest
neighbor in the opposite point cloud. This is quadratic-time to compute, and
the most expensive part of the whole operation, but with ~100 points in each
set it is quite doable.

A pedagogical aside: notice that the “trick” here is really a

change in representation.This optimization problem is easier to solve on point clouds than on raster images! An important lesson, that applies across the ML-for-graphics domain… and more broadly to all differentiable programming enterprises.

Remarkably, this zany scheme works quite well! Let’s try it with a heart. (A heart, because it’s an easy low-entropy shape, but also because it felt symbolically appropriate in relation to Hofstadter’s broader philosophical project of souls-within-souls-within-souls.)

When we initialize the system with 4 random affine transformations, the trail left behind by the chaos game is very unimpressive. The sparsity is because I simulate only 200 steps of the chaos game for each step of gradient descent — there’s a tradeoff between noise and computation time, of course, as there is with any form of SGD.

Okay. Time to optimize! Here is what it looks like after 100K steps of optimization with an Adam optimizer (that’s about 40 minutes of wall-time computation on my old MacBook). It looks pretty good, don’t you think? The chamfer distance scheme works!

Now we can export the four affine matrices out from PyTorch and work out the
“true” fixed point with a more expensive offline computation (for example, by
simulating a few *million* steps of the chaos game). It looks like this — not
perfect, and indeed quite crayon-scribble-ey, but clearly *something*
interesting has happened, and the result is convincingly heartlike.

Where is the fractal structure hidden in this heart? This needs some vizualization tooling to see clearly. After a bit of JS-canvas-hacking: here is a GIF of the heart that reveals its fixed-point structure. Hearts-in-hearts-in-hearts!

I will admit to blinking a few times when I first saw this animation play out
in full — I did *not* expect it to work, and even now I find myself marveling
at this creation? discovery? of a heart encoded in 24 numbers.

But it still doesn’t look *quite right* — the fractalness isn’t *obvious*.
How come? It’s because our affine transformations stretch and squash the heart
into almost unrecognizable blobs. A final, natural improvement is to force our
affine transformations to be *rigid*, so that there isn’t any squishing or
skewing in the fractal. This is actually not too bad to implement: the solution
to problems of constraint are typically reparametrizations, and indeed in this
case we can simply reparametrize the matrices in terms of a rotation, an
(isotropic) scale, and a translation. Now we’re down from 6 parameters per
matrix to just 4 — even more magical! The result is a much more “obviously”
fractal-ey fixed point, simply because your eye can more easily pick out the
structural recursion when it is made of rigid motions.

Let’s take it for a — pardon — *spin!*

Since I promised you trees, here is some fractal foliage. The target image was a picture of a maple leaf! I’d never thought about this before, but indeed a maple leaf contains within it an echo of the whole tree. (Full disclosure: it takes a couple of randomized restarts to get a really impressive result — the low dimensionality of the fractal’s parametrization leads to the same stuck-in-a-local-optimum woes that differentiable rendering folks face all the time…)

This one uses just three matrices — experimentally, three matrices tend to be
enough to get really good results, and more just lead to very busy and crowded
fractals. (I don’t know if there is a word for number-of-maps–in-an-IFS, but
let’s call it the IFS’s *arity*. In this terminology, I like *ternary* IFSes.)

And finally, since I promised you Christmas trees…

This last animation is actually absolutely *baffling* to me. In part, this is
because of how *treelike* the fractal turned out — here it is overlayed
against the silhouette I optimized against. Can you imagine this is encoded by
just 12 parameters? Odd.

But even more bafflingly: the actual geometry of the recursion is *nothing*
like the tree recursion geometry you and I are used to from CS106! Compare the
GIF above to Barnsley’s Fern: while Barnsley turns each full leaf into two
smaller leaves with “semantic” affine maps, this Christmas tree does all sorts
of *bizarre* uninterpretable cartwheels and somehow almost magically works
itself out in the end. It is mesmerizing to me, I can stare at it for minutes
at a time.

When you think about it, it is quite shocking that a Christmas tree is the
*unique* fixed point of these 3 rigid affine maps. “Beautiful” isn’t the word
for it — neither is “grotesque” (though both words have been used by various
people I have shared this with). There is something just unsettlingly
fascinating about the way something magical *pops out* from three rectangles.
Maybe I should have demanded a proof of that theorem after all…

There is so much more to say about all of this. What shapes are easiest to IFSify, and why? In higher dimensions, can we treat the IFS as a kind of SVD-esque dimensionality reduction scheme for data (i.e. literal point clouds)? What kind of data would it make sense to do this for — where do you find spatial self-similarity? Do point clouds have an inherent IFS “dimensionality” (in terms of, say, the number of affine transformations you need to get a good approximation)? With some more work, could SGD be a legitimate solution to the open problem of fractal compression?

It’s a lot to think about — but for now… well, there is a Calvin and Hobbes gag that begins with Calvin saying “I’ve been thinking” and Hobbes interrupting “On a weekend?” That’s how I’ve been feeling this winter break! Now it is time to shutdown the jupyter kernel and get some rest — Comfortably Numbered will return in 2021!

(Jupyter notebook for this blog post available on Github.)

Update (Dec 27): Check out some improvements made by a reader!

Update (Dec 29): I was pointed to a 1998
paper (Melnik and Pollack) that
(independently!) tells *exactly* the same story. I’m struck by how closely the
two expositions align. There is something deeply comforting about rediscovering
a place someone else has visited once-upon-a-time, like finding a flag buried
under the snow on a mountaintop — a kind of storybook enchantment that sits
across the table from loneliness.

When I took Pat and Matt’s wonderful graphics class CS348B (yay PBRT!), one of the most frustrating debugging journeys I had was related to interpreting vectors in the wrong coordinate systems. At the time I’d come up with an idea to catch such bugs statically, and now, 1.5 years later, I finally have found time to write down the idea.

Here is the problem: A vector only has “meaning” with respect to a coordinate
system. In graphics, a vector could be in all sorts of coordinate systems:
*world space*, or *object space*, or *camera space*, or *screen space*, or
*normalized device coordinate (NDC) space,* or *raster space* (some of these
are self-explanatory, but see PBR
§6.1.1-6.2
for definitions). This leads to bugs. If you interpret a “camera-space” vector
as if it were in “world space,” your math will be *all wrong!* You need to
apply the affine transformation `cameraToWorld`

before doing any world-space
computation on a camera-space vector. But what if you forget to do that? Your
C++ code will happily compile, but your poor bunny might look like chicken soup
instead… (I learned this the hard way!).

This got me thinking: perhaps a safer `Vector3f`

should know *at the type
level* what coordinate system it is in. Rather than a `Vector3f`

, perhaps what
we really want is a `Vector3f<WorldSpace>`

. You simply can’t add a
`Vector3f<CameraSpace>`

to a `Vector3f<WorldSpace>`

— that’s a *type error!*
The compiler *makes* you call `cameraToWorld`

if you want to do anything with
these two vectors. (All this is very much in the spirit of
torchsaber, by the way.)

I think there are broadly two ways you might want to implement something like
this. One way is to treat `CameraSpace`

as a purely formal symbol, just a
*name* or an *annotation*. The compiler’s job is to check these annotations.
You’d have to explicitly break this abstraction inside the implementation of
`cameraToWorld`

, but from a pedagogical perspective I think that is exactly
what you want.

The other way is for the tag `CameraSpace`

to somehow encode the actual
geometry of what *camera space* “means.” For example, each vector could also
carry around the `Transformation`

that gets you to its host coordinate system
from `WorldSpace`

. Given this information, the compiler could even be able to
*infer* the implementation of `cameraToWorld`

for *free!* Of course, this comes
at a cost, which is that the compiler might not be able to statically check
equality of `Transformations`

, because the matrices might not be materialized
at compile-time.

In the rest of this post I’ll quickly literate-ly program option (1) as a proof of concept. It is short and sweet (at least, as “sweet” as any bit of C++ hacking can be).

First, we define our coordinate system “annotations” as dummy classes.

```
class Space {};
class WorldSpace : public Space {};
class CameraSpace : public Space {};
```

Next, we design a `Vec3`

template class that takes in such a “tag” as a
template parameter (`CoordinateSystem`

), and enforces that the tags are the
same on operations. I’ve implemented the “`+`

“ operator but you can imagine the
rest — there is nothing sneaky going on here! In fact, this is exactly how
one implements a generic `Vector3`

that can be specialized over numeric types
(e.g. `Vector3<float>`

, `Vector3<int>`

, and so on).

```
template <class CoordinateSystem>
class Vec3 {
static_assert(
std::is_base_of<Space, CoordinateSystem>::value,
"Vec3 annotation must derive from Space"
);
public:
double x, y, z;
Vec3(double x, double y, double z)
: x(x), y(y), z(z) {};
// the '+' operator!
Vec3<CoordinateSystem> operator+(
Vec3<CoordinateSystem>& other
) {
return Vec3(
this->x + other.x,
this->y + other.y,
this->z + other.z
);
}
};
```

(Note that you need to `#include <type_traits>`

to get `std::is_base_of`

, which
is there just to protect you from trying to make a `Vec3<string>`

or something
else uncouth like that. Completely optional.)

…actually, that’s it! We can already get some mileage out of this. For example, this should be okay:

```
Vec3<WorldSpace> p(0, 0, 0);
Vec3<WorldSpace> q(1, 1, 1);
auto r = p + q; // ok!
```

On the other hand, this is perhaps *not* okay:

```
Vec3<CameraSpace> s(2, 2, 2);
auto t = r + s; // NOT OK!
```

And indeed, the compiler complains *and* gives a helpful error message!

```
test.cc:40:14: error: no match for 'operator+'
(operand types are 'Vec3<WorldSpace>'
and 'Vec3<CameraSpace>')
40 | auto t = r + s; // NOT OK!
| ~ ^ ~
| | |
| | Vec3<CameraSpace>
| Vec3<WorldSpace>
```

Of course! How *dare* we add a camera-space vector to a world-space vector? We
need to transform `s`

to world-space…

```
Vec3<WorldSpace> cameraToWorld(
Vec3<CameraSpace>& vec
) {
return Vec3<WorldSpace>(
vec.x * 2, vec.y - 4, vec.z + 8 // or whatever
);
}
```

…and now `g++`

is happy. :)

```
auto u = r + cameraToWorld(s); // ok!
```

Perhaps this trick will make its way into PBRT someday!

]]>Update (Dec 23): Rachit Nigam pointed me to this (very recent) SPLASH paper that considers the same problem. (“See? It’s not just me!”)

Note:Jekyll and Hyde — is how this blog goes. There are weeks of rational thought and weeks of irrational ramblings. This fall has been much of the latter, for good reason, but here is a break(!) in the clouds.Okay, okay. The truth is that I wrote this essay for my CS class this quarter (leave it to Pat to assign an essay for a CS class!). But then, I think I have reached the age where every written assignment in college is to be treated as an opportunity to say something I otherwise would not have a chance to say… and so the essay, lightly edited, finds its way to this blog.

In this essay I want to describe two kinds of tricky bugs that might creep into your PyTorch programs. I call the bugs “clogs” and “leaks.” In my mind “clogs” and “leaks” reveal an exciting possible research direction for anyone interested in designing better APIs for automatic differentiation in machine learning.

*Note: The examples presented, though in principle timeless, were tested using
Python 3.8.5 running PyTorch 1.6.0.*

If you are familiar with PyTorch internals, you can skip this section. If not,
a brief review of a wonderful topic: how does PyTorch differentiate your code
for gradient descent? The technical term for PyTorch’s approach is *tape-based
reverse-mode automatic differentiation*. As you perform arithmetic computations
on your variables, PyTorch tracks the intermediate values in a computation
graph. When you want to differentiate a value, you call `.backward()`

on that
value. PyTorch then walks *backwards* along this computation graph, computing
the derivative at each step and accumulating them according to the chain rule.
Eventually, the leaf nodes of the graph contain the derivatives you asked for.

Let me give a small example. Suppose we wanted to compute ($d2x^2/dx|_{x=3}$). We might write a program that looks like this:

```
x = torch.tensor(3., requires_grad=True)
y = 2 * x**2
y.backward()
print(x.grad)
```

This program generates the following computation graph.

When you call `.backward()`

, the PyTorch automatic differentiation walks
backwards along the graph, computing derivatives at *each* step. By the chain
rule, the product of these gives the overall derivative we sought.

Now, consider this simple PyTorch program to compute ($d(\sqrt{x} + x)/dx|_{x=4}$). What do you expect to be printed?

```
x = torch.tensor(4., requires_grad=True)
y = sqrt(x) + x
y.backward()
print(x.grad)
```

A casual user or AP calculus student would *expect* to see 1.25 printed, of
course. But what *actually* gets printed is 1. Why?

Ah! I didn’t show you the full program: I hid the imports. It turns out that
the first line of this program is `from math import sqrt`

, *not* ```
from torch
import sqrt
```

. Now, the Python standard library’s `math.sqrt`

is not a
PyTorch-differentiable function, and so PyTorch is unable to track the flow of
derivatives through `sqrt(x)`

.

As a result of this bug, backpropagation gets “stuck” on the way back, and only
the derivative of `x`

, i.e. 1, is deposited. This is a clog — the gradients
can’t flow! In the computation graph below, the dotted arrow represents the
clog.

The reason calling `math.sqrt()`

on a PyTorch tensor is not a runtime error is
that PyTorch tensors implicitly convert to “raw” floating-point numbers as
needed. Most of the time this is a useful and indispensable feature. But I
believe this situation should *at the very least* raise an error or a warning.
While the example I presented was reasonably straightforward, there are *many*
different ways to “clog” backpropagation, with varying degrees of insidiousness
(for example, what happens when you mutate a variable in place?). It can be a
nightmare to debug such situations when something goes wrong — that is, if you
notice the bug in the first place!

*By the way:* the celebrated “reparametrization trick” that powers variational
autoencoders is really just a workaround for a gradient clog problem. To train
a variational autoencoder, you need to compute the derivative of a sample of a
probability distribution with respect to the distribution’s parameters (e.g.
the mean ($\mu$) and variance ($\sigma^2$) of a Gaussian distribution).
Unfortunately, naïvely sampling from a parametrized distribution abruptly
truncates the computation graph with respect to the parameters, because the
random number generator is not differentiable all the way through — who *knows*
what *it’s* doing! The solution, is to sample from a standard unit normal
distribution (where ($\mu=0$) and ($\sigma=1$)), and then re-scale the sample
by multiplying by ($\sigma$) and adding ($\mu$). Of course, multiplication and
addition *are* easily differentiable, and so the gradients can now flow.
Problem solved!

Now, consider this slightly more complicated PyTorch program. We are going to implement a silly reinforcement learning algorithm. Here is the situation: There is a truck driving on the road with constant velocity, and your goal is to catch up to it and drive right alongside the truck. At each timestep you are allowed to choose your velocity, and then you’re told how far you are from the truck.

The setup:

```
truck_velocity = torch.tensor(3.142)
truck_position = torch.tensor(2.718)
def get_measurement(car_position):
global truck_position
truck_position = truck_position + truck_velocity
return torch.abs(truck_position - car_position)
```

And a simple online gradient-based learning algorithm:

```
my_velocity = torch.tensor(0.01)
my_position = torch.tensor(0.)
for i in range(500):
my_velocity.requires_grad_()
my_position = my_position + my_velocity
loss = get_measurement(my_position)
loss.backward()
my_velocity =\
my_velocity.detach() - my_velocity.grad * 0.01
```

Unlike last time, there’s nothing up my sleeve here — this is all reasonable PyTorch code. This code actually works just fine.

But, if you run it for long enough (say, 1000 iterations), you’ll notice something odd: each step starts taking longer and longer. The algorithm is accidentally quadratic! You can see this behavior quite clearly in this graph, which shows a linear growth in iteration time from step to step (the spikes are garbage collection pauses).

How can this be? Isn’t each loop doing the same calculation?

Here is one hypothesis: if you’ve read this
paper you might look to see if we’re
`.detach()`

-ing `my_velocity`

. The `.detach()`

function snips off all incoming
edges to a node in the computation graph; essentially, creating an artificial
clog. If we forget to do that, the gradients would “leak” back in time across
multiple steps in the graph, all the way back to the first step, and each
iteration would therefore take longer and longer — just as we’re observing.

But, alas, this is not the source of the bug: as you can see, we *are*
detaching `my_velocity`

when we update it. So, what’s really going on here?

It’s tricky! The leak is in `my_position`

, which subtly depends on *all*
previous values of `my_velocity`

and therefore makes backpropagation compute
gradients for *all* previous timesteps. The dataflow diagram below hopefully
clarifies this point. Notice how each `velocity`

has its parent nodes detached
(thanks to the call to `.detach()`

!), but `loss`

still has an indirect
dependence on the chain of `positions`

.

Finding the correct place to insert the line ```
my_position =
my_position.detach()
```

is left as a not-quite-trivial exercise to the reader.
Beware! Putting it in the *wrong* place will either have no effect *or* cause
`my_velocity`

to always have gradient 0.

Just like memory leaks, gradient leaks can be extremely sneaky. They pop up
whenever your inference is “stateful” — think of applications like physics
controllers, reinforcement learning, animated graphics, RNNs, and so on. I
would not be surprised if many popular implementations of such algorithms *do*
have “gradient leak” bugs. However, the bugs usually only manifest themselves
visibly when the inference passes through enough timesteps for the leak to
compound. Just like a dripping tap, you might not notice your losses until you
get the bill at the end of the month… and then, you need to figure out how to
track down the source of the leak and figure out the right way to fix it.

In the long term, how can we protect ourselves from this class of bugs? One
potential solution is to embed the API inside a language whose type system
tracks the creation of the computation graph. You might be able to use
well-understood techniques like *taint analysis* or *linear types* (pun not
intended) which traditionally track the flow of *information*, to now track the
flow of *differentiability* through the program.

Let me be slightly more concrete about this suggestion. In our “clog” example,
a good type system might detect that `sqrt`

cuts off the computation graph,
and, knowing that ($y$) does not directly depend on ($x$) in the expected way
anymore, complain at compile-time when we try to request ($dy/dx$). In our
“leak” example, a good type system might notice that the “old” `my_position`

effectively goes out of scope when it is re-assigned, and therefore it might
complain that an unreachable reference to it actually does persist through the
new `my_position`

. Such checks seem very reasonable to demand from a modern
type system.

Over the months I’ve started to miss the little speckles of Stanford life that
I never thought I would get attached to. One of them is the regular bell-toll
of our lovely little clock tower. There is something oddly compelling about
public time — time as a utility — time as something to be broadcast for
consumption. So I have a special fondness for clock towers, and the Network
Time Protocol, and the anchor who says “it’s 8am and this is NPR,” and the GPS
system, and people who set their clocks by the atomic clock in Boulder, and the
website time.gov, and the pilot who says “we will be
landing shortly in San Francisco, the local time is 6:15am.” At least, in this
divided world, we can agree on *something*.

Well: This evening I was tuning my guitar, and I discovered some more public time. Here is what happened. First I plucked the low E string, and watched the string blur into a lenslike pair of arks that envelop the extent of its vibration. I ran the first few frets to try out the sound: F, F-sharp, G, G-sharp. Great! Next, I plucked the A string, and watched the same blur of the string. But when I tried plucking it at the first fret — that would be a B-flat — something bizarre happened. Instead of a vibrating blur, the string slowly wavered back and forth between its two extremes!

What? I rubbed my eyes. A low “Bb” vibrates at around 117 Hz. That is *much*
faster than my eye can resolve (because of this “persistence of vision,”
animations in video games don’t have much incentive to exceed around 60 frames
per second). In short, I should be seeing a blur. But no! I can track the
string’s leisurely wobble quite easily. I bet you can, too: the GIF below shows
an open-string A on the left and a first-fret Bb on the right.

(Sorry for the poor quality — I had to hold my phone between my knees — it was tricky, okay? You can find the original videos, with sound, on Github.)

Here’s what I think is going on. The lights in my room are powered by AC current from the power grid. In the United States, AC current is standardized to transmit at 60 hertz (why 60? apparently it’s a long story, but in part the number 60 is related to turbine hardware logistics). That means that the current peaks in (absolute) voltage 120 times per second, and so the lights in my room are actually all rapidly flickering at 120 Hz. I just don’t realize it because of the persistence of vision.

Ah! As I mentioned above, a Bb vibrates at around 117 Hz — to be precise, it’s 116.54 Hz. You can work this out without looking it up by recalling that concert A is 440 Hz (as printed on every metronome) and the low Bb is two octaves less one half step below — that’s twice twelve minus one or 23 half steps below. Indeed, 440/(2^(23/12)) gives 116.54.

Why is this correspondence between 120 and 116.54 so interesting? Well, when you have nearby frequencies, they inferfere to produce “beat” frequencies. I’ve written about beats in the spatial domain before when talking about moiré patterns. Here, the beats are in the temporal domain, forming a kind of “wagon wheel” illusion. The superposition of the string’s vibration and the light’s flickering creates a low-frequency “beat” that gives the illusion of the string vibrating slowly.

Eyeballing the right GIF, I count around 5 oscillations of the string over the 3-second clip. That gives a beat frequency of 1.67 Hz. The difference in interfering frequencies is twice the beat frequency (elementary trigonometry exercise, use the sum-to-product identity). Using that difference, I can compute that my Bb is tuned to 120 Hz - 3.3 Hz = 116.7 Hz. Not bad! It turns out that you can buy “stroboscopic tuners” online, but perhaps if you are good at estimating slow frequencies, you don’t need one…!

Briefly, some thoughts for the future: I’m reminded by my guitar string of Edgerton’s hummingbirds and Muybridge’s horses, and many more moments frozen in time. There is a lot more I could say, and that I have said, and perhaps I will explain this more carefully in a future essay. The confluence of the aural and visual has also been on my mind lately; though I haven’t “blogged” about it I have been thinking a lot about clocking coin tosses by recording the sound a coin makes while it rings. In that story, too, stroboscopic techniques make an appearance. Maybe I will say more about that as well.

For the moment however I am struck — “struck” — by how strange it is that there is so much invisible to us even in plain sight, so much hidden in the folds of time. Public time, and the privacy of time.

]]>I was talking to a dear friend of mine this week, and she unknowingly reminded me of a thought I had last year. I wrote down this essay rawly one November night as an imagined chapter of an imagined book, then by morning forgot it (and the book) unedited. Today’s moment — personal, political, historical — seems as good as any to revisit it.

(Note: Reformatted automatically for web by pandoc. I’m sorry if the delicate math typesetting breaks; then again, for this essay the math is at best incidental to the message.)

The tricky integrals below are known as the Borwein integrals, and, as the Mathematician diligently evaluates them one by one, she is treated to a pattern.

$$\begin{aligned} B_1 = \int_0^\infty \frac{\sin(x)}{x}dx &= \frac{\pi}{2} \\ B_3 = \int_0^\infty \frac{\sin(x)\sin(x/3)}{x(x/3)}dx &= \frac{\pi}{2} \\ B_5 = \int_0^\infty \frac{\sin(x)\sin(x/3)\sin(x/5)}{x(x/3)(x/5)}dx &= \frac{\pi}{2} \\ B_7 = \int_0^\infty \frac{\sin(x)\sin(x/3)\sin(x/5)\sin(x/7)}{x(x/3)(x/5)(x/7)}dx &= \frac{\pi}{2}\end{aligned}$$

The aesthetics of mathematics demand at this point for her to formulate
a *conjecture*, which is that all Borwein integrals ($B_n$) are equal to
($\pi/2$).

Do you believe this conjecture? Perhaps, but our friend the Mathematician is not so easily satisfied. She presses on and continues evaluating the integrals, and she is rewarded with further evidence.

$$\begin{aligned} B_9 = \frac{\pi}{2} \qquad B_{11} = \frac{\pi}{2} \qquad B_{13} = \frac{\pi}{2}\end{aligned}$$

But when she gets to ($B_{15}$), she finds unexpectedly (or is it unexpected?)
that it is ever so slightly *less* than ($\pi/2$). The pattern falls apart, the
conjecture evaporates, and with it our belief.

The previous essay was on seeing; this one is on believing, the act of
believing, the art of believing, the ethics of believing. What does it
mean to believe? We might say that to believe is to have beheld *proof*
that expels *doubt*. I turn to Arnold Ross’ Prologue to his eponymous
mathematical summer program held annually at the Ohio State University.
He writes,

In an effective program of mathematical studies, students are propelled by eager curiosity to observe and experiment, thus creating new opportunities for observation. In hopes of unearthing deep relationships, students search for patterns among those observations, formulate adventurous conjectures, prune the many conjectures with the sharp ax of possible counterexamples, and then attempt

to endow the surviving conjectures with the security of a proof.

*To endow with the security of a proof.* Proof is comfort in certainty,
it is relief, it is contentment in the establishment of truth, the ghoul
of doubt having been vanquished. In our so-called “post-truth world,” it
can at times be unfashionable to simply believe, and it is
understandable for our Mathematician to have demanded proof to assuage
her well-placed skepticism.

But here matters become complicated. The painting on the right is
Caravaggio’s *Incredulity of Saint Thomas* (1601). It depicts the
apostle Thomas, who, having missed Jesus’ resurrection, demands proof.
Says Jesus, reach and feel my wounds — be not faithless, but believing
— *because thou hast seen me, thou hast believed: blessed are they
that have not seen, and yet have believed* (John 20:29). I look at the
furrowed brow of Thomas, his all but prosecutorial examination of the
evidence, and I wonder: What is the ugly side of *proof?* Is it right
even to use the word *belief* once you have *proof?* Or does it then
simply fade into the landscape of fact, or, dare I say, *reality?* I
think of those haunting words from the American Declaration of
Independence, “We hold these truths to be *self-evident*, that all men
are created equal.” *Self-evident:* not “it can be shown that all men
are created equal” or “assume for the sake of argument that all men are
created equal,” no, this truth is *self-evident*, it denies proof
because proof is unnecessary for an axiom. If even this simple matter of
human equality is up for debate, to be quartered by the Devil’s team of
reasoned and articulate advocates, then how can we proceed at all?

Christianity is not the only religious tradition to find value in the
*unproven*. I am reminded of Lao Tzu’s *Tao Te Ching*; I quote now from
Stephen Mitchell’s translation of the final verse:

True words aren’t eloquent;

eloquent words aren’t true.

Wise men don’t need to prove their point;

men who need to prove their point aren’t wise.

Or we might turn to an ancient Indian hymn:

But, after all, who knows, and who can say Whence it all came, and how creation happened? The gods themselves are later than creation, so who knows truly whence it has arisen? Whence all creation had its origin, the creator, whether he fashioned it or whether he did not, the creator, who surveys it all from highest heaven, he knows — or maybe even he does not know.

The unproven, the unknown — the unprovable, the unknowable — that is
where faith *begins,* where there are stakes to truth.

What, then, do we make of the 17th-century mathematician Blaise Pascal,
who proposed a game-theoretic argument now known as *Pascal’s Wager:*
that a rational person should live as though God exists because it is
the prudent option, on the off chance that He indeed does? Or shall we
invoke the brilliant 20th-century logician Kurt Gödel, whose seminal
work on “incompleteness theorems” introduced the marvelous notion of the
*provably unprovable* theorem — what can we make of Gödel’s
“ontological proof” of the existence of God, which has recently been
formalized in a computer system and computer-verified, in some sense
lending proof to the proof itself? “Ah! Like Doubting Thomas, these men
rely too much upon the crutch of rationality,” we might now complain,
“searching for and clinging to proof as if the blueness of the sky, the
sound of a brook, the smell of a peach were not proof enough of a
Creator.”

But even this, I think, is too simple, because to the Mathematician “proof” is more than the expulsion of doubt. Let us play the game of conjecture again. Pick a prime number, and divide it by four. If the remainder is one, try to write the prime number as the sum of two perfect squares:

$$\begin{aligned} 5 = 4\cdot 1 + 1 &= 2^2 + 1^2 \\ 13 = 4\cdot 3 + 1 &= 3^2 + 2^2 \\ 17 = 4\cdot 4 + 1 &= 4^2 + 1^2 \\ 29 = 4\cdot 7 + 1 &= 5^2 + 2^2\end{aligned}$$

Can this be done with all primes?

You might be more cautious to believe this time, so let me just tell you: it can, the observation is known as Fermat’s theorem, and I can supply two proofs. The first is an elegant argument that relates sums of squares to points on a grid, and reasons carefully about the geometry of that grid to conclude that a point with the desired properties must exist. The proof is enlightening, it reveals among other things the importance and origin of that number 4 that smiles so mysteriously in the theorem’s statement. Knowing the deeper mathematical structure at work, we can begin to generalize Fermat’s theorem in new directions; the theorem becomes in a sense just one shadow cast by this structure.

The second proof, due to Zagier, was published in 1990 as a single
sentence, and is reproduced in its entirety on the right. It
hallucinates for us a truly bizarre object, asserts that it has a
certain symmetry that forces there to be an odd number of solutions to
the equation ($x^2 + (2\cdot y)^2 = p$), and, as zero is not odd,
concludes that there must be at least one solution. This is one of the
strangest proofs I have ever seen; it provides absolutely no insight
into *why* the theorem is true — though it certainly assures me *that*
it is true. Here it is not the theorem, but rather the proof that is the
shadow.

I show you these two proofs to suggest that not all proofs are created
equal. The great mathematician Paul Erdős spoke of The Book, wherein God
wrote the most elegant, insightful proof for every theorem of
mathematics. “You need not believe in God,” he would say, “but you
should believe in The Book.” A professor of mine once asked me if I
would like a copy of “The *Abridged* Book,” which for every conceivable
conjecture simply lists whether or not it is true. It is a shortened
form of the Book, so we can agree to treat its word as proof enough —
or perhaps it only provides succinct, convincing but unilluminating
proofs in the style of Zagier’s — but it says no more than “true” or
“false.” This is not an idle fantasy: already we have found fragments of
The Abridged Book in the form of long computer-generated proofs, such as
the proof of the Four-Color Theorem, which no human could ever check by
hand.

Here is the difference between The Book and The Abridged Book: one would
occupy the Mathematician’s mind for decades, the other would be
immediately cast aside as useless. To the Mathematician, the quest for a
proof is not to establish *that* a theorem is true, but rather *why*.
The proof is not *in* the pudding, the proof *is* the pudding.

This winter I spent some time at a planetarium near Stanford, and I was
told by the announcer that the nearest star outside our solar system,
Proxima Centauri, is over twenty trillion miles away. *Twenty trillion!*
said the announcer, *isn’t that amazing?* And I thought — well, is it?
Would I be ten times as amazed if Proxima Centauri were two hundred
trillion miles away instead? Surely not. I am not sure what to do with
that number. I suspect that you are not, either, because as humans we
are not equipped to appreciate the celestial scale.

Let us agree then that there is really nothing interesting about that
number “twenty trillion” — ah, except for one thing, which is *that we
know it at all;* that we, earthbound, can measure the distances to the
stars. On a cloudless night the stars on the horizon appear to follow
you as you drive by the streetlights; really, it is the streetlights
passing you that create the illusion of relative movement, and the rate
of this movement depends on how far away the streetlights are, as a
consequence of elementary geometry. The same effect occurs at the
celestial scale: as the Earth moves around the Sun distant stars appear
to follow us in relation to the nearest stars, a delicate quiver in the
stars, and from this effect we can measure the distance to Proxima
Centauri. The streetlights are Proxima, our vehicle is Spaceship Earth.
To contemplate this, the consistency of geometry, in the small and in
the large, is to me a spiritual experience that assures me that I am
governed by the same laws of parallax that govern the stars I came from;
that is, it is not just all men but *all entities* that are created
equal before mathematics.

There is a poem of Robert Frost’s, “Choose Something Like a Star,” where the speaker pleads for a star to tell him how it burns

And it says “I burn.”

That’s all it says, but then the speaker continues

But say with what degree of heat.

Talk Fahrenheit, talk Centigrade.

Use language we can comprehend.

Tell us what elements you blend.

Frost reminds us of the dignity in the silence of the stars, and that
would be all well and good except for one thing: *that’s not how stars
are*. Stars speak to us when they radiate across time and space, and
they speak openly to anyone with a telescope. It falls on us — that
is, on humanity — to listen, to hear not only the temperature and
elemental composition of the stars, but also their biographies, the
stories of their births and their deaths.

In the afterword to *Pilgrim at Tinker Creek,* Annie Dillard writes

In October, 1972, camping in Acadia National Park on the Maine coast, I read a nature book. I had very much admired this writer’s previous book. The new book was tired. Everything in it was the dear old familiar this and the dear old familiar that. God save us from meditations. What on earth had happened to this man? Decades had happened, that was all. Exhaustedly, he wondered how fireflies made their light. I knew—at least I happened to know—that two enzymes called luciferin and luciferase combined to make the light. It seemed that if the writer did not know, he should have learned. Perhaps, I thought that night reading in the tent, I might write about the world before I got tired of it.

*To write about the world before you get tired about it.* The
Mathematician is not tired of the world, will never be tired of the
world, because she knows the world will never stop speaking to her.

And yes, even though Nature has over the course of this affair revealed
for us some Truth — the numerical distance to a star, its numerical
temperature — it is not that Truth but rather the revelation that was
the true blessing; the number, the Truth, it is just a byproduct, a
projection. I wonder whether in the moment of looking into Christ’s
wounds Thomas thought at all about the little matter of the
resurrection’s authenticity; I wonder if I, seeing Thomas fall to his
knees in understanding, would not myself wish to doubt again so that I,
too, could be shown the light. *How* we know is the light and *what* we
know the shadow; *proof* is the light and *theorem* is the shadow —
this is the Mathematician’s defense of her art.

A 2014 paper by Rozhkovskaya and Reb titled “Is the List of Incomplete Open Cubes Complete?” complicates Sol LeWitt’s celebrated 1974 artwork, “Incomplete Open Cubes”. The artwork is a gallery of 122 three-dimensional structures; they are meant to be all distinct connected subsets of edges of the unit cube that result in a 3D figure — or, in LeWitt’s words, “all the ways of the cube not being complete”. The complication, noted by the paper, is that two of the structures are identical up to rotation; they should instead be mirror reflections of each other.

Have art historians realized yet? I would like to think there is some deeper philosophical significance to this bit of trivia. The exercise in exhaustive enumeration of incompleteness is itself incomplete — and, we cannot help but overlook the incompleteness, just as we cannot help but imagine the missing edges of any of the 122 almost-cubes.

Image source: Incomplete Open Cubes Revisited

I have meditated on the art-historical significance of chirality before, so today I want to talk about something else. I don’t know about you, but when I see LeWitt’s almost-cubes, I feel the need to submerge them in a tank of soapy water. Why? To see what shape the films take, of course! Is that irrational? Perhaps, but writes LeWitt, “Irrational thoughts should be followed absolutely and logically.” It turns out that my thought isn’t hard to follow — if not absolutely and logically, then at least computationally.

The shape of a soap film is a minimal surface, that is, the surface with minimal area that obeys the boundary conditions of the “loop” that the film forms within. The definition doesn’t depend at all on the properties of soap: any film-forming substance should — in the absence of other forces, such as gravity — contract to the same shape as a result of surface tension. These shapes can often be pleasing and unexpected.

Image source: Wikipedia

Now, because finding the surface is a question of continuous minimization, we should be able to easily apply automatic differentiation — at least, to a discretization of the problem. Here is the algorithm: we start with a mesh grid of points and triangulate it. Then, we compute the surface area by repeated application of Heron’s Theorem to the triangles. Finally, we differentiate the total area with respect to the positions of the points, and nudge the points towards less area. The full source code for this adventure is available here. (By the way, I’m using PyTorch out of habit, but perhaps TensorFlow would be more appropriate because of its oddly LeWittian logo.)

Before I give you the results, I want you to take a moment to try and imagine what the soap films will look like. I think it’s an interesting exercise in qualitative reasoning, and, in any case, LeWitt’s work all but pleads for us to see what isn’t there.

For starters, we can reduce the entire problem from 122 cubes into just a handful of cases. Simply ask: how many distinct-up-to-symmetry “loops” are there on a cube? Let’s do casework by face count: there is 1 loop with one face (the “square”), 1 loop with two faces (the “L”), and 2 loops with three faces (the “U” and the “corner”). The rest are accounted for by complement: a four-face loop is also a two-face loop.

This yields a total of 4 cases — much more manageable! And indeed, I’ll suggest that you start with Case 1, the “square”, since the answer in that case should be obvious.

Okay, now, the results!

Case 1 is easy; we already know the answer. The soap film is just flat, because its boundary is planar.

Case 2 is less obvious — I would not have predicted this outcome! It turns
out that the soap film tries to “flatten” itself into the hypotenuse. What are
the implications of this? Well, one implication is that if you want to minimize
the material you use to build a tent (without regard to volume), you can do
better than stretching the material taught into large rectangles. Sagging
actually *saves* material.

You might expect Case 3a to flatten similarly, but because of the up-down symmetry that can’t happen. Instead, you get a beautiful saddle shape. Actually, it turns out that all minimal surfaces are saddle-like; if they had nonzero curvature at some point then you should be able to “flatten the bulge” to reduce surface area.

And finally, Case 3b — sorry this isn’t quite perfectly modeled — tries to flatten itself around the center, kind of like a tortilla chip.

]]>Summary of this post: It turns out that by carefully mixing forward-mode and reverse-mode automatic differentiation, you can greatly simplify certain hyperparameter optimization algorithms.

Here’s a fun recursive idea: just like how we optimize the parameters of
machine learning models by gradient descent, we can also their
*hyperparameters* by gradient descent. This is by no means a new idea; you can
find 20-year-old papers that discuss (for example) optimizing gradient descent
step sizes by gradient descent itself.

Broadly, there are two ways to do this. The “short-term” way is to take a
*single* step of gradient descent, and then based on how that goes, adjust the
hyperparameters. That’s the premise of the 2018 paper Online Learning Rate
Adaptation with Hypergradient Descent: it
turns out that you can do this in an extremely lightweight way. You would hope
that over time the hyperparameters converge to something optimal alongside the
parameters (and indeed that is the case, though you have to be careful).

The “long-term” way is to train *several* steps and *then* backpropagate
through the *entire training* to adjust the hyperparameters. The hope is that
this provides a stronger “signal” to the hyperparameter-gradient, which is
better for convergence. But, this comes at the (tremendous) expense of having
to store a copy of the entire computation graph for several steps of training,
and then backpropagate through it. If your model has 1GB of parameters, you
need to store ~1GB worth of numbers for *each step*. You can imagine that adds
up over the course of many epochs, and so the algorithm is severely limited by
how much RAM you have.

The 2015 paper Gradient-based Hyperparameter Optimization through Reversible Learning makes this work by throwing away the intermediate steps of the computation graph and then JIT-re-computing it “in reverse” during backpropagation. It’s a clever technique, but kind of hairy and hard to implement (you have to be super careful about numerical precision).

Here’s a graph that visualizes these two techniques (from this
paper). You’re looking at several parallel
loss curves (($\log(f)$) is the loss plotted on a log scale), pointing towards
you, arranged in order of the “step size” hyperparameter
(that’s ($\log(\alpha)$)). The orange curve represents a “short-term”
hyperparameter optimization, which is allowed to move along the “step size”
axis at each step. The “long-term” hyperparameter optimization instead
optimizes directly on the thick black dashed “U” — that is, after *several*
steps of training. You can see how the latter is smoother, but also much harder
to compute.

To summarize: the short-term way is cheap but noisy. The long-term way is expensive but less noisy. Can we get the best of both worlds?

Well, let’s think more carefully about the source of the expense. The problem
is that we need to store (or be able to reconstruct) the full computation graph
in order to do backpropagation. Okay, but do we really *need* to do
backpropagation? The only reason we backpropagate is that it’s more efficient
in the case when you want derivatives with respect to *many* different
variables. If you have millions of model parameters, backpropagation is
millions of times faster than the much simpler forward-mode (“dual numbers”)
automatic
differentiation.

But we *don’t* have millions of *hyperparameters!* Step size, for example, is
just a single number. The Adam optimizer only has a total of 4 hyperparameters.
With this in mind, backpropagation isn’t even the right choice — we *should*
be using dual numbers for the hyperparameter optimization. On the other hand,
we should still be using backpropagation for the “inner loop” that optimizes
the (non-hyper-) parameters. That is, we want to do something like this:

```
initialize hyperparameters
# loop to optimize hyperparameters
while True:
initialize parameters
# loop to optimize parameters
for i in range(100):
run model on data to get loss
# using reverse-mode!
compute d(loss) / d(parameters)
update parameters using hyperparameters
# using forward-mode
compute d(loss) / d(hyperparameters)
update hyperparameters
```

(Update: I discovered that this was suggested in the 2017 paper Forward and Reverse Gradient-Based Hyperparameter Optimization. But keep reading — while the paper’s implementation needs a lot of math to be worked out manually, I’m going to show you how to implement this in a way that makes all the math in the paper fall out “for free”…)

This proposal raises a logistical question: how do we reconcile these two
automatic differentiation algorithms in the same program? **The “trick” is to
“thread” dual numbers through a backpropagation implementation.** In other
words, implement backpropagation as usual, but rather than `float`

type
numbers, exclusively use `dual_number`

type numbers (even when doing derivative
calculations). Initialize the system such that the dual numbers track
derivatives with respect to the hyperparameters you care about. Then, your
final loss value’s attached ($\epsilon$)-value *immediately* gives you
`d(loss)/d(hyperparameters)`

. No backpropagation needed — and so, it’s safe
to “forget” the computation graph.

That’s it! That’s all I wanted to share in this blog post! :)

Of course, it’s not obvious how to implement this in PyTorch, since you can’t
naïvely do `tensor(dtype=dual_number)`

. Rather than hack in a custom numeric
data type that implemented dual numbers, I wrote my own tiny implementations of
forward- and reverse-mode automatic differentiation. It’s just a couple dozen
(very-recognizable-to-automatic-differentiation-enthusiasts) lines of code. I
was careful to make each implementation generic in the kind of “number” it
accepts. That allowed me to run the reverse-mode algorithms using the
forward-mode data types.

Running it on a simple quadratic optimization problem, we can see that updating the hyperparameter ($\alpha$) yields better-looking loss curves — in this GIF, we’re discovering that we should increase the step size. Yay!

I think this is a very neat trick, which surely has other cool applications (for example, differentiating through long-running physics simulations!). If you’re curious, check out this IPython notebook for the full source code. Feel free to adapt it to your ideas. Then, tell me what you’ve built!

]]>“I do not know what ‘poetical’ is: is it honest in deed and word? Is it a true thing?” — Lotman quoting Pushkin quoting Shakespeare in

As You like It, 3.3.

How do I want to say this? Here is one way: Earlier this year, the world was *very surprised* at a computer doing *very unsurprising* things. The celebrated computer program, OpenAI’s GPT-3, is a *language model*, a machine learning model that (to simplify) outputs the statistically-likeliest completion of a sentence, much like the autocomplete on your iPhone keyboard. If we wrote the prefix

The weather is warm and…

GPT-3 might output “sunny” or “humid” — that would be probable and unsurprising. But it would likely not output “turtledove” — that would be improbable and surprising. Of course, you can run this program again and again to sample more and more words, and thus generate long sequences of text.

- The weather is warm and [sunny]
- The weather is warm and sunny [so]
- The weather is warm and sunny so [let’s]
- The weather is warm and sunny so let’s [go]
- The weather is warm and sunny so let’s go [to]
- The weather is warm and sunny so let’s go to [the]
- The weather is warm and sunny so let’s go to the [beach.]

In principle this is nothing new. There was a GPT-1 and a GPT-2, too, as you can imagine. But GPT-3 is bigger and better — it has “read” an *enormous* corpus of text to build up its database, and the model itself consists of *175 billion* parameters (GPT-2 was a “mere” 1.5 billion parameters). Just training a model of this size is an exciting engineering accomplishment.

By virtue of its size, GPT-3 is extremely good at hallucinating text in this way. And this is where things begin to get uncanny. If you phrase your prompt as a question, GPT-3 reliably provides answers in complete sentences. If you phrase your prompt as a few coding examples, GPT-3 can generate software to build small user interfaces based on a description of what buttons you want. It can even generate convincing hoax blog posts. Don’t worry, what you are reading right now is written by a human — but it is probably not unreasonable to be a little concerned.

I think it is natural to ask whether it can generate “literary” text as well — whether it is a kind of superintelligent monkey that *will* generate *Hamlet* if you ask it to. Now, it’s pretty clear that GPT-3 is not quite a Great Automatic Grammatizer (I have written in the past about what *that* may look like). But for short fragments, as lots of people have shown, it isn’t embarrassing either. And so, as we hurtle towards a future where statistically-sound “literary text” can be mass-produced, I have been thinking a lot about what “literature” is.

One set of thoughts comes from an unlikely (so to speak) source, which is the early-1970s literary theory of Yuri Lotman (or Jüri, or Jurij — you choose how to spell it — this becomes relevant soon). In *The Structure of the Artistic Text* (1971, but I’m referencing Ronald Vroon’s 1977 translation and Ewa Thompson’s guide), Lotman connects Shannon’s and Kolmogorov’s quantitative theories of communication and entropy with the qualitative question *what is literature?*. At least, that is my understanding — I’m not an expert in Russian literary theory or semiotics, but I’m trying my best.

Let’s think back to one of Shannon’s big ideas, which is that language is *redundant*, which means there are many ways to convey the same idea. Because of its redundancy, language is resilient to noise in the communication channel (such as loud music at a party). If I stprt intrdducinq mutnt1ons into my spellings, you can still more or less get the message. Similarly, most ideas can be paraphrased. Or, to put it differently, you can convey a concept in many different ways. That is, notions can be materialized into words in multiple fashions. If there were no redundancy in the language — that is, if every combination of letters formed a different valid word, or if there were only one way to express each idea — then this error-correction would be impossible. Imagine if each volume in Borges’ Library of Babel was a meaningful masterpiece! Imagine spilling a teardrop on a page, only to change the meaning dramatically!

This redundancy is the reason why text is so *compressible*, why a ZIP file can be so small compared to the source it is compressing. You can just strip out the redundancy to get a smaller file: your disk’s I/O channel has negligible noise compared to a loud party, and so the redundancy is not needed. (Any introductory theory-of-computation course touches on *Kolmogorov Complexity*, a kind of theoretical limit to compressibility in terms of computability: it turns out that Kolmogorov worked with language and literature as well.)

Perhaps this redundancy makes GPT-3’s behavior a little less surprising (I hope that word “surprising” is accruing meaningfulness for you). The more redundancy a language has, the more about the statistical structure of language you can “learn” from data, and therefore the easier it is to predict likely completions. Indeed, GPT models can be used to perform data compression.

Now, my understanding of Lotman’s thesis is that *literature does not behave this way*. Literature is *more entropic*, which is to say, literature is *less redundant*. It is packed with meaning — in fact, every choice is intentional and significant. Lotman argues that literature lives outside or above “ordinary” natural language; a work of literature creates its own “code,” and each word takes on a heightened meaning in this code. To read literature is to learn to decipher this code. He writes, *The artistic text is an intricately constructed thought. All its elements are meaningful elements.* Once a text is composed, it is baked to an immutable, brittle crust. “The reader considers the text placed before him as the only possible text.”

For example — and this is my example, not Lotman’s — the repetition in a line from *Lear*

No, no, no, no! Come, let’s away to prison:

is exact and precise. Four times, no more, no less — and each utterance of the word “no” is significant, dripping with intention. The raw sound “no” itself becomes a kind of meaningful sign, detached from its usual English-language semantics. You cannot, without losing meaning, compress that text as simply “4 x no.”

Let me give you another example. Consider Roethke’s “The Waking.” It’s a sestina, a poem built on repetition. Every other stanza ends with the same line.

I wake to sleep, and take my waking slow.

I feel my fate in what I cannot fear.I learn by going where I have to go.

This repeats.

Of those so close beside me, which are you?

God bless the Ground! I shall walk softly thereAnd learn by going where I have to go.

Until, suddenly and jarringly (in the softest way possible), it doesn’t.

Great Nature has another thing to do

To you and me; so take the lively air,And,lovely, learn by going where to go.

That word *lovely* appears so deliberately in that line that its absence in the previous lines is conspicuous. As a result, those repeated lines, which seemed like blind repetitions, become conscious choices. Roethke could have said *whatever he wanted* but he chose to repeat himself verbatim, without the word “lovely,” in what Lotman might call a kind of “minus-device,” a trembling vacuum, an absence with presence — that in itself is meaning, that in itself holds entropy.

It would not do for Roethke to paraphrase those lines, either. “I educate myself by traveling where needed” has no impact. Because literary text is stripped of redundancy, even small mutations — even semantics-preserving ones — can dramatically affect the text. It’s like compiling for embedded targets with `-O4`

. Like `gcc`

, Lotman does not believe you can paraphrase literature to get equivalent literature. You cannot re-tell a story, because there are no synonyms in the unique code of that particular work of literature. I am reminded of Borges’ “Pierre Menard, Author of the Quixote,” a short story wherein Menard *writes* (not re-writes) *Don Quixote*.

Pierre Menard did not want to compose

anotherQuixote, which surely is easy enough—he wanted to composetheQuixote. Nor, surely, need one be obliged to note that his goal was never a mechanical transcription of the original; he had no intention ofcopyingit. His admirable ambition was to produce a number of pages which coincided—word for word and line for line—with those of Miguel de Cervantes.

Maybe, in light of the discussion above, these bizarre lines are a little less mystifying, a little more satisfying.

That brings me back to GPT-3, and the literary status of its output. Can we ever call a GPT-3-generated poem a work of literature? Well, following Lotman, let’s instead ask this: how much information, really, does GPT-3 output? How entropic is GPT-3’s output? And of course the answer is that GPT-3 is quite orderly in the grand scheme of things. “Whenever the text does not realize one of at least two possibilities,” writes Lotman, “but automatically follows one, it loses its capacity to convey information.” That is, every time GPT-3 argmaxes its way to a prediction predestined by its statistical model, which is really a synthesis of gigabytes and gigabytes of data, it is outputting — well, nothing. No new information. The occasional choice made by the random sampling is the only “information” being produced, though it is hidden in the fluffy sheep-skin of statistically-unobjectionable prose. If GPT-3 completed the prompt-word “yesterday” with Roosevelt’s “a date which will live in infamy” speech half the time, and Beatles’ “All my troubles seemed so far away” lyrics the other half the time, then GPT-3 has only really communicated one bit of information; the rest is redundant. Indeed, if you were to measure the number of bytes read from `/dev/urandom`

(or whatever RNG) by GPT-3 in the course of sampling, and analyze the relative probabilities predicted by the model, perhaps you would be able to compute a Shannon entropy — or even a convincing literariness score — for the output. I sense it would not be very high.

I want to end this piece with two notes.

First: I see a couple of problems with Lotman’s theory — again, I’m not a literary theorist by trade so I do not know whether he (or others) have addressed them.

- How does Lotman account for the way early (oral) poetry, with its strict rhyme and metrical patterns, was designed to be memorized? Isn’t the poetical form, in a very direct sense, optimizing itself for compression and therefore less entropic — as if the redundancy were to harden the message against the noise of
*time*and*amnesia*? Or do these qualities themselves encode and exfiltrate information, like the stress-patterns in Sanskrit verse, or the knots knitted by the women in*A Tale of Two Cities*? Lotman spends some time at the beginning of Chapter 6 discussing how poetry removed itself from ordinary speech, and then prose removed itself from poetry, and therefore prose is twice-removed from ordinary speech. But that does not quite answer my question. - How does Lotman account for the fuzziness of texts — for example, the many editions of
*Hamlet*that editors*to this day*Frankenstein together to try to formulate a coherent whole? How does this theory stand in the absence of an original source text?

Second: Lotman means all this in a serious technical sense. As far as I can tell, he is not just borrowing scientific buzzwords to give legitimacy to an otherwise fuzzy theory. For example, he cites the quantitative work of Shannon and Kolmogorov directly, even though it was (even for the time) primitive. The preface to his book suggests rhetorically that the only way to test these theories would be to compute empirical metrics on an unthinkably large set of poems. Later Lotman suggests a kind of semantic analysis of poetry, but says

The operations recounted above give only a general and deliberately rough semantic skeleton, since

a description of all connections arising in the text and of all extra-textual relations which could be ascertained would be an unrealistic task in terms of sheer volume.

But today, we can do that — take exactly those measurements that Lotman (and Kolmogorov and Shannon) could only have dreamed of! It is as if Lotman predicted the relativistic warping of the stars, and we had to wait decades for the eclipse to make the confirming observations. Practical technology has finally caught up to theory.

And so I think there is potential for unprecedented scholarship here. This is a question where deep learning enthusiasts and literary theorists and semioticians all have something valuable to say. I wonder what the conversation will be like? It is comforting to note that our hero Kolmogorov appeals as early as 1964,

]]>But real progress in this direction demands that cyberneticians [today’s deep learning researchers?] take a greater interest in the humanities and learn more about them.

Another magic trick. Watch these two grids overlap to reveal a secret image! Can you spot the phantom horse?

If my janky CSS animation didn’t work for you, here is another way to see it: simply shrink the top grid slowly until the horse “pops” out. It’s hard to show this effect consistently across displays/browsers because of varying pixel spacings and various browsers’ aggressive anti-aliasing tactics, but here is a screen recording I captured.

Magic trick? Not quite. Graphics folks will immediately recognize this effect
as a curious instance of the dreaded *moiré pattern,* caused by the
interference between two almost-but-not-quite-the-same grids. Researchers work
really hard to escape such “glitchy” artifacts — they are often a result of
sloppily sampling a texture in a grid, leading to jagged edges or moiré streaks
(image below from Wikipedia).

In fact, a lot of
effort has
gone into finding “good” sampling patterns that avoid moirés; for example, in
an odd crossover of sampling theory and anatomy, if you look at the
distribution of photoreceptors in a monkey’s
retina they are neither in a grid
*nor* uniformly random, but rather arranged in a *low-discrepancy* pattern to
avoid exactly this kind of artifact.

But for all the hard work gone into avoiding these patterns, you kind of have to admit that they are beautiful and mysterious. In fact “moiré” gets its name from a desirable form of silk weave that exhibits this phenomenon owing to interference from the weave (photo from Wikipedia again). Similarly, this sculpture (and a lot more art) exploits moiré patterns for their hypnotic beauty.

There are lots of real-world ways you can play with this effect, and indeed soon enough you can’t stop seeing it everywhere you go (this is the joy and frustration of being interested in visual phenomena):

- Point a camera at a checkered shirt; the camera’s pixel grid interferes with the fabric pattern (I’ve been told not to wear checkered shirts when giving talks at certain venues where everything is recorded).
- Align two kitchen strainers and gently bend them to see the patterns change.
- Watch a screen door’s shadow interfere with the screen itself, poke the screen a bit to see the patterns change.
- Watch the wire fences on either side of a pedestrian walkway above a road interfere with each other.
- Look at an office chair from behind, watch the mesh of the back interfere with the mesh of the seat. Depending on their relative curvature you can get some beautiful effects.

These patterns are seemingly random, like marbled paper. But can we control this randomness? Satisfyingly, an on-demand moiré such as the effect above is extremely simple to generate using differentiable programming in PyTorch.

What I am optimizing is a *small distortion* to the image of the grid such that
when the grid and its distortion are superimposed, the resulting image has low
L2 difference from a given target, such as the horse image. What exactly is
this “distortion”? Imagine the image printed on play-dough and deforming it by
stretching and squeezing parts. The result is a warping of the image. To
express this warping, each pixel is assigned a “motion vector” which tells it
where to go. To make sure the motion vectors are smooth and continuous between
adjacent pixels, I actually learn a heavily downsampled set of motion vectors
(even a 12-by-12 array gets good results) and then upsample it to full
resolution with bicubic interpolation.

Here is the wonderful thing: all of these operations are built in to PyTorch and differentiable, apparently to support Spatial Transformer Networks. So, a straightforward 30-line implementation of the above algorithm is all it takes to start “learning” a moiré grid perturbation. Here is a Jupyter notebook with all you need to get started.

How much computation does it take to get good results? See this animation: it converges extremely rapidly, in just 800 iterations or so of plain gradient descent (a couple of minutes on my laptop).

I’m of course not the first person to try this trick; I found references in
2001 and 2013. I think my results look significantly better than those
presented in the paper *Variational Approach to Moire Pattern Synthesis*
(Lebanon and Bruckstein 2001), even though my technique
is much simpler than their gradient-based technique. On the other hand, I think
the more mathematically-motivated results presented in *Target-Driven Moiré
Pattern Synthesis by Phase Modulation* (Tsai & Chuang
2013) are much more impressive, even though they have a
slightly different setup than mine. They also had the resources to print out
their patterns on transparent film and overlay them physically, as well as to
mess with a camera whose Bayer pattern was known! And they suggest using this
trick for cryptography. Very neat.

To understand what they do we need to understand the mechanics of moiré
(instead of leaving it all to SGD!). I’ve always been told that when two
high-frequency signals such as grids are superimposed, they can alias to form
interesting low-frequeny content (famously there is this
comic). But I’ve never been given more detail than
that high-level overview, so I looked it up myself. It’s actually quite
wonderful and principled. Here is my understanding as informed by *A
Generalized Fourier-based Method for the Analysis of 2D Moiré Envelope-forms in
Screen Superpositions* (Amidror 1994), specifically by
Figure 4 reproduced below.

First, we need to come up with a useful characterization of grids. Recall that
you can take a 2D signal (i.e. image) and perform a 2D Fourier transform, which
is *also* an image. Each point in frequency-space represents a set of stripes
(i.e. a sine wave) rotated to match the point’s angle with respect to the
origin (see (a) and (b) in the figure for examples of what I mean). The
frequency of the stripes is given by the distance from the point to the origin,
and the amplitude is given by the point’s brightness (complex-valued
brightnesses let you set phase). To get “sharp” stripes instead of a “smooth”
sine wave gradient, you have to add “harmonics” at regular multiples of that
point (recall the Fourier transform of a square wave if that helps). This is
(d) and (e) in the figure from the paper.

Next, we note that superimposing two sets of stripes is like pointwise
multiplication (if you treat zeros as “occlusion”, then multiplying by zero
always sends the result to zero as desired). Because pointwise multiplication
is convolution in frequency space, the superposition looks like a *lattice*
spanned by the harmonics in frequency space. This lattice is depicted in (f) in
the figure.

The key to the illusion is that this lattice might include points much closer to the origin than either of the harmonics that span it! Because the human visual perceptual system acts as a low-pass filter, the result is that you see a low-frequency “phantom” in the superposition. This is visible in (c) in the figure.

Tsai and Chuang use this exact insight to reverse the moiré effect analytically, with excellent results.

**Open questions:** I think there’s a lot more to do here. In my mind, the holy
grail is a T-shirt you can wear which, *only when photographed* reveals a
secret message. Another neat application would be a wire mesh rolled into a
cylinder such that as you rotated it, the moiré patterns animated into
something cool (perhaps a galloping horse, as a nod to Muybridge). There is
some precedent for this in “barrier-grid” animation (see, e.g. these vinyl
record
covers),
but the benefit of a moiré-based approach would be that the barriers need not
be as thick, allowing for finer detail.

Another direction to go in is audio moiré; that is, could we make two sounds, both above human aural range (20 kHz) such that when played together you hear a perfectly clear, audible voice?

As you might have guessed, I think naïve gradient descent should be able to get us a long way there…

]]>