This summer I was briefly in Vancouver, and after dinner on my last day I found myself walking several blocks back to my hotel. The night was dark but the sidewalk was brightly lit and it was a lovely journey. Along the way, I took this picture:

What struck me is how the streetlight casts two shadows, blue and yellow, and moreover those shadows appear opposite the yellow and blue lamps, respectively. What could possibly be going on?

Here is an explanation: the two lamps together create a kind of grayish-white light that bathes the sidewalk. Where the yellow lamp is occluded, the blue light is dominant, so the shadow is blue. Similarly, where the blue lamp is occluded, the yellow light is dominant, so the shadow is yellow.

Looking at this scene I’m reminded of painter Wayne Thiebaud’s rich, saturated shadows. You could say the perceptual effect here demonstrates that “white light” is the sum of all wavelengths, a fact we learn in grade school (I think there is an exhibit at the SF Exploratorium with a similar concept). But to me, this also demonstrates the range of what we are willing to call “white.” If one of the lamps were to burn out, our eyes would adjust to the blue or yellow almost immediately, and we would still see the sidewalk as gray — we experience a truly remarkable “color constancy” across lighting conditions. In this way, when he paints a shadow as a saturated, non-gray color, Thiebaud sees beyond his own seeing.

]]>The other day my toilet broke and I involuntarily learned a lot about how flushing works. My friend suggested an analogy for me: flushing toilets is like a neuron’s activation: once a critical threshold is met, there’s an “all-at-once” response.

That got me thinking, could we implement deep neural networks in plumbing? It turns out, the answer is yes! A very simplified model of a flush toilet’s nonlinear behavior is as follows: it’s a bucket, into which water can be poured, and there is a hole at height $h \geq 0$. If you pour in volume $v$ of water into the bucket, the output that flows out of the hole is $\text{ReLU}(v - h)$.

The second component we need to build a neural network is a linear map. We can do this by attaching a branching pipe to the hole. This component will have $k$ branches with cross-sectional areas $A_1, A_2, \dots, A_k > 0$. By conservation of mass and a simple pressure argument, the amount of water that pours out of branch $i$ is $A_i / \Sigma_j A_j$.

Together, these components allow us to compute a function from $\mathbb{R}\rightarrow \mathbb{R}^k$, which looks something like $\text{PeLU}(v, \vec{A}, h) = \text{ReLU}(v - h)\cdot \vec{A} / \Sigma_j A_j$. Here, “PeLU” stands for “Porcelain-Emulated Linear Unit.” It is clear how to vectorize this expression over $v$, which effectively creates a new kind of neural network “layer” with trainable parameters $\vec{A}$ and $h$ for each input dimension. To enforce the positivity constraint on $h$ and $A_i$, we will actually work with the following key equation: $\text{PeLU}(v, \vec{A}, h) = \boxed{\text{ReLU}(v - h^2) \cdot \text{softmax}(\vec{A})}$.

All that is left to do at this point is to implement this in PyTorch and train it.

```
import torch
class PeLU(torch.nn.Module):
def __init__(self, in_feat, out_feat):
super().__init__()
self.heights = torch.nn.Parameter(
torch.randn(in_feat))
self.weights = torch.nn.Parameter(
torch.randn(in_feat, out_feat)
)
def forward(self, X):
X = torch.nn.functional.relu(X - self.heights ** 2)
X = X.matmul(self.weights.softmax(dim=1))
return X
```

Here, I built a PeLU layer that can be slipped into any PyTorch model, mapping `in_feat`

inputs to `out_feat`

outputs. Next, let’s stack some PeLU layers together and train the result on the classic “Iris” dataset, which has 4 features and assigns one of 3 labels. We will create a “hidden layer” of size 3, just to keep things interesting.

```
from iris import feat, labl
m = torch.nn.Sequential(
PeLU(4, 3),
PeLU(3, 3)
)
o = torch.optim.Adam(m.parameters(), lr=0.01)
lf = torch.nn.CrossEntropyLoss()
for i in range(10_000):
o.zero_grad()
pred = m(feat * 10)
loss = lf(pred, labl)
loss.backward()
o.step()
print('Loss:', loss)
print('Error:', 1. - ((torch.argmax(pred, dim=1) == labl) * 1.).mean())
```

This trains very quickly, in seconds, and gives an error of 2%. Of course, we haven’t split the dataset into a train/test set, so may be be overfitting.

By the way, you may have noticed that I multiplied `feat`

by 10 before passing it to the model. Because of the conservation of mass, the total amount of water in the system is constant. But each bucket “loses” some water that accumulates below the hole. To make sure there’s enough water to go around, I boosted the total amount.

But that’s all that needs to be done! Once we export the parameters and write a small visualizer…

Isn’t that a delight to watch? I may even fabricate one to have as a desk toy — it shouldn’t be hard to make a 3D-printable version of this. If we wanted to minimize the number of pipes, we could add an L1 regularization term that enforces sparsity in the $\vec{A}$ terms.

Some final thoughts: this system has a couple of interesting properties. First, there’s a kind of “quasi-superposition” that it allows for: if you pour in more water on top to incrementally refine your input, the output will automatically update. Second, the “conservation of mass” guarantees that the total water output will never exceed the total water input. Finally, it’s of course entirely passive, powered only by gravity.

This has me wondering if we can build extremely low-power neural network inference devices by optimizing analog systems using gradient descent in this way (a labmate pointed me to this, for example).

Below is a little widget you can use to enjoy playing with PeLU networks. All inputs must be between 0 and 10. :)

Sepal length (cm):

Sepal width (cm):

Petal length (cm):

Petal width (cm):

Speculation on what (other than its own frail body) that bird-in-the-hand might signify has always been attractive to me, but especially so now thinking, as I have been, about the work I do that has brought me to this company. (full)

And just because you want to paint a bird, do actually paint a bird, it doesn’t mean you’ve accomplished anything. (full)

Waite:What do you call this?Steichen:I use the same term the sculptor did, oiseau, a bird.Waite:What makes you call it a bird, does it look like a bird to you?Steichen:It does not look like a bird but I feel that it is a bird, it is characterized by the artist as a bird.Waite:Simply because he called it a bird does that make it a bird to you?Steichen:Yes, your honor.Waite:If you would see it on the street you never would think of calling it a bird, would you?

[Steichen:Silence]Young:If you saw it in the forest you would not take a shot at it?Steichen:No, your honor. (more)

But who knows what Fabritius intended? There’s not enough of his work left to even make a guess. The bird looks out at us. It’s not idealized or humanized. It’s very much a bird.

And then there is this fourth level, which is a whole new object in the world: the prop made for the movie, the representative of the thing, becomes, in its own right, a whole other thing, a whole new object of desire. . . . There are several people who own originals, and I have been attempting to contact them and reach them, hoping that they will let me spend a few minutes in the presence of one of the real birds, maybe to take a picture, or even to pull out the hand-held laser scanner that I happen to own that fits inside a cereal box, and could maybe, without even touching their bird, I swear, get a perfect 3D scan. And I’m even willing to sign pages saying that I’ll never let anyone else have it, except for me in my office, I promise. I’ll give them one if they want it. And then, maybe, then I’ll achieve the end of this exercise. But really, if we’re all going to be honest with ourselves, I have to admit that achieving the end of the exercise was never the point of the exercise to begin with, was it? (full)

]]>Plate 357

(Numenius Borealis)is the only instance in which the subject appears dead in the work of John James Audubon.He waits alone, sketching angels from the shade—

a kind of heavenly bird,he reasons with himself—although their wings are broke, faces scarred, each fragile mouth feigning the same sad smile as the one before it. Offered triple his price to paint a likeness of the pastor’s daughter—buried for more than a week—he reluctantly agreed—times being what they are.

. . . . .

And yet he studies it—from behind the dunes—studies its several postures, grounded and in sudden flight—and not content to praise it from a distance, to sacrifice detail, unpacks his brushes and arranges them before raising his rifle and taking aim.

Hello and welcome to the 198th Carnival of Mathematics, a roving monthly roundup of mathy blog posts from around the Internet! Longtime Comfortably Numbered readers should be no stranger to the Carnival: I hosted #134, #148, and #159 in the past.

As is traditional, I want to start by thinking a bit about the number 198. It is actually a very dear number to me — I even own a “198” t-shirt! Why, you ask? The number 198 is the emblem of the wonderful CS198 program at Stanford, which hires a huge team of undergraduates to help teach the introductory computer science courses every year. (As far back as 1995, the faculty running the program published a retrospective on its impact on campus.)

Here is another fun fact about the number 198: suppose I start looking at all the integers ($1, 2, 3, 4, 5, \dots$), keeping only the ones that are perfect powers ($1, 4=2^2, 8=2^3, 9=3^2, 16=2^4, \dots$), and then taking their cumulative averages ($1$, $2.5=(1+4)/2$, $4.33=(1+4+8)/3$, $5.5=(1+4+8+9)/4$, $7.6=(1+4+8+9+16)/5, \dots$). You might wonder, are there any other integers in this sequence? It turns out, *yes!* And I bet you can guess the first one…

Now, time for some links! I’ve provided some “teaser” text from each submission to get you interested.

The BRSR blog asks: can you find a Euclidean triangle on a non-Euclidean surface?

A question that came up in a math chatroom (yes, I’m the kind of nerd who spends time in math chatrooms): find a “Euclidean” triangle on a non-Euclidean surface. More exactly, find a geodesic triangle on a surface with non-constant Gaussian curvature having the same sum of interior angles as an Euclidean triangle.

On his blog bit-player, Brian Hayes asks: why do pandemics peak in distinct waves as the months go by?

I’m puzzled by all this structural embellishment. Is it mere noise—a product of random fluctuations—or is there some driving mechanism we ought to know about, some switch or dial that’s turning the infection process on and off every few months?

I have a few ideas about possible explanations, but I’m not so keen on any of them that I would try to persuade you they’re correct. However, I

dohope to persuade you there’s something here that needs explaining.

On Twitter, Thien An asks: would you rather have a bishop and a knight, or two bishops?

To get this figure, it suffices to grab 100k games and only keep the positions with BB+pawns v.s. BN+pawns and see who eventually wins. That’s very simplistic since there are indeed many other important positional features, but I was actually expecting less convincing results…

Looking at a video of a pomegranate sorting mechanism, Nisar Khan asks: how much do the pomegranate sizes vary between sorted buckets?

After watching the below video, thought of finding the approximate size differences of pomegranates sorted in different boxes…

On his math blog, Tony asks: did studying maths help Emma Raducanu win the US Open?

I’ve discussed toy examples in public lectures on game theory, which (it seems to me) is relevant to choices players make - whether to serve to the forehand or backhand, and where to expect for your opponent to serve, for example. I very much doubt if players ever analyse in these terms, but they are intuitively doing game theory when making their tactical decisions.

But I think in the case of Raducanu there is a more general point…

On Risen Crypto, Trajesh asks: how does the quadratic sieve factoring algorithm really work?

The Quadratic Sieve is the second fastest algorithm for factoring large semiprimes. It’s the fastest for factoring ones which are lesser than 100 digits long.

On The Universe of Discourse, Mark Dominus asks: why is the “S” combinator named “S”? He also posted a nice puzzle that is easily solved if you happen to know Dilworth’s theorem; separately, Jim Fowler built a musical PICO-8 game where you race to find the chains guaranteed by Dilworth’s theorem.

I thought about this for a while but couldn’t make any progress. But OP had said “I know I have to construct a partially ordered set and possibly use Dilworth’s Theorem…” so I looked up Dilworth’s theorem.

On their blog “Gödel’s Last Letter and P=NP,” Ken Regan and Dick Lipton discuss “baby steps” in math, in the context of some exciting recent results.

Reckoned against a later paper by Cohn and Elkies, [Viazovska’s] improvement was 0.0000…000001. The recent post where we discussed the phrase “the proof is in the pudding” involves a number with six more zeroes than that. These are

notwhat we mean by “baby steps.”

And finally, something more about sequences— on her math blog, Tanya Khovanova explores the top 50 largest numbers to start OEIS sequences.

My son, Alexey Radul, wrote a program that finds the largest numbers to start a sequence in the Online Encyclopedia of Integer Sequences (OEIS). To my surprise, the top ten are all numbers consisting of ones only.

That’s all for this month’s Carnival! What a joy to see — once more — the math blog world *overflowing* with curiosity. Hosting the Carnival of Mathematics truly is a delight.

Want even more math? The previous Carnival, the 197th, was hosted by Jeremy at Math $\cap$ Programming, and the next — the 199th, to which you can contribute here, starting now! — will be hosted by Vaibhav at DoubleRoot. See you there!

]]>Surely *something* will happen — which means that if you integrate over all
possible outcomes, the probabilities must add up to 1. This elementary fact
remains true, even if the probabilities of the outcomes change for some reason.
Are you reminded of something? *Fluids* behave in the same way: no matter how
the particles are distributed, the total quantity of fluid is conserved, which
means that if you integrate over all of space, the masses must add up to 1.
Indeed, when we talk about *probability density functions* and *probability
mass functions*, we are borrowing fluidic vocabulary.

I get excited about analogies like this, where the physics of one domain
promises to unexpectedly deliver insight into a different domain. Another
example of this is how spring constants and capacitances play the same
mathematical role in the differential equations that model their respective
mechanical and electrical systems. Harry Olson’s wonderful book *Dynamical
Analogies* shows some more
beautifully-illustrated examples of mechanical, electrical, and acoustic
systems with identical dynamics.

Can we get any mileage out of the probability-fluid analogy? It turns out,
*yes!* We can apply our physical intuition about fluids to design a clever
statistical machine learning algorithm. That’s the subject of this piece. By
the way, you can undoubtedly find better (and more correct) expositions of this
math in other places; my goal here is to emphasize the interaction between
physical intuition and probabilistic applications, bridged so beautifully by
the fluidic analogy.

A little bit about the problem setup: sometimes, when working with data, you find yourself in a place where you want to sample from a probability distribution $p(x)$. This might not be so bad if you can compute $p(x)$ exactly — for example, you can use rejection sampling or something. Often, however, you can only compute some scaled quantity $q(x)$ such that $p(x) \propto q(x)$ with some unknown constant of proportionality. Commonly, this happens in Bayesian inference, where you may want to sample from $\pi(x \mid y) = \pi(x,y)/\pi(y)$ where $y$ might be data you’ve seen and $x$ might be the parameters to your model. Here, the problem is that the normalizing constant $\pi(y)=\int \pi(x,y)dx$ is hard to compute because that integral is intractable in high dimensions.

One famous way to do this kind of sampling is by the Metropolis-Hastings
algorithm, which approximates $p(x)$ with the stable distribution of a Markov
chain. The “miracle” of the algorithm is that the transition probabilities
between two states only depend on the *ratios* of the probability densities at
those states — the unknown constant simply cancels out!

But now let’s think about this from a fluids perspective. Have unknown normalizing constants popped up in physics anywhere? Yes! The partition function $Z$ in Boltzmann is often unknown. Recall that a system at thermodynamic equilibrium at temperature $T$ has probability $p_i \propto e^{-U_i/k_BT}$ of being in state $i$, where $U_i$ is the potential energy of that state and $k_B$ is a physical constant that you can find, e.g. on Wikipedia. A curious application of this distribution is working out how thin the atmosphere is at altitude $h$ by reasoning about $U_h$ being the gravitational potential energy at altitude $h$: because $U$ is linear in $h$, we can instantly conclude that atmospheric density decreases exponentially in altitude.

But returning to machine learning, here is the insight that the fluidic analogy makes possible: if we set $U(x) = -\log q(x)$ then a particle simulated at some “temperature” $T=1/k_B$ in potential $U$ would eventually have probability $e^{-U(x)}=p(x)$ of being at $x$. So, we can use what we know about the dynamics of particles that are a part of a fluid at thermal equilibrium, to sample from probability distributions that may have nothing at all to do with fluids.

What exactly does it mean to “simulate a particle” of this probability fluid? You can take those words pretty literally, as it turns out. There are two sources of motion at each timestep: $U$ and $T$. Given potential $U(x)$, we can derive the force $-\nabla U$ that acts on our particle, and then apply Newton’s Second Law to infer the velocity. We’ll assume that velocity is roughly independent between timesteps because the particle is repeatedly being bumped around by other particles and having its velocity reset — so, no need to track acceleration. By the way, notice that $\nabla \log q(x)$ doesn’t depend on the unknown normalization constant, because the logarithm turns it into an additive constant, which disappears under the derivative.

As for the effect of $T$, we can model the thermal motion of the particle as Brownian motion: that is, at each time, the particle gets bumped to some random neighboring position drawn independently from a Gaussian that is somehow scaled with respect to “temperature” $T$. (How do we translate $T=1/k_B$ to this un-physical world where “joules” and “kelvins” make no sense? Stay tuned…)

This fluidic analogy leads us to what people call the “Unadjusted Langevin Algorithm.” The algorithm is quite simple to state and implement:

- Initialize some $x_0$.
- Using some small step size $\Delta t$ integrate $\Delta x = \Delta t \cdot \nabla\log q(x_t) + \sqrt{2\Delta t} \cdot \xi_t$ where you sample $\xi_t \sim N(0,1)$; the Greek letter $\xi$ is chosen for maximum squiggliness, in order to act as a visual metaphor for its function. Importantly, this step only requires you to evaluate the unnormalized density $q$!
- That’s it — under some (not
*too*mild but still reasonable) assumptions, snapshots of the trajectory ${x_t}$ are distributed accoding to $p(x)$!

Are you convinced? Here’s a sanity check: without the stochastic component $\sqrt{2\Delta t}\cdot\xi_t$ at each step, $x_t$ deterministically walks towards the maximum likelihood estimator — that is, without the stochastic component, we’d have invented gradient descent! The stochasticity, however, makes $x_t$ a random variable instead.

The rest of this piece is dedicated to answering the question: where does that stochastic coefficient $\sqrt{2\Delta t}$ come from? You might have already worked out that it’s the magnitude of the thermal motion, which depends on $T$. If that coefficient were too small, then, as we just discussed above, we simply have gradient descent: our particle is analogous to a marble rolling down a (syrupy — overdamped!) hill. If the coefficient were too large, then thermal motion would dominate and the “signal” from the potential $U$ would stop mattering. So, clearly, the threshold $\sqrt{2\Delta t}$ is “special” in some way.

First let me convince you that we want this coefficient to indeed be
proportional to $\sqrt{\Delta t}$, even though it looks dimensionally wrong
when placed in a sum right next to the $\Delta t\cdot\nabla \log q$ term.
What’s going on? It turns out that this “lots of little Gaussian bumps” motion
doesn’t have an intuitive integral calculus. In particular, for Gaussians, it’s
the *variance* that adds, not the standard deviation. So in time $t$ the
variance of the total distance traveled would be $\Sigma\sqrt{\Delta t}^2 =
\Sigma \Delta t = t$, and therefore the standard deviation would be $\sqrt{t}$,
which is exactly what we want for the algorithm to be invariant to how small
$\Delta t$ is. (For more on this, you could read about “Wiener processes.”)

Now for the “2.” To derive this value, let’s return to our fluidic analogy. Suppose we wanted to compute the change in probability density at a point $x$ over time, that is, ${\partial p(x,t)}/{\partial t}$, with the understanding that we will set this time derivative to zero later to reason about the system’s behavior at equilibrium. By following the fluidic analogy, we can hallucinate a “probability flow” $\vec{J}(x, t)$ and assert, by Gauss’ Law, that ${\partial p}/{\partial t} = -\nabla \cdot \vec{J}$.

To measure $\vec{J}(x, t)$, we should think about what’s flowing; that is, about “particles,” whose dynamics is given by the Langevin update rule for $\Delta x_t$ I wrote out above. The deterministic part is easy to deal with: from a tiny region of volume $\Delta V$, a mass $p\Delta V$ of particles leaves with velocity $\nabla\log q$, so the flux is $p\nabla \log q$.

For the stochastic part, notice that the dynamics of the thermal motion are spatially invariant, i.e. independent of $x$. So, whatever amount diffuses out of $\Delta V$ is proportional to $p$, and whatever amount diffuses in is proportional to the density in the neighborhood of $\Delta V$: this means the net flow is proportional to $-\nabla p$. When we talk about fluids, this observation is also known as Fick’s Law, and the constant of proportionality is called the diffusion coefficient $D$.

How can we compute $D$? Looking at only the stochastic part, we have $\partial
p/\partial t = -D\nabla^2 p$, which we recognize as the heat equation. That is,
for this brief aside, we’re going to think of $p$ representing — not
probability over Bayesian model parameters *or* fluid density in a potential
well, but — heat in a block of metal — a bonus analogy!

Thankfully, the solution to the heat equation turns out to be well-known: it is
a time-varying Gaussian. And it better be! After all, the stochastic diffusion
is just a sum of lots of independent little Gaussian-sampled steps of variance
$\Delta(\sigma^2) = {2\Delta t}$, which together form a big Gaussian of
variance $\sigma^2 = 2t$, as discussed above. But let’s say we didn’t know the
coefficient was 2. Let’s set it to $a$ instead. Now, if we know that $p(x,t) =
e^{-x^2/2(a t)}/\sqrt{2\pi(at)}$, then we can literally take the partial
derivatives on both sides of the heat equation and expand to solve for $D$. A
miracle occurs, and many things cancel, leaving us with simply $-D=a/2$. It’s
not worth writing out the algebra here: I encourage you to try it yourself,
since it is that rare combination of straightforward and satisfying. But
furthermore, it gives some mechanical (if not physical) insight into the “2”
that appears in the denominator of the result: algebraically, the $-1/2$ power
of $at$ in the normalization constant $1/\sqrt{2\pi(at)}$ kicks out some
factors of two that are *not* balanced by the $-1$ power of $t$ in the
exponential.

In any case, though, we conclude from the above that $\Delta p = -(a/2)\Delta t
\nabla^2 p$, which means the stochastic flux is simply $\nabla p$. Perhaps it’s
all clear to you by now, but let’s take this over the finish line. Putting
these two components together, $U$ and $T$, we have $\partial p/\partial t =
\nabla\cdot (p\nabla\log q-(a/2)\nabla p)$. This nice identity is known as the
Fokker-Planck equation (the devoted reader might note that this is the *second*
reference to Fokker on this blog).

Now, finally, as promised many paragraphs ago, we will now reason about this
equation at equilibrium. At equilibrium the distribution is static in time, so
we have $0 = \nabla \cdot (p\nabla \log q -(a/2) \nabla p)$. Assuming suitable
boundary conditions, we can assume the argument of the divergence is also zero,
so $\nabla \log q = (a/2)\nabla p / p$. Integrating in all directions, we have
$\log q + c = (a/2)\log p$, or rather $p \propto q^{2/a}$. This makes sense:
just as expected, for large $a$ the effect of $q$ is wiped out, and for small
$a$ all but the maximum value get squashed to zero. Crucially, however, if we
want $p \propto q$, then we must set $a=2$. Aha! So *that’s* why the thermal
motion needs to be scaled by $\sqrt{2\Delta t}$. (By the way, that constant of
integration above, $c$, oddly enough encodes that pesky normalization constant
that started off this whole discussion…)

Okay. I’m glad I got that out of my system, it’s been rattling around for
weeks, now! How wonderful it is, that we can start with the simplest axiom of
probability — “*something* must happen” — and, reasoning only with physical
intuition, reach a sampling algorithm that solves a real problem.

Snow and snowflakes have been on my mind lately. This is not entirely because
of the storm in Texas — rather, I have been thinking since Christmas about
the snowflake’s *shape:* those small, delicate, spiky circles we all know so
well (err, perhaps there is something of current events here after all…).
This is a blog post about snowflake shapes.

What shape *is* a snowflake? In principle, we all know what snowflakes look
like. We can all imagine the six-fold symmetry, and then we can imagine the
embellishments. And yet— not so fast! As a little child I (and I do not
think I am alone in this) wanted to see the shape of a snowflake for myself.
Of course, I never quite could — at least, to the point where I could stuff
and saturate my eyes — because the delicate snowflakes melted away as soon as
my snow-blind eyes could focus on the crystal caught. Think about it, then:
where does your own conception of snowflake-shape arise from? Does it come from
firsthand experience, or from clip-art and wintertime papercrafts?

(Of course, there

issomething to be said for indirect inference of snowflake-shape. Years later, I would marvel at the fact that you could infer that snowflakes are hexagonal from the existence of 22º haloes, a kind of crystallography at the heavenly scale, linking the macroscopic and microscopic. For now however I am concerned with seeingdirectly, and face to face, with a presence and visual immediacy that cannot be afforded by imagination alone.)

This is no small concern. A year or so ago I visited the California Academy of
Sciences and learned about Wilson Bentley, the first man to photograph a
snowflake. This was on January 15, 1885; he was just shy of 20 years old. To
appreciate this moment, you have to imagine a time *before* January 15, 1885,
when no such photograph existed. Until Bentley’s innovation, no man had seen a
single snowflake for more than a delicate glimpse. Perhaps spatially the
snowflake could be magnified under a microscope (as Hooke and his
contemporaries
demonstrated in the 17th century), but how could anyone magnify it
*temporally?* Bentley’s Edgertonesque miracle — akin to Muybridge’s “horse in
motion,” captured just seven years earlier — is to hold still the moment to
reveal *more* — to give us the opportunity to gaze without urgency. Dare I
say, Bentley *freezes time?* The manipulation of time becomes an act of
unconcealment, and then furthermore an act of creation.
Bentley’s snowflakes outlived him, and they will outlive me.

Wilson Bentley was not a scientist by training, and perhaps that served him well. Over the course of his life he collected over 5,000 images of what he called “ice flowers.” There is something marvelous about this devotion, which spanned years and years, crystal after crystal, wonder undiminished. Surely the 5,000th snowflake brought Wilson as much delight as the 4,999th (and surely it is the case that you know you are in love when you find yourself falling in love afresh every day).

There is also something to the uniformity of Bentley’s collection. Each snowflake is framed carefully in the center of a slide, on a black background. The images are diligently cropped to squares. See for yourself! The University of Wisconsin-Madison hosts an online archive of 1,181 of Bentley’s images, courtesy the Schwerdtfeger Library. Bentley never sought copyright on his images; they are in the public domain, and so I downloaded them, along with their labeled categories. The categories have wonderful names, such as “crystal with broad branches” and “lumped graupel.”

When displayed in a grid, the snowflakes look like a box of delicious holiday sweets. Surely these delights are made of sugar and not ice.

What does one *do* with 1,181 snowflakes? One searches for patterns, the way
Bentley would never have been able to. Bentley’s images demand to be studied as
samples from a distribution, and they are strikingly well-suited to our
favorite statistical computer vision techniques. For example, though they are
scanned at a uniform 1000x1000 resolution, they still make sense when
dramatically downscaled, even to as low as 32x32. Notice also how the
snowflakes are highly amenable to *data augmentation* on account of their
intrinsic symmetry. Each snowflake can be reflected and rotated arbitrarily to
produce several more high-quality synthetic snowflakes “for free.” So, even
though there are nominally 1,181 unique images, there is much more juice here
than one might expect at first. You can’t always play this game: CIFAR images
can’t really be rotated, and most MNIST digits can’t even be flipped. But
snowflakes are snowflakes.

One February evening I trained — with a lot of help from my friends Gautam and Jack — a baby variational autoencoder, which embeds the snowflakes in a small low-dimensional latent space. Here, “low-dimensional” means 10-dimensional: it seems very reasonable to encode 2^10 data points with 10 parameters. In the image below, the top row consists of five randomly-selected source images, and the bottom row is the VAE’s attempted reconstruction. The contrastive green-and-purple colorscheme comes from matplotlib’s default, and reminds me of the delightful pGLO experiments we did in high-school biology.

Of course, as with all things deep-learning, this was easier said than
done. For a technique that falls broadly under “unsupervised learning,” a VAE
requires an awful lot of babysitting! —but then again, one of the joys of
online school is that *nobody can tell* if you’re training machine learning
models behind your Zoom tab. So I trained, and I trained. The current model
takes around 6 hours on my dusty old MacBook, though after the first 30 minutes
the results are already non-embarrassing.

Do you want to build a snowgan?Separately, I also spent some time trying to build a model adversarially. The snowgan (so to speak) consists of two rival sisters: the generator, Elsa, is of the same architecture as the VAE’s decoder, and the discriminator, Anna, learns to distinguish Elsa’s output from real snowflakes. Elsa and Anna compete, and as a result Elsa gets better at camouflaging the fakes among the flakes. Alas, GAN convergence is finicky enough that I lost interest after a few days… open problem!

It turns out that the VAE’s decoder can be used to build a small-scale
snowflake factory. Using ONNX.js and only a little bit of guile, I ported the
VAE’s decoder to JavaScript and attached the latent space to some ```
<input
type="range">
```

sliders. By using the sliders to choose parameters in the latent
space and feeding these parameters to the decoder, one can interpolate to
hallucinate snowflakes that are *not* part of Bentley’s training set. With some
experimentation, I found the sliders that correspond to parts of the latent
space that map to outliers; those parameters I clamped to extreme values in the
non-outlier-ey direction. This interface is what you see at the top of this
blog post.

At time of writing, there is only one Google search result for “this snowflake does not exist,” which points to a 2017 master’s thesis in elementary-school education by Eli̇f Dedebaş. The context is an activity where children are making paper snowflakes, and investigating the relationship between the number of folds and the symmetries. One child creates a snowflake-like shape with

eight-fold symmetry and is asked if she knows why such a shape is unnatural. Shape!

I imagine this slider infrastructure is on par with the sort of fancy tooling
UI that they have up in the Great Snow Factory In The Sky. I’m reminded of
David Wiesner’s wonderful Caldecott-winning picture book *Sector 7*, which I
recall reading (“reading”) in my elementary school’s library. It is about a
cloud factory, accessed from the top of the Empire State Building. Notice how
the large sousaphone-bell-shaped openings emanating from the factory roof are
exactly a portrait of an autoencoder’s decoder.

“Wiesner visited the Empire State Building on a zero-visibility day to research
this book,” reads the About-the-Author at the end, “He was the only person
there.” There is something of Wiesner in Bentley and Bentley in Wiesner, I
think. Think of them both, and yourself, too, as Friedrich’s *Wanderer*.

The results interpolated by the VAE are surprisingly convincing. Here are some snowflakes that do not exist (snowfakes?). I created these images by sliding the sliders haphazardly and recording a trace of the changing image. Often, people display this kind of interpolation in animated form, but in relation — and, indeed, deference — to Bentley’s life’s work, I would rather see them spread spatially than temporally.

I enjoy trying to work out the exact moment of transition between different kinds of snowflakes; there is a snowy saltation that occurs at certain boundaries that I cannot quite point you to. A curious fact is that a t-SNE plot in the VAE’s latent space shows no clustering with respect to the provided labels; the VAE seems to have hallucinated its own classification scheme, one which is perhaps more macroscopically morphological in nature than Magono and Lee’s 1966 classification scheme.

The BentleyBlizzardBlossoms (B3) dataset, along with some PyTorch utilities and the source code and trained model of the baby autoencoder, is available on Github. I think it would make a wonderful alternative to MNIST for small experiments. If you use it for anything, please let me know!

]]>It’s Christmastime again! Shall we talk about trees? This post is about my new holiday hobby: hallucinating tree-shaped fractals using our old friend, gradient descent.

But — let me start at the beginning. For a variety of reasons, the cover
image of Douglas Hofstadter’s *I am a Strange
Loop* has been on my mind
this month. I have been thinking a lot about this wonderful image created by
pointing a camera at a projector that displays what the camera is seeing. In
modern terms: what happens if you screen-share the screen sharing window? This
happens:

Such “feedback loop” recursions often rapidly converge to breathtaking fixed-points. The NYC MoMath even has an art installation that lets you play with a camera-and-projector setup to make wonderful patterns.

As you can imagine, there is a mathematical theory here. At risk of ruining the mystery, I will tell you about it, though without writing any equations. :-) The theory is the theory of Iterated Function Systems, which is almost exactly what it sounds like. Start with a set of affine functions (affine because that’s what camera-projector-systems do) and repeatedly apply them to a set of points, taking the union at each step. Regardless of where you start, you will soon end up with a fractal structure which is the fixed point of the system. Why fractal? —because the self-similarity comes from the infinitely-nested composition of the affine functions.

There are theorems that ensure convergence towards and uniqueness of this fixed
point in the limit, but… it’s believable enough, don’t you think? Following
your intuition, you can build Sierpinski Triangles and Barnsley
Ferns and all sorts of other
beautiful structures using an IFS. I refer you to Section 8.2 of *The
Algorithmic Beauty of Plants* by Prusinkiewicz and Lindenmayer (yes, *that*
Lindenmayer) for more botanical
connections.

Now here is the question that has been on my mind: if I give you a picture of a
fractal-fern, can you give me a set of affine functions whose fixed point is
that fern? This is the *inverse
problem*
for IFSes. According to Wikipedia, this problem has real-world applications in
image compression, but it is *hard* to do in general.

Hmm. Interesting. Can our favorite tool, gradient descent, come to the rescue? At least in an approximate sense? Recall that a two-dimensional affine function is really just a 3x3 matrix (with 6 free parameters) that encodes the details of the transformation. If we could easily compute images of IFS fixed points from a set of matrices, then perhaps we could try to optimize the parameters of these matrices to make the fixed point look the way we want it to.

The challenge, of course, is efficiently *finding* the fixed point —
repeatedly rasterizing and compositing affine transformations on images sounds
expensive — and doing so differentiably sounds like a scalability nightmare!
But there is a wonderful solution to this problem, which is to play the
so-called “chaos game.” Instead of
*densely* applying all the functions to all the points on the plane, you can
*sparsely* approximate the fixed point as follows: Start with a point — any
point! — and *randomly* select one of the affine functions of the IFS and
apply it to the point. Repeat this process with the new point. It turns out
that in the limit, the “trail” left behind by this wandering point converges to
the fixed point of the IFS. (This, too, is a theorem, but again I think it is
believable enough that I will not demand a proof.) This is what the “chaos
game” looks like for the Sierpinski triangle; notice how the salt-and-pepper
spattering of points soon converges boldly into the fractal we know and love
(it’s an animation — stare for a few seconds).

So here is the plan: we start with a point, and play the chaos game with our
current IFS matrices to obtain a *point cloud* that stochastically approximates
the fixed-point. Then, we compare this *point cloud* to our target fern-image
to obtain a “loss.” Finally, we update our IFS matrix parameters to minimize
this loss, until at last the fixed-point converges to the target image. It’s
just “machine learning,” really: we’re learning to play the chaos game!

Here is an outline of the algorithm (with some details elided):

```
# initialize IFS
F = [random_3x3_affine() for _ in range(4)]
o = torch.optim.Adam(F, lr=0.0001)
for step in range(100_000):
o.zero_grad()
# start at origin
v = torch.tensor([0., 0., 1.])
# play the chaos game once
trace = []
for _ in range(200):
# applying an affine function is just
# a matrix multiplication!
v = torch.matmul(random.choice(F), v)
trace.append(v)
# treat the trace as a point cloud
loss = compare(target_image, trace)
# update parameters
loss.backward()
o.step()
```

Ah! Actually, there is *one* detail that I *should* explain: how do you compare
a point cloud and an image? My idea is to convert the target image to a
*second* point cloud by uniformly sampling points from it, for example by
rejection-sampling. Then you can compare the two point clouds by the so-called
“chamfer distance,” which is the mean distance from each point to its nearest
neighbor in the opposite point cloud. This is quadratic-time to compute, and
the most expensive part of the whole operation, but with ~100 points in each
set it is quite doable.

A pedagogical aside: notice that the “trick” here is really a

change in representation.This optimization problem is easier to solve on point clouds than on raster images! An important lesson, that applies across the ML-for-graphics domain… and more broadly to all differentiable programming enterprises.

Remarkably, this zany scheme works quite well! Let’s try it with a heart. (A heart, because it’s an easy low-entropy shape, but also because it felt symbolically appropriate in relation to Hofstadter’s broader philosophical project of souls-within-souls-within-souls.)

When we initialize the system with 4 random affine transformations, the trail left behind by the chaos game is very unimpressive. The sparsity is because I simulate only 200 steps of the chaos game for each step of gradient descent — there’s a tradeoff between noise and computation time, of course, as there is with any form of SGD.

Okay. Time to optimize! Here is what it looks like after 100K steps of optimization with an Adam optimizer (that’s about 40 minutes of wall-time computation on my old MacBook). It looks pretty good, don’t you think? The chamfer distance scheme works!

Now we can export the four affine matrices out from PyTorch and work out the
“true” fixed point with a more expensive offline computation (for example, by
simulating a few *million* steps of the chaos game). It looks like this — not
perfect, and indeed quite crayon-scribble-ey, but clearly *something*
interesting has happened, and the result is convincingly heartlike.

Where is the fractal structure hidden in this heart? This needs some vizualization tooling to see clearly. After a bit of JS-canvas-hacking: here is a GIF of the heart that reveals its fixed-point structure. Hearts-in-hearts-in-hearts!

I will admit to blinking a few times when I first saw this animation play out
in full — I did *not* expect it to work, and even now I find myself marveling
at this creation? discovery? of a heart encoded in 24 numbers.

But it still doesn’t look *quite right* — the fractalness isn’t *obvious*.
How come? It’s because our affine transformations stretch and squash the heart
into almost unrecognizable blobs. A final, natural improvement is to force our
affine transformations to be *rigid*, so that there isn’t any squishing or
skewing in the fractal. This is actually not too bad to implement: the solution
to problems of constraint are typically reparametrizations, and indeed in this
case we can simply reparametrize the matrices in terms of a rotation, an
(isotropic) scale, and a translation. Now we’re down from 6 parameters per
matrix to just 4 — even more magical! The result is a much more “obviously”
fractal-ey fixed point, simply because your eye can more easily pick out the
structural recursion when it is made of rigid motions.

Let’s take it for a — pardon — *spin!*

Since I promised you trees, here is some fractal foliage. The target image was a picture of a maple leaf! I’d never thought about this before, but indeed a maple leaf contains within it an echo of the whole tree. (Full disclosure: it takes a couple of randomized restarts to get a really impressive result — the low dimensionality of the fractal’s parametrization leads to the same stuck-in-a-local-optimum woes that differentiable rendering folks face all the time…)

This one uses just three matrices — experimentally, three matrices tend to be
enough to get really good results, and more just lead to very busy and crowded
fractals. (I don’t know if there is a word for number-of-maps–in-an-IFS, but
let’s call it the IFS’s *arity*. In this terminology, I like *ternary* IFSes.)

And finally, since I promised you Christmas trees…

This last animation is actually absolutely *baffling* to me. In part, this is
because of how *treelike* the fractal turned out — here it is overlayed
against the silhouette I optimized against. Can you imagine this is encoded by
just 12 parameters? Odd.

But even more bafflingly: the actual geometry of the recursion is *nothing*
like the tree recursion geometry you and I are used to from CS106! Compare the
GIF above to Barnsley’s Fern: while Barnsley turns each full leaf into two
smaller leaves with “semantic” affine maps, this Christmas tree does all sorts
of *bizarre* uninterpretable cartwheels and somehow almost magically works
itself out in the end. It is mesmerizing to me, I can stare at it for minutes
at a time.

When you think about it, it is quite shocking that a Christmas tree is the
*unique* fixed point of these 3 rigid affine maps. “Beautiful” isn’t the word
for it — neither is “grotesque” (though both words have been used by various
people I have shared this with). There is something just unsettlingly
fascinating about the way something magical *pops out* from three rectangles.
Maybe I should have demanded a proof of that theorem after all…

There is so much more to say about all of this. What shapes are easiest to IFSify, and why? In higher dimensions, can we treat the IFS as a kind of SVD-esque dimensionality reduction scheme for data (i.e. literal point clouds)? What kind of data would it make sense to do this for — where do you find spatial self-similarity? Do point clouds have an inherent IFS “dimensionality” (in terms of, say, the number of affine transformations you need to get a good approximation)? With some more work, could SGD be a legitimate solution to the open problem of fractal compression?

It’s a lot to think about — but for now… well, there is a Calvin and Hobbes gag that begins with Calvin saying “I’ve been thinking” and Hobbes interrupting “On a weekend?” That’s how I’ve been feeling this winter break! Now it is time to shutdown the jupyter kernel and get some rest — Comfortably Numbered will return in 2021!

(Jupyter notebook for this blog post available on Github.)

Update (Dec 27): Check out some improvements made by a reader!

Update (Dec 29): I was pointed to a 1998
paper (Melnik and Pollack) that
(independently!) tells *exactly* the same story. I’m struck by how closely the
two expositions align. There is something deeply comforting about rediscovering
a place someone else has visited once-upon-a-time, like finding a flag buried
under the snow on a mountaintop — a kind of storybook enchantment that sits
across the table from loneliness.

When I took Pat and Matt’s wonderful graphics class CS348B (yay PBRT!), one of the most frustrating debugging journeys I had was related to interpreting vectors in the wrong coordinate systems. At the time I’d come up with an idea to catch such bugs statically, and now, 1.5 years later, I finally have found time to write down the idea.

Here is the problem: A vector only has “meaning” with respect to a coordinate
system. In graphics, a vector could be in all sorts of coordinate systems:
*world space*, or *object space*, or *camera space*, or *screen space*, or
*normalized device coordinate (NDC) space,* or *raster space* (some of these
are self-explanatory, but see PBR
§6.1.1-6.2
for definitions). This leads to bugs. If you interpret a “camera-space” vector
as if it were in “world space,” your math will be *all wrong!* You need to
apply the affine transformation `cameraToWorld`

before doing any world-space
computation on a camera-space vector. But what if you forget to do that? Your
C++ code will happily compile, but your poor bunny might look like chicken soup
instead… (I learned this the hard way!).

This got me thinking: perhaps a safer `Vector3f`

should know *at the type
level* what coordinate system it is in. Rather than a `Vector3f`

, perhaps what
we really want is a `Vector3f<WorldSpace>`

. You simply can’t add a
`Vector3f<CameraSpace>`

to a `Vector3f<WorldSpace>`

— that’s a *type error!*
The compiler *makes* you call `cameraToWorld`

if you want to do anything with
these two vectors. (All this is very much in the spirit of
torchsaber, by the way.)

I think there are broadly two ways you might want to implement something like
this. One way is to treat `CameraSpace`

as a purely formal symbol, just a
*name* or an *annotation*. The compiler’s job is to check these annotations.
You’d have to explicitly break this abstraction inside the implementation of
`cameraToWorld`

, but from a pedagogical perspective I think that is exactly
what you want.

The other way is for the tag `CameraSpace`

to somehow encode the actual
geometry of what *camera space* “means.” For example, each vector could also
carry around the `Transformation`

that gets you to its host coordinate system
from `WorldSpace`

. Given this information, the compiler could even be able to
*infer* the implementation of `cameraToWorld`

for *free!* Of course, this comes
at a cost, which is that the compiler might not be able to statically check
equality of `Transformations`

, because the matrices might not be materialized
at compile-time.

In the rest of this post I’ll quickly literate-ly program option (1) as a proof of concept. It is short and sweet (at least, as “sweet” as any bit of C++ hacking can be).

First, we define our coordinate system “annotations” as dummy classes.

```
class Space {};
class WorldSpace : public Space {};
class CameraSpace : public Space {};
```

Next, we design a `Vec3`

template class that takes in such a “tag” as a
template parameter (`CoordinateSystem`

), and enforces that the tags are the
same on operations. I’ve implemented the “`+`

“ operator but you can imagine the
rest — there is nothing sneaky going on here! In fact, this is exactly how
one implements a generic `Vector3`

that can be specialized over numeric types
(e.g. `Vector3<float>`

, `Vector3<int>`

, and so on).

```
template <class CoordinateSystem>
class Vec3 {
static_assert(
std::is_base_of<Space, CoordinateSystem>::value,
"Vec3 annotation must derive from Space"
);
public:
double x, y, z;
Vec3(double x, double y, double z)
: x(x), y(y), z(z) {};
// the '+' operator!
Vec3<CoordinateSystem> operator+(
Vec3<CoordinateSystem>& other
) {
return Vec3(
this->x + other.x,
this->y + other.y,
this->z + other.z
);
}
};
```

(Note that you need to `#include <type_traits>`

to get `std::is_base_of`

, which
is there just to protect you from trying to make a `Vec3<string>`

or something
else uncouth like that. Completely optional.)

…actually, that’s it! We can already get some mileage out of this. For example, this should be okay:

```
Vec3<WorldSpace> p(0, 0, 0);
Vec3<WorldSpace> q(1, 1, 1);
auto r = p + q; // ok!
```

On the other hand, this is perhaps *not* okay:

```
Vec3<CameraSpace> s(2, 2, 2);
auto t = r + s; // NOT OK!
```

And indeed, the compiler complains *and* gives a helpful error message!

```
test.cc:40:14: error: no match for 'operator+'
(operand types are 'Vec3<WorldSpace>'
and 'Vec3<CameraSpace>')
40 | auto t = r + s; // NOT OK!
| ~ ^ ~
| | |
| | Vec3<CameraSpace>
| Vec3<WorldSpace>
```

Of course! How *dare* we add a camera-space vector to a world-space vector? We
need to transform `s`

to world-space…

```
Vec3<WorldSpace> cameraToWorld(
Vec3<CameraSpace>& vec
) {
return Vec3<WorldSpace>(
vec.x * 2, vec.y - 4, vec.z + 8 // or whatever
);
}
```

…and now `g++`

is happy. :)

```
auto u = r + cameraToWorld(s); // ok!
```

Perhaps this trick will make its way into PBRT someday!

]]>Update (Dec 23): Rachit Nigam pointed me to this (very recent) SPLASH paper that considers the same problem. (“See? It’s not just me!”)

Note:Jekyll and Hyde — is how this blog goes. There are weeks of rational thought and weeks of irrational ramblings. This fall has been much of the latter, for good reason, but here is a break(!) in the clouds.Okay, okay. The truth is that I wrote this essay for my CS class this quarter (leave it to Pat to assign an essay for a CS class!). But then, I think I have reached the age where every written assignment in college is to be treated as an opportunity to say something I otherwise would not have a chance to say… and so the essay, lightly edited, finds its way to this blog.

In this essay I want to describe two kinds of tricky bugs that might creep into your PyTorch programs. I call the bugs “clogs” and “leaks.” In my mind “clogs” and “leaks” reveal an exciting possible research direction for anyone interested in designing better APIs for automatic differentiation in machine learning.

*Note: The examples presented, though in principle timeless, were tested using
Python 3.8.5 running PyTorch 1.6.0.*

If you are familiar with PyTorch internals, you can skip this section. If not,
a brief review of a wonderful topic: how does PyTorch differentiate your code
for gradient descent? The technical term for PyTorch’s approach is *tape-based
reverse-mode automatic differentiation*. As you perform arithmetic computations
on your variables, PyTorch tracks the intermediate values in a computation
graph. When you want to differentiate a value, you call `.backward()`

on that
value. PyTorch then walks *backwards* along this computation graph, computing
the derivative at each step and accumulating them according to the chain rule.
Eventually, the leaf nodes of the graph contain the derivatives you asked for.

Let me give a small example. Suppose we wanted to compute $d2x^2/dx|_{x=3}$. We might write a program that looks like this:

```
x = torch.tensor(3., requires_grad=True)
y = 2 * x**2
y.backward()
print(x.grad)
```

This program generates the following computation graph.

When you call `.backward()`

, the PyTorch automatic differentiation walks
backwards along the graph, computing derivatives at *each* step. By the chain
rule, the product of these gives the overall derivative we sought.

Now, consider this simple PyTorch program to compute $d(\sqrt{x} + x)/dx|_{x=4}$. What do you expect to be printed?

```
x = torch.tensor(4., requires_grad=True)
y = sqrt(x) + x
y.backward()
print(x.grad)
```

A casual user or AP calculus student would *expect* to see 1.25 printed, of
course. But what *actually* gets printed is 1. Why?

Ah! I didn’t show you the full program: I hid the imports. It turns out that
the first line of this program is `from math import sqrt`

, *not* ```
from torch
import sqrt
```

. Now, the Python standard library’s `math.sqrt`

is not a
PyTorch-differentiable function, and so PyTorch is unable to track the flow of
derivatives through `sqrt(x)`

.

As a result of this bug, backpropagation gets “stuck” on the way back, and only
the derivative of `x`

, i.e. 1, is deposited. This is a clog — the gradients
can’t flow! In the computation graph below, the dotted arrow represents the
clog.

The reason calling `math.sqrt()`

on a PyTorch tensor is not a runtime error is
that PyTorch tensors implicitly convert to “raw” floating-point numbers as
needed. Most of the time this is a useful and indispensable feature. But I
believe this situation should *at the very least* raise an error or a warning.
While the example I presented was reasonably straightforward, there are *many*
different ways to “clog” backpropagation, with varying degrees of insidiousness
(for example, what happens when you mutate a variable in place?). It can be a
nightmare to debug such situations when something goes wrong — that is, if you
notice the bug in the first place!

*By the way:* the celebrated “reparametrization trick” that powers variational
autoencoders is really just a workaround for a gradient clog problem. To train
a variational autoencoder, you need to compute the derivative of a sample of a
probability distribution with respect to the distribution’s parameters (e.g.
the mean $\mu$ and variance $\sigma^2$ of a Gaussian distribution).
Unfortunately, naïvely sampling from a parametrized distribution abruptly
truncates the computation graph with respect to the parameters, because the
random number generator is not differentiable all the way through — who *knows*
what *it’s* doing! The solution, is to sample from a standard unit normal
distribution (where $\mu=0$ and $\sigma=1$, and then re-scale the sample
by multiplying by $\sigma$ and adding $\mu$. Of course, multiplication and
addition *are* easily differentiable, and so the gradients can now flow.
Problem solved!

Now, consider this slightly more complicated PyTorch program. We are going to implement a silly reinforcement learning algorithm. Here is the situation: There is a truck driving on the road with constant velocity, and your goal is to catch up to it and drive right alongside the truck. At each timestep you are allowed to choose your velocity, and then you’re told how far you are from the truck.

The setup:

```
truck_velocity = torch.tensor(3.142)
truck_position = torch.tensor(2.718)
def get_measurement(car_position):
global truck_position
truck_position = truck_position + truck_velocity
return torch.abs(truck_position - car_position)
```

And a simple online gradient-based learning algorithm:

```
my_velocity = torch.tensor(0.01)
my_position = torch.tensor(0.)
for i in range(500):
my_velocity.requires_grad_()
my_position = my_position + my_velocity
loss = get_measurement(my_position)
loss.backward()
my_velocity =\
my_velocity.detach() - my_velocity.grad * 0.01
```

Unlike last time, there’s nothing up my sleeve here — this is all reasonable PyTorch code. This code actually works just fine.

But, if you run it for long enough (say, 1000 iterations), you’ll notice something odd: each step starts taking longer and longer. The algorithm is accidentally quadratic! You can see this behavior quite clearly in this graph, which shows a linear growth in iteration time from step to step (the spikes are garbage collection pauses).

How can this be? Isn’t each loop doing the same calculation?

Here is one hypothesis: if you’ve read this
paper you might look to see if we’re
`.detach()`

-ing `my_velocity`

. The `.detach()`

function snips off all incoming
edges to a node in the computation graph; essentially, creating an artificial
clog. If we forget to do that, the gradients would “leak” back in time across
multiple steps in the graph, all the way back to the first step, and each
iteration would therefore take longer and longer — just as we’re observing.

But, alas, this is not the source of the bug: as you can see, we *are*
detaching `my_velocity`

when we update it. So, what’s really going on here?

It’s tricky! The leak is in `my_position`

, which subtly depends on *all*
previous values of `my_velocity`

and therefore makes backpropagation compute
gradients for *all* previous timesteps. The dataflow diagram below hopefully
clarifies this point. Notice how each `velocity`

has its parent nodes detached
(thanks to the call to `.detach()`

!), but `loss`

still has an indirect
dependence on the chain of `positions`

.

Finding the correct place to insert the line ```
my_position =
my_position.detach()
```

is left as a not-quite-trivial exercise to the reader.
Beware! Putting it in the *wrong* place will either have no effect *or* cause
`my_velocity`

to always have gradient 0.

Just like memory leaks, gradient leaks can be extremely sneaky. They pop up
whenever your inference is “stateful” — think of applications like physics
controllers, reinforcement learning, animated graphics, RNNs, and so on. I
would not be surprised if many popular implementations of such algorithms *do*
have “gradient leak” bugs. However, the bugs usually only manifest themselves
visibly when the inference passes through enough timesteps for the leak to
compound. Just like a dripping tap, you might not notice your losses until you
get the bill at the end of the month… and then, you need to figure out how to
track down the source of the leak and figure out the right way to fix it.

In the long term, how can we protect ourselves from this class of bugs? One
potential solution is to embed the API inside a language whose type system
tracks the creation of the computation graph. You might be able to use
well-understood techniques like *taint analysis* or *linear types* (pun not
intended) which traditionally track the flow of *information*, to now track the
flow of *differentiability* through the program.

Let me be slightly more concrete about this suggestion. In our “clog” example,
a good type system might detect that `sqrt`

cuts off the computation graph,
and, knowing that $y$ does not directly depend on $x$ in the expected way
anymore, complain at compile-time when we try to request $dy/dx$. In our
“leak” example, a good type system might notice that the “old” `my_position`

effectively goes out of scope when it is re-assigned, and therefore it might
complain that an unreachable reference to it actually does persist through the
new `my_position`

. Such checks seem very reasonable to demand from a modern
type system.

Over the months I’ve started to miss the little speckles of Stanford life that
I never thought I would get attached to. One of them is the regular bell-toll
of our lovely little clock tower. There is something oddly compelling about
public time — time as a utility — time as something to be broadcast for
consumption. So I have a special fondness for clock towers, and the Network
Time Protocol, and the anchor who says “it’s 8am and this is NPR,” and the GPS
system, and people who set their clocks by the atomic clock in Boulder, and the
website time.gov, and the pilot who says “we will be
landing shortly in San Francisco, the local time is 6:15am.” At least, in this
divided world, we can agree on *something*.

Well: This evening I was tuning my guitar, and I discovered some more public time. Here is what happened. First I plucked the low E string, and watched the string blur into a lenslike pair of arks that envelop the extent of its vibration. I ran the first few frets to try out the sound: F, F-sharp, G, G-sharp. Great! Next, I plucked the A string, and watched the same blur of the string. But when I tried plucking it at the first fret — that would be a B-flat — something bizarre happened. Instead of a vibrating blur, the string slowly wavered back and forth between its two extremes!

What? I rubbed my eyes. A low “Bb” vibrates at around 117 Hz. That is *much*
faster than my eye can resolve (because of this “persistence of vision,”
animations in video games don’t have much incentive to exceed around 60 frames
per second). In short, I should be seeing a blur. But no! I can track the
string’s leisurely wobble quite easily. I bet you can, too: the GIF below shows
an open-string A on the left and a first-fret Bb on the right.

(Sorry for the poor quality — I had to hold my phone between my knees — it was tricky, okay? You can find the original videos, with sound, on Github.)

Here’s what I think is going on. The lights in my room are powered by AC current from the power grid. In the United States, AC current is standardized to transmit at 60 hertz (why 60? apparently it’s a long story, but in part the number 60 is related to turbine hardware logistics). That means that the current peaks in (absolute) voltage 120 times per second, and so the lights in my room are actually all rapidly flickering at 120 Hz. I just don’t realize it because of the persistence of vision.

Ah! As I mentioned above, a Bb vibrates at around 117 Hz — to be precise, it’s 116.54 Hz. You can work this out without looking it up by recalling that concert A is 440 Hz (as printed on every metronome) and the low Bb is two octaves less one half step below — that’s twice twelve minus one or 23 half steps below. Indeed, 440/(2^(23/12)) gives 116.54.

Why is this correspondence between 120 and 116.54 so interesting? Well, when you have nearby frequencies, they inferfere to produce “beat” frequencies. I’ve written about beats in the spatial domain before when talking about moiré patterns. Here, the beats are in the temporal domain, forming a kind of “wagon wheel” illusion. The superposition of the string’s vibration and the light’s flickering creates a low-frequency “beat” that gives the illusion of the string vibrating slowly.

Eyeballing the right GIF, I count around 5 oscillations of the string over the 3-second clip. That gives a beat frequency of 1.67 Hz. The difference in interfering frequencies is twice the beat frequency (elementary trigonometry exercise, use the sum-to-product identity). Using that difference, I can compute that my Bb is tuned to 120 Hz - 3.3 Hz = 116.7 Hz. Not bad! It turns out that you can buy “stroboscopic tuners” online, but perhaps if you are good at estimating slow frequencies, you don’t need one…!

Briefly, some thoughts for the future: I’m reminded by my guitar string of Edgerton’s hummingbirds and Muybridge’s horses, and many more moments frozen in time. There is a lot more I could say, and that I have said, and perhaps I will explain this more carefully in a future essay. The confluence of the aural and visual has also been on my mind lately; though I haven’t “blogged” about it I have been thinking a lot about clocking coin tosses by recording the sound a coin makes while it rings. In that story, too, stroboscopic techniques make an appearance. Maybe I will say more about that as well.

For the moment however I am struck — “struck” — by how strange it is that there is so much invisible to us even in plain sight, so much hidden in the folds of time. Public time, and the privacy of time.

]]>