Suppose two Bayesians, Alice and Bob, put on a variety show where they take turns tossing a biased coin and announcing outcomes to a live studio audience. (Bayesians love this kind of thing—it keeps them entertained for hours…)

Unfortunately, just as Alice goes on stage, she realizes with dread that she forgot to bring the coin. Thinking on her feet, she mimes pulling a tiny imaginary coin out of her pocket, and says “This is a biased coin!” It works—the audience buys it and the crowd goes wild.

She mimes tossing the pretend coin and randomly announces “heads” or “tails.” Then, she hands the coin to Bob, who (catching on) also mimes a toss. This has just turned into a Bayesian improv show.

But now Bob has a problem. Should he announce “heads” or “tails”? He could
choose uniformly at random, but after many rounds the audience might get
suspicious if the coin’s bias is too close to 50%. How can he keep up the
charade of a *biased* coin?

Here’s what Bob does. In the spirit of “yes, and…,” he infers the coin’s bias based on Alice’s reported outcome (say, with a uniform prior) and samples a fresh outcome with that bias. So if Alice said “heads,” Bob would be a bit likelier to say “heads” as well.

Then Alice takes the coin back and does the same, freshly inferring the coin’s
bias from the past *two* tosses. In this way, the two actors take turns
announcing simulated outcomes to the oblivious audience, while building a
shared understanding of the coin’s bias.

What happens? How can we characterize the sequence of outcomes? Intuitively, we might expect either a “rich-get-richer” effect where they end up repeating heads or tails. Or we might expect a “regression-to-the-mean” where they converge to simulating a fair coin.

The surprising answer is that this process is indistinguishable from Alice and
Bob tossing a *real* coin with fixed bias (chosen uniformly). A critic lurking
in the audience would never suspect something afoot!

This result is a consequence of the correspondence between the Pólya distribution and the Beta-binomial distribution.

I have a hunch that this observation could be useful: perhaps in designing a new kind of cryptographic protocol, or perhaps in explaining something about human cognition. If you have ideas, let me know!

Proof sketch: Model the actors’ belief with a Beta distribution with parameters $(h, t)$ initialized to $(1, 1)$, i.e. uniform. At each toss the probability of heads is given by $h/(h+t)$, and the outcome increments $h$ or $t$ by 1. You can think of this as a Pólya urn with h black and t white balls: each time you draw a ball, you put it back and add a “bonus” ball of the same color. It is well-known (look here or here or here) that this is the same as the Beta-binomial process.

]]>

Have you noticed the way streetlights shine through the bare branches of trees on cold winter nights? Walking home from work tonight I was struck by the almost overwhelming beauty of the scene. The light creates a perfect, glittering halo around itself, like the moon in a Van Gogh painting. The tree in turn all but reaches its fingers out to grasp the light. If you move your head from side to side, it feels as if a wormhole has opened in the tree, sucking the branches into the light’s force field.

I think this happens because of the Fresnel effect. If the light hits a branch at just the exact angle, the wood (like many other materials in the world) becomes unexpectedly shiny. In the brambles that hang off of a tree’s bare branch, the only twigs that catch the light’s glint are the ones that make a specific shallow angle with your eye. By radial symmetry, it’s not hard to see that this should result in a glowing circular halo. It’s similar to how divers see a circular window on the surface of the water when they look up, or the 22º halo you sometimes see around the sun — but here the medium is wood, not water.

If you live in a place that has seasons, I encourage you to go and see this effect for yourself. It’s a hard phenomenon to capture on camera — it requires high dynamic range, high-resolution video, and patience outdoors when it’s cold and dark. For now, here is the best I could do with my old iPhone. (I stabilized the light with a little Python script.)

For comparison, here are two frames from this video:

]]>I knocked over a tall jar of honey. Clumsy, clumsy! But the jar was nearly empty; jars tend to fall when they are nearly empty.

But jars also tend to fall when they are full, which got me thinking: for both full and empty jars, we can argue by symmetry that their center of mass is halfway up. By Rolle’s Theorem, then, at some point when the jar is partially filled it should have a minimally- or maximally-high center of mass. It’s easy to see that the center of mass can never get higher than halfway up. This suggests that the jar’s stability increases and then decreases as the honey is consumed.

When is the jar most stable? Working in units where the empty jar’s height and mass are 1, its radius is $r$, and honey’s density is $\alpha$, the height of the center of mass is given by: $$ c = \frac{\alpha r^2h(h/2) + 1/2}{\alpha r^2h + 1} $$

Let’s assume honey is extremely viscous (i.e. changes shape slowly), and that the jar tips over if its center of mass is over its rim. Then the maximum angle you can tip the jar before it falls over is: $$ \theta^\star = \tan^{-1}(r/c) $$ The question is then, for a given $r$ and $\alpha$, what $h$ maximizes $\theta^\star$? Intuitively, it seems like it should be somewhere halfway between empty and full.

We can compute the optimal $h$ by differentiating $\theta^\star$ with respect to $h$ and setting the derivative to zero: $$ \frac{d\theta^\star}{dh}=-\frac{r^2}{r^2+c^2}\frac{dc}{dh} = 0 $$ The positive solution to this, thanks to WolframAlpha, is given by: $$ h^\star=\frac{1}{1+\sqrt{\alpha r^2+1}} $$ Does this make sense? As $\alpha r^2$ increases (heavier honey), the optimal height gets lower, which makes sense because the honey begins to have a ballasting effect.

Now we can plug in some numbers. A standard 8-oz honey bottle is about 6 inches tall, has a base radius of about 1 inch, and weighs about 1oz. The density of honey is 0.8oz per cubic inch. In our units, this means the jar’s radius is $1/6$ and the density of honey is $0.8/(1/6)^3$. Plugging this in, we find remarkably that $h^\star \approx 0.3$. So a jar of honey is most stable when only about 1/3 full!

]]>Suppose you are cooking tofu, cut into tasty 1-inch cubes. On your stove it takes each cube-face 10 minutes to cook. You could cook each of the 6 faces of each cube in one hour (60 minutes) by rotating the cube to an uncooked face every 10 minutes–but this is clearly tedious and suboptimal.

A better idea is stirring: instead of carefully rotating each cube to an uncooked face, you can vigorously stir your pan and reorient each cube uniformly at random (like rolling dice).

The question I want to consider today is, “how often should you stir your tofu cubes”? For example, you might choose to stir them every 10 minutes for an hour, for a total of 6 stirs including right at the beginning. In that case, you might get lucky and have every face exposed to heat once. However, because each stir is an independent random roll, the more likely scenario is that some face gets cooked twice or more (burnt!) and other faces don’t get any heat (raw!). **Intuitively, it seems like we should stir more frequently to “even out” the heat.** Let’s check this intuition.

Suppose we stir $n$ times over the course of the hour, and we are willing to tolerate being off by up to $\delta=5\%$ of optimal cookedness–that is, consider the tofu raw before 9 minutes and 30 seconds, and burnt after 10 minutes and 30 seconds. Let $f(n)$ be the proportion of burnt or raw faces, in expectation. We want to see how $f(n)$ depends on $n$.

We can compute $f(n)$ many ways. Most obviously, we can simulate the tofu system with some large number of cubes and see what happens.

```
import numpy as np
from matplotlib import pyplot as plt
mean = 1/6
tol = 0.05
def experiment(n):
N = 1_000
rolls = np.random.multinomial(n, np.ones(6)/6, size=N)
toasts = rolls / n
burns = ((np.abs(toasts - mean) >= tol * mean) * 1.0).mean()
return burns
```

It is also not hard to work this out analytically. First, by linearity of expectation we can just look at one face, which acts like a binomial random variable: for each of the $n$ stirs, there is a $1/6$ probability of getting heat. We can subtract the CDFs at the tolerance limits $\mu\cdot(1\pm\delta)$ to compute the proportion of times the face is well-cooked, and then subtract from 1 to get the proportion of bad faces.

```
import scipy.stats
def analytical(n):
return 1 - (
scipy.stats.binom.cdf( (1 + tol) * (n / 6), n, 1/6 ) -
scipy.stats.binom.cdf( (1 - tol) * (n / 6), n, 1/6 )
)
```

We can also use statistics tricks to approximate the answer in useful closed forms. For example, we can take a normal approximation to the binomial as $n \rightarrow \infty$. The mean is $\mu=1/6$, of course, and the variance is given by $\sigma^2 = p(1-p)/n = 5/36n$. We can then plug those parameters into the normal CDF, along with the tolerance limits as above.

```
def normal(n):
vari = 1/6 * (1 - 1/6) / n
z = (tol * mean) / np.sqrt(vari)
return 1 - (scipy.stats.norm.cdf(z) - scipy.stats.norm.cdf(-z))
```

Finally, we can apply Chebyshev’s inequality or a Chernoff bound. These formulas only provide an upper bound on the proportion of bad faces, but they don’t depend on a normal approximation and are thus guaranteed to be true. I won’t work through the derivations here.

```
def chebyshev(n):
vari = 1/6 * (1 - 1/6) / n
z = (tol * mean) / np.sqrt(vari)
return np.minimum(1., 1 / z ** 2)
def chernoff(n):
return np.minimum(1., 2 * np.exp(-(tol ** 2) / 3 * (1/6) * n))
```

Plotting all of these metrics out to asymptotic $n \rightarrow \infty$ (see the graph below), we see several expected patterns:

- As $n$ increases, the proportion of burnt faces goes down with exponential falloff.
- The analytical solution closely tracks the simulation, at least until the variance of the simulation gets high enough to make it unreliable.
- The normal approximation is indistinguishable from the analytical solution.
- The bounds are quite loose. Chebyshev “wins” for a little while, but ultimately Chernoff’s exponentiality kicks in and dominates.

So far, so good: it seems like you can get quite well-cooked tofu with only a logarithmic amount of stirring effort!

```
ns = np.arange(1, 30000, 100)
plt.plot(ns, [experiment(int(n)) for n in ns], 'x', label='Simulation')
plt.plot(ns, [analytical(int(n)) for n in ns], '-', label='Analytical')
plt.plot(ns, [normal(int(n)) for n in ns], '--', label='Normal approximation')
plt.plot(ns, [chebyshev(int(n)) for n in ns], '-', label='Chebyshev\'s inequality')
plt.plot(ns, [chernoff(int(n)) for n in ns], '-', label='Chernoff bound')
plt.yscale('log')
plt.ylabel('P(burnt or raw face)')
plt.xlabel('n, the number of tosses per hour')
plt.legend()
```

```
<matplotlib.legend.Legend at 0x7fbf0b32ae60>
```

But asymptotic $n$ is unreasonable in this setting: for example, 6000 flips per hour (6 kfph) is more than one flip per second. Let’s zoom in on small $n$ to see what is happening at the scale of the real world. In this plot, I’m hiding the two bounds (which are all maxed out at 1) and only showing up to $n=600$, which is a stir every 6 seconds.

```
ns = np.arange(1, 600)
plt.plot(ns, [experiment(int(n)) for n in ns], 'x', label='Simulation')
plt.plot(ns, [analytical(int(n)) for n in ns], '.', label='Analytical')
plt.plot(ns, [normal(int(n)) for n in ns], '--', label='Normal approximation')
plt.ylabel('P(burnt or raw face)')
plt.xlabel('n, the number of tosses per hour')
plt.legend()
```

```
<matplotlib.legend.Legend at 0x7fbf09037460>
```

A curious pattern emerges! Here is what I notice:

- The analytical solution (orange dots) continues to track the simulation (blue crosses).
- However, they both diverge from the normal approximation (which is perhaps as we expect at low $n$).
- This divergence is systematic. The analytical probabilities form an interesting “banded” structure.
- Most surprisingly, sometimes increasing $n$
*increases*the number of bad faces! - To do better than $n=6$, you have to go all the way out to $n\approx 500$.

The puzzle is, **Why does more stirring cause worse cooking?**

Let’s consider the cases $n=6$ and $n=12$. For $n=6$, we already reasoned that a face is well-cooked if it is exposed to heat for only one of the 6 turns. This has probability $\binom{6}{1}(1/6)^1(6/5)^4 \approx 0.402$, and indeed the graph above shows that analytically, we expect a $0.6$ probability of a face getting burnt or being raw at $n=6$.

Now consider $n=12$. Now, a face is definitely well-cooked if it gets 2 turns on the heat, because $2/12 = 1/6 = \mu$. But what about 1 turn or 3 turns? In each case, it is off by a factor of $(1/12)/(1/6) = 0.5 \geq \delta$, so the face would either be raw or get burnt. Hence, the probability of a well-cooked face is given by $\binom{12}{2}(1/6)^2(5/6)^{10} \approx 0.296$, which yields a probability of $0.704$ that the face is raw or burnt. Higher than $0.6$ at $n=6$!

Another way to see what’s going on is to consider 2-faced pancakes instead of 6-faced tofu cubes. If you cook a pancake with $n=2$ random flips, you have a good chance that you will cook both sides (one on each flip), though of course there is some chance you burn one side by cooking it twice, and leave one side uncooked. But if you cook a pancake with $n=3$ flips, you will necessarily always burn one side and leave the other side uncooked, because at best you will get a 1/3-2/3 ratio of cooking time.

Returning to cubes now, let’s see what happens more generally. Say a face gets $X$ turns on the heat where $X \sim \text{Binom}(n, 1/6)$. The probability of being well-cooked is $\Pr[|X/n-\mu| \leq (1+\delta)\mu]$ where $\mu=1/6$. Breaking this up, we can sum over possible outcomes of $X$, to have $\sum_{(1-\delta)(1/6) \leq x/n \leq (1+\delta)(1/6)} \Pr[X=x]$. In other words, we are summing over integer multiples of $1/n$ in the range $[(1-\delta)(1/6), (1+\delta)(1/6))]$. How many integer multiples are there in that range? This is easily given by $\lceil (1/6)(1+\delta)/(1/n) \rceil - \lfloor (1/6)(1-\delta)/(1/n) \rfloor - 1$. We can plot this against the analytical probability to get some insight.

```
ns = np.arange(1, 100)
plt.plot(ns, [analytical(int(n)) for n in ns], '.', label='Analytical')
plt.plot(
ns,
(np.ceil(ns * 1/6 * (1 + tol)) -
np.floor(ns * 1/6 * (1 - tol)) - 1),
label='Number of integer multiples of 1/n in range'
)
plt.xlabel('n, the number of tosses per hour')
plt.legend()
```

```
<matplotlib.legend.Legend at 0x7fc0105f1150>
```

The spikes in the number-of-integers graph correspond directly to the jumps between bands. This suggests the following explanation of what’s going on: that there are two “forces” at play.

- The number of integer multiples of $1/n$ that could be matched within the range allowed by the tolerance, which jumps around in a quantized manner according to number theory.
- The sum of probabilities of matching each of those integer multiples, which decreases as we add more flips, because there are more possible other-outcomes.

When the first “force” spikes, it causes the probability of a bad face to drop suddenly because there are more ways to cook a good face. However, the second “force” causes the probability of a bad face to rise gradually because each way to make a good face becomes less likely. This explains the banding structure in the graph above.

To summarize, in the limit, continuous stirring indeed helps cook your tofu more evenly. However, if you are stirring only occasionally, then sometimes more stirring can actually harm your tofu!

]]>This summer I was briefly in Vancouver, and after dinner on my last day I found myself walking several blocks back to my hotel. The night was dark but the sidewalk was brightly lit and it was a lovely journey. Along the way, I took this picture:

What struck me is how the streetlight casts two shadows, blue and yellow, and moreover those shadows appear opposite the yellow and blue lamps, respectively. What could possibly be going on?

Here is an explanation: the two lamps together create a kind of grayish-white light that bathes the sidewalk. Where the yellow lamp is occluded, the blue light is dominant, so the shadow is blue. Similarly, where the blue lamp is occluded, the yellow light is dominant, so the shadow is yellow.

Looking at this scene I’m reminded of painter Wayne Thiebaud’s rich, saturated shadows. You could say the perceptual effect here demonstrates that “white light” is the sum of all wavelengths, a fact we learn in grade school (I think there is an exhibit at the SF Exploratorium with a similar concept). But to me, this also demonstrates the range of what we are willing to call “white.” If one of the lamps were to burn out, our eyes would adjust to the blue or yellow almost immediately, and we would still see the sidewalk as gray — we experience a truly remarkable “color constancy” across lighting conditions. In this way, when he paints a shadow as a saturated, non-gray color, Thiebaud sees beyond his own seeing.

]]>The other day my toilet broke and I involuntarily learned a lot about how flushing works. My friend suggested an analogy for me: flushing toilets is like a neuron’s activation: once a critical threshold is met, there’s an “all-at-once” response.

That got me thinking, could we implement deep neural networks in plumbing? It turns out, the answer is yes! A very simplified model of a flush toilet’s nonlinear behavior is as follows: it’s a bucket, into which water can be poured, and there is a hole at height $h \geq 0$. If you pour in volume $v$ of water into the bucket, the output that flows out of the hole is $\text{ReLU}(v - h)$.

The second component we need to build a neural network is a linear map. We can do this by attaching a branching pipe to the hole. This component will have $k$ branches with cross-sectional areas $A_1, A_2, \dots, A_k > 0$. By conservation of mass and a simple pressure argument, the amount of water that pours out of branch $i$ is $A_i / \Sigma_j A_j$.

Together, these components allow us to compute a function from $\mathbb{R}\rightarrow \mathbb{R}^k$, which looks something like $\text{PeLU}(v, \vec{A}, h) = \text{ReLU}(v - h)\cdot \vec{A} / \Sigma_j A_j$. Here, “PeLU” stands for “Porcelain-Emulated Linear Unit.” It is clear how to vectorize this expression over $v$, which effectively creates a new kind of neural network “layer” with trainable parameters $\vec{A}$ and $h$ for each input dimension. To enforce the positivity constraint on $h$ and $A_i$, we will actually work with the following key equation: $\text{PeLU}(v, \vec{A}, h) = \boxed{\text{ReLU}(v - h^2) \cdot \text{softmax}(\vec{A})}$.

All that is left to do at this point is to implement this in PyTorch and train it.

```
import torch
class PeLU(torch.nn.Module):
def __init__(self, in_feat, out_feat):
super().__init__()
self.heights = torch.nn.Parameter(
torch.randn(in_feat))
self.weights = torch.nn.Parameter(
torch.randn(in_feat, out_feat)
)
def forward(self, X):
X = torch.nn.functional.relu(X - self.heights ** 2)
X = X.matmul(self.weights.softmax(dim=1))
return X
```

Here, I built a PeLU layer that can be slipped into any PyTorch model, mapping `in_feat`

inputs to `out_feat`

outputs. Next, let’s stack some PeLU layers together and train the result on the classic “Iris” dataset, which has 4 features and assigns one of 3 labels. We will create a “hidden layer” of size 3, just to keep things interesting.

```
from iris import feat, labl
m = torch.nn.Sequential(
PeLU(4, 3),
PeLU(3, 3)
)
o = torch.optim.Adam(m.parameters(), lr=0.01)
lf = torch.nn.CrossEntropyLoss()
for i in range(10_000):
o.zero_grad()
pred = m(feat * 10)
loss = lf(pred, labl)
loss.backward()
o.step()
print('Loss:', loss)
print('Error:', 1. - ((torch.argmax(pred, dim=1) == labl) * 1.).mean())
```

This trains very quickly, in seconds, and gives an error of 2%. Of course, we haven’t split the dataset into a train/test set, so may be be overfitting.

By the way, you may have noticed that I multiplied `feat`

by 10 before passing it to the model. Because of the conservation of mass, the total amount of water in the system is constant. But each bucket “loses” some water that accumulates below the hole. To make sure there’s enough water to go around, I boosted the total amount.

But that’s all that needs to be done! Once we export the parameters and write a small visualizer…

Isn’t that a delight to watch? I may even fabricate one to have as a desk toy — it shouldn’t be hard to make a 3D-printable version of this. If we wanted to minimize the number of pipes, we could add an L1 regularization term that enforces sparsity in the $\vec{A}$ terms.

Some final thoughts: this system has a couple of interesting properties. First, there’s a kind of “quasi-superposition” that it allows for: if you pour in more water on top to incrementally refine your input, the output will automatically update. Second, the “conservation of mass” guarantees that the total water output will never exceed the total water input. Finally, it’s of course entirely passive, powered only by gravity.

This has me wondering if we can build extremely low-power neural network inference devices by optimizing analog systems using gradient descent in this way (a labmate pointed me to this, for example).

Below is a little widget you can use to enjoy playing with PeLU networks. All inputs must be between 0 and 10. :)

Sepal length (cm):

Sepal width (cm):

Petal length (cm):

Petal width (cm):

Speculation on what (other than its own frail body) that bird-in-the-hand might signify has always been attractive to me, but especially so now thinking, as I have been, about the work I do that has brought me to this company. (full)

And just because you want to paint a bird, do actually paint a bird, it doesn’t mean you’ve accomplished anything. (full)

Waite:What do you call this?Steichen:I use the same term the sculptor did, oiseau, a bird.Waite:What makes you call it a bird, does it look like a bird to you?Steichen:It does not look like a bird but I feel that it is a bird, it is characterized by the artist as a bird.Waite:Simply because he called it a bird does that make it a bird to you?Steichen:Yes, your honor.Waite:If you would see it on the street you never would think of calling it a bird, would you?

[Steichen:Silence]Young:If you saw it in the forest you would not take a shot at it?Steichen:No, your honor. (more)

But who knows what Fabritius intended? There’s not enough of his work left to even make a guess. The bird looks out at us. It’s not idealized or humanized. It’s very much a bird.

And then there is this fourth level, which is a whole new object in the world: the prop made for the movie, the representative of the thing, becomes, in its own right, a whole other thing, a whole new object of desire. . . . There are several people who own originals, and I have been attempting to contact them and reach them, hoping that they will let me spend a few minutes in the presence of one of the real birds, maybe to take a picture, or even to pull out the hand-held laser scanner that I happen to own that fits inside a cereal box, and could maybe, without even touching their bird, I swear, get a perfect 3D scan. And I’m even willing to sign pages saying that I’ll never let anyone else have it, except for me in my office, I promise. I’ll give them one if they want it. And then, maybe, then I’ll achieve the end of this exercise. But really, if we’re all going to be honest with ourselves, I have to admit that achieving the end of the exercise was never the point of the exercise to begin with, was it? (full)

Plate 357

(Numenius Borealis)is the only instance in which the subject appears dead in the work of John James Audubon.He waits alone, sketching angels from the shade—

a kind of heavenly bird,he reasons with himself—although their wings are broke, faces scarred, each fragile mouth feigning the same sad smile as the one before it. Offered triple his price to paint a likeness of the pastor’s daughter—buried for more than a week—he reluctantly agreed—times being what they are.

. . . . .

And yet he studies it—from behind the dunes—studies its several postures, grounded and in sudden flight—and not content to praise it from a distance, to sacrifice detail, unpacks his brushes and arranges them before raising his rifle and taking aim.

(Added April 18, 2023)

Zizzy, uncanny, pebble-tapping, lusty, pule. Ventriloquial, tantara, feminine, crepitate. Tintinnabulation, sough, devil’s tattoo. Sparrowy.

(Added June 2023)

By Kobayashi Issa; mentioned by a professor on a walk by the Charles River.

Children imitating cormorants

are even more wonderful

than cormorants.

(Added June 2023)

From Philip K. Dick’s speech, “How to build a universe.”

In my writing I got so interested in fakes that I finally came up with the concept of fake fakes. For example, in Disneyland there are fake birds worked by electric motors which emit caws and shrieks as you pass by them. Suppose some night all of us sneaked into the park with real birds and substituted them for the artificial ones. Imagine the horror the Disneyland officials would feel when they discovered the cruel hoax. Real birds! And perhaps someday even real hippos and lions. Consternation.

(Added June 2023)

From Eric Kraft’s *Where do we stop?*

The idea that seemed so bright when it was leaping and darting and fluttering through my mind looked dull and dead when I’d caught it and pinned it to my paper … It wasn’t an idea now, but the representation of an idea. It didn’t fly, didn’t flutter by, didn’t catch the eye as I thought it would. (154)

(Added June 2023)

From CAKE’s Mr. Mastodon Farm

Now due to a construct in my mind

That makes their falling and their flight

Symbolic of my entire existence

It becomes important for me

To get up and see

Their last-second curves toward flightIt’s almost as if my life would fall

Unless I see their ascent.

See also: the poem “Because You Asked about the Line Between Prose and Poetry“ by Howard Nemerov.

Sparrows were feeding in a freezing drizzle

That while you watched turned to pieces of snow

Riding a gradient invisible

From silver aslant to random, white, and slow.There came a moment that you couldn’t tell.

And then they clearly flew instead of fell.

From an interview with illustrator David Sibley.

An illustration provides so much more than a photograph. In an illustration, I can create a typical bird, an average bird of a species in the exact pose that I want, and create an image of a similar species in exactly the same pose so that all the differences are apparent… Your drawing becomes a record of your understanding of that bird in that moment.

(See Edgerton’s photograph of Mrs. May Rogers Webster with her hummingbirds.)

A poem by Raymond Carver

]]>A crow flew into the tree outside my window.

It was not Ted Hughes’s crow, or Galway’s crow.

Or Frost’s, Pasternak’s, or Lorca’s crow.

Or one of Homer’s crows, stuffed with gore,

after the battle. This was just a crow.

That never fit in anywhere in its life,

or did anything worth mentioning.

It sat there on the branch for a few minutes.

Then picked up and flew beautifully

out of my life.

Hello and welcome to the 198th Carnival of Mathematics, a roving monthly roundup of mathy blog posts from around the Internet! Longtime Comfortably Numbered readers should be no stranger to the Carnival: I hosted #134, #148, and #159 in the past.

As is traditional, I want to start by thinking a bit about the number 198. It is actually a very dear number to me — I even own a “198” t-shirt! Why, you ask? The number 198 is the emblem of the wonderful CS198 program at Stanford, which hires a huge team of undergraduates to help teach the introductory computer science courses every year. (As far back as 1995, the faculty running the program published a retrospective on its impact on campus.)

Here is another fun fact about the number 198: suppose I start looking at all the integers ($1, 2, 3, 4, 5, \dots$), keeping only the ones that are perfect powers ($1, 4=2^2, 8=2^3, 9=3^2, 16=2^4, \dots$), and then taking their cumulative averages ($1$, $2.5=(1+4)/2$, $4.33=(1+4+8)/3$, $5.5=(1+4+8+9)/4$, $7.6=(1+4+8+9+16)/5, \dots$). You might wonder, are there any other integers in this sequence? It turns out, *yes!* And I bet you can guess the first one…

Now, time for some links! I’ve provided some “teaser” text from each submission to get you interested.

The BRSR blog asks: can you find a Euclidean triangle on a non-Euclidean surface?

A question that came up in a math chatroom (yes, I’m the kind of nerd who spends time in math chatrooms): find a “Euclidean” triangle on a non-Euclidean surface. More exactly, find a geodesic triangle on a surface with non-constant Gaussian curvature having the same sum of interior angles as an Euclidean triangle.

On his blog bit-player, Brian Hayes asks: why do pandemics peak in distinct waves as the months go by?

I’m puzzled by all this structural embellishment. Is it mere noise—a product of random fluctuations—or is there some driving mechanism we ought to know about, some switch or dial that’s turning the infection process on and off every few months?

I have a few ideas about possible explanations, but I’m not so keen on any of them that I would try to persuade you they’re correct. However, I

dohope to persuade you there’s something here that needs explaining.

On Twitter, Thien An asks: would you rather have a bishop and a knight, or two bishops?

To get this figure, it suffices to grab 100k games and only keep the positions with BB+pawns v.s. BN+pawns and see who eventually wins. That’s very simplistic since there are indeed many other important positional features, but I was actually expecting less convincing results…

Looking at a video of a pomegranate sorting mechanism, Nisar Khan asks: how much do the pomegranate sizes vary between sorted buckets?

After watching the below video, thought of finding the approximate size differences of pomegranates sorted in different boxes…

On his math blog, Tony asks: did studying maths help Emma Raducanu win the US Open?

I’ve discussed toy examples in public lectures on game theory, which (it seems to me) is relevant to choices players make - whether to serve to the forehand or backhand, and where to expect for your opponent to serve, for example. I very much doubt if players ever analyse in these terms, but they are intuitively doing game theory when making their tactical decisions.

But I think in the case of Raducanu there is a more general point…

On Risen Crypto, Trajesh asks: how does the quadratic sieve factoring algorithm really work?

The Quadratic Sieve is the second fastest algorithm for factoring large semiprimes. It’s the fastest for factoring ones which are lesser than 100 digits long.

On The Universe of Discourse, Mark Dominus asks: why is the “S” combinator named “S”? He also posted a nice puzzle that is easily solved if you happen to know Dilworth’s theorem; separately, Jim Fowler built a musical PICO-8 game where you race to find the chains guaranteed by Dilworth’s theorem.

I thought about this for a while but couldn’t make any progress. But OP had said “I know I have to construct a partially ordered set and possibly use Dilworth’s Theorem…” so I looked up Dilworth’s theorem.

On their blog “Gödel’s Last Letter and P=NP,” Ken Regan and Dick Lipton discuss “baby steps” in math, in the context of some exciting recent results.

Reckoned against a later paper by Cohn and Elkies, [Viazovska’s] improvement was 0.0000…000001. The recent post where we discussed the phrase “the proof is in the pudding” involves a number with six more zeroes than that. These are

notwhat we mean by “baby steps.”

And finally, something more about sequences— on her math blog, Tanya Khovanova explores the top 50 largest numbers to start OEIS sequences.

My son, Alexey Radul, wrote a program that finds the largest numbers to start a sequence in the Online Encyclopedia of Integer Sequences (OEIS). To my surprise, the top ten are all numbers consisting of ones only.

That’s all for this month’s Carnival! What a joy to see — once more — the math blog world *overflowing* with curiosity. Hosting the Carnival of Mathematics truly is a delight.

Want even more math? The previous Carnival, the 197th, was hosted by Jeremy at Math $\cap$ Programming, and the next — the 199th, to which you can contribute here, starting now! — will be hosted by Vaibhav at DoubleRoot. See you there!

]]>Surely *something* will happen — which means that if you integrate over all
possible outcomes, the probabilities must add up to 1. This elementary fact
remains true, even if the probabilities of the outcomes change for some reason.
Are you reminded of something? *Fluids* behave in the same way: no matter how
the particles are distributed, the total quantity of fluid is conserved, which
means that if you integrate over all of space, the masses must add up to 1.
Indeed, when we talk about *probability density functions* and *probability
mass functions*, we are borrowing fluidic vocabulary.

I get excited about analogies like this, where the physics of one domain
promises to unexpectedly deliver insight into a different domain. Another
example of this is how spring constants and capacitances play the same
mathematical role in the differential equations that model their respective
mechanical and electrical systems. Harry Olson’s wonderful book *Dynamical
Analogies* shows some more
beautifully-illustrated examples of mechanical, electrical, and acoustic
systems with identical dynamics.

Can we get any mileage out of the probability-fluid analogy? It turns out,
*yes!* We can apply our physical intuition about fluids to design a clever
statistical machine learning algorithm. That’s the subject of this piece. By
the way, you can undoubtedly find better (and more correct) expositions of this
math in other places; my goal here is to emphasize the interaction between
physical intuition and probabilistic applications, bridged so beautifully by
the fluidic analogy.

A little bit about the problem setup: sometimes, when working with data, you find yourself in a place where you want to sample from a probability distribution $p(x)$. This might not be so bad if you can compute $p(x)$ exactly — for example, you can use rejection sampling or something. Often, however, you can only compute some scaled quantity $q(x)$ such that $p(x) \propto q(x)$ with some unknown constant of proportionality. Commonly, this happens in Bayesian inference, where you may want to sample from $\pi(x \mid y) = \pi(x,y)/\pi(y)$ where $y$ might be data you’ve seen and $x$ might be the parameters to your model. Here, the problem is that the normalizing constant $\pi(y)=\int \pi(x,y)dx$ is hard to compute because that integral is intractable in high dimensions.

One famous way to do this kind of sampling is by the Metropolis-Hastings
algorithm, which approximates $p(x)$ with the stable distribution of a Markov
chain. The “miracle” of the algorithm is that the transition probabilities
between two states only depend on the *ratios* of the probability densities at
those states — the unknown constant simply cancels out!

But now let’s think about this from a fluids perspective. Have unknown normalizing constants popped up in physics anywhere? Yes! The partition function $Z$ in Boltzmann is often unknown. Recall that a system at thermodynamic equilibrium at temperature $T$ has probability $p_i \propto e^{-U_i/k_BT}$ of being in state $i$, where $U_i$ is the potential energy of that state and $k_B$ is a physical constant that you can find, e.g. on Wikipedia. A curious application of this distribution is working out how thin the atmosphere is at altitude $h$ by reasoning about $U_h$ being the gravitational potential energy at altitude $h$: because $U$ is linear in $h$, we can instantly conclude that atmospheric density decreases exponentially in altitude.

But returning to machine learning, here is the insight that the fluidic analogy makes possible: if we set $U(x) = -\log q(x)$ then a particle simulated at some “temperature” $T=1/k_B$ in potential $U$ would eventually have probability $e^{-U(x)}=p(x)$ of being at $x$. So, we can use what we know about the dynamics of particles that are a part of a fluid at thermal equilibrium, to sample from probability distributions that may have nothing at all to do with fluids.

What exactly does it mean to “simulate a particle” of this probability fluid? You can take those words pretty literally, as it turns out. There are two sources of motion at each timestep: $U$ and $T$. Given potential $U(x)$, we can derive the force $-\nabla U$ that acts on our particle, and then apply Newton’s Second Law to infer the velocity. We’ll assume that velocity is roughly independent between timesteps because the particle is repeatedly being bumped around by other particles and having its velocity reset — so, no need to track acceleration. By the way, notice that $\nabla \log q(x)$ doesn’t depend on the unknown normalization constant, because the logarithm turns it into an additive constant, which disappears under the derivative.

As for the effect of $T$, we can model the thermal motion of the particle as Brownian motion: that is, at each time, the particle gets bumped to some random neighboring position drawn independently from a Gaussian that is somehow scaled with respect to “temperature” $T$. (How do we translate $T=1/k_B$ to this un-physical world where “joules” and “kelvins” make no sense? Stay tuned…)

This fluidic analogy leads us to what people call the “Unadjusted Langevin Algorithm.” The algorithm is quite simple to state and implement:

- Initialize some $x_0$.
- Using some small step size $\Delta t$ integrate $\Delta x = \Delta t \cdot \nabla\log q(x_t) + \sqrt{2\Delta t} \cdot \xi_t$ where you sample $\xi_t \sim N(0,1)$; the Greek letter $\xi$ is chosen for maximum squiggliness, in order to act as a visual metaphor for its function. Importantly, this step only requires you to evaluate the unnormalized density $q$!
- That’s it — under some (not
*too*mild but still reasonable) assumptions, snapshots of the trajectory ${x_t}$ are distributed accoding to $p(x)$!

Are you convinced? Here’s a sanity check: without the stochastic component $\sqrt{2\Delta t}\cdot\xi_t$ at each step, $x_t$ deterministically walks towards the maximum likelihood estimator — that is, without the stochastic component, we’d have invented gradient descent! The stochasticity, however, makes $x_t$ a random variable instead.

The rest of this piece is dedicated to answering the question: where does that stochastic coefficient $\sqrt{2\Delta t}$ come from? You might have already worked out that it’s the magnitude of the thermal motion, which depends on $T$. If that coefficient were too small, then, as we just discussed above, we simply have gradient descent: our particle is analogous to a marble rolling down a (syrupy — overdamped!) hill. If the coefficient were too large, then thermal motion would dominate and the “signal” from the potential $U$ would stop mattering. So, clearly, the threshold $\sqrt{2\Delta t}$ is “special” in some way.

First let me convince you that we want this coefficient to indeed be
proportional to $\sqrt{\Delta t}$, even though it looks dimensionally wrong
when placed in a sum right next to the $\Delta t\cdot\nabla \log q$ term.
What’s going on? It turns out that this “lots of little Gaussian bumps” motion
doesn’t have an intuitive integral calculus. In particular, for Gaussians, it’s
the *variance* that adds, not the standard deviation. So in time $t$ the
variance of the total distance traveled would be $\Sigma\sqrt{\Delta t}^2 =
\Sigma \Delta t = t$, and therefore the standard deviation would be $\sqrt{t}$,
which is exactly what we want for the algorithm to be invariant to how small
$\Delta t$ is. (For more on this, you could read about “Wiener processes.”)

Now for the “2.” To derive this value, let’s return to our fluidic analogy. Suppose we wanted to compute the change in probability density at a point $x$ over time, that is, ${\partial p(x,t)}/{\partial t}$, with the understanding that we will set this time derivative to zero later to reason about the system’s behavior at equilibrium. By following the fluidic analogy, we can hallucinate a “probability flow” $\vec{J}(x, t)$ and assert, by Gauss’ Law, that ${\partial p}/{\partial t} = -\nabla \cdot \vec{J}$.

To measure $\vec{J}(x, t)$, we should think about what’s flowing; that is, about “particles,” whose dynamics is given by the Langevin update rule for $\Delta x_t$ I wrote out above. The deterministic part is easy to deal with: from a tiny region of volume $\Delta V$, a mass $p\Delta V$ of particles leaves with velocity $\nabla\log q$, so the flux is $p\nabla \log q$.

For the stochastic part, notice that the dynamics of the thermal motion are spatially invariant, i.e. independent of $x$. So, whatever amount diffuses out of $\Delta V$ is proportional to $p$, and whatever amount diffuses in is proportional to the density in the neighborhood of $\Delta V$: this means the net flow is proportional to $-\nabla p$. When we talk about fluids, this observation is also known as Fick’s Law, and the constant of proportionality is called the diffusion coefficient $D$.

How can we compute $D$? Looking at only the stochastic part, we have $\partial
p/\partial t = -D\nabla^2 p$, which we recognize as the heat equation. That is,
for this brief aside, we’re going to think of $p$ representing — not
probability over Bayesian model parameters *or* fluid density in a potential
well, but — heat in a block of metal — a bonus analogy!

Thankfully, the solution to the heat equation turns out to be well-known: it is
a time-varying Gaussian. And it better be! After all, the stochastic diffusion
is just a sum of lots of independent little Gaussian-sampled steps of variance
$\Delta(\sigma^2) = {2\Delta t}$, which together form a big Gaussian of
variance $\sigma^2 = 2t$, as discussed above. But let’s say we didn’t know the
coefficient was 2. Let’s set it to $a$ instead. Now, if we know that $p(x,t) =
e^{-x^2/2(a t)}/\sqrt{2\pi(at)}$, then we can literally take the partial
derivatives on both sides of the heat equation and expand to solve for $D$. A
miracle occurs, and many things cancel, leaving us with simply $-D=a/2$. It’s
not worth writing out the algebra here: I encourage you to try it yourself,
since it is that rare combination of straightforward and satisfying. But
furthermore, it gives some mechanical (if not physical) insight into the “2”
that appears in the denominator of the result: algebraically, the $-1/2$ power
of $at$ in the normalization constant $1/\sqrt{2\pi(at)}$ kicks out some
factors of two that are *not* balanced by the $-1$ power of $t$ in the
exponential.

In any case, though, we conclude from the above that $\Delta p = -(a/2)\Delta t
\nabla^2 p$, which means the stochastic flux is simply $\nabla p$. Perhaps it’s
all clear to you by now, but let’s take this over the finish line. Putting
these two components together, $U$ and $T$, we have $\partial p/\partial t =
\nabla\cdot (p\nabla\log q-(a/2)\nabla p)$. This nice identity is known as the
Fokker-Planck equation (the devoted reader might note that this is the *second*
reference to Fokker on this blog).

Now, finally, as promised many paragraphs ago, we will now reason about this
equation at equilibrium. At equilibrium the distribution is static in time, so
we have $0 = \nabla \cdot (p\nabla \log q -(a/2) \nabla p)$. Assuming suitable
boundary conditions, we can assume the argument of the divergence is also zero,
so $\nabla \log q = (a/2)\nabla p / p$. Integrating in all directions, we have
$\log q + c = (a/2)\log p$, or rather $p \propto q^{2/a}$. This makes sense:
just as expected, for large $a$ the effect of $q$ is wiped out, and for small
$a$ all but the maximum value get squashed to zero. Crucially, however, if we
want $p \propto q$, then we must set $a=2$. Aha! So *that’s* why the thermal
motion needs to be scaled by $\sqrt{2\Delta t}$. (By the way, that constant of
integration above, $c$, oddly enough encodes that pesky normalization constant
that started off this whole discussion…)

Okay. I’m glad I got that out of my system, it’s been rattling around for
weeks, now! How wonderful it is, that we can start with the simplest axiom of
probability — “*something* must happen” — and, reasoning only with physical
intuition, reach a sampling algorithm that solves a real problem.

Snow and snowflakes have been on my mind lately. This is not entirely because
of the storm in Texas — rather, I have been thinking since Christmas about
the snowflake’s *shape:* those small, delicate, spiky circles we all know so
well (err, perhaps there is something of current events here after all…).
This is a blog post about snowflake shapes.

What shape *is* a snowflake? In principle, we all know what snowflakes look
like. We can all imagine the six-fold symmetry, and then we can imagine the
embellishments. And yet— not so fast! As a little child I (and I do not
think I am alone in this) wanted to see the shape of a snowflake for myself.
Of course, I never quite could — at least, to the point where I could stuff
and saturate my eyes — because the delicate snowflakes melted away as soon as
my snow-blind eyes could focus on the crystal caught. Think about it, then:
where does your own conception of snowflake-shape arise from? Does it come from
firsthand experience, or from clip-art and wintertime papercrafts?

(Of course, there

issomething to be said for indirect inference of snowflake-shape. Years later, I would marvel at the fact that you could infer that snowflakes are hexagonal from the existence of 22º haloes, a kind of crystallography at the heavenly scale, linking the macroscopic and microscopic. For now however I am concerned with seeingdirectly, and face to face, with a presence and visual immediacy that cannot be afforded by imagination alone.)

This is no small concern. A year or so ago I visited the California Academy of
Sciences and learned about Wilson Bentley, the first man to photograph a
snowflake. This was on January 15, 1885; he was just shy of 20 years old. To
appreciate this moment, you have to imagine a time *before* January 15, 1885,
when no such photograph existed. Until Bentley’s innovation, no man had seen a
single snowflake for more than a delicate glimpse. Perhaps spatially the
snowflake could be magnified under a microscope (as Hooke and his
contemporaries
demonstrated in the 17th century), but how could anyone magnify it
*temporally?* Bentley’s Edgertonesque miracle — akin to Muybridge’s “horse in
motion,” captured just seven years earlier — is to hold still the moment to
reveal *more* — to give us the opportunity to gaze without urgency. Dare I
say, Bentley *freezes time?* The manipulation of time becomes an act of
unconcealment, and then furthermore an act of creation.
Bentley’s snowflakes outlived him, and they will outlive me.

Wilson Bentley was not a scientist by training, and perhaps that served him well. Over the course of his life he collected over 5,000 images of what he called “ice flowers.” There is something marvelous about this devotion, which spanned years and years, crystal after crystal, wonder undiminished. Surely the 5,000th snowflake brought Wilson as much delight as the 4,999th (and surely it is the case that you know you are in love when you find yourself falling in love afresh every day).

There is also something to the uniformity of Bentley’s collection. Each snowflake is framed carefully in the center of a slide, on a black background. The images are diligently cropped to squares. See for yourself! The University of Wisconsin-Madison hosts an online archive of 1,181 of Bentley’s images, courtesy the Schwerdtfeger Library. Bentley never sought copyright on his images; they are in the public domain, and so I downloaded them, along with their labeled categories. The categories have wonderful names, such as “crystal with broad branches” and “lumped graupel.”

When displayed in a grid, the snowflakes look like a box of delicious holiday sweets. Surely these delights are made of sugar and not ice.

What does one *do* with 1,181 snowflakes? One searches for patterns, the way
Bentley would never have been able to. Bentley’s images demand to be studied as
samples from a distribution, and they are strikingly well-suited to our
favorite statistical computer vision techniques. For example, though they are
scanned at a uniform 1000x1000 resolution, they still make sense when
dramatically downscaled, even to as low as 32x32. Notice also how the
snowflakes are highly amenable to *data augmentation* on account of their
intrinsic symmetry. Each snowflake can be reflected and rotated arbitrarily to
produce several more high-quality synthetic snowflakes “for free.” So, even
though there are nominally 1,181 unique images, there is much more juice here
than one might expect at first. You can’t always play this game: CIFAR images
can’t really be rotated, and most MNIST digits can’t even be flipped. But
snowflakes are snowflakes.

One February evening I trained — with a lot of help from my friends Gautam and Jack — a baby variational autoencoder, which embeds the snowflakes in a small low-dimensional latent space. Here, “low-dimensional” means 10-dimensional: it seems very reasonable to encode 2^10 data points with 10 parameters. In the image below, the top row consists of five randomly-selected source images, and the bottom row is the VAE’s attempted reconstruction. The contrastive green-and-purple colorscheme comes from matplotlib’s default, and reminds me of the delightful pGLO experiments we did in high-school biology.

Of course, as with all things deep-learning, this was easier said than
done. For a technique that falls broadly under “unsupervised learning,” a VAE
requires an awful lot of babysitting! —but then again, one of the joys of
online school is that *nobody can tell* if you’re training machine learning
models behind your Zoom tab. So I trained, and I trained. The current model
takes around 6 hours on my dusty old MacBook, though after the first 30 minutes
the results are already non-embarrassing.

Do you want to build a snowgan?Separately, I also spent some time trying to build a model adversarially. The snowgan (so to speak) consists of two rival sisters: the generator, Elsa, is of the same architecture as the VAE’s decoder, and the discriminator, Anna, learns to distinguish Elsa’s output from real snowflakes. Elsa and Anna compete, and as a result Elsa gets better at camouflaging the fakes among the flakes. Alas, GAN convergence is finicky enough that I lost interest after a few days… open problem!

It turns out that the VAE’s decoder can be used to build a small-scale
snowflake factory. Using ONNX.js and only a little bit of guile, I ported the
VAE’s decoder to JavaScript and attached the latent space to some ```
<input
type="range">
```

sliders. By using the sliders to choose parameters in the latent
space and feeding these parameters to the decoder, one can interpolate to
hallucinate snowflakes that are *not* part of Bentley’s training set. With some
experimentation, I found the sliders that correspond to parts of the latent
space that map to outliers; those parameters I clamped to extreme values in the
non-outlier-ey direction. This interface is what you see at the top of this
blog post.

At time of writing, there is only one Google search result for “this snowflake does not exist,” which points to a 2017 master’s thesis in elementary-school education by Eli̇f Dedebaş. The context is an activity where children are making paper snowflakes, and investigating the relationship between the number of folds and the symmetries. One child creates a snowflake-like shape with

eight-fold symmetry and is asked if she knows why such a shape is unnatural. Shape!

I imagine this slider infrastructure is on par with the sort of fancy tooling
UI that they have up in the Great Snow Factory In The Sky. I’m reminded of
David Wiesner’s wonderful Caldecott-winning picture book *Sector 7*, which I
recall reading (“reading”) in my elementary school’s library. It is about a
cloud factory, accessed from the top of the Empire State Building. Notice how
the large sousaphone-bell-shaped openings emanating from the factory roof are
exactly a portrait of an autoencoder’s decoder.

“Wiesner visited the Empire State Building on a zero-visibility day to research
this book,” reads the About-the-Author at the end, “He was the only person
there.” There is something of Wiesner in Bentley and Bentley in Wiesner, I
think. Think of them both, and yourself, too, as Friedrich’s *Wanderer*.

The results interpolated by the VAE are surprisingly convincing. Here are some snowflakes that do not exist (snowfakes?). I created these images by sliding the sliders haphazardly and recording a trace of the changing image. Often, people display this kind of interpolation in animated form, but in relation — and, indeed, deference — to Bentley’s life’s work, I would rather see them spread spatially than temporally.

I enjoy trying to work out the exact moment of transition between different kinds of snowflakes; there is a snowy saltation that occurs at certain boundaries that I cannot quite point you to. A curious fact is that a t-SNE plot in the VAE’s latent space shows no clustering with respect to the provided labels; the VAE seems to have hallucinated its own classification scheme, one which is perhaps more macroscopically morphological in nature than Magono and Lee’s 1966 classification scheme.

The BentleyBlizzardBlossoms (B3) dataset, along with some PyTorch utilities and the source code and trained model of the baby autoencoder, is available on Github. I think it would make a wonderful alternative to MNIST for small experiments. If you use it for anything, please let me know!

]]>