Comfortably Numbered

Peripheral pareidolia

Hardmath123 — Tue, 30 Apr 2024 04:00:00 GMT

Faces that you can only see without looking

I work in a dense urban campus, and the view from my building looks into the windows of the next building over.

Sometimes, when I meet with my advisor, I sit at a table facing that window. On these days, a strange thing happens. When I look up at my advisor, I get the eerie sensation in the corner of my eye that there is a face in that window watching me. Of course, when I look at the window, the face vanishes—there is no one there.

I have been chalking this up to the paranoid hallucinations of a tired brain. But a chance email from a professor this week made me realize what is actually going on here.

When I look up at my advisor, my eyes bring them into focus. To do this, they converge (“cross”) slightly, so that the image is aligned on my retinas. What happens to the window in the background? Because it is behind my advisor, it diverges slightly: I actually see two copies of it superimposed with a slight horizontal displacement, kind of like a Magic Eye stereogram. Of course, the window also blurs out of focus—in part because it is far behind my eyes’ current focal plane, and in part because it is in my peripheral vision, which has significantly lower acuity.

Here is a video of what happens when you simulate diverging and blurring the window. I’ve tiled the image many times so that you always have a few copies in your peripheral vision.

Aha! At the end of the simulation, the window-images align perfectly to form a pair of little faces side by side. The light fixtures form the eyes and the nose, and the frame provides the lips and the mouth. The faces are not the clearest, but they are certainly face-like enough to evoke pareidolia.

Of course, if you were to look directly at the window, your eyes would unconverge, the windows would fuse, and the face would seem to disappear. So, it’s not paranoia, it’s peripheral pareidolia!

(Food for thought: how can we automatically find/create more examples of “peripheral pareidolia”? Does this effect say something interesting about the organization of our visual system?)

Undetectable Bayesian Improv Theater

Hardmath123 — Sun, 21 Jan 2024 05:00:00 GMT

How to pretend like you have a biased coin

Suppose two Bayesians, Alice and Bob, put on a variety show where they take turns tossing a biased coin and announcing outcomes to a live studio audience. (Bayesians love this kind of thing—it keeps them entertained for hours…)

Unfortunately, just as Alice goes on stage, she realizes with dread that she forgot to bring the coin. Thinking on her feet, she mimes pulling a tiny imaginary coin out of her pocket, and says “This is a biased coin!” It works—the audience buys it and the crowd goes wild.

She mimes tossing the pretend coin and randomly announces “heads” or “tails.” Then, she hands the coin to Bob, who (catching on) also mimes a toss. This has just turned into a Bayesian improv show.

But now Bob has a problem. Should he announce “heads” or “tails”? He could choose uniformly at random, but after many rounds the audience might get suspicious if the coin’s bias is too close to 50%. How can he keep up the charade of a biased coin?

Here’s what Bob does. In the spirit of “yes, and…,” he infers the coin’s bias based on Alice’s reported outcome (say, with a uniform prior) and samples a fresh outcome with that bias. So if Alice said “heads,” Bob would be a bit likelier to say “heads” as well.

Then Alice takes the coin back and does the same, freshly inferring the coin’s bias from the past two tosses. In this way, the two actors take turns announcing simulated outcomes to the oblivious audience, while building a shared understanding of the coin’s bias.

What happens? How can we characterize the sequence of outcomes? Intuitively, we might expect either a “rich-get-richer” effect where they end up repeating heads or tails. Or we might expect a “regression-to-the-mean” where they converge to simulating a fair coin.

The surprising answer is that this process is indistinguishable from Alice and Bob tossing a real coin with fixed bias (chosen uniformly). A critic lurking in the audience would never suspect something afoot!

This result is a consequence of the correspondence between the Pólya distribution and the Beta-binomial distribution.

I have a hunch that this observation could be useful: perhaps in designing a new kind of cryptographic protocol, or perhaps in explaining something about human cognition. If you have ideas, let me know!

Proof sketch: Model the actors’ belief with a Beta distribution with parameters $(h, t)$ initialized to $(1, 1)$, i.e. uniform. At each toss the probability of heads is given by $h/(h+t)$, and the outcome increments $h$ or $t$ by 1. You can think of this as a Pólya urn with h black and t white balls: each time you draw a ball, you put it back and add a “bonus” ball of the same color. It is well-known (look here or here or here) that this is the same as the Beta-binomial process.

See also: cool new blog post that riffs on these ideas

The light tries to enter the long black branches

Hardmath123 — Mon, 27 Nov 2023 05:00:00 GMT

Thinking about a streetlight’s wooden halo

Have you noticed the way streetlights shine through the bare branches of trees on cold winter nights? Walking home from work tonight I was struck by the almost overwhelming beauty of the scene. The light creates a perfect, glittering halo around itself, like the moon in a Van Gogh painting. The tree in turn all but reaches its fingers out to grasp the light. If you move your head from side to side, it feels as if a wormhole has opened in the tree, sucking the branches into the light’s force field.

I think this happens because of the Fresnel effect. If the light hits a branch at just the exact angle, the wood (like many other materials in the world) becomes unexpectedly shiny. In the brambles that hang off of a tree’s bare branch, the only twigs that catch the light’s glint are the ones that make a specific shallow angle with your eye. By radial symmetry, it’s not hard to see that this should result in a glowing circular halo. It’s similar to how divers see a circular window on the surface of the water when they look up, or the 22º halo you sometimes see around the sun — but here the medium is wood, not water.

If you live in a place that has seasons, I encourage you to go and see this effect for yourself. It’s a hard phenomenon to capture on camera — it requires high dynamic range, high-resolution video, and patience outdoors when it’s cold and dark. For now, here is the best I could do with my old iPhone. (I stabilized the light with a little Python script.)

Your browser does not support the video tag.

For comparison, here are two frames from this video:

Half a Jar of Honey

Hardmath123 — Sun, 12 Nov 2023 05:00:00 GMT

Investigating the stability of honey containers

I knocked over a tall jar of honey. Clumsy, clumsy! But the jar was nearly empty; jars tend to fall when they are nearly empty.

But jars also tend to fall when they are full, which got me thinking: for both full and empty jars, we can argue by symmetry that their center of mass is halfway up. By Rolle’s Theorem, then, at some point when the jar is partially filled it should have a minimally- or maximally-high center of mass. It’s easy to see that the center of mass can never get higher than halfway up. This suggests that the jar’s stability increases and then decreases as the honey is consumed.

When is the jar most stable? Working in units where the empty jar’s height and mass are 1, its radius is $r$, and honey’s density is $\alpha$, the height of the center of mass is given by: $$ c = \frac{\alpha r^2h(h/2) + 1/2}{\alpha r^2h + 1} $$

Let’s assume honey is extremely viscous (i.e. changes shape slowly), and that the jar tips over if its center of mass is over its rim. Then the maximum angle you can tip the jar before it falls over is: $$ \theta^\star = \tan^{-1}(r/c) $$ The question is then, for a given $r$ and $\alpha$, what $h$ maximizes $\theta^\star$? Intuitively, it seems like it should be somewhere halfway between empty and full.

We can compute the optimal $h$ by differentiating $\theta^\star$ with respect to $h$ and setting the derivative to zero: $$ \frac{d\theta^\star}{dh}=-\frac{r^2}{r^2+c^2}\frac{dc}{dh} = 0 $$ The positive solution to this, thanks to WolframAlpha, is given by: $$ h^\star=\frac{1}{1+\sqrt{\alpha r^2+1}} $$ Does this make sense? As $\alpha r^2$ increases (heavier honey), the optimal height gets lower, which makes sense because the honey begins to have a ballasting effect.

Now we can plug in some numbers. A standard 8-oz honey bottle is about 6 inches tall, has a base radius of about 1 inch, and weighs about 1oz. The density of honey is 0.8oz per cubic inch. In our units, this means the jar’s radius is $1/6$ and the density of honey is $0.8/(1/6)^3$. Plugging this in, we find remarkably that $h^\star \approx 0.3$. So a jar of honey is most stable when only about 1/3 full!

How much should you stir your tofu?

Hardmath123 — Tue, 06 Jun 2023 04:00:00 GMT

Can stirring more ever lead to worse outcomes?

Suppose you are cooking tofu, cut into tasty 1-inch cubes. On your stove it takes each cube-face 10 minutes to cook. You could cook each of the 6 faces of each cube in one hour (60 minutes) by rotating the cube to an uncooked face every 10 minutes–but this is clearly tedious and suboptimal.

A better idea is stirring: instead of carefully rotating each cube to an uncooked face, you can vigorously stir your pan and reorient each cube uniformly at random (like rolling dice).

The question I want to consider today is, “how often should you stir your tofu cubes”? For example, you might choose to stir them every 10 minutes for an hour, for a total of 6 stirs including right at the beginning. In that case, you might get lucky and have every face exposed to heat once. However, because each stir is an independent random roll, the more likely scenario is that some face gets cooked twice or more (burnt!) and other faces don’t get any heat (raw!). Intuitively, it seems like we should stir more frequently to “even out” the heat. Let’s check this intuition.

Suppose we stir $n$ times over the course of the hour, and we are willing to tolerate being off by up to $\delta=5\%$ of optimal cookedness–that is, consider the tofu raw before 9 minutes and 30 seconds, and burnt after 10 minutes and 30 seconds. Let $f(n)$ be the proportion of burnt or raw faces, in expectation. We want to see how $f(n)$ depends on $n$.

We can compute $f(n)$ many ways. Most obviously, we can simulate the tofu system with some large number of cubes and see what happens.

import numpy as np
from matplotlib import pyplot as plt

mean = 1/6
tol = 0.05

def experiment(n):
    N = 1_000
    rolls = np.random.multinomial(n, np.ones(6)/6, size=N)
    toasts = rolls / n
    burns = ((np.abs(toasts - mean) >= tol * mean) * 1.0).mean()
    return burns

It is also not hard to work this out analytically. First, by linearity of expectation we can just look at one face, which acts like a binomial random variable: for each of the $n$ stirs, there is a $1/6$ probability of getting heat. We can subtract the CDFs at the tolerance limits $\mu\cdot(1\pm\delta)$ to compute the proportion of times the face is well-cooked, and then subtract from 1 to get the proportion of bad faces.

import scipy.stats

def analytical(n):
    return 1 - (
        scipy.stats.binom.cdf( (1 + tol) * (n / 6), n, 1/6 ) -
        scipy.stats.binom.cdf( (1 - tol) * (n / 6), n, 1/6 )
    )

We can also use statistics tricks to approximate the answer in useful closed forms. For example, we can take a normal approximation to the binomial as $n \rightarrow \infty$. The mean is $\mu=1/6$, of course, and the variance is given by $\sigma^2 = p(1-p)/n = 5/36n$. We can then plug those parameters into the normal CDF, along with the tolerance limits as above.

def normal(n):
    vari = 1/6 * (1 - 1/6) / n
    z = (tol * mean) / np.sqrt(vari)
    return 1 - (scipy.stats.norm.cdf(z) - scipy.stats.norm.cdf(-z))

Finally, we can apply Chebyshev’s inequality or a Chernoff bound. These formulas only provide an upper bound on the proportion of bad faces, but they don’t depend on a normal approximation and are thus guaranteed to be true. I won’t work through the derivations here.

def chebyshev(n):
    vari = 1/6 * (1 - 1/6) / n
    z = (tol * mean) / np.sqrt(vari)
    return np.minimum(1., 1 / z ** 2)

def chernoff(n):
    return np.minimum(1., 2 * np.exp(-(tol ** 2) / 3 * (1/6) * n))

Plotting all of these metrics out to asymptotic $n \rightarrow \infty$ (see the graph below), we see several expected patterns:

As $n$ increases, the proportion of burnt faces goes down with exponential falloff.
The analytical solution closely tracks the simulation, at least until the variance of the simulation gets high enough to make it unreliable.
The normal approximation is indistinguishable from the analytical solution.
The bounds are quite loose. Chebyshev “wins” for a little while, but ultimately Chernoff’s exponentiality kicks in and dominates.

So far, so good: it seems like you can get quite well-cooked tofu with only a logarithmic amount of stirring effort!

ns = np.arange(1, 30000, 100)
plt.plot(ns, [experiment(int(n)) for n in ns], 'x', label='Simulation')
plt.plot(ns, [analytical(int(n)) for n in ns], '-', label='Analytical')
plt.plot(ns, [normal(int(n)) for n in ns], '--', label='Normal approximation')
plt.plot(ns, [chebyshev(int(n)) for n in ns], '-', label='Chebyshev\'s inequality')
plt.plot(ns, [chernoff(int(n)) for n in ns], '-', label='Chernoff bound')
plt.yscale('log')

plt.ylabel('P(burnt or raw face)')
plt.xlabel('n, the number of tosses per hour')
plt.legend()

But asymptotic $n$ is unreasonable in this setting: for example, 6000 flips per hour (6 kfph) is more than one flip per second. Let’s zoom in on small $n$ to see what is happening at the scale of the real world. In this plot, I’m hiding the two bounds (which are all maxed out at 1) and only showing up to $n=600$, which is a stir every 6 seconds.

ns = np.arange(1, 600)
plt.plot(ns, [experiment(int(n)) for n in ns], 'x', label='Simulation')
plt.plot(ns, [analytical(int(n)) for n in ns], '.', label='Analytical')
plt.plot(ns, [normal(int(n)) for n in ns], '--', label='Normal approximation')

plt.ylabel('P(burnt or raw face)')
plt.xlabel('n, the number of tosses per hour')
plt.legend()

A curious pattern emerges! Here is what I notice:

The analytical solution (orange dots) continues to track the simulation (blue crosses).
However, they both diverge from the normal approximation (which is perhaps as we expect at low $n$).
This divergence is systematic. The analytical probabilities form an interesting “banded” structure.
Most surprisingly, sometimes increasing $n$ increases the number of bad faces!
To do better than $n=6$, you have to go all the way out to $n\approx 500$.

The puzzle is, Why does more stirring cause worse cooking?

Let’s consider the cases $n=6$ and $n=12$. For $n=6$, we already reasoned that a face is well-cooked if it is exposed to heat for only one of the 6 turns. This has probability $\binom{6}{1}(1/6)^1(6/5)^4 \approx 0.402$, and indeed the graph above shows that analytically, we expect a $0.6$ probability of a face getting burnt or being raw at $n=6$.

Now consider $n=12$. Now, a face is definitely well-cooked if it gets 2 turns on the heat, because $2/12 = 1/6 = \mu$. But what about 1 turn or 3 turns? In each case, it is off by a factor of $(1/12)/(1/6) = 0.5 \geq \delta$, so the face would either be raw or get burnt. Hence, the probability of a well-cooked face is given by $\binom{12}{2}(1/6)^2(5/6)^{10} \approx 0.296$, which yields a probability of $0.704$ that the face is raw or burnt. Higher than $0.6$ at $n=6$!

Another way to see what’s going on is to consider 2-faced pancakes instead of 6-faced tofu cubes. If you cook a pancake with $n=2$ random flips, you have a good chance that you will cook both sides (one on each flip), though of course there is some chance you burn one side by cooking it twice, and leave one side uncooked. But if you cook a pancake with $n=3$ flips, you will necessarily always burn one side and leave the other side uncooked, because at best you will get a 1/3-2/3 ratio of cooking time.

Returning to cubes now, let’s see what happens more generally. Say a face gets $X$ turns on the heat where $X \sim \text{Binom}(n, 1/6)$. The probability of being well-cooked is $\Pr[|X/n-\mu| \leq (1+\delta)\mu]$ where $\mu=1/6$. Breaking this up, we can sum over possible outcomes of $X$, to have $\sum_{(1-\delta)(1/6) \leq x/n \leq (1+\delta)(1/6)} \Pr[X=x]$. In other words, we are summing over integer multiples of $1/n$ in the range $[(1-\delta)(1/6), (1+\delta)(1/6))]$. How many integer multiples are there in that range? This is easily given by $\lceil (1/6)(1+\delta)/(1/n) \rceil - \lfloor (1/6)(1-\delta)/(1/n) \rfloor - 1$. We can plot this against the analytical probability to get some insight.

ns = np.arange(1, 100)
plt.plot(ns, [analytical(int(n)) for n in ns], '.', label='Analytical')
plt.plot(
    ns,
    (np.ceil(ns * 1/6 * (1 + tol)) -
     np.floor(ns * 1/6 * (1 - tol)) - 1),
    label='Number of integer multiples of 1/n in range'
)
plt.xlabel('n, the number of tosses per hour')
plt.legend()

The spikes in the number-of-integers graph correspond directly to the jumps between bands. This suggests the following explanation of what’s going on: that there are two “forces” at play.

The number of integer multiples of $1/n$ that could be matched within the range allowed by the tolerance, which jumps around in a quantized manner according to number theory.
The sum of probabilities of matching each of those integer multiples, which decreases as we add more flips, because there are more possible other-outcomes.

When the first “force” spikes, it causes the probability of a bad face to drop suddenly because there are more ways to cook a good face. However, the second “force” causes the probability of a bad face to rise gradually because each way to make a good face becomes less likely. This explains the banding structure in the graph above.

To summarize, in the limit, continuous stirring indeed helps cook your tofu more evenly. However, if you are stirring only occasionally, then sometimes more stirring can actually harm your tofu!

Light from shadow, seeing from seeing

Hardmath123 — Mon, 26 Sep 2022 04:00:00 GMT

Thoughts on a blue and yellow photograph I took this summer

This summer I was briefly in Vancouver, and after dinner on my last day I found myself walking several blocks back to my hotel. The night was dark but the sidewalk was brightly lit and it was a lovely journey. Along the way, I took this picture:

What struck me is how the streetlight casts two shadows, blue and yellow, and moreover those shadows appear opposite the yellow and blue lamps, respectively. What could possibly be going on?

Here is an explanation: the two lamps together create a kind of grayish-white light that bathes the sidewalk. Where the yellow lamp is occluded, the blue light is dominant, so the shadow is blue. Similarly, where the blue lamp is occluded, the yellow light is dominant, so the shadow is yellow.

Looking at this scene I’m reminded of painter Wayne Thiebaud’s rich, saturated shadows. You could say the perceptual effect here demonstrates that “white light” is the sum of all wavelengths, a fact we learn in grade school (I think there is an exhibit at the SF Exploratorium with a similar concept). But to me, this also demonstrates the range of what we are willing to call “white.” If one of the lamps were to burn out, our eyes would adjust to the blue or yellow almost immediately, and we would still see the sidewalk as gray — we experience a truly remarkable “color constancy” across lighting conditions. In this way, when he paints a shadow as a saturated, non-gray color, Thiebaud sees beyond his own seeing.

PeLU: Porcelain-Emulated Linear Unit

Hardmath123 — Sat, 04 Dec 2021 05:00:00 GMT

A low-power deep learning inference mechanism inspired by flush toilets

The other day my toilet broke and I involuntarily learned a lot about how flushing works. My friend suggested an analogy for me: flushing toilets is like a neuron’s activation: once a critical threshold is met, there’s an “all-at-once” response.

That got me thinking, could we implement deep neural networks in plumbing? It turns out, the answer is yes! A very simplified model of a flush toilet’s nonlinear behavior is as follows: it’s a bucket, into which water can be poured, and there is a hole at height $h \geq 0$. If you pour in volume $v$ of water into the bucket, the output that flows out of the hole is $\text{ReLU}(v - h)$.

The second component we need to build a neural network is a linear map. We can do this by attaching a branching pipe to the hole. This component will have $k$ branches with cross-sectional areas $A_1, A_2, \dots, A_k > 0$. By conservation of mass and a simple pressure argument, the amount of water that pours out of branch $i$ is $A_i / \Sigma_j A_j$.

Together, these components allow us to compute a function from $\mathbb{R}\rightarrow \mathbb{R}^k$, which looks something like $\text{PeLU}(v, \vec{A}, h) = \text{ReLU}(v - h)\cdot \vec{A} / \Sigma_j A_j$. Here, “PeLU” stands for “Porcelain-Emulated Linear Unit.” It is clear how to vectorize this expression over $v$, which effectively creates a new kind of neural network “layer” with trainable parameters $\vec{A}$ and $h$ for each input dimension. To enforce the positivity constraint on $h$ and $A_i$, we will actually work with the following key equation: $\text{PeLU}(v, \vec{A}, h) = \boxed{\text{ReLU}(v - h^2) \cdot \text{softmax}(\vec{A})}$.

All that is left to do at this point is to implement this in PyTorch and train it.

import torch

class PeLU(torch.nn.Module):
    def __init__(self, in_feat, out_feat):
        super().__init__()
        self.heights = torch.nn.Parameter(
            torch.randn(in_feat))
        self.weights = torch.nn.Parameter(
            torch.randn(in_feat, out_feat)
        )

    def forward(self, X):
        X = torch.nn.functional.relu(X - self.heights ** 2)
        X = X.matmul(self.weights.softmax(dim=1))
        return X

Here, I built a PeLU layer that can be slipped into any PyTorch model, mapping in_feat inputs to out_feat outputs. Next, let’s stack some PeLU layers together and train the result on the classic “Iris” dataset, which has 4 features and assigns one of 3 labels. We will create a “hidden layer” of size 3, just to keep things interesting.

from iris import feat, labl

m = torch.nn.Sequential(
    PeLU(4, 3),
    PeLU(3, 3)
)
o = torch.optim.Adam(m.parameters(), lr=0.01)
lf = torch.nn.CrossEntropyLoss()

for i in range(10_000):
    o.zero_grad()
    pred = m(feat * 10)
    loss = lf(pred, labl)
    loss.backward()
    o.step()

print('Loss:', loss)
print('Error:', 1. - ((torch.argmax(pred, dim=1) == labl) * 1.).mean())

This trains very quickly, in seconds, and gives an error of 2%. Of course, we haven’t split the dataset into a train/test set, so may be be overfitting.

By the way, you may have noticed that I multiplied feat by 10 before passing it to the model. Because of the conservation of mass, the total amount of water in the system is constant. But each bucket “loses” some water that accumulates below the hole. To make sure there’s enough water to go around, I boosted the total amount.

But that’s all that needs to be done! Once we export the parameters and write a small visualizer…

Isn’t that a delight to watch? I may even fabricate one to have as a desk toy — it shouldn’t be hard to make a 3D-printable version of this. If we wanted to minimize the number of pipes, we could add an L1 regularization term that enforces sparsity in the $\vec{A}$ terms.

Some final thoughts: this system has a couple of interesting properties. First, there’s a kind of “quasi-superposition” that it allows for: if you pour in more water on top to incrementally refine your input, the output will automatically update. Second, the “conservation of mass” guarantees that the total water output will never exceed the total water input. Finally, it’s of course entirely passive, powered only by gravity.

This has me wondering if we can build extremely low-power neural network inference devices by optimizing analog systems using gradient descent in this way (a labmate pointed me to this, for example).

Below is a little widget you can use to enjoy playing with PeLU networks. All inputs must be between 0 and 10. :)

Sepal length (cm):
Sepal width (cm):
Petal length (cm):
Petal width (cm):

Birds and the Representation of Representation

Hardmath123 — Thu, 21 Oct 2021 04:00:00 GMT

What is it about birds?

Toni Morrison, Nobel Lecture

Speculation on what (other than its own frail body) that bird-in-the-hand might signify has always been attractive to me, but especially so now thinking, as I have been, about the work I do that has brought me to this company. (full)

Richard Siken, “The Language of the Birds”

And just because you want to paint a bird, do actually paint a bird, it doesn’t mean you’ve accomplished anything. (full)

Cross-examination in Brancusi v. United States

Waite: What do you call this?
Steichen: I use the same term the sculptor did, oiseau, a bird.
Waite: What makes you call it a bird, does it look like a bird to you?
Steichen: It does not look like a bird but I feel that it is a bird, it is characterized by the artist as a bird.
Waite: Simply because he called it a bird does that make it a bird to you?
Steichen: Yes, your honor.
Waite: If you would see it on the street you never would think of calling it a bird, would you?
[Steichen: Silence]
Young: If you saw it in the forest you would not take a shot at it?
Steichen: No, your honor. (more)

Donna Tartt, The Goldfinch

But who knows what Fabritius intended? There’s not enough of his work left to even make a guess. The bird looks out at us. It’s not idealized or humanized. It’s very much a bird.

Adam Savage, “My obsession with objects and the stories they tell”

And then there is this fourth level, which is a whole new object in the world: the prop made for the movie, the representative of the thing, becomes, in its own right, a whole other thing, a whole new object of desire. . . . There are several people who own originals, and I have been attempting to contact them and reach them, hoping that they will let me spend a few minutes in the presence of one of the real birds, maybe to take a picture, or even to pull out the hand-held laser scanner that I happen to own that fits inside a cereal box, and could maybe, without even touching their bird, I swear, get a perfect 3D scan. And I’m even willing to sign pages saying that I’ll never let anyone else have it, except for me in my office, I promise. I’ll give them one if they want it. And then, maybe, then I’ll achieve the end of this exercise. But really, if we’re all going to be honest with ourselves, I have to admit that achieving the end of the exercise was never the point of the exercise to begin with, was it? (full)

Michael Shewmaker, “The Curlew”

Plate 357 (Numenius Borealis) is the only instance in which the subject appears dead in the work of John James Audubon.

He waits alone, sketching angels from the shade—a kind of heavenly bird, he reasons with himself—although their wings are broke, faces scarred, each fragile mouth feigning the same sad smile as the one before it. Offered triple his price to paint a likeness of the pastor’s daughter—buried for more than a week—he reluctantly agreed—times being what they are.
. . . . .
And yet he studies it—from behind the dunes—studies its several postures, grounded and in sudden flight—and not content to praise it from a distance, to sacrifice detail, unpacks his brushes and arranges them before raising his rifle and taking aim.

Richard Hunt’s quest for notation for birdsong

(Added April 18, 2023)

Zizzy, uncanny, pebble-tapping, lusty, pule. Ventriloquial, tantara, feminine, crepitate. Tintinnabulation, sough, devil’s tattoo. Sparrowy.

(See the article for more…)

Children imitating cormorants

(Added June 2023)

By Kobayashi Issa; mentioned by a professor on a walk by the Charles River.

Children imitating cormorants
are even more wonderful
than cormorants.

Fake birds in Disneyland

(Added June 2023)

From Philip K. Dick’s speech, “How to build a universe.”

In my writing I got so interested in fakes that I finally came up with the concept of fake fakes. For example, in Disneyland there are fake birds worked by electric motors which emit caws and shrieks as you pass by them. Suppose some night all of us sneaked into the park with real birds and substituted them for the artificial ones. Imagine the horror the Disneyland officials would feel when they discovered the cruel hoax. Real birds! And perhaps someday even real hippos and lions. Consternation.

A representation of an idea

(Added June 2023)

From Eric Kraft’s Where do we stop?

The idea that seemed so bright when it was leaping and darting and fluttering through my mind looked dull and dead when I’d caught it and pinned it to my paper … It wasn’t an idea now, but the representation of an idea. It didn’t fly, didn’t flutter by, didn’t catch the eye as I thought it would. (154)

Symbolic of my entire existence

(Added June 2023)

From CAKE’s Mr. Mastodon Farm

Now due to a construct in my mind
That makes their falling and their flight
Symbolic of my entire existence
It becomes important for me
To get up and see
Their last-second curves toward flight

It’s almost as if my life would fall
Unless I see their ascent.

See also: the poem “Because You Asked about the Line Between Prose and Poetry“ by Howard Nemerov.

Sparrows were feeding in a freezing drizzle
That while you watched turned to pieces of snow
Riding a gradient invisible
From silver aslant to random, white, and slow.

There came a moment that you couldn’t tell.
And then they clearly flew instead of fell.

Why illustrate?

From an interview with illustrator David Sibley.

An illustration provides so much more than a photograph. In an illustration, I can create a typical bird, an average bird of a species in the exact pose that I want, and create an image of a similar species in exactly the same pose so that all the differences are apparent… Your drawing becomes a record of your understanding of that bird in that moment.

Unseeable birds

(See Edgerton’s photograph of Mrs. May Rogers Webster with her hummingbirds.)

My Crow

A poem by Raymond Carver

A crow flew into the tree outside my window.
It was not Ted Hughes’s crow, or Galway’s crow.
Or Frost’s, Pasternak’s, or Lorca’s crow.
Or one of Homer’s crows, stuffed with gore,
after the battle. This was just a crow.
That never fit in anywhere in its life,
or did anything worth mentioning.
It sat there on the branch for a few minutes.
Then picked up and flew beautifully
out of my life.

Carnival of Mathematics 198

Hardmath123 — Fri, 01 Oct 2021 04:00:00 GMT

The Carnival revisits an old home

Hello and welcome to the 198th Carnival of Mathematics, a roving monthly roundup of mathy blog posts from around the Internet! Longtime Comfortably Numbered readers should be no stranger to the Carnival: I hosted #134, #148, and #159 in the past.

As is traditional, I want to start by thinking a bit about the number 198. It is actually a very dear number to me — I even own a “198” t-shirt! Why, you ask? The number 198 is the emblem of the wonderful CS198 program at Stanford, which hires a huge team of undergraduates to help teach the introductory computer science courses every year. (As far back as 1995, the faculty running the program published a retrospective on its impact on campus.)

Here is another fun fact about the number 198: suppose I start looking at all the integers ($1, 2, 3, 4, 5, \dots$), keeping only the ones that are perfect powers ($1, 4=2^2, 8=2^3, 9=3^2, 16=2^4, \dots$), and then taking their cumulative averages ($1$, $2.5=(1+4)/2$, $4.33=(1+4+8)/3$, $5.5=(1+4+8+9)/4$, $7.6=(1+4+8+9+16)/5, \dots$). You might wonder, are there any other integers in this sequence? It turns out, yes! And I bet you can guess the first one…

Now, time for some links! I’ve provided some “teaser” text from each submission to get you interested.

The BRSR blog asks: can you find a Euclidean triangle on a non-Euclidean surface?

A question that came up in a math chatroom (yes, I’m the kind of nerd who spends time in math chatrooms): find a “Euclidean” triangle on a non-Euclidean surface. More exactly, find a geodesic triangle on a surface with non-constant Gaussian curvature having the same sum of interior angles as an Euclidean triangle.

On his blog bit-player, Brian Hayes asks: why do pandemics peak in distinct waves as the months go by?

I’m puzzled by all this structural embellishment. Is it mere noise—a product of random fluctuations—or is there some driving mechanism we ought to know about, some switch or dial that’s turning the infection process on and off every few months?

I have a few ideas about possible explanations, but I’m not so keen on any of them that I would try to persuade you they’re correct. However, I do hope to persuade you there’s something here that needs explaining.

On Twitter, Thien An asks: would you rather have a bishop and a knight, or two bishops?

To get this figure, it suffices to grab 100k games and only keep the positions with BB+pawns v.s. BN+pawns and see who eventually wins. That’s very simplistic since there are indeed many other important positional features, but I was actually expecting less convincing results…

Looking at a video of a pomegranate sorting mechanism, Nisar Khan asks: how much do the pomegranate sizes vary between sorted buckets?

After watching the below video, thought of finding the approximate size differences of pomegranates sorted in different boxes…

On his math blog, Tony asks: did studying maths help Emma Raducanu win the US Open?

I’ve discussed toy examples in public lectures on game theory, which (it seems to me) is relevant to choices players make - whether to serve to the forehand or backhand, and where to expect for your opponent to serve, for example. I very much doubt if players ever analyse in these terms, but they are intuitively doing game theory when making their tactical decisions.

But I think in the case of Raducanu there is a more general point…

On Risen Crypto, Trajesh asks: how does the quadratic sieve factoring algorithm really work?

The Quadratic Sieve is the second fastest algorithm for factoring large semiprimes. It’s the fastest for factoring ones which are lesser than 100 digits long.

On The Universe of Discourse, Mark Dominus asks: why is the “S” combinator named “S”? He also posted a nice puzzle that is easily solved if you happen to know Dilworth’s theorem; separately, Jim Fowler built a musical PICO-8 game where you race to find the chains guaranteed by Dilworth’s theorem.

I thought about this for a while but couldn’t make any progress. But OP had said “I know I have to construct a partially ordered set and possibly use Dilworth’s Theorem…” so I looked up Dilworth’s theorem.

On their blog “Gödel’s Last Letter and P=NP,” Ken Regan and Dick Lipton discuss “baby steps” in math, in the context of some exciting recent results.

Reckoned against a later paper by Cohn and Elkies, [Viazovska’s] improvement was 0.0000…000001. The recent post where we discussed the phrase “the proof is in the pudding” involves a number with six more zeroes than that. These are not what we mean by “baby steps.”

And finally, something more about sequences— on her math blog, Tanya Khovanova explores the top 50 largest numbers to start OEIS sequences.

My son, Alexey Radul, wrote a program that finds the largest numbers to start a sequence in the Online Encyclopedia of Integer Sequences (OEIS). To my surprise, the top ten are all numbers consisting of ones only.

That’s all for this month’s Carnival! What a joy to see — once more — the math blog world overflowing with curiosity. Hosting the Carnival of Mathematics truly is a delight.

Want even more math? The previous Carnival, the 197th, was hosted by Jeremy at Math $\cap$ Programming, and the next — the 199th, to which you can contribute here, starting now! — will be hosted by Vaibhav at DoubleRoot. See you there!

Fortune in Flux

Hardmath123 — Wed, 07 Jul 2021 04:00:00 GMT

The fortune of us that are the moon’s men doth ebb and flow like the sea (Henry IV, Part I)

Surely something will happen — which means that if you integrate over all possible outcomes, the probabilities must add up to 1. This elementary fact remains true, even if the probabilities of the outcomes change for some reason. Are you reminded of something? Fluids behave in the same way: no matter how the particles are distributed, the total quantity of fluid is conserved, which means that if you integrate over all of space, the masses must add up to 1. Indeed, when we talk about probability density functions and probability mass functions, we are borrowing fluidic vocabulary.

I get excited about analogies like this, where the physics of one domain promises to unexpectedly deliver insight into a different domain. Another example of this is how spring constants and capacitances play the same mathematical role in the differential equations that model their respective mechanical and electrical systems. Harry Olson’s wonderful book Dynamical Analogies shows some more beautifully-illustrated examples of mechanical, electrical, and acoustic systems with identical dynamics.

Can we get any mileage out of the probability-fluid analogy? It turns out, yes! We can apply our physical intuition about fluids to design a clever statistical machine learning algorithm. That’s the subject of this piece. By the way, you can undoubtedly find better (and more correct) expositions of this math in other places; my goal here is to emphasize the interaction between physical intuition and probabilistic applications, bridged so beautifully by the fluidic analogy.

A little bit about the problem setup: sometimes, when working with data, you find yourself in a place where you want to sample from a probability distribution $p(x)$. This might not be so bad if you can compute $p(x)$ exactly — for example, you can use rejection sampling or something. Often, however, you can only compute some scaled quantity $q(x)$ such that $p(x) \propto q(x)$ with some unknown constant of proportionality. Commonly, this happens in Bayesian inference, where you may want to sample from $\pi(x \mid y) = \pi(x,y)/\pi(y)$ where $y$ might be data you’ve seen and $x$ might be the parameters to your model. Here, the problem is that the normalizing constant $\pi(y)=\int \pi(x,y)dx$ is hard to compute because that integral is intractable in high dimensions.

One famous way to do this kind of sampling is by the Metropolis-Hastings algorithm, which approximates $p(x)$ with the stable distribution of a Markov chain. The “miracle” of the algorithm is that the transition probabilities between two states only depend on the ratios of the probability densities at those states — the unknown constant simply cancels out!

But now let’s think about this from a fluids perspective. Have unknown normalizing constants popped up in physics anywhere? Yes! The partition function $Z$ in Boltzmann is often unknown. Recall that a system at thermodynamic equilibrium at temperature $T$ has probability $p_i \propto e^{-U_i/k_BT}$ of being in state $i$, where $U_i$ is the potential energy of that state and $k_B$ is a physical constant that you can find, e.g. on Wikipedia. A curious application of this distribution is working out how thin the atmosphere is at altitude $h$ by reasoning about $U_h$ being the gravitational potential energy at altitude $h$: because $U$ is linear in $h$, we can instantly conclude that atmospheric density decreases exponentially in altitude.

But returning to machine learning, here is the insight that the fluidic analogy makes possible: if we set $U(x) = -\log q(x)$ then a particle simulated at some “temperature” $T=1/k_B$ in potential $U$ would eventually have probability $e^{-U(x)}=p(x)$ of being at $x$. So, we can use what we know about the dynamics of particles that are a part of a fluid at thermal equilibrium, to sample from probability distributions that may have nothing at all to do with fluids.

What exactly does it mean to “simulate a particle” of this probability fluid? You can take those words pretty literally, as it turns out. There are two sources of motion at each timestep: $U$ and $T$. Given potential $U(x)$, we can derive the force $-\nabla U$ that acts on our particle, and then apply Newton’s Second Law to infer the velocity. We’ll assume that velocity is roughly independent between timesteps because the particle is repeatedly being bumped around by other particles and having its velocity reset — so, no need to track acceleration. By the way, notice that $\nabla \log q(x)$ doesn’t depend on the unknown normalization constant, because the logarithm turns it into an additive constant, which disappears under the derivative.

As for the effect of $T$, we can model the thermal motion of the particle as Brownian motion: that is, at each time, the particle gets bumped to some random neighboring position drawn independently from a Gaussian that is somehow scaled with respect to “temperature” $T$. (How do we translate $T=1/k_B$ to this un-physical world where “joules” and “kelvins” make no sense? Stay tuned…)

This fluidic analogy leads us to what people call the “Unadjusted Langevin Algorithm.” The algorithm is quite simple to state and implement:

Initialize some $x_0$.
Using some small step size $\Delta t$ integrate $\Delta x = \Delta t \cdot \nabla\log q(x_t) + \sqrt{2\Delta t} \cdot \xi_t$ where you sample $\xi_t \sim N(0,1)$; the Greek letter $\xi$ is chosen for maximum squiggliness, in order to act as a visual metaphor for its function. Importantly, this step only requires you to evaluate the unnormalized density $q$!
That’s it — under some (not too mild but still reasonable) assumptions, snapshots of the trajectory ${x_t}$ are distributed accoding to $p(x)$!

Are you convinced? Here’s a sanity check: without the stochastic component $\sqrt{2\Delta t}\cdot\xi_t$ at each step, $x_t$ deterministically walks towards the maximum likelihood estimator — that is, without the stochastic component, we’d have invented gradient descent! The stochasticity, however, makes $x_t$ a random variable instead.

The rest of this piece is dedicated to answering the question: where does that stochastic coefficient $\sqrt{2\Delta t}$ come from? You might have already worked out that it’s the magnitude of the thermal motion, which depends on $T$. If that coefficient were too small, then, as we just discussed above, we simply have gradient descent: our particle is analogous to a marble rolling down a (syrupy — overdamped!) hill. If the coefficient were too large, then thermal motion would dominate and the “signal” from the potential $U$ would stop mattering. So, clearly, the threshold $\sqrt{2\Delta t}$ is “special” in some way.

First let me convince you that we want this coefficient to indeed be proportional to $\sqrt{\Delta t}$, even though it looks dimensionally wrong when placed in a sum right next to the $\Delta t\cdot\nabla \log q$ term. What’s going on? It turns out that this “lots of little Gaussian bumps” motion doesn’t have an intuitive integral calculus. In particular, for Gaussians, it’s the variance that adds, not the standard deviation. So in time $t$ the variance of the total distance traveled would be $\Sigma\sqrt{\Delta t}^2 = \Sigma \Delta t = t$, and therefore the standard deviation would be $\sqrt{t}$, which is exactly what we want for the algorithm to be invariant to how small $\Delta t$ is. (For more on this, you could read about “Wiener processes.”)

Now for the “2.” To derive this value, let’s return to our fluidic analogy. Suppose we wanted to compute the change in probability density at a point $x$ over time, that is, ${\partial p(x,t)}/{\partial t}$, with the understanding that we will set this time derivative to zero later to reason about the system’s behavior at equilibrium. By following the fluidic analogy, we can hallucinate a “probability flow” $\vec{J}(x, t)$ and assert, by Gauss’ Law, that ${\partial p}/{\partial t} = -\nabla \cdot \vec{J}$.

To measure $\vec{J}(x, t)$, we should think about what’s flowing; that is, about “particles,” whose dynamics is given by the Langevin update rule for $\Delta x_t$ I wrote out above. The deterministic part is easy to deal with: from a tiny region of volume $\Delta V$, a mass $p\Delta V$ of particles leaves with velocity $\nabla\log q$, so the flux is $p\nabla \log q$.

For the stochastic part, notice that the dynamics of the thermal motion are spatially invariant, i.e. independent of $x$. So, whatever amount diffuses out of $\Delta V$ is proportional to $p$, and whatever amount diffuses in is proportional to the density in the neighborhood of $\Delta V$: this means the net flow is proportional to $-\nabla p$. When we talk about fluids, this observation is also known as Fick’s Law, and the constant of proportionality is called the diffusion coefficient $D$.

How can we compute $D$? Looking at only the stochastic part, we have $\partial p/\partial t = -D\nabla^2 p$, which we recognize as the heat equation. That is, for this brief aside, we’re going to think of $p$ representing — not probability over Bayesian model parameters or fluid density in a potential well, but — heat in a block of metal — a bonus analogy!

Thankfully, the solution to the heat equation turns out to be well-known: it is a time-varying Gaussian. And it better be! After all, the stochastic diffusion is just a sum of lots of independent little Gaussian-sampled steps of variance $\Delta(\sigma^2) = {2\Delta t}$, which together form a big Gaussian of variance $\sigma^2 = 2t$, as discussed above. But let’s say we didn’t know the coefficient was 2. Let’s set it to $a$ instead. Now, if we know that $p(x,t) = e^{-x^2/2(a t)}/\sqrt{2\pi(at)}$, then we can literally take the partial derivatives on both sides of the heat equation and expand to solve for $D$. A miracle occurs, and many things cancel, leaving us with simply $-D=a/2$. It’s not worth writing out the algebra here: I encourage you to try it yourself, since it is that rare combination of straightforward and satisfying. But furthermore, it gives some mechanical (if not physical) insight into the “2” that appears in the denominator of the result: algebraically, the $-1/2$ power of $at$ in the normalization constant $1/\sqrt{2\pi(at)}$ kicks out some factors of two that are not balanced by the $-1$ power of $t$ in the exponential.

In any case, though, we conclude from the above that $\Delta p = -(a/2)\Delta t \nabla^2 p$, which means the stochastic flux is simply $\nabla p$. Perhaps it’s all clear to you by now, but let’s take this over the finish line. Putting these two components together, $U$ and $T$, we have $\partial p/\partial t = \nabla\cdot (p\nabla\log q-(a/2)\nabla p)$. This nice identity is known as the Fokker-Planck equation (the devoted reader might note that this is the second reference to Fokker on this blog).

Now, finally, as promised many paragraphs ago, we will now reason about this equation at equilibrium. At equilibrium the distribution is static in time, so we have $0 = \nabla \cdot (p\nabla \log q -(a/2) \nabla p)$. Assuming suitable boundary conditions, we can assume the argument of the divergence is also zero, so $\nabla \log q = (a/2)\nabla p / p$. Integrating in all directions, we have $\log q + c = (a/2)\log p$, or rather $p \propto q^{2/a}$. This makes sense: just as expected, for large $a$ the effect of $q$ is wiped out, and for small $a$ all but the maximum value get squashed to zero. Crucially, however, if we want $p \propto q$, then we must set $a=2$. Aha! So that’s why the thermal motion needs to be scaled by $\sqrt{2\Delta t}$. (By the way, that constant of integration above, $c$, oddly enough encodes that pesky normalization constant that started off this whole discussion…)

Okay. I’m glad I got that out of my system, it’s been rattling around for weeks, now! How wonderful it is, that we can start with the simplest axiom of probability — “something must happen” — and, reasoning only with physical intuition, reach a sampling algorithm that solves a real problem.