A 2014 paper by Rozhkovskaya and Reb titled “Is the List of Incomplete Open Cubes Complete?” complicates Sol LeWitt’s celebrated 1974 artwork, “Incomplete Open Cubes”. The artwork is a gallery of 122 three-dimensional structures; they are meant to be all distinct connected subsets of edges of the unit cube that result in a 3D figure — or, in LeWitt’s words, “all the ways of the cube not being complete”. The complication, noted by the paper, is that two of the structures are identical up to rotation; they should instead be mirror reflections of each other.

Have art historians realized yet? I would like to think there is some deeper philosophical significance to this bit of trivia. The exercise in exhaustive enumeration of incompleteness is itself incomplete — and, we cannot help but overlook the incompleteness, just as we cannot help but imagine the missing edges of any of the 122 almost-cubes.

Image source: Incomplete Open Cubes Revisited

I have meditated on the art-historical significance of chirality before, so today I want to talk about something else. I don’t know about you, but when I see LeWitt’s almost-cubes, I feel the need to submerge them in a tank of soapy water. Why? To see what shape the films take, of course! Is that irrational? Perhaps, but writes LeWitt, “Irrational thoughts should be followed absolutely and logically.” It turns out that my thought isn’t hard to follow — if not absolutely and logically, then at least computationally.

The shape of a soap film is a minimal surface, that is, the surface with minimal area that obeys the boundary conditions of the “loop” that the film forms within. The definition doesn’t depend at all on the properties of soap: any film-forming substance should — in the absence of other forces, such as gravity — contract to the same shape as a result of surface tension. These shapes can often be pleasing and unexpected.

Image source: Wikipedia

Now, because finding the surface is a question of continuous minimization, we should be able to easily apply automatic differentiation — at least, to a discretization of the problem. Here is the algorithm: we start with a mesh grid of points and triangulate it. Then, we compute the surface area by repeated application of Heron’s Theorem to the triangles. Finally, we differentiate the total area with respect to the positions of the points, and nudge the points towards less area. The full source code for this adventure is available here. (By the way, I’m using PyTorch out of habit, but perhaps TensorFlow would be more appropriate because of its oddly LeWittian logo.)

Before I give you the results, I want you to take a moment to try and imagine what the soap films will look like. I think it’s an interesting exercise in qualitative reasoning, and, in any case, LeWitt’s work all but pleads for us to see what isn’t there.

For starters, we can reduce the entire problem from 122 cubes into just a handful of cases. Simply ask: how many distinct-up-to-symmetry “loops” are there on a cube? Let’s do casework by face count: there is 1 loop with one face (the “square”), 1 loop with two faces (the “L”), and 2 loops with three faces (the “U” and the “corner”). The rest are accounted for by complement: a four-face loop is also a two-face loop.

This yields a total of 4 cases — much more manageable! And indeed, I’ll suggest that you start with Case 1, the “square”, since the answer in that case should be obvious.

Okay, now, the results!

Case 1 is easy; we already know the answer. The soap film is just flat, because its boundary is planar.

Case 2 is less obvious — I would not have predicted this outcome! It turns
out that the soap film tries to “flatten” itself into the hypotenuse. What are
the implications of this? Well, one implication is that if you want to minimize
the material you use to build a tent (without regard to volume), you can do
better than stretching the material taught into large rectangles. Sagging
actually *saves* material.

You might expect Case 3a to flatten similarly, but because of the up-down symmetry that can’t happen. Instead, you get a beautiful saddle shape. Actually, it turns out that all minimal surfaces are saddle-like; if they had nonzero curvature at some point then you should be able to “flatten the bulge” to reduce surface area.

And finally, Case 3b — sorry this isn’t quite perfectly modeled — tries to flatten itself around the center, kind of like a tortilla chip.

]]>Summary of this post: It turns out that by carefully mixing forward-mode and reverse-mode automatic differentiation, you can greatly simplify certain hyperparameter optimization algorithms.

Here’s a fun recursive idea: just like how we optimize the parameters of
machine learning models by gradient descent, we can also their
*hyperparameters* by gradient descent. This is by no means a new idea; you can
find 20-year-old papers that discuss (for example) optimizing gradient descent
step sizes by gradient descent itself.

Broadly, there are two ways to do this. The “short-term” way is to take a
*single* step of gradient descent, and then based on how that goes, adjust the
hyperparameters. That’s the premise of the 2018 paper Online Learning Rate
Adaptation with Hypergradient Descent: it
turns out that you can do this in an extremely lightweight way. You would hope
that over time the hyperparameters converge to something optimal alongside the
parameters (and indeed that is the case, though you have to be careful).

The “long-term” way is to train *several* steps and *then* backpropagate
through the *entire training* to adjust the hyperparameters. The hope is that
this provides a stronger “signal” to the hyperparameter-gradient, which is
better for convergence. But, this comes at the (tremendous) expense of having
to store a copy of the entire computation graph for several steps of training,
and then backpropagate through it. If your model has 1GB of parameters, you
need to store ~1GB worth of numbers for *each step*. You can imagine that adds
up over the course of many epochs, and so the algorithm is severely limited by
how much RAM you have.

The 2015 paper Gradient-based Hyperparameter Optimization through Reversible Learning makes this work by throwing away the intermediate steps of the computation graph and then JIT-re-computing it “in reverse” during backpropagation. It’s a clever technique, but kind of hairy and hard to implement (you have to be super careful about numerical precision).

Here’s a graph that visualizes these two techniques (from this
paper). You’re looking at several parallel
loss curves (($\log(f)$) is the loss plotted on a log scale), pointing towards
you, arranged in order of the “step size” hyperparameter
(that’s ($\log(\alpha)$)). The orange curve represents a “short-term”
hyperparameter optimization, which is allowed to move along the “step size”
axis at each step. The “long-term” hyperparameter optimization instead
optimizes directly on the thick black dashed “U” — that is, after *several*
steps of training. You can see how the latter is smoother, but also much harder
to compute.

To summarize: the short-term way is cheap but noisy. The long-term way is expensive but less noisy. Can we get the best of both worlds?

Well, let’s think more carefully about the source of the expense. The problem
is that we need to store (or be able to reconstruct) the full computation graph
in order to do backpropagation. Okay, but do we really *need* to do
backpropagation? The only reason we backpropagate is that it’s more efficient
in the case when you want derivatives with respect to *many* different
variables. If you have millions of model parameters, backpropagation is
millions of times faster than the much simpler forward-mode (“dual numbers”)
automatic
differentiation.

But we *don’t* have millions of *hyperparameters!* Step size, for example, is
just a single number. The Adam optimizer only has a total of 4 hyperparameters.
With this in mind, backpropagation isn’t even the right choice — we *should*
be using dual numbers for the hyperparameter optimization. On the other hand,
we should still be using backpropagation for the “inner loop” that optimizes
the (non-hyper-) parameters. That is, we want to do something like this:

```
initialize hyperparameters
# loop to optimize hyperparameters
while True:
initialize parameters
# loop to optimize parameters
for i in range(100):
run model on data to get loss
# using reverse-mode!
compute d(loss) / d(parameters)
update parameters using hyperparameters
# using forward-mode
compute d(loss) / d(hyperparameters)
update hyperparameters
```

(Update: I discovered that this was suggested in the 2017 paper Forward and Reverse Gradient-Based Hyperparameter Optimization. But keep reading — while the paper’s implementation needs a lot of math to be worked out manually, I’m going to show you how to implement this in a way that makes all the math in the paper fall out “for free”…)

This proposal raises a logistical question: how do we reconcile these two
automatic differentiation algorithms in the same program? **The “trick” is to
“thread” dual numbers through a backpropagation implementation.** In other
words, implement backpropagation as usual, but rather than `float`

type
numbers, exclusively use `dual_number`

type numbers (even when doing derivative
calculations). Initialize the system such that the dual numbers track
derivatives with respect to the hyperparameters you care about. Then, your
final loss value’s attached ($\epsilon$)-value *immediately* gives you
`d(loss)/d(hyperparameters)`

. No backpropagation needed — and so, it’s safe
to “forget” the computation graph.

That’s it! That’s all I wanted to share in this blog post! :)

Of course, it’s not obvious how to implement this in PyTorch, since you can’t
naïvely do `tensor(dtype=dual_number)`

. Rather than hack in a custom numeric
data type that implemented dual numbers, I wrote my own tiny implementations of
forward- and reverse-mode automatic differentiation. It’s just a couple dozen
(very-recognizable-to-automatic-differentiation-enthusiasts) lines of code. I
was careful to make each implementation generic in the kind of “number” it
accepts. That allowed me to run the reverse-mode algorithms using the
forward-mode data types.

Running it on a simple quadratic optimization problem, we can see that updating the hyperparameter ($\alpha$) yields better-looking loss curves — in this GIF, we’re discovering that we should increase the step size. Yay!

I think this is a very neat trick, which surely has other cool applications (for example, differentiating through long-running physics simulations!). If you’re curious, check out this IPython notebook for the full source code. Feel free to adapt it to your ideas. Then, tell me what you’ve built!

]]>“I do not know what ‘poetical’ is: is it honest in deed and word? Is it a true thing?” — Lotman quoting Pushkin quoting Shakespeare in

As You like It, 3.3.

How do I want to say this? Here is one way: Earlier this year, the world was *very surprised* at a computer doing *very unsurprising* things. The celebrated computer program, OpenAI’s GPT-3, is a *language model*, a machine learning model that (to simplify) outputs the statistically-likeliest completion of a sentence, much like the autocomplete on your iPhone keyboard. If we wrote the prefix

The weather is warm and…

GPT-3 might output “sunny” or “humid” — that would be probable and unsurprising. But it would likely not output “turtledove” — that would be improbable and surprising. Of course, you can run this program again and again to sample more and more words, and thus generate long sequences of text.

- The weather is warm and [sunny]
- The weather is warm and sunny [so]
- The weather is warm and sunny so [let’s]
- The weather is warm and sunny so let’s [go]
- The weather is warm and sunny so let’s go [to]
- The weather is warm and sunny so let’s go to [the]
- The weather is warm and sunny so let’s go to the [beach.]

In principle this is nothing new. There was a GPT-1 and a GPT-2, too, as you can imagine. But GPT-3 is bigger and better — it has “read” an *enormous* corpus of text to build up its database, and the model itself consists of *175 billion* parameters (GPT-2 was a “mere” 1.5 billion parameters). Just training a model of this size is an exciting engineering accomplishment.

By virtue of its size, GPT-3 is extremely good at hallucinating text in this way. And this is where things begin to get uncanny. If you phrase your prompt as a question, GPT-3 reliably provides answers in complete sentences. If you phrase your prompt as a few coding examples, GPT-3 can generate software to build small user interfaces based on a description of what buttons you want. It can even generate convincing hoax blog posts. Don’t worry, what you are reading right now is written by a human — but it is probably not unreasonable to be a little concerned.

I think it is natural to ask whether it can generate “literary” text as well — whether it is a kind of superintelligent monkey that *will* generate *Hamlet* if you ask it to. Now, it’s pretty clear that GPT-3 is not quite a Great Automatic Grammatizer (I have written in the past about what *that* may look like). But for short fragments, as lots of people have shown, it isn’t embarrassing either. And so, as we hurtle towards a future where statistically-sound “literary text” can be mass-produced, I have been thinking a lot about what “literature” is.

One set of thoughts comes from an unlikely (so to speak) source, which is the early-1970s literary theory of Yuri Lotman (or Jüri, or Jurij — you choose how to spell it — this becomes relevant soon). In *The Structure of the Artistic Text* (1971, but I’m referencing Ronald Vroon’s 1977 translation and Ewa Thompson’s guide), Lotman connects Shannon’s and Kolmogorov’s quantitative theories of communication and entropy with the qualitative question *what is literature?*. At least, that is my understanding — I’m not an expert in Russian literary theory or semiotics, but I’m trying my best.

Let’s think back to one of Shannon’s big ideas, which is that language is *redundant*, which means there are many ways to convey the same idea. Because of its redundancy, language is resilient to noise in the communication channel (such as loud music at a party). If I stprt intrdducinq mutnt1ons into my spellings, you can still more or less get the message. Similarly, most ideas can be paraphrased. Or, to put it differently, you can convey a concept in many different ways. That is, notions can be materialized into words in multiple fashions. If there were no redundancy in the language — that is, if every combination of letters formed a different valid word, or if there were only one way to express each idea — then this error-correction would be impossible. Imagine if each volume in Borges’ Library of Babel was a meaningful masterpiece! Imagine spilling a teardrop on a page, only to change the meaning dramatically!

This redundancy is the reason why text is so *compressible*, why a ZIP file can be so small compared to the source it is compressing. You can just strip out the redundancy to get a smaller file: your disk’s I/O channel has negligible noise compared to a loud party, and so the redundancy is not needed. (Any introductory theory-of-computation course touches on *Kolmogorov Complexity*, a kind of theoretical limit to compressibility in terms of computability: it turns out that Kolmogorov worked with language and literature as well.)

Perhaps this redundancy makes GPT-3’s behavior a little less surprising (I hope that word “surprising” is accruing meaningfulness for you). The more redundancy a language has, the more about the statistical structure of language you can “learn” from data, and therefore the easier it is to predict likely completions. Indeed, GPT models can be used to perform data compression.

Now, my understanding of Lotman’s thesis is that *literature does not behave this way*. Literature is *more entropic*, which is to say, literature is *less redundant*. It is packed with meaning — in fact, every choice is intentional and significant. Lotman argues that literature lives outside or above “ordinary” natural language; a work of literature creates its own “code,” and each word takes on a heightened meaning in this code. To read literature is to learn to decipher this code. He writes, *The artistic text is an intricately constructed thought. All its elements are meaningful elements.* Once a text is composed, it is baked to an immutable, brittle crust. “The reader considers the text placed before him as the only possible text.”

For example — and this is my example, not Lotman’s — the repetition in a line from *Lear*

No, no, no, no! Come, let’s away to prison:

is exact and precise. Four times, no more, no less — and each utterance of the word “no” is significant, dripping with intention. The raw sound “no” itself becomes a kind of meaningful sign, detached from its usual English-language semantics. You cannot, without losing meaning, compress that text as simply “4 x no.”

Let me give you another example. Consider Roethke’s “The Waking.” It’s a sestina, a poem built on repetition. Every other stanza ends with the same line.

I wake to sleep, and take my waking slow.

I feel my fate in what I cannot fear.I learn by going where I have to go.

This repeats.

Of those so close beside me, which are you?

God bless the Ground! I shall walk softly thereAnd learn by going where I have to go.

Until, suddenly and jarringly (in the softest way possible), it doesn’t.

Great Nature has another thing to do

To you and me; so take the lively air,And,lovely, learn by going where to go.

That word *lovely* appears so deliberately in that line that its absence in the previous lines is conspicuous. As a result, those repeated lines, which seemed like blind repetitions, become conscious choices. Roethke could have said *whatever he wanted* but he chose to repeat himself verbatim, without the word “lovely,” in what Lotman might call a kind of “minus-device,” a trembling vacuum, an absence with presence — that in itself is meaning, that in itself holds entropy.

It would not do for Roethke to paraphrase those lines, either. “I educate myself by traveling where needed” has no impact. Because literary text is stripped of redundancy, even small mutations — even semantics-preserving ones — can dramatically affect the text. It’s like compiling for embedded targets with `-O4`

. Like `gcc`

, Lotman does not believe you can paraphrase literature to get equivalent literature. You cannot re-tell a story, because there are no synonyms in the unique code of that particular work of literature. I am reminded of Borges’ “Pierre Menard, Author of the Quixote,” a short story wherein Menard *writes* (not re-writes) *Don Quixote*.

Pierre Menard did not want to compose

anotherQuixote, which surely is easy enough—he wanted to composetheQuixote. Nor, surely, need one be obliged to note that his goal was never a mechanical transcription of the original; he had no intention ofcopyingit. His admirable ambition was to produce a number of pages which coincided—word for word and line for line—with those of Miguel de Cervantes.

Maybe, in light of the discussion above, these bizarre lines are a little less mystifying, a little more satisfying.

That brings me back to GPT-3, and the literary status of its output. Can we ever call a GPT-3-generated poem a work of literature? Well, following Lotman, let’s instead ask this: how much information, really, does GPT-3 output? How entropic is GPT-3’s output? And of course the answer is that GPT-3 is quite orderly in the grand scheme of things. “Whenever the text does not realize one of at least two possibilities,” writes Lotman, “but automatically follows one, it loses its capacity to convey information.” That is, every time GPT-3 argmaxes its way to a prediction predestined by its statistical model, which is really a synthesis of gigabytes and gigabytes of data, it is outputting — well, nothing. No new information. The occasional choice made by the random sampling is the only “information” being produced, though it is hidden in the fluffy sheep-skin of statistically-unobjectionable prose. If GPT-3 completed the prompt-word “yesterday” with Roosevelt’s “a date which will live in infamy” speech half the time, and Beatles’ “All my troubles seemed so far away” lyrics the other half the time, then GPT-3 has only really communicated one bit of information; the rest is redundant. Indeed, if you were to measure the number of bytes read from `/dev/urandom`

(or whatever RNG) by GPT-3 in the course of sampling, and analyze the relative probabilities predicted by the model, perhaps you would be able to compute a Shannon entropy — or even a convincing literariness score — for the output. I sense it would not be very high.

I want to end this piece with two notes.

First: I see a couple of problems with Lotman’s theory — again, I’m not a literary theorist by trade so I do not know whether he (or others) have addressed them.

- How does Lotman account for the way early (oral) poetry, with its strict rhyme and metrical patterns, was designed to be memorized? Isn’t the poetical form, in a very direct sense, optimizing itself for compression and therefore less entropic — as if the redundancy were to harden the message against the noise of
*time*and*amnesia*? Or do these qualities themselves encode and exfiltrate information, like the stress-patterns in Sanskrit verse, or the knots knitted by the women in*A Tale of Two Cities*? Lotman spends some time at the beginning of Chapter 6 discussing how poetry removed itself from ordinary speech, and then prose removed itself from poetry, and therefore prose is twice-removed from ordinary speech. But that does not quite answer my question. - How does Lotman account for the fuzziness of texts — for example, the many editions of
*Hamlet*that editors*to this day*Frankenstein together to try to formulate a coherent whole? How does this theory stand in the absence of an original source text?

Second: Lotman means all this in a serious technical sense. As far as I can tell, he is not just borrowing scientific buzzwords to give legitimacy to an otherwise fuzzy theory. For example, he cites the quantitative work of Shannon and Kolmogorov directly, even though it was (even for the time) primitive. The preface to his book suggests rhetorically that the only way to test these theories would be to compute empirical metrics on an unthinkably large set of poems. Later Lotman suggests a kind of semantic analysis of poetry, but says

The operations recounted above give only a general and deliberately rough semantic skeleton, since

a description of all connections arising in the text and of all extra-textual relations which could be ascertained would be an unrealistic task in terms of sheer volume.

But today, we can do that — take exactly those measurements that Lotman (and Kolmogorov and Shannon) could only have dreamed of! It is as if Lotman predicted the relativistic warping of the stars, and we had to wait decades for the eclipse to make the confirming observations. Practical technology has finally caught up to theory.

And so I think there is potential for unprecedented scholarship here. This is a question where deep learning enthusiasts and literary theorists and semioticians all have something valuable to say. I wonder what the conversation will be like? It is comforting to note that our hero Kolmogorov appeals as early as 1964,

]]>But real progress in this direction demands that cyberneticians [today’s deep learning researchers?] take a greater interest in the humanities and learn more about them.

Another magic trick. Watch these two grids overlap to reveal a secret image! Can you spot the phantom horse?

If my janky CSS animation didn’t work for you, here is another way to see it: simply shrink the top grid slowly until the horse “pops” out. It’s hard to show this effect consistently across displays/browsers because of varying pixel spacings and various browsers’ aggressive anti-aliasing tactics, but here is a screen recording I captured.

Magic trick? Not quite. Graphics folks will immediately recognize this effect
as a curious instance of the dreaded *moiré pattern,* caused by the
interference between two almost-but-not-quite-the-same grids. Researchers work
really hard to escape such “glitchy” artifacts — they are often a result of
sloppily sampling a texture in a grid, leading to jagged edges or moiré streaks
(image below from Wikipedia).

In fact, a lot of
effort has
gone into finding “good” sampling patterns that avoid moirés; for example, in
an odd crossover of sampling theory and anatomy, if you look at the
distribution of photoreceptors in a monkey’s
retina they are neither in a grid
*nor* uniformly random, but rather arranged in a *low-discrepancy* pattern to
avoid exactly this kind of artifact.

But for all the hard work gone into avoiding these patterns, you kind of have to admit that they are beautiful and mysterious. In fact “moiré” gets its name from a desirable form of silk weave that exhibits this phenomenon owing to interference from the weave (photo from Wikipedia again). Similarly, this sculpture (and a lot more art) exploits moiré patterns for their hypnotic beauty.

There are lots of real-world ways you can play with this effect, and indeed soon enough you can’t stop seeing it everywhere you go (this is the joy and frustration of being interested in visual phenomena):

- Point a camera at a checkered shirt; the camera’s pixel grid interferes with the fabric pattern (I’ve been told not to wear checkered shirts when giving talks at certain venues where everything is recorded).
- Align two kitchen strainers and gently bend them to see the patterns change.
- Watch a screen door’s shadow interfere with the screen itself, poke the screen a bit to see the patterns change.
- Watch the wire fences on either side of a pedestrian walkway above a road interfere with each other.
- Look at an office chair from behind, watch the mesh of the back interfere with the mesh of the seat. Depending on their relative curvature you can get some beautiful effects.

These patterns are seemingly random, like marbled paper. But can we control this randomness? Satisfyingly, an on-demand moiré such as the effect above is extremely simple to generate using differentiable programming in PyTorch.

What I am optimizing is a *small distortion* to the image of the grid such that
when the grid and its distortion are superimposed, the resulting image has low
L2 difference from a given target, such as the horse image. What exactly is
this “distortion”? Imagine the image printed on play-dough and deforming it by
stretching and squeezing parts. The result is a warping of the image. To
express this warping, each pixel is assigned a “motion vector” which tells it
where to go. To make sure the motion vectors are smooth and continuous between
adjacent pixels, I actually learn a heavily downsampled set of motion vectors
(even a 12-by-12 array gets good results) and then upsample it to full
resolution with bicubic interpolation.

Here is the wonderful thing: all of these operations are built in to PyTorch and differentiable, apparently to support Spatial Transformer Networks. So, a straightforward 30-line implementation of the above algorithm is all it takes to start “learning” a moiré grid perturbation. Here is a Jupyter notebook with all you need to get started.

How much computation does it take to get good results? See this animation: it converges extremely rapidly, in just 800 iterations or so of plain gradient descent (a couple of minutes on my laptop).

I’m of course not the first person to try this trick; I found references in
2001 and 2013. I think my results look significantly better than those
presented in the paper *Variational Approach to Moire Pattern Synthesis*
(Lebanon and Bruckstein 2001), even though my technique
is much simpler than their gradient-based technique. On the other hand, I think
the more mathematically-motivated results presented in *Target-Driven Moiré
Pattern Synthesis by Phase Modulation* (Tsai & Chuang
2013) are much more impressive, even though they have a
slightly different setup than mine. They also had the resources to print out
their patterns on transparent film and overlay them physically, as well as to
mess with a camera whose Bayer pattern was known! And they suggest using this
trick for cryptography. Very neat.

To understand what they do we need to understand the mechanics of moiré
(instead of leaving it all to SGD!). I’ve always been told that when two
high-frequency signals such as grids are superimposed, they can alias to form
interesting low-frequeny content (famously there is this
comic). But I’ve never been given more detail than
that high-level overview, so I looked it up myself. It’s actually quite
wonderful and principled. Here is my understanding as informed by *A
Generalized Fourier-based Method for the Analysis of 2D Moiré Envelope-forms in
Screen Superpositions* (Amidror 1994), specifically by
Figure 4 reproduced below.

First, we need to come up with a useful characterization of grids. Recall that
you can take a 2D signal (i.e. image) and perform a 2D Fourier transform, which
is *also* an image. Each point in frequency-space represents a set of stripes
(i.e. a sine wave) rotated to match the point’s angle with respect to the
origin (see (a) and (b) in the figure for examples of what I mean). The
frequency of the stripes is given by the distance from the point to the origin,
and the amplitude is given by the point’s brightness (complex-valued
brightnesses let you set phase). To get “sharp” stripes instead of a “smooth”
sine wave gradient, you have to add “harmonics” at regular multiples of that
point (recall the Fourier transform of a square wave if that helps). This is
(d) and (e) in the figure from the paper.

Next, we note that superimposing two sets of stripes is like pointwise
multiplication (if you treat zeros as “occlusion”, then multiplying by zero
always sends the result to zero as desired). Because pointwise multiplication
is convolution in frequency space, the superposition looks like a *lattice*
spanned by the harmonics in frequency space. This lattice is depicted in (f) in
the figure.

The key to the illusion is that this lattice might include points much closer to the origin than either of the harmonics that span it! Because the human visual perceptual system acts as a low-pass filter, the result is that you see a low-frequency “phantom” in the superposition. This is visible in (c) in the figure.

Tsai and Chuang use this exact insight to reverse the moiré effect analytically, with excellent results.

**Open questions:** I think there’s a lot more to do here. In my mind, the holy
grail is a T-shirt you can wear which, *only when photographed* reveals a
secret message. Another neat application would be a wire mesh rolled into a
cylinder such that as you rotated it, the moiré patterns animated into
something cool (perhaps a galloping horse, as a nod to Muybridge). There is
some precedent for this in “barrier-grid” animation (see, e.g. these vinyl
record
covers),
but the benefit of a moiré-based approach would be that the barriers need not
be as thick, allowing for finer detail.

Another direction to go in is audio moiré; that is, could we make two sounds, both above human aural range (20 kHz) such that when played together you hear a perfectly clear, audible voice?

As you might have guessed, I think naïve gradient descent should be able to get us a long way there…

]]>And now, a magic trick. Before you is a 239-by-200 Conway’s Game of Life board:

What happens if we start from this configuration and take a single step of the game? Here is a snapshot of the next generation:

Amazing! It’s a portrait of John Conway! (Squint!)

How does this trick work? (It’s not a hoax — you can try it yourself at copy.sh/life using this RLE file.)

Well, let’s start with how it *doesn’t* work. Reversing Life configurations
exactly — the art of “atavising” — is a hard search
problem,
and doing it at this scale would be computationally infeasible. Imagine
searching through ($ 2^{239\times200} $) possible configurations! Surely that
is hopeless… indeed, the talk I linked shares some clever algorithms that
nonetheless take a full 10 minutes to find a predecessor for a tiny 10-by-10
board.

But it turns out that *approximately* reversing a Life configuration is much
easier — instead of a tricky discrete search problem, we have an easy
*continuous optimization* problem for which we can use our favorite algorithm,
gradient descent.

Here is the idea: Start with a random board configuration ($ b $) and compute ($ b^\prime $), the result after one step of Life. Now, for some target image ($ t $) compute the derivative ($ \frac{\partial}{\partial b} \sum|b^\prime-t| $), which tells us how to change ($ b $) to make ($ b^\prime $) closer to ($ t $). Then take a step of gradient descent, and rinse and repeat!

Okay, okay, I know what you’re thinking: *Life isn’t differentiable!* You’re
right. Life is played on a grid of bools, and there is no way the map ($ b
\mapsto b^\prime $) is continuous, let alone differentiable.

But suppose we could make a “best effort” differentiable analogue? Let us play Life on a grid of real numbers that are 0 for “dead” cells and 1 for “live” cells. Can we “implement” a step of Life using only differentiable operations? Let’s try.

We will look at each cell ($ c $) individually. The first step is to count our live neighbors. Well, if “live” cells are 1 and “dead” cells are 0, then we can simply add up the values of all 8 of our neighbors to get our live neighbor count ($ n $). Indeed, if we wanted to be clever, we could implement this efficiently as a convolution with this kernel:

\[ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{bmatrix} \]

Good, so we know ($ n $) and of course our own 0/1 value ($ c $). The next step is to figure out whether this cell will be alive in the next generation. Here are the rules:

- If the cell is alive, ($ c = 1 $), then it stays alive if and only if ($ n = 2 $) or ($ n = 3 $).
- If the cell is dead, ($ c = 0 $), then it comes to life if and only if ($ n = 3 $).

Let us approach (2) first. What function is 1 when ($ n = 3 $) and 0 otherwise?
The “spike-at-3” function, of course. That’s not differentiable, but a narrow
Gaussian centered at 3 is *almost* the same thing! If scaled appropriately, the
Gaussian is one at input 3, and *almost* zero everywhere else. See this graph:

Similarly, for (1) an appropriately-scaled Gaussian centered at 2.5 gets the
job done. Finally, we can mux these two cases under gate ($ c $) by simple
linear interpolation. Because ($ c \approx 0 $) or ($ c \approx 1 $), we know
that ($ ca + (1-c)b $) is just like writing `if c then a else b`

.

That’s it! We now have a differentiable function, which looks like this:

Note that it’s worth “clamping” the output of that function to 0 or 1 using, say, the tanh function. That way cell values always end up close to 0 or 1.

Okay, great! Now all that’s left to do is to write this in PyTorch and call .backward()…

…and it begins to learn, within seconds! Here is a GIF of ($ b $) at every 100 steps of gradient descent.

(Note: This GIF is where the target image ($ t $) is an all-white rectangle; my janky PyTorch gradient descent finds sparse configurations that give birth to an “overpopulated” field where nearly 90% of cells are alive.)

Looking at that GIF, the beautiful labyrinthine patterns reminded me of something seemingly unrelated: the skin of a giant pufferfish. Here’s a picture from Wikipedia:

And here’s a zoomed-in 100-by-100 section of the finished product from above:

Patterns like the one on the pufferfish come about as a result of “symmetry-breaking,” when small perturbations disturb an unstable homogeneous state. The ideas were first described by Alan Turing (and thus the patterns are called Turing Patterns). Here’s his 1952 paper.

I can’t help but wonder if there’s a reaction-diffusion-model-esque effect at work here as well, the symmetry being broken by the random initialization of ($ b $). If that’s the case, it would create quite a wonderful connection between cells, cellular automata, Turing-completeness, and Turing patterns…

(Curious? Here is the Python program I used to play these games. Enjoy!)

]]>Suppose you took a walk in a high-dimensional vector space, leaving behind breadcrumbs. If you wanted to visualize your route, you might consider projecting the breadcrumbs down to a lower-dimensional space.

One way to do this is to project the breadcrumbs onto three randomly-chosen vectors. Then you might get something like this.

But perhaps a better way to choose the projection is to do PCA on the breadcrumbs as if they were any other point cloud. If you do this, you will find that your walk was extremely loopy and spirally. Here’s an example of the same trajectory, projected down to 3D with PCA (this data is taken from my research work, but even a random walk will show the phenomena I discuss in this post).

Looking at this, you might be concerned that you are walking in circles.

This, for example, was the situation with a 2016 ICML paper, “Visualizing Deep Network Training Trajectories,” that tried to visualize the trajectory of neural network parameters over the course of training by doing exactly what I described: PCA on “breadcrumb” snapshots of the parameters. The author found some strange, hard-to-explain oscillatory behavior; even Lissajous curves.

A similar thing, by the way, happened with a 1994 text on genetics, “The History and Geography of Human Genes” by Cavalli-Sforza et al, which applied PCA to genetic data and found periodic effects.

It turns out however that there’s something deeper going on here, something independent of the individual datasets: these loopy patterns seem to be “just what happens” when you do PCA on such data. A 2018 NeurIPS paper, “PCA of high dimensional random walks with comparison to neural network training” by Antognini and Sohl-Dickstein, explains what’s going on with the neural networks. A 2008 paper, “Interpreting principal component analyses of spatial population genetic variation” by Novembre and Stephens, explains what’s going on with genetics.

Here’s my understanding of the story, reconstructed in part from this Stack Exchange answer.

Recall how PCA works: it tries to come up with orthogonal axes that eat up as much of the variance of the data as possible. One way to think about this is that it tries to come up with a rotation (“orthogonal transformation”) of the data that de-correlates the variance over each axis. In other words, it finds a way to diagonalize the covariance matrix of the dataset.

Okay, in symbols: if ($ X $) is the matrix whose rows are data points and columns are fields, and if your data is “centered” at the origin, then ($ X^TX $) is the covariance matrix whose entries give the covariance between pairs of fields. If ($ E $) is the transformation, then the goal is to select ($ E $) such that ($ (XE)^T(XE) = E^T(X^TX)E $) is a diagonal matrix; that is, ($ E $) diagonalizes ($ X^TX $). Having done that, our projected path will be the (first few columns of) the matrix ($ XE $).

It turns out that ($ XE $) diagonalizes ($ X X^T $). You can see that by just writing it out: ($ (XE)^T XX^T (XE) = E^T (X^TX)^2 E $). Of course an eigenvalue of ($ A $) is also an eigenvalue of ($ A^2 $) and therefore ($ E $) diagonalizes ($ (X^TX)^2 $) as desired. So, it suffices to study the eigenvalues of ($ XX^T $) to learn what the projected path will look like.

Now let’s look carefully at ($ X X^T$), which is the “covariance” for each step
rather than for each field. The claim is that this matrix is more-or-less
*Toeplitz*, which means each row is the previous row but shifted by one spot to
the right. The reason for this is that adjacent data points are spatially
nearby and should have similar correlations. One way to think about this is
that the covariances are just dot-products of pairs of points, and the dot
product of points 1 and 2 should be about the same as the dot product of points
2 and 3 if your steps are approximately evenly-spaced.

Indeed this seems to be the case for the loopy path I showed above:

If the matrix is truly approximately *Toeplitz*, then multiplying by it is a
convolution by the first row (just look at its structure!). But as we were all
taught in signal-processing class, convolution is the same as pointwise
multiplication in the Fourier domain. Okay, so what are the “eigenvectors” of
pointwise multiplication? One-hot vectors, of course! (Unless your convolution
happens to have two frequencies with exactly the same amplitude, in which case
you can get some mixing.) Finally, taking the inverse Fourier transform of
these one-hot vectors, we recover sinusoidal eigenvectors. Here’s a plot of the
top 3 eigenvectors for my running example:

To recap: beware of periodic structures in PCA projections! PCA on high-dimensional trajectories gives loopy projections because their covariance matrices tend to be Toeplitz, yielding sinusoidal eigenvectors. Or, high-dimensional spaghetti casts spirali-shaped shadows!

Here’s a bit of Python code to explore this phenomenon further on a random walk of 200 steps in 1024-dimensional space.

```
import numpy as np
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# build a path and center it
position = np.zeros(1024)
X = []
for i in range(200):
position = position + np.random.randn(1024)
X.append(position)
X = np.array(X)
scaler = StandardScaler(copy=False)
scaler.fit(X)
X = scaler.transform(X)
# plot the PCA projection on some components
p = PCA(n_components=5)
p.fit(X)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(
p.transform(X)[:, 0],
p.transform(X)[:, 3],
p.transform(X)[:, 4]
)
# plot the Toeplitz matrix
plt.figure()
plt.imshow(np.dot(X, X.transpose()))
# plot the eigenvalues of the Toeplitz matrix
plt.figure()
plt.plot(np.linalg.eig(np.cov(X))[1][:, :4])
```

]]>It’s Christmastime, shall we talk about trees?

I was reading about Knuth’s Fantasia
Apocalypta, and his
description of the “style” he wrote it in contains the wonderfully Knuthian
line “And of course a musical work on the Apocalypse should also contain
calypso.” Naturally I had to look up whether or not this was a linguistic
coincidence, and — in an odd turn of events, since these things usually tend
to work out — it turns out that *apocalypse* and *calypso* are *not*
etymologically related. *Apocalypse* comes from the Greek root *kaluptein,*
“to cover,” and so the *apocalypse* is the “uncovering” or “revelation” (the
root *kel* is related to *hell*). The word *calypso* is of unknown origin, but
from the West Indies (this is distinct from *Calypso,* the Greek mythological
nymph, whose name is indeed related to “covered,” or “concealing”).

I shared this with a good friend of mine and she immediately pointed out that
we should check out the word *eucalyptus*. She was right: *eucalyptus* means
“well-covered,” referring (not ironically to the bark as I originally thought
but rather) to the calyx, which forms a lid over flowers when in
bud.

That got me thinking about eucalyptuses…

Lately I’ve spent a lot of time around the eucalyptuses in Toyon and Arboretum
Groves at Stanford and I’ve
noticed that the peeling bark of what I’m 80% sure is the *Eucalyptus
globulus* (Tasmanian Blue Gum)
seems to climb up the tree in a whirling, helical pattern. I didn’t have the
presence of mind to take a picture at the time but here’s a picture from
Wikimedia
Commons
taken in Maui.

Really it’s just beautiful to me, the whorl reaching aspirationally towards the
sky. I’m reminded of Correggio’s *Ganymede Abducted by the Eagle*, the way the
dog and the tree stump look up together, moving, turning even while frozen in
the painting.

This much would be believable, just another quirk of nature. But here’s the
truly odd thing: most of the trees seem to grow according to a thumbs-up
right-hand rule; if you were to try and screw them into the ground you would
turn them righty-tighty. *So eucalyptuses have some kind of chirality?* I
rubbed my eyes, I thought I was dreaming.

But indeed, some hours of study later, I learned that the “spiral grain” is a real thing in the study of wood formation.

In the Real World of economics spiral grain is considered a defect because it
reduces the structural quality of wood. But let us set that aside for a moment
and marvel at the phenomenon itself. I turn to Dr. John Maddern Harris’
wonderfully comprehensive 1989 text *Spiral Grain and Wave Phenomena in Wood
Formation* (thanks Chandler for finding me a copy!) for details. Harris, citing
confusion about the convention in prior work, carefully defines LH (left-handed
“sinistral”) and RH (right-handed “dextral”) spirals (I wonder if he was aware
of the right-hand-rule convention from E&M, which would simplify matters here).
Perhaps more entertainingly Harris introduces the German words as well:

Numerous other expressions have been used to describe spiral direction, reflecting no doubt the notorious difficulty of so doing without waving of arms and twisting of wrists. In the German literature the terms “sonnig” and “wider- sonnig” are frequently encountered. Their meaning is “with the sun” and “against the sun”, and hence (in the Northern Hemisphere!) they relate to LH and RH spiral grain respectively.

Here’s a picture of a eucalyptus from Harris’ book.

Certain species do indeed show predilections for certain handednesses.
Furthermore, certain species *change* handedness over the course of their
lives. In Harald Säll 2002 doctoral thesis Spiral Grain in Norway
Spruce I
learned the remarkable fact that the Norway Spruce begins life as a left-handed
spiral grain, but over the decades transitions to a right-handed spiral grain.
The drawing below is from his thesis:

We don’t fully understand why trees spiral, much less how they pick a direction. The most fanciful explanation I’ve come across — the one I would like to believe — is that there is some kind of Coriolis effect at work. But I will admit that both handednesses have been observed in both hemispheres, which weakens that theory.

At the moment I’m on a quest to find a (translated, ideally) copy of Meyer’s
1949 article *Sprachliche und literarische Bemerkungen zum Problem “Drehwuchs”*
published in *Mitt Schweiz Centralanst Forst Veruschswes* (26:331-347) to learn
more about the linguistic history of chirality in relation to spiral grains.
Until then…

**Open question:** What’s going on here, why are eucalyptuses right-handed?

**Further reading:** Noah Rosenberg’s
Xtrees site about Stanford
trees — I found this while reading about eucalyptuses!

Edited to add (2/19/2020):

]]>“DNA normally forms a right-handed spiral (although a rare left-handed variant can occur). In other words, it twists like a conventional screw. That, though, has not stopped it being reproduced wrongly in hundreds of places. Dr. Tom Schneider has a website where he has collected hundreds of examples of incorrectly drawn ‘left-handed DNA’, most being found in scientific journals. Many are in advertisements, so we may perhaps charitably suggest the final copy was never seen by a scientist, but that doesn’t quite explain them all. Certainly not the editorial comment in Nature, the place where DNA’s structure was first described, which in 2000 mentioned the clues that ‘led Watson and Crick to deduce the left-handed double helical structure of DNA.’ Watson, in fact, has been particularly badly served, his 1978 textbook, Molecular Biology of the Gene, having six different illustrations with left-handed DNA, and in 1990 the American journal Science quoting Watson as saying, ‘I have to read SCIENCE every week,’ this being illustrated with left-handed DNA. Perhaps worst of all, a 1998 reprint of Watson’s The Double Helix was illustrated on the front and back with left-handed DNA. Perhaps it is not a coincidence that Watson is left-handed.”

— Chris McManus, in Right Hand, Left Hand: The Origins of Asymmetry in Brains, Bodies, Atoms and Cultures. 2002.

The last time I spent a long time on I-5, I started thinking about envelopes. This time, I started thinking about orchards. As you drive along the central valley highway, you can see thousands and thousands of perfect rows of trees. You can actually see them from the top on Google’s satellite imagery; try zooming in slowly on that point, it’s really quite beautiful.

When you drive by them, you see the side view rather than the top view. In
fact, you see the side view *in motion* – the trees look like they’re dancing
an elaborate dance, continuously lining up into different formations to form
different rows in different angles. Really, of course, it is you who is moving;
the trees are still but by viewing them from different angles you can see
different “alignments” of the grid.

The effect is most pronounced with freshly-planted trees (or perhaps those are markers?), where there is no foliage blocking the horizon.

I didn’t think to make a video at the time, but you can imagine that as the car moves forwards a little bit, the pattern re-aligns in an incredibly satisfying way.

I’m intrigued by the multiple “vanishing points” you can see in that picture. The trees are regularly spaced in a square grid, but somehow there are multiple (!) irregularly-spaced (!!) points along the horizon to which rows of trees appear to converge. What’s going on here — is there a pattern to the patterns? You could either stop reading here to have food for thought on your next road trip, or you could keep reading for my ideas.

The I-5 trees are a variation on a well-known theme: a theme called *Euclid’s
orchard* (Wikipedia,
MathWorld). You stand at
the origin of the plane, and at every integer point there is a line segment
(“tree”) extending one unit upwards perpendicular to the plane. What you see
from your vantage point is Euclid’s orchard. Try to imagine what it looks like;
then, play with a JavaScript simulation of it
here.

The main idea of Euclid’s orchard is that trees at integer points (x, y) occlude trees at any integer multiples of that point (nx, ny), so you don’t see anything behind them. Also, nearby trees appear taller than farther-away trees because that’s how perspective works. So there is a “horizon” where in each direction you either see (1) a tree if the direction’s slope is rational, and the tree is shorter if its denominator is larger; (2) nothing if the direction’s slope is irrational. All sorts of fun can be had with this: see for example a nice connection to the Fibonacci numbers.

The I-5 orchard in the picture is similar, but now we look upon it from some height. Imagine mapping each point on the horizon to the direction in which I would have to point my camera to look directly at that point. The “vanishing points” have lots of trees along those directions, giving the illusion of “rows.” What does this mean? It means that if you walked along that ray in Euclid’s orchard, you would run into trees more frequently than other rays. In other words, those rays represent directions whose slopes are rational with small denominators.

Hmm… rational with small denominators… where have I blogged about that before…? Right — around four years ago, when we showed that 2^12864326 begins with the digits “10000000…” (and other number-theoretic shenanigans). I wasn’t kidding, rational approximation really is one of my favorite topics!

At the time I showed you this picture with little explanation.

Here is a little more explanation of its significance: as you start filling in circles to some level, you generate a sequence of fractions called the “Farey sequence” by listing the points of tangency of the circle and the X axis. The Farey sequence is the list of completely reduced fractions whose denominators are smaller than some bound. Putting those two facts together: the points of tangency of the circles represent points that are “rational with small denominators” — exactly what we thought our I-9 orchard’s “vanishing points” were!

Looking back at the image, this is somewhat believable. We see a couple of
“strong” vanishing points, and then “weaker” vanishing points in between the
strong ones, and then *even* “weaker” vanishing points between the strong and
the weak vanishing points, and so on.

There are a number of cool implications of this. For example, the “least patterned” regions of the horizon would correspond to regions that are “hard to rationally approximate”; so, you’d expect them at plus-or-minus the golden ratio’s slope from your line of sight.

]]>Say what you will about *The Big Bang Theory*, it has its moments, and this is
one of my favorites (watch it to the end of the clip!):

Putting aside the wonderful meta-dramatic irony of this moment, the word “un-unravelable” is the kind of thing you can sit and think about for hours. I know because I have.

See, you might think that the “un-“ prefixes cancel, such that “un-unravelable” means the same thing as “ravelable.” This is kind of like how “un-unforgettable” probably means “ordinary.” And indeed, Wiktionary agrees with this assessment, suggesting instead “ravelable” as the “simpler and more immediately logical choice.”

But clearly that’s not what Dr. Cooper meant in this case! “Ravelable” means
*can be raveled* whereas Dr. Cooper meant *cannot be unraveled*, which are at
least intuitively separate concepts. So what’s going on?

First of all, we need to figure out what “ravel” means: it
means to entangle. Something that
is “raveled” is entangled and knotted like the HDMI cords behind your cable
box. The sleep that Macbeth murders is the “sleep that knits up the raveled
sleeve of care” (2.2.37). Actually, that’s not the full story. “Ravel” *also*
means to *disentangle*, making the word a so-called “Janus word” (read more at
Merriam-Webster).
But let’s stick with the former definition for the moment.

Okay, so now we can start tacking on prefixes and suffixes. But we have to be careful! The dynamics of affixes are tricky — the order in which we tack them on matters.

For example, “ravel-able” would mean *can be entangled,* so spaghetti is
ravelable. Then “un-ravelable” could mean *cannot be entangled,* so ravioli is
unravelable.

On the other hand, “un-ravel” would mean *to disentangle,* so Sherlock unravels
mysteries. Then “unravel-able” could mean *disentanglable,* so the mysteries
Sherlock solves are unravelable (by him, at least).

To summarize:

- un+(ravel+able): there is
*no*way to ravel it - (un+ravel)+able: there
*is*a way to unravel it

Alas! The associative property has broken down! This means we need to examine each of the five possible paranthesisifications of “un+un+ravel+able” separately (why five?).

Here they are:

**(un+un)+(ravel+able)**and**((un+un)+ravel)+able**are easy: neither of them typecheck, because “un+un” is meaningless.**un+(un+(ravel+able)):**“ravelable” means*can be entangled,*so “un-ravelable” means*cannot be entangled,*and thus “un-unravelable” means*is not such that it cannot be entangled,*i.e.*can be entangled.*For example, your new favorite organizational scheme might seem perfect at first, but after a few weeks you will find that no scheme can fend off the entropic tendencies of the universe, and that your files are by nature un-un-ravelable.**(un+(un+ravel))+able:**“unravel” means*disentangle,*so “un-unravel” means*re-entangle*(by analogy to “un-undo” meaning “redo”) and thus “ununravel-able” means*re-entanglable.*For example, a jigsaw puzzle that you mix up again after solving is ununravel-able.**un+((un+ravel)+able):**“unravel” means*disentangle,*so “unravel-able” means*can be disentangled,*and thus “un-unravelable” means*cannot be disentangled.*For example, Dr. Cooper’s web of lies is un-unravelable.

So what’s really going on here is that the prefix *un-* has many faces (a Janus
prefix?); it reverses verbs and negates adjectives. Perhaps this is clearer in
the language of first-order logic, where *-able* introduces an existential
quantifier over *a way to do the thing* (and thus do(w, x) does w to x).

- N/A
- N/A
**(un+(un+(ravel+able)))(x)**means ¬¬∃w. raveled(do(w, x))**((un+(un+ravel))+able)(x)**means ∃w. ¬¬raveled(do(w, x))**(un+((un+ravel)+able))(x)**means ¬∃w. ¬raveled(do(w, x))

Only (5) puts an odd number of negations outside the existential quantifier,
flipping it to a universal quantifier, and expressing that you *cannot* unravel
whatever needs unraveling.

Why doesn’t this effect happen with “un-unforgettable”? It’s because “unforget” isn’t a common word, and so we reach for interpretation (3) rather than (5). Though now that I think about it, “unforget” is an excellent replacement for “remember.”

]]>The other day I ordered a book on art history from Amazon… which reminded me
of the movie *Angels and Demons*, which reminded me of ambigrams. Ambigrams are
those stylized words that look the same when you turn them upside down. Here is
the classic, by John
Langdon.

I’ve been thinking a lot about the absurd power of gradient descent these days, and so almost immediately I felt the need to try and synthesize ambigrams on-demand by gradient descent.

It turns out that this isn’t too hard to implement. Let’s start with just
single-character ambigrams. Here’s the plan: first, we train some sort of
neural net to recognize characters. This training is by gradient descent on the
weights / convolution filter / whatnot; specifically, we imagine a function of
the weights that outputs some loss and take its derivative *with respect to the
weights* to do gradient descent. Once the neural net is trained, however, we
can just treat it as a single function from tensors (input image) to tensors (a
one-hot encoding of the classification… or some softmaxed approximation).
Taking that one-hot output and converting it into a cross-entropy loss against
some desired output, we can now do gradient descent to minimize this loss *with
respect to the input image* (keeping the weights constant). This process lets
us “dream” of images representing whichever characters we want.

Similarly, we can do the same thing but with the tensor turned upside-down, to dream images of upside-down characters. Okay, so now we can just combine these two processes — literally, by adding the loss functions — to dream an image that matches one character right-side-up and another character upside-down. That’s it! Now we let it run all day to generate an image for each alphanumeric character pair (36^2 = 1296) and we start to get a nice little “ambigram font”:

Combining letters from this “font” lets us make arbitrary ambigrams.

This was a low-effort operation, hacked together over an evening and a morning, plus a full day of “training.” I’m sure it’s possible to do better. Here are some technical notes for anyone who wants to have a go at it. My code is all online, here.

- It’s all written in PyTorch. The neural net itself is literally PyTorch’s built-in MNIST example.
- I used the EMNIST dataset, which is MNIST that also adds in the alphabets from the NIST database (and comes with PyTorch). I used the “balanced” split.
- I added an L1 regularizer to force the output images to be sparse, i.e. mostly “blank paper” with some “ink.” Without this, you get very dark images.

Okay, now time for some pictures. Some turned out better than others, and there is plenty of scope for improvement…

P.S. Depending on your browser, images in this post may or may not flip upside-down automatically if you hover over them.

Update (3/26/20): Alex Pietrow created a website that lets you use this font to build your own ambigrams (along with more ambigram fun).

]]>