A Fishy Function

The Hitchhiker’s Guide to Patience, or, How To Use Earthquakes To Calculate Compound Interest

Sunday, July 16, 2017 · 17 min read

I started thinking about these ideas in late May, but haven’t gotten a chance to write about them until now…

If you want to take a boat from the Puget Sound to Lake Washington, you need to go across the Ballard Locks, which separate the Pacific Ocean’s saltwater from the freshwater lakes. The locks are an artificial barrier, built in the early 1900s to facilitate shipping.

Today, the locks have a secondary function: they are a picnic spot. A while back, I visited the locks on a sunny and warm day. A band was playing in the park by the water, and there were booths with lemonade and carnival games. Every few minutes, a boat would enter the locks, be raised or lowered, and continue on its way.

If you walk across the locks, you can check out the fish ladder, a series of raised steps designed to help fish — in this case, salmon — migrate, since the locks cut off their natural path between the water bodies. There is usually a crowd around the fish ladder. Around once a minute, a salmon leaps out of the water and goes up a step; the children gasp and cheer as they watch over the railing.

This is the idyllic scene that we will soon destroy with the heavy hammer of mathematical statistics. You see, it turns out that a little bit of thought about these salmon can give us a way to use historical earthquake data to approximate $ e $.

But I’m getting ahead of myself. Let’s start at the beginning.


What is the probability that a fish jumps out of the water right now? This is a tricky question to answer. Suppose there’s a 10% chance that a fish jumps out of the water right now. That means the probability that a fish doesn’t jump is 90%. In the next instant of time, there’s again a 10% chance that the fish jumps. So, the laws of probability tell us that over the course of $ n $ instants, there’s a $ 0.90^n $ probability that no fish-jumps occur.

But there’s an infinite number of instants in every second! Time is continuous: you can subdivide it as much as you want. So the probability that no fish-jumps occur in a one-second period is $ 0.90^\infty $, which is… zero! Following this reasoning, a fish must jump at least every second. And this is clearly a lie: empirically, the average time between fish-jumps is closer to a minute.

Okay, so “probability that a fish jumps right now“ is a slippery thing to define. What can we do instead? Since the problem seems to be the “right now” part of the definition, let’s try to specify a time interval instead of an instant. For example, what is the probability that we will observe $ n $ fish-jumps in the next $ t $ seconds?

Well, we’re going to need some assumptions. For simplicity, I’m going to assume from now on that fish jump independently, that is, if one fish jumps, then it does not affect the behavior of any other fish. I don’t know enough about piscine psychology to know whether or not this is a valid assumption, but it doesn’t sound too far-fetched.


While we’re on the subject of far-fetchedness: the math that follows is going to involve a lot of handwaving and flying-by-the-seat-of-your-pants. We’re going to guess at functions, introduce constants whenever we feel like it, evaluate things that may or may not converge, and, throwing caution and continuity to the wind, take derivatives of things that might be better left underived.

I think it’s more fun this way.

Yes, we could take the time to formalize the ideas with lots of definitions and theorems and whatnot. There’s a lot to be said about mathematical rigor, and it’s really important for you, the reader, to be extremely skeptical of anything I say. In fact, I encourage you to look for mistakes: the reasoning I’m about to show you is entirely my own, and probably has some bugs here and there. (The conclusions, for the record, match what various textbooks say; they just derive them in a slightly different way.)

A couple of lemmas here and there might make the arguments here much more convincing. But they will also make this post tedious and uninspiring, and I don’t want to go down that road. If you’re curious, you can look up the gnarly details in a book. Until then, well, we’ve got bigger fish to fry!


Okay, back to math. We can model the probability we’re talking about with a function that takes $ n $ and $ t $ as inputs and tells you the probability, $ P(n, t) $, that you see $ n $ fish-jumps in the time period $ t $. What are some things we know about $ P $?

Well, for starters, $ P(n, 0) = 0 $, since in no time, there’s no way anything can happen.

What about $ P(n, a + b) $? That’s the probability that there are $ n $ fish-jumps in $ a + b $ seconds. We can decompose this based on how many of the fish-jumps occurred in the “$ a $” and “$ b $” periods:

\begin{align} P(n, a+b) & = P(0, a)P(n, b) \\ & + P(1, a)P(n-1, b) \\ & + \ldots \\ & + P(n, a)P(0, b) \end{align}

Hmm. This looks familiar… perhaps…

Yes! Isn’t this what you do to the coefficients of polynomials when you multiply them? The coefficient of $ x^n $ in $ a(x)b(x) $ is a similar product, in terms of the coefficients of $ x^i $ and $ x^{n-i} $ in $ a(x) $ and $ b(x) $, respectively.

This can’t be a coincidence. In fact, it feels appropriate to break out this gif again:

Benedict Cumberbatch being awesome

Let’s try to force things into polynomial form and see what happens. Let $ p_t(x) $ be a polynomial where the coefficient of $ x^n $ is the probability that $ n $ fish-jumps occur in time $ t $:

\begin{align} p_t(x) &= P(0, t)x^0 + P(1, t)x^1 + \ldots \\ &= \sum_{n=0}^\infty P(n, t)x^n \end{align}

(Yes, fine, since $ n $ can be arbitrarily large, it’s technically a “power series”, which is just an infinitely long polynomial. Even more technically, it’s a generating function.)

We know that $ p_0(x) = 1 $, because nothing happens in no time, i.e. the probability of zero fish-jumps is “1” and the probability of any other number of fish-jumps is “0”. So $ p_0(x) = 1x^0 + 0x^1 + \ldots $, which is equal to just “1”.

What else do we know? It should make sense that $ p_t(1) = 1 $, since if you plug in “1”, you just add up the coefficients of each term of the polynomial. Since the coefficients are the probabilities, they have to add up to “1” as well.

Now, taking a leap of faith, let’s say that $ p_{a+b}(x) = p_a(x)p_b(x) $, because when the coefficients multiply, they work the same way as when we decomposed the probabilities above.

Why is this property interesting,? We’re turning a property about addition into a property about multiplication. That sounds awfully like something else we’re used to: logarithms! Forgetting for a moment that $ p $ is a power series, maybe we can “solve” for the function $ p_t(x) $ by messing around with something like this:

\[ p_t(x) = e^{tx} \]

Okay, $ e^{tx} $ doesn’t quite work because we want $ p_t(1) = 1 $. Maybe $ e^{t(x-1)} $ will work? It seems to have all the properties we want…


Let’s take a moment to stop and think. At this point, it’s not even clear what we’re doing. The whole point of defining $ p_t(x) $ was to look at the coefficients, but when we “simplify” it into $ e^{t(x-1)} $ we no longer have a power series.

Or do we?

Recall from calculus class that you can expand out some functions using their Taylor Series approximation, which is a power series. In particular, you can show using some Fancy Math that

\begin{align} e^x &= \frac{x^0}{0!} + \frac{x^1}{1!} + \frac{x^2}{2!} + \ldots \\ &= \sum_{n=0}^\infty \frac{x^n}{n!} \end{align}

If you haven’t taken calculus class yet, I promise this isn’t black magic. It’s not even plain magic. It’s just a result of a clever observation about what happens to $ e^x $ when you increase $ x $ by a little bit.

If you have taken calculus, bet you didn’t think this “series approximation” stuff would ever be useful! But it is, because a quick transformation gives us the series representation for $ p_t(x) $:

\[ e^{t(x-1)} = e^{tx}/e^t = \sum_{n=0}^\infty \frac{(tx)^n}{n!e^t} \]

and so the coefficient of $ x^n $ gives us $ P(n, t) = t^n/(e^t n!) $.


Now we have a new problem: this formula doesn’t depend at all on the type of events we’re observing. In particular, the formula doesn’t “know” that the salmon at Lake Washington jump around once a minute. We never told it! Fish at other lakes might jump more or less frequently, but the formula gives the same results. So the formula must be wrong. Sad.

But it might be salvageable! Let’s go back and see if we can add a new constant to represent the lake we’re in. Perhaps we can call it $ \lambda $, the Greek letter “L” for lake. Where could we slip this constant in?

Our solution for $ p_t(x) $ was:

\[ p_t(x) = e^{t(x-1)} \]

but in retrospect, the base $ e $ was pretty arbitrarily chosen. We could make the base $ \lambda $ instead of $ e $, but that would mess up the Taylor Series, which only works with base $ e $. That would be inconvenient.

However, we know that we can “turn” $ e $ into any number by raising it to a power, since $ e^{\log b} = b $. If we want base $ b $, we can replace $ e $ with $ e^{\log b} $. This suggests that $ \lambda = \log b $ could work, making our equation:

\[ p_t(x) = \left(e^\lambda\right)^{t(x-1)} = e^{(\lambda t) (x-1)} \]

This seems to fit the properties we wanted above (you can check them if you want). Going back to our Taylor Series expansion, we can just replace $ t $ with $ \lambda t $ to get:

\[ P(n, t) = \frac{\left(\lambda t\right)^n}{e^{\lambda t} n!} \]


Let’s step back and think about what we’re claiming. Knowing only that fish jump randomly, and roughly independently, we claim to have an expression for the probability that $ n $ fish-jumps occur in a time interval $ t $.

“Okay, hold up,” you say, “something smells fishy about this. This is pretty bold: we know nothing about how fish think, or fluid dynamics, or whatever other factors could influence a fish’s decision to jump. And yet we have this scary-looking expression with $ e $ and a factorial in there!”

That’s a fair point. I’m just as skeptical as you are. It would be good to back up these claims with some data. Sadly, I didn’t spend my time in Seattle recording fish-jumping times. But, in a few more sections, I promise there will be some empirical evidence to assuage your worries. Until then, let’s press on, and see what else we can say about fish.


We have a way to get the probability of some number of fish-jumps in some amount of time. What’s next?

One thing we can do is compute the average number of fish-jumps in that time interval, using expected value. Recall that to find expected value, you multiply the probabilities with the values. In this case, we want to find:

\[ E_t[n] = \sum_{n=0}^\infty P(n, t)n \]

This looks hard… but also oddly familiar. Remember that

\[ p_t(x) = \sum_{n=0}^\infty P(n, t)x^n \]

because, y’know, that’s how we defined it. Using some more Fancy Math (“taking the derivative”), this means that

\[ \frac{dp_t(x)}{dx} = \sum_{n=0}^\infty P(n, t)nx^{n-1} \]

and so $ E_t[n] = p^\prime_t(1) $.

That… still looks hard. Derivatives of infinite sums are no fun. But remember from the last section that we also have a finite way to represent $ p_t(x) $: what happens if we take its derivative?

\begin{align} p_t(x) &= e^{(\lambda t) (x-1)} \\ p^\prime_t(x) &= (\lambda t)e^{(\lambda t) (x-1)} \\ p^\prime_t(1) &= E_t[n] = \lambda t \end{align}

Aha! The average number of fish-jumps in time $ t $ is $ \lambda t $. If $ t $ has units of time and $ \lambda t $ has units of fish-jumps, this means that $ \lambda $ has units of fish-jumps-per-time. In other words, $ \lambda $ is just the rate of fish-jumps in that particular lake! For Lake Washington, $ \lambda_w = 1/60 \approx 0.0167 $ fish-jumps-per-second, which means that the probability of seeing two fish-jumps in the next thirty seconds is:

\[ p_{30}(2) = \frac{(0.0167\times30)^2}{e^{0.0167\times30}2!} \approx 0.076 \]

I think that’s pretty neat.


What about the standard deviation of the number of fish-jumps? That sounds ambitious. But things have been working out pretty well so far, so let’s go for it.

Standard deviation, or $ \sigma $, the Greek letter “sigma”, is a measure of “how far, on average, are we from the mean?” and as such seems easy to define:

\[ \sigma = E[n-\lambda t] \]

Well, this isn’t hard to evaluate. Knowing that expected values add up, we can do some quick math:

\begin{align} \sigma &= E[n] - E[\lambda t] \\ &= \lambda t - \lambda t = 0 \end{align}

Oops. We’re definitely off by a little bit on average, so there’s no way that the standard deviation is 0. What went wrong?

Well, $ n - \lambda t $ is negative if $ n $ is lower than expected! When you add the negative values to the positive ones, they cancel out.

This is annoying. But there’s an easy way to turn negative numbers positive: we can square them. Let’s try that.

\begin{align} \sigma^2 &= E[(n-\lambda t)^2] \\ &= E[n^2 - 2n\lambda t + (\lambda t)^2] \end{align}

Now what? We don’t know anything about how $ E[n^2] $ behaves.

Let’s go back to how we figured out $ E[n] $ for inspiration. The big idea was that

\[ \frac{dp_t(x)}{dx} = \sum_{n=0}^\infty P(n, t)nx^{n-1} \]

Hmm. What if we take another derivative?

\[ \frac{d^2p_t(x)}{dx^2} = \sum_{n=0}^\infty P(n, t)n(n-1)x^{n-2} \]

We get an $ n(n-1) $ term, which isn’t quite $ n^2 $, but it’s degree-two. Let’s roll with it. Following what we did last time,

\begin{align} p_t(x) &= e^{(\lambda t)(x - 1)} \\ p^\prime_t(x) &= (\lambda t)e^{(\lambda t)(x - 1)} \\ p^{\prime\prime}_t(x) &= (\lambda t)(\lambda t)e^{(\lambda t)(x - 1)} \\ E[n(n-1)] &= p^{\prime\prime}_t(1) \\ &= (\lambda t)^2 \end{align}

And now we have to do some sketchy algebra to make things work out:

\begin{align} \sigma^2 &= E[(n-\lambda t)^2] \\ &= E[n^2 - 2n\lambda t + (\lambda t)^2] \\ &= E[n^2 - n - 2n\lambda t + n + (\lambda t)^2] \\ &= E[(n^2 - n) - 2n\lambda t + n + (\lambda t)^2] \\ &= E[n^2 - n] - E[2n\lambda t] + E[n] + E[(\lambda t)^2] \\ &= (\lambda t)^2 - 2(\lambda t)(\lambda t) + \lambda t + (\lambda t)^2 \\ &= \lambda t \end{align}

…which means $ \sigma = \sqrt{\lambda t} $.

Seems like magic.


Okay, fine, we have this fancy function to model these very specific probabilities about fish-jump-counts over time intervals. But the kids watching the fish ladder don’t care! They want to know what’s important: “how long do I need to wait until the next fish jumps?”

Little do they know, this question opens up a whole new can of worms…

Until now, we’ve been playing with $ n $ as our random variable, with $ t $ fixed. Now, we need to start exploring what happens if $ t $ is the random variable. This needs some new ideas.

Let’s start with an easier question to answer. What is the probability that you need to wait longer than five minutes (300 seconds) to see a fish-jump? (Five minutes is way longer than my attention span when looking at fish. But whatever.)

It turns out that we already know how to answer that question. We know the probability that no fish jump in five minutes: that’s equal to $ p_{300}(0) $. Why? Well, when we plug in $ x = 0 $, all the $ x $ terms go away in the series representation, and we’re only left with (the coefficient of) the $ x^0 $ term, which is what we want.

Long story short, the probability that you need to wait longer than five minutes is $ e^{0.0167\times300(0-1)} = 0.00674 $. This means that the probability that you will see a fish-jump in the next five minutes is $ 1 - e^{0.0167\times300(0-1)} $, which is around 0.9932. This is the probability that you have to wait less than five minutes to see a fish-jump. For an arbitrary time interval $ T $, we have $ P(t<T) = 1 - e^{-\lambda T} $, where $ t $ is the actual time you have to wait.

Sanity check time! This gets close to 1 as $ T $ gets higher, which sounds about right: the longer you’re willing to wait, the likelier it is that you’ll see a fish jump. Similarly, if fish jump at a higher rate, $ \lambda $ goes up, and the probability gets closer to 1, which makes sense. Indeed, encouragingly enough, this equation looks very close to the equation we use for half-lives and exponential radioactive decay…

Now things are going to get a bit hairy. What is the probability that you have to wait exactly $ T $, that is, $ P(t = T) $? This should be zero: nothing happens in no time. But let’s be reasonable: when we say “exactly” $ T $, we really mean a tiny window between, say, $ T $ and $ T + dT $ where $ dt $ is a small amount of time, say, a millisecond.

The question then is, what is $ P(T < t < T + dt) $, which isn’t too hard to answer: it’s just $ P(t < T + dt) - P(t < T) $, that is, you need to wait more than $ T $ but less than $ T + dT $. In other words,

\[ P(t \approx T, dT) = P(t < T+dT) - P(t < T) \]

where $ dt $ is an “acceptable margin of error”.

This looks awfully like a derivative! We’re expressing the change in probability as a function of change in time: if I wait $ dT $ longer, how much likelier am I to see a fish-jump?

Let’s rewrite our above equation to take advantage of the derivativey-ness of this situation.

\begin{align} P(t \approx T, dT) &= \left(\frac{P(t < T+dT) - P(t < T)}{dT}\right)dT\\ &= \left(\frac{d P(t < T)}{dT}\right)dT \\ &= \left(\frac{d (1-e^{-\lambda T})}{dT}\right)dT \\ &= \lambda e^{-\lambda T} dT \end{align}

By the way, this might give a simpler-but-slightly-less-satisfying answer to our initial question, “what is the probability that a fish jumps out right now?“ If we set $ T $ to 0, then we get $ P(t \approx 0, dT) = \lambda dT $. In other words, if fish jump out of the water at a rate $ \lambda $, then for a tiny period of time $ dT $, the probability of seeing a fish jump in that time is $ \lambda dT $. This is one of those facts that seems really straightforward one day, and completely mindblowing the next day.

Anyway. Now that we have an approximation for the probability that you need to wait a specific time $ T $, we can find an expected value for $ t $ by taking the sum over discrete increments of $ dt $:

\[ E[t] = \sum^\infty_{k=0} P(t \approx T, dT) \times T \]

where $ T = k\times dT $. Since we’re talking about the limit as $ dT $ gets smaller and smaller, it seems reasonable to assume that this thing turns into

\begin{align} E[t] &= \int^\infty_0 P(t \approx T, dT) \times T \\ &= \int^\infty_0 \lambda e^{-\lambda T} dT \times T \end{align}

You can integrate that by parts, or just use WolframAlpha, which tells you that $ E[t] = \lambda^{-1} $.


…which is kind of obvious, isn’t it? Remember that $ \lambda $ was the rate at which our fish jumped. If fish jump once a minute, shouldn’t we expect to have to wait a minute to see a fish jump? Isn’t this similar to the way wavelength and frequency are related?

The answer is, “yes and no”. “Yes”, the value $ \lambda^{-1} $ is indeed pretty sensible in retrospect. A simpler way to derive it might have been to note that for any time period $ T $, the expected number of fish-jumps is $ \lambda T $ (as we found out above), and so the average time interval between fish-jumps would be $ T / (\lambda T) = \lambda^{-1} $. The fact that the average interval between fish-jumps corresponds to the the expected interval is captured by the surprisingly well-known acronym “PASTA”: Poisson Arrivals See Time Averages (I’m not making this up!).

But “no”, it’s not “obvious” that you should have to wait the average inter-fish time!

Suppose you, like Rip Van Winkle, you woke up after a very long sleep, and you wanted to know “how much longer until Monday morning?”

Well, Monday mornings happen every 7 days, and so if you set $ \lambda = 1/7 $, you should expect to have to wait 7 days until Monday.

But that’s silly! You definitely need to wait fewer than 7 days on average! In fact, most people would intuitively say that you need to wait 7/2 = 3.5 days on average: and they would be right. (The intuition is that on average, you’d wake up halfway between two Monday mornings.)

This is the so-called “Hitchhiker’s Paradox”: if cars on a highway through the desert appear roughly once an hour, how long does a hitchhiker who just woke up need to wait until he sees a car? It seems reasonable to say “half an hour”, since on average, you’d wake up halfway between two cars. On the other hand, with $ \lambda = 1 $, you’d expect to wait an hour until you see a car.

So which one is right? And why are the answers different?

Well, the “Rip Van Winkle” interpretation assumes that cars on a desert highway — like Mondays — come at regular intervals. In reality, cars on a desert highway — like the salmon of Seattle — are usually independent. They might come in a cluster a few minutes after you wake up, or a lone car might come the next day. Crucially, the next car doesn’t “know” anything about previous cars, and so it doesn’t matter when you wake up: we call this property “memorylessness”.

It turns out that since there’s a nonzero probability of having to wait a very long time for a car, the average gets pulled up from half an hour. With that in mind, it’s really quite surprising that the true mean turns out to be exactly $ 1/\lambda $.


And now, the aftermath.

Very little of the above discussion was fish-specific. The only properties of salmon that mattered here were that salmon jump randomly and independently of each other, at some rate $ \lambda $. But our calculations work for any such process (let’s call such processes Poisson processes).

Poisson processes were studied as early as 1711 by de Moivre, who came up with the cool theorem about complex numbers. However, they’re named after Siméon Denis Poisson, who in 1837 studied (not fish, but) the number of wrongful convictions in court cases.

Today, Poisson processes model all sorts of things. Managers use it to model customers arriving at a grocery checkout. Programmers use it to model packets coming into a network. Both of these are examples of queueing theory, wherein Little’s Law relates $ \lambda $ to how long things have to wait in queues. You could probably use a Poisson process to model how frequently bad things happen to good people, and use that to create a statistical model of how unfair the world is.

The upshot is this: even though I didn’t record any fish-jumping data back in Seattle, we can definitely try out these ideas on other “sporadic” processes. Wikipedia, it turns out, maintains a list of earthquakes that happened in the 21st century. Earthquakes are pretty sporadic, so let’s play with that dataset.

I scraped the date of each earthquake, and wrote a small script to count the the number of earthquakes in each month-long interval. That is, $ t $ is 2,592,000 seconds. By “binning” my data by month, I got lots of samples of $ n $. This gives an easy way to compute $ P(n, t) $ “empirically”.

On the other hand, taking the total number of earthquakes and dividing by the total time range (around 17 years, since we’re in 2017) gives us the rate $ \lambda $, which in this case works out to about $ 1.06\times10^{-6} $ earthquakes per second. This gives a way to compute $ P(n, t) $ “theoretically” by using our fancy formula with the factorial and whatnot.

\[ P(n, t) = \frac{\left(\lambda t\right)^n}{e^{\lambda t} n!} \]

Comparing the results gives us this pretty plot!

Earthquake poisson process

They match up surprisingly well.

What else can we say? Well, the average inter-earthquake time works out to $ 1/\lambda $, or around 940,000 seconds. That’s about eleven days. On average, a reader of this blog post can expect to wait eleven days until the next earthquake of magnitude 7 or above hits.

And for those of you who have been wondering, “can we do these calculations in reverse to approximate $ e $?” the answer is, yes! We just solve the above equation for $ e $.

\[ e\approx\left(\frac{P(n, t)n!}{(\lambda t)^n}\right)^{-(\lambda t)^{-1}} \]

In my case, using earthquake data for $ n = 1 $, I got $ e \approx 2.75 $. I’d say that’s pretty good for an algorithm that relies on geology for accuracy (in reality, $ e $ is around 2.718).


In many ways, it is quite incredible that the Poisson process conditions — randomness, independence, constant rate — are all you need to derive conclusions for any Poisson process. Knowing roughly that customers at a burger place are random, act independently, and arrive around once a minute at lunchtime — and knowing nothing else — we can predict the probability that four customers arrive in the next three minutes. And, magically, this probability will have $ e $ and a factorial in it.

Humans don’t evaluate expressions involving $ e $ and factorials when they decide when to get a burger. They are subject to the immense complexity of human life, much like how salmon are subject to the immense complexity of the fluid mechanics that govern Lake Washington, much like how earthquakes are subject to the immense complexity of plate tectonics.

And yet, somehow statistics unites these vastly different complexities, finding order and meaning in what is otherwise little more than chaos.

Isn’t that exciting?

~ Fin. ~


Assorted references below.


Postscript, two weeks later. This morning at the coffee shop I realized that the Poisson distribution is a lot like the binomial distribution with a lot of trials: the idea is that you have lots of little increments of time, and a fish either jumps or doesn’t jump in each increment — this is called a Bernoulli process. Presumably, over a long period of time, this should even out to a Poisson process…

Recall that the probability of a fish-jump happening in some small time period $ dt $ turned out to be $ \lambda dt $ for our definition of $ \lambda $ as the rate of fish-jumps. Can we go the other way, and show that if the probability of something happening is $ \lambda dt $ for a small period of time $ dt $, then it happens at a rate of $ \lambda $?

Turns out, yes!

The binomial distribution is a way to figure out, say, what the probability is that if I flip 100 counts, then exactly 29 of them land “heads” (a coin toss is another example of a Bernoulli process). More abstractly, the binomial distribution gives you the probability $ B(N, k) $ that if something has probability $ p $ of happening, then it happens $ k $ times out of $ N $ trials.

The formula for $ B(N, k) $ can be derived pretty easily, and you can find very good explanations in a lot of high-school textbooks. So, if you don’t mind, I’m just going to give it to you for the sake of brevity:

\[ B(N, k) = \binom{N}{k} p^k (1-p)^{N-k} \]

Now, can we apply this to a Poisson process? Well, let’s say $ k = n $, the number of times our event happens in time $ t $. Then we have

\[ \binom{N}{n} p^n (1-p)^{N-n} \]

What next? We know that $ p = \lambda dt $. Also, for time period $ t $, there are $ t / dt $ intervals of $ dt $, so $ N = t / dt $. That means we can substitute $ dt = t / N $, and thus $ p = \lambda (t / N) $. This gives us

\[ \binom{N}{n} (\lambda t / N)^n (1-\lambda t / N)^{N-n} \]

Oh, and of course to approximate a Poisson process, this is the limit as $ N $ approaches infinity:

\[ \lim_{N\to\infty} \binom{N}{n} (\lambda t / N)^n (1-\lambda t / N)^{N-n} \]

This isn’t a hard limit to take if we break apart the product.

\[ \lim_{N\to\infty} \frac{N! (\lambda t)^n}{n!(N-n)! N^n} \lim_{N\to\infty}(1-\lambda (t / N))^{N-n} \]

The right half is surprisingly enough the definition of $ e^{-\lambda t} $, since the $ - n $ in the exponent doesn’t really matter. The left half is trickier: it turns out that $ N! / (N-n)! $ is the product $ N(N-1)\ldots(N-n+1) $. As a polynomial, it is degree $ n $, and the leading term is $ N^n $. But look! In the denominator, we have an $ N^n $ term as well, so in the limit, those both go away.

We’re left with what simplifies to our expression for the Poisson distribution.

\begin{align} \lim_{dt\to 0} B(N=t/dt, p=\lambda dt) &= \frac{(\lambda t)^n}{n!}e^{-\lambda t} \\ &= \frac{(\lambda t)^n}{e^{\lambda t}n!} \\ &= P(\lambda, t) \end{align}

which I think is literally magic.

◊ ◊ ◊