How Newbies Codebreak
A story of codebreaking using Python, Google, and common sense.
Saturday, February 8, 2014 · 6 min read
So I'm overdue for a post. Not my fault: school's keeping me busy. In fact, it's keeping me very busy. On Friday night I had to crack a cipher for a competition. It's not a very exciting feat, but it was the most fun I have had in ages, and it offers a nice glance into how people think about problems.
If you liked this, you'll love Simon Singh's The Code Book, which is a rather nice introduction to cryptography that has a whole bunch of awesome true stories about historical top-secret codebreaking feats. This sentence was totally not sponsored by Simon.
Our ciphertext looked like this:
"25112311'15 525422 24 142112'22 222524127 52123715: 412 11121122113241 15315221113 2225422 5415 621212141114 2412 222511 152417221111122225 11112222233 41214 4122122251123 2225422 5415 621212141114 2412 222511 12241211221111122225 11112222233. 24'13 2224231114 216 22252415 142415121515242112 216 14162422426241513 41214 1521124426241513; 511 2624811 2412 222511 2251112223-624231522 11112222233; 511 12111114 412 11121122113241 15315221113 2225422 25415 14111321123413 415 242215 2121411231624121224121815 41214 412 112225241426 1211411." -132412541126 1321212311
Ugh. We were told that it's a substitution cipher, with A-Z represented by
1-26 in some order. The standard
way to solve these is frequency
analysis: we look at the percentage of each coded letter. For example, if 12
shows up 8% of the time, it's probably a more common letter like e or
a as opposed to x. Cryptographers have tables of letter
frequencies in various languages, so this is easy stuff.
Before you read on, take a moment to see if you can get anywhere with it. Make some intelligent guesses and see if you can make any progress.
(Back so quick? Go back give it a real try!)
Did you see the problem? Our cipher is much tougher than frequency analysis,
because we don't have any delimiters between encrypted letters. So
112
could be any of 1-1-2, 11-2, 1-12
. That's a
big problem: it makes words ambiguous. Even the intended recipient of
the encrypted message doesn't know the way the word is broken up for sure, he
needs some trial and error (but with the key and a reasonable English
vocabulary, it's relatively easy).
(Mathematical aside: If you have a string of length $l$, how many ways can you break it up into a series 2 or 1 digit numbers? The answer is, believe it or not, the lth Fibonacci number. The proof is simple: given a string, we can chomp off the first digit and break up the rest in $f(l-1)$ ways, or chomp off the first 2 digits and break up the rest in $f(l-2)$ ways. So the total is $f(l) = f(l-1) + f(l-2)$, which is Fibonacci.)
Anyway, Tim ended up writing a nice little Python program that breaks up
words into a list of numbers (it turns out we can eliminate quite a few,
because the two-digit numbers have to be 26 or lower). That was exciting,
though the huge outputs were slightly disturbing. So it was possible to
brute-force it: we generate all possible breaking-ups, then try all possible
keys, and then use a large list of words to see if things make sense. (For
those of you that don't know, on UNIX systems
/usr/share/dict/words
contains a very handy list of English words
you can grep in).
It turns out, the number of possibilities to try is big. None of us owned a
supercomputer, and we didn't have a couple billion years to spare to spare. But
we did have some hints. We had a bunch of words that ended with apostrophes:
25112311'15
, 142112'22
, and 24'13
. We
started with some initial guessing. The first one looked too long to be a
contraction, so it could be a possessive. So perhaps 15 is s. 22 could
be t, because of contractions like don't
or can't
. The
last one has a bunch of choices. However, we saw 24 exist on its own
at the beginning of the first sentence. This is helpful because it could be a
word like I
which is both a letter and a word. (We knew none of this for
sure: 24 could very well have meant it
or is
). So we conjectured
that 13 is m, to form I'm
. We had other clues, too, but nothing
too definitive. The guesses above seemed mutually consistent, but we didn't
have any solid proof. For example, looking at words like 412
and
413
, we guessed 4 was a, because a lot of short words
begin with a
.
Then we had a realization. This was obviously a quote from someone (look at
the structure of the punctuation), and that someone was probably famous. So the
name gives a lot of hints. In particular, both the first and last name start
with 13
. Hmm. My first instinct was Marilyn Monroe
, so Tim wrote
another program to deduce the possible meanings of letters if I gave a guess
for the word. Marilyn fit beautifully, but Monroe didn't (the n
in
Marilyn and the n
in Monroe corresponded to different numbers). Boo.
We tried Mickey Mantle, though I swear I only knew about him from
Seinfeld. That didn't work either. Boo again. So I gave up all hope
and Googled celebreties whose first and last names start with m
. And
that led me to a
wonderful Yahoo answer that actually listed out a dozen famous people with
initials M. M. This is so impressive, that I reproduce the list below:
- Michael (Mike) Myers, comedian, actor
- Maureen McCormick, actress
- Matthew Maconahay, actor
- Max McGee, former Green Bay Packer (1960's) later, Packer's radio announcer
- Marlee Matlin, actress
- Mark Mcguire, baseball star
- Mark Martin, Nascar driver
- Marylin McCoo, singer
- Marilyn Manson, singer
- Mickey Mantle, HOF baseball player
- Matthew Modine, actor ("Full Metal Jacket")
- Melissa Manchester, singer
- Michael Moore, film maker
- Martina McBride, country singer
- Marsha Mason, actress, "The Good-Bye Girl)
- Marilyn Monroe, actress
- Mary McGregor, singer, "Torn Between Tow Lovers"
We manually tried the first names; the clear winner was Michael
. Very
exciting. The last name is now rather obvious: Moore
(13 21 21 23 11:
the double 21 makes it strikingly clear). This couldn't be an accident.
So we got to substituting in the newfound letters into the rest of the
message. Not too easy, because of all the ambiguity, but with some educated
guesses we made enough progress to be able to read bits and pieces, most
notably Here's
at the very beginning.
The way forward was pretty clear now, we searched some quote databases for Michael Moore quotes.
The first couple were hopeful but clearly wrong:
Here's a way to stop suicide bombings — give the Palestinians a bunch of missile-firing Apache helicopters and let them and the Israelis go at each other head to head. Four billion dollars a year to Israel — four billion dollars a year to the Palestinians — they can just blow each other up and leave the rest of us the hell alone.
Here's what I do support: I support them coming home.
It turns out Moore's a rather prolific political commentator and filmmaker,
so that couldn't possibly be helping narrow the search space. No
worries. At that point I gave in and made some wild assumptions (like in a
Sudoku, when you give up on logic and take some leaps of faith). That gave me
Here's what I can't think…
(At this point I feel it is worth mentioning that Tim somehow generated the following. I think it's fair to say Humans: 1, Python: 0.)
ehaaeyaa'ah hehaee i aaeaae'ee eeeheaaek heaeykah: ac aaaeaaeeaayeaa ahyaheeaaay eeehaee hal teaeaeaaaaaa eaae eeehaa aheaakeeaaaaaeeeeh aaaaeeeeeyy acs aaeeaeeehaaey eeehaee hal teaeaeaaaaaa eaae eeehaa aeeaaeaaeeaaaaaeeeeh aaaaeeeeeyy. i'm sines it soil aaeaahaeahaheaeaae it aaateaeeaeteaahay acs aheaaeaaeteaahay; he line eaae eeehaa eehaaaeeey-teaeyahee aaaaeeeeeyy; he aeaaaaaa ac aaaeaaeeaayeaa ahyaheeaaay eeehaee ehaah aaaaayeaaeyaay al eaeeah eaeaaaaeyateaaeaeeaaeanah acs ac aaeeeheaaaet aeaaaaa." -ayeaaehaaaet ayeaeaeyaa
Anyway, with a few more guesses, we found it:
Here's what I don't think works: An economic system that was founded in the 16th century and another that was founded in the 19th century. I'm tired of this discussion of capitalism and socialism; we live in the 21st century; we need an economic system that has democracy as its underpinnings and an ethical code.
So that was that. Two hours, a root beer lollipop, and four nerds was all it took.
(In retrospect, we could have gotten some clues from frequency analysis. There are no 9's, so 19 and 29 are probably x and z. And there is a scary number of places where there are a bunch of 1's in a row. The only letter to repeat itself so much is e, and indeed 11 corresponds to e.)
One last thing: Tim's final answer was:
here's hhat i .oc't thick horks: ac ecete.ic s.stem that has .oooae. ic the si.teecth eettr. ac. acother that has .oooae. ic the ciceteecth eettr.. i'm tire. o. this .iscssioc o. ..italism ac. socaalism; he li.e ic the thect.-.irst eettr.; he cee. ac ecete.ic s.stem that has .emoc.am as its ooaer.iccic.s ac. ac ethical ceae." -michael moore
which is rather impressive, coming from a computer program. If anyone wants eternal fame, go ahead and take up the challenge to write a robust, generic cipher-like-this solver. We can use it next year!