How Newbies Codebreak

A story of codebreaking using Python, Google, and common sense.

Saturday, February 8, 2014 · 6 min read

Special thanks to Tim, Nathan, and the Water Bottle Stealer (you know who you are). I probably wouldn't have gotten close to a solution without you guys.

So I'm overdue for a post. Not my fault: school's keeping me busy. In fact, it's keeping me very busy. On Friday night I had to crack a cipher for a competition. It's not a very exciting feat, but it was the most fun I have had in ages, and it offers a nice glance into how people think about problems.

If you liked this, you'll love Simon Singh's The Code Book, which is a rather nice introduction to cryptography that has a whole bunch of awesome true stories about historical top-secret codebreaking feats. This sentence was totally not sponsored by Simon.

Our ciphertext looked like this:

"25112311'15 525422 24 142112'22 222524127
52123715: 412 11121122113241 15315221113 2225422 5415 621212141114 2412 222511
152417221111122225 11112222233 41214 4122122251123 2225422 5415 621212141114
2412 222511 12241211221111122225 11112222233. 24'13 2224231114 216 22252415
142415121515242112 216 14162422426241513 41214 1521124426241513; 511 2624811
2412 222511 2251112223-624231522 11112222233; 511 12111114 412 11121122113241
15315221113 2225422 25415 14111321123413 415 242215 2121411231624121224121815
41214 412 112225241426 1211411." -132412541126 1321212311

Ugh. We were told that it's a substitution cipher, with A-Z represented by 1-26 in some order. The standard way to solve these is frequency analysis: we look at the percentage of each coded letter. For example, if 12 shows up 8% of the time, it's probably a more common letter like e or a as opposed to x. Cryptographers have tables of letter frequencies in various languages, so this is easy stuff.

Before you read on, take a moment to see if you can get anywhere with it. Make some intelligent guesses and see if you can make any progress.

(Back so quick? Go back give it a real try!)

Did you see the problem? Our cipher is much tougher than frequency analysis, because we don't have any delimiters between encrypted letters. So 112 could be any of 1-1-2, 11-2, 1-12. That's a big problem: it makes words ambiguous. Even the intended recipient of the encrypted message doesn't know the way the word is broken up for sure, he needs some trial and error (but with the key and a reasonable English vocabulary, it's relatively easy).

(Mathematical aside: If you have a string of length $l$, how many ways can you break it up into a series 2 or 1 digit numbers? The answer is, believe it or not, the lth Fibonacci number. The proof is simple: given a string, we can chomp off the first digit and break up the rest in $f(l-1)$ ways, or chomp off the first 2 digits and break up the rest in $f(l-2)$ ways. So the total is $f(l) = f(l-1) + f(l-2)$, which is Fibonacci.)

Anyway, Tim ended up writing a nice little Python program that breaks up words into a list of numbers (it turns out we can eliminate quite a few, because the two-digit numbers have to be 26 or lower). That was exciting, though the huge outputs were slightly disturbing. So it was possible to brute-force it: we generate all possible breaking-ups, then try all possible keys, and then use a large list of words to see if things make sense. (For those of you that don't know, on UNIX systems /usr/share/dict/words contains a very handy list of English words you can grep in).

It turns out, the number of possibilities to try is big. None of us owned a supercomputer, and we didn't have a couple billion years to spare to spare. But we did have some hints. We had a bunch of words that ended with apostrophes: 25112311'15, 142112'22, and 24'13. We started with some initial guessing. The first one looked too long to be a contraction, so it could be a possessive. So perhaps 15 is s. 22 could be t, because of contractions like don't or can't. The last one has a bunch of choices. However, we saw 24 exist on its own at the beginning of the first sentence. This is helpful because it could be a word like I which is both a letter and a word. (We knew none of this for sure: 24 could very well have meant it or is). So we conjectured that 13 is m, to form I'm. We had other clues, too, but nothing too definitive. The guesses above seemed mutually consistent, but we didn't have any solid proof. For example, looking at words like 412 and 413, we guessed 4 was a, because a lot of short words begin with a.

Then we had a realization. This was obviously a quote from someone (look at the structure of the punctuation), and that someone was probably famous. So the name gives a lot of hints. In particular, both the first and last name start with 13. Hmm. My first instinct was Marilyn Monroe, so Tim wrote another program to deduce the possible meanings of letters if I gave a guess for the word. Marilyn fit beautifully, but Monroe didn't (the n in Marilyn and the n in Monroe corresponded to different numbers). Boo.

We tried Mickey Mantle, though I swear I only knew about him from Seinfeld. That didn't work either. Boo again. So I gave up all hope and Googled celebreties whose first and last names start with m. And that led me to a wonderful Yahoo answer that actually listed out a dozen famous people with initials M. M. This is so impressive, that I reproduce the list below:

  1. Michael (Mike) Myers, comedian, actor
  2. Maureen McCormick, actress
  3. Matthew Maconahay, actor
  4. Max McGee, former Green Bay Packer (1960's) later, Packer's radio announcer
  5. Marlee Matlin, actress
  6. Mark Mcguire, baseball star
  7. Mark Martin, Nascar driver
  8. Marylin McCoo, singer
  9. Marilyn Manson, singer
  10. Mickey Mantle, HOF baseball player
  11. Matthew Modine, actor ("Full Metal Jacket")
  12. Melissa Manchester, singer
  13. Michael Moore, film maker
  14. Martina McBride, country singer
  15. Marsha Mason, actress, "The Good-Bye Girl)
  16. Marilyn Monroe, actress
  17. Mary McGregor, singer, "Torn Between Tow Lovers"

We manually tried the first names; the clear winner was Michael. Very exciting. The last name is now rather obvious: Moore (13 21 21 23 11: the double 21 makes it strikingly clear). This couldn't be an accident.

So we got to substituting in the newfound letters into the rest of the message. Not too easy, because of all the ambiguity, but with some educated guesses we made enough progress to be able to read bits and pieces, most notably Here's at the very beginning.

The way forward was pretty clear now, we searched some quote databases for Michael Moore quotes.

The first couple were hopeful but clearly wrong:

Here's a way to stop suicide bombings — give the Palestinians a bunch of missile-firing Apache helicopters and let them and the Israelis go at each other head to head. Four billion dollars a year to Israel — four billion dollars a year to the Palestinians — they can just blow each other up and leave the rest of us the hell alone.
Here's what I do support: I support them coming home.

It turns out Moore's a rather prolific political commentator and filmmaker, so that couldn't possibly be helping narrow the search space. No worries. At that point I gave in and made some wild assumptions (like in a Sudoku, when you give up on logic and take some leaps of faith). That gave me Here's what I can't think…

(At this point I feel it is worth mentioning that Tim somehow generated the following. I think it's fair to say Humans: 1, Python: 0.)

ehaaeyaa'ah hehaee i aaeaae'ee eeeheaaek heaeykah: ac aaaeaaeeaayeaa ahyaheeaaay eeehaee hal teaeaeaaaaaa eaae eeehaa aheaakeeaaaaaeeeeh aaaaeeeeeyy acs aaeeaeeehaaey eeehaee hal teaeaeaaaaaa eaae eeehaa aeeaaeaaeeaaaaaeeeeh aaaaeeeeeyy. i'm sines it soil aaeaahaeahaheaeaae it aaateaeeaeteaahay acs aheaaeaaeteaahay; he line eaae eeehaa eehaaaeeey-teaeyahee aaaaeeeeeyy; he aeaaaaaa ac aaaeaaeeaayeaa ahyaheeaaay eeehaee ehaah aaaaayeaaeyaay al eaeeah eaeaaaaeyateaaeaeeaaeanah acs ac aaeeeheaaaet aeaaaaa." -ayeaaehaaaet ayeaeaeyaa

Anyway, with a few more guesses, we found it:

Here's what I don't think works: An economic system that was founded in the 16th century and another that was founded in the 19th century. I'm tired of this discussion of capitalism and socialism; we live in the 21st century; we need an economic system that has democracy as its underpinnings and an ethical code.

So that was that. Two hours, a root beer lollipop, and four nerds was all it took.

(In retrospect, we could have gotten some clues from frequency analysis. There are no 9's, so 19 and 29 are probably x and z. And there is a scary number of places where there are a bunch of 1's in a row. The only letter to repeat itself so much is e, and indeed 11 corresponds to e.)

One last thing: Tim's final answer was:

here's hhat i .oc't thick horks: ac ecete.ic s.stem that has .oooae. ic the si.teecth eettr. ac. acother that has .oooae. ic the ciceteecth eettr.. i'm tire. o. this .iscssioc o. ..italism ac. socaalism; he li.e ic the thect.-.irst eettr.; he cee. ac ecete.ic s.stem that has .emoc.am as its ooaer.iccic.s ac. ac ethical ceae." -michael moore

which is rather impressive, coming from a computer program. If anyone wants eternal fame, go ahead and take up the challenge to write a robust, generic cipher-like-this solver. We can use it next year!

◊ ◊ ◊