3. Codebreaking

  1. 3.1. Overview of Codebreaking
  2. 3.2. Frequency Analysis
  3. 3.3. Interlude: Probability
  4. 3.4. Interlude: Conditional Probability
  5. 3.5. Index of Coincidence
  6. 3.6. Breaking the Vignère Cipher
  7. 3.7. Interlude: \(G\)-Test
  8. 3.8. Breaking Rectangular Transposition
  9. 3.9. Known Plaintext Attack
  10. 3.10. Perfect Secrecy

3.1. Overview of Codebreaking

In the last section, we mostly took the perspective of Alice and Bob: we focused on how to encrypt and decrypt texts using a variety of ciphers. In this section, we take the perspective of Eve: we’ll study how to break some of the ciphers we encountered above. This will include some interludes on probability theory, which will help us devise more clever codebreaking techniques.

It’s worth remarking at the outset that codebreaking is not an “exact science,” in the sense that there isn’t any codebreaking algorithm that guarantees producing the correct plaintext from the ciphertext in one shot without access to the key. Instead, the techniques help constrain the search for the correct ciphertext: we’ll see how we can make educated guesses about what the key might be, but they will be guesses nonetheless, so there could be points where some guess you made was incorrect and you have to go back and make an alternative guess. Still, the guesswork is “educated enough” that the techniques work pretty well (especially in an age we can get computers to do some of the work for us), so most of the ciphers discussed above are no longer considered secure!

The only exception is the one-time pad, which, as we mentioned above, does have a very strong security property called “perfect secrecy.” We’ll return to this again at the end of this section.

3.2. Frequency Analysis

Frequency analysis is a powerful technique for breaking simple (and sometimes also polygraphic) substutition ciphers. The idea dates back to the Arab polymath Al-Kindi (801–873 CE) and is relatively simple:

Heuristic The relative frequencies of letters remains roughly stable across different samples of English texts, and ETAOINSHRDLU is the approximate order of the 12 most common letters.1

For an interesting tangent about the letters ETAOINSHRDLU, see here!

SageCell Here is a SageCell that lets you look at frequencies in various English texts. Hit “Run.” You’ll see several input boxes which default to passages from Emily Brönte’s Wuthering Heights, the Declaration of Independence, F. Scott Fitzgerald’s The Great Gatsby, and the Preamble to the Universal Declaration of Human Rights. The code outputs a table indicating the frequency of each letter (as a percentage) in each text, as well as a ranked list of the most frequent through the least frequent letter in each text.

Do you see the patterns described in the heuristic above? Namely, are the frequency percentages roughly equal for all of the texts? Is it true that ETAOINSHRDLU is the approximate order of the 12 most common letters?

Then find some other passage(s) of English text on your own and plug them in to see what happens!

This heuristic can be exploited to break simple substitution ciphers — even though, as we observed earlier, the number of possible keys in a general simple substitution cipher is roughly 403 septillion! The technique works best with longer ciphertexts, and the idea is to guess the decryption key one letter at a time, doing one of the following at each step:

  1. Assign the most frequent unassigned letter of ciphertext to be the most frequent unassigned letter in some sample English text (or perhaps some other letter with a similar frequency).

  2. Look through the ciphertext and see if you can make any guesses about words that seem to appear there. If you see something, fill in the blanks in that word by making appropriate guesses for the key.

If at any point it seems like your guesses are leading to nonsense or implausible sequences of letters, backtrack and make an alternative guess!

For example, it is usually fairly safe to start with two applications of option 1, ie, guessing that the most common letter in the ciphertext is E and that the second most common letter is T. Luckily, the word THE occurs very frequently in English (and the same sequence occurs in other common words like THEY or THEIR or THEN or THERE), so if after you make the E and T substutitions you see the pattern T*E frequently (with * representing some fixed ciphertext letter), you can feel pretty confident replacing * with H. This is an example of using option 2. At this point, applying option 2 again might require some clever guesswork. For example, if you see the pattern TH*T frequently, there’s a good chance this might the word THAT. Or maybe you can find frequent occurrences of sequences like *T for some fixed letter *, which could be IT or AT (these can be hard to tell apart based purely on letter frequencies since A and I often have similar frequencies), so you could try making one of those guesses. If you can’t spot any possible words, just try using option 1 instead and match the most frequent letters!

Usually the first few guesses after E and T are the hardest. Once you’ve made just a few correct guesses, it starts becoming very easy to see words!

SageCell / Exercise Here is a SageCell to assist with the frequency analysis described above. Hit “Run” and you’ll see the following:

Use the strategy described above to decrypt this ciphertext!

Exercise The following ciphertext was encrypted using a simple substitution. Find the corresponding plaintext!

Exercise Why will the frequency analysis strategy we’ve just described fail for ciphertexts encrypted using rectangular transposition?

Exercise The strategy as we’ve described it above doesn’t work verbatim for Polybius squares. How would you adapt this strategy to get it to work for a Polybius square? Explain in your own words why breaking a Polybius square will be “just as easy” as what we’ve described above.

Exercise The strategy as we’ve described it above doesn’t work verbatim for digraphic substitution schemes like the Hill and Playfair ciphers. How would you adapt this strategy to get it to work for these digraphic substitution ciphers? Do you have any concerns about applying this strategy? In other words, what reasons can you give to support the idea that these digraphic substitution ciphers will be harder to break than simple substitution ciphers using the strategy we’ve outlined above?

Harder Sage Exercise Adapt the code for the SageCell above so that it can help with decrypting digraphic substitution schemes.

In the sections that follow, we’ll elaborate on the ideas we’ve seen in this section to see how variants of the same fundamental observation — that English text exhibits certain patterns — can be used to mount attacks on other ciphers. But first, we must develop some theory.

3.3. Interlude: Probability

Experiments and Events

In probability theory, the word “experiment” is used to talk abstractly and heuristically about processes which generate “outcomes” and which might be rather intricate. These experiments are formally modeled by probability spaces. The general definition requires a background in measure theory, which is not a prerequisite for this course, so we won’t make the general definition. Instead, we’ll use the following:

Definition A (discrete) probability space is a nonempty countable set \(\Omega\) called the sample space and whose elements are called outcomes. Each outcome \(x \in \Omega\) is assigned a real number \(P[x]\) between 0 and 1 called its probability, and the probabilities of all of the outcomes must sum to 1. \[ \sum_{x \in \Omega} P[x] = 1. \]

“Countable” means the the outcomes can be put into a list so that the summation \(\sum_{x \in \Omega} P[x]\) makes sense. Any finite set is countable. Some infinite sets are also countable, but for now we’ll focus on the finite case.

The probability associated to each outcome should be thought of as some measure of our “confidence” that our experiment will produce that outcome. For example, it might be the percentage of times we expect the experiment to produce that outcome if the experiment were to be repeated many many times.

Rolling a dice is an example of an experiment. The possible outcomes of this experiment are the numbers 1 through 6. In other words, the sample space is \(\Omega = \{1, 2, 3, 4, 5, 6\}\). Assigning each outcome probability 1/6 means that the dice is “fair” and each outcome is equally likely. We have thus constructed a probability space: we have a finite set \(\Omega\) that enumerates the possible outcomes, and we’ve assigned a probability to each outcome.

A single experiment can also have “multiple parts.” For example, flipping a fair coin twice can be thought of as a single experiment. Possible outcomes of this experiment might be something like “heads and then heads again,” or “heads and then tails,” and so forth. All of these outcomes taken together as a set form the sample space. If we abbreviate heads as \(H\) and tails as \(T\), we might write \(\Omega = \{HH, HT, TH, TT\}\). If we assign each out these four outcomes probability 1/4, we’re modeling the situation where the coin is fair and the result of each coin flip is unrelated to the other.

In the two examples we’ve just seen, all of the outcomes have the same probability. This is a very common situation and so it has a special name:

Definition A probability space is uniform if all of its outcomes have equal probability.

It is sometimes useful to group various outcomes together. This is accomplished by the following definition:

Definition Given a probability space, an event \(E\) is a subset of the sample space. We define \[ P[E] = \sum_{x \in E} P[x]. \]

Be careful: the words “event” and “outcome” feel somewhat similar in day-to-day in English, but have distinct definitions in probability theory!

Consider again the example of the rolling a dice. An “event” might be something like “the dice roll is odd.” Formally, if we’re thinking of the sample space \(\Omega = \{1, 2, 3, 4, 5, 6\}\), the event “the dice roll is odd” corresponds to the event \(E = \{1, 3, 5\}\). This event is also assigned a probability, just by summing together the probabilities of all of the outcomes that comprise that event: \[ P[E] = P[1] + P[3] + P[5] = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{3}{6} = \frac{1}{2}. \]

Exercise Suppose you have 4 boxes (labeled 1, 2, 3, and 4), and you have 8 colors available (red, blue, green, yellow, pink, purple, teal, brown). Consider an experiment where each of the 4 boxes is assigned a color. For example, one possible outcome of this experiment might be the one where box 1 is colored red, box 2 is colored blue, box 3 is colored green, and box 4 is colored blue.

  1. How many possible outcomes are there?
  2. How many outcomes are in the event “no two boxes have the same color”?

Exercise Suppose you have \(k\) boxes and you have \(n\) colors available. Consider again the same experiment where each of the \(k\) boxes is assigned one of the \(n\) colors “at random” (ie, construct a uniform probability space). What is the probability of the event that no two boxes have the same color? What is the probability that there are at least two boxes of the same color? Find expressions in terms of \(n\) and \(k\).

Sage Exercise Write some code that takes as input a positive integer \(n\) and returns the smallest positive integer \(k\) such that, when \(k\) boxes are assigned \(n\) colors at random, the probability that there are at least two boxes of the same color is at least 50%. If you run your code with \(n = 365\), you should get \(k = 23\) as output. Explain what this has to do with the birthday paradox.

Proof Exercise Prove formally that, if a discrete probability space is uniform, it must be finite.

Random Variables

A very common way that events show up is by considering random variables, which are mathematical gadgets that represent making an observation (or taking a measurement) on the outcome of an experiment. Each random variable has a set of possible values that the random variable can take. Letters like \(X, Y, \cdots\) are used to denote random variables. If you like formal definitions, here it is:

Definition Fix a probability space \(\Omega\). A random variable is a function with domain \(\Omega\), and its set of (possible) values is the range of this function.

For example, consider again the “multi-part” experiment we discussed above where we flip a coin twice. Making an observation of the first coin flip can be thought of as a random variable, which we will call \(X\). This observation can be either “heads” or “tails.” If we abbreviate “heads” and “tails” as \(H\) and \(T\) again, then we can write things like \(X = H\) to refer to the event that the first coin flip landed heads. In other words, inside the sample space \(\Omega = \{HH, HT, TH, TT\}\), the notation \(X = H\) describes the event \(\{HH, HT\}\) and we have \[ P[X = H] = P[HH] + P[HT] = \frac{1}{4} + \frac{1}{4} = \frac{1}{2}. \] There are many other observations or measurements we could make with the same experiment. For example, we might instead be interested in the number of heads. This is another random variable; let’s call it \(Y\). This can take values 0, 1, and 2. The notation \(Y = 1\) describes the event that we observe 1 heads out of the two coin flips, ie, the event \(\{HT, TH\}\), and we have \[ P[Y = 1] = P[HT] + P[TH] = \frac{1}{4} + \frac{1}{4} = \frac{1}{2}. \] On the other hand, \(Y = 0\) corresponds to the singleton event \(\{TT\}\), so \[ P[Y = 0] = P[TT] = \frac{1}{4}. \] This leads us to the following definition:

Definition A random variable is uniform if all of its values have equal probability.

In our example above, the random variable \(X\) is uniform but \(Y\) is not. With \(X\), the two possible values are \(H\) and \(T\), and both of those have probability 1/2. With \(Y\), however, there are three possible values, with one of those values being twice as likely as either of the other two.

Proof Exercise Prove formally that, if \(\Omega\) is a discrete probability space and \(X\) is a uniform random variable on \(\Omega\), then the set of values of \(X\) must be finite. (This should be very similar to the last “Proof Exercise.”)

Here is another useful notion that comes up often when dealing with random variables:

Definition Suppose \(X\) is a random variable whose values are real numbers. The expected value (or expectation) of \(X\), denoted \(E[X]\), is defined by \[ E[X] = \sum_{\text{values } a} a \cdot P[X = a]. \]

For example, in the experiment involving two coin flips, the random variable \(Y\) which counts the number of heads has real number values. Its expectation is \[ E[Y] = 0 \cdot \frac{1}{4} + 1 \cdot \frac{1}{2} + 2 \cdot \frac{1}{4} = 1. \]

Exercise Consider the experiment where you roll a pair of fair dice. Let the random variable \(X\) denote the sum of the dice rolls.

  1. What are the possible values of \(X\)?
  2. What is \(P[X = 7]\)?
  3. What is \(P[X = 7 \text{ or } 11]\)?
  4. What is \(E[X]\)?

Exercise Below are four versions of the same game. Your archnemesis gets to pick the version of the game, and then you get to choose how many times to flip a coin: 10 times or 100 times. Identify how many coin flips you should choose for each version of the game. It costs $1 to play each game. Explain your reasoning.

  1. If the proportion of heads is larger than 0.60, you win $1.
  2. If the proportion of heads is larger than 0.40, you win $1.
  3. If the proportion of heads is between 0.40 and 0.60, you win $1.
  4. If the proportion of heads is smaller than 0.30, you win $1.

Exercise Let’s say we play a game where we keep flipping a fair coin until we’ve seen either one heads or three tails overall.

  1. What is the probability that you see strictly more tails than heads?
  2. What is the probability that you see strictly more heads than tails?
  3. How many flips are expected on average?
  4. How many tails are expected on average?

3.4. Interlude: Conditional Probability

We introduce a definition that looks very simple at first, but actually introduces many counter-intuitive results. So before getting to the definition, consider the following situation. Suppose Kambili and Amaka both secretly flip two fair coins. Kambili announces that her second flip was heads. Amaka announces that she had at least one heads. Who’s more likely to have flipped two heads? In other words, if you had to make a bet about who flipped more heads, who would you bet on? We’ll discuss the answer to this question in more detail below. For now, just think about what your intuition says.

Definition Fix a probability space. Given two events \(A\) and \(B\), we define the conditional probability \(P[A \mid B]\) by \[ P[A \mid B] = \frac{P[A \cap B]}{P[B]}, \] where \(A \cap B\) is the set of all outcomes that are in both \(A\) and \(B\). The conditional probability \(P[A \mid B]\) is also called the probability of \(A\) given \(B\).

The intuition here is that \(P[A \mid B]\) represents how confident we are that \(A\) happens given that we already know that \(B\) happens.

Let’s now see this definition in action. For many people (but not everyone), the intuitive answer to the question posed at the beginning of this section is, “both are equally likely.” If that’s what you thought, you’re in good company. Unfortunately, it’s not correct!

To see why, let’s formalize what’s going on. The experiment we’re considering involves two coin flips, so we’re again dealing with the uniform probability space whose sample space is \(\Omega = \{HH, HT, TH, TT\}\). The event of interest \(A\) is the singleton event that both flips were heads, ie, \(A = \{HH\}\).

In Kambili’s situation, we know that her second flip was heads. In other words, we’re restricting ourselves to the event \(B_1 = \{HH, TH\}\), and we have \[ P(A \mid B_1) = \frac{P(A \cap B_1)}{P(B_1)} = \frac{P(A)}{P(B_1)} = \frac{1/4}{1/2} = \frac{1}{2}. \] This tells us that the probability that Kambili has two heads is 1/2. In Amaka’s situation, we only that one of her flips was heads. In other words, we’re now restricting ourselves to the event \(B_2 = \{HH, TH, HT\}\), and we have \[ P(A \mid B_2) = \frac{P(A \cap B_2)}{P(B_2)} = \frac{P(A)}{P(B_2)} = \frac{1/4}{3/4} = \frac{1}{3}. \] This tells us that the probability that Amaka has two heads is 1/3.

In other words, it’s more likely that Kambili has two heads! If we had to bet on one of the two having more heads based on the information we’ve been given, we should bet on Kambili.

Here are some exercise to practice:

Exercise There are three coins in a bag. One is a normal quarter: one side is heads, the other side is tails. The second coin is almost identical except that both sides are heads; similarly, both sides of the third coin are tails. You shake the bag around to shuffle the coins. You then close your eyes, pull one coin out at random, put it down on a table, and then open your eyes. You see heads. What is the probability that the other side of the coin is also heads?

Exercise Explain what this xkcd comic has to do with conditional probability.

Exercise Suppose 80% of people like peanut butter, 89% like jelly, and 78% like both. Given that a randomly sampled person likes peanut butter, what’s the probability that they also like jelly?

Exercise Lupus is a medical phenomenon where antibodies that are supposed to attack foreign cells to prevent infections instead see plasma proteins as foreign bodies, leading to a high risk of blood clotting. It is believed that 2% of the population suffers from this disease. A test for lupus is 98% accurate if a person actually has the disease, and 74% accurate if a person does not have the disease. There is a line from the Fox television show House that is often used after a patient tests positive for lupus: “It’s never lupus.” Do you think there is truth to this statement? Use appropriate probabilities to support your answer.

Harder Exercise Suppose you play the following game. You roll two fair dice. Let \(U\) be the sum of the numbers on the two dice.

What is the probability that you win?

The idea of conditional independence leads us to the following important definition that occurs frequently in probability theory:

Definition Fix a probability space. We say that two events \(A\) and \(B\) are independent if \[ P(A \cap B) = P(A) P(B). \]

It is often most convenient to reformulate this definition slightly. In the case that \(P(B) > 0\), observe that \(P(A \cap B) = P(A) P(B)\) is equivalent to \(P(A \cap B)/P(B) = P(A)\). But the left-hand side of this equation is \(P(A \mid B)\) by definition, so we see that independence of \(A\) and \(B\) is equivalent to asserting that \[ P(A \mid B) = P(A). \] This statement should be interpreted as follows: if \(A\) and \(B\) are independent, our confidence in \(A\) happening does not change at all even if we’re told that \(B\) happened (or that it did not happen). More loosely, knowing whether or not \(B\) happens tells us “nothing” about whether or not \(A\) happens!

Exercise The American Community Survey is an ongoing survey that provides data every year to give communities the current information they need to plan investments and services. The 2010 American Community Survey estimates that 14.6% of Americans live below the poverty line, 20.7% speak a language other than English at home, and 4.2% fall into both categories. Is the event that a randomly chosen American lives below the poverty line independent of the event that the person speaks a language other than English at home?

Exercise A bag contains 5 red marbles and 3 blue marbles. Two marbles are drawn randomly from the bag: Alejandra takes the first one and Beatrice takes the second. Is the event that Alejandra’s marble is blue independent of the event that Beatrice’s marble is blue?

Exercise Fix a probability space \(\Omega\) and two events \(A\) and \(B\). Let \(\neg B\) denote the event that \(B\) does not happen, ie, \(\neg B = \Omega \setminus B\). Show that \(A\) and \(B\) are independent if and only if \(A\) and \(\neg B\) are independent.

There is also a version of this definition for random variables.

Definition Fix a probability space. Two random variables \(X\) and \(Y\) are independent if the events \(X = a\) and \(Y = b\) are independent for all pairs \((a, b)\) where \(a\) is a value of \(X\) and \(b\) is a value of \(Y\).

3.5. Index of Coincidence

Imagine starting with some text consisting of only uppercase alphabet characters. Separate all of the letters from each other. Throw them all into in a bag to shuffle them up. Then draw two letters from it without replacement. What is the probability that the two letters you drew are the same?

Let \(N\) be the total number of letters in the bag. For each letter \(\alpha\), let \(N_\alpha\) be the number of times that \(\alpha\) appears in the bag. Then \(N_\alpha/N\) is the probability that \(\alpha\) is the first letter you draw from the bag. After you draw that letter, there are \(N-1\) letters left in the bag and \(\alpha\) occurs only \(N_\alpha-1\) times, so the probability of drawing \(\alpha\) again is \((N_\alpha-1)/(N-1)\). Thus the overall probability of drawing \(\alpha\) twice is \[ \frac{N_\alpha(N_\alpha - 1)}{N(N-1)}. \] Our question we posed in the first paragraph, however, was not about the probability that we draw \(\alpha\) from the bag twice. It was about drawing two of the same letter, which can be any letter. This means we have to sum over all of the possible choices of \(\alpha\). In other words, the probability we’re looking for is actually \[ \sum_{\alpha \in \{\text{A}, \dotsc, \text{Z}\}} \frac{N_\alpha(N_\alpha - 1)}{N(N-1)}. \] There is one step we did above that you should justify to yourself before proceeding:

Exercise Above we stated that the probability of drawing \(\alpha\) for our first letter is \(N_\alpha/N\) and the probability of drawing \(\alpha\) for our second letter is \((N_\alpha-1)/(N-1)\), so the probability of drawing two \(\alpha\)’s in a row is the product of these. Justify this “multiplication” of probabilities using conditional probabilities.

Hint. Consider the experiment of drawing two letters from the bag. Let \(A_\alpha\) be the event that the first letter is \(\alpha\) and \(B_\alpha\) the event that the second letter is \(\alpha\). What is \(P(A_\alpha)\)? What is \(P(B_\alpha \mid A_\alpha)\)?

Since there are 26 letters in the English alphabet, it turns out to be convenient to normalize this probability we computed above by multiplying by 26. The resulting number gets a special name:

Definition Let \(N\) be the length of some text and, for each letter \(\alpha\), let \(N_\alpha\) be the number of times \(\alpha\) occurs in this text. The index of coincidence of the text, denoted \(\mathrm{IC}\), is the number \[ \mathrm{IC} = 26 \sum_{\alpha \in \{\text{A}, \dotsc, \text{Z}\}} \frac{N_\alpha(N_\alpha - 1)}{N(N-1)}. \]

Here are some exercises to help you play with this definition.

Exercise Suppose that you start with a text where every letter appears equally often. What would the index of coincidence of such a text be?

Exercise What is the largest possible index of coincidence? What kind of text would result in an index of coincidence that’s as large as possible?

Important Exercise Let \(N\) be the length of some text and, for each letter \(\alpha\), let \(N_\alpha\) be the number of times \(\alpha\) occurs in this text. Let \(p_\alpha = N_\alpha/N\). Show that, if \(N\) is very large, then
\[ \mathrm{IC} \approx 26 \sum_{\alpha \in \{\text{A}, \dotsc, \text{Z}\}} p_\alpha^2. \]

As we’ve seen, relative frequencies of letters in English are roughly stable from text to text, at least for sufficiently long texts. Correspondingly, the index of coincidence is also roughly stable from text to text! Here is the relevant heuristic:

Heuristic The index of coincidence of long English plaintext is (a) typically around 1.75, and (b) almost always between 1.5 and 2.

SageCell Here is a SageCell to help convince you of the above heuristic. You’ll see several input boxes which default to passages from Emily Brönte’s Wuthering Heights, the Declaration of Independence, F. Scott Fitzgerald’s The Great Gatsby, and the Preamble to the Universal Declaration of Human Rights. The code outputs a table indicating the index of coincidence of each of these texts.

Based on the Important Exercise above, we know that \[ \mathrm{IC} \approx 26 \sum_{\alpha \in \{\text{A}, \dotsc, \text{Z}\}} p_\alpha^2. \] Notice that the right-hand side stays exactly the same if we perform a simple substitution on our text! For example, suppose the letter E has a relative frequency of 13% in some plaintext. Then \(p_{\text{E}}^2 \approx 0.13^2\) will appear as a summand in the above expression for the \(\mathrm{IC}\) of this plaintext. If we perform a simple substitution that converts the letter E to something else, say Q, then \(p_{\text{Q}}^2 \approx 0.13^2\) will appear as a summand of the \(\mathrm{IC}\) for the ciphertext. The subcript is different, but the value is exactly the same! Since we’re summing over all of the letters, the value of \(\mathrm{IC}\) does not change when we perform a simple substitution.

Exercise Suppose some English plaintext is encrypted using a transposition cipher (eg, rectangular transposition). How will the index of coincidence of the plaintext compare to that of the ciphertext?

This suggests that the index of coincidence can be used as a test to determine whether text encrypted using an unknown cipher involved a non-simple substitution. If no non-simple substitution was involved, we should see an index of coincidence that matches what’s expected of English plaintext based on the above heuristic. Otherwise, there should be some discrepancy!

Exercise Here are several ciphertexts that were encrypted using unknown ciphers. For which of them do you think that the encryption process may have involved a non-simple substitution? Calculate indices of coincidence and explain your reasoning.

Ciphertext 1: Ciphertext 2: Ciphertext 3: Ciphertext 4:

Harder Exercise Try decrypting the ciphertexts from the previous exercise for which you decided that the encryption process did not involve a non-simple substitution! (At least one of these won’t be very easy.)

(Harder?) Sage Exercise Do you speak a language other than English? If so, edit the code above and replace the sample texts in order to find out what the typical index of coincidence is for that language!

Depending on what language you decide to work with, the code might involve varying amounts of editing. Here are some example remarks:

3.6. Breaking the Vignère Cipher

With the index of coincidence in hand, we’re now ready to break the Vignère cipher. Recall that this cipher was considered unbreakable for about 300 years…! The method we will describe is not the first method that was discovered, but it makes good use of the mathematical machinery we’ve developed.

The idea is as follows. Suppose we start with following ciphertext which we know to be encrypted using the Vignère cipher.

We make a guess for the “period,” ie, the length of the key word. Suppose we guess a period of 5. We then break the ciphertext into rows of that length.

IKGWE RTZON XEULT EZWHW XTYHS ZKEG…

If the period we’ve guessed is correct, each column of this rectangle was encrypted using a Caesar cipher with the same shift. This means that we should be able to detect that our guess for the period is correct by examining indices of coincidence! If every column has an index of coincidence in the range that’s expected of English text, our guess for the period is probably correct. Otherwise, we should try a different period.

In our example, if we guess a period of 5, it turns out that the indices of coincidence of the five columns are 1.13, 1.11, 1.18, 1.15, and 1.15, respectively. This is well outside the range that’s expected of English text, so our guess 5 is probably wrong. So then we try a period of 6, and we find indices of coincidence of 1.75, 1.77, 1.73, 1.72, 1.79, and 1.74 — all in the right range! This means there’s an excellent chance that the period is 6.

Once we have a guess for the period, we’re in the clear! We then just have to figure out what the shift is for each column, and we can do this using basic frequency analysis: the most frequent letter in each column should correspond to E, so we should guess that the shift for that column is the shift that moves E to the most frequent letter. (If that doesn’t work we might try shifting so that the second most frequent letter in that column moves to E instead.)

SageCell Here is a tool to help you break a Vignère cipher using these ideas. Hit “Run.” You’ll see the following:

Below these input fields, the code outputs several things.

Decrypt the ciphertext! Hint: It’s one of the sample passages we’ve seen above.

Exercise The following ciphertext was encrypted using a Vignère cipher. Find the corresponding plaintext. Also state what the key word used for encryption was.

3.7. Interlude: \(G\)-Test

Example Setup

To motivate this section, consider the following situation. Suppose that every registered voters in an imaginary county in the United States is classified into the mutually exclusive and exhaustive racial groups “White,” “Black,” “Hispanic,” and “Other.” Before proceeding any further, let’s stop to think through this:

Exercise Brainstorm as many reasons as you can about why it might be problematic (and/or overly simplistic) to categorize all of the registered voters of our imaginary county into one and only one of the racial categories “White,” “Black,” “Hispanic,” and “Other.”

I urge you to think carefully about issues like the ones you may have identified in the above exercise whenever someone proposes some clear-cut categorization scheme to you. In any case, we’ll continue this section with our simplistic modeling assumption that these four racial categories are mutually exclusive and exhaustive. We’ll see later that there are situations where assuming a mutually exclusive and exhaustive categorization is not so problematic, including in problems related to cryptography.

Suppose that, by inspecting the voter rolls, we find that racial distribution of this county is as follows:

White Black Hispanic Other Total
Distribution 72% 7% 12% 9% 100%

Since jurors are supposed to be drawn from the list of registered voters, we might hope that a random sample of jurors would follow this same racial distribution. Suppose that we actually sample 275 jurors and observe the racial distribution displayed in the second row of the table below.

White Black Hispanic Other Total
Distribution 72% 7% 12% 9% 100%
Observed 210 10 20 35 275
Expected 198 19.25 33 24.75 275

If our random sample of jurors followed the overall racial distribution of registered voters, we would expect that 72% of them would be White, which would be \(0.72 \cdot 275 = 198\) people. We can calculate the expected numbers of jurors in the other groups similarly to fill in the third row above.

Note that, even if the sample of jurors was representative of the county’s electorate, there would be some deviation from the expected counts. For example, remember that we’re assuming that the categories are mutually exclusive, so we could not possibly observe a sample with 19.25 Black jurors! But if we had observed something like 198 White jurors, 19 Black jurors, 33 Hispanic jurors, and 25 Other jurors — or something very close to that — we probably would not be very surprised with our results. Stated differently, the data we collected would feel consistent with the hypothesis that the racial distribution of jurors matches the racial distribution of the electorate.

But it feels like what we observed was actually pretty far from the expected counts…! How can we quantify and make sense of this observation?

\(G\)-Test

The idea is to introduce a number that measures the difference between the observed and expected rows. There are a variety of numbers that can be used, but let us consider one that is often denoted \(G\).2 It is defined as follows:

Definition Suppose \(X\) is a random variable with finitely many values \(a_1, \dotsc, a_n\) and let \(p_i = P[X = a_i]\). Suppose we make \(N\) observations of the values \(a_1, \dotsc, a_n\) and that \(O_i\) is the number of observations of \(a_i\) that we made. Let \(E_i = N p_i\) and then define \[ G = 2 \sum_i O_i \ln \left( \frac{O_i}{E_i} \right). \] If \(O_i = 0\) for some \(i\), we set the corresponding summand \(O_i \ln(O_i/E_i) = 0\). If there exists an \(i\) such that \(E_i = 0\) but \(O_i \neq 0\), set \(G = \infty\).

In our example, the random variable \(X\) represents observing the race of a randomly drawn voter from our county. It has 4 possible values (White, Black, Hispanic, Other), so \(n = 4\). The values \(p_i\) are the percentages of the electorate in each racial group, the values \(O_i\) are the observed counts, and the values \(E_i\) are the expected counts. We have \[ G = 2 \left( 210 \ln \left( \frac{210}{198} \right) + 10 \ln \left( \frac{10}{19.25} \right) + 20 \ln \left( \frac{20}{33} \right) + 35 \ln \left( \frac{35}{24.75} \right) \right) \approx 15.84 \]

Here is a fundamental observation about this quantity:

Gibbs’ Inequality We always have \(G \geq 0\). Moreover, \(G = 0\) if and only if \(O_i = E_i\) for all \(i = 1, \dotsc, n\).

Observe that if \(O_i = E_i\) for all \(i\), then \(\ln(O_i/E_i) = \ln(1) = 0\) for all \(i\), so \(G = 0\). This proves one small part of the above theorem; we will omit the rest of the proof, but if you’re a proof-inclined person, you can find some accessible proofs on Wikipedia. The key idea to take away is that \(G\) is 0 when the observed counts exactly match up with the expected counts, and that \(G\) gets big when the observed counts differ significantly from the expected counts.

How big is “big”? In particular, in our example, is 15.84 a “big” value of \(G\)? The answer to this question is provided by the following difficult theorem from probability theory, which we state slightly imprecisely and explain in a bit more detail after the statement.

Wilks’ Theorem Suppose the \(N\) observations of the values \(a_1, \dotsc, a_n\) that we make are in fact independent observations of the random variable \(X\). For large values of \(N\), the values of \(G\) are well approximated by a chi-square distribution with \(n-1\) degrees of freedom.

Several points of explanation are now in order. First, a “chi-square distribution with \(k\) degrees of freedom” is a certain function \(f_k\) defined on \([0, \infty)\) and taking non-negative values everywhere with total integral equal to 1.3 In other words, we have \(f_k(x) \geq 0\) for all \(x \geq 0\) and
\[ \int_0^\infty f_k(x)\, dx = 1. \] The formula for \(f_k(x)\) is complicated and also unimportant for our purposes, because as we will note below, most well-developed programming languages have built-in functions that know how to calculate relevant integrals of \(f_k\). Still, it’s useful to at least have some geometric intuition about what the functions \(f_k\) look like for varying values of \(k\). This is what the following SageCell does.

SageCell The following code plots a graph of the function \(f_k\) described above. Hit “Run.” It defaults to \(k = 3\), but then change the first line of the code to see what happens for other positive integers \(k\)! Make sure to pay attention to the scale on the axes as you explore.

Second, to say that “the values of \(G\) are well approximated by a chi-square distribution with \(n-1\) degrees of freedom” is to say that, for any (not necessarily finite) interval \((a, b)\), the probability that \(G\) lands inside the interval \((a, b)\) is approximately \[ \int_a^b f_k(x)\, dx. \] Sage can compute these types of integrals for us when \(a = 0\) and \(b\) is finite, but we can always reduce the general case to this special case.

SageCell The following Sage code calculates for us that \[ \int_0^{15.84} f_3(x)\, dx \approx 0.999. \] Notice where the degrees of freedom 3 goes, and also where the upper endpoint of integration 15.84 goes. The cum_distribution_function stands for “cumulative distribution function,” which means that the output is the total area from 0 up to the input value.

It follows that \[ \int_{15.84}^\infty f_3(x)\, dx = 1 - \int_0^{15.84} f_3(x) \approx 1 - 0.999 = 0.001. \]

The number 0.001 is our p-value. It is telling us that the probability of observing a value of \(G\) that is bigger than 15.84 is only about 0.1%! That is quite a small probability, so our calculation suggests that the value of \(G\) that we saw is in fact quite large.

Stated differently, our p-value of 0.001 tells us that, if jurors in this county were truly representative of the county’s electorate, there would only be roughly a 0.1% chance of seeing a sample that deviated at least as much from the expected counts as the data that we saw. Because that’s such a small probability, this suggests that it’s very unlikely that our sample of jurors is actually representative of the county’s electorate. We have quantified the observation we made informally above.

If you go back and look at the statement of Wilks’ Theorem, you’ll see one detail that we’ve ignored so far, which brings us to the third point of explanation. The theorem only works for “large values of \(N\).” Was our value of \(N = 275\) large enough to justify what we did above?

The answer to this question depends on how well you want the values of \(G\) to be approximated by a chi-square distribution. The better an approximation you want, the higher a value of \(N\) you need. That being said, statisticians have found through practice that the following heuristic generally works well:

Heuristic Addendum to Wilks’ Theorem The approximation of \(G\) by a chi-square distribution with \(n-1\) degrees of freedom is “good enough” as long as the vast majority of the expected counts \(E_1, \dotsc, E_n\) are all at least 5.

In our situation, all of the expected counts are well above 5, so we do not need to worry.

This process (of computing expected counts, finding an observed value of \(G\), and using a chi-square approximation to find a p-value, ie, the probability of observing a larger value of \(G\) than what we observed if the observations do in fact come from the theoretical distribution) is called a \(G\)-test. It is a very useful technique for a lot of problems in statistics, and below, we will apply it to codebreaking! Meanwhile, here are some problems to practice with the ideas you’ve encountered here.

Exercise Use the cum_distribution_function in Sage to calculate the following integrals.

  1. \(\displaystyle \int_0^{20} f_{10}(x)\, dx\)
  2. \(\displaystyle \int_{10}^{\infty} f_{3}(x)\, dx\)
  3. \(\displaystyle \int_{10}^{20} f_{15}(x)\, dx\)

Exercise A professor using an open source introductory statistics book predicts that 60% of the students will purchase a hard copy of the book, 25% will print it out from the web, and 15% will read it online. At the end of the semester she asks her students to complete a survey where they indicate what format of the book they used. Of the 126 students, 71 said they bought a hard copy of the book, 30 said they printed it out from the web, and 25 said they read it online. How well does this data fit the professor’s predictions? Run a \(G\)-test to find out!

We conclude this section with a technical remark regarding the definition of \(G\). Feel free to skip over this at first pass, but make sure to come back to it at some point soon!

Technical Remark You’ll notice that \(G\) is defined to be \(\infty\) if there is a nonzero observed count corresponding to an expected count of 0. This makes theoretical sense: if there’s a value we theoretically expect to be impossible to observe, but then we observe it, the distance from the observed distribution to the expected distribution should be massive!

In practice, if you see an expected count of 0, it’s worth thinking carefully to decide if this really makes sense. We’ll see in examples below that sometimes, the way we compute expected counts will lead to certain expected counts being 0 when they maybe “shouldn’t” be, in which case it may be useful to “fudge” the expected count a little bit.

3.8. Breaking Rectangular Transposition

Suppose we have a long passage which we know to be encrypted using rectangular transposition:

Here is a strategy for breaking the code.

We start by making an arbitrary guess for the “period,” ie, the length of the key word. We know that the period has to be a divisor of the length of the ciphertext. The above ciphertext is 2808 characters, and 2808 has 32 possible divisors:

\[ 1, 2, 3, 4, 6, 8, 9, 12, 13, 18, 24, 26, 27, 36, 39, 52, 54, 72, \\ 78, 104, 108, 117, 156, 216, 234, 312, 351, 468, 702, 936, 1404, 2808 \]

Since there are only 26 characters in the English alphabet, the period can be at most 26.4 This means that the period must be one of the following:

\[ 1, 2, 3, 4, 6, 8, 9, 12, 13, 18, 24, 26 \]

Let’s guess that the period is 6. We then arrange our ciphertext into a rectangle of width 6. In this case, our rectangle will have height \(N = 2808/6 = 468\). That is a lot of lines to display, so for ease of reading, I will truncate displaying this rectangle, but remember that the rectangle is actually very long.

OIPWMJ
ALWSLE
LJLYEA
MENUAB
IHSDAC
ESRTIE
EMKHAO
AMNPAI
IELNAP
RCAAH…

Now for every pair of numbers \(i \neq j\) between 1 and 6, consider the tall column of width 2 we would get by placing the \(i\)th and \(j\)th column of the above rectangle next to each other. For example, if \(i = 4\) and \(j = 2\), we would get the following \(468 \times 2\) rectangle:

WI
SL
YJ
UE
DH
TS
HM
PM
NE
AC

We think of this as 468 observations of a pair of English letters if the columns \(i\) and \(j\) were consecutive in the plaintext. In particular, for every pair of letters \(\alpha\) and \(\beta\), we count the number of times that we see the sequence \(\alpha\beta\) appearing in this column. Let \(O^{(i,j)}_{\alpha\beta}\) be this number. We can see from our truncated example above that \(O^{(4,2)}_{\text{WI}}, O^{(4,2)}_{\text{SL}}\), etc, are all at least 1. (In fact, it turns out that if you do this counting for the whole ciphertext, we have \(O^{(4,2)}_{\text{WI}} = 1\) and \(O^{(4,2)}_{\text{SL}} = 2\).)

On the other hand, we can use a large sample of English to calculate the probability \(p_{\alpha\beta}\) of the pair \(\alpha\beta\) occurring in English text. We can use these to calculate expected counts \(E_{\alpha\beta} = N p_{\alpha\beta} = 468 p_{\alpha\beta}\) and then calculate a corresponding value of \(G\) using the observed counts \(O^{(i,j)}_{\alpha\beta}\). Let us call this \(G^{(i,j)}\). In other words, we have \[ G^{(i,j)} = 2 \sum_{\alpha\beta} O^{(i,j)}_{\alpha\beta} \ln \left( \frac{O^{(i,j)}_{\alpha\beta}}{E_{\alpha\beta}} \right). \] We can then assemble all of these values of \(G^{(i,j)}\) as \(i \neq j\) varies into a box of numbers: \[ \begin{bmatrix} \infty & G^{(1,2)} & G^{(1,3)} & G^{(1,4)} & G^{(1,5)} & G^{(1,6)} \\ G^{(2,1)} & \infty & G^{(2,3)} & G^{(2,4)} & G^{(2,5)} & G^{(2,6)} \\ G^{(3,1)} & G^{(3,2)} & \infty & G^{(3,4)} & G^{(3,5)} & G^{(3,6)} \\ G^{(4,1)} & G^{(4,2)} & G^{(4,3)} & \infty & G^{(4,5)} & G^{(4,6)} \\ G^{(5,1)} & G^{(5,2)} & G^{(5,3)} & G^{(5,4)} & \infty & G^{(5,6)} \\ G^{(6,1)} & G^{(6,2)} & G^{(6,3)} & G^{(6,4)} & G^{(6,5)} & \infty \end{bmatrix}\] Note that all of the diagonal entries of this box are set to \(\infty\) because we only compute \(G^{(i,j)}\) when \(i \neq j\). This is an arbitrary convention and the diagonal entries should just be ignored.

If we guessed the period correctly, then we should find that every row except one of them has one number that’s much smaller than all of the others — and this tells us something about how to permute the letters to find the plaintext! For example, if we find in the first row that \(G^{(1,4)}\) is much smaller than the other numbers, that tells us that rows 1 and 4 are likely to be consecutive in the plaintext — because the frequency distribution of the pairs that occur in the long \(468 \times 2\) rectangle displayed earlier is close to the frequency distribution of pairs that occur in English plaintext.

This would be an enormous amount of calculation to do by hand. If the period we’ve guessed is \(p\), there are \(p(p-1)\) choices for pairs \((i,j)\) where \(i \neq j\). For example, with \(p = 6\), there are 30 choices for such pairs, so we have 30 values of \(G^{(i,j)}\) to compute. Moreover, there are \(26^2 = 676\) possible choices for a sequence \(\alpha\beta\), so we have that many observed and expected counts to compute every time, and then a large sum to compute to find \(G^{(i,j)}\)…! So we’ll get a computer to do these calculations for us.

Before seeing what the result looks like, we pause for a technical remark. Feel free to ignore this at first pass, but you should come back to it at some point and try to internalize what it’s saying. It’s closely related the technical remark at the end of the last section.

Technical Remark There is a technical problem related to how “large” our sample text is. Even if it is quite large, it could miss certain pairs of letters that are infrequent in English — but which are not strictly impossible in English. In the SageCell below, we’ll use a sample text that combines a passage from Fitzgerald’s The Great Gatsby, the entire Declaration of Independence, a passage from Brönte’s Wuthering Heights, the entire preamble to the Universal Decalaration of Human Rights, and the entirety of Poe’s short stories “William Wilson,” and “The Tell-Tale Heart” – but even in this fairly large sample, there are no instances of the pair OQ, which is not strictly impossible since it can occur in encodings of legitimate phrases like Kangaroo Queen! For all we know, some of these legitimate but infrequent sequences which don’t appear in our sample could very well occur in the encrypted message. But if it does happen to occur, our “expected” count based on the sample will be 0, which will lead to a \(G\) value of \(\infty\).

This is problematic. One solution is expanding the sample even more, but then the computations also get slower. Here we’ll adopt an alternative solution: whenever we see an expected count of 0, we’ll artificially go in and set it to be something nonzero but small. (The code below will use 0.1, but you’re welcome to change this if you like. This number is set towards the beginning of the code block, at the line that says epsilon = 0.1.) This way, we’ll always get finite \(G\) values, but we’ll still be capturing the fact that pairs that fail to occur in our sample should occur at most very infreqeuntly in the decrypted ciphertext.

Returning to our example, we find the following “\(G\)-box.”

\[ \begin{bmatrix} \infty & 1151.3 & 1090.2 & \color{green}485.5 & 1069.3 & 1005.0 \\ 1234.4 & \infty & 1228.3 & 1049.6 & \color{green}440.2 & 1148.6 \\ \color{green}437.5 & 1044.1 & \infty & 1004.1 & 1164.5 & 933.4 \\ 1154.7 & 1088.6 & 977.3 & \infty & 1115.7 & 1023.6 \\ 1137.2 & 1221.9 & \color{green}425.9 & 1100.0 & \infty & 1070.0 \\ 1003.7 & \color{green}442.3 & 944.9 & 1021.6 & 1086.1 & \infty \end{bmatrix} \]

The specific numbers we see in this box are not terribly important: what’s important is that every row except one has a number that’s significantly smaller than the other numbers on that row. The numbers that are smaller than the others on the same row are highlighted in green, and you’ll notice that every row except the fourth has a green entry. The fact that in row 1, the 4th column is much smaller than the other entries in that row suggests that columns 1 and 4 in our \(468 \times 6\) rectangle are consecutive. Similarly, in row 2, the fact that the 5th column is much smaller than others suggests that columns 2 and 5 are consecutive. If you look at all the relations you get this way and put them together, you should find that the above \(G\)-box leads us to think that the ordering of the columns is 6, 2, 5, 3, 1, 4. The fact that the 4th row doesn’t have an entry that’s much smaller than the others corresponds to the fact that the 4th column gets reordered to the end.

Here’s an execise to practice producing this ordering:

Exercise Suppose that, when trying to break rectangular transposition, you find “\(G\)-boxes” of the following forms, where the exclamation mark indicates an entry that is much smaller than every other entry on its row. Write down the corresponding decrypting permutation (ie, the ordering of the columns in the plaintext) that this configuration of values suggests.

  1. \(\begin{bmatrix} . & . & . & ! & . \\ . & . & . & . & ! \\ . & . & . & . & . \\ . & ! & . & . & . \\ . & . & ! & . & . \end{bmatrix}\)

  2. \(\begin{bmatrix} . & ! & . & . & . \\ . & . & . & . & . \\ ! & . & . & . & . \\ . & . & ! & . & . \\ . & . & . & ! & . \end{bmatrix}\)

  3. \(\begin{bmatrix} . & . & . & ! & . & . & . \\ . & . & . & . & . & . & ! \\ . & ! & . & . & . & . & . \\ . & . & ! & . & . & . & . \\ . & . & . & . & . & . & . \\ . & . & . & . & ! & . & . \\ . & . & . & . & . & ! & . \\ \end{bmatrix}\)

This pattern (one number on each row except one is smaller than all the others in the same row) will most likely only happen when we guess the period correctly. If our period was incorrect, we’ll see something different. In our example, if we guessed that the period is 4, we would find the following box:

\[ \begin{bmatrix} \infty & 1272.1 & 1355.2 & \color{orange} 861.3 \\ 1453.5 & \infty & 1419.3 & 1312.9 \\ \color{orange} 732.1 & 1353.3 & \infty & 1405.0 \\ 1302.4 & 1279.2 & 1297.5 & \infty \end{bmatrix} \]

Notice that all the entries in the second and fourth rows are similarly sized. Rows 1 and 3 do have one entry that’s a little smaller than the others, but the difference is not nearly as dramatic as we saw above. Of course, sometimes, we might find a “\(G\) box” that has the right pattern even though our guess is wrong. But in that case, we should see nonsense when we try decrypting using the corresponding permutation, and then we just make another guess for the period!

We’re now ready to play with a SageCell.

SageCell Here is a SageCell that does the above computations and helps you break rectangular transposition! Once you hit “Run,” here’s what you’ll see:

Based on these inputs, you’ll see the following outputs:

Note that you’ll only see the “Output” and “Remarks” if Sigma is a permutation of the numbers from 1 up to the period you guessed; otherwise you’ll see an error message here.

We saw above that the period should be 6 and that the ordering of the columns should be [6,2,5,3,1,4]. Adjust the period and sigma to these values and see if this works!
Exercise The following ciphertext was encrypted using rectangular transposition. Find the corresponding plaintext.
Exercise The following ciphertext was encrypted using rectangular transposition. Find the corresponding plaintext.

Sage Exercise with Minimal Programming One of the first lines of code above in the SageCell above says epsilon = 0.1. What is this line of code doing? What happens if you make this number smaller? What happens if you make it bigger? Explain why this makes sense.

3.9. Known Plaintext Attack

So far in this section, we’ve studied a few techniques for conducting ciphertext-only attacks on various ciphers, ie, ciphers where the only information Eve knows to begin with is which cipher was used to encrypt a message and has no information about the message itself. In this section, we’ll describe a method for conducting a known-plaintext attack on simple substitution, ie, an attack where Eve has some partial information about the plaintext itself. Specifically, we’ll consider the situation where Eve already knows that that a certain word appears at least once in the plaintext. As it turns out, we can use the \(G\) test statistic to help us make good guesses here.

Suppose, for example, that Eve intercepts the following ciphertext, known to be encrypted using a simple substitution.

It’s the same ciphertext we studied in the “Frequency Analysis” section above, but this time, suppose further that Eve knows from the outset that the word LONDON occurs at least once in the plaintext. Let’s take a look at how just knowing this can significantly speed up the codebreaking process.

Notice that, in the word LONDON, the first 4 letters are all distinct, the 5th letter is the same as the 2nd, and the 6th letter is the same as the 3rd. This pattern will be preserved by simple substitution, so somewhere in the ciphertext, we should find such a sequence. If we can find such a pattern, it will suggest how the letters L, O, N, and D should be matched! As it turns out, there are 18 sequences in this ciphertext that fit this pattern:

SCUPCU
RACZAC
OFGJFG
ZUDRUD
ZUDRUD
SFZOFZ
ARZURZ
UDCZDC
RAFJAF
GKROKR
MUFJUF
SXQAXQ
SFZOFZ
UQZOQZ
SFZOFZ
RZUCZU
GCZWCZ
QZWCZW

There are some repeats in this list (ZUDRUD occurs twice and SFZOFZ occurs thrice), but even after eliminating repeats, there are still 15 distinct possibilities for sequences that might correspond to LONDON.

That’s already pretty good: if we were just brute forcing through possible substitutions, the number of possibilities for matching the letters L, O, N, and D would be \[ 26 \cdot 25 \cdot 24 \cdot 23 = 358,800, \] so we’ve already drastically decreased our search space (by a factor of \(358,800/15 = 23,920\)). But it turns out that we can do even better!

The key observation is that the relative frequencies of the letters L, O, N and D in some sample of plaintext English should match the relative frequencies of the corresponding ciphertext letters. We can measure deviations between frequencies using \(G\), so we can compute \(G\) for each of the 15 possible matchings and use that to help guide our choices.

Let’s describe how to do one of these calculations of \(G\) in detail. Start by looking at some sample English text (eg, the Declaration of Independence) and throwing out all of the letters except L, O, N, and D. If we do so, we find the following distribution:

L O N D Total
Count in Sample 228 513 483 252 1476
Percent in Sample 15.4% 34.8% 32.7% 17.1% 100%

Suppose we think that the first possibility we listed above, SCUPCU, corresponds to LONDON. This would mean that S is matched to L, and C to O, and U to N, and P to D. We can then go through our ciphertext and count instances of S, C, U and P (throwing out all other letters), which lets us fill in the first row in the table below.

S C U P Total
Observed in Ciphertext 146 293 429 83 951
Expected in Ciphertext 146.9 330.5 311.2 162.4 951

To fill in the second row, we use the percentages we calculated in our table above. For example, we saw that 15.4% of the letters L, O, N, and D should be L, so if L corresponds to S in the ciphertext, we would expect \(0.154 \cdot 951 \approx 146.9\) letters of the ciphertext to be S. Now we can compute \(G\). \[ \begin{aligned} G &= 2 \sum O \ln \left(\frac{O}{E}\right) \\ &\approx 2 \left( 146 \cdot \ln \left( \frac{146}{146.9} \right) + 293 \cdot \ln \left( \frac{293}{330.5} \right) + 429 \cdot \ln \left( \frac{429}{311.2} \right) + 83 \cdot \ln \left( \frac{83}{162.4} \right) \right) \\ &\approx 91.6. \end{aligned} \]

Exercise The values of \(G\) are approximated by a chi-square distribution with \(k\) degrees of freedom. What is \(k\)? What is the value of the integral \[ \int_{91.6}^\infty f_k(x)\, dx, \] ie, the p-value? Based on this, does it seem likely that SCUPCU is a correct match for London?

We can then do analogous calculations of \(G\) for the other 14 possibilities. Here are the results:

Ciphertext Match \(G\)
SCUPCU 91.6
RACZAC 602.9
OFGJFG 34.5
ZUDRUD 442.5
SFZOFZ 7.8
ARZURZ 160.5
UDCZDC 354.4
RAFJAF 597.0
GKROKR 490.8
MUFJUF 39.3
SXQAXQ 284.8
UQZOQZ 223.0
RZUCZU 444.8
GCZWCZ 162.4
QZWCZW 533.3

Recall now that smaller vaues of \(G\) mean that the distributions are closer together, so the most likely match for LONDON appears to be SFZOFZ. Of course, it’s possible that SFZOFZ is not the correct match, but if we make that guess and then it seems like the decryption is turning out to be nonsense, we can just go back and try the one option with the next lowest value of \(G\).

As we’ve seen, it’s usually easy to guess which letters correspond go E and T. We now have a systematic way of identifying the letters L, O, N, and D as well. We’ve made significant process!

SageCell Here’s a SageCell to play with. It’s very similar to the Frequency Analysis SageCell we had earlier, but now there is one additional input: a textbox labeled “Known” which should contain plaintext words you know to be present in the encrypted message. Correspondingly, there is one additional output: a table labeled “Match” which shows all the patterns which match the known word, together with their corresponding \(G\) values and the pairs that would need to be added to the guessed keys in order to make that pattern fit.

Note that, as you make guesses for the key, the pattern matches could decrease. For example, if you guess that ciphertext letter S corresponds to plaintext letter S, only 3 of the 15 matches described above will remain — namely, the 3 in which the first letter is S! Similarly, if you guess that the ciphertext letter R corresponds to plaintext letter E, then all pattern match containing the letter R will be eliminated since E does not occur in the word LONDON.

Try decrypting. It’s the same message as last time, but you should find that the knowledge of what corresponds to LONDON will speed up the process significantly.

Exercise The following ciphertext was encrypted using a simple substitution. The plaintext is known to contain the word CARPENTER. Find the corresponding plaintext.
Exercise The following ciphertext was encrypted using a simple substitution. The plaintext is known to contain the word MOSCOW. Find the corresponding plaintext.

Harder Sage Exercise with No Programming Involved When you did the above codebreaking exercises, you may have noticed a changing \(G\) value associated to a given pattern match as you made guesses for the decryption key. Look through the code and explain why this is happening. Does this behavior make sense to you?

3.10. Perfect Secrecy

At this point, we’re ready to think about cryptosystems in some generality. We’ll introduce some abstractions that encompass all of the types of ciphers we’ve seen so far, but the point is not the generality; the goal in this section is to see what it means to say that “the one-time pad achieves perfect secrecy,” and to prove this.

This section may be a little challenging, especially if you haven’t done much with proofs before; focus on understanding the definitions and theorem statements at first, and you can return the proofs later if you like.

Let’s first introduce the general framework. From Eve’s perspective, it makes sense to consider Alice’s choice of a plaintext message as a random variable \(M\) whose set of values \(\mathscr{M}\) represents all possible plaintext messages. Note that, by saying that \(M\) is a random variable, we’re introducing probabilities associated to every possible plaintext message, and we assume that Eve knows this probability distribution. For example, \(\mathscr{M}\) might be the set of all strings in the letters AZ, and we might deem the plaintext QUIZ to be less probable than the plaintext THEN on the basis that the former uses much more infrequent letters than the latter. Alternatively, we could assign probabilities based on bigram frequencies instead of single-letter frequencies. It doesn’t matter how we assign probabilities; the point is just that any plaintext message has a probability assigned to it.

We can model known-plaintext attacks in at least two ways. Suppose, for example, that Eve knows that the plaintext must contain the word LONDON. The first way to model a known-plaintext attack is to assign a probability of 0 to every plaintext message that doesn’t contain the LONDON as a substring. The second way is to simply eliminate all strings that don’t contain LONDON from our set of values \(\mathscr{M}\). It is convenient to use the latter method of modeling known-plaintext attacks, since it allows us to assume that \[ P[M = m] \neq 0 \] for all \(m \in \mathscr{M}\), and so we will assume this from here on out.

This brings us to a the following definition of a cryptosystem.

Definition A cryptosystem (on \(M\)) consists of the following data:

We require that \[ D(k, E(k, m)) = m \] for any key \(k \in \mathscr{K}\) and any plaintext message \(m \in \mathscr{M}\), ie, that a plaintext can be recovered from its encryption. Once we’ve specified the above data, we also define the random variable \(C = E(K, M)\) to model an observation of the ciphertext.

For example, consider the Caesar cipher. The sets \(\mathscr{M}\) and \(\mathscr{C}\) are both equal to the set of all possible strings in the capital letters A through Z, and the key space \(\mathscr{K}\) is the set of all possible shifts, ie, \(\mathscr{K} = \{0, 1, \dotsc, 25\}\). Given \(k \in \mathscr{K}\) and \(m \in \mathscr{M}\), the string \(E(k, m)\) is the result of applying the Caesar cipher with shift \(k\) to \(m\), ie, by adding \(k\) mod 26 to the numbers 0–25 corresponding to each letter AZ in \(m\). To fully specify the cryptosystem, we should also assign probabilities to each of the keys; for example, we might assume that \(K\) is uniform.

This is not necessarily the only way to fit the Caesar cipher into the above framework. For example, if you like, you can fix an upper bound on the length of the strings involved so that \(\mathscr{M}\) and \(\mathscr{C}\) are finite. You could also decide that the key space is just \(\{1, \dotsc, 25\}\), if you don’t want to consider a shift of 0 to be a valid key. You could also choose a non-uniform distribution for the random variable \(K\).

Exercise Use this definition to model a Polybius square cipher where the key box has our usual row and column labels ADFGVX. What is \(\mathscr{M}\) in this setting? What is \(\mathscr{C}\)? What is \(\mathscr{K}\), and how many elements does it have?

Exercise How would you use this definition to model rectangular transposition? More specifically, what would you take \(\mathscr{K}\) to be?

Note. There isn’t a unique “right” answer here! The answer you might write down if you’ve seen group theory before may be different than the answer you might write down if you haven’t seen any group theory.

Proof Exercise For any cryptosystem, show that the number of ciphertexts must be at least as large as that of the number of plaintext messages.

Hint. The question is about comparing the cardinalities of the sets \(\mathscr{M}\) and \(\mathscr{C}\). To do this, fix some \(k_0 \in \mathscr{K}\) and consider the function \(E(k_0, -) : \mathscr{M} \to \mathscr{C}\).

Exercise Construct an example to show that it is possible to have strictly more ciphertexts than plaintext messages, even when every ciphertext is in fact in the range of the encryption function.

Here is the central definition of this section:

Definition A cryptosystem achieves perfect secrecy if \(M\) and \(C\) are independent random variables.

In other words, for any plaintext message \(m\) and any encrypted message \(c\), we should have \[ P[M = m \mid C = c] = P[M = m]. \] Heuristically, this means that observing \(C = c\) provides Eve with no additional information whatsoever about \(M\) — the best guesses that she might have made about the plaintext message before she intercepted the ciphertext do not change even after she intercepts the ciphertext!

The attacks on cryptosystems we’ve seen so far exploit the fact that those cryptosystems fail to achieve perfect secrecy. When we conduct frequency analysis on simple substitution, for example, we are using the fact that certain plaintexts become more likely after observing a given ciphertext: eg, the plaintexts in which the letter E appears in the same positions in which the most frequent letter of the ciphertext appears.

How do we achieve perfect secrecy? Well, the first observation to make is that achieving perfect secrecy requires having a lot of keys. More precisely, the statement is the following:

Lemma Suppose a cryptosystem achieves perfect secrecy. Then the number of keys must be at least as large as the number of “possible” ciphertexts (ie, ciphertexts \(c \in \mathscr{C}\) such that \(P[C = c] \neq 0\)).

Proof. Fix \(m_0 \in \mathscr{M}\) and recall our assumption that \(P[M = m_0] \neq 0\). Suppose \(c \in \mathscr{C}\) is a “possible” ciphertext. We then have \[ P[M = m_0 \mid C = c] = P[M = m_0] \neq 0, \] so there must exist a key \(k(c) \in \mathscr{K}\) such that \(E(k(c), m_0) = c\). If \(c' \in \mathscr{C}\) is another “possible” ciphertext and \(k(c) = k(c')\), then
\[ c = E(k(c), m_0) = E(k(c'), m_0) = c' \] so \(c \mapsto k(c)\) defines an injective function from possible ciphertexts into \(\mathscr{K}\). Thus the number of keys must be at least as large as the number of possible ciphertexts. \(\Box\)

Having observed this, here is a result that gives us a way to achieve perfect secrecy:

Perfect Secrecy Theorem A cryptosystem achieves perfect secrecy if it satisfies all of the following conditions:

Proof. Fix a ciphertext \(c_0 \in \mathscr{C}\) and observe that \[ \begin{aligned} P[C = c_0] &= P[E(K, M) = c_0] \\ &= \sum_{\substack{k \in \mathscr{K}, m \in \mathscr{M}, \\ E(k, m) = c_0}} P[K = k \text{ and } M = m] \\ &= \sum_{\substack{k \in \mathscr{K}, m \in \mathscr{M}, \\ E(k, m) = c_0}} P[K = k] P[M = m], \end{aligned}\] where we use independence of \(K\) and \(M\) for the last equality. For any plaintext message \(m \in \mathscr{M}\), we know by assumption that there exists a unique key \(k_0(m)\) such that \(E(k_0(m), m) = c_0\). Thus the above sum can be re-written \[ P[C = c_0] = \sum_{m \in \mathscr{M}} P[M = m] P[K = k_0(m)]. \] Since \(K\) is uniform, it has finitely many values, and if \(N\) is the number of values, we have \(P[K = k_0(m)] = 1/N\) for all \(m \in \mathscr{M}\), so \[ P[C = c_0] = \frac{1}{N} \sum_{m \in \mathscr{M}} P[M = m] = \frac{1}{N} \] since the events \(M = m\) as \(m \in \mathscr{M}\) varies form a partition of the probability space.

Now fix a plaintext \(m_0 \in \mathscr{M}\). Observe that \[ \begin{aligned} P[M = m_0 \text{ and } C = c_0] &= \sum_{k \in \mathscr{K}} P[K = k \text{ and } M = m_0 \text{ and } C = c_0] \\ &= \sum_{k \in \mathscr{K}} P[K = k \text{ and } M = m_0 \text{ and } E(k, m_0) = c_0]. \end{aligned} \] There exists a unique key \(k_0 \in \mathscr{K}\) such that \(E(k_0, m_0) = c_0\), so we have \[ P[K = k \text{ and } M = m_0 \text{ and } E(k, m_0) = c_0] = 0 \] unless \(k = k_0\). In other words, the above summation collapses to just a single term and we have \[ \begin{aligned} P[M = m_0 \text{ and } C = c_0] &= P[K = k_0 \text{ and } M = m_0 \text{ and } E(k, m_0) = c_0] \\ &= P[K = k_0 \text{ and } M = m_0] \\ &= P[K = k_0] P[M = m_0] \\ &= \frac{1}{N} \cdot P[M = m_0] \\ &= P[M = m_0]P[C = c_0] \end{aligned} \] where we have used independence of \(K\) and \(M\) for the third step, and the fact that \(P[C = c_0] = 1/N\) for the final step. \(\Box\)

Our next order of business is to show that “the one-time pad achieves perfect secrecy.” Remember that the one-time pad is a special case of the Vignère cipher, so let’s first think about how we can formally model the Vignère cipher using the framework we’ve introduced here. Fix a period \(p\) and let \(\mathscr{K}\) be the set of all sequences \((a_1, \dotsc, a_p)\) where \(a_1, \dotsc, a_p\) are all numbers 0–25. Also, let’s fix a message length \(r\) and let \(\mathscr{M}\) and \(\mathscr{C}\) be the set of all strings in the letters AZ of length \(r\). Note that, if Eve intercepts a ciphertext of length \(r\) that she knows to be encrypted using a Vignère cipher, she also knows the plaintext must have had length \(r\), so she can assign probability 0 to all messages of other lengths; in other words, it is reasonable to eliminate all messages of other lengths from our message spaces.

We then have the Vignère encryption and decryption functions \(E : \mathscr{K} \times \mathscr{M} \to \mathscr{C}\) and \(D : \mathscr{K} \times \mathscr{C} \to \mathscr{M}\). Applying Vignère encryption to a message of length \(r\) will always just use the first \(\min\{r, p\}\) of the \(p\) numbers in our key sequence, so might as well replace \(p\) with \(\min\{r, p\}\) in order to assume that \(p \leq r\).

Lemma Using the notation we’ve just introduced, we have \(p = r\) if and only if, for every plaintext message \(m \in \mathscr{M}\) and every ciphertext \(c \in \mathscr{C}\), there exists a unqiue \(k \in \mathscr{K}\) such that \(E(k, m) = c\).

Proof. Let’s formally prove one direction: assume that for any \(m \in \mathscr{M}\) and \(c \in \mathscr{C}\), there is a unique \(k \in \mathscr{K}\) such that \(E(k, m) = c\). If we fix some message \(m_0 \in \mathscr{M}\), our assertion is that the function \(E(-, m_0) : \mathscr{K} \to \mathscr{C}\) is bijective, so \(\mathscr{K}\) and \(\mathscr{C}\) must have the same number of elements. But \(\mathscr{K}\) has \(26^p\) elements and \(\mathscr{C}\) has \(26^r\) elements, and \(26^p = 26^r\) implies \(p = r\).

For the converse, let’s sketch the idea of the proof on the context of an example. I will leave it to you to generalize this sketch if you are interested. Let’s suppose that \(p = r = 3\). We then want to show that, for any \(m \in \mathscr{M}\) and \(c \in \mathscr{C}\), there is a unique \(k \in \mathscr{K}\) such that \(E(k, m) = c\). Suppose for example that \(m\) is ZEE and that \(c\) is TOE. If we convert these messages to their corresponding number sequences, we have ZEE corresponding to \((25, 4, 4)\) and TOE corresponding to \((19, 14, 4)\). Then consider \[ k = (25, 4, 4) - (19, 14, 4) \bmod{26} = (6, -10, 0) \bmod{26} = (6, 16, 0). \] This is the only key sequence with the property that encrypting ZEE with this key results in TOE! \(\Box\)

Proof Exercise Formalize the sketchy part of the argument for the above lemma.

Let’s recall some of the assumptions we made when we were discussing the one-time pad when we first introduced it:

In other words, these three assumptions we made coincide exactly with the conditions that appear in the Perfect Secrecy Theorem! We have now proved what we want to prove:

Corollary The one-time pad achieves perfect secrecy.

Harder Exercise If you look back to when we described the one-time pad, you’ll notice that, besides the three assumptions we mentioned above, we also had a fourth assumption: that the key never be reused. Suppose that Eve intercepts two ciphertexts \(c_0\) and \(c_1\) of length \(r\) and that she knows that both were encrypted using a Vignère cipher with the same key of length \(r\). Describe a strategy that Eve might use to extract information about the corresponding plaintexts.

Exercise Our definition of “cryptosystem” does not include the assumption that \(K\) and \(M\) are independent. This is slightly nonstandard, but it was also intentional. Read about the autokey cipher, and then explain how to formally model the autokey cipher as a “cryptosystem” in two different ways: one in which \(K\) and \(M\) aren’t independent, and one in which \(K\) and \(M\) are independent.

Harder Exercise Suppose a cryptosystem achieves perfect secrecy.

  1. Is it true that \(K\) and \(M\) must be independent?
  2. Is it true that \(K\) and \(C\) must be independent?