Matt Mahoney
Florida Institute of Technology
mmahoney@cs.fit.edu
Shannon estimated the entropy of written English in 1950 by having human subjects guess successive characters in a string of text selected at random from various sources. He proved that if the probability of taking r guesses until the correct letter is guessed is pr, then the entropy, H (in bpc) is:
Sr r(pr - pr+1) log2 r £ H £ Sr pr log2 1/pr
In one experiment, random passages were selected from Jefferson the Virginian by Dumas Malone. The subject was shown the previous 100 characters of text and asked to guess the next character until successful. The text was reduced to 27 characters (A-Z and space). Subjects were allowed to use a dictionary and character frequency tables (up to trigram) as aids. The following results (first 2 columns) were obtained from 100 trials for one subject, and is typical. Since the counts only estimate pr, I smoothed the data (somewhat crudely) by averaging the counts between r/2 and 2r in the third column.
r | pr | pr smoothed |
---|---|---|
1 | .80 | .8000 |
2 | .07 | .0700 |
3 | .00 | .0333 |
4 | .03 | .0225 |
5 | .04 | .0167 |
6 | .02 | .0157 |
7 | .01 | .0133 |
8 | .00 | .0100 |
9 | .01 | .0083 |
10 | .00 | .0046 |
11 | .00 | .0040 |
12 | .01 | .0025 |
13 | .00 | .0022 |
14 | .01 | .0016 |
15-27 | .00 | .0014-.0003 |
Hmin | .648 | .678 |
Hmax | 1.242 | 1.431 |
Table 1. Entropy (H) bounds from a Shannon game experiment, before and after smoothing, from Shannon (1950).
The lower bound, Hmin would be the entropy if each letter had a uniform distribution, for instance, some could be predicted with certainty, some could only be one of two equally likely letters, some could only be one of three, and so on. This is obviously not the case. The upper bound, Hmax would be the entropy if all of the letters had the same distribution. This is not the case either, since some letters, like those at the beginning of a word, are harder to guess than others. The true entropy must be somewhere in between, but its value is unknown.
The fundamental reason for the uncertainty is that the mapping from the probability distribution to a ranking is many to one. To overcome this difficulty, Cover and King (1978) had subjects assign probabilities directly, using a gambling game. However, this approach is tedious in practice, and succumbs to the human tendency to assign artificially high probabilities to unlikely events. (See Schwartz and Reisberg, 1991, pp. 552 ff.). This is the same human trait that explains the popularity of both insurance and lotteries. The unfortunate result of this is to overestimate entropy. Cover and King obtained measurements of 1.3 to 1.7 bpc for individual subjects and 1.3 bpc when the results were combined.
I used the "large" (P6) neural network data compression model described in (Mahoney 1999). The model predicts successive characters one bit at a time, using only the last n = 5 characters as context. The program was modified to assign probabilities to whole characters and rank them in order to collect statistics for a Shannon game simulation. The true entropy (relative to the model) is determined by the compression ratio. The compression algorithm was not modified.
The Shannon game simulation results use unsmoothed counts. The large number of characters eliminates low counts, making smoothing unnecessary.
Input size | Hmin | H | Hmax | Interp. |
---|---|---|---|---|
10K | 2.013 | 3.028 | 3.131 | 0.907 |
100K | 1.600 | 2.375 | 2.670 | 0.724 |
731K | 1.416 | 2.061 | 2.448 | 0.625 |
Table 2. Entropy vs. Shannon game bounds as input size is increased.
The interpolation value is (H-Hmin)/(Hmax-Hmin), the fraction of the distance from Hmin to Hmax for the true value of H.
In the second test, Lewis Carroll's, Alice in Wonderland from the Gutenberg Press (alice30.txt with header removed), was reduced as in book1 to 135,059 characters. The full text was compressed using context lengths of n = 1, 2, 3, 4, and 5. The smaller contexts were obtained by disabling the input neurons for the larger contexts. No other parameters were changed in the model. Results are shown below.
Context size | Hmin | H | Hmax | Interp. |
---|---|---|---|---|
1 | 2.414 | 3.254 | 3.388 | 0.862 |
2 | 1.803 | 2.578 | 2.836 | 0.750 |
3 | 1.471 | 2.164 | 2.498 | 0.674 |
4 | 1.357 | 2.028 | 2.378 | 0.657 |
5 | 1.303 | 1.974 | 2.319 | 0.660 |
Table 3. Entropy vs. Shannon game bounds as context is increased.
The language model used here is rather crude, in that it models letters but not words, ignoring the syntactic and semantic constraints found in natural language. It is plausible that adding these constraints would allow whole words to be predicted, furthering the variations in letter predictability and lowering the interpolation value.
If we assume that the interpolation value is less than 0.6 for natural language, and apply this to Shannon's estimate of 0.6 to 1.3 bpc, we would conclude that the entropy of written English is less than 1.0 bpc. If we take the range 0.678 to 1.431 bpc from the smoothed data, we would assume less than 1.1 bpc, a value yet to be reached by the best language models.
Mahoney, Matthew V. (1999), Fast Text Compression with Neural Networks, submitted for publication.
Rosenfeld, Ronald (1996), "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer, Speech and Language, 10.
Schwartz, Barry, and Daniel Reisberg (1991), Learning and Memory, New York: W. W. Norton and Company.
Shannon, Cluade E. (1950), "Prediction and Entropy of Printed English", Bell Sys. Tech. J (3) p. 50-64.