Refining the Estimated Entropy of English by Shannon Game Simulation

Matt Mahoney
Florida Institute of Technology
mmahoney@cs.fit.edu

Abstract

Shannon (1950) estimated the entropy of written English to be between 0.6 and 1.3 bits per character (bpc), based on the ability of human subjects to guess successive characters in text. Simulations to determine the empirical relationship between the provable bounds and the known entropies of various models suggest that the actual value is 1.1 bpc or less.

Background

Determining the entropy of natural language text is a fundamentally important problem in natural language processing. The ability to predict characters or words in text as well as a human is equivalent to solving the artificial intelligence problem. Some statistical language models, trained on hundreds of megabytes of text are now performing so well (for example, 1.22 bpc by Rosenfeld (1996)), that it is unclear whether the problem has been solved. The uncertainty lies not in the models, but in the human benchmark.

Shannon estimated the entropy of written English in 1950 by having human subjects guess successive characters in a string of text selected at random from various sources. He proved that if the probability of taking r guesses until the correct letter is guessed is pr, then the entropy, H (in bpc) is:

Sr r(pr - pr+1) log2 r £ H £ Sr pr log2 1/pr

In one experiment, random passages were selected from Jefferson the Virginian by Dumas Malone. The subject was shown the previous 100 characters of text and asked to guess the next character until successful. The text was reduced to 27 characters (A-Z and space). Subjects were allowed to use a dictionary and character frequency tables (up to trigram) as aids. The following results (first 2 columns) were obtained from 100 trials for one subject, and is typical. Since the counts only estimate pr, I smoothed the data (somewhat crudely) by averaging the counts between r/2 and 2r in the third column.

r pr pr smoothed
1 .80 .8000
2 .07 .0700
3 .00 .0333
4 .03 .0225
5 .04 .0167
6 .02 .0157
7 .01 .0133
8 .00 .0100
9 .01 .0083
10 .00 .0046
11 .00 .0040
12 .01 .0025
13 .00 .0022
14 .01 .0016
15-27 .00 .0014-.0003
Hmin .648 .678
Hmax 1.242 1.431

Table 1. Entropy (H) bounds from a Shannon game experiment, before and after smoothing, from Shannon (1950).

The lower bound, Hmin would be the entropy if each letter had a uniform distribution, for instance, some could be predicted with certainty, some could only be one of two equally likely letters, some could only be one of three, and so on. This is obviously not the case. The upper bound, Hmax would be the entropy if all of the letters had the same distribution. This is not the case either, since some letters, like those at the beginning of a word, are harder to guess than others. The true entropy must be somewhere in between, but its value is unknown.

The fundamental reason for the uncertainty is that the mapping from the probability distribution to a ranking is many to one. To overcome this difficulty, Cover and King (1978) had subjects assign probabilities directly, using a gambling game. However, this approach is tedious in practice, and succumbs to the human tendency to assign artificially high probabilities to unlikely events. (See Schwartz and Reisberg, 1991, pp. 552 ff.). This is the same human trait that explains the popularity of both insurance and lotteries. The unfortunate result of this is to overestimate entropy. Cover and King obtained measurements of 1.3 to 1.7 bpc for individual subjects and 1.3 bpc when the results were combined.

Procedure

The purpose of this experiment is to investigate empirically the relationship between the actual entropy of text and the upper and lower bounds using various language models. A language model assigns a probability P(wi|w1...i-1) to each character wi given all of the characters before it. The entropy of a text sample relative to a model is the average value of log2 1/P(wi|w1...i-1), averaged over all characters. To simulate a Shannon game, each of the wi in the alphabet are sorted from most likely to least likely, and the position of the correct character is taken as the number of guesses.

I used the "large" (P6) neural network data compression model described in (Mahoney 1999). The model predicts successive characters one bit at a time, using only the last n = 5 characters as context. The program was modified to assign probabilities to whole characters and rank them in order to collect statistics for a Shannon game simulation. The true entropy (relative to the model) is determined by the compression ratio. The compression algorithm was not modified.

The Shannon game simulation results use unsmoothed counts. The large number of characters eliminates low counts, making smoothing unnecessary.

Results

Two tests were performed. In the first test, book1 of the Calgary corpus (Far from the Madding Crowd by Hardy) was reduced to a 27 character alphabet by converting upper case to lower case and all other character sequences to a single space. Compression results and Shannon game simulations were obtained for the first 10K, 100K, and full text (731,361 characters). Results are shown below.

Input size Hmin H Hmax Interp.
10K 2.013 3.028 3.131 0.907
100K 1.600 2.375 2.670 0.724
731K 1.416 2.061 2.448 0.625

Table 2. Entropy vs. Shannon game bounds as input size is increased.

The interpolation value is (H-Hmin)/(Hmax-Hmin), the fraction of the distance from Hmin to Hmax for the true value of H.

In the second test, Lewis Carroll's, Alice in Wonderland from the Gutenberg Press (alice30.txt with header removed), was reduced as in book1 to 135,059 characters. The full text was compressed using context lengths of n = 1, 2, 3, 4, and 5. The smaller contexts were obtained by disabling the input neurons for the larger contexts. No other parameters were changed in the model. Results are shown below.

Context size Hmin H Hmax Interp.
1 2.414 3.254 3.388 0.862
2 1.803 2.578 2.836 0.750
3 1.471 2.164 2.498 0.674
4 1.357 2.028 2.378 0.657
5 1.303 1.974 2.319 0.660

Table 3. Entropy vs. Shannon game bounds as context is increased.

Conclusions

We observe that as the model is improved, whether by increasing the amount of training data, or by making changes to the model itself (varying the context length), that the true entropy moves toward the lower bound. The reason for this may be that as the model gets better, there is more variation between the predictability of letters, e.g. at the beginning and end of a word. Recall that the upper bound is for the case of all letter positions having the same distribution.

The language model used here is rather crude, in that it models letters but not words, ignoring the syntactic and semantic constraints found in natural language. It is plausible that adding these constraints would allow whole words to be predicted, furthering the variations in letter predictability and lowering the interpolation value.

If we assume that the interpolation value is less than 0.6 for natural language, and apply this to Shannon's estimate of 0.6 to 1.3 bpc, we would conclude that the entropy of written English is less than 1.0 bpc. If we take the range 0.678 to 1.431 bpc from the smoothed data, we would assume less than 1.1 bpc, a value yet to be reached by the best language models.

References

Cover, T. M., and R. C. King (1978), "A Convergent Gambling Estimate of the Entropy of English", IEEE Transactions on Information Theory (24)4 (July) pp. 413-421.

Mahoney, Matthew V. (1999), Fast Text Compression with Neural Networks, submitted for publication.

Rosenfeld, Ronald (1996), "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer, Speech and Language, 10.

Schwartz, Barry, and Daniel Reisberg (1991), Learning and Memory, New York: W. W. Norton and Company.

Shannon, Cluade E. (1950), "Prediction and Entropy of Printed English", Bell Sys. Tech. J (3) p. 50-64.