Rationale for a Large Text Compression Benchmark

Matt Mahoney
Last update: Aug. 20, 2006

The Large Text Compression Benchmark and the Hutter Prize are designed to encourage research in natural language processing (NLP). I argue that compressing, or equivalently, modeling natural language text is "AI-hard". Solving the compression problem is equivalent to solving hard NLP problems such as speech recognition, optical character recognition (OCR), and language translation. I argue that ideal text compression, if it were possible, would be equivalent to passing the Turing test for artificial intelligence (AI), proposed in 1950 [1]. Currently, no machine can pass this test [2]. Also in 1950, Claude Shannon estimated the entropy (compression limit) of written English to be about 1 bit per character [3]. To date, no compression program has achieved this level. In this paper I will also describe the rationale for picking this particular data set and contest rules.

Compression is Equivalent to General Intelligence

In 2000, Hutter [21,22] proved that finding the optimal behavior of a rational agent is equivalent to compressing its observations. Essentially he proved Occam's Razor [23], the simplest answer is usually the correct answer. The proof applies to any goal-seeking agent in any unknown environment which can be simulated by a computer. For example, an agent could be a person or computer whose goal is to accumulate money or some other reward signal in an unknown world.

Formally, the agent and environment are a pair of interacting Turing machines that pass messages back and forth at each time step. In addition, the environment outputs a reward signal (a number) to the agent, and the agent's goal is to maximize the accumulated reward. What Hutter proved is that the optimal behavior of an agent is to guess that the environment is controlled by the shortest program that is consistent with all of the interaction observed so far.

The problem of finding this program known as AIXI. AIXI implies a solution to optimal algorithmic compression: coding a string as the shortest program that outputs it. The problem is worth studying because it is hard. The general problem is not computable [11], although Hutter proved that if we assume time bounds t and space bounds l on the environment, then this restricted problem, known as AIXItl, can be solved in O(t2l) time.

Text Compression is AI-Hard

"AI-hard" is an informal term meaning that solving one problem in the class (of text-based AI problems) is equivalent to solving all such problems. I argue that this class includes text compression. By text-based AI, I am excluding vision, robotics, etc., and mean systems that read and/or write natural language text.

Given a probability distribution P over strings s, the Shannon capacity theorem states that the optimal (shortest average) code for s has length log2 1/P(s) bits. If the distribution P is known, then such codes are in fact easy to find and compute, for example, arithmetic codes [4]. The following arithmetic code for s is within one bit of optimal: choose the shortest binary number x between 0 and 1 such that P(before s) <= x < P(before or equal to s), where P(before s) is the total probability of all strings before s in some lexicographical (or alphabetical) order. Therefore text compression can be reduced to modeling, or computing P(s) for any string s.

The problem is that for natural language, we do not know how to explicitly compute the model P(s) for a given string s. However humans must implicitly know P(s) because our speech and writing, by definition, follows this distribution.

The modeling problem occurs in any system that converts some human-recognizable input signal x to a natural language string s, such as OCR, speech recognition, handwriting recognition, and language translation. The general problem, as outlined by Knight [5] is to find s such that P(s|x) is maximized. This can be decomposed by Bayes law: P(s|x) = P(x|s)P(s)/P(x), so the problem is to find s that maximizes P(x|s)P(s), since P(x) is fixed. P(x|s) is an application-specific model, such as an acoustic model for speech recognition. For example, P(x|"recognize speech") is higher for x that sounds like the corresponding text. Note that P(x|"recognize speech") is about equal to P(x|"reckon eyes peach"). To distinuish these possibilities given utterance x we need the language model to tell us that P("recognize speech") > P("reckon eyes peach").

In fact, language modeling is an active area of research in speech recognition. Often, it is studied independently of the acoustic model. The most widely used measure of the quality of a language model is its perplexity, which is a measure of how well it will compress a corpus of natural language text [6]. Models are also used to correct errors in other NLP systems such as OCR [7] and language translation [5].

Text Compression is Equivalent to AI

In 1950, Turing [1] proposed the "imitation game" as a test of artificial intelligence. An interrogator or judge communciates with a machine and a human via teletype (or in modern times, by instant messages) and tries to distinguish them. If the judge cannot, then the machine passes the test for AI. A modern version of the test, the Loebner contest [2], has yet to produce this result.

Let us assume that the judge and human contestant use language model P, and further, that the judge is able to evaluate messages using P. In other words, both the judge and contestant are more likely to say or write "recognize speech" than "reckon eyes peach", and further, they are more likely to judge the second phrase as unlikely or incorrect, compared to the first. Assume that both people do this consistenly given any pair of strings that might occur in human speech or writing. A string might be a dialog, e.g. a sequence of questions and answers. Our two assumptions state that both humans would answer any given question the same way, and that both humans would recognize when an answer is different than the one they would give.

Now suppose that the machine can compute P(s) for all s. Then for any question q (or dialog ending in a question) and any answer a, the machine can compute P(a|q) = P(qa)/P(q), and use this distribution to answer any question q posed by the judge. By definition, this distribution would be identical to the distribution of answers given by the human, and the two would be indistinguishable.

If we relax our two original assumptions, then we make it easier for the machine to pass the test. First, if we relax the assumption that both humans use the same language model, then it is no longer necessary for the machine to exactly duplicate the model of its human opponent. The judge will expect some variation among individuals, so might attribute any differences between the machine and human as normal variation.

If we relax the second assumption that the judge uses the same model for generation and recognition, then this is a defect in the judge's ability that works to the advantage of the machine. If the judge would be more likely to give answer a than answer b, but does not recognize this fact when the machine gives answer b, then the machine benefits.

Compression Still Does Not Seem Like AI

The objection to the argument that compression implies AI is that while humans are better than computers at speech recognition, language translation, reading, answering questions, etc., humans cannot compress text efficiently. This is true. However, I argue that while compression is much harder than language modeling for humans, it is only slightly harder for computers.

Humans are good at quickly applying language rules and real world knowledge to distinguish between high probability strings like "roses are red" and low probability strings such as "roses are rod" or "roses red are". This is precisely the hard problem that a text compressor must solve so that it can assign shorter codes to the more likely strings.

It is this assignment of codes that is easy for computers, but hard for humans. Given any string s with estimated probability p(s), the compressor must choose a code of length about 21/p(s) bits [24]. Furthermore, the decompressor must agree on exactly the same mapping of strings to codes so that s can be properly decoded. This mapping obviously depends on the estimated distribution, p(), but no two humans are going to agree exactly on p(s) for all s. Furthermore, p() for a single person changes over time as the person learns and forgets. Even at any given instant in time, the computation of p(s) is uncertain because it is performed with neurons, which are noisy, analog devices.

Computers are deterministic, so any algorithm used to compute p(s) by the compressor can be duplicated exactly by the decompressor. Furthermore, there are efficient algorithms for mapping strings to codes given any distribution p(), such as arithmetic coding [4]. In an arithmetic code, s is mapped to a code x, expressed as a binary fraction in [0,1), such that p(<s) <= x < p(<=s), where p(<s) means the cumulative probability of all strings lexicographically less than s. An arithmetic code can be computed efficiently by expressing p(s) as a product of conditional probabilities or predictions: p(s) = p(s1s2...sn) = p(s1)p(s2|s1)...p(sn|s1...sn-1), then narrowing the bounds on x after each predicted symbol is read and outputting the leading bits of x as soon as they are known. Arithmetic coding is used in all of the best PPM and context mixing algorithms.

To summarize:

Why Not Lossy Compression?

Although humans cannot compress losslessly, they are very good at lossy compression: remembering that which is most important and discarding the rest. Lossy compression algorithms like JPEG and MP3 mimic the lossy behavior of the human perceptual system by discarding the same information that we do. For example, JPEG codes the color signal of an image at a lower resolution than brightness because the eye is less sensitive to high spatial freqencies in color. But we clearly have a long way to go. We can now compress speech to about 8000 bits per second with reasonably good quality. In theory, we should be able to compress speech to about 20 bits per second by transcribing it to text and using standard text compression programs like zip.

Humans do poorly at reading text and recalling it verbatim, but do very well at recalling the important ideas and conveying them in different words. It would be a powerful demonstration of AI if a lossy text compressor could do the same thing. But there are two problems with this approach. First, just like JPEG and MP3, it would require human judges to subjectively evaluate the quality of the restored data. Second, there is much less noise in text than in images and sound, so the savings would be much smaller. If there are 1000 different ways to write a sentence expressing the same idea, then lossy compression would only save log2 1000 = about 10 bits. Even if the effect was large, requiring compressors to code the explicit representation of ideas would still be fair to all competitors.

Why Not Other Evaluation Measures?

Because compression is quick, quantitative, objective, and repeatable. A direct measure of AI such as the Turing test is expensive to implement in practice (e.g. the Loebner prize [2]). The outcome depends on differences between people, so the results must be averaged over a large sample of human subjects.

A language model could also be evaluated in an application, e.g. word error rate in speech recognition or character error rate in OCR. However, the results also depend on the acoustic or optical model, so comparisons between models will only be meaningful in the context of a fixed input model. It is common practice in speech recognition research to evaluate a language model isolated from the acoustic model by measuring text compression (expressed as word perplexity), and this method has been found experimentally to be equivalent or superior to other proposed language model evaluation methods [6].

How Much Compression can be Achieved?

Shannon [3] estimated the entropy of written English using experiments in which human subjects tried to guess successive characters in text. From these experiments, he could infer a probability distribution, and thus compute the expected code length, or entropy. Unfortunately, there are many probability distributions that could lead to the same observed sequence of guesses, so Shannon could only estimate the entropy bounds to be between 0.6 and 1.3 bits per character (bpc) using a 27 character alphabet of monocase letters and space. However, these results held for a wide range of subjects over different material, such as novels and technical manuals.

Cover and King [8] refined Shannon's experiments by having subjects place bets on the outcome of the next character in a marketplace gambling game, and concluded that the entropy of Shannon's material was at most 1.3 bpc. I believe this is an overestimate because humans have a tendency to overestimate the likelihood of rare events, which would have the effect of increasing the observed entropy. This effect was noted by McDonald [9], and might explain the popularity of lotteries and insurance.

I attempted to resolve this issue by simulating the Shannon game with known models, and concluded that the actual value is near the middle of the Shannon bounds, about 1 bpc [10]. However, including punctuation and capitalization in text adds about 0.25 bpc or 15-20% for most high end compressors. Much work still needs to be done.

The entropy of this and all compression benchmarks is unknown. Unfortunately there is no direct way to compute it. In the absence of a known probability distribution, we may define the information content, or Kolmogorov complexity K(s), of a string s as the length of the shortest program that outputs s [11]. K(s) is independent of the language used to write the program, up to a constant independent of s, because any program written in language L1 can be rewritten in L2 by appending a compiler for L1 written in L2.

Kolmogorov also proved that K(s) is not computable. The proof is simple. Suppose that K(s) is computable. Then you could write a program to find the first string s whose complexity is at least n bits, for any n as follows:

  s := ""
  while K(s) < n do
    s := next(s)  // in some lexicographical ordering
  output s
Now let n = 10000. The above program is shorter than n bits, but it outputs s and K(s) = 10000, which is a contradiction. This proof can be applied to any language by making n sufficiently large.

As a second example of the difficulty of measuring entropy, consider the string of 109 zero bytes encrypted with AES-256 in CBC mode with key "foobar". This short description could be translated to a short program that generates an apparently random (incompressible) string to anyone who does not know the key. The ability to compress this string would be equivalent to either guessing the key or breaking AES (a "distinguishing attack").

Fortunately, compressing text is much easier. It is only as hard as AI. The fact that the human brain can do it suggests that it is possible.

Why Wikipedia?

The text in Wikipedia is high quality. Most of it has been edited and reviewed by many people to correct errors. It represents a vast body of knowledge.

The text is not ideal. It contains XML tags, which compress easily, and user IDs, which are difficult to compress. It contains hypertext links that are not visible and not easily modeled by humans. It contains formatting tags for tables, headers, lists, links to images, etc. The order of the articles is arbitrary, which increases entropy, but begins alphabetically, which decreases entropy. However, most real-world sources of text have similar warts, and language models will have to deal with them.

In any case, modeling text mixed with other data still requires the ability to model text. The other data is probably much easier to model because it follows simpler rules.

Wikipedia was suggested as a benchmark by Jim Bowery on comp.compression for the proposed C-Prize [19]. The prize was then funded by Marcus Hutter and became known as the Hutter Prize.

Why English?

Wikipedia has text available in about 100 languages in varying amounts. A multilingual corpus would encouage people to write very general algorithms for learning grammar and lexical models, rather than code specifically for English (e.g. dictionaries, stemming rules, etc). Also, because many articles contain the same information in different languages, it would encourage developers to add language translation capabilities to improve compression.

However, multiple language capabilities are not required to pass the Turing test. The benchmark already takes precautions against the excessive use of dictionaries by including the size of the decompressor. It is not possible to prevent developers from tuning a compressor to a particular benchmark except by keeping the data secret.

I picked English because it is the most widely used on the Internet and in Wikipedia, and I admit, I am biased.

Why One Gigabyte?

Turing predicted in 1950 that a machine with 109 bits of memory would pass the test for AI in 50 years. He did not explain where this number came from, but his prediction about the availability of memory was remarkably accurate, considering it predated Moore's law by decades. 109 bits would be the size of a 1 GB file after compression to 1 bpc.

Landauer [12] esimated in 1986 that humans have about 109 bits of long term memory, based on rates of learning and forgetting.

Assuming you spend several hours a day reading, writing, talking, or listening, you process about a gigabyte of language in your lifetime.

There are two reasons not to use a larger benchmark. First high end compressors build a model of the text compressed so far (to make predictions about upcoming input), and this model cannot occupy less memory than the size of the compressed output without discarding information. Usually the model is many times larger (a speed-memory tradeoff). On a computer with 1 GB memory, this means that statistics would have to be discarded, which would be equivalent to splitting the file into small parts and compressing them separately. Thus, a larger benchmark would not lead to better compression (except by using temporary files, which would be slow).

The second, and more important reason is that it would lead to inferior models for NLP applications. Humans are able to judge the correctness of a sentence never seen before by applying rules of grammar and semantics. With a larger data set like Google, it would be possible to simply count the number of occurrences to estimate its probability. Such a system in an NLP application would be inefficient because it would require an unnecessarily large training set.

Why Not a Variety of Data Types?

Most data compression benchmarks focus on finding the best overall compressor by including many types of data: text in various languages and formats, images, audio, executable code, source code, spreadsheets, databases, etc. That is not the purpose of this benchmark. Its purpose is to evaluate language models for AI. The more general problem, as I mentioned, is much harder. At the very least, it forces high end compressors to include lots of specialized models for different data types, resulting in overly complex systems.

In fact, ideal lossless compression of images, audio and executables is arguably harder than compressing text, regardless of the data format. Consider the complexity of the following sources:

The second problem is that there is not enough text in existing benchmarks. The table below summarizes some important benchmarks that have been updated in the past year, showing the approximate amount of natural language text. Some data sets are not public to prevent tuning to benchmarks.

Benchmark                Author               Public?  Size    Contents
---------                ------               ------   ----    --------
Calgary corpus           [15]                   Yes    3 MB    14 files: text (2 MB), data, bitmap image, source, exe  
Archive comparision      Michael Molhanic       No     8 MB    1 executable file
Emilcont                 Berto Destasio         No     13 MB   12 files: data, images, audio, 1MB text
Practical Compression    Wim Heirman            Yes    78 MB   1 set: source code (tested on fast compressors only)
MaximumCompression SFC   W. Bergmans            Yes    53 MB   10 files: text (4 MB), images
                   MFC                          No     316 MB  1 set of 510 files, many types (about 80 MB text)
UCLC                     Johan de Bock          Yes    400 MB  12 file sets: text (25 MB), audio, images, data, exe
Squeeze Chart            Stephan Busch          No     2.6 GB  15 file sets, wav, bmp, exe, 150 MB multilingual text
GUI Compression          DOGG                   No     7 GB    10 sets: data, images, audio, text (10 MB)
Links to other benchmarks maintained by Werner Bergmans.

Why Include the Decompressor Size?

If the benchmark did not include the decompressor, it would be possible to write a program to compress the data to an empty file by writing a (very large) decompressor that stores a copy of the file. The rules are similar to the Calgary Challenge [13]. The rules of both benchmarks are designed to accept proofs of tighter upper bounds on the algorithmic complexity K(s) of the data for a fixed set of languages. K(s) represents the size of both the code (the decompressor) and its input (the compressed file). The allowed languages are zipped C++, zipped x86 machine code, and so on.

The Calgary challenge allows decompressors to be stored in any of several archive formats which compress better than zip. However, for this benchmark, the code for most decompressors represents only about 0.01% of the total size. Zip is good enough, and is widely used.

An alternative to including the decompressor size would be to use an unpublished data set, as some benchmarks do. However I want the results to be publicly verifiable.

Why Bother with Compression?

There is no requirement that a compressor work on any files other than the two in this benchmark. It would be within the rules to write a compressor that simply stores one byte to indicate which file was input, and gives an error message in all other cases. I believe this will not happen. The knowledge in the decompressor's language model has to either be hand coded or learned. The hand coded approach has been tried for 50 years and has always failed because of the enormous effort required (e.g. 22 years for Cyc [14]). Therefore, I believe it will be more expedient to write a compressor that can learn a language model from arbitrary input data.

It is an open question whether a language model can be learned solely from unlabeled training text, as a compressor would have to do. Humans acquire some verbally expressable knowledge such as "the sky is blue" nonverbally, for example, by looking at the sky. However, it seems that all knowledge that can be tested verbally (as in the Turing test) can also be acquired verbally. Thus, a blind person could correctly answer a question about the color of the sky by having been told the answer. (So could a computer. Google returns more hits for "the sky is blue" than "the sky is green"). Some knowledge can only be acquired nonverbally, such as the ability to swim, to recognize some person's face, or to recognize a novel odor, but it does not seem possible to devise any verbal tests for such knowledge.

Will This Benchmark Lead to AI?

That is not known. I believe that if a program compresses this data set as well as a human can predict it, as measured by the Shannon game or a similar procedure, then we will have solved the problem of building a language model. This is different than actually building a model that is useful for a particular purpose. Collecting the knowledge that an AI system must learn for a particular application is a different problem.

Also, the entropy of this data set with respect to a human model has not yet been measured.

Why Not Rank by Speed and Memory?

For practical reasons, compressors have to be benchmarked on a machine with limited memory, disk, and processing power. With current technology (in 2006), this imposes a severe constraint on the top programs. There is a three way tradeoff between speed, memory, and compression. The best compressors are near the speed and memory limits. In theory, a language model could be stored in no more memory than the size of the compressed output. In practice, high end compressors may use less efficient representations, often 10 to 50 times larger, in order to achieve reasonable speed. These programs are penalized because the best they can do is to compress the text in small, independent pieces, none of which have enough data to build a full language model according to the arguments I gave earlier.

I believe the best way to achieve the goal of NLP is to relax the CPU and memory constraints as faster hardware and cheaper memory becomes available. Only the training/test data should be limited.

Nevertheless, speed and memory are important so I report them.

What has been Achieved?

The most common general purpose lossless compression algorithms, in order of worst to best compression, and from fastest to slowest, are:

None of these compressors model text beyond the lexical (word) level. However there are offline semantic models such as LSA [18] that exploit the transitive property of semantics as a fuzzy identity relation. Let A, B and C be words and define "means" as "is likely to occur near (in running text)". It can be observed:
  A means A.                                     (reflexive property)
  If A means B then B means A.                   (symmetric property)
  If A means B and B means C then A means C.     (transitive property)
Thus, if we observe "star" near "moon" and "moon" near "night", we can predict that "night" might follow "star", even if these two words have not been previously observed near each other. This model can be constructed in LSA by factoring a word-word proximity matrix using singular value decomposition into 3 matrices with the middle matrix diagonal, and discarding all but the few hundred largest elements of this matrix.

Acquiring syntactic knowledge still remains a difficult challenge.


1. Turing, A. M., “Computing Machinery and Intelligence", Mind, 59:433-460, 1950.

2. Loebner Prize, http://www.loebner.net/Prizef/loebner-prize.html

3. Shannon, Cluade E., “Prediction and Entropy of Printed English”, Bell Sys. Tech. J (3) p. 50-64, 1950.

4. Howard, Paul G., and Jeffrey Scott Vitter, "Analysis of Arithmetic Coding for Data Compression", Information Processing and Management, 28(6), 749-763, Nov. 1992.

5. Knight, Kevin, “Automatic Knowledge Acquision for Machine Translation”, AI Magazine, 18(4) 81-96, 1997.

6. Chen, Stanley F., Douglas Beeferman, and Ronald Rosenfeld, "Evaluation Metrics for Language Models". In DARPA Broadcast News Transcription and Understanding Workshop, 1998.

7. Teahan, W. J., S. Englis, J. G. Cleary, G. Holmes, “Correcting English text using PPM models”, IEEE Proc. Data Compression Conference, 289-298, 1998.

8. Cover, T. M., and R. C. King, “A Convergent Gambling Estimate of the Entropy of English”, IEEE Transactions on Information Theory (24)4 (July) pp. 413-421, 1978.

9. McDonald, John, "200% Probability and Beyond: The Compelling Nature of Extraordinary Claims in the Absence of Alternative Explanations". Skeptical Inquirer (22)1: 45-49, 64, 1998.

10. Mahoney, M., "Refining the Estimated Entropy of English by Shannon Game Simulation" (unpublished), http://cs.fit.edu/~mmahoney/dissertation/entropy1.html, 2000.

11. Chaitin, G. J., "Algorithmic Information Theory", IBM Journal of Research and Development 21:350-359, 496, 1997.

12. Landauer, Tom, “How much do people remember? Some estimates of the quantity of learned information in long term memory”, Cognitive Science (10) pp. 477-493, 1986.

13. Broukhis, Leonid A., Calgary Challenge, http://mailcom.com/challenge/

14. Cyc. Wikipedia. http://en.wikipedia.org/wiki/Cyc.

15. Bell, Timothy, Ian H. Witten, John G. Cleary, “Modeling for Text Compression”, ACM Computing Surveys (21)4, pp. 557-591, Dec. 1989.

16. Burrows, M. and D. J. Wheeler, "A Block-sorting Lossless Data Compression Algorithm", Digital Systems Research Center, 1994.

17. Mahoney, M., "Adaptive Weighing of Context Models for Lossless Data Compression", Florida Tech. Technical Report TR-CS-2005-16, 2005.

18. Bellegarda, Jerome R., “Speech recognition experiments using multi-span statistical language models”, IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 717-720, 1999.

19. Bowery, Jim, "The C-Prize", http://www.geocities.com/jim_bowery/cprize.html, 2005.

20. Zipf, George Kingley, The Psycho-Biology of Language, an Introduction to Dynamic Philology, M.I.T. Press, 1935.

21. Hutter, M., Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions, Proceedings of the 12th European Conference on Machine Learning (ECML-2001) 226-238, 2000.

22. Hutter, M., Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability, EATCS, Springer, Berlin, 2004.

23. Wikipedia, Occam's Razor, http://en.wikipedia.org/wiki/Occam's_Razor, 2006.

24. Shannon, C., and W. Weaver, The Mathematical Theory of Communication, Urbana: University of Illinois Press, 1949.

This page is maintained by Matt Mahoney, mmahoney (at) cs.fit.edu