Back

# The Cost of Natural Language Modeling

By Matt Mahoney
Florida Institute of Technology

The purpose of this dissertation is to estimate the cost of natural language modeling, an unsolved problem which is central to artificial intelligence (AI). Knowing the difficulty of the problem, without actually solving it, would allow one to make an intelligent decision as to whether an AI project is worth pursuing, and if so, how much effort should be invested.

By cost, I mean the information content (in bits) of the language model for a natural language such as English. The complexity of the model is the dominant factor in determining its cost in dollars. A model is an algorithm that estimates the probability, P(x), of observing any given text string x in human communication. Having a good model is critical to AI applications such as speech recognition, optical character or handwriting recognition, language translation, or natural language interfaces to databases or search engines. The problem is unsolved, in that no model performs as well as humans, either directly (by estimating P(x)), or in any AI application. This is in spite of enormous effort since Alan Turing first proposed the idea of artificial intelligence in 1950.

Turing estimated that it would take 109 bits of memory to solve the AI problem, but offered no explanation for this figure. My approach is to find an empirical relationship between the size of a model (the amount of memory it requires) and its performance (how well it estimates P(x)), and compare with human performance. I did this for over 30 published statistical language models, obtaining the following. Fig. 1. Comparison of model size vs. performance.

The performance of the model is measured as the entropy in bits per character (bpc), where lower numbers are better. If we take a random string x from a large corpus of text, then a good language model is likely to assign a higher value to P(x) than a poor one. For instance, a good model would assign P(roses are red) > P(roses red are), and since the former would be more likely to appear as a test string, a higher probability usually indicates a better model. If x is very long (megabytes), then this test is very accurate because it effectively combines many smaller tests.

The entropy measure, defined as H = 1/|x| log2 1/P(x), is independent of the length of the test string, |x|, and decreases as P(x) goes up. It is identical to the best compression ratio that can be obtained with an optimal encoding of x. Indeed, many of the test points in fig. 1 are data compression programs.

The size of a statistical language model depends on the amount of training text used to train the model. The minimum amount of memory needed is simply the memory needed to store the training data. This can be reduced somewhat by compressing the data, using the model itself to do the compression. Thus, a model trained on 108 characters that compresses to 2 bpc needs at least 2 x 108 bits. The actual amount would be more, depending on the implementation, but we are more interested in finding the lower bound, which depends only on easily measurable quantities.

In 1950, Claude Shannon estimated that the entropy of written English is between 0.6 and 1.3 bpc, based on how well human subjects could predict the next character in a text string selected at random from a book. Thus, fig. 1. suggests that the AI problem is close to being solved, and that a solution would require about 108 to 1010 characters of training data (108 to 1010 bits after compression), in agreement with Turing.

My poster Text Compression as a Test for Artificial Intelligence (compressed PostScript, 1 page), published in the July, 1999 AAAI conference proceedings, supports my statistical modeling approach to this problem.

The dissertation proposal (compressed PostScript, 44 pages) was accepted Dec. 1, 1999.

Bibliography (HTML) of related work.

## Improving our Estimate

There are two main sources of uncertainty in our estimate. First, it is hard to discern a trend from many different language models developed by many different researchers. We need data more like the line labeled Hector, consisting of a set of points for a single model and training set as the size of the set is varied. Unfortunately, Hector is a rather poor model. (It is a Burrows-Wheeler block sorting data compressor developed in 1994, and tested on a 103 MB sample of English text from a variety of sources. The model is character based and cannot learn word-level constraints such as semantics and syntax.)

The second source of error is in Shannon's estimated entropy of English, which has never been improved upon. Cover and King (1978) removed the uncertainty in interpreting the results of Shannon's character guessing game by having subjects assign odds and bet on the next letter, but the test is tedious in practice and succumbs to the human tendency to overestimate unlikely events.

My dissertation addresses both sources of uncertainty. First, I plan to develop a language model, incrementally adding character, lexical, semantic, and syntactic constraints, to develop a smooth set of data points like the Hector data, but hopefully better. Second, I plan to refine the estimate of the entropy of the test data by using a form of the Shannon game that reduces the uncertainty in the interpretation of the results.

### Language Model

The language model will be developed incrementally as follows:
• Character (letters)
• Lexical (words)
• Semantic (word associations)
• Syntactic (sentence structure)

#### Character

Character level or n-gram models assign a probability to the next character in a text stream based on the context of the previous n - 1 (n about 5) characters. For example, the model learns that the is more likely than thq based on the number of times e and q occur after th in the training data.

The character model was implemented as a neural network, in order to develop techniques that could be extended to higher level models. In Oct. 1999, I submitted Fast Text Compression with Neural Networks (HTML) to the FLAIRS special track on neural networks. The paper was accepted, and I presented it on May 23, 2000.

#### Lexical

Lexical models apply n-gram constraints to words or terms rather than letters. Words are common character sequences that cannot be broken apart. For instance, New York is one word (because York doesn't appear alone), but apples is two words, because both apple and s (denoting plural) can each appear without the other.

I have not yet implemented a lexical model, but I showed the feasibility of learning one in A note on lexical acquisition in text without spaces.

#### Semantic

Semantic models learn word associations, such as fire and smoke, by their close proximity in running text. Semantic associtations form a fuzzy equivalence relation, and models can be grouped into three sublevels according to which equivalence property they exploit:
• Cache models exploit the reflexive property: a word tends to appear near itself.
• Trigger models exploit the symmetric property: if two words appear near each other once, then either word will trigger the other in the future.
• Latent Semantic Analysis (LSA) models exploit the transitive property: if fire is associated with smoke and smoke with heat, then fire triggers heat even if they never occurred together before.

#### Syntactic

Syntactic models apply constraints across categories of words (nouns, verbs, prepositions), word groupings (phrases, sentences, paragraphs), or sublexical components (vowels, consonants, punctuation). Units are categorized together when they appear in the same context, for instance, the cat ate and the dog ate implies that cat and dog form a syntactic category. There are three model levels:
• No-repeat models disallow repetition, as in of of or preposition preposition.
• Hidden Markov Models (HMMs) apply n-gram constraints to sequences of categories, such as article adjective noun.
• Context Free Grammars (CFGs) combine common sequences of categories (such as the above noun-phrase) into new nonterminal symbols, in the same way that the lexical model combines letters into words. For instance, sentence = noun-phrase verb-phrase noun-phrase, and paragraph = indentation sentence sentence sentence.

### Entropy of English

The wide range of uncertainty in Shannon's estimate of the entropy of written English is mostly due to the loss of information about the probability distribution of the next letter of text, when all we have is the number of tries that it took to guess the letter. Shannon proved that if the probability that it will take r tries to guess the next letter is pr, then the entropy bounds of H are:

Sr r(pr - pr+1) log2 r £ H £ Sr pr log2 1/pr

Shannon obtained p1 = 0.8, p2 = 0.07, ..., and bounds of 0.6 to 1.3 bpc. The range of uncertainty could be reduced by making the pr more equal as follows.

• Use a language model to give the subject choices of approximately the same probability, e.g. A to EF or EG to Z.
• Divide the tests into "hard" and "easy" parts, for instance, the first letter of each word is hard, and the rest are easy. Then combine the results.
Another approach is to find the empirical relationship between the bounds and the actual entropy in various machine models. See Refining the Estimated Entropy of English by Shannon Game Simulation.

### Putting it Together

The four language models and the human entropy measurements need to be made on the same test set. This corpus will probably have to be in the 100 MB to 1 GB range. This set should be easy to obtain from USENET. See USENET as a Text Corpus.