A Note on Lexical Acquisition in Text without Spaces

By Matt Mahoney
Florida Institute of Technology

Abstract

It is shown that word boundaries can be found in text without spaces and without language knowledge beyond character n-gram statistics.

Introduction

Many word-based natural language models assume some simple rule for parsing the input text, for instance, that a word is a sequence of letters separated by spaces. A better model would use rules for stemming prefixes and suffixes, for instance, undeniably = un + deny + able + ly, components which could stand alone. A lexicon or dictionary is generally needed to handle exceptions, for example, bus is not plural, and New York is one word. The problem is particularly severe in agglutinative languages such as Chinese, which lack spaces between words, and in continuous speech in any language.

How can a lexical model be acquired from a corpus of text? It seems that in order to build a vocabulary, it is necessary to parse the text into words, and in order to parse text, it is necessary to know the vocabulary. Thus, most machine lexical models are built manually. How do humans solve the problem?

Jusczyk (1996) showed that parsing is learned first. This seems impossibly difficult, like splitting canyoureadthis into words without knowing English, yet Jusczyk showed that infants do the equivalent of just that with continuous speech by age 10.5 months, which is 1.5 months before they normally learn their first word.

Hutchens and Alder (1998) showed that a simple trigram character model could deduce that spaces are used to segment text into words. They note that, given the previous two characters, the first character of a word has a higher than average entropy or unpredictability. They attempted to use this result to find word boundaries in text without spaces, but the results were inconclusive. In this paper, I improve on their results by using a 5-gram model and combining conditional entropy reading both forwards and backwards.

Procedure

The text of Lewis Carroll's, Alice in Wonderland (152,141 characters, from the Gutenberg Press), was reduced to a 26 character alphabet by converting upper case letters into lower case, and removing all other characters including spaces. Then the following test was applied to each character boundary between w_i-1 and w_i for n = 2, 3, 4, and 5:

H(w_i|w_i-n+1...i-1) + H(w_i-1|w_i...i+n-2) > T_n

where T_n is a threshold, and H(y|x) is the entropy of the probability distribution over y given x, i.e. the uncertainty of the character y given the n - 1 characters in x on the other side of the boundary.

H(y|x) = S_y P(y|x) log₂ 1/P(y|x)

and probabilities are estimated by counting n-grams in the text, with no adjustments for zero counts, i.e. P(y|x) = count(x,y)/count(x). 0 log₂ 1/0 is defined to be 0.

A word boundary is predicted when the test passes. The threshold was adjusted until recall = precision, where recall is the fraction of word boundaries (from the original text) that were predicted, and precision is the fraction of predicted boundaries that really were boundaries in the original text. A boundary is assumed if the original text had characters such as spaces or punctuation that were removed.

Results

The table below shows that accuracy improves as the order of the character model (n) increases. A baseline of 0.22 would be expected by random guessing (at 4.5 letters per word).

n Recall/Precision Threshold
2 0.41 7.5
3 0.63 6.9
4 0.75 5.7
5 0.77 4.2

n	Recall/Precision	Threshold
2	0.41	7.5
3	0.63	6.9
4	0.75	5.7
5	0.77	4.2

Table 1. Word boundary detection accuracy in text without spaces for various n-gram models.

Conclusions

These results show that text contains sufficient information to derive a lexicon even in the absence of parsing clues. It does not tell us how this knowledge should be applied, however. Indeed, many models have been proposed, such as Nevill-Manning and Witten's (1997) grammar builder, and a number of methods investigted by Klakow (1998). All of these methods are based on the statistical tendency of characters to occur in common sequences which cannot be divided. We call these sequences words.

References

Hutchens, Jason L., and Michael D. Alder (1998), "Finding Structure via Compression", Proceedings of the International Conference on Computational Natural Language Learning, pp. 79-82,

Jusczyk, Peter W. (1996), "Investigations of the word segmentation abilities of infants", 4'th Intl. Conf. on Speech and Language Processing, Vol. 3, 1561-1564.

Klakow, Dietrich (1998), "Language-model optimization by mapping of corpora", Proc. IEEE ICASSP, Vol. 2, 701-704.

Nevill-Manning, Craig G., Ian H. Witten (1997), "Inferring lexical and grammatical structure from sequences", IEEE Proc. Conf. on Compression and Complexity of Sequences, 265-274.