By Matt Mahoney

Florida Institute of Technology

How can a lexical model be acquired from a corpus of text? It seems that in order to build a vocabulary, it is necessary to parse the text into words, and in order to parse text, it is necessary to know the vocabulary. Thus, most machine lexical models are built manually. How do humans solve the problem?

Jusczyk (1996) showed that parsing is learned first. This seems
impossibly difficult, like splitting *canyoureadthis* into
words without knowing English, yet Jusczyk showed that infants
do the equivalent of just that with continuous speech by age
10.5 months, which is 1.5 months before they normally learn their
first word.

Hutchens and Alder (1998) showed that a simple trigram character model could deduce that spaces are used to segment text into words. They note that, given the previous two characters, the first character of a word has a higher than average entropy or unpredictability. They attempted to use this result to find word boundaries in text without spaces, but the results were inconclusive. In this paper, I improve on their results by using a 5-gram model and combining conditional entropy reading both forwards and backwards.

H(w_{i}|w_{i-n+1...i-1}) +
H(w_{i-1}|w_{i...i+n-2}) > T_{n}

where T_{n} is a threshold, and H(y|x) is the entropy of
the probability distribution over y given x, i.e. the uncertainty
of the character y given the n - 1 characters in x on the other
side of the boundary.

H(y|x) = S_{y}
P(y|x) log_{2} 1/P(y|x)

and probabilities are estimated by counting n-grams in the text,
with no adjustments for zero counts, i.e.
P(y|x) = count(x,y)/count(x). 0 log_{2} 1/0 is
defined to be 0.

A word boundary is predicted when the test passes. The threshold was adjusted until recall = precision, where recall is the fraction of word boundaries (from the original text) that were predicted, and precision is the fraction of predicted boundaries that really were boundaries in the original text. A boundary is assumed if the original text had characters such as spaces or punctuation that were removed.

n | Recall/Precision | Threshold |
---|---|---|

2 | 0.41 | 7.5 |

3 | 0.63 | 6.9 |

4 | 0.75 | 5.7 |

5 | 0.77 | 4.2 |

Table 1. Word boundary detection accuracy in text without spaces for various n-gram models.

Jusczyk, Peter W. (1996), "Investigations of the word segmentation abilities of infants", 4'th Intl. Conf. on Speech and Language Processing, Vol. 3, 1561-1564.

Klakow, Dietrich (1998), "Language-model optimization by mapping of corpora", Proc. IEEE ICASSP, Vol. 2, 701-704.

Nevill-Manning, Craig G., Ian H. Witten (1997), "Inferring lexical and grammatical structure from sequences", IEEE Proc. Conf. on Compression and Complexity of Sequences, 265-274.