Matt Mahoney
mmahoney@cs.fit.edu
Florida Tech, CS Dept.
May 11, 2000
An obvious source of text is the web. It is up to date, but does not solve the representation problem. In order to sample web pages randomly, it is necessary to obtain a complete list of them. The only way to do this is with a spider or web crawler, the technique used by search engines to build an index, and this requires enormous computational resources and network bandwidth. Furthermore, even the best search engines only index about a third of the web. (Peterson, 1998). It might be possible to use search engines to sample the web by entering every word in the dictionary and using whatever pages are returned, but this fails to weight the corpus in favor of the most common words, and we don't know which words are the most common without a representative sample to begin with.
With USENET, it is easy to obtain a complete list of available posts from a news server, thus obtaining text covering an extensive range of topics. It is true that such text is of lower quality than professionally written material; it is full of opinions, personal attacks, obscenities, spam (inappropriate advertising), misspellings, typographical errors, etc. However, this is the way ordinary people write. It is a cross section of the language as used by the general population (at least those with access to computers, which would be the audience that an AI application developer would want to target).
In section 2, I describe some of the properties of the USENET database. In 3, I describe a method of separating English text from foreign languages and non-text data such as images. In 4, I describe the results of this filtering on USENET.
A preliminary sample of articles showed that about half of the total text consisted of binary files in UUENCODE format. In this fomrat, 4 ASCII characters from a 64 character set represent 3 bytes of binary data. Typically these encode images (.gif or .jpeg), videos (.mpeg), music (.mp3), or archives (.zip, .gz, .tar.Z). They are usually large (5000 to 10000 lines) and often in multiple parts. In some cases, it may require hundreds of articles to post a single .mpeg file. A typical case looks like this:
begin 644 m13360b.jpg M_]C_X``02D9)1@`!````2@!*``#__@`"_]L`A``(!08'!@4(!P8'"0@("0P4 M#0P+"PP8$1(.%!T9'AX<&1P;("0N)R`B*R(;'"@V*"LO,3,T,Q\F.#PX,CPN M,C,Q`0@)"0P*#!<-#1<Q(1PA,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q M,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3'_Q`&B```!!0$!`0$!`0`````````` M`0(#!`4&!P@)"@L!``,!`0$!`0$!`0$````````!`@,$!08'"`D*"Q```@$# M`P($`P4%!`0```%]`0(#``01!1(A,4$&$U%A!R)Q%#*!D:$((T*QP152T?`D (several thousand more lines...) endTo reduce the amount of unwanted data that has to be downloaded and discarded, the 742 newsgroups containing the string ".binar" were removed from the list. This eliminated all of the "binaries" groups, as well as some foreign language hierarchies such as "chile.binarios" and "it.binari".
Of the remaining 25,187 newsgroups, the listing contained a total of 5,805,351 articles. Every 100'th article was sampled. Because of gaps between the low and high article numbers in each group, only 53,247 articles (92%) were available. The total size was 163.9 MB, which took 7 hours to download over a 36K dialup connection. This consisted of 53.6 MB of NNTP headers, and 110.2 MB of article bodies, of which 21.0 MB still was UUencoded files. This left 89.2 MB of text, or 54.4% of the total download.
The 53,247 articles were posted by 39,766 different authors. Of these 32,985 posted a single article. The most posts by a single author was 566, which turned out to be a robot posting to alt.nocem, a newsgroup used by the CancelMoose program (www.cm.org), for filtering spam. The posts consist of lists of article numbers to be filtered.
Because of crossposting, it is possible to see duplicate articles, although sampling mostly eliminates this problem. On average, each article was posted to 1.67 newsgroups. However, this does not account for cross posting to newsgroups not carried by the server, or duplicate postings to newgroups in order to circumvent rules about excessive crossposting by some newsgroups (thus, the need for CancelMoose).
Another source of duplicate text is quoting a message in a response. Usually, lines of quoted text begin with ">". Of 89.2 MB of text, 23.1 MB (26%) is quoted. Again, it is not necessary to remove this text if we use sampling.
A simple, general purpose filter was written to distinguish English text from other types of data. The idea is to compare the character frequency distribution of the target data with the distribution in English. Articles are rated on a scale of 0 to 1, with 1 being a perfect match. The score is calculated:
Score = H/Ht
where H is the entropy of the idealized source, and Ht is the cross entropy of the idealized source using the test sample as a model. A simple unigram character distribution is used.
H = Si P(i) log 1/P(i)
Ht = Si P(i)
log 1/Pt(i)
where i ranges over the 256 character alphabet of 8-bit bytes. The probabilities P(i) (in the model) and Pt(i) (in the test sample) are estimated by counting characters with an initial value of 1/256 (rather than 0) to avoid probabilities of 0.
P(i) = (N(i) + 1/256)/(N + 1)
where N(i) is the count of the number of occurrences of character i, and N is the total count of all characters.
To illustrate how this works, scores were calculated for the 14 files in the Calgary corpus (Bell, Witten, and Cleary, 1989), using the text of Alice in Wonderland from the Gutenberg press (alice30.txt minus the legal header) as a model.
Score File Size Description 0.895915 PAPER2 82199 Technical paper, UNIX troff format (text with formatting codes) 0.874933 PAPER1 53161 Technical paper, UNIX troff format 0.872272 BOOK1 768771 Fiction book, unformatted ASCII with SGML page numbers 0.864516 NEWS 377109 USENET articles (headers and bodies) 0.852758 BOOK2 610856 Non fiction book, UNIX troff format 0.851486 TRANS 93695 Transcript of UNIX terminal session (text and control codes) 0.829446 PROGL 71646 Lisp source code 0.827883 PROGC 39611 C source code 0.826229 PROGP 49379 Pascal source code 0.825960 BIB 111261 Bibliography, UNIX refer format (text) 0.594224 OBJ1 21504 VAX executable program (binary) 0.556023 OBJ2 246814 Apple Macintosh executable program (binary) 0.507828 GEO 102400 Geophysical data (binary) 0.224342 PIC 513216 Bitmapped fascimile image (binary)
The source code for a file filtering program is available as sim.cpp.
When testing was done across newsgroups at a whole (all message bodies concatenated together), a threshold of about 0.94 was necessary. This left 43.5 MB of text.
German and Italian are the most likely to be mistaken for English, probably because the character sets are similar. Of 317 newsgroups in the de.* hierarchy (German), 6 newsgroups had scores above 0.94. Of 356 in it.*, 3 were above 0.94. Of 9359 total newsgroups in the sample, 2670 were above 0.94.
At the 0.94 newsgroup threshold, it was found that 27.4 MB (63%) of text was unquoted (no leading > character). If we account for crossposting (1.67 newsgroups per post), then 16.4 MB are original in our sample. Thus, we have 1.64 GB of original English text available over a 2 week period, or 117 MB per day.
Carroll, Lewis (1865), Alice in Wonderland, Gutenberg Press, ftp://sunsite.unc.edu/pub/docs/books/gutenberg/etext97/alice30h.zip
Harman D., (1995) (ed), "Overview of the Third Text Retrieval Conference (TREC-3)", National Institute of Standards and Technology Special Publication 500-225, Gaitersberg MD 20879.
Peterson, I., (1998), "Web searches fall short", Science News (153) p. 286, May 2