USENET as a Text Corpus

Matt Mahoney
mmahoney@cs.fit.edu
Florida Tech, CS Dept.
May 11, 2000

Abstract

Of the over one gigabyte of data posted daily to USENET, about 10% is original, readily extractable English text suitable for linguistic research or language modeling. A method of extraction involving character frequency analysis is described.

1. Introduction

Research in statistical language modeling usually requires a corpus, a large quantity of text. Usually this consists of published works written by professional writers. For example, the TREC corpus (Harman, 1995), which includes about one gigabyte of text of the Wall Street Journal from 1987-92, is a widely cited example. One problem with this source is that it is not representative of the English language as a whole. For instance, newspaper articles rarely include the common English word, you. Another problem is that it is dated, and does not represent the current state of the English language. Thus, words like website are not likely to be found.

An obvious source of text is the web. It is up to date, but does not solve the representation problem. In order to sample web pages randomly, it is necessary to obtain a complete list of them. The only way to do this is with a spider or web crawler, the technique used by search engines to build an index, and this requires enormous computational resources and network bandwidth. Furthermore, even the best search engines only index about a third of the web. (Peterson, 1998). It might be possible to use search engines to sample the web by entering every word in the dictionary and using whatever pages are returned, but this fails to weight the corpus in favor of the most common words, and we don't know which words are the most common without a representative sample to begin with.

With USENET, it is easy to obtain a complete list of available posts from a news server, thus obtaining text covering an extensive range of topics. It is true that such text is of lower quality than professionally written material; it is full of opinions, personal attacks, obscenities, spam (inappropriate advertising), misspellings, typographical errors, etc. However, this is the way ordinary people write. It is a cross section of the language as used by the general population (at least those with access to computers, which would be the audience that an AI application developer would want to target).

In section 2, I describe some of the properties of the USENET database. In 3, I describe a method of separating English text from foreign languages and non-text data such as images. In 4, I describe the results of this filtering on USENET.

2. Properties of USENET

To analyze the properties of USENET, on May 8, 2000, I sampled 1% of the articles available on the news server at nntp.ix.netcom.com owned by the Internet service provider Netcom, now part of Mindspring/Earthlink. At that time the server carried 25,929 newsgroups with articles going back 14 days. The sample included articles from 9359 newsgroups, skipping many of the smaller ones.

A preliminary sample of articles showed that about half of the total text consisted of binary files in UUENCODE format. In this fomrat, 4 ASCII characters from a 64 character set represent 3 bytes of binary data. Typically these encode images (.gif or .jpeg), videos (.mpeg), music (.mp3), or archives (.zip, .gz, .tar.Z). They are usually large (5000 to 10000 lines) and often in multiple parts. In some cases, it may require hundreds of articles to post a single .mpeg file. A typical case looks like this:

  begin 644 m13360b.jpg
  M_]C_X``02D9)1@`!````2@!*``#__@`"_]L`A``(!08'!@4(!P8'"0@("0P4
  M#0P+"PP8$1(.%!T9'AX<&1P;("0N)R`B*R(;'"@V*"LO,3,T,Q\F.#PX,CPN
  M,C,Q`0@)"0P*#!<-#1<Q(1PA,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q
  M,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3$Q,3'_Q`&B```!!0$!`0$!`0``````````
  M`0(#!`4&!P@)"@L!``,!`0$!`0$!`0$````````!`@,$!08'"`D*"Q```@$#
  M`P($`P4%!`0```%]`0(#``01!1(A,4$&$U%A!R)Q%#*!D:$((T*QP152T?`D
  (several thousand more lines...)
  end

To reduce the amount of unwanted data that has to be downloaded and discarded, the 742 newsgroups containing the string ".binar" were removed from the list. This eliminated all of the "binaries" groups, as well as some foreign language hierarchies such as "chile.binarios" and "it.binari".

Of the remaining 25,187 newsgroups, the listing contained a total of 5,805,351 articles. Every 100'th article was sampled. Because of gaps between the low and high article numbers in each group, only 53,247 articles (92%) were available. The total size was 163.9 MB, which took 7 hours to download over a 36K dialup connection. This consisted of 53.6 MB of NNTP headers, and 110.2 MB of article bodies, of which 21.0 MB still was UUencoded files. This left 89.2 MB of text, or 54.4% of the total download.

The 53,247 articles were posted by 39,766 different authors. Of these 32,985 posted a single article. The most posts by a single author was 566, which turned out to be a robot posting to alt.nocem, a newsgroup used by the CancelMoose program (www.cm.org), for filtering spam. The posts consist of lists of article numbers to be filtered.

Because of crossposting, it is possible to see duplicate articles, although sampling mostly eliminates this problem. On average, each article was posted to 1.67 newsgroups. However, this does not account for cross posting to newsgroups not carried by the server, or duplicate postings to newgroups in order to circumvent rules about excessive crossposting by some newsgroups (thus, the need for CancelMoose).

Another source of duplicate text is quoting a message in a response. Usually, lines of quoted text begin with ">". Of 89.2 MB of text, 23.1 MB (26%) is quoted. Again, it is not necessary to remove this text if we use sampling.

3. Language Filtering

In addition to English text, USENET contains text in many other languages, as well as non-text data, such as automated and machine readable postings, ROT13 encoded text, encrypted text, PGP signatures, ASCII art, and so on. Therefore, simply filtering out UUencoded files is not enough.

A simple, general purpose filter was written to distinguish English text from other types of data. The idea is to compare the character frequency distribution of the target data with the distribution in English. Articles are rated on a scale of 0 to 1, with 1 being a perfect match. The score is calculated:

Score = H/H_t

where H is the entropy of the idealized source, and H_t is the cross entropy of the idealized source using the test sample as a model. A simple unigram character distribution is used.

H = S_i P(i) log 1/P(i)
H_t = S_i P(i) log 1/P_t(i)

where i ranges over the 256 character alphabet of 8-bit bytes. The probabilities P(i) (in the model) and P_t(i) (in the test sample) are estimated by counting characters with an initial value of 1/256 (rather than 0) to avoid probabilities of 0.

P(i) = (N(i) + 1/256)/(N + 1)

where N(i) is the count of the number of occurrences of character i, and N is the total count of all characters.

To illustrate how this works, scores were calculated for the 14 files in the Calgary corpus (Bell, Witten, and Cleary, 1989), using the text of Alice in Wonderland from the Gutenberg press (alice30.txt minus the legal header) as a model.

    Score   File     Size             Description
  0.895915 PAPER2   82199   Technical paper, UNIX troff format (text with formatting codes)
  0.874933 PAPER1   53161   Technical paper, UNIX troff format
  0.872272 BOOK1   768771   Fiction book, unformatted ASCII with SGML page numbers
  0.864516 NEWS    377109   USENET articles (headers and bodies)
  0.852758 BOOK2   610856   Non fiction book, UNIX troff format
  0.851486 TRANS    93695   Transcript of UNIX terminal session (text and control codes)
  0.829446 PROGL    71646   Lisp source code
  0.827883 PROGC    39611   C source code
  0.826229 PROGP    49379   Pascal source code
  0.825960 BIB     111261   Bibliography, UNIX refer format (text)
  0.594224 OBJ1     21504   VAX executable program (binary)
  0.556023 OBJ2    246814   Apple Macintosh executable program (binary)
  0.507828 GEO     102400   Geophysical data (binary)
  0.224342 PIC     513216   Bitmapped fascimile image (binary)

The source code for a file filtering program is available as sim.cpp.

4. Filtering USENET

When the filtering algorithm was applied to body of individual USENET articles (again using Alice in Wonderland as a model), it was found that a threshold of about 0.91 to 0.92 was necessary to remove most foreign language text. This leaves 44.5 MB to 39.5 MB of text from the original 110.2 MB. The test was done by manually sampling the resulting text. It is not necessary to pre-filter UUencoded files, as these score very low, around 0.2.

When testing was done across newsgroups at a whole (all message bodies concatenated together), a threshold of about 0.94 was necessary. This left 43.5 MB of text.

German and Italian are the most likely to be mistaken for English, probably because the character sets are similar. Of 317 newsgroups in the de.* hierarchy (German), 6 newsgroups had scores above 0.94. Of 356 in it.*, 3 were above 0.94. Of 9359 total newsgroups in the sample, 2670 were above 0.94.

At the 0.94 newsgroup threshold, it was found that 27.4 MB (63%) of text was unquoted (no leading > character). If we account for crossposting (1.67 newsgroups per post), then 16.4 MB are original in our sample. Thus, we have 1.64 GB of original English text available over a 2 week period, or 117 MB per day.

Conclusion

USENET provides a good source of up-to-date text. Over 100 MB of English per day is available by use of simple filtering techniques based on character frequency distribution, and removal of headers, duplicate posts and quoted material. By using sampling, the duplication and quoting problems are eliminated, and a ready source of over 40 MB is available at a 1% sampling rate. These numbers could probably be improved with better filtering.

References

Bell, Timothy, Ian H. Witten, John G. Cleary (1989), "Modeling for Text Compression", ACM Computing Surveys (21)4, pp. 557-591

Carroll, Lewis (1865), Alice in Wonderland, Gutenberg Press, ftp://sunsite.unc.edu/pub/docs/books/gutenberg/etext97/alice30h.zip

Harman D., (1995) (ed), "Overview of the Third Text Retrieval Conference (TREC-3)", National Institute of Standards and Technology Special Publication 500-225, Gaitersberg MD 20879.

Peterson, I., (1998), "Web searches fall short", Science News (153) p. 286, May 2