Improving Neural Network Text Compression with Word-Oriented Contexts

Matthew Mahoney
Florida Institute of Technology, CS Dept.
May 13, 2000


(Mahoney, 2000) describes a fast text compression program using neural network prediction based on a character model and arithmetic encoding. An improvement of 3% to 8% in compression ratio on text files was obtained by incorporating a bigram word model.

1. Introduction

Several researchers have observed improvements in text compression when the input is divided into words, rather than characters. For instance, Teahan and Cleary (1997) improved text compression by about 0.05 bpc (bits per character) in a predictive arithmetic encoder by using words rather than letters as the alphabet in an n-gram prediction model. Similarly, Jiang and Jones (1992) improved Ziv-Limpel compression by storing words rather than arbitrary character sequences in the code dictionary. In both cases, words were split on spaces and punctuation characters using simple parsing rules.

In (Mahoney, 2000), I described a predictive arithmetic encoder called P6 based on a 6-gram character model. Prediction is done one bit at a time by a two layer neural network with 222 (about 4 million) inputs, and one output. The output is the probability that the next bit will be a 1. The inputs are "context detectors", where the context is the last 1 to 5 complete bytes, plus the 0 to 7 bits of the partially read current byte. An input unit is active (output of 1) when the context matches a particular value, otherwise it is 0. Because there are more than 222 possible context values, the context is hashed to a 22-bit number to select the active input. There are 5 contexts considered, with lengths of 1 to 5 complete bytes. Thus there are 5 out of 222 inputs active at any time.

Given inputs xi and output y, the neural network computes the probability y = P(next bit is a 1) as

  1. y = g(Si wixi), (weighted summation of inputs) where
  2. g(x) = 1/(1 + e-x) (squashing function with bounds (0, 1))

The weights wi are initialized to 0 and trained as follows.

  1. wi ¬ wi + xi(hS + hL/s2i)E (adjust the weight to reduce the error, E = actual bit - y), where
  2. s2i = (Ni+1) / (Ni(0)+1/2)(Ni(1)+1/2) (variance of the data in context i)
In (4), Ni(0) and Ni(1) are the counts of 0 and 1 (respectively) occurring when xi is active, Ni = Ni(0) + Ni(1), and hS and hL are parameters called the short and long term learning rates. These were tuned by hand to optimize compression on Alice in Wonderland (the last 152,141 bytes of alice30.txt from the Gutenberg Press). Compression of 2.129 bpc was obtained with hL = 0.35 and hS = (0.08, 0.15, 0.2, 0.2, 0.2) for context lengths of 1 through 5 respectively. By comparison, compress, pkzip, and gzip obtained 3.270, 2.884, and 2.848 bpc. The best known archivers at the time, boa (Sutton, 1998) and rkive (Taylor, 1998), obtained 2.061 and 2.055 bpc.

2. Word-Based Modeling

In the P12 implementation, the 5 character contexts (lengths 1 through 5) were replaced with 4 character contexts (1 through 4) and two word contexts (unigram and bigram). As a first step, a context of length 7 characters was added, and the learning rates were retuned on Alice in Wonderland to hL = 0.34 and hS = 0.06, obtaining a compression of 2.096 bpc. In this case, it did not help to set hS individually for each context length.

Next, the length 7 context was replaced with a word context, consisting of a hash of all of the alphabetic characters (A-Z, a-z) read so far, back to but not including the last non-alphabetic character. All contexts where the last character was not a letter were mapped to the same input unit. This improved compression to 2.088 bpc. hL and hS were retuned, but no further compression could be obtained, so the values were left unchanged.

Then the length-5 context was replaced with a 2 word context, a hash of the current and previous alphabetic sequences, and ignoring any characters in between. This improved compression to 2.074 bpc. Tuning hL to 0.38 improved compression only slightly to 2.073 bpc.

Increasing the number of contexts, or equivalently, the number of active input neurons, from 5 to 6 slows execution by 20%. However, P12 also uses a faster hash function which improves execution time by 10%, for an overall 10% slowdown. The faster hash function does not affect compression. Compression and decompression time on Alice in Wonderland on a 475 MHz P6-II under Windows 98 is about 4.5 to 5 seconds. The program uses about 16 MB of memory, the same as P6.

3. Results

P6 and P12 were tested on the Calgary corpus of 14 files, on book1 (a large text file from it), and the three large files of the Canterbury corpus. To make the comparisons fair, the parameters of P6 and P12 were not adjusted after tuning on Alice in Wonderland. Results are listed below.

  File        Size     PKZIP  BOA    RKIVE  P6     P12    P6®P12
  Alice       152,141  2.884  2.061  2.055  2.129  2.073   2.7%
  Calgary   3,141,622  2.621  1.904* 2.192  2.143* 2.116*  1.7%
  book1       768,771  3.260  2.204  2.120  2.283  2.204   3.5%
  bible     4,047,392  2.353  1.463  1.490  1.545  1.489   3.3%
  ecoli     4,638,690  2.313  1.955  1.941  2.145  2.056   4.2%
  world192  2,473,400  2.341  1.369  1.411  1.521  1.388   8.8%


4. Discussion

P12 obtains compression on text files (Alice, book1, bible, and world192) that is very close (within 2%) to the best available. Compression times are comparable as well, though slower than popular programs such as PKZIP and compress.

P12 does not do as well on the Calgary corpus, a mix of text and binary files, because it was specifically optimized for text. Nor does it do well on ecoli, which contains only the letters a, c, g, and t (the DNA sequence of the bacteria E. Coli). The fact that there is any improvement over P6 is surprising, because the added word context would not provide any useful information and should have the effect of adding random noise due to hash collisions. As far as P12 is concerned, the file consists of a single large word.

Although P12 obtains 3-8% better compression on English text, it is unfortunate that the method used probably would not work on other languages without changing the rules for determining word boundaries. (P12 would interpret extended characters like é as spaces). Hutchens and Alder (1998) described how the start of a word can be detected by a sudden increase in entropy, and I described an improvement to this method by measuring entropy reading both forwards and backwards (Mahoney, 1999). However, it is not known whether automatic lexical acquisition could be applied to text compression or language modeling and achieve better results than ad-hoc parsing.

P12 (C++ source code and Windows executable) is available at


Hutchens, Jason L., and Michael D. Alder (1998), "Finding Structure via Compression", Proceedings of the International Conference on Computational Natural Language Learning, pp. 79-82,

Jiang, J., S. Jones (1992), "Word-based dynamic algorithms for data compression", IEE Proc. Communication, Speech, and Vision, 139(6): 582-586.

Mahoney, Matthew (1999), "A note on lexical acquisition in text without spaces" (unpublished),

Mahoney, Matthew (2000), "Fast text compression with neural networks", Proc. FLAIRS, Orlando (to appear),

Sutton, Ian (1998), boa 0.58 beta,

Taylor, Malcolm, RKIVE v1.91 beta 1 (1998),

Teahan, W. J., John G. Cleary (1997), "Models of English text", IEEE Proc. Data Compression Conference, 12-21