Matthew Mahoney
mmahoney@cs.fit.edu
Florida Institute of Technology, CS Dept.
May 13, 2000
In (Mahoney, 2000), I described a predictive arithmetic encoder called P6 based on a 6-gram character model. Prediction is done one bit at a time by a two layer neural network with 222 (about 4 million) inputs, and one output. The output is the probability that the next bit will be a 1. The inputs are "context detectors", where the context is the last 1 to 5 complete bytes, plus the 0 to 7 bits of the partially read current byte. An input unit is active (output of 1) when the context matches a particular value, otherwise it is 0. Because there are more than 222 possible context values, the context is hashed to a 22-bit number to select the active input. There are 5 contexts considered, with lengths of 1 to 5 complete bytes. Thus there are 5 out of 222 inputs active at any time.
Given inputs xi and output y, the neural network computes the probability y = P(next bit is a 1) as
The weights wi are initialized to 0 and trained as follows.
Next, the length 7 context was replaced with a word context, consisting of a hash of all of the alphabetic characters (A-Z, a-z) read so far, back to but not including the last non-alphabetic character. All contexts where the last character was not a letter were mapped to the same input unit. This improved compression to 2.088 bpc. hL and hS were retuned, but no further compression could be obtained, so the values were left unchanged.
Then the length-5 context was replaced with a 2 word context, a hash of the current and previous alphabetic sequences, and ignoring any characters in between. This improved compression to 2.074 bpc. Tuning hL to 0.38 improved compression only slightly to 2.073 bpc.
Increasing the number of contexts, or equivalently, the number of active input neurons, from 5 to 6 slows execution by 20%. However, P12 also uses a faster hash function which improves execution time by 10%, for an overall 10% slowdown. The faster hash function does not affect compression. Compression and decompression time on Alice in Wonderland on a 475 MHz P6-II under Windows 98 is about 4.5 to 5 seconds. The program uses about 16 MB of memory, the same as P6.
File Size PKZIP BOA RKIVE P6 P12 P6®P12 Alice 152,141 2.884 2.061 2.055 2.129 2.073 2.7% Calgary 3,141,622 2.621 1.904* 2.192 2.143* 2.116* 1.7% book1 768,771 3.260 2.204 2.120 2.283 2.204 3.5% bible 4,047,392 2.353 1.463 1.490 1.545 1.489 3.3% ecoli 4,638,690 2.313 1.955 1.941 2.145 2.056 4.2% world192 2,473,400 2.341 1.369 1.411 1.521 1.388 8.8%
P12 does not do as well on the Calgary corpus, a mix of text and binary files, because it was specifically optimized for text. Nor does it do well on ecoli, which contains only the letters a, c, g, and t (the DNA sequence of the bacteria E. Coli). The fact that there is any improvement over P6 is surprising, because the added word context would not provide any useful information and should have the effect of adding random noise due to hash collisions. As far as P12 is concerned, the file consists of a single large word.
Although P12 obtains 3-8% better compression on English text, it is unfortunate that the method used probably would not work on other languages without changing the rules for determining word boundaries. (P12 would interpret extended characters like é as spaces). Hutchens and Alder (1998) described how the start of a word can be detected by a sudden increase in entropy, and I described an improvement to this method by measuring entropy reading both forwards and backwards (Mahoney, 1999). However, it is not known whether automatic lexical acquisition could be applied to text compression or language modeling and achieve better results than ad-hoc parsing.
P12 (C++ source code and Windows executable) is available at cs.fit.edu/~mmahoney/compression
Jiang, J., S. Jones (1992), "Word-based dynamic algorithms for data compression", IEE Proc. Communication, Speech, and Vision, 139(6): 582-586.
Mahoney, Matthew (1999), "A note on lexical acquisition in text without spaces" (unpublished), cs.fit.edu/~mmahoney/dissertation/lex1.html
Mahoney, Matthew (2000), "Fast text compression with neural networks", Proc. FLAIRS, Orlando (to appear), cs.fit.edu/~mmahoney/compression/nn_paper.html
Sutton, Ian (1998), boa 0.58 beta, ftp://ftp.cdrom.com/pub/sac/pack/boa058.zip
Taylor, Malcolm, RKIVE v1.91 beta 1 (1998), http://www.geocities.com/SiliconValley/Peaks/9463/rkive.html
Teahan, W. J., John G. Cleary (1997), "Models of English text", IEEE Proc. Data Compression Conference, 12-21