/* PAQ1 - File archiver and compressor. (C) 2002, Matt Mahoney, mmahoney@cs.fit.edu This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA or see http://www.gnu.org/licenses/gpl.txt USAGE To compress: PAQ1 archive file file... (1 or more file names), or or (MSDOS): dir/b | PAQ1 archive (read file names from input) or (UNIX): ls | PAQ1 archive To decompress: PAQ1 archive To list contents: more < archive Compression: The files listed are compressed and stored in the archive, which is created. The archive must not already exist. File names may specify a path, which is stored. If there are no file names on the command line, then PAQ1 prompts for them, reading until the first blank line or end of file. Decompression: No file names are specified. The archive must exist. If a path is stored, the file is extracted to the appropriate directory, which must exist. PAQ1 does not create directories. If the file to be extracted already exists, it is not replaced; rather it is compared with the archived file, and the offset of the first difference is reported. It is not possible to add, remove, or update files in an existing archive. If you want to do this, extract the files, delete the archive, and create a new archive with just the files you want. FILE FORMAT The archived file names are stored in a readable header as follows: PAQ1\r\n size name\r\n size name\r\n \032\f\0 where "size" is the original number of bytes, as a 10 digit decimal number, right justified with leading spaces, "name" is the file name (as input, including path if any), \r = carriage return, \n = linefeed, \f = formfeed, \032 is the DOS end of file character, and \0 is a NUL. The header is followed by the compressed data in binary format. The input files are concatenated and treated as a single data stream. Some improvement in compression is possible by grouping related files together, for instance, all text files followed by all binary files. Data is compressed using a bit stream predictive arithmetic encoder. An input string s is encoded as a binary fraction x such that for any randomly chosen string r of the same length: p(r < s) <= x < p(r <= s) (1) Such an x with length at most log2 p(s) + 1 bits can always be found. It is convenient to estimate p(s) = p(s1 s2 s3 ... sn) = p(s1) p(s2 | s1) p(s3 | s1 s2) ... p(sn | s1 s2...sn-1) and to narrow the range of possible x (initially 0 to 1) as each probability term is computed. As the leading bits of x become known (because they are the same at the lower and upper bounds), they can be output. The archive is a binary representation of x in base 256 with the most significant digit (256^-1) first. The ordering of s is as a binary number of n bits (n/8 bytes) with the most significant bit as the most significant bit of the first byte. The archive is decompressed by finding s that satifies (1) given x. We make an identical series of probability estimates p(si | s1...si-1) where each si is one bit (0 or 1), then use our knowledge of x to decide which it is. After 8 predictions, one byte of uncompressed data is output. Compression performance depends on the quality of prediction. A predictor consists of a set of models that output a weighted probability that the next bit (si) will be a 0 or 1, given all of the previous input (s1 s2 ... si-1). The weighted probability is expressed as counts of 0s and 1s (n0 and n1), where p(1) = n1/(n0+n1) with weight (confidence) n0+n1. Models are combined by adding their respective counts, which has the effect of a weighted averaging. PAQ1 combines the outputs of 5 models: m0 - A bland model in which p(1) = 1/2 with confidence 2, i.e. n0 = 1, n1 = 1. By itself, this model would result in no compression. m1 - NonstatioaryPPM. This is a combination of N = 8 models of order n = 1 through N. In an order n model, the next bit is predicted in the context of the last n - 1 bytes, plus the last 0 to 7 bits of the current byte. Each time a 0 or 1 occurs in some context, the counts n0 and n1 associated with that context is incremented. If the opposite count is more than 2, then the excess is halved. This is a compromise between a stationary model, in which all outcomes in the same context are equally significant, and an inverse temporal model, in which the probability of an event is inversely proportional to the time since it last occurred, regardless of anything that happened previous to that. For example, suppose that a context occurs 10 times, and the sequence of bits that follow the context is 1111111000. If the outcomes are independent, then n0 = 7, n1 = 3, and p(1) = 7/10 with confidence 10. In an inverse temporal model, the last 1 was 4 events ago and the last 0 was 1 event ago, so 0 is 4 times more likely than 1, for p(1) = 1/5. PAQ1 uses a compromise between these two positions: 1111111 n0 = 0 n1 = 7 p(1) = 1 with confidence 7 11111110 n0 = 1 n1 = 4 p(1) = 4/5 with confidence 5 111111100 n0 = 2 n1 = 3 p(1) = 3/5 with confidence 5 1111111000 n0 = 3 n1 = 2 p(1) = 2/5 with confidence 5 The counts are weighted by n^2 (order squared). The rationale is that if the last n bytes of context match, then the next byte will match with probability (n - 1)/n (in an inverse temporal model), so we weight the counts by n. Also, the expected number of matches of length n is proportional to 1/n (given the above probability), so we weight the counts by n again. The counts are stored in a hash table with 16M entries. Each entry takes 2 bytes for a total of 32 MB of memory: an 8 bit hash checksum to detect collisions, and an 8 bit state representing the two counts n0 and n1. The context is hashed to 32 bits, with 24 bits used to index the hash table and 8 bits used to detect collisions. If a collision occurs, then the next two table elements are tried. If all 3 are occupied, then the one with the lowest n0 + n1 is removed. The 8-bit state represents counts of 0 to 10 exactly, and counts up to 512 as an approximation with probabilistic increments. Since n0 and n1 can't both be large at the same time, no states are needed for such pairs. m2 - MatchModel. This is an approximation of the NonstationaryPPM model for orders longer than n = 8. It searches the previous 4M bytes of input for a matching context of 8 or more bytes (plus the current 0-7 bits), and if found, predicts whatever bit followed next with weight 3n^2 (instead of n^2 since there might be more matches, but only the last one is used). It uses a hash table of 1M 3-byte pointers (always overwriting the previous entry) into the 4MB buffer to find matching contexts, for a total 7 MB of memory. m3 - WordModel. This is an optimization for English text. There are two models, and order n = 1 word model and an order n = 2 word model. In the first, the context is the most recent word or partial word, plus the previous and current bytes (8 to 15 bits). A word is defined as a sequence of letters (a-z), case insensitive. The second model includes the previous word in the context. The weight is (n + w)^2 where w is the length of the word or words (i.e. the square of the length of the context in bytes as in m1). The counts are stored in a 4M hash table (8 MB memory) as in model m1. m4 - CyclicModel. This is an optimization for data with fixed length records or rows, such as databases, tables, and binary numeric data. If the record length can be determined, then two models are used. The first is an order 1 bit model (no context) with weight 4 in which the history is all bits in the same column. The second is an order 9-bit model with weight 16 in which the context is the previous 8 bits (spanning a byte boundary), i.e. the history is all those bits which have the same 8 bit context as the current bit being predicted. The bit counts are stored in a 32K (64 KB) hash table. The record length is determined by finding a pattern of 4 consecutive 8 bit sequences at equal intervals of at least 9 bits. (It does not have to be an integral number of bytes). If a pattern of n equally spaced sequences is found at interval r, then n more records are assumed before the model is expired. A new cycle length of r may replace an existing one if the new product rn exceeds the old one (with n rows remaining until the model expires). PERFORMANCE The following results were obtained by PAQ1 and various popular and top ranked compression programs on the Calgary corpus. For each program, two tests were performed. In the first, the 14 files were compressed and the total size is reported (either as an archive or 14 compressed files, depending on the program). In the second test, the 14 files were concatenated (alphabetically by file name) into a single file of size 3,141,622 bytes and compressed. The resulting sizes are shown. Run time is shown for the first test but are about the same for both. The test was performed on a 750 MHz Duron with 256 MB memory, 128K L2 cache, running Windows Me. PAQ1 was compiled using DJGPP (g++) 2.95.2 with optimization (gxx -O). For all other programs, options are selected for maximum compression within 64 MB of memory. Program Options 14 files Seconds Concatenated ------- ------- -------- ------- ------------ compress 1,272,772 1.5 1,318,269 pkzip 2.04e 1,032,290 1.5 1,033,217 gzip 1.2.4 -9 1,017,624 2 1,021,863 P5 992,902 31.8 same P6 841,717 38.4 same P12 831,341 38.8 same bzip2 1.0.0 -9 828,347 5 859,448 acb u 766,322 110 769,363 boa 0.58b -m15 751,413 44 769,196 ppmd H e -m64 -o16 744,057 5 759,674 sbc 0.910 c -m3 740,161 4.1 819,016 ppmonstr H e -m64 -o1 719,922 13 736,899 rk 1.04 -mx3 -M64 -ts 712,188 36 755,872 rk 1.02 b5 -mx3 707,144 64 750,740 PAQ1 m1, 1 MB 843,819 46.3 same for all m1, 8 MB 768,463 45.9 m1, 16 MB 758,456 45.6 m1, 32 MB 751,734 45.3 m1-m2, 32+7 MB 731,637 48.5 m1-m3, 32+7+8 MB 723,092 62.1 m1-m4, 16+7+8 MB 720,310 71.2 m1-m4, 32+7+4 MB 717,766 68.0 m1-m4, 32+7+8 MB (as coded) 716,704 68.1 compress, pkzip, and gzip are LZ encoders, encoding strings as pointers to previous occurrences or to dictionary entries. bzip2 and sbc are Burrows-Wheeler (context sorting) encoders. P5, P6, and P12 use neural network prediction and arithmetic encoding. acb, boa, ppmd, ppmonstr, and rk are stationary PPM (longest context matching) encoders. rk is the best of 157 compressors rated by ACT as of 6/24/2001 (Gilchrist, see http://compression.ca/act-calgary.html ), followed by sbc, boa, and acb. On the Canterbury corpus, ppmonstr, ppmd, and rk are the top three (see http://www.geocities.com/SiliconValley/Bay/1995/texts22.html ). PAQ1 beats all other comprssors when the Calgary corpus is concatenated into a single file. Compression is the same with or without concatenation because PAQ1 (and P5, P6, P12) always concatenate the files to take advantage of cross file statistics. Concatenation tends to hurt stationary models such as PPM and Burrows-Wheeler because the Calgary corpus contains a mix of text and binary files with different statistics and the models do not adapt. Results for RK -mx3 and PAQ1. RK is a PPMZ-based program which rearranges the file order to optimize compression across files. For consistency, the PAQ1 test was run with the files in the same order as RK rather than alphabetically as previously. File Size RK -mx3 Ratio PAQ1 Ratio ------ ------ ------- ----- ------ ----- PAPER2 82199 22111 26.8 22323 27.16 PAPER1 53161 13085 24.6 12783 24.05 BOOK1 768771 201950 26.2 206539 26.87 BOOK2 610856 139591 22.8 132151 21.63 OBJ1 21504 9397 43.6 10027 46.63 GEO 102400 47960 46.8 52244 51.02 OBJ2 246814 63754 25.8 66416 26.91 PIC 513216 30595 5.9 46975 9.15 BIB 111261 24146 21.7 23011 20.68 PROGL 71646 13692 19.1 12661 17.67 PROGC 39611 11614 29.3 10571 26.69 PROGP 49379 10207 20.6 9048 18.32 TRANS 93695 14273 15.2 13465 14.37 NEWS 377109 104435 27.6 97917 25.97 ------ ------ ------- ----- ------ ----- 14 3141622 706810 22.4 716386 22.80 rk outperforms PAQ1 on binary files such as GEO (seismic data, 32 bit numbers), PIC (bitmapped 200 dpi FAX image, page from a textbook), OBJ1 (VAX executable), and OBJ2 (Macintosh executable). PAQ1 outperforms rk on most text files, especially toward the end as more statistics are collected. rk also outperforms PAQ1 on random data, with practically no expansion vs. 6.6% (8.5 bpc) for PAQ1. Slowing the aging of statistics (making the model more stationary) decreases this expansion but hurts compression for almost all other files. It is surprisingly difficult to detect randomness embedded in compressible data without sacrificing the ability to adapt to changing statistics. ORIGINAL CODE: 1/6/01. Compiled using DJGPP 2.95.2. REVISED 1/21/2001 to compile in Borland C++ 5.5.1. - Removed unused typedef U64 (Borland does not support long long). - In the CyclicModel constructor, changed cpos(255) to cpos(256) to fix a vector out of bounds bug. - In main(), added a test for filesize.size()==0 before calling max_element(). The last two bugs caused PAQ1 to crash when compiled with Borland but not g++. Archive compatibility is not affected. */ #include #include #include #include #include #include #include #include #include #include using namespace std; // 8-32 bit unsigned types, adjust as appropriate typedef unsigned char U8; typedef unsigned short U16; typedef unsigned long U32; class U24 { // 24-bit unsigned int U8 b0, b1, b2; // Low, mid, high byte public: explicit U24(int x=0): b0(x), b1(x>>8), b2(x>>16) {} operator int() const {return (((b2<<8)|b1)<<8)|b0;} }; // 32-bit random number generator based on r(i) = r(i-24) ^ r(i-55) class Random { U32 table[55]; // Last 55 random values int i; // Index of current random value in table public: Random(); U32 operator()() { // Return 32-bit random number if (++i==55) i=0; if (i>=24) return table[i]^=table[i-24]; else return table[i]^=table[i+31]; } } rnd; Random::Random(): i(0) { for (int j=0; j<55; ++j) table[j]=314159265*j; for (int j=0; j<10000; ++j) operator()(); } /* Model interface. A Predictor is made up of a collection of various models, whose outputs are summed to yield a prediction. Methods: Model.predict(int& n0, int& n1) - Adds to counts n0 and n1 such that it predicts the next bit will be a 1 with probability n1/(n0+n1) and confidence n0+n1. Model.update(int y) - Appends bit y (0 or 1) to the model. */ class Model { public: virtual void predict(int& n0, int& n1) const = 0; virtual void update(int y) = 0; virtual ~Model() {} }; /* Hash table element base class. It contains an 8-bit checksum to detect collisions, and a priority() method which is used to control replacement when full by replacing the element with the lowest priority (0 = unused). The derived class should supply the data to be stored and override priority(). */ class HashElement { U8 ch; // Checksum public: HashElement(int c=0): ch(c) {} // Initialize the checksum, 0 = unused int checksum() const {return ch;} // Collision if not matched int priority() const {return ch!=0;} // Override: lowest replaced first }; /* 3 byte counter, shown for reference only. It implements a nonstationary pair of counters of 0s and 1s such that preference is given to recent history by discarding old bits. */ class Counter3: public HashElement { U8 n[2]; // n[y] is the counts of ys (0s or 1s) public: Counter3(int c=0): HashElement(c) {n[0]=n[1]=0;} int get0() const {return n[0];} // Return count of 0s int get1() const {return n[1];} // Return count of 1s int priority() const {return get0()+get1();} // For hash replacement void add(int y) { // Add y (0 or 1) to n[y] and age the opposite count if (n[y]<255) ++n[y]; if (n[1-y]>2) (n[1-y]/=2)+=1; } }; /* Approximately equivalent 2 byte counter implementing the above. The representable counts (n0, n1) are 0-10, 12, 14, 16, 20, 24, 28, 32, 48, 64, 128, 256, 512. Both counts are represented by a single 8-bit state. Counts larger than 10 are incremented probabilistically. Although it uses 1/3 less memory, it is 8% slower and gives 0.05% worse compression than the 3 byte counter. */ class Counter: public HashElement { U8 state; struct E { // State table entry U16 n0, n1; // Counts represented by state U8 s00, s01; // Next state on input 0 without/with probabilistic incr. U8 s10, s11; // Next state on input 1 U32 p0, p1; // Probability of increment x 2^32-1 on inputs 0, 1 }; static E table[244]; // State table public: Counter(int c=0): HashElement(c), state(0) {} int get0() const {return table[state].n0;} int get1() const {return table[state].n1;} int priority() const {return state;} void add(int y) { if (y) { if (state<94 || rnd() is a hash table of 2^N elements of type T (derived from HashElement) with linear search of 3 elements in case of collision. If all elements collide, then the one with the lowest .priority() is replaced. Hashtable[i] returns a T& indexed by the lower bits of i whose checksum matches the upper bits of i, creating or replacing if needed. */ template class Hashtable { private: T* table; // Array of 2^N+M elements Hashtable(const Hashtable&); // No copy Hashtable& operator=(const Hashtable&); // No assignment public: Hashtable(): table(new Counter[(1< T& Hashtable::operator[](U32 i) { const U32 lb=i&((1<>24; int bj=lb; int bget=1000000000; for (U32 j=lb; j counter0; // Counters for context lengths 0 and 1 vector counter1; Hashtable counter2; // for lengths 2 to N-1 Counter *cp[N]; // Pointers to current counters U32 hash[N]; // Hashes of last 0 to N-1 bytes public: inline void predict(int& c0, int& c1) const; // Add to counts of 0s and 1s inline void update(int y); // Append bit y (0 or 1) to model NonstationaryPPM(); }; NonstationaryPPM::NonstationaryPPM(): c0(1), c1(0), cn(1), counter0(256), counter1(65536) { for (int i=0; iget0()*wt; n1+=cp[i]->get1()*wt; } } // Add bit y (0 or 1) to model void NonstationaryPPM::update(int y) { // Count y by context for (int i=0; iadd(y); // Store bit y cn+=cn+y; if (cn>=53) cn-=53; c0+=c0+y; if (c0>=256) { // Start new byte for (int i=N-1; i>0; --i) hash[i]=(hash[i-1]+c0)*987660757; c1=c0-256; c0=1; cn=1; } // Set up pointers to next counters cp[0]=&counter0[c0]; cp[1]=&counter1[c0+(c1<<8)]; for (int i=2; i buf; // Input buffer, wraps at end vector ptr; // Hash table of pointers U32 hash; // Hash of current context up to pos-1 int pos; // Element of buf where next bit will be stored int bpos; // Number of bits (0-7) stored at buf[pos] int begin; // Points to first matching byte (does not wrap) int end; // Points to last matching byte + 1, 0 if no match public: MatchModel(): buf(0x400000), ptr(0x100000), hash(0), pos(0), bpos(0), begin(0), end(0) {} void predict(int& n0, int& n1) const { if (end) { int wt=end-begin; if (wt>1000) wt=3000000; else wt*=wt*3; if ((buf[end]>>(7-bpos))&1) n1+=wt; else n0+=wt; } } // Append bit y to buf and check that it matches at the end. // After a byte is completed, compute a new hash and store it. // If there is nn current match, search for one. void update(int y) { (buf[pos]<<=1)+=y; // Store bit ++bpos; if (end && (buf[end]>>(8-bpos))!=buf[pos]) // Does it match? begin=end=0; // no if (bpos==8) { // New byte bpos=0; hash=hash*(16*123456791)+buf[pos]+1; if (++pos==int(buf.size())) pos=0; if (end) ++end; else { // If no match, search for one U32 h=(hash^(hash>>16))&(ptr.size()-1); end=ptr[h]; if (end>0) { begin=end; int p=pos; while (begin>0 && p>0 && begin!=p+1 && buf[begin-1]==buf[p-1]) { --begin; --p; } } if (end==begin) // No match found begin=end=0; ptr[h]=U24(pos); } } } }; /* A WordModel predicts in the context of whole words (a-z, case insensitive) using a trigram model. */ class WordModel { private: U32 word1, word0; // Hash of previous and current word int ww1, ww0; // Word weights (lengths) Hashtable t1; // Model int c1; // Previous char, lower case int c0; // 0-7 bits of current char with leading 1 bit int c0h; // Small hash of c0 Counter *cp0, *cp1; // Points into t1 current context public: WordModel(): word1(0), word0(0), ww1(0), ww0(0), c1(0), c0(1), c0h(1), cp0(&t1[0]), cp1(&t1[0]) {} void predict(int& n0, int& n1) const { const int wt0=(ww0+1)*(ww0+1); n0+=cp0->get0()*wt0; n1+=cp0->get1()*wt0; const int wt1=(ww0+ww1+2)*(ww0+ww1+2); n0+=cp1->get0()*wt1; n1+=cp1->get1()*wt1; } void update(int y) { cp0->add(y); cp1->add(y); // Update contexts with next bit c0+=c0+y; c0h+=c0h+y; if (c0h>=59) c0h-=59; if (c0>=256) { c0-=256; if (isalpha(c0)) { word0=(word0+tolower(c0)+1)*234577751*16; if (++ww0>8) ww0=8; } else if (word0) { ww1=ww0; ww0=0; word1=word0; word0=0; } c1=c0; c0=1; c0h=1; } // Set up pointers to new contexts U32 h=word0*123456791+c1*345689647+c0h+(c0<<24); cp0=&t1[h]; cp1=&t1[word1+h]; } }; /* CyclicModel models data with fixed length records such as tables, databases, images, audio, binary numeric data, etc. It models runs of 0s or 1s in bit columns. A table is detected when an 8-bit pattern occurs 4 or more times in a row spaced by the same interval. The table ends after no repetition detection for the same number of bits as were in the table. */ class CyclicModel: public Model { struct E { int p, n, r; // Position of last match, number of matches, interval E(): p(0), n(0), r(0) {} }; vector cpos; // Table of repeat patterns by char int pos; // Current bit position in input int c0; // Last 8 bits int cycle; // Most likely number of columns int column; // Column number, 0 to cycle-1 int size; // Number of bits before the table expires, 0 to 3*cycle Hashtable t; // Context is last 8 bits in column vector t1; // Context is the column number only Counter *cp, *cp1; // Points to t, t1 public: CyclicModel(): cpos(256), pos(0), c0(1), cycle(0), column(0), size(0), t1(2048), cp(&t[0]), cp1(&t1[0]) {} void predict(int& n0, int& n1) const { if (cycle>0) { int wt=16; n0+=cp->get0()*wt; n1+=cp->get1()*wt; wt=4; n0+=cp1->get0()*wt; n1+=cp1->get1()*wt; } } void update(int y) { if (++column>=cycle) column=0; if (size>0 && --size==0) cycle=0; cp->add(y); cp1->add(y); c0=(c0+c0+y)&0xff; ++pos; E& e=cpos[c0]; if (e.p+e.r==pos) { ++e.n; if (e.n>3 && e.r>8 && e.r*e.n>size) { size=e.r*e.n; if (cycle!=e.r) { cycle=e.r; column=pos%cycle; } } } else { e.n=1; e.r=pos-e.p; } e.p=pos; int h=column*3+c0*876546821; cp=&t[h]; cp1=&t1[column&2047]; } }; /* A Predictor predicts the next bit given the bits so far using a collection of models. Methods: p() returns probability of a 1 being the next bit, P(y = 1) as a 16 bit number (0 to 64K-1). update(y) updates the models with bit y (0 or 1) */ class Predictor { NonstationaryPPM m1; MatchModel m2; WordModel m3; CyclicModel m4; public: U16 p() const { int n0=1, n1=n0; m1.predict(n0, n1); m2.predict(n0, n1); m3.predict(n0, n1); m4.predict(n0, n1); return U16(65535.0*n1/(n0+n1)); } void update(int y) { m1.update(y); m2.update(y); m3.update(y); m4.update(y); } }; /* An Encoder does arithmetic encoding. Methods: Encoder(COMPRESS, f) creates encoder for compression to archive f, which must be open for writing in binary mode Encoder(DECOMPRESS, f) creates encoder for decompression from archive f, which must be open for reading in binary mode encode(bit) in COMPRESS mode compresses bit to file f. encode() in DECOMPRESS mode returns the next decompressed bit from file f. print() prints compression statistics */ typedef enum {COMPRESS, DECOMPRESS} Mode; class Encoder { private: Predictor predictor; const Mode mode; // Compress or decompress? FILE* archive; // Compressed data file U32 x1, x2; // Range, initially [0, 1), scaled by 2^32 U32 x; // Last 4 input bytes of archive. int eofs; // Number of EOF's read from data long xchars; // Number of bytes of compressed data long encodes; // Number of bits of uncompressed data int start_time; // Clocks at start of compression long total_encodes; // Sum of encodes int total_time; // Sum of compression times public: Encoder(Mode m, FILE* f); int encode(int bit=0); void print(); ~Encoder(); }; // Constructor Encoder::Encoder(Mode m, FILE* f): predictor(), mode(m), archive(f), x1(0), x2(0xffffffff), x(0), eofs(0), xchars(0), encodes(0), start_time(0), total_encodes(0), total_time(0) { start_time=clock(); // In DECOMPRESS mode, initialize x to the first 4 bytes of the archive if (mode==DECOMPRESS) { for (int i=0; i<4; ++i) { int c=getc(archive); if (c==EOF) { c=0; ++eofs; } else ++xchars; x=(x<<8)+c; } } } /* encode(bit) -- Split the range [x1, x2] at x in proportion to predictor P(y = 1). In COMPRESS mode, make the lower or upper subrange the new range according to y. In DECOMPRESS mode, return 0 or 1 according to which subrange x is in, and make this the new range. Maintain x1 <= x <= x2 as the last 4 bytes of compressed data: In COMPRESS mode, write the leading bytes of x2 that match x1. In DECOMPRESS mode, shift out these bytes and shift in an equal number of bytes into x from the archive. */ int Encoder::encode(int y) { ++encodes; // Split the range U32 p=65535-predictor.p(); // Probability P(0) * 64K rounded down const U32 xdiff=x2-x1; U32 xmid=x1; // = x1+p*(x2-x1) multiply without overflow, round down if (xdiff>=0x10000000) xmid+=(xdiff>>16)*p; else if (xdiff>=0x1000000) xmid+=((xdiff>>12)*p)>>4; else if (xdiff>=0x100000) xmid+=((xdiff>>8)*p)>>8; else if (xdiff>=0x10000) xmid+=((xdiff>>4)*p)>>12; else xmid+=(xdiff*p)>>16; // Update the range if (mode==COMPRESS) { if (y) x1=xmid+1; else x2=xmid; } else { if (x<=xmid) { y=0; x2=xmid; } else { y=1; x1=xmid+1; } } predictor.update(y); // Shift equal MSB's out while (((x1^x2)&0xff000000)==0) { if (mode==COMPRESS) { putc(x2>>24, archive); ++xchars; } x1<<=8; x2=(x2<<8)+255; if (mode==DECOMPRESS) { int c=getc(archive); if (c==EOF) { c=0; if (++eofs>5) { printf("Premature end of archive\n"); print(); exit(1); } } else ++xchars; x=(x<<8)+c; } } return y; } // Destructor Encoder::~Encoder() { // In COMPRESS mode, write out the remaining bytes of x, x1 < x < x2 if (mode==COMPRESS) { while (((x1^x2)&0xff000000)==0) { putc(x2>>24, archive); x1<<=8; x2=(x2<<8)+255; } putc(x2>>24, archive); // First unequal byte } if (total_encodes>0) { long total_xchars=ftell(archive); printf("%ld/%ld = %6.4f bpc (%4.2f%%) in %1.2f sec\n", total_xchars, total_encodes/8, total_xchars*64.0/total_encodes, total_xchars*800.0/total_encodes, double(total_time)/CLOCKS_PER_SEC); } } // Print Encoder stats void Encoder::print() { int now=clock(); if (encodes>0) printf("%ld/%ld = %6.4f bpc (%4.2f%%) in %1.2f sec\n", xchars, encodes/8, xchars*64.0/encodes, xchars*800.0/encodes, double(now-start_time)/CLOCKS_PER_SEC); else printf("0 bytes\n"); total_time+=now-start_time; start_time=now; total_encodes+=encodes; encodes=0; xchars=0; } // Read one byte from encoder e int decompress(Encoder& e) { // Decompress 8 bits, MSB first int c=0; for (int i=0; i<8; ++i) c=(c<<1)+e.encode(); return c; } // Write one byte c to encoder e void compress(Encoder& e, int c) { for (int i=0; i<8; ++i) { // Compress 8 bits, MSB first e.encode((c&128)?1:0); c<<=1; } } // Fail if out of memory void handler() { printf("Out of memory\n"); exit(1); } // Read and return a line of input from FILE f (default stdin) up to // first control character. Skips CR in CR LF. string getline(FILE* f=stdin) { int c; string result=""; while ((c=getc(f))!=EOF && c>=32) result+=char(c); if (c=='\r') (void) getc(f); return result; } // User interface int main(int argc, char** argv) { clock(); set_new_handler(handler); // Check arguments if (argc<2) { printf( "PAQ1 file compressor/archiver, (C) 2002, Matt Mahoney, mmahoney@cs.fit.edu\n" "This program is free software distributed without warranty under the terms\n" "of the GNU General Public License, see http://www.gnu.org/licenses/gpl.txt\n" "\n" "To compress: PAQ1 archive filenames... (archive will be created)\n" " or (MSDOS): dir/b | PAQ1 archive (reads file names from input)\n" "To extract/compare: PAQ1 archive (does not clobber existing files)\n" "To view contents: more < archive\n"); return 1; } // File names and sizes from input or archive vector filename; // List of names vector filesize; // Size or -1 if error // Extract files FILE* archive=fopen(argv[1], "rb"); if (archive) { if (argc>2) { printf("File %s already exists\n", argv[1]); return 1; } printf("Extracting archive %s ...\n", argv[1]); // Read "PAQ1\r\n" at start of archive if (getline(archive) != "PAQ1") { printf("Archive file %s not in PAQ1 format\n", argv[1]); return 1; } // Read "size filename" in "%10d %s\r\n" format while (true) { string s=getline(archive); if (s.size()>10) { filesize.push_back(atol(s.c_str())); filename.push_back(s.substr(11)); } else break; } // Test end of header for "\f\0" { int c1=0, c2=0; if ((c1=getc(archive))!='\f' || (c2=getc(archive))!=0) { printf("%s: Bad PAQ1 header format %d %d\n", argv[1], c1, c2); return 1; } } // Extract files from archive data Encoder e(DECOMPRESS, archive); for (int i=0; i2) for (int i=2; i=0) fprintf(archive, "%10ld %s\r\n", filesize[i], filename[i].c_str()); } putc(032, archive); putc('\f', archive); putc(0, archive); // Write data Encoder e(COMPRESS, archive); for (int i=0; i=0) { printf("%s: ", filename[i].c_str()); FILE* f=fopen(filename[i].c_str(), "rb"); int c; for (long j=0; j