Rules for the Large Text Compression Benchmark

Matt Mahoney

Rules may change at any time to meet the goals of fairness, accuracy, maximizing public participation, and recognizing existing practice. Note that the rules for listing in the benchmark and for the Hutter Prize are different.

Rules for Benchmark Listing

Last update: Jan. 31, 2008.

All results must be subject to public verification. Eligible compression programs must be available on the Internet for free download and testing. Commercial programs with a free trial period of 7 days or more are allowed. Programs that require personal information such as name or email address before they can be downloaded or used are not considered free. Extentions to existing programs such as GUI wrappers that do not change the compressed format are not eligible. Programs or versions withdrawn by the author are not eligible. Programs violating licenses of other programs are not eligible. Patented algorithms are allowed. At my discretion I may list ineligible results anyway with appropriate caveats.

Compression programs will be ranked by the compressed size of enwik9 plus the size of a zip archive (readable by unzip) containing the decompressor and any other files needed by the decompressor at run time (dictionaries, configuration files, .dll files not normally part of Windows, etc). The archive may contain either an executable program or source code in any general purpose programming language, whichever is smaller.

Only the version and combination of options achieving the best known compression for each program will appear in the ranked results. Other results may appear in the individual program descriptions. Two differently-named programs are considered different versions of the same program if they are by the same author and use the same underlying algorithm (LZ77, BWT, PPM, CM, etc).

The decompressor must be able to run without a network connection. The decompressor must run without selecting options that affect the contents of the uncompressed file, whether these options are passed on the command line, selected using a GUI, or from environment variables, configuration files, the Windows registry, or any other source that must be configured by the user or is set during compression. Changing the name or attributes of the compressed file (other than its contents) must not affect the contents after decompression. Most programs meet these requirements. If not, the length of a string containing any required settings will be added to the compressed size (e.g. epm).

Compressors and decompressors do not have to be general purpose. They may be tuned specifically to this benchmark and are allowed to reject or fail on any input other than enwik9. However, the test hardware, operating system, compiler, and programming language implementing the decompressor must be general purpose, available to the public, and not specifically designed to improve the ranking on this benchmark. (A Win32 or Linux executable or C/C++ program meets this requirement).

Anyone may submit results to this benchmark by emailing me at mmahoney(at)cs.fit.edu or matmahoney(at)yahoo.com. I will acknowledge your contribution. If possible, please send me:

I appreciate any information you send me, even if not complete. There is no restriction on compression/decompression time, memory, or disk space. However it will make the results comparable to mine if you select options limiting memory usage to 1800 MB.

History

May 10 2006 - Benchmark started.
Aug 06 2006 - Added rules for Hutter Prize.
Aug 20 2006 - Rules for Hutter prize moved to separate website.
Jul 30 2007 - Memory limit upgraded from 800 MB to 1700 MB.
Oct 28 2007 - Added rule that software must be published for at least 30 days.
Jan 31 2008 - Repealed 30 day wait.