Better Archiver with Recursive Functionality (BARF)

The BARF compressor will compress any nonempty file by at least one byte. Thus, by compressing already compressed files over and over again, it is possible to eventually reduce any file to 0 bytes.

BARF has been tested on the Calgary corpus, a well known benchmark. In just one pass, it achieves the best known result of any compressor, compressing all 14 files to 1 byte each. Run time is under 1 second on a 750 MHz PC. Of course these files can be compressed again down to 0 bytes each, just like any other file. (Yes, they can be decompressed correctly).

BARF is free, open source software, released under the GNU GPL.

Download BARF! (Windows executable, about 1 MB).

Source code:
barf.cpp (includes compilation insructions)
mkstring.cpp

Using BARF

BARF will compress or decompress any number of files at one time. It should be run from an MSDOS box in Windows, or from a UNIX/Linux shell.

  barf c file1 file2 ...      (Compresses, adds a .x or .x?? extension)
  barf d file1.x file2.x ...  (Decompresses, removes the extension)

If there are no filenames on the command line, then BARF will read them standard input. For example, to compress every file in the current directory (in Windows):

  dir/b | barf c              (Compress all files)
  dir/b | barf d              (Decompress all files)

During compression, BARF adds an extension to the file name and deletes the original file, similar to the way COMPRESS or GZIP adds a .Z or .gz extension. Decompression works in reverse, removing the extension and restoring the file to its original name, size, and contents.

The Random Data Compression Challenge

The challenge is to compress one million random digits (415,241 bytes) and to write a decompressor, which together with the compressed file, is smaller than the original data. The decompressor (either source or executable) may itself be compressed.

BARFER (BARF for Extracting Random data) meets the challenge. The decompressor is in source form. The data file and two source code files below are compressed with barfer.exe (Windows executable).

random.bin.x1b.x5z.x2l (415,238 bytes)
barf.cpp.x (1 byte) (in IE, right click to download)
mkstring.cpp.x (1 byte)

To decompress, restoring and verifying random.bin (AMillionRandomDigits.bin)

  barfer d random.bin.x1b.x5z.x2l
  barfer d random.bin.x1b.x5z
  barfer d random.bin.x1b
  fc random.bin AMillionRandomDigits.bin

The two source code files are exactly the same as barf.cpp and mkstring.cpp above, but the compilation steps are different. I created barfer.exe using Borland C++ 5.5.1 and UPX 1.2.4w as follows.

  bcc32 mkstring.cpp                       
  dir/b *.cpp | mkstring >barf.h
  bcc32 barf.cpp
  upx barf.exe
  rename barf.exe barfer.exe

To verify that barf.cpp.x and mkstring.cpp.x are indeed the compressed source files for barfer.exe, you may create barfer.exe from the original sources using any C++ compiler, then decompress the .x files and compare with the original sources. For example, in UNIX: (I haven't tested this):

  g++ mkstring.cpp -o mkstring
  ls barf.cpp mkstring.cpp | ./mkstring >barf.h
  g++ barf.cpp -o barfer
  ./barfer d random.bin.x1b.x5z.x2l
  ./barfer d random.bin.x1b.x5z
  ./barfer d random.bin.x1b
  diff random.bin AMillionRandomDigits.bin      (should compare the same)
  mv barf.cpp.x temp_barf.cpp.x
  mv mkstring.cpp.x temp_mkstring.cpp.x
  ./barfer d *.x
  diff barf.cpp temp_barf.cpp                   (should compare the same)
  diff mkstring.cpp temp_mkstring.cpp           (should compare the same)

Even if you don't buy the argument that you can compress the decompressor, then you could still use barfer.exe (39,936 bytes) to recursively compress random.bin 39,937 times so that the total size is still one byte less than the original.

If you just want to compress the file and not worry about the size of the compressor, then just use the Better Archiver with Recursive Functionality with Extreme Squeezing Technology (BARFEST). It compresses AMillionRandomDigits.bin to 1 byte in 1 pass or 0 bytes in 2 passes. Windows executable: barfest.exe.

The BARF Principle

Note that we can combine different compression algorithms to achieve results better than either one alone. For example, a program could try both COMPRESS and GZIP and choose the smaller file (either file.Z or file.gz) and never do worse than either program alone. BARF extends this idea by trying 257 different algorithms and choosing the best one, with remarkable results.

Recursive compression has long been thought to be impractical. The pigeonhole principle states that it is impossible for a single algorithm to compress all messages of n bits or more, no matter what n is. This is because there are 2ⁿ possible messages of n bits, but only 2ⁿ-1 possible encodings of n-1 or fewer bits. This means that at least two messages must code to the same encoding. Such an encoding could not be decompressed unambiguously. To avoid this, some messages must inevitably "compress" to an equal or larger size than the original. This limitation has stymied the development of recursive file compression technology because we eventually reach a point at which the "compressed" file is not any smaller.

However, the pigeonhole principle does not apply to a set of algorithms. BARF solves the recursive compression problem by using multiple algorithms arranged in such a way that every nonempty file can be compressed by at least one of them. By choosing the best algorithm, it is guaranteed that every nonempty file can compressed by at least one byte. By repeated compression, any file can be eventually compressed to 0 bytes.

BARF Compressed File Format

BARF tries 257 compression algorithms, numbered 1 through 257, then picks the best one. Files compressed by method 1 have a .x extension. This algorithm is:

  if the input is the i'th (1-14) file of the Calgary corpus
    then encode it as a 1 byte file with byte value i-1
  else
    compress using a byte oriented LZ77 code with a 1 byte header of 255:
      Byte x = 0-31 means that the next x bytes represent themselves
      Byte x = 32-255 represents a copy of the two bytes from x-32 places back

BARF tries this method first. If the result is not smaller than the original file, then it applies one of the remaining methods 2-257. Method n is as follows:

  if the first byte is n-2 then remove it

Method n (2-257) is indicated by an extension of the form .x[0-9][a-z] as a base 26 encoding of n-2. For example, method 2 (.x0a) removes a leading 0 byte, method 3 (.x0b) removes a leading 1 byte, up through method 257 (.x9v) which removes a leading 255 byte.

Although BARF does not compress every file to one byte on the first pass, it can be easily modified to do this on any benchmark of up to 255 files. Instructions are found in the source code.

Disclaimer

Of course, a properly designed test, such as the Calgary corpus compression challenge is not susceptable to such tricks. The decompression program (about 1 MB for barf.exe) is added to the size of the compressed files, and the file names (which can get very long) also have to be stored by packing everything into an archive.

Last updated Sept. 27, 2003 by Matt Mahoney