Back Up Next

35 Language Models

Character models        Corpus  Size    H (bpc) Reference
LZ (compress)           book1   731Kb   3.151   compress 4.3d (1990)
LZ (zip)                book1   731Kb   2.956   PKZIP 2.04e (1993)
LZ (gzip -9)            book1   731Kb   2.921   gzip 1.2.4 (1993)
PPMC5 (ha -a2)          book1   731Kb   2.141   ha 0.98, Hirvola 1993
BW (szip)               book1   731Kb   2.102   szip 1.05x, Schindler 1997, 1998
Neural net, 2 layer     book1   731Kb   2.062   Mahoney, 1999
PPMZ (boa -m15)         book1   731Kb   1.962   boa 0.58b, Sutton 1998
PPMZ (rkive)            book1   731Kb   1.943   rkive 1.91b1, Taylor 1998
BW (szip)               book1   768Kb   2.345   szip 1.05x
BW                      book1   768Kb   2.49    Burrows, Wheeler, 1994
                        Hector  103Mb   2.01 
PPM*                    book1   768Kb   2.40    Cleary, Teahan, Witten, 1995
Neural net, 3 layer     Munchner 600Kb  2.89    Schmidhuber, Heil, 1996
PPM5                    Malone  46Kb    2.402   Teahan, Cleary, 1996
                        Malone  6.6Mb   1.598 
PPM5+bigrams            Malone  6.6Mb   1.488   Teahan, Cleary, 1997
PPM5                    WSJ     15.4Mb  1.602   Teahan, Cleary, 1997
Symbol ranking          Calgary 3.1Mb   3.1     Fenwick, 1997
							
Lexical Models          Corpus  Size    H       Reference
WDLZW                   text?   62Kb    2.88    Jiang, Jones, 1992
Bigram                  LOB     6Mb     2.104   Ney, Essen, Kneser, 1995
Trigram                 WSJ     250Mb   1.325   Kneser, Ney, 1995
Trigram                 WSJ     250Mb   1.341   Seymore, Rosenfeld, 1996
5-gram scaled           NAB     1.32Gb  1.301   Kneser, 1996
n-gram+phrases          SWB     11 Mb   1.226*  Ries, Buo, Waibel, 1996
4-gram scaled           WSJ     250Mb   1.284   Ristad, Thomas, 1997
Trigram                 BNC     550Mb   1.398   Clarkson, Robinson, 1997
Trigram+distant bigrams WSJ     25Mb    1.437   Martin, Ney, Zaplo, 1999
							
Semantic models         Corpus  Size    H       Reference
Bigram+topic+cache      LOB     6Mb     2.028   Kneser, Steinbiss, 1993
Phrase bigrams+5 topics RR      330Kb   0.823*  Giachen, 1995
Func. & content trigram Vermobil 281Kb  1.927*  Geutner, 1996
Trigram+triggers        WSJ     210Mb   1.221   Rosenfeld, 1996
Trigram+triggers        WSJ     25Mb    1.446   Simons, Ney, Martin, 1997
Trigram+topic+cache     BNC     550Mb   1.303   Clarkson, Robinson, 1997
Bigram+LSA              WSJ     250Mb   1.325   Bellegarda, 1998
Trigram+cache+topic (IR) WSJ    210Mb   1.283   Mahajan, Beeferman, Huang, 1999
Trigram+topic           SWB     11.5Mb  1.211*  Khudanpur, Wu, 1999
							
Syntactic models        Corpus  Size    H       Reference
Trigram+POS             Office  137Mb   1.702   Jelinek, Mercer, Roikos, 1990
Tagged                  Malone  6.6Mb   1.433   Teahan, Cleary, 1998
                        WSJ     5.63Mb  1.490 
							
Human models            Corpus  Size    H       Reference
Character ranking       Malone  ?       0.6-1.3 Shannon, 1950
Gambling                Malone  ?       < 1.3   Cover, King, 1978
Gambling                Malay   ?       < 1.3   Tan, 1981

*Speech (excluded)