Bibliography

Aberg, J., Yu. M. Shtarkov, B. J. M. Smeets (1997), "Estimation of escape probabilities for PPM based on universal source coding theory", Proc. Intl. Symposium on Information Theory, 65. PPME improves on PPMD by adaptively estimating parameters for escape modeling (a la PPMZ).

Abramson, Norman (1963), Information Theory and Coding, New York: McGraw-Hill. Defines information, sources, Markov processes, uniquely decodable and instantatneous codes, channels, entropy. The Kraft inequality describes the minimum length for a uniquely decodable code (proved by McMillan). Shannon's first theorem (noiseless coding theorem): H(S) Ln/n < H(S) + 1/n places a lower bound on the average code length Ln for blocks of n symbols from S as the entropy H(S). Shannon's second theorem says that a code can be selected to send up to H(C) bits/second over a noisy channel with capacity C with arbitrarily small error.

Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski (1985), "A learning algorithm for Boltzmann machines", Cognitive Science (9) pp. 147-169. Symmetric neural network associative memory converges to global minimum energy by simulated annealing - gradually reducing random behavior of binary neurons. Can learn internal representations.

Alaoui Mounir, S., Goharian, N., Mahoney, M., Salem, A., Frieder, O. (1998), "Fusion of Information Retrieval Engines (FIRE)", Proceedings of the Conference on Parallel and Distributed Techniques and Applications, Las Vegas NV.

Alta-Vista (1998), http://www.altavista.digital.com (May 28, 1998). Language translation between English and 5 other languages. "The spirit is willing, but the flesh is weak" translated to Spanish and back, becomes "The alcohol is arranged, but the meat is weak."

Anderson, James A., (1983) "Cognitive and Psychological Computation with Neural Models", IEEE Transactions on Systems, Man, and Cybernetics 13(5) Sept./Oct., pp. 799-815. Overview of neural networks.

Armer, Paul (1960), "Attitudes toward Intelligent Machines", Symposium on Bionics, Wadd Technical Report 60 600, pp. 13-19., reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Controversy over whether machines can think. Soviet interest in AI.

Ashby, W. Ross (1960), Design for a Brain, 2'nd Ed., London: Wiley. Describes the homeostat, an electromechanical antificial neural network with 4 neurons. A system is ultrastable if it adapts by remaining stable in a changing environment. A homeostat adapts by making random changes to its synapses in response to a neuron being driven to its limits, a model of negative reinforcement.

Bahl, L. R., et. al. (1989), "Large vocabulary natural language continuous speech recognition", IEEE ICASSP, Vol. 1, 465-567. 5000 word vocabulary, 11% WER, perplexity 93 on read office correspondence.

Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson (1983), "Neuronlike adaptive elements that can solve difficult learning control problems", IEEE Transactions on Systems, Man, and Cybernetics SMC-13 pp. 834-846. 2-neuron network models balancing pole on cart using reinforcement learning.

Beardsley, Tim (1999), "Getting Wired -- New observations may show how neurons form connections", Scientific American 280(6) June, 24-25. Dendrites in rat hippocampus grow filopodia 25 minutes after electrical stimulation.

Bell, Timothy, Ian H. Witten, John G. Cleary (1989), "Modeling for Text Compression", ACM Computing Surveys (21)4, pp. 557-591, Dec. 1989. The most common data compression algorithms used are Ziv-Limpel (LZ), in which duplicate substrings are replaced with pointers to the previous occurences, and predictive, in which a character with probability p is assigned a code of length log 1/p. Arithmetic encoding allows fractional bit lengths. Dictionary methods such as LZ were proven equivalent to predictive methods. The best algorithm on the Calgary corpus is PPMC, 2.48 bits/character. PPMC is prediction by partial match, predicting the next character using the longest matching context (preceeding characters), method C, where the code space for all unpredicted characters at a given context size is equal to a single predicted character.

Bell, Timothy (1998), Canterbury Corpus, http://corpus.canterbury.ac.nz/ On large text files, szip is best:

File             Category                         Size (Bytes)
--------------------------------------------------------------
E.Coli  Complete genome of the E. Coli bacterium  4638690
bible.txt The King James version of the bible     4047392
world192.txt The CIA world fact book              2473400

Test file: Ecol bibl wrld WGHT AVERAGE (bits/char) szip-b 2.06 1.53 1.40 1.72 1.66

Bellegarda, Jerome R., John W. Butzberger, Yen-Lu Chow, Noah B. Coccaro, Devang Naik (1996), "A novel word clustering algorithm based on latent semantic analysis", Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 172-175. 20,000 words (-300 stop terms) x 17,500 documents of the NAB corpus used to project word-word matrix by LSA into 125 dimensions (best=100-200), then into 2000 clusters. 2 examples: (1): abstract, art, artist, artist's, canvas, curator, decorative, drawings, exhibit, exhibition, exhibitions, gallery, galleries, Gogh, Henri, museum, meseum's museums, painted, painter, painters, painting, paintaings, photographs, Picasso, poems, Pollock, Pons, portraits, retrospective, Revere, sketches. (2): appeal, appeals, appellate, argued, arguments, attorney's, circuit, confessed, count, courts, criminal, decide, decision, indict, indictments, judge, judge's, judges, leniency, misdemeanor, office's, prosecuted, prosecution, prosecutions, overturned, prosecutor, prosecutorial, prosecutors, ruled, ruling, rulings, witness.

Bellegarda, Jerome R. (1998), "Exploiting both local and global constraints for multi-span statistical language modeling", IEEE Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, 677-680. WSJ0 (1987-89, 87K documents, 42M words, vocabulary-23K): bigram perplexity=215, trigram=142, bigram+LSA (125 dimensions)=147, word cluster size K=100: 106, document cluster size L=1: 116, K=100, L=1: 102.

Bellegarda, Jerome R. (1999), "Speech recognition experiments using multi-span statistical language models", IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 717-720. Bigram+LSA improves speech recognition by 17% over bigram model.

Bimbot, F., M. El-Beze, M. Jardino (1997), "An alternative scheme for perplexity estimation", Proc. IEEE ICASSP, Vol. 2., 1483-1486. Automated Shannon game using random sampling outperforms continuous sampling.

Bloom, Charles, ppmz v9.1 (1997), http://www.cco.caltech.edu/~bloom/src/ppmz.html (Sept 28, 1998). Data compression program (did not decompress files properly on my 486DX/Windows 95 machine). Weighted average on Calgary corpus: 1.9355 bpc.

Bloom, Charles (1998), "Solving the Problems of Context Modeling", http://www.cco.caltech.edu/~bloom/papers/ppmz.zip (Sept 28 1998). Describes the PPMZ data compression program, claims to be the best with 2.119 bits/character (unweighted) on the Calgary corpus.

Bordel, G., I Torres, E. Vidal (1995), "QWI: A method for improved smoothing in language modelling", IEEE ICASSP, Vol. 1, 185-188.

Borko, Harold (1967), Automated Language Processing, The State of the Art, New York: Wiley. D. G. Bobrow cites 72 refereces prior to 1965 on attempts to define a grammar for English. Also, techniques for automated abstraction, information retrieval, and language translation, notably the English-Russian translation project of 1958-61 which was ultimately abandoned.

Bower, B. (1998a), "Speech Insights Sound Off in the Brain", Science News, 155(5), 68. People use variations in word duration, tone, inflection, and pauses to help parse speech into phrases. Brain waves detected through an array of 56 electrodes detect a distinct pattern indicating word confusion, and also in difficult to parse sentences when inflectional cues are removed. Another pattern indicates the end of a phrase.

Bower, B. (1998b), "Learning to make, keep adult neurons", Science News 155(11), 170, from E. Gould at Princeton, Mar. 1999 Nature Neuroscience. Learning tasks known to require an intact hippocampus in mature rats stimulate growth of new neurons in the hippocampus. Such tasks include classical conditioning (CS = noise, US = shock to eyelid, R = blink), and finding and remembering location of underwater platform in water filled maze. Tasks not requiring hippocampus: simultaneous CS and US, swimming to visible platform.

Brause, Rdiger W., (1992), The minimum entropy network, Proc. Intl. Conf. on Tools with AI, 85-92.

Brill, E., and Mooney, R. J. (1997), "An overview of Empirical Natural Language Processing", AI Magazine, 18(4), 13-24. There is a trend toward empirical techniques using a tagged or untagged corpus and machine learning (instead of hand-coded rules) in NLP tasks such as speech recognition, syntactic parsing, semantic processing, information extraction, and language translation.

Bobrow, Danny G., J. Bruce Fraser , 'An augmented state transition network analysis procedure ', IJCAI-69 , pp. 557-567 , 1969 .

Brooks Jr., Frederick P. (1975), The Mythical Man-Month, Reading MA: Addison-Wesley. Software development cost about 10 lines/day in the 1960's and 70's.

Brooks, R. A (1991)., "Intelligence Without Representation", Artificial Intelligence Journal (47), 1991, pp. 139-159. http://www.ai.mit.edu/people/brooks/paperlist.html (Oct 12, 1998). Systems (in particluar, robots) can be developed incrementally when subdivided into activities that have full access to inputs and outputs. In such systems, there is no model of knowledge representation; instead knowledge is divided among activities. Brooks argues that traditional AI subdivisions into input, knowledge (or symbolic processing), and output modules resultes in unrealistic expectations about the interfaces between modules leading to systems that fail when integrated.

M. Burrows and D. J. Wheeler (1994), A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Center, http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html (Oct 30 1998). A Burrows-Wheeler transformation on a block of text sorts all rotations of the text and outputs the last character of the sorted rotations plus the position of the original string. From this, the original string can be recovered. The output string is highly compressible because it is correlated with the sorted suffixes that follow. Calgary corpus = 2.1804 bits/char weighted, 2.43 mean, compared to compress = 3.53 mean, gzip = 2.71 mean. Hector corpus (English text) compression improves up to 2.01 bpc for 103 MB. By block size:

Size	book1	Hector
1k	4.34	4.35
4k	3.86	3.83
16k	3.43	3.39
64k	3.00	2.98
256k	2.68	2.65
750k	2.49
1M		2.43
4M		2.26
16M		2.13
64M		2.04
103M		2.01

Calgary Corpus (1993), http://www.kiarchive.ru/pub/msdos/compress/calgarycorpus.zip (Oct. 9 1998).

Cardie, Claire (1997), "Empirical Methods of Information Extraction", AI Magazine 18(4) 65-79. All systems (AUTOSLOG, PALKA, CRYSTAL) use tokenization and tagging, sentance analysis (partial parsing), extraction, merging (anaphora resolution/discourse analysis using learned rules such as C4.5 decision trees), and template generation. RESOLVE (using learning) had 41-44% recall, 51-59% precision at MUC-6, but 5 best systems, all using hand-coded rules, had 51-63% recall and 62-72% precision. Humans: 80-82% precision.

Cardie, Claire, and Mooney, Raymond J. (1999), "Guest Editors' Introduction: Machine Learning and Natural Language", Machine Learning (34)1, www.cs.utexas.edu/users/ml/mlj-nll/

Carroll, Glenn, and Eugene Charniak (1992), "Two Experiments on Learning Probabilistic Dependency Grammars from Corpora", in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, pp. 1-7. A probabilistic context-free grammar is inferred from a tagged corpus.

Carroll, Lewis (1865), Alice in Wonderland, Gutenberg Press, ftp://sunsite.unc.edu/pub/docs/books/gutenberg/etext97/alice30h.zip (Oct. 5, 1998)

Chakrabarti, S, B. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, J. M. Kleinberg, D. Gibson (1999), Hypersearching the Web, Scientific American, 280(6) June, 54-60. The search engine clever finds authorities and hubs by counting links to/from web pages. Web is 3x108 pages, growing at 106 pages/day.

Charniak, Eugene (1997a), "Statistical Techniques for Natural Language Parsing", AI Magazine 18(4) 33-43. Part of speech tagging uses 30-150 tags (noun, verb, etc.). A 300,000 tagged corpus achieves 90% accuracy by guessing the most common tag, plus proper noun on new words. The best taggers achieve 96-97% using hidden Markov models. Humans do 98%. Statistical parsing using probablilistic context free grammars achieve 75% accuracy (per parse tree node) given only tags. Lexicalized parsing is 87-88% given words on the Penn tree (tagged Wall St. Journal corpus).

Charniak, Eugene (1997b), "Statistical Parsing with a Context-free Grammar and Word Statistics", Proceedings of the AAAI (http://www.aaai.org), 598-603. Compares 3 parsing algorithms using the Penn treebank.

Charniak, Eugene (1993), Statistical Language Learning, Cambridge MA: MIT Press.

Chen, Stanley F., Ronald Rosenfeld (1999), "Efficient sampling and feature selection in whole sentence maximum entropy language models", IEEE Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, 549-552. A more efficient maximum entropy model (no normalization constant) can be used to generate most likely sentences when only relative probabilities are needed. Uses independence sampling and importance sampling to swap words in whole sentences. Improves on (Rosenfeld 1997).

Clarkson, P. R., and A. J. Robinson (1997), "Language model adaptation using mixtures and an exponentially decaying cache", IEEE ICASSP, Vol. 2, 799-802. Reduced trigram perplexity by 24% on BNC (108 words) by 24% using mix of 50 models, 14% by cache (0.005 decay rate), 30% combined to 115.86.

Cleary, John G., W. J. Teahan, Ian H. Witten (1995), "Unbounded length contexts for PPM", Proc. Data Compression Conference, 52-61. PPM* improves on PPMC-n by using unbounded context length n but starting with the shortest context length with exactly 1 prior occurrence when there is a choice. Equivalent to Burrows-Wheeler transform.

Cleary, John G., W. J. Teahan (1995), "Experiments on the zero frequency problem", Proc. Data Compression Conference, 480. Tests Laplace's law (P(i) = (C(i)+1)/(C(0)+C(1)+2)) for novel 1-bit events (C(i)=C(0) or C(1)) by context in theCalgary corpus. For book1 (text), P(1) 1/20C(0), P(0) 1/30C(1). Other files diverged also, but Laplace's law was otherwise a good fit when C(0)>0, C(1)>0.

Comprende (1998), http://comprende.globalink.com/main.html (May 28, 1998). Produces "draft quality" language translation between several European languages.

compress 4.3d for MSDOS (1990), ftp://ctan.tug.org/tex-archive/tools/compress/msdos.zip (Nov. 3, 1998).

Coon, Dennis, (1997), Essentials of Psychology, 7'th ed., Pacific Grove: Brooks/Cole Publishing Co.

Cover, T. M., and R. C. King (1978), "A Convergent Gambling Estimate of the Entropy of English", IEEE Transactions on Information Theory (24)4 (July) pp. 413-421, A more accurate method than Shannon's of measuring the entropy of text is to have people estimate probabilities of successive characters by placing bets. The upper bound is 1.3 to 1.6 bits/character for individuals and 1.3 bits/character using committee betting.

Crick, F. H. C. and C. Asanuma (1986), Certain aspects of the anatomy and physiology of the cerebral cortex, in Rumelhart, David E., James L. McClelland, and the PDP Research Group (1986), Parallel Distributed Processing, vol. 2, Cambridge MA: MIT Press. p. 333-371. Many types of neurons (100's), cortex 1000 cm2 x 80,000 neurons/mm2 (200,000 in striate cortex, visual focus) x 1.5-5 mm thick. Cortex organized by function, connected to thalamus/body by stimulating neurons both ways. 2 neuron types: stimulating (type I): most common (80%) pyramidal, usually spiny dendrites, type I inputs to spines, dendrites, type II to soma, axon hillock, occasionally one I and one II to spines; long axons (cross brain). Inhibiting (type II), no spines, both type inputs to dendrites, soma; short axons (few mm). Firing rate: few to 100's per second, usually low.

Cycorp Inc. (1997), http://www.cyc.com (Oct. 15, 1998) A knowledge-representation system designed to "break the software brittleness bottleneck." "Common sense" rules are represented in an extended first-order logic language called Cycl. Currently being applied to the High Performance Knowledge Base for DARPA.

Data Compression FAQ (July 25, 1998), http://www.cs.ruu.nl/wais/html/na-dir/compression-faq/.html (Oct. 5, 1998). Benchmarks: ha 0.98 a2 is best on Calgary corpus (part 1). Intro to arithmetic coding. (part 2).

Della Pietra, S., V. Della Pietra, R. Mercer, S. Roukos, (1992) "Adaptive Language Modeling Using Minimum Discriminant Estimation". Proceedings of the International Conference on Acousitics, Speech, and Signal Processing, 633-636, San Francisco, Mar. 1992. Also Proceedings of the DARPA Workshop on Speech and Natural Language, Morgan Kaufmann, 103-106, Feb. 1992. First applied maximum entropy approach to language modeling.

Denning, Peter J., (1990), Is Thinking Computable?, RIACS Technical Report 90.2 (NASA-CR-188859). Controversy over whether machines can think.

Dietterich, Thomas G. (1997), "Machine Learning Research, Four Current Directions", AI Magazine, 18(4), 97-136. Ensembles of classifiers: subsampling examples (bagging, cross-validated committees, ADABOOST); manipulating input, output; randomness in algorithm (Monte Carlo Markov Chain); combining (voting, weighted, stacking); decision trees and neural networks are NP-hard, so suboptimal, incomplete exploration of hypothesis space. Scaling: large traning set (RIPPER = O(n log2 n)); many features: preprocessing, sampling, racing; WINNOW uses exponential feature weights for context-sensitive spell checking. Reinforcement learning: exponential decay of credit assignment; exploration vs. performance. Learning stochastic models: probabilistic networks, Naive-Bayes often works even when inputs are not independent; Hidden Markov (clustering); learning structure.

DuPont, Pierre, and Ronald Rosenfeld, "Lattice based language models, Carnegie Mellon Univ. Tech. report CMU-CS-97-173 , September 1997, Extends n-gram backoff to 2 dimensions with word groups.

Einstein, Albert (1933), Essays in Science (translated from the German Mein Weltbild by Alan Harris), New York: Philosophical Library Inc. Special theory of relativity: physics is invariant under velocity of frame of reference: Lorentz transform (constant speed of light) ds2 = dx2 + dy2 + dz2 - c2dt2. General theory of relativity: invariance under acceleration/gravity: ds2 = uv guvxuxv in continuous, non-Euclidean 4 dimensional space.

Epstein W (1962), "A Further Study of the Influence of Syntactical Study on Learning", Amer. J. Psychology (75)122-126, cited in (H”rmann 1979, p. 205).

Fan, L., Q. Jacobson, P. Cao & W. Lin (1999) Web Prefetching Between Low-Bandwidth Clients and Proxies: Potential and Performance, SIGMETRICS'99 http://www.cs.wisc.edu/~cao/papers/prepush.html (Apr. 29, 1999). Web document latency over a modem bank can be improved by proxy caching using PPM prediction. Authors recommend predicting next 4 documents from previous 2.

Feigenbaum, Edward A., (1961), "The Simulation of Verbal Learning Behavior", Proceedings of the Western Joint Computer Conference, 19:121-132, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Models human learning of nonsense syllable pairs using a decision tree.

Feldman, Julian (1961), "Simulation of Behavior in the Binary Choice Experiment, Proceedings of the Western Joint Computer Conference 19:133-144, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Program that simulates the human pattern-recognition and decision process in predicting the next bit in a random sequence.

Feldman, J. A. and D. H. Ballard (1982), "Connectionist Models and their Properties", Cognitive Science (6)., pp. 205-254. Describes the connectionist model, where neurons represent words and synapses represent associations.

Feldman, J., G. Lakoff, D. Bailey, S. Narayanan, T. Regier, A. Stolcke (1994), L0 - "The First Five Years of an Automated Language Acquisition Project", http://www.icsi.berkeley.edu/NTL/first_five_years.ps.Z (Oct 5, 1998). L0 is a project to determine the truth value of natural language statements (in several languages) about pictures of geometric objects. The system uses stochastic context-free grammatical inference from positive examples for syntactic analysis, and a connectionist model at the semantic level. The project is unfinished.

Fischer, Charles N., LeBlanc, Richard J. Jr (1991), Crafting a Compiler with C, Redwood City, Calif: Benjamin/Cummings Publishing Co., Inc.

Fishler, Martin A., and Oscar Firschein (1987), Intelligence - The Eye, the Brain, and the Computer, Reading MA: Addison Wesley. Describes neural network models of the brain. The brain has 1010 to 1011 neurons with hundreds or thousands of synapses per neuron. Response time is on the order of milliseconds.

Flanagan, David (1997), Java in a Nutshell, Cambridge: O'Reilly, p. 202-207. Describes the Java programming language, and the Unicode 16-bit character set for all of the world's languages.

Flinders University (1998) The Flinders University of South Australia presents the 1999 Loebner Prize Competition, http://www.cs.flinders.edu.au/Research/AI/LoebnerPrize/ (Oct 15, 1998). 5 programs and 4 human confederates are ranked by 10 judges in 1998. The winning machine ($2000) ranked most human was judged more human than a confederate 15% of the time. Turing expected a machine to reach 30% ($25,000 prize) at the end of the century.

Floyd, Robert W, and Richard Beigel (1994), The Language of Machines, New York: Computer Science Press. Book on theoretical computer science: A language is a set of strings. A grammar is a set of symbol-substitution rules that defines a language. The Chomsky hierarchy of grammars by increasing power: finite, context free, context sensitive, unrestricted. A recursively enumerable (r.e.) language, defined by an unrestricted grammar, is recognizable by a Turing machine, believed by Church's thesis to be the most powerful model of computation with denumerably infinite memory. A recursive language is decidable by a Turing machine that always halts. Some langauges are r.e. but not recursive; the halting problem is undecidable. A problem NP-complete if it is among the hardest problems in which a solution can be found in exponential time and verified in polynomial time. It is believed but not proven that a large class of NP-complete problems have no polynomial solution: traveling salesman, subset-sum, SAT, etc.

Freeman, James A., and David M. Skapura (1991), Neural Networks - Algorithms, Applications, and Programming Techniques, Reading MA: Addison-Wesley. Describes neural networks.

Frieder, Ophir (1997), personal communication. The average web-based query is 1.2 words.

Fukushima, Kunihiko, Sei Miyake, Takayuki Ito, "Neocognitron: an neural network model for a mechanism of visual pattern recognition", IEEE Transactions on Systems, Man, and Cybernetics SMC-13 pp. 826-834. Neural network for handwritten character recognition.

Gazdar, Gerald, Chris Mellish (1996), Natural Language Processing in Prolog/Pop11/Lisp, http://www.cogs.susx.ac.uk/lab/nlp/gazdar/nlp-in-prolog/ (Apr 10, 1999). Chapter 3. Recursive and augmented transition networks.

Gelernter, H. (1959), Realization of a Geometry-Theorem Proving Machine, Proceedings of an International Conference on Information Processing, Paris: UNESCO House, pp. 273-282., reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Proved high-school level plane geometry theorems, for example, "a point on a bisector of an angle is equidistant from the sides of the angle". The proof space is trimmed heuristically by discarding propositions inconsistent with a diagram of the problem.

Geutner, Petra (1996), "Introducing linguistic constraints into statistical language modeling", Proc. 4'th Intl. Conf. on Spoken Langauge, Vol. 1, 402-405. Compares function/content history with trigram word: perplexity:on a 281K spoken dialog corpus: F/C=95.3, word=62.7, interpolated 80% word=60.3.

Gevarter, William B., (1983) "An Overview of Computer-Based Natural Language Processing", NASA Tech Memo 85653.

Giachin, Egidio P. (1995), "Phrase bigrams for continuous speech recognition", ICASSP Vol. 1, 225-228. Combining bigrams by impact on perplexity gives fastest convergence but using MI=P(c1,c2)/P(c1)P(c2) eventually gives same results.

Gibbs, W. Wayt (1998), "Hello, Is This the Web?", Cyber View, Scientific American (Dec.) p. 48. Commentary on the (lack of) usefulness of speech recognition technology. The journalists that have promoted it are not actually using it because the most time consuming part of writing is not the typing but the thinking and editing. It also makes mistakes (correctly spelled) and is not speaker independent.

Gilchrist, Jeff (Sept. 1998), Archive Comparison Test, http://www.geocities.com/SiliconValley/Park/4264/act-mcal.html (Oct. 5, 1998). Best compreession is BOA 0.58b, options: -m15 -s: 76.1% on a 21 file collection that includes the Calgary benchmark, 72.4% on book1. Followed by ACB, RKIVE. HA 0.999b e12 is 72.9% PKZIP 2.04e is 67.3%

Gilly, Daniel (1992), Unix in a Nutshell, Sebastopol CA: O'Reilly. UNIX reference.

Gleick, James (1987), Chaos, Penguin Books. A chaotic system is one in which a small change in the initial state results in a large change later on.

Good, I. J. (1953). "The Population Frequencies of Species and the Estimation of Population Parameters", Biometrica 40(3,4) 237-264.

Good, I. J., (1963), "Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables", Annals of Mathematical Statistics 34, 911-934.

Gorin, Allen L., Giuseppe Riccardi (1999), "Spoken language variation over time and state in a natural spoken dialog system", Proc. IEEE ICASSP, Vol. 2, 721-724. The "How may I help you" speech system uses 6 language models depending on dialog state to improve recognition.

Green, Bert F. Jr., Alice K. Wolf, Carol Chomsky, and Kenneth Laughery (1961), Baseball: An Automatic Question Answerer, Proceedings of the Western Joint Computer Conference, 19:219-224, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Answers natural language questions such as "How many games did the Yankees play in July". Did dictionary lookup, syntactic and content analysis.

Grossberg, Stephen (1980), "How does the brain build a cognitive code?", Psychological Review (87) pp. 1-51.

Grossberg, Stephen, and Michael A. Cohen (1983), "Absolute Stability of Global Pattern Formation and Parallel Memory Storage by Competitive Neural Networks", IEEE Transactions on Systems, Man, and Cybernetics 13(5) Sept./Oct., pp. 815-826. A neural network is stable if two conditions hold: weights are symmetric: wij = wji, and there is no time delay in computing the neuron's state.

Grossberg, Steven, and Daniel S. Levine (1987), "Neural Dynamics of Attentionally Modulated Pavlovian Conditioning: Blocking, Inter-Stimulus Interval, and Secondary Reinforcement", Applied Optics (26) p. 5015-5030. A model of classical conditioning in neural networks.

Grossman, David A., and Ophir Frieder (1998), Information Retrieval: Algorithms and Heuristics, Boston: Kluwer Academic Publishers. Overview of information retrieval. The most widely used technique for satisfying natural language queries is to match words between query and document. Words that occur rarely are weighed more heavily. Refinements include matching stems (suffix stripping), matching phrases, and matching synonyms using a thesaurus or relevance feedback, noting that related words often appear together. Parsing is not effective.

Guha, R. V., and D. B. Lenat (1994), "Comparing CYC to Other AI Systems", http://www.cyc.com/tech-reports/act-cyc-406-91/act-cyc-406-91.html (May 29, 1998)

Gngor, Tunga, (1995), COMPUTER PROCESSING OF TURKISH: MORPHOLOGICAL AND LEXICAL INVESTIGATION, Ph.D. Dissertation, http://www.cmpe.boun.edu.tr/theses/phd/gungor.html (Apr 10, 1999). A Turkish spell checker. Turkish is agglutinative. Uses ATN for morphological analysis.

Guthrie, Louise, James Pustejovsky, Yorick Wilks, Brian M. Slator (1996), The Role of Lexicons in Natural Language Processing, Communications of the ACM (39)1 pp. 63-72, Jan. 1996. Most natural language processing systems use a lexicon, a machine-readable dictionary.

gzip 1.2.4 (1993), Jean-loup-Gailly, http://www.kiarchive.ru/pub/msdos/compress/gzip124.exe (Oct. 9, 1998). Data compression program. GZIP386 -9 (best compression) on Calgary corpus: 2.5913 bpc weighted average.

Halpern, J. Y., and Grove, A. J., 1997. Probability update: conditioning vs. cross-entropy, Proceedings of the Thirteenth Conference on Uncertainty in AI, 1997, pp. 208--214. Cross-entropy gives an unintuitive result in the Judy Benjamin problem: Judy believes P(B) = 1/2, P(R1) = P(R2) = 1/4 and receives a message P(R1 | R1 or R2) = q 1/2. Cross entropy updates P(B) > 1/2 but careful analysis leaves P(B) = 1/2 unchanged.

Hammerton, James (1994), "Connectionist Natural Language Processing", http://www.cs.bham.ac.uk/~jah/CNLP/report1/report1.html (May 21 1998).

Harman D., (1995) (ed), "Overview of the Third Text Retrieval Conference (TREC-3)", National Institute of Standards and Technology Special Publication 500-225, Gaitersberg MD 20879. TREC is an annual competition among text retrieval systems using a 4 GB corpus of newspaper articles and government documents and 50 new manually-evaluated queries. The best engines achieve recall + precision of 80-100% (of a possible 200%).

Hayes-Roth, Frederick (1994), "The State of Knowledge-Based Systems", Communications of the ACM 37(3) 27-39. The technology is mature.

Hebb, D. O. (1949), The Organization of Behavior, New York: Wiley. Proposed the first model of learning in neurons: when two neurons fire simultaneously, the synapse between them becomes stimulating.

Heeman, Peter A., G‚raldine Dammani (1997), "Deriving phrase-based language models", IEEE Workshop on Automatic Speech Recognition and Understanding, 41-48. WER in speech recognition improved by grouping phrases into words. Argues for acousic model considerations: "gonna" as 1 word.

Hirvola, H., 1993. ha 0.98, http://www.webwaves.com/arcers/msdos/ha098.zip (Jan 12, 1999). A data compression archiver. HA -a2 on Calgary corpus: 2.1549 bpc weighted average.

Hopfield, J. J. (1982), "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences (79) 2554-2558. Defines energy of symmetric network associative memory. Capacity = 0.15n vectors.

Hopfield, J. J. (1984), "Neurons with Graded Response have Collective Properties like those of Two-State Neurons", Proceedings of the National Academy of Sciences, USA (81) pp. 3088-3092. In the neural network equation x := g(Wx), where x is the neuron vector and W is the synapse matrix, g may be any increasing and bounded function of vector components without affecting the collective properties of the network.

Hopfield, J. J., and D. W. Tank (1986), "Computing with Neural Circuits: A Model", Science, (232), Aug. 8, 1986, pp. 625-628. An overview of neural networks.

H”rmann, Hans (1979), Psycholinguistics, 2nd Ed., New York: Springer-Verlag. Language psychology. Humans perform better at both comprehension in a noisy environment and at short-term recall when text/speech is low entropy by restricting input possiblities, and using lexically, grammatically, and semantically correct sentences. Short-term memory is about 100 bits.

Howard, Paul G., and Jeffrey Scott Viller (1992), "Practical Implementations of Arithmetic Encoding", in Image and Text Compression, ed. James A. Storer, Norwell MA: Klewer Academic Press, pp. 85-112. http://www.cs.duke.edu/~jsv/Papers/HoV92.actech.pdf Locality in text compression sometimes improves compression. PPMC, PPMD, Quasi-arithmetic encoding, compressed trees (1 byte per binary decision), hashed Markov models without collision resolution.

Huang, Shell Ying, and Ghim Hwee Ong (1993), "Entropies of Chinese texts based on three models of Hanyu Pinyin phonetic system", Proc. Intl. Conf. on Information Engineering, Communications and Networks for the year 2000, vol 1, 305-309. Chinese GB3212-80 charater set has 6763 ideograms (syllables). Word = 1-4 ideograms, each 402 phonetic spellings (1-55 meanings each) in Hanyu Pinyin (consonant, vowel, 5 tones), 26 letters.

Huang, Xiaohong, Zhensheng Luo, Jian Tang (1997), "A quick method for Chinese word segmentation", IEEE Intl. Conf. on Intelligent Processing Systems, Vol. 2, 1773-1776. Uses a 40,000 word lexicon to find maximum length matches with backtracking.

Hutchens, Jason (1995), Natural Language Grammatical Inference, Honours Dissertation, University of Western Australia, http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998). A stochastic grammar can be extracted from a corpus of text using statistical techniques.

Hutchens, Jason L., and Michael D. Alder (1997), "Language Acquisition and Data Compression", 1997 Australasian Natural Language Processing Summer Workshop Notes (Feb.), pp. 39-49, http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998). English text can be partitioned into words at points of high entropy using simple trigram prediction.

Hutchens, Jason L., and Michael D. Alder (1998), "Introducing MegaHAL", Proceedings of the Human-Computer Communication Workshop, pp. 271-274, http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998). The annual Loebner contest began in 1991 in New York, offers US $100,000 to a machine that passes the AI test for artificial intelligence. The contest consists of 6 machines, 4 humans, and 10 judges. The judges rank the machines and humans from least to most human-like. If any machine has a median rank above the lowest human, it wins. No machine ever has. MegaHAL was entered in 1998.

Hutchens, Jason L., and Michael D. Alder (1998), "Finding Structure via Compression", Proceedings of the International Conference on Computational Natural Language Learning, pp. 79-82, http://ciips.ee.uwa.edu/au/~hutch/research/papers (Oct. 5, 1998). A data compressor is used to find high-entropy characters which identify word boundaries in text to reduce perplexity and for grammatical inference.

Hwang, Kyuwoong (1997), "Vocabulary optimization based on perplexity", argues for but instead shows perplexity increase on Korean text split into syllables or phonemes as word units.

Ingargiola, Giorgio P. (1995), PAC Learning, http://yoda.cis.temple.edu:8080/UGAIWWW/lectures95/learn/pac/pac.html (Apr. 10, 1999). DEFINITION: A class of functions F is Probably Approximately (PAC) Learnable if there is a learning algorithm L that for all f in F, all distributions D on X, all epsilon (0 < epsilon < 1) and delta (0 < delta < 1), will produce an hypothesis h, such that the probability is at most delta that error(h) > epsilon. L has access to the values of epsilon and delta, and to ORACLE(f,D). N=size of hypothesis space. F is Efficiently PAC Learnable if L is polynomial in epsilon, delta, and ln(N). It is Polynomial PAC Learnable if m is polynomial in epsilon, delta, and the size of (minimal) descriptions of individuals and of the concept. Lower bound on m: m > (1/epsilon)*(ln(1/delta)+ln N)

Iyer, R., M. Ostendorf, M. Meteer (1997), Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, 254-261. Proposes alternatives to perplexity to predicting speech recognition performance.

Jackson, Philip C. (1985), Introduction to Artificial Intelligence, 2nd. ed., New York: Dover Publications.

James, William (1890), Psychology (Briefer Course), New York: Holt, ch. XVI "Association" pp. 253-279. Describes associationative model of thought.

Jelinek, F., R. Mercer, Salim Roukos (1990), "Classifying words for improved statistical language models", IEEE ICASSP, Vol. 1, 621-624. Classifies OOV word s by trigram context c. Ranking by MMI = log P(s|c)/P(s) produces good synonyms, but ML = log P(s|c) better perplexity.

Jiang, J., S. Jones (1992), "Word-based dynamic algorithms for data compression", IEE Proc. Communication, Speech, and Vision, 139(6): 582-586. Word-based LZ (including space) outperforms standard LZW, DLZW, LZFG on text.

Jones, Mark. A., and Jason M. Eisner, "A Probabilistic Parser and Its Application" (1992), in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, pp. 14-21. A bracketed corpus is used to train a probabilistic context-free grammar. Adding semantic constraints improves success rate of parsed sentences from 79% to 96%. Low-level statistical knowledge of English improves OCR error rates by 70-80% and 90% in telephony sublanguage domains.

Jusczyk, Peter W. (1996), "Investigations of the word segmentation abilities of infants", 4'th Intl. Conf. on Speech and Language Processing, Vol. 3, 1561-1564. Infants learn to segment words with leaning syllable stressed at 7.5 months, other words at 10.5 months.

Kalai, Adam, Stanley Chen, Avrim Blum, Ronald Rosenfeld (1999), "On-line algorithms for combining language models", IEEE Proc. Intl. Conf. on Acoustics Speech, and Signal Processing, 745-748. Perplexity of AP+WSJ+SWB+BN avg: selector=174.7, switcher=159.6, mixer=157.4, concatenation: mixer=184.7, switcher=160.8..

Kasparov vs. Deep Blue (1997), http://www.chess.ibm.com (Nov. 17, 1998). IBM's Deep Blue defeated world chess champion Garry Kasparov 3.5 to 2.5 in a 6 game match on May 11, 1997. It uses a 32-node RS/6000 with 256 custom processors to evaluate 200,000,000 chess positions per second, twice as fast as the previous year (when Kasparov won). Kasparov evaluates 3 positions per second. IBM does not consider Deep Blue to have AI.

Kaplan, Ronald M., 'Augmented transition networks as psychological models of sentence comprehension ', Artificial Intelligence , pp. 77-100 , 1972 .

Kauffman, Stuart A. (1996), "Antichaos and Adaptation", Scientific American (web site), http://www.sciam.com/explorations/062496kaufman.html (Mar. 21 1998). Complex systems tend to evolve to the critical boundary between stability and chaos. A collection of random logic gates is stable with 2 or fewer inputs per gate, chaotic with 3 or more inputs. A critically balanced system of n variables has about n1/2 attractors. Human DNA has about 100,000 genes, many of which turn other genes on or off, resulting in about 256 100,0001/2 cell types.

Kernighan, Brian W., and Dennis M. Ritchie, (1987), The C Programming Language, 2nd ed., Englewood Cliffs NJ: Prentice Hall.

Khudanpur, Sanjeev, Jun Wu (1999), "A maximum entropy language model integrating n-grams and topic dependencies for conversational speech recognition", Proc. IEEE Conf. on Acoustics, Speech, and Signal Processing, 553-556. Adding topic information manually or automatically to speech decreases word error rate, down to 42.6% on Switchboard.

Klakow, Dietrich (1998), "Language-model optimization by mapping of corpora", Proc. IEEE ICASSP, Vol. 2, 701-704. Improves n-gram word perplexity by finding phrases by 3 methods: pair counts, mututal information, and change in entropy. Entropy method works best by a small margin, but not consistently.

Kneser, Reinhard, Volker Steinbiss (1993), "On the dynamic adaptation of stochastic language models", IEEE ICASSP, Vol. 2, 586-589. Uses weighted average of 15 models by topic plus full model, interpolation parameters updated over 500 word window to improve bigram perplexity 10% on 1.1M word English corpus, from 532.1 to 480.7, or 410.9 to 384.0 with cache.

Kneser, Reinhard, and Hermann Ney (1995), "Improved backing-off for m-gram language modeling", IEEE ICASSP, Vol. 1, 181-184.

Kneser, Reinhard (1996), "Statistical language modeling using a variable context length", IEEE ICSLP, Vol. 1, 494-497. Improves perplexity and word error rate for speech recognition over trigram model by pruning 4-gram tree by least contribution to perplexity.

Knight, Kevin, (1997), "Automatic Knowledge Acquision for Machine Translation", AI Magazine, 18(4) 81-96. Best systems inferior to human translators in U.S. military evaluations. Rule based are best. Probabilistic model: max P(E|S) = max P(S|E)P(E) from bilingual (S to E) and monolingual (E) corpus. Lexical, syntactic, semantic, and interlingua translators. No evaluation benchmarks.

Knuth, Donald E., (1981) The Art of Computer Programming, 2nd ed., Reading Mass: Addison Wesley.

Korth, Henry F., and Abraham Silberschatz (1991), Database System Concepts, 2nd ed., New York: McGraw Hill. A relation is a set of tuples of attributes. A database is a set of relations. SQL is the standard language of relational algebra. Operations map relations to relations: select (subset of attributes), project (subset of tuples), and cross product. A database is in third normal form (3NF) if all attributes are discrete (1NF) and all relations have a unique primary key (set of attributes that determines the tuple).

Kullback, S. (1959), Information Theory in Statistics. New York: Wiley.

Kuffler, Stephen W., Nicholls, John G., Martin, Robert A., (1984), From Neuron to Brain, 2nd ed., Sunderland MA: Sinauer Associates Inc. 1. p. 3. Analysis of signals in the central nervous system. 2. p. 19. The visual world: cellular organization and its analysis. 3. p. 75. Columnal organization and layering of the cortex. 4. p. 99. Electrical signaling. 5. p. 111. Ionic basis of resting and action potentials. 15. p. 379. How sensory signals arise and their centrifugal control.

Kupiec, Julian, and John Maxwell (1992), "Training Stochastic Grammars from Unlabeled Text Corpora", in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press. pp. 8-13. Regular and context-free probabilistic grammars can be trained on unlabeled text to identify parts of speech for unknown words.

Landauer, Tom (1986), "How much do people remember? Some estimates of the quantity of learned information in long term memory", Cognitive Science (10) pp. 477-493. About 109 bits, based on rates of learning and forgetting.

Lashley, K. S., (1950), "In search of the engram", Society of Experimental Biology Symposium No. 4: Pshchological Mechanisms in Animal Behavior, Cambridge: Cambridge University Press, pp. 454-455, 468-473, 477-480. Memories are distributed throughout the brain.

Lau, Raymond, Ronald Rosenfeld, Salim Roukos (1993), "Trigger-based language models: a maximum entropy approach", IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, 45-48. Adding 60,000 trigger constraints improves perplexity 12% on WSJ corpus (5M words, 40K unigrams, 200K bigrams, 200K trigrams) over trigram model. Discusses self triggers.

Lawrence, Steve, C. Lee Giles, Sandiway Fong (1998), "Natural Language Grammatical Inference with Recurrent Neural Networks", accepted by IEEE Transactions on Knowledge and Data Engineering, http://www.neci.nj.com/homepages/lawrence/papers/nl-tkde98.pdf (and .ps.Z). Grammar can be extracted from a tagged corpus by a recurrent neural network. The most effective of four models studied is the Elman network, which has time-delayed recurrent connections from the outputs to the inputs of the hidden (second of three) layer.

Lippmann, R. P. (1987), "An Introduction to Computing with Neural Networks", IEEE ASSP Magazine, Apr., pp. 4-22.

Little, W. A., and Gordon L. Shaw (1975), "A statistical theory of short and long term memory", Behavioral Biology (14) pp. 115-133. Short term memory is stable states in a neural network.

Lindsay, Robert K (1963), "Inferential Memory as the Basis of Machines which Understand Natural Language". Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, pp. 217-233. A program that answers natural language queries about family relationships. Describes many of the problems in NLP: ambiguity of meaning and syntactic structure that depends on high-level knowledge.

Loebner, Hugh, (1998) Home Page of The Loebner Prize--"The First Turing Test". http://www.loebner.net/Prizef/loebner-prize.html (Nov. 4, 1998).

Loewenstern, David and Peter N. Yianilos, (1997), Significantly Lower Entropy Estimates for Natural DNA Sequences, http://www.neci.nj.nec.com/homepages/pny/papers/cdna/ main.html (Feb 26, 1998). Best compression of DNA: 1.66 bits/nucleotide in noncoding human DNA, 1.84-1.87 in genes.

Luo, Xiaoqiang, and Frederick Jelenick (1999), "Probabilistic classification of HMM states for large vocabulary continuous speech recognition", IEEE ICASSP, Vol. 1, 353-356.

Mahajan, M., D. Beeferman, X. D. Huang (1999), "Improved topic-dependent language modeling using information retrieval techniques", IEEE ICASSP, Vol. 1, 541-544. Improves n-gram perplexity by using history as IR query to select training data from corpus (tf-idf model). Best results with blending of high/low recall trained models, with stemming, but no stopwords or query expansion.

Mahoney, Matthew V. (1999), "Text Compression as a Test for Artificial Intelligence", http://www.he.net/~mmahoney/paper4.ps.Z

Mandelbrot (1953), reference in [Borko 1967] p. 96 given as: Mandelbrot, B, "An informational theory of the structure of language based upon the theory of the statistical matching of messages and coding", Proceedings of a Symposium on Applications of Communication Theory, W. Jackson (ed.), London: Butterworths, 1953, pp. 486-502. Generalizes Zipf's distribution to pr = c(r + a)-b, 0 a 1, b > 1, c > 0, r pr = 1, where pr is the frequency of the r'th most common word in natural language text.

Manning, Christopher D., and Hinrich Schtze, (1999), Foundations of Statistical Natural Language Processing, Cambridge MA: MIT Press.

Marklein, Mary Beth (1998), "Software Makes the Grade on Essays", USA Today, Apr. 16, 1998, p. D1. Describes a program that grades English essays, using a large sample of correct essays as training examples.

Martin, S. C., H. Ney, J. Zaplo (1999), "Smoothing methods in maximum entropy language modeling", Proc. IEEE ICASSP, Vol. 1, 545-548. Perplexity on 4.5M word WSJ0 corpus, 20K vocabulary + : 148.2 for trigram model interpolating 2 smoothing methods.

Mauldlin, Michael L. (1994). Chatterbots, Tinymuds, And The Turing Test: Entering The Loebner Prize Competition. AAAI-94. http://www.fuzine.com/mlm/aaai94.html (Nov. 4 1998).

McClelland, James L., and David E. Rumelhart (1981), "An interactive activation model of context effects in letter perception: part 1. An account of basic findings", Psychological Review (88) pp. 375-407. (Connectionist) model of visual word perception.

McCulloch, Warren S., and Walter Pitts (1943), "A logical calculus of the ideas immanent in nervous activity", Buletin of Mathematical Biophysics (5) pp. 115-133. 2-state neurons can implement any Boolean function.

McDonald, John, (1998), 200% Probability and Beyond: The Compelling Nature of Extraordinary Claims in the Absence of Alternative Explanations. Skeptical Inquirer (22)1: 45-49, 64.

McKee, Doug, and John Maloney (1992), "Using Statistics Gained from Corpora in a Knowledge-Based NLP System", in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, pp. 73-80. SOLOMON is a knowledge system that is trained using a tagged corpus. Statistics are used to learn noun phrases, verb-preposition relations, verb-clause relations, count/mass nouns, partitives.

Michaud, Lisa (1998), The Intelligent Word Prediction Project, http://www.asel.udel.edu/nli/nlp/wpredict.html (Apr 10, 1999). Uses ATNs to predict the next word in text to help the disabled communicate.

Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K. (1993), Introduction to WordNet: An On-line Lexical Database, ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps (Feb 5, 1999). At http://www.cogsci/princeton.edu/~wn/ since 1990. 95,600 words

Minsky, Marvin (1961), "Steps toward Artificial Intelligence", Proceedings of the Institute of Radio Engineers, 49:8-30. Summary of early AI work: search (hill climbing, heuristics), pattern recognition, learning, problem solving and planning, induction and models.

Minsky, Marvin, and Seymour Papert (1969), Perceptrons, Cambridge MA: MIT Press, Introduction pp. 1-20. Perceptrons cannot learn arbitrary rules.

Mitchell Tom M. (1997), Machine Learning, New York: McGraw-Hill. Learning involves generalizing a set of examples to a hypothesis consistent with the examples. Techniques include decision trees, neural networks, and genetic algorithms.

Moffat, Alistair (1990), "Implementing the PPM Data Compression Scheme", IEEE Transactions on Communications (38)11 (Nov.), pp. 1917-1921. PPM achieves 2.2 bits/character on English text in theory. A practical implementation achieves 2.4 bits/character.

Mori, H., H. Aso, S. Makino, (1995), "Japanese document recognition based on interpolated n-gram model of character", Proc. Third Intl. Conf. on Document Analysis and Recognition, Vol. 1, 274-277. Deleted interpolation outperforms flooring in Japanese OCR.

Munro, P. W. (1986), State-dependent factors influencing neural plasticity: a partial account of the critical period. In Rumelhart, David E., James L. McClelland, and the PDP Research Group (1986), Parallel Distributed Processing, vol. 2, Cambridge MA: MIT Press, p. 471-502. Model of plasticity of the visual cortex in cats: growth of synaptic strength with use.

Nelson, Mark, (1996) Data Compression with the Burrows-Wheeler Transform, Dr. Dobbs Journal, Sept., pp. 46-50, http://www.dogma.net/markn/articles/bwt/bwt.htm (Oct 30 1998). A Burrows-Wheeler compressor using run length encoding pre/post and arithmetic encoding. Calgary=2.41 bpc, PKZIP=2.64.

Nevill-Manning, Craig G., Ian H. Witten (1996), "Compressing semi-structured text using hierarchical phrase identifications", IEEE Proc. Data Compression Conference, 63-72. Infers grammar on geneological database.

Nevill-Manning, Craig G., Ian H. Witten (1997), "Inferring lexical and grammatical structure from sequences", IEEE Proc. Conf. on Compression and Complexity of Sequences, 265-274. SEQUITUR compresses semistructured data by deterministic grammatical inference followed by nondeterministic FSA inference at top level rule (typically very large). Each rule must be used twice, and no pair of symbols may be repeated. On text, rules often find words.

Newell, Allen, J. C. Shaw, H. A. Simon (1957), "Empirical Explorations with the Logic Theory Machine: A Case Study in Heuristics", Proceedings of the Western Joint Computer Conference, 15:218-239, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. The program discovers proofs of propositional logic theorems. A proof is a sequence of statements ending with the theorem in which each statement is an axiom or is derived from previous statements by substitution, detachment, or replacement. Theorems are proved by heuristically trimming the space of all possible proofs. It solves the following in 10 seconds:

To prove: (p implies not-p) implies not-p
1. (A or A) implies A (axiom)
2. (not-A or not-A) implies not-A (substitution)
3. (A implies not-A) implies not-A (replacement)
4. (p implies not-p) implies not-p (substitution).
It fails to prove [p or (q or r)] implies [(p or q) or r] after 23 minutes on the RAND JOHNNIAC.

Newell, Allen, J. C. Shaw, and H. A. Simon (1958), "Chess-Playing Programs and the Problem of Complexity", IBM Journal of Research and Development, 2:320-335, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Describes a chess-playing programs using the minimax algorithm and heuristic trimming. It runs on the RAND JOHNNIAC at 20,000 ops/sec in interpreted IPL-IV (6000 words, est. 16,000 at completion), computes a move in 1-10 hours, plays at an amateur level. The first program, described by Kister, Stein, Ulam, Walden, Wells at Los Alamos in 1956 played on a 6x6 board on a 11,000 op/sec MANIAC-I using a 600-word machine coded program at 12 min/move, 2-deep search, equivalent to a human with 20 game experience. The chess algorithm was first described by Shannon in 1949. Search space is 10120 board positions.

Newell, Allen, H. A. Simon (1961), "GPS: A Program that Simulates Human Thought", Lernende Automaten, Munich: R. Oldenbourg KG, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. An attempt to simulate the human problem-solving process in proving logic theorems.

Newsbytes (1988), "Automatic Language Translation System for Telecommunications", Mar. 22, 1988, http://www.nbnn.com/pubNews/88/48342.html (Mar. 29, 1998). Toshiba developed an English-Japanese translator using a 130,000 word vocabulary and 100,000 language rules.

Ney, H., U. Essen, R. Kneser (1995), "On the estimation of `small' probabilities by leaving-one-out", IEEE Trans. Pattern Analysis and Machine Intelligence, 17(12) 1202-1212. Derives Turing-Good formula and shows absolute discounting better than relative for text prediction.

Ng, Hwee Tou, and Zelle, John (1997), "Corpus-Based Approaches to Semantic Interpretation in Natural Language Processing", AI Magazine, 18(4), 45-64. Word-sense disambiguation: In the Longman Dictionalry of Contemporary English, the 121 most frequent nouns have 7.8 avg. meanings; 70 verbs 12.0 meanings. Best results with naive-Bayes or nearest-neighbor classifier or learning on tagged corpus, combining local collocations (most heavily), surrounding words, syntactic relations, and parts of speech. Unsupervised methods use a dictionary such as Wordnet. Best performance 58-75% depending or corpus. Semantic parsing: conversion to frame representation: best 70% accuracy on 225 examples. No standard benchmarks.

Nickerson, R. S. (1968), "A note on long term recognition for picture material", Psychonomic Science (11) p. 58.

Osterson, Daniel N., Michael Stob, and Scott Weinstein (1986), Systems that Learn, Cambridge MA: MIT Press. Defines learning as finding a Turing machine that recognizes a language given an infninte sequence of all strings in the language. Proves that some languages are unlearnable.

Oxford English Dictionary. 1989 edition has 400,000 words. 10,000 updates every 3 months. www.oed.com (Feb 5, 1999). First edition took from 1879 to 1928 to publish.

Papadimitriou, Christos H. (1994), Computational Complexity, Reading MA: Addison-Wesley. Computer science (see Floyd). Also defines Boolean and first-order logic.

Papalia, Diane E., and Sally Wendkos-Olds (1990), A Child's World, Infancy through Adolescence, New York: McGraw-Hill.

Pavlov (1927), Conditioned Reflexes, New York: Oxford University Press, (English translation, 1960). Showed that animals could be conditioned to respond to a novel stimulus (CS) by pairing it with an unconditioned stimulus (US) that elicits the same response (R).

Pearl, Judea (1988), Probabilistic Reasoning in Intelligent Systems, San Mateo CA: Morgan Kaufmann Publishers Inc. About AI systems that deal with uncertainty in reasoning and inference.

Pederson, Ted, and Rebecca Bruce (1997), "A New Supervised Learning Algorithm for Word Sense Disambiguation", Proceedings of the AAAI (http://www.aaai.org), 604-609. Disambiguation of 12 words on hand-tagged ACL/DCI Wall Street Journal corpus. Best to worst: C4.5 decision tree=.859, FSS AIC Naive mix (builds a dependency graph starting from Naive Bayes)=.848, Naive Bayes=.847, CN2 rule induction=.843, BSS AIC Naive Mix (prunes a fully connected dependency graph)=.842, BSS AIC (no mix of models)=.841, FSS AIC=.838, PEBLS (nearest neighbor)=.835, majority classifier (baseline)=.738.

Penrose, Roger (1990), The Emperor's New Mind, New York: Oxford University Press.

Peterson, I., (1998), "Web searches fall short", Science News (153) p. 286, May 2. Web search engines index 4% to 34% of the 320 million Web pages as of Dec. 1997.

Pierce, John R. (1980), An Introduction to Information Theory, New York: Dover Publications.

PKZIP (1993), version 2.04e, PKWARE Inc. Data compression and archiving program. Calgary corpus, weighted average: 2.6287 bpc.

Porter, M. F. (1980), "An algorithm for suffix striping", Program (14), pp. 130-137. A moderately complex but not foolproof algorithm for removing English suffixes.

Pressman, Roger S. (1992), Software Engineering, A Practitioner's Approach, New York: McGraw-Hill. A more recent estimate of software cost, still 10 lines/day.

Raymond, Eric (1997), Jargon Dictionary, http://www.netmeg.net/jargon/terms/a/ai-complete.html (Oct 15, 1998). Definition of AI-Complete.

Resnik, Philip (1992), "WordNet and Distributional Analysis: A Class-based Approach to Lexical Discovery", in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, pp. 48-56, http://www.jair.org/abstracts/resnik99a.html (Aug. 5, 1999). Semantic knowledge in the form of verb-object associations can be extracted from text.

Resnik, Philip (1998), "Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language", Journal of Artificial Intelligence Research 11:95-130.

Rich, Elaine, and Kevin Knight (1991), Artificial Intelligence, 2nd Ed., New York: McGraw-Hill. Describes the Turing test for AI: if a human cannot distinguish a machine from another human via teletype.

Ries, Klaus, Finn Dag Buo, Alex Waibel (1996), "Class phrase models for language modeling", ICSLP, Vol. 1, 398-401. Combines word pairs by directly measuring impact on perplexity, best method of several reviewed.

Ristad, Eric Sven, and Robert G. Thomas (1997), "Nonuniform Markov models", IEEE ICASSP, Vol. 2, 791-794.

Rochester, N., J. J. Holland, L. H. Haibt, and Wl L. Duda (1956), Tests on a cell assembly theory of the action of the brain, using a large digital computer", IRE Transactions on Information Theory IT-2: pp. 80-93. Simulation showed need to limit synapse weights using Hebb's rule.

Rosenblatt, F. (1958), "The preceptron: a probabilistic model for information storage and organization in the brain", Psychological Review (65) pp. 386-408. Describes 3-layer network of on-off neurons with feedback from output to hidden layer and lateral inhibition.

Rosenfeld, Ronald (1996), "A Maximum Entropy Approach to Adaptive Statistical Language Modeling", Computer, Speech and Language, 10, http://www.cs.cmu.edu/afs/cs/user/roni/WWW/me-csl-revised.ps (May 21, 1999). Achieved perplexity of 68/word on 38MW WSJ corpus (20K vocabulary, 2.2% unknown) by weighted blending of maximum entropy model (trigger words plus 1, 2, 3-grams) and adaptive 1, 2, 3 grams. ME: Given constraints (fi, P) h,w fi(h,w)P(h,w) = Ki, then P(h,w) = i ifi(h,w), i found by GIS (Darroch and Ratcliff 1972):

1. select arbitrary i
2. E[fi]=h,w P(h,w)fi(h,w), P as above
3. i  iKi/E[fi]
4. Repeat 2-3 until i converge
In practice, E[fi]=hP'(h)w(w|h)f(h,w) = Ki where h=history, w=current word, P(h,w)=probability to be determined, P'=empirical probability from training data, f(h,w)=selector function: 1 or 0, E=expected value, Ki=constant derived from training statistics. D(P(x)Q(x))=Kullback-Liebler distance = x Q(x) log(Q(x)/P(x)) (Kullback 1959).

Rosenfeld, R. (1997), "A whole sentence maximum entropy language model", IEEE Workshop on Automaic Speech Recognition and Understanding, 230-237. Generates sentences by randomly replacing words according to relative probability.

Rudnicky, A. I., Hauptmann, A. G., Lee, K. (1994), "Survey of Current Speech Technology", Communications of the ACM, 37(3), 52-57. Speech recognition has been making slow but steady progress since the 1970's in vocabulary size, continuous speech, speaker independence, and noisy input.

Ruelle, David (1991), Chance and Chaos, Princeton NJ: Princeton University Press.

Rumelhart, David E., James L. McClelland, and the PDP Research Group (1986), Parallel Distributed Processing, Cambridge MA: MIT Press. Vol 1. Ch 1 p 3. The appeal of parallel distributed processing, McClelland, Rumerlart, G. E. Hinton. Ch 2 p 45. A general framework for parallel distributed processing, Rumerlart, Hinton, McClelland. The activation level of a unit is a bounded monotone function (binary, random, or continuous) of a weighted sum of activation levels of other units. System learns by adjusting weights as a function of the product of input and output activation levels (variations of Hebb's rule). Ch 3 p 77. Distributed representations, Hinton, McClelland, Rumelhart. Ch 4 p 110. PDP models and general issues in cognitive science, Rumelhart, McClelland. Ch 5 p 151. Feature discovery by competitive learning, Rumelhart, D. Zipser. Networks use lateral inhibition to allow only one unit on in a layer or group. Ch 6 p 194. Information processing in dynamical systems: foundations of harmony theory, P. Smolensky. Bipartite network with activation levels of {0,1} and {-1,1}. Ch 7 p 282. Learning and relearning in Boltzmann machines, Hinton, T. J. Sejnowski. Binary activation levels are randomly selected by temperature. Simulated annealing guarantees global minimum energy. Ch 8 p 318. Learning internal representations by error propagation, Rumelhart, Hinton, R. J. Williams. (Back propagation). Learning in feedforward multilayer networks with graded activation funtions by adjusting weights in proportion to contribution to output error usually avoids local minima. Ch 9 p 365. An introduction to linear algebra in parallel distributed processing, M. I. Jordan. Ch 10 p 423. The logic of activation functions, R. J. Williams. Ch 11 p 444. An analysis of the delta rule and the learning of statistical associations, G. O. Stone. Ch 12 p 460. Resource requirements of standard and programmable nets, McClelland. Ch 13 p 488. P3: a parallel network simulating sytem, Zipser, D. E. Rabin. Vol 2. Ch 14 p 7. Schemata and sequential thought processes in PDP models, Rumelhart, P. Smolensky, McClelland, Hinton. Ch 15 p 58. Interactive processes in speech perception: the TRACE model, McClelland, J. L. Elman. Ch 16 p 122. The programmable blackboard model of reading, McClelland. Ch 17 p 170. A distributed model of human learning and memory, McClelland, Rumelhart. Ch 18 p 216. On learning the past tenses of English verbs, Rumelhart, McClelland. Ch 19 p 272. Mechanisms of sentence processing: assigning roles to constituents, McClelland, A. H. Kawamoto. Ch 20 p 333. Certain aspects of the anatomy and physiology of the cerebral cortex, F. H. C. Crick, C. Asanuma. 2 neuron types: 80% type I: stimulates, long axons, dendrite spines; II inhibits, short axons. Firing rate usually 50-100/s or less. Neurons specialize in stimuli (local representation). 1010 neurons, 1013 synapses? in cerebral cortex. Ch 21 p 372. Open questions about computation in cerebral cortex, T. J. Sejnowski. Ch 22 p 390. Nerual and conceptual interpretation of PDP models, P. Smolensky. Ch 23 p 432. Biologically plausible models of place recognitino and goal location, Zipser. Ch 24 p 471. State-dependent factors influencing nural plasticity: a partial account of the critical period, P. W. Munro. Studied left-right eye plasticity in visual cortex of cats. Model: connection weights start small (below threshold) and grow. Ch 25 p 503. Amnesia and distributed memory, McClelland and Rumelhart. Proposed neural model of bilateral temporal injury and Korsakoff's anmesia. Ch 26 p 531. Reflections on cognition and parallel distributed processing, D. A. Norman.

Samuel, A. L. (1959), Some Studies in Machine Learning using the Game of Checkers, IBM Journal of Research and Development, 3:211-229, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. The checkers program learned by adjusting the weights of heuristics by playing against itself, and humans, and from book openings. The program, running on an IBM-7090, defeated R. W. Nealey, a "former Connecticut checkers champion, and one of the nation's foremost players" on July 12, 1962.

Schauble, Peter (1997), Multimedia Information Retrieval, Boston: Kluwer Academic Publishers. Speech and scanned text are noisy, which complicates information retrieval.

Schewe, Phillip W. (1994), Physics News (202), http://newton.ex.ac.uk/aip/physnews.202.html (Feb. 26, 1999). Junk DNA (97% in humans) is Zipf distributed at lengths 3, 6, 7, 8.

Schindler, Michael (1998), szip homepage, http://www.compressconsult.com/szip/ (Oct 30, 1998). szip data compression uses a Burrows-Wheeler block-predictive encoder. Version 1.05x gives slightly better compression than 1.1 beta (as of 9/21/1998). No source code available. Best compression on large text files of the Canterbury corpus at http://corpus.canterbury.ac.nz/results/large.html Test file: Ecol bibl wrld WGHT AVERAGE (bits/char) szip-b 2.06 1.53 1.40 1.72 1.66

Schindler, Michael (1997), A Fast Block-sorting Algorithm for Lossless Data Compression, 1997 Data Compression Conference, http://www.compressconsult.com/szip/ (Oct 30, 1998). Burrows-Wheeler compression can be speeded up with a slight loss of compression by sorting on a fixed-size context (2-4 characters). Modifying the move-to-front postprocessing (M1) to move to the second position from the front (M1) also helps for big blocks. Calgary (unweighted mean bpc): ST2,M1=2.97, ST2,M1=2.54, BWT,MTF=2.46

Schmidhuber, Jurgen, Stefan Heil (1996), "Sequential neural text compression", IEEE Trans. on Neural Networks 7(1): 142-146. A 3-layer neural network with 80 character * 5 time window inputs, 430 hidden units, 80 outputs was trained to predict text in 10K-20K German newspaper text files for input into an arithmetic encoder. Results (input/output ratio)=2.72 vs: pack=1.74, compress=1.99, gzip -9=2.29. 1000 times slower.

Schwartz, Barry, and Daniel Reisberg (1991), Learning and Memory, New York: W. W. Norton and Company. p. 5 associationism. p. 43 Pavlovian conditioning, eye blink, conditioned fear, autoshaping, taste aversion, p. 51 adaptive function, CRs that oppose URs, habituation, p. 61 evidence for S-S and S-R learning, p. 66 CS before US, p. 72, CS like US learns faster, p. 76 extinction, p. 82 discrimination, p. 91 contengency (CS must provide information to be salient), p. 101 Resclora-Wagner theory: delta-V(n) = K(lambda - V(n-1)) exponential decay toward conditioned strength, p. 119 operant conditioning, p. 127 contengency learning, p. 129 learned helplessness causes depression, p. 158 tokens can reinforce, p. 175 matching law: reinforcement = amount/delay (pigeons), p. 182 maximizing: behavior maximizes reward, p. 214 stimulus discrimination: generalization along dimensions unless S+, S- along dimension, peak shift past S+, p. 235 pigeons discriminate pictures, p. 253 model model of memory: sensory (raw, .25s visual, several s audio), short term (7 +- 2 words or chunks, maintained by rehearsal, inner voice), long term (vast, conceptual), p. 271 amnesia, p. 288 recall is state-dependent, p. 305 implicit/explicit memory, p. 317 episodic/semantic memory, p. 324 association with prior knowledge improves memory/reconstruction, p. 365 knowledge, p. 399 propositional models, p. 406 network models, p. 439 connectionism, p. 464 speech disrupts short term memory, p. 482 subjects recalled 98% of 612 pictures, 90% after 1 week, several thousand (Nickerson, R. S. 1968, A note on long-term recognition memory for picture material, Psychonomic Science, 11, 58; Standing, L., 1973, Learning 10,000 pictures, Quarterly Journal of Experimental Psychology, 25, 207-222), p. 495 forgetting as recall interference, p. 525 amnesia before age 3, p. 528 memory lasts decades, p. 552 covariance detection is biased, p. 569 probabilities estimated by evidence summation is not weighted

Sejnowski, T. J., (1986), Open questions about computation in cerebral cortex, in Rumelhart, David E., James L. McClelland, and the PDP Research Group (1986), Parallel Distributed Processing, vol. 2., Cambridge MA: MIT Press. p. 372-389. Neurons in visual cortex recognize features, edges, lines movement in certain orientations, illusionary edges, faces. Plasticity.

Sejnowski, Terrence J., and Charles R. Rosenberg (1986), "NETtalk: a parallel network that learns to read aloud", The John Jpokins University Electrical Engineering and Computer Science Technical Report JHU/EECS-86/01, 32 pp.

Selberg, Erik, and Oren Etzioni (1995), "Multi-Service Search and Comparison Using the Metacrawler", http://www.w3.org/Conferences/WWW4/Papers/169/ (May 4, 1998). A Web search engine that redirects queries to other engines and collects the responses. Typical query distributions are given. Queries appear Zipf-distributed with c = 0.01. Top query is "sex".

Selfridge, Oliver G., Ulric Neisser (1960), "Pattern Recognition by Machine, Scientific American, Aug., 203:60-68, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. Pattern recognition of Morse Code and printed characters.

Seppa, N., (1999) "Deaf people seem to hear signing", Science News 155(8): 122.

Seymore, Kristie, Ronald Rosenfeld (1996), "Scalable backoff language models", Proc. Intl. Conf. on Spoken Language, 232-235. Improves on the practice of eliminating low frequency bigrams and trigrams to reduce model size in a trigram word model. Criteria for exclusion is the difference in log probabilities for the bigram and trigram models rather than counts.

Shannon, Claude, and Warren Weaver (1949), The Mathematical Theory of Communication, Urbana: University of Illinois Press. Defines and justifies entropy H(P) = i P(i) log P(i) as a measure of information. Proves discrete capacity theorem: channel capacity = entropy. Continuous capacity theorem: capacity = bandwidth log signal/noise.

Shannon, Cluade E. (1950), "Prediction and Entropy of Printed English", Bell Sys. Tech. J (3) p. 50-64. Entropy of English is 0.6-1.3 bits/character, determined by human character-guessing and ranking tests. From 100 samples of "Jefferson the Virginian" by Dumas Malone, using a 27-character alphabet and a context of 100 characters, next guess: 80, 7, 0, 3, 4, 2, 1, 0, 1, 0, 0, 1, 0, 1 (80% on 1st guess, 7% on 2nd...).

Shieber, Stuart M., (1994), Lessons from a Restricted Turing Test. http://xxx.lanl.gov/abs/cmp-lg/9404002 (Oct 15, 1998). Criticizes the first (1990) Loebner competition because it restricted the topic of discussion, and because the goal of AI is unobtainable using current technology.

Simons, M., H. Ney, S. C. Martin (1997), "Distant bigram language modelling using maximum entropy", IEEE Intl. Conf. on Acuoustics, Speech, and Signal Processing, vol. 2, 787-790. Perplexity of WSJ (4,472,827 words, 20,000 word vocabulary + ), 325,000 word test set: unigram+bigram=215.2 (5 iterations of GIS), 211.8 (10). Adding distant bigrams=169.5 (5). One iteration takes 190 CPU hours on SGI/R4600. ME improves over discounting with SU only with distant bigrams. Linear interpolation of ME with trigram disconting: perplexity 144.0.

Slagle, James R. (1963), A Heuristic Program that Solves Symbolic Integration Problems in Freshman Calculus, Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, pp. 191-203. Solves problems such as x4/(1x2)5/2 dx = arcsin x + 1/3 tan3 arcsin x tan arcsin x. Ran in LISP on an IBM-7090 with 32K words of memory. A 1961 dissertation at MIT.

Standing, L. (1973), "Learning 10,000 Pictures", Quarterly Journal of Experimental Psychology (25) pp. 207-222. Subjects memorized S=20-10,000 pictures, 20-1000 vivid pictures, or 20-1000 written words (random from 25,000 most common) at 1 per 5.6 s, 2000/day max. Recall after 2 days: pictures .93 log10 S + .08, vivid .97 log S + .04, words .92 log S - .01. In another test, subjects described pictures using 6 words avg during recall (13% chance of matching wrong 1 of 200).

Sparck-Jones, Karen, and Peter Willett (1997), Readings in Information Retrieval, San Francisco: Morgan Kaufmann.

Stolcke, Andreas (1997), "Linguistic Knowledge and Empirical Methods in Speech Recognition", AI Magazine 18(4), 25-31. The best methods use word digram and trigram frequencies. Syntactic parsing does not help.

Task Vocab-ulary Style Channel Acous-tics Word error rate % ATIS 2000 spontan-eous high BW clean 2.1 NA bus. news 60,000 read high BW clean 6.6 broad-cast news 60,000 various various various 27.1 switch-board 23,000 spontan-eous phone clean 35.1

Stone, Harold S. (1993) High Performance Computer Architecture, 3rd. Ed., New York: McGraw Hill. Memory cache-design is based on a Zipf-distribution of memory accesses.

Storer, James A. (1988), Data Compression, methods and theory, Rockville MD: Computer Science Press. Coding, on/off line methods, English statistics.

Sumita, Eiichiro, and Hitoshi Iida (1992), "Example-Based NLP Techniques - A Case Study of Machine Translation", in Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Technical Report W-92-01, Menlo Park CA: AAAI Press, pp. 81-88. English-Japanese translation using examples.

Sutton, Ian (1998a), boa 0.58 beta, http://webhome.idirect.com/~isutton/ (Oct 5, 1998). Data compression and archiving program. Calgary corpus (weighted): 1.9250 bpc using default options (7M RAM), 1.9041 bpc using -m15 -s (15M RAM, compress across files), 1.9134 bpc using -m15.

Sutton, Ian (1998b) Personal communication. Boa is a variation of PPMZ.

Tan, C. P., (1981), "On the Entropy of the Malay Language", IEEE Transactions on Information Theory (27)3 (May) pp. 383-384. Entropy of Malay text (26 letters plus space) is 1.3 bits/character using Cover and King's committee gambling technique.

Taylor, Malcolm, RKIVE v1.91 beta 1 (1998) http://www.geocities.com/SiliconValley/Peaks/9463/rkive.html (Nov. 10, 1998).

Teahan, W. J., John G. Cleary (1996), "The entropy of English using PPM-based models", IEEE Proc. Data Compression Conference, 53-62. Compression of Jefferson the Virginian by Dumas Malone in 1948, 27 char, PPM-5, bigram encoding, trained on 6 books by same author (5,063,237 chars) = 1.488 bpc. Bigram encoding (104 most frequent bigrams without spaces) improves compression 6.84%.

Teahan, W. J., John G. Cleary (1997), "Models of English text", IEEE Proc. Data Compression Conference, 12-21. Demonstrates PPM-5 model applications to cryptography (simple substitution cipher), spelling correction and speech recognition. Drop in compressed output (bpc) nearly linear in log input size: 10K=2.7bpc, 100K=2.3bpc, 1M=2.1bpc, 10M=1.9bpc. Tagged text compression: tags compressed for free. Word based compression better than character: LOB corpus (5636660, 27 char alphabet): PPM5=1.860bpc, WW=1.783, WTW+TTWT=1.782. WSJ (15398849, 27)=1.602, 1.539, 1.547.

Teahan, W. J., John G. Cleary (1998), Tag based models of English text, Proc. Data Compression Conference, 289-298.

Teahan, W. J., S. Englis, J. G. Cleary, G. Holmes (1998), "Correcting English text using PPM models", IEEE Proc. Data Compression Conference, 289-298. A PPM model improves post OCR accuracy from 96.3% to 96.9%

Thompson, G. Brian, WIlliam E. Tunmer, Tom Nicholson (1993), Reading Acquisition Processes, Clevedon UK: Multilingual Matters Ltd.

Thorne, James; P. Bratley; Hamish Dewar , 'The syntactic analysis of English by machine ', in Donald Michie editors, Machine Intelligence 3 , pp. , Elsevier , New York , 1968

Turing, A. M., (1950) "Computing Machinery and Intelligence, Mind, 59:433-460, reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, 1963. (See also http://www.loebner.net/Prizef/TuringArticle.html (Dec. 8, 1998)). Turing proposed the "imitation game", which defines artificial intelligence as a machine that can win the game by being indistinguishable from an average human by an average interrogator through text-only communication. He predicted that in 50 years that a machine with 109 bits of memory would be incorrectly judged human 30% of the time after 5 minutes of conversation. He gives the following example:

Q: Please write me a sonnet on the subject of the Forth Bridge.
A: Count me out on this one. I never could write poetry.
Q: Add 34957 to 70764.
A: (Pause about 30 seconds and then give as answer) 105621.
Q: Do you play chess?
A: Yes.
Q: I have K at my K1, and no other pieces.  You have only K at K6 and R at R1.  It is your move.  What do you play?
A: (After a pause of 15 seconds) R-R8 mate.

Uhr, Leonard, Charles Vossler (1963) "A Pattern-Recognition Program that Generates, Evaluates, and Adjusts its own Operators", Computers and Thought, E. A. Feigenbaum and J. Feldman eds, New York: McGraw Hill, pp. 251-268. An adaptive pattern recognition program for single characters on a 20x20 grid. 5x5 patterns of 1, 0, and blank are selected as characteristics of training images and are weighted according to their usefulness in predicting the training images. The program was 96% successful in recognizing hand-printed capital letters by different people (humans do 97%). It recognized 5 spoken words (zero-four) by different speakers, and also segmented and recognized words and phonemes in a simple sentence "Did Dad say before" by different people with 100% success. The authors compare their methods to neural nets based on physiological data about the retina. Letters can be recognized down to 20x20 rods but not at all below 10x10. Human measurements of lines, angles, etc. have 2-3 bits accuracy. Evidence of excitory and inhibitory synapses dates to Hartline H. K., 1938, "The response of single optic nerve fibers of the vertebrate eye to illumination of the retina", American Journal of Psysiology 121:400-415.

Valiant,L.G. (1984), A Theory of the Learnable, Communications of the ACM, 27(11):1134-1142. Learnability of Boolean functions. A function F is learnable if it can be deduced in polynomial time in h, input, and representation from positive examples with distribution D or calls to an oracle that gives F(x) for any x in {0,1,*}* with P(false negative) < 1/h. Encrypting functions are not learnable (unproven). k-CNF, DNF, and mu-expressions (each Boolean variable appears once) are learnable. k-CNF: learnable in L(h, (2t)k+1) examples, L 2h(S + ln h). DNF: L(h, d) examples, dt oracles, d=degree (number of monomials), t=number of variables. mu-expr: O(t3) oracles.

Waelbroeck, H., and F. Zertuche (1995), "Discrete Chaos", J. Nonlinear Science, Sept. 1995, and http://luthien.nuclecu.unam.mx/~nncp/chaos.html (May 20, 1997). Defines chaos for discrete systems.

Wardhaugh, Ronald (1972), Introduction to Linguistics, New York: McGraw-Hill. Language is a system of arbitrary vocal symbols used for human communication.

M.Warmuth, Y. Freund, R.E. Schapire and Y. Singer (1995), "Using and Combining Predictors that Specialize," in Twentyninth Annual ACM Symposium on Theory of Computing, 1995, http://www.cse.ucsc.edu/~manfred/pubs/sleeping.ps (Mar. 19, 1999).

Waterman, Michael S. (1995), Introduction to Computational Biology, London: Chapman & Hall. Human DNA has 3-billion base-pairs of 4 possible bases. About 5% encodes protein, using a code of 3 base pairs to one of 20 amino acids.

Whitten, David (1994) "The Unofficial, Unauthorized CYC Frequently Asked Questions Information Sheet.", http://www.mcs.net/~jorn/html/ai/cycfaq.html (May 29, 1998). CYC was developed from 1984 to 1994. Its "sea of assertions" passed one million rules in 1990. Its developers hoped in 1994 that CYC would be running on every computer in 5 years.

Williams, Martyn (1996), "Review - Tsunami, Typhoon for Windows", http://www.nb-pacifica.com/headline/reviewtsunamityphoonf_608.shtml (May 29, 1998). English-Japanese translator that produces draft-quality translations.

Winston, Patrick Henry (1984), Artificial Intelligence, 2'nd Ed., Reading MA: Addison Wesley.

Witten, Ian H., Timothy C. Bell (1991), "The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression", IEEE Trans. on Information Theory, 37(4): 1085-1094. In text compression, if a context occurs n times with outcome distribution (assumed binomial for each outcome) ti outcomes occuring i times, then the probability of a novel outcome is p = t1/n - t2/n2 + t3/n3 -... (method P) t1/n (method X). This is shown better than method C, used in PPMC: p = r/(r + n), r = i ti, the number of distinct outcomes, for text files (but not binary files) in the Calgary corpus. For the case t1 = 0 (p = 0) or t1 = n (p = n), the best strategy is to fall back to method C, called method XC.

Woods, William A. (1970), 'Transition network grammars for natural language analysis ', Communications of the ACM , 13(10):591-606 , October 1970 . Augmented transition networks.

Word (1994), version 6.0a, Microsoft Corp. Includes a spell-checker and grammar-checker.

Wordnet http://www.cogsci.princeton.edu/~wn/ (Feb 5, 1999).

Wright, J. H., G. I. F. Jones, H. Lloyd-Thomas (1994), "A robust language model incorporating a substring parser and extended n-grams", IEEE ICASSP, Vol 1, 361-364.

Yang, Kae-Cheng, Tai-Hsuan Ho, Lee-Feng Chien, Lin-Shan Lee (1998), "Statistics-based segment pattern lexicon -- a new direction for Chinese language modeling", Proc. IEEE ICASSP, Vol. 1, 169-172. Uses forward-backwards algorithm to find words in Chinese text.

Yeap, Wai Kiang (1997), "Emperor AI, Where is Your New Mind", AI Magazine, 18(4), 137-144. Two approaches to AI: Ai = CS approach; aI = reverse engineering (psychology) approach.

Zipf, George Kingley (1935), The Psycho-Biology of Language, an Introduction to Dynamic Philology, M.I.T. Press. In all natural languages, the r'th most common word has frequency pr c/r, where c = 0.1 in English.