Character and lexical models
- Most common letters: space, e, t, a, o, n, r, ...
- Bigrams: th, er, in, ...
- Trigrams: the, ing, her, ...
- Words: the, of, and, to, ... (Zipf distribution)
- Word bigrams: of the, in the, ...
Probability that two n-grams (n = 1 to 6) are equal when separated
by t characters in Alice in Wonderland.
- Character models (but not word models) are stationary.
- Cache effect. Words are most likely to repeat after 50-100 characters.
- N-grams tend not to repeat (a but not aa,
the but not the the).