Back
Up
Next
Evaluating the Human Language Model
What is the next letter in Roses are r_?
Humans apply a language model:
- Character: re is more common than rz
- Lexical: r is not a word (can't be a space)
- Semantic: red, fragrent, flower, ...
- Syntactic: probably a noun or adjective
Shannon (1950)
- Count number of guesses until correct (r = 1 to 27)
- Obtain probability distribution P(r) (P(1) = 0.8, P(2) = 0.07,
... P(27) = 0.001)
- Sr (P(r) - P(r+1))
log2 r £ H
£
Sr P(r) log2 1/P(r)
- Lower bound = 0.6 bpc assumes a series of uniform distributions
- Upper bound = 1.3 bpc assumes a single distribution P for all
characters
- Problem: Broad range due to a large skew in P
Gambling - Cover and King (1978)
- Subjects assigned P(a), P(b), ..., P(z) directly
in a gambling game
- Individual results: 1.3 - 1.7 bpc
- Combined: 1.3 bpc
- Problem: People do not bet rationally,
tend to overestimate P for unlikely events (lotteries,
insurance), increasing the estimate of H