Model Performance vs. Size

If Q is an estimate of a model, P, then entropy, H = S_x P(x) log 1/Q(x) is minimized when Q = P (Shannon, 1949).

H is the expected compression ratio when Q is used to compress a random sample with distribution P (i.e. text).

Memory required to store the training set.