Model selection
Kerby Shedden
Department of Statistics, University of Michigan

November 8, 2021

1 / 24

Background

Suppose we observe data y and are considering a family of models fθ
that may approximately describe how y was generated.
If we are mainly interested in the individual model parameters, we will
focus on how close θ̂j is to θj (e.g. in terms of its bias, variance, MSE).
Alternatively, our focus may be on the probability distribution f that is
the data-generating model for y . In this case, we are more interested in
whether fθ̂ approximates f , rather than in whether θ̂j approximates θj .

2 / 24

Background

The term model model selection is used to describe statistical estimation
in a context when the focus is more on the structure of the fitted model
than on the individual parameters.
Model selection is a form of statistical inference, but in model selection,
we are often contrasting two families of models {fθ } and {gη }. Often,
the dimensions of η and θ will be different.
In contrast, when we do parameter estimation we are selecting a model
from within one family {fθ }, where θ has a fixed dimension and usually
lies in a simple domain like Rp .

3 / 24

Model complexity and parsimony
The discrepancy between fθ and fθ̂ is strongly influenced by how complex
of a model we decide to fit.
Suppose we have p = 30 covariates and n = 50 observations. We could
consider the following two alternatives:
1. We could fit a model using all of the covariates. In this case, θ̂ is
unbiased for θ (in a linear model fit using OLS). But θ̂ has very high
variance.
2. We could fit a model using only the five strongest effects. In this
case, θ̂ will be biased for θ, but it will have lower variance
(compared to the estimate including all covariates).
If our goal is for fθ̂ and fθ to be close, either approach 1 or approach 2
could perform better, depending on the circumstances.

4 / 24

Assessing model fit

A more complex model will usually fit the data better than a more
parsimonious (simpler) model. This is called overfitting.
Due to overfitting, we cannot simply compare models in terms of how
well they fit the data (e.g. in terms of the residual sum of squares, or the
height of the likelihood). The more complex model will always appear to
be better if we do this.
To overcome this, most model selection procedures balance a measure of
a model’s fit with a measure of its complexity. To justify selecting the
more complex model, it must fit the data substantially better than the
simpler model.

5 / 24

Fit and parsimony
Example: The purple, green, and blue curves below are estimates of
E [y |x]. The green line fits the data better but is more complex. Which
estimate is closest to the truth?

2
0

Y

2
4
6
80.0

0.2

0.4

0.6

X

0.8

1.0

1.2
6 / 24

F-tests

Suppose we are comparing two nested families of models F1 ⊂ F2 , both
of which are linear subspaces. An F-test can be used to select between
F1 and F2 .

7 / 24

Mallows’ Cp
Suppose we postulate the model
y = Xβ + 
but in fact E [y ] 6∈ col(X ). We’ll continue to assume that the
homoscedastic variance structure cov(|X ) = σ 2 I holds. Denote this
model as M.
Denote the error in estimating E [y ] under model M as
DM = ŷM − E [y ],
where ŷM is the projection of y onto col(X ).

8 / 24

Mallows’ Cp
Write
E [y ] = θX + θX⊥ ,
where θX ∈ col(X ) and θX⊥ ∈ col(X )⊥ . Since y = θX + θX⊥ + , it follows
that ŷ = θX + X , where X is the projection of  onto col(X ).
Therefore

0
EDM DM

=

E (ŷM − Ey )(ŷM − Ey )0

=

E (X − θX⊥ )(X − θX⊥ )0

=

θX⊥ θX⊥0 + σ 2 PX

where PX is the projection matrix onto col(X ).

9 / 24

Mallows’ Cp
Taking the trace of both sides, yields
E kDM k2 = kθX⊥ k2 + (p + 1)σ 2 ,
where p + 1 is the rank of PX .
Mallows’ Cp aims to estimate
Cp∗ = E kDM k2 /σ 2 = kθX⊥ k2 /σ 2 + p + 1
The model that minimizes Cp∗ is the closest to the true model in this
particular sense.

10 / 24

Mallows’ Cp
We need an estimate of Cp∗ .
To begin, we can derive the expected value of
σ̂ 2 = ky − ŷM k2 /(n − p − 1)
in the case where E [y ] is not necessarily in col(X ):

E σ̂ 2

= E [y 0 (I − P)y /(n − p − 1)]
= E [(θX + θX⊥ + )0 (I − P)(θX + θX⊥ + )/(n − p − 1)]
= E [tr(I − P)(θX⊥ + )(θX⊥ + )0 /(n − p − 1)]
= kθX⊥ k2 /(n − p − 1) + σ 2 .

11 / 24

Mallows’ Cp

Now suppose we have an unbiased estimate of σ 2 . This could come from
a regression against a much larger design matrix that is thought to
contain E [y ]. Call this estimate σ ∗2 . Then
(n − p − 1)E (σ̂ 2 − σ ∗2 ) = kθX⊥ k2 .
Therefore we can estimate Cp∗ using
Cp = (n − p − 1)(σ̂ 2 − σ ∗2 )/σ ∗2 + p + 1.
The model M with the smallest value of Cp is selected.

12 / 24

AIC

Suppose we are selecting from a family of linear models with design
matrices XM , for M ∈ M.
For each XM , the model parameters (slopes and error variance) can be
estimated using least squares (and method of moments for the error
variance) as a vector θ̂M . This allows us to construct a predictive density:
p(y ; XM , θ̂M ).

13 / 24

AIC

The Kullback-Leibler divergence (“KL-divergence”) between the
predictive density and the actual density p(y ) is

Ey log

p(y )
p(y ; XM , θ̂M )

!

Z
=

log

p(y )
p(y ; Xm , θ̂M )

!
p(y )dy ≥ 0.

Here we are considering θ̂M to be fixed.
Small values of the KL divergence indicate that the predictive density is
close to the actual density.

14 / 24

AIC
Akaike’s Information Criterion (AIC) aims to estimate the KL divergence
between a candidate model and the data-generating model p(y )
unbiasedly. We can then select the candidate model that has the smallest
estimated KL-divergence relative to p(y ).
The KL-divergence can be written
Ey log(p(y )) − Ey log(p(y ; θ̂M , XM ).
We can ignore the first term since it doesn’t depend on M. Thus it will
be equivalent to select the model that maximizes the predictive log
likelihood:
Z
Ey log(p(y ; θ̂M , XM )) =

log(p(y ; XM , θ̂M ))p(y )dy .

15 / 24

AIC

The predictive log likelihood is the expected value of
log p(y ∗ ; XM , θ̂M (y ))
taken over the joint distribution of y and y ∗ , which are independent
copies of the data.
The parameter estimates θ̂M = θ̂M (y ) are determined from y , which you
can think of as a “training set”, and the log-likelihood is evaluated at y ∗ ,
using θ̂M (y ) to set the parameters.

16 / 24

AIC

Since we don’t have both y and y ∗ , it is natural to use the plug-in
estimator of the predictive log likelihood:
log p(y ; XM , θ̂M (y ))
But this is biased upward, due to overfitting.
Surprisingly, this upward bias can be shown to be approximately equal to
the dimension of M, which is p + 1 for regression (p + 2 if you count σ 2 ).

17 / 24

AIC

Thus we may take
log(p(ytrain ; XM , θ̂M )) − p − 1
as a model selection statistic to be maximized (commonly this is
multiplied by -2, in which case it is to be minimized). This quantity is
the AIC.

18 / 24

AIC in linear models
To apply the AIC to linear models, we assume the error values  are
multivariate normal, so the log-likelihood becomes
n
1
− log σ 2 − 2 ky − X βk2 .
2
2σ
If we work with the profile likelihood over β, we get −n log(σ̂ 2 )/2 (plus a
constant). Therefore maximizing the AIC is equivalent to maximizing
−n log(σ̂ 2 ) −
| {z }
fit

2(p + 1)
| {z }
complexity

The conventional form of the AIC is a scaled version of the expression
above, which is to be minimized:
AIC = n log(σ̂ 2 ) + 2(p + 1).
19 / 24

AIC and likelihood ratios
The AIC does not require the models being compared to be nested, but
let’s consider this special case.
Let L1 and L0 be the maximized log likelihoods for two nested models (so
L1 ≥ L0 ).
We know that 2(L1 − L0 ) approximately follows a χ2q distribution, where
q is the difference between the number of parameters of the two models.
If q = 1, we select the larger model if L1 − L0 ≥ 1.92 (the usual
likelihood ratio test at the 0.05 type I error rate).
If the additional parameters are not needed, then E [L1 − L0 ] = 0.5 (so 0.5
is the lowest possible threshold for L1 − L0 that could ever be considered).
Under AIC, we select the larger model if L1 − L0 > 1 (less strict than the
likelihod ratio test).

20 / 24

Bayesian Information Criterion (BIC)
A different criterion that we will not derive here is the “Bayesian
information criterion” (BIC). The BIC defines complexity differently from
the AIC:
−n log(σ̂ 2 ) −
| {z }
fit

(p + 1) log(n)
|
{z
}
complexity

The conventional definition of the BIC is n log(σ̂ 2 ) + (p + 1) log(n). The
best-fitting model under the BIC minimizes this quantity.
The complexity penalty in BIC, log(n)(p + 1), will always be larger than
the corresponding AIC penalty, which is 2(p + 1). Thus the BIC will
always favor simpler models than the AIC.

21 / 24

Model selection based on prediction
Many approaches to model selection attempt to identify the model that
predicts best on independent data.
If independent “training” and “test” sets are available, for each model M
the parameters of M can be fit using the training data, yielding θ̂M .
Predictions can then be made on the test set
ŷM,test = XM,test θ̂M
and the quality of prediction can be assessed, for example, using the
Mean Squared Prediction Error (MSPE):
kytest − ŷM,test k2 /n.

22 / 24

Cross-validation
Separate training and test sets are usually not available. Cross validation
is a direct method for obtaining unbiased estimates of the prediction
mean squared error when only training data are available.
In k-fold cross validation, the data are partitioned into k disjoint subsets
(“folds”), denoted S1 ∪ · · · ∪ Sk = {1, . . . , n}.
Let β̂j be the fitted coefficients omitting the j th of these subsets, and let

CVk = n

−1

k X
X

(yi − Xi0 β̂j )2

j=1 i∈Sj

This is an approximately unbiased (but potentially very imprecise)
estimate of the MSPE on a test set.
The special case of leave one out cross validation (LOOCV) is when
k = n.
23 / 24

Cross-validation

For OLS regression, CVn (also known as “prediction residual error sum of
squares”, or PRESS), can be computed rapidly:
CVn = n−1

X

Ri2 /(1 − Pii )2 .

i

The generalized cross-validatation (GCV) criterion replaces Pii with the
average diagonal element of P, which is trace(P)/n:
GCVn = n−1

X
i

Ri2 /(1 − tr(P)/n)2 = n−1

ky − ŷ k2
.
(1 − tr(P)/n)2

24 / 24