Prediction

Prediction
Kerby Shedden
Department of Statistics, University of Michigan

October 27, 2021

1 / 27

Prediction analysis
In a prediction-oriented analysis, we fit a model to capture the
conditional mean relationship between independent variables
x ∈ Rp and a dependent variable y ∈ R. We then use the fitted
model to make predictions on an independent data set.
The model has the form fθ , where for each θ, we have a function
from Rp to R. Thus {fθ } is a family of functions indexed by a
parameter θ.
We use the data to obtain an estimate θ̂ of θ, which in turn leads
us to an estimate fθ̂ of the regression function.
It is helpful to think in terms of training data {(yi , xi )} that are
used to fit the model, so θ̂ = θ̂({(yi , xi )}), and testing data
{(yi∗ , xi∗ )} on which predictions are made.

2 / 27

Quantifying prediction error
Prediction analysis focuses on prediction errors, for example
through the mean squared prediction error (MSPE):
E |y ∗ − fθ̂ (x ∗ )|2 ,
and its sample analogue
∗

n
X

kyi∗ − fθ̂ (xi∗ )k2 /n∗ ,

i=1

where n∗ is the size of the testing set.
Prediction analysis does not usually focus on properties of the
parameter estimates themselves, e.g. bias E θ̂ − θ, or parameter
MSE E (θ̂ − θ)2 .
3 / 27

MSPE for OLS analysis
The mean squared prediction error for OLS regression is easy to
derive. The testing data follow yi∗ = xi∗ β + ∗i . Let
∗
ŷ ∗ = X ∗ β̂ ∈ Rn denote the predicted values in the test set. Then
E ky ∗ − ŷ ∗ k2 = E kX ∗ β + ∗ − X ∗ β̂k2
= E kX ∗ (β − β̂)k2 + E k∗ k2
= E [(β̂ − β)0 (X ∗0 X ∗ )(β̂ − β)] + n∗ σ 2


= tr X ∗0 X ∗ · E [(β̂ − β)(β̂ − β)0 ] + n∗ σ 2


= tr X ∗0 X ∗ · Σβ̂ + n∗ σ 2 ,
where Σβ̂ is the covariance matrix of β̂ from the training process.
Note the requirement for ŷ ∗ and y ∗ to be independent (given X
and X ∗ if they are random).
4 / 27

MSPE for OLS analysis
The MSPE for OLS is


tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 .
If X is the training set design matrix, then Σβ̂ = σ 2 (X 0 X )−1 , so if
X = X ∗ , then
E ky ∗ − ŷ k2 = σ 2 (p + 1 + n∗ ),
and the MSPE in this case is
σ 2 (p + 1)/n∗ + σ 2 = σ 2 (p + 1)/n + σ 2 .

5 / 27

MSPE for OLS analysis

More generally, suppose X 0 X /n = X ∗0 X ∗ /n∗ . Then
Σβ̂ = σ 2 (X 0 X )−1 = σ 2 n∗ (X ∗0 X ∗ )−1 /n.
Thus the MSPE is


tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 = σ 2 (p + 1)/n + σ 2 .

6 / 27

MSPE in practice

The MSPE discussed here is the primary population quantity of
interest for prediction. It is not always straightforward to estimate.
The task of model selection, discussed later, can be viewed as
aiming to identify the model with the lowest MSPE (among a set
of candidate models under consideration).
Note that the candidate models we fit to data may not be
correctly-specified, so the usual estimate of σ̂ 2 may be biased.

7 / 27

PRESS residuals
One way to estimate the MSPE with few theoretical conditions is
using cross validation. We will briefly introduce this idea here, then
return to it and cover it in more detail when we talk about model
selection.
If case i is deleted and a prediction of yi is made from the
remaining data, we can compare the observed and predicted values
to get the prediction residual:
r(i) ≡ yi − ŷ(i)i .
where ŷ(i)i is the prediction of yi based on a data set in which case
i was removed.

8 / 27

PRESS residuals
A simple formula for the prediction residual in OLS is given by

r(i) = yi − xi β̂(i)
= yi − xi (β̂ − ri (X 0 X )−1 xi /(1 − Pii ))
= ri /(1 − Pii ).
where X is the design matrix, xi is row i of the design matrix, and
P is the projection matrix (for the full sample).
The sum of squares of the prediction residuals is called PRESS
(prediction residual error sum of squares). It is equivalent to using
leave-one-out cross validation to estimate the “generalization error
rate”.
9 / 27

Bias and variance in prediction
If we are using a function fθ̂ to predict y from x, we can view the
prediction error as arising from contributions of bias and variance.
The bias is
b(x) ≡ E [fθ̂ (x)|x] − E [y |x].
The variance is v (x) ≡ var[fθ̂ (x)|x].
The MSPE is
E [(y − fθ̂ (x))2 ] = b(x)2 + v (x).

10 / 27

Bias and variance in prediction

While having zero bias is an important consideration in some
statistical analyses, arguably the overall accuracy, as measured by
MSPE, should be the dominant consideration.
The MSPE results from a combination of squared bias and
variance. If we want to minimize the MSPE we should consider
using a biased estimator, if by doing so we attain better MSPE
(due to it having much smaller variance).
The relationship between bias and variance discussed here is often
referred to as the bias/variance tradeoff.

11 / 27

Ridge regression
Ridge regression uses the minimizer of a penalized squared error
loss function to estimate the regression coefficients:
β̂ ≡ argminβ ky − X βk2 + λβ 0 Dβ.
Typically D is a diagonal matrix with 0 in the 1,1 position and
ones on the rest of the diagonal. In this case,
β 0 Dβ =

X

βj2 .

j≥1

This makes most sense when the covariates have been
standardized, so it is reasonable to penalize the βj equally.

12 / 27

Ridge regression

Ridge regression is a compromise between fitting the data as well
as possible (by making ky − X βk2 small), while not allowing any
one fitted coefficient to get very large (which causes β 0 Dβ to get
large).

13 / 27

Ridge regression and colinearity
Suppose x1 and x2 are standardized vectors with a substantial
positive correlation (i.e. x10 x2 /n is large), and the population slopes
are β1 and β2 , i.e. E [y |x1 , x2 ] = β1 x1 + β2 x2 .
Fits of the form
(β1 + γ)x1 + (β2 − γ)x2 = E [y |x1 , x2 ] + γ(x1 − x2 )
have similar MSE values as γ varies, since x1 − x2 is small when x1
and x2 are strongly positively associated.
In other words, OLS can’t easily distinguish among these fits.
For example, if x1 ≈ x2 , then 3x1 + 3x2 , 4x1 + 2x2 , 5x1 + x2 , etc.
all produce similar fitted values.

14 / 27

Ridge regression and colinearity

For large λ, ridge regression favors the fits that minimize
(β1 + γ)2 + (β2 − γ)2 .
This expression is minimized at γ = (β2 − β1 )/2, giving the fit
(β1 + β2 )x1 /2 + (β1 + β2 )x2 /2.
⇒ Ridge regression favors coefficient estimates for which strongly
positively correlated covariates have similar estimated effects.

15 / 27

Calculation of ridge regression estimates
For a given value λ > 0, ridge regression is no more difficult
computationally than ordinary least squares, since

∂
ky − X βk2 + λβ 0 Dβ = −2X 0 y + 2X 0 X β + 2λDβ,
∂β
so the ridge estimate β̂ solves the system of linear equations
(X 0 X + λD)β = X 0 y .
This equation can have a unique solution even when X 0 X is
singular. Thus one application of ridging is to produce regression
estimates for singular design matrices.
16 / 27

Ridge regression bias and variance
Ridge regression estimates are biased, but may be less variable
than OLS estimates. If X 0 X is non-singular, the ridge estimator
can be written

β̂λ = (X 0 X + λD)−1 X 0 y
= (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 y
= (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 (X β + )
= (I + λ(X 0 X )−1 D)−1 β + (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 .
Thus the bias is
E [β̂λ |X ] − β = ((I + λ(X 0 X )−1 D)−1 − I )β
17 / 27

Ridge regression bias and variance

The variance of the ridge regression estimates is

varβ̂λ = σ 2 (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 (I + λ(X 0 X )−1 D)−T .

18 / 27

Ridge regression bias and variance

Next we will show that var[β̂] ≥ var[β̂λ ], in the sense that
var[β̂] − var[β̂λ ]
is a non-negative definite matrix.
First let M = λ(X 0 X )−1 D, and note that

19 / 27

Ridge regression bias and variance

v 0 (varβ̂ − varβ̂λ )v


∝ v 0 (X 0 X )−1 − (I + M)−1 (X 0 X )−1 (I + M)−T v

= u 0 (I + M)(X 0 X )−1 (I + M)0 − (X 0 X )−1 u

= u 0 M(X 0 X )−1 + (X 0 X )−1 M 0 + M(X 0 X )−1 M 0 u
= u 0 2λ(X 0 X )−1 D(X 0 X )−1 +

λ2 (X 0 X )−1 D(X 0 X )−1 D(X 0 X )−1 u

where u = (I + M)−T v .
We can conclude that for any fixed vector θ,
var(θ0 β̂λ ) ≤ var(θ0 β̂).

20 / 27

Ridge regression effective degrees of freedom
Like OLS, the fitted values under ridge regression are linear
functions of the observed values
ŷλ = X (X 0 X + λD)−1 X 0 y
In OLS regression, the degrees of freedom is the number of free
parameters in the model, which is equal to the trace of the
projection matrix P that satisfies ŷ = Py .
Fitted values in ridge regression are not a projection of y , but the
matrix
X (X 0 X + λD)−1 X 0 .
plays an analogous role to P.
21 / 27

Ridge regression effective degrees of freedom

The effective degrees of freedom for ridge regression is defined as


EDFλ = tr X (X 0 X + λD)−1 X 0 .
The trace can be easily computed using the identity



trace X (X 0 X + λD)−1 X 0 = trace (X 0 X + λD)−1 X 0 X .

22 / 27

Ridge regression effective degrees of freedom
EDFλ is monotonically decreasing in λ. To see this we will use the
following fact about matrix derivatives
∂tr(A−1 B)/∂A = −A−T B 0 A−T .
By the chain rule, letting A = X 0 X + λD, we have

−1

∂tr A


X ∂tr A−1 X 0 X
∂Aij
·
X X /∂λ =
∂Aij
∂λ
ij
X
= −
[A−T (X 0 X )A−T ]ij · Dij
0



ij

X
= −
[A−T (X 0 X )A−T ]ii · Dii
i

≤ 0.
23 / 27

Ridge regression effective degrees of freedom

EDFλ equals rank(X ) when λ = 0. To see what happens as
λ → ∞, we can apply the Sherman-Morrison-Woodbury identity
(A + UCV )−1 = A−1 − A−1 U C −1 + VA−1 U

−1

VA−1 .

Let G = X 0 X , and write D = FF 0 , where F has independent
columns (usually F will be p + 1 × p as we do not penalize the
intercept).

24 / 27

Ridge regression effective degrees of freedom
Applying the SMW identity and letting λ → ∞ we get



tr (G + λD)−1 G

 
G −1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 G −1 G


= tr Ip+1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0


→ trIp+1 − tr G −1 F (F 0 G −1 F )−1 F 0


→ trIp+1 − tr (F 0 G −1 F )−1 F 0 G −1 F


=

tr

=

p + 1 − rank(F ).

Therefore in the usual case where F has rank p, EDFλ converges
to 1 as λ grows large, reflecting the fact that all coefficients other
than the intercept are forced to zero.

25 / 27

Ridge regression and the SVD
Suppose we are fitting a ridge regression with D = I , and we factor
X = USV 0 using the singular value decomposition (SVD), so that
U and V are orthogonal matrices, and S is a diagonal matrix with
non-negative diagonal elements.
The fitted coefficients are

β̂λ = (X 0 X + λI )−1 X 0 y
= (VS 2 V 0 + λVV 0 )−1 VSU 0 y
= V (S 2 + λI )−1 SU 0 y
Note that for OLS (λ = 0), we get β̂ = VS −1 U 0 y . The effect of
ridging is to replace S −1 in this expression with (S 2 + λI )−1 S,
which are uniformly smaller values when λ > 0.
26 / 27

Ridge regression tuning parameter

There are various ways to set the ridge parameter λ.
Cross-validation can be used to estimate the MSPE for any
particular value of λ. Then this estimated MSPE could be
minimized by checking its value at a finite set of λ values.
Generalized cross validation, which minimizes the following over λ,
is a simpler, and more commonly used approach.

GCV(λ) =

ky − ŷλ k2
.
(n − EDFλ )2

27 / 27