Regression diagnostics
Kerby Shedden
Department of Statistics, University of Michigan

October 31, 2021

1 / 46

Motivation
When working with a linear model with design matrix X , the
conventional linear model is based on the following conditions:
E [y |X ] ∈ col(X )

and

var[y |X ] = σ 2 In×n .

Unbiasedness and consistency of least squares point estimates depend on
the first condition approximately holding. Least squares inferences
depend on both of the above conditions approximately holding.
Inferences for small sample sizes may also depend on the distribution of
y − E [y |X ] being approximately multivariate Gaussian, but for moderate
or large sample sizes this condition is not critical.
Regression diagnostics for linear models are approaches for assessing how
well a particular data set fits one or both of these conditions.

2 / 46

Residuals
Linear models can be expressed in two equivalent ways:
I

Focus only on moments:
E [y |X ] ∈ col(X )

I

and

var[y |X ] = σ 2 I .

Use a “generative model” such as an additive error model of the
form y = X β + , where  is random with E [|X ] = 0, and
cov[|X ] ∝ In×n .

Since the residuals can be viewed as predictions of the errors, it turns out
that regression model diagnostics can often be developed using the
residuals.
Recall that the residuals can be expressed
r ≡ (I − P)y
where P is the projection onto col(X ) and y ∈ Rn is the vector of
observed responses.
3 / 46

Residuals
The residuals have two key mathematical properties regardless of the
correctness of the model specification:
I

The residuals sum to zero, since (I − P)1 = 0 and hence
10 r = 10 (I − P)y = 0.

I

The residuals and fitted values are orthogonal (they have zero
sample covariance):

cd
ov(r , ŷ |X ) ∝

(r − r̄ )0 ŷ

=

r 0 ŷ

=

y 0 (I − P)Py

=

0.

These properties hold as long as an intercept is included in the model (so
P · 1 = 1, where 1 ∈ Rn is a vector of 1’s).
4 / 46

Residuals
If the basic linear model conditions hold, these two properties have
population counterparts:
I

The expected value of each residual is zero:

E [r |X ]

I

=

(I − P)E [y |X ]

=

0 ∈ Rn .

The population covariance between any residual and any fitted
value is zero:

cov(r , ŷ |X )

= E [r ŷ 0 ]
=

(I − P)cov(y |X )P

= σ 2 (I − P)P
=

0 ∈ Rn×n .
5 / 46

Residuals
If the model is correctly specified, there is a simple formula for the
variances and covariances of the residuals:
cov(r |X )

=

(I − P)E [yy 0 ](I − P)

=


(I − P) X ββ 0 X 0 + σ 2 I (I − P)

= σ 2 (I − P).
If the model is correctly specified, the standardized residuals
yi − ŷi
σ̂
and the Studentized residuals
yi − ŷi
σ̂(1 − Pii )1/2
approximately have zero mean and unit variance.
6 / 46

Residuals
The standardized residuals are crudely standardized using a plug-in logic.
The standardized “errors” are
yi − E [y |x = xi ]
,
σ
which have zero mean and unit variance (exactly).
If we plug-in ŷi for E [y |x = xi ] and σ̂ for σ, then we get the standardized
residual.
2
However we know that the actual
p variance of yi − ŷi is σ (I − P)ii , so it
is more precise to scale by σ (I − P)ii rather than scaling by σ.

Since we plug in the estimate σ̂ for the population parameter σ, even the
Studentized residual does not have variance exactly equal to 1.

7 / 46

External standardization of residuals
2
be the estimate of σ 2 obtained by fitting a regression model
Let σ̂−i
omitting the i th case. It turns out that we can calculate this value
without actually refitting the model:

2
σ̂−i
=

(n − p − 1)σ̂ 2 − ri2 /(1 − Pii )
n−p−2

where ri is the residual for the model fit to all data.
The “externally standardized” residuals are
yi − ŷi
,
σ̂−i
The “externally Studentized” residuals are
yi − ŷi
.
σ̂−i (1 − Pii )1/2
8 / 46

Outliers and masking

In some settings, residuals can be used to identify “outliers”. However, in
a small data set, a large outlier will increase the value of σ̂, and hence
may mask itself.
Externally Studentized residuals solve the problem of a single large outlier
masking itself. But masking may still occur if multiple large outliers are
present.

9 / 46

Outliers and masking
If multiple large outliers may be present we may use alternate estimates
of the scale parameter σ:
I

Interquartile range (IQR): this is the difference between the 75th
percentile and the 25th percentile of the distribution or data. The
IQR of the standard normal distribution is 1.35, so IQR/1.35 can be
used to estimate σ.

I

Median Absolute Deviation (MAD): this is the median value of the
absolute deviations from the median of the distribution or data, i.e.
median(|Z − median(Z )|). The MAD of the standard normal
distribution is 0.65, so MAD/0.65 can be used to estimate σ.

These alternative estimates of σ can be used in place of the usual σ̂ for
standardizing or Studentizing residuals.

10 / 46

Leverage
Leverage is a measure of how strongly the data for case i determines the
fitted value ŷi .
Since ŷ = Py , where P is the projection matrix onto col(X ), then
ŷi =

X

Pij yj .

j

It is natural to define the leverage for case i as Pii ,
This is related to the fact that the variance of the i th residual is
σ 2 (1 − Pii ). Since the residuals have mean zero, when Pii is close to 1,
the residual will likely be close to zero. Thus the least squares fit will
tend to pass closer to high leverage points than low leverage points.

11 / 46

Leverage
What is a big leverage? The average leverage is trace(P)/n = (p + 1)/n.
If the leverage for a particular case is two or more times greater than the
average leverage, it may be considered to have high leverage.
In simple linear regression, we showed earlier that
var(yi − α̂ − β̂xi ) = (n − 1)σ 2 /n − σ 2 (xi − x̄)2 /

X

(xj − x̄)2 .

j

Since var(ri ) = σ 2 (1 − Pi i), this implies that when p = 1,
Pii = 1/n + (xi − x̄)2 /

X
(xj − x̄)2 .
j

12 / 46

Leverage

Leverage values in a simple linear regression:
2.5

0.045

2.0

0.040
0.035

Leverage

1.5

Y

1.0
0.5

0.030
0.025
0.020

0.0
0.5

0.015

1.0 0.0

0.010 0.0

0.2

0.4

X

0.6

0.8

1.0

0.2

0.4

X

0.6

0.8

1.0

13 / 46

Leverage
Leverage values in a linear regression with two independent variables:
1.0
0.8

X2

0.6
0.4
0.2
0.0

0.0

0.2

0.4

X1

0.6

0.8

1.0

14 / 46

Projection matrices and Mahalanobis distances
Suppose that the design matrix X has a leading column of 1’s. The
projection onto col(X ) is unchanged if we transform X to XB, where
B ∈ Rp+1×p+1 is invertible. Thus, we can take X̃ ∈ Rn×p+1 to be a
matrix with leading column identically equal to 1, and for j > 1 the j th
column of X̃ is the centered version of the j th column of X .
Since
P ≡ proj(col(X )) = proj(col(X̃ )) = X̃ (X̃ 0 X̃ )−1 X̃ 0 ,
it follows that
Pij = x̃i (X̃ 0 X̃ )−1 x̃j0
where x̃i is row i of X̃ .

15 / 46

Projection matrices and Mahalanobis distances
Let ΣX ∈ Rp×p denote the covariance matrix of columns 2 through p + 1
of X (or of X̃ ). We can block decompose as follows:
X̃ 0 X̃ /n =



11×1
0p×1

01×p
ΣX


.

We can write x̃i = (1, xi∗ − µX ), where µX ∈ Rp is the mean of columns
2 through p + 1 of X and xi∗ contains elements 2 through p + 1 of xi .
This gives us
∗0
Pij = n−1 (1 + (xi∗ − µX )Σ−1
X (xj − µX )).

Since Σ−1
X is positive definite, this implies that Pii ≥ 1/n.

16 / 46

Leverage and Mahalanobis distances
The expression
∗
0
(xi∗ − µX )Σ−1
X (xi − µX )

is the Mahalanobis distance between xi∗ and µX . It is equivalent to
decorrelating the xi∗ to zi = R −T xi∗ (e.g. where ΣX = R 0 R) and then
taking the squared Euclidean distance kzi − R −T µX k2 .
Thus there is a direct relationship between the Mahalanobis distance of a
point relative to the center of the covariate set (µX ), and its leverage.
Observations with covariate values xi∗ in the tails of the distribution of all
{xi∗ }, as assessed with Mahalanobis distance, have the highest leverage
(but no larger than 1).
Observations with covariate values close to the centroid µX of all {xi∗ }
have the least leverage (but no smaller than 1/n).

17 / 46

Influence
Influence measures the degree to which deletion of a case changes the
fitted model.
We will see that this is different from leverage – a high leverage point has
the potential to be influential, but is not always influential.
The deleted slope for case i is the fitted slope vector that is obtained
upon deleting case i. The following identity allows the deleted slopes to
be calculated efficiently
β̂(i) = β̂ −

ri
(X 0 X )−1 xi0 ,
1 − Pii

where ri is the i th residual, and xi is row i of the design matrix.

18 / 46

Influence
The vector of all deleted fitted values ŷ(i) are
ŷ(i) = X β̂(i) = ŷ −

ri
X (X 0 X )−1 X 0 .
1 − Pii

Influence can be measured by Cook’s distance:

Di

≡
=
=

1
(ŷ − ŷ(i) )0 (ŷ − ŷ(i) )
(p + 1)σ̂ 2
ri2
xi (X 0 X )−1 xi0
(1 − Pii )2 (p + 1)σ̂ 2
Pii ris2
,
(1 − Pii )(p + 1)

where ri is the residual and ris is the studentized residual.
19 / 46

Influence

Cook’s distance approximately captures the average squared change in
fitted values due to deleting case i, in error variance units.
Cook’s distance is large only if both the leverage Pii is high, and the
studentized residual for the i th case is large.
As a general rule, Di values from 1/2 to 1 are high, and values greater
than 1 are considered to be “very high”.

20 / 46

Influence
Cook’s distances in a simple linear regression:
0.10

Cook's distance

0.08
0.06

0.04
0.02
0.00 0.0

0.2

0.4

X

0.6

0.8

1.0

21 / 46

Influence
Cook’s distances in a linear regression with two variables:
1.0

0.32
0.28

0.8

0.24

0.6

X2

0.20
0.16

0.4

0.12
0.08

0.2
0.0

0.04
0.0

0.2

0.4

X1

0.6

0.8

1.0

22 / 46

Regression graphics

Quite a few graphical techniques have been proposed to aid in visualizing
regression relationships. We will discuss the following plots:
1. Scatterplots of y against individual covariates xj
2. Scatterplots of two covariates xj , xk against each other
3. Residuals versus fitted values plot
4. Added variable plots
5. Partial residual plots
6. Residual quantile plots

23 / 46

Scatterplots of y against individual covariates

4
2
0
2
44

4

4
2
0
2
44

4

4
2
0
2
44

Y

4
2
0
2
44

2

0

X1

2

Y

Y

Y

E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

2

0

X3

2

2

2

0

2

4

0

2

4

X2

X1 −X2 + X3

24 / 46

Scatterplots of covariates against each other

X3

4
2
0
2
44

2

0

2

X3

X2

E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

4
2
0
2
44

X1

4

2

0

X2

4
2
0
2
44

2

2

0

X1

2

4

4
25 / 46

Residuals against fitted values plot
E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

Residuals

4
2
0
2
44

2

0

2

Fitted values

4

26 / 46

Residuals against fitted values plots
Heteroscedastic errors:
E [y |x] = x1 + x3 , var[y |x] = 4 + x1 + x3 , var(xj ) = 1, cor(xj , xk ) = 0.3

Residuals

20
10
0
10
20 4

2

0

2

Fitted values

4
27 / 46

Residuals against fitted values plots
Nonlinear mean structure:

Residuals

E [y |x] = x12 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

4
2
0
2
44

2

0

2

Fitted values

4
28 / 46

Added variable plots

Suppose P−j is the projection onto the span of all covariates except xj ,
and define ŷ−j = P−j y , xj? = P−j xj . The added variable plot is a
scatterplot of y − ŷ−j against x − xj? .
The squared correlation coefficient of the points in the added variable
plot is the partial R 2 for variable j.
Added variable plots are also called partial regression plots.

29 / 46

Added variable plots

Ŷ−2

4
2
0
2
44

2

0

2

Ŷ−3

Ŷ−1

E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

4
2
0
2
44

X1

4

2

0

X3

4
2
0
2
44

2

2

0

X2

2

4

4

30 / 46

Partial residual plot

Suppose we fit the model
ŷi = β̂ 0 xi = β̂0 + β̂1 xi1 + · · · β̂p xip .
The partial residual plot for covariate j is a plot of β̂j xij + ri against xij ,
where ri is the residual.
The partial residual plot attempts to show how covariate j is related to y ,
if we control for the effects of all other covariates.

31 / 46

Partial residual plot

β̂2X2 + R

4
2
0
2
44

2

0

2

β̂3X3 + R

β̂1X1 + R

E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3

4
2
0
2
44

X1

4

2

0

X3

4
2
0
2
44

2

2

0

X2

2

4

4

32 / 46

Residual quantile plots

E [y |x] = x1 − x2 + x3 , var[y |x] = 1, var(xj ) = 1, cor(xj , xk ) = 0.3
t4 distributed errors

Residual quantiles
(standardized)

4
2
0
2
44

2

0

2

Standard normal quantiles

4

33 / 46

Transformations

As noted above, the linear model imposes two main constrains on the
population that is under study. Specifically, the conditional mean
function should be linear, and the conditional variance function should be
constant.
If it appears that E [y |x = x] is not linear in x, or that Var[y |x = x] is not
constant in x, it may be possible to continuously transform either y or x
so that the linear model becomes more consistent with the data.

34 / 46

Variance stabilizing transformations
Many populations encountered in practice exhibit a mean/variance
relationship, where E [yi ] and var[yi ] are related.
Suppose that
var[yi ] = g (E [yi ])σ 2 ,

and let f (·) be a transform to be applied to the yi . The goal is to find a
transform such that the variances of the transformed responses are
constant. Using a Taylor expansion,
f (yi ) ≈ f (E [yi ]) + f 0 (E [yi ])(yi − E [yi ]).

35 / 46

Variance stabilizing transformations
Therefore
var[f (yi )] ≈ f 0 (E [yi ])2 · var[yi ] = f 0 (E [yi ])2 · g (E [yi ])σ 2 .
√
The goal is to find f such that f 0 = 1/ g .
Example: Suppose g (z) = z λ . This includes the “Poisson regression”
case λ = 1, where the variance is proportional to the mean, and the case
λ = 2 where the standard deviation is proportional to the mean.
√
When λ = 1, f solves f 0 (z) = 1/ z, so f is the square root function.
When λ = 2, f solves f 0 (z) = 1/z, so f is the logarithm function.

36 / 46

Log/log regression

Suppose we fit a simple linear regression of the form
E [log(y ) | log(x)] = α + β log(x).

E [log(y ) | x = x + 1] − E [log(y ) | x = x] = β
Using the crude approximation log E [y |x] ≈ E [log(y )|x], we conclude
E [y |x] is approximately scaled by a factor of e β when X is scaled by a
factor of e.
Thus in a log/log model, we may say that a f % change in X is
approximately associated with a f β % change in the expected response.

37 / 46

Maximum likelihood estimation of a data transformation

The Box-Cox family of transforms is

y 7−→

yλ − 1
,
λ

which makes sense only when all y is positive.
The Box-Cox family includes the identity (λ = 1), all power
transformations such as the square root (λ = 1/2) and reciprocal
(λ = −1), and the logarithm in the limiting case λ → 0.

38 / 46

Maximum likelihood estimation of a data transformation
Suppose we assume that for some value of λ, the transformed data follow
a linear model with Gaussian errors. We can then set out to estimate λ.
The joint log-likelihood of the transformed data is
n
1 X (λ)
n
(yi − xi0 β)2 .
− log(2π) − log σ 2 − 2
2
2
2σ
i

(λ)

Next we transform this back to a likelihood in terms of yi = gλ−1 (yi
This joint log-likelihood is

).

X
n
n
1 X
− log(2π) − log σ 2 − 2
(gλ (yi ) − xi0 β)2 +
log Ji
2
2
2σ
i

i

where the Jacobian is
log Ji = log gλ0 (yi ) = (λ − 1) log yi .
39 / 46

Maximum likelihood estimation of a data transformation

The joint log likelihood for the yi is

X
n
1 X
n
− log(2π) − log σ 2 − 2
(gλ (yi ) − xi0 β)2 + (λ − 1)
log yi .
2
2
2σ
i

i

This likelihood is maximized with respect to λ, β, and σ 2 to identify the
MLE.

40 / 46

Maximum likelihood estimation of a data transformation

To do the maximization, let y (λ) ≡ gλ (y ) denote the transformed
observed responses, and let ŷ (λ) denote the fitted values from regressing
Y (λ) on X . Since σ 2 does not appear in the Jacobian,
σ̂λ2 ≡ n−1 ky (λ) − ŷ (λ) k2
will be the maximizing value of σ 2 . Therefore the MLE of β and λ will
maximize
X
n
log yi .
− log σ̂λ2 + (λ − 1)
2
i

41 / 46

Collinearity Diagnostics
Collinearity inflates the sampling variances of covariate effect estimates.
To understand the effect of collinearity on var[β̂j |X ], reorder the columns
and partition the design matrix X as
xj

X =

X0



=

xj − xj⊥ + xj⊥

X0



where xj is column j of X , X0 is the n × p matrix consisting of all columns
in X except xj , and xj⊥ is the projection of xj onto col(X0 )⊥ . Therefore
0

H≡X X =



xj0 xj
(xj − xj⊥ )0 X0
0
⊥
X0 (xj − xj )
X00 X0


.

−1
−1
var β̂j = σ 2 H11
, so we want a simple expression for H11
.

42 / 46

Collinearity Diagnostics
A symmetric block matrix can be inverted using:



A
B0

B
C

−1


=

S −1
−1 0 −1
−C B S

−S −1 BC −1
−1
C + C −1 B 0 S −1 BC −1


,

where
S = A − BC −1 B 0 .
Therefore
−1
H1,1
=

1
,
kxj k2 − (xj − xj⊥ )0 P0 (xj − xj⊥ )

where P0 = X0 (X00 X0 )−1 X0 is the projection matrix onto col(X0 ).
43 / 46

Collinearity Diagnostics
Since xj − xj⊥ ∈ col(X0 ), we can write
−1
H1,1
=

1
,
kxj k2 − kxj − xj⊥ k2

0

and since xj⊥ (xj − xj⊥ ) = 0, it follows that
kxj k2 = kxj − xj⊥ + xj⊥ k2 = kxj − xj⊥ k2 + kxj⊥ k2 ,
so
−1
H1,1
=

1
.
kxj⊥ k2

This makes sense, since smaller values of kxj⊥ k2 correspond to greater
collinearity.
44 / 46

Collinearity Diagnostics

Let Rjx2 be the coefficient of determination (multiple R 2 ) for the
regression of Xj on the other covariates.
Rjx2 = 1 −

kxj⊥ k2
kxj − (xj − xj⊥ )k2
=
1
−
.
kxj − x̄j k2
kxj − x̄j k2

Combining the two equations yields
−1
H11
=

1
1
·
.
kxj − x̄j k2 1 − Rjx2

45 / 46

Collinearity Diagnostics
The two factors in the expression
−1
H11
=

1
1
.
·
2
kxj − x̄j k 1 − Rjx2

reflect two different sources of variance of β̂j :
I

1/kxj − x̄j k2 = 1/ ((n − 1)var(x
c j )) reflects the scaling of xj and the
sample size n.

I

The variance inflation factor (VIF) 1/(1 − Rjx2 ) is scale-free and
sample size-free. It is always greater than or equal to 1, and is equal
to 1 only if xj is orthogonal to the other covariates. Large values of
the VIF indicate that parameter estimation is strongly affected by
collinearity.

46 / 46