Fitting and inference for linear models using least squares

Fitting and inference for linear models using least
squares
Kerby Shedden
Department of Statistics, University of Michigan

October 12, 2021

1 / 171

Goals and terminology
The goal is to learn about an unknown function f that relates variables
y ∈ R and x ∈ Rp through a relationship y ≈ f (x). The variables y and
x have distinct roles:
I

Independent variables (also known as predictors, regressors,
covariates, exogeneous variables): x ∈ Rp , a vector.

I

Dependent variable (response, outcome, endogeneous variables):
y ∈ R, a scalar.

Note that the terms “independent” and “dependent” do not imply
statistical independence or linear algebraic independence of these
variables. Instead, these terms suggest that the value of x can be
manipulated, or exhibits “natural variation”, and we observe the
consequent changes in y .

2 / 171

Goals and terminology

The analysis is empirical (based on a sample of data):
yi ∈ R
xi = (xi1 , . . . , xip )0 ∈ Rp
i = 1, . . . , n.
where n is the sample size.
The data point (yi , xi ) reflects the i th analysis unit (also called a case or
observation).

3 / 171

Why do we want to do this?

“Two cultures” of statistics (Leo Breiman):
I

Prediction: Estimate f to predict the typical value of y at a given x
point using fˆ(x) (x is not necessarily one of the xi points in the
data).

I

Learning about structure: Inductive learning about the relationship
between x and y , such as understanding which predictors or
combinations of predictors are associated with particular changes in
y , possibly attributing mechanisms or causal roles to specific factors.

We will see that inferences about mechanisms (causal inferences) based
on regression analysis can be made only in specific, limited settings.

4 / 171

Examples:
I

An empirical model for the weather conditions 48 hours from now
could be based on current and historical weather conditions. Such a
model could have a lot of practical value, but it would not
necessarily provide a lot of insight into the atmospheric processes
that underly changes in the weather.

I

A study of the relationship between exposure to air pollution during
childhood and life outcomes (like adult earnings) would primarily be
of interest for inference, rather than for prediction. We may be able
to quantify a risk to individuals from this exposure, and then
estimate the cumulative effects of the exposure in a large
population. But the effect of the exposure on an individual child is
probably too small in relation to other risk factors for it to be of
predictive value at the individual level.

5 / 171

Statistical interpretations of the regression function
The most common way of putting “curve fitting” into a statistical
framework is to define f as the conditional expectation
f (x) ≡ E [y |x].
Here, the notation E [y |x] implies that y and x are random variables, and
occasionally we may emphasize this by writing E [y |x = x].
Less commonly, the regression function is defined as a conditional
quantile, such as the median
f (x) ≡ median(y |x)
or even some other quantile f (x) ≡ Qp (y |x = x). This is called quantile
regression.

6 / 171

The conditional expectation function

For our purposes, the conditional expectation E [y |x] is viewed as a
deterministic function of x, essentially what we would get if we sampled a
large number of x, y pairs from their joint distribution, and took the
average of the y values that occur when the specific value x is observed.
If there are densities we can write:
Z
E [y |x] =

y · f (y |x)dy .

7 / 171

Least Squares Fitting

In a linear model, the independent variable x is postulated to be related
to the dependent variable y via a linear relationship

yi ≈

p
X

βj xij = β 0 xi = hβ, xi i.

j=1

This expression holds for each observed value, i = 1, . . . , n.
This is a “linear model” in two senses: it is linear in β for fixed x, and it
is linear in x for fixed β (technically, it is “bilinear”).

8 / 171

Least Squares Fitting

To estimate f , we need to estimate the βj . One approach to doing this is
to minimize the following function of β:

L(β) =

n
n
X
X
X
(yi −
βj xij )2 =
(yi − β 0 xi )2
i=1

j

i=1

This is called least squares estimation – it results from solving an
optimization problem.

9 / 171

Simple linear regression
A special case of the linear model is simple linear regression, when there
is p = 1 covariate and an intercept (a covariate whose value is always 1).
L(α, β) =

X

(yi − α − βxi )2

i

We can differentiate with respect to α and β:
∂L/∂α
∂L/∂β

P
= −2 Pi (yi − α − βxi )
= −2 i (yi − α − βxi )xi

P
= −2 P ri
= −2 i ri xi

ri = yi − α − βxi
is the “working residual” (requires working values for α and β).

10 / 171

Simple linear regression
Setting
∂L/∂α = ∂L/∂β = 0
and solving for α and β yields
α̂ = ȳ − β̂x̄
P
yi xi /n − ȳ x̄
β̂ = P 2
xi /n − x̄ 2
P
P
where ȳ =
yi /n and x̄ = xi /n are the sample mean values
(averages).
The “hat” notation (ˆ) distinguishes the least-squares optimal values of α
and β from an arbitrary pair of parameter values.

11 / 171

Simple linear regression
We will call α̂ and β̂ the “least squares estimates” of the model
parameters α and β.
At the least squares solution,
∂L/∂α
∂L/∂β

P
= 2 P
ri
= −2 i ri xi

=
=

0
0

we have the following basic properties for the least squares estimates:
I

The residuals ri = yi − α − βxi sum to zero

I

The residuals are orthogonal to the independent variable x.

Recall that two vectors v , w ∈ Rd are orthogonal if

Pd

j=1 vj wj

= 0.

12 / 171

Two important identities

Centered and uncentered sums of squares can be related as follows:
X

xi2 /n − x̄ 2 =

X

(xi − x̄)2 /n.

i

Centered and uncentered cross-products can be related as follows:
X

yi xi /n − ȳ x̄ =

X

yi (xi − x̄)/n =

X
(yi − ȳ )(xi − x̄)/n.
i

13 / 171

Simple linear regression
Note that
X
(xi − x̄)2 /n

and

i

X
(yi − ȳ )(xi − x̄)/n.
i

are essentially the sample variance of x1 , . . . , xn , and the sample
covariance of the (xi , yi ) pairs. Since β̂ is their ratio, we can replace n in
the denominator with n − 1 so that

β̂ =

cd
ov(y , x)
var(x)
c

where cd
ov and var
c are the usual unbiased estimates of variance and
covariance.

14 / 171

Convex functions
A map Q : Rd → R is convex if for any v , w ∈ Rd ,
Q(λv + (1 − λ)w ) ≤ λQ(v ) + (1 − λ)Q(w ),
for 0 ≤ λ ≤ 1. If the inequality is strict for 0 < λ < 1 and all v 6= w ,
then Q is strictly convex.

λ=1
Q(v)
λQ(v)+(1−λ)Q(w)

λ=0
Q(w)

Q(λv+(1−λ)w)
v

w

15 / 171

Convex functions
A key property of strictly convex functions is that they have at most one
global minimizer. That is, there exists at most one v ∈ Rd such that
Q(v ) ≤ Q(w ) for all w ∈ Rd .
The proof is simple. Suppose there exists v 6= w such that
Q(v ) = Q(w ) = inf u∈Rd Q(u).
If Q is strictly convex and λ = 1/2, then
Q(v /2 + w /2) < (Q(v ) + Q(w ))/2 = inf u∈Rd Q(u),
Thus z = (v + w )/2 has the property that Q(z) < inf u∈Rd Q(u), a
contradiction.

16 / 171

Convexity of quadratic functions

A general quadratic function in d dimensions can be written
Q(v ) = v 0 Av + b 0 v + c
where A is a d × d matrix, b and v are vectors in Rd , and c is a scalar.
Note that
v 0 Av =

X
i,j

vi vj Aij

b0 v =

X

bj vj .

j

17 / 171

Convexity of quadratic functions

If b ∈ col(A), we can complete the square to eliminate the linear term,
giving us
Q(v ) = (v − f )0 A(v − f ) + s.
where f is any vector satisfying Af = −b/2, and s = c − f 0 Af .
If A is invertible, we can take f = −A−1 b/2.
Since the property of being convex is invariant to translations in both the
domain and range, without loss of generality we can assume f = 0 and
s = 0 for purposes of analyzing the convexity of Q.

18 / 171

Convexity of quadratic functions

Two key definitions:
I

A square matrix A is positive definite if v 0 Av > 0 for all vectors
v 6= 0.

I

A square matrix is positive semidefinite if v 0 Av ≥ 0 for all v .

We will now show that the quadratic function Q(v ) = v 0 Av is strictly
convex if and only if A is positive definite.
Note that without loss of generality, A is symmetric, since otherwise
Ã ≡ (A + A0 )/2 gives the same quadratic form as A.

19 / 171

Convexity of quadratic functions

Q(λv + (1 − λ)w )

=

(λv + (1 − λ)w )0 A(λv + (1 − λ)w )

= λ2 v 0 Av + (1 − λ)2 w 0 Aw + 2λ(1 − λ)v 0 Aw .

λQ(v ) + (1 − λ)Q(w ) − Q(λv + (1 − λ)w ) =
λ(1 − λ)(v 0 Av + w 0 Aw − 2w 0 Av ) =
λ(1 − λ)(v − w )0 A(v − w ) ≥ 0.
If 0 < λ < 1, this is a strict inequality for all v 6= w if and only if A is
positive definite.

20 / 171

Convexity of quadratic functions

The gradient, or Jacobian of a scalar-valued function y = f (x) (y ∈ R,
x ∈ Rd ) is
(∂f /∂x1 , . . . , ∂f /∂xd ),
which is viewed as a row vector by convention.

21 / 171

Convexity of quadratic functions
If A is symmetric, the Jacobian of Q(v ) = v 0 Av is 2v 0 A0 . To see this,
write
v 0 Av =

X

vi2 Aii +

i

X

vi vj Aij

i6=j

and differentiate with respect to v` to get the `th component of the
Jacobian:
2v` A`` +

X
j6=`

vj A`j +

X

vi Ai` = 2(Av )` .

i6=`

22 / 171

Convexity of quadratic functions
The Hessian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is the
matrix
Hij = ∂ 2 f /∂xj ∂xi .
If A is symmetric, the Hessian of the quadratic form Q is 2A.
To see this, note that the `th component of 2v 0 A0 is the inner product of
v with the `th row (or column) of A. The derivative of this inner product
with respect to a second index `0 is 2A``0 .
It follows that a quadratic function is strictly convex iff its Hessian is
positive definite (more generally, any continuous, twice differentiable
function is convex on Rd if and only if its Hessian matrix is everywhere
positive definite).

23 / 171

Uniqueness of the simple linear regression least squares fit

The least squares solution for simple linear regression, α̂, β̂, is unique as
long as var[x]
c
(the sample variance of the covariate) is positive.
To see this, note that the Hessian (second derivative matrix) of L(α, β) is

H=
where x· =

P

i

∂ 2 L/∂α2
∂ 2 L/∂α∂β

∂ 2 L/∂α∂β
∂ 2 L/∂β 2




=

2n
2x·

2x·
P
2 xi2



xi .

24 / 171

Uniqueness of the simple linear regression least squares fit

If var(x
c 1 , . . . , xn ) > 0 then this is a positive definite matrix since all the
principal submatrices have positive determinants:
|2n| > 0
2n
2x·

2x·
P
2 xi2

X

xi2 − 4(x· )2

=

4n

=

4n(n − 1)var(x
c 1 , . . . , xn )

> 0.

25 / 171

The fitted line and the data center

The fitted line
y = α̂ + β̂x
passes through the center of mass (centroid) of the data (x̄, ȳ ).
This can be seen by substituting x = x̄ into the equation of the fitted
line, which yields ȳ .

26 / 171

Regression slopes and the Pearson correlation
The sample Pearson correlation coefficient between x and y is

ρ̂ =

cd
ov(y , x)
c )SD(x)
c
SD(y

The relationship between β̂ and ρ̂ is

β̂ = ρ̂ ·

c )
SD(y
.
c
SD(x)

Thus the fitted slope has the same sign as the Pearson correlation
coefficient between y and x.

27 / 171

Reversing x and y
Suppose that x and y are reversed in a simple linear regression. The
slope of the regression of x on y is

β̂∗ =

c
SD(x)
cd
ov(y , x)
= cor(y
c , x)
.
c )
var(y
c )
SD(y

If the data fall exactly on a line, then cor(y , x) = 1, so β̂∗ = 1/β̂, which
is consistent with algebraically rearranging
y = α + βx
to
x = −α/β + y /β.
But if the data do not fit a line exactly, this property does not hold.

28 / 171

Norms
The Euclidean norm on vectors (the most commonly used norm) is:

kv k =

sX

vi2 =

√

v 0v .

i

Here are some useful identities, for vectors v , w ∈ Rp :
kv + w k2 = kv k2 + kw k2 + 2v 0 w

kv − w k2 = kv k2 + kw k2 − 2v 0 w

29 / 171

Fitting multiple linear regression models
Multiple regression is a regression analysis in which there may be more
than one explanatory variable (i.e. p ≥ 1). In this case, the covariate
data define the design matrix:


1
 1




X =





1

x11
x21

···
···

x12
x22
···
···
···
···
···

xn1

xn2

···


x1p
x2p 










xnp

Note that in some situations the first column of 10 s (the intercept) will
not be included.

30 / 171

Fitting multiple linear regression models
In multiple regression, the linear model coefficients are written as a vector
β = (β0 , β1 , . . . , βp )0
where β0 is the intercept and βk is the slope corresponding to the k th
covariate. For a given working covariate vector β, the vector of fitted
values is given by the matrix-vector product
ŷ = X β,
which is an n-dimensional vector.
The vector of working residuals y − X β is also an n-dimensional vector.

31 / 171

Fitting multiple linear regression models

The goal of least-squares estimation is to minimize the sum of squared
differences between the fitted and observed values.
L(β) =

X

(yi − ŷi )2 = ky − X βk2 .

i

Estimating β by minimizing L(β) is called ordinary least squares (OLS).

32 / 171

The multivariate chain rule
Recall that if f and g are differentiable functions from R to R, and we
set h = f ◦ g , i.e. h(x) = f (g (x)), then the chain rule states that
h0 (x) = f 0 (g (x)) · g 0 (x).
Now suppose g (·) is a map from Rm to Rn and f (·) is a scalar-valued
function on Rn . If h = f ◦ g , i.e. h(z) = f (g (z)). Let
fj0 (x) = ∂f (x)/∂xj , let
∇f (x) = (f10 (x), · · · , fn0 (x))0
denote the gradient of f , and let J denote the Jacobian of g
Jij (z) = ∂gi (z)/∂zj .

33 / 171

The multivariate chain rule
Then

∂h(z)/∂zj

=

X

∂gi (z)/∂zj · fi 0 (g (z))

i

=

[J(z)0 ∇f (g (z))]j

Thus we can write the gradient of h as a matrix-vector product between
the (transposed) Jacobian of g and the gradient of f :
∇h = J 0 ∇f
where J is evaluated at z and ∇f is evaluated at g (z).

34 / 171

The least squares gradient function
For the (multiple) least squares problem, the gradient of L(β) with
respect to β is
∂L/∂β = −2X 0 (y − X β).
This can be seen by differentiating

L(β) =

X

(yi − β0 −

i

p
X

xij βj )2

j=1

element-wise, or by differentiating
ky − X βk2
using the
Pmultivariate chain rule, letting g (β) = y − xβ and
f (x) = j xj2 .
35 / 171

Normal equations

Setting ∂L/∂β = 0 yields the “normal equations:”
X 0X β = X 0y
Thus calculating the least squares estimate of β reduces to solving a
system of p + 1 linear equations. Algebraically we can write
β = (X 0 X )−1 X 0 y ,
which is often useful for deriving analytical results. However this
expression should not be used to numerically calculate the coefficients.

36 / 171

Solving the normal equations
The most standard numerical approach is to calculate the QR
decomposition of X
X = QR
where Q is a n × p + 1 orthogonal matrix (i.e. Q 0 Q = I ) and R is a
p + 1 × p + 1 upper triangular matrix.
The QR decomposition can be calculated rapidly, and highly precisely.
Once it is obtained, the normal equations become
Rβ = Q 0 y ,
which is an easily solved p + 1 × p + 1 triangular system.

37 / 171

Matrix products
In multiple regression we encounter the matrix product X 0 X . Let’s review
some ways to think about matrix products.
If A ∈ Rn×m and B ∈ Rn×p are matrices, we can form the product
A0 B ∈ Rm×p .
Suppose we partition A and B by rows:


− a1
 − a2

···
A=


···
− an


−
− 




−



− b1
 − b2

···
B=


···
− bn


−
− 




−

38 / 171

Matrix products

Then
A0 B = a10 b1 + a20 b2 + · · · + an0 bn ,
where each aj0 bj term is an outer product


aj1 bj1
 aj2 bj1
0
aj bj = 

···

aj1 bj2
aj2 bj2
···

···
···



 ∈ Rm×p .


39 / 171

Matrix products
Now if we partition A and B by columns


|
A =  a1
|

|
a2
|


|
a3 
|



|
B =  b1
|

|
b2
|


|
b3  ,
|

then A0 B is a matrix of inner products
a10 b1

a20 b1
A0 B = 
 ···


a10 b2
a20 b2
···


···
··· 
.
··· 

40 / 171

Matrix products
Thus we can view the product X 0 X involving the design matrix in two
different ways. If we partition X by rows (the cases)


− x1
 − x2

···
X =


···
− xn


−
− 




−

then

X 0X =

n
X

xi0 xi

i=1

is the sum of the outer product matrices of the cases.

41 / 171

Matrix products

If we partition X by the columns (the variables)


|
X =  x1
|

|
x2
|


|
x3  ,
|

then X 0 X is a matrix whose entries are the pairwise inner products of the
variables ([X 0 X ]ij = xi0 xj ).

42 / 171

Mathematical properties of the multiple regression fit

The multiple least square solution is unique as long as the columns of X
are linearly independent. Here is the proof:
1. The Hessian of L(β) is 2X 0 X .
2. For v 6= 0, v 0 (X 0 X )v = (Xv )0 Xv = kXv k2 > 0, since the columns of
X are linearly independent. Therefore the Hessian of L is positive
definite.
3. Since L(β) is quadratic with a positive definite Hessian matrix, it is
convex and hence has a unique global minimizer.

43 / 171

Projections

Suppose S is a subspace of Rd , and v is a vector in Rd . The projection
operator PS maps v to the vector in S that is closest to v :
PS (v ) = argminη∈S kv − ηk2 .

44 / 171

Projections

Property 1: (v − PS (v ))0 s = 0 for all s ∈ S. To see this, let s ∈ S.
Without loss of generality ksk = 1 and (v − PS (v ))0 s ≤ 0. Let λ ≥ 0,
and write

kv − PS (v ) + λsk2

= kv − PS (v )k2 + λ2 + 2λ(v − PS (v ))0 s.

If (v − PS (v ))0 s 6= 0, then for sufficiently small λ > 0,
λ2 + 2λ(v − PS (v ))0 s < 0. This means that PS (v ) − λs is closer to v
than PS (v ), contradicting the definition of PS (v ).

45 / 171

Projections
Property 2: Given a subspace S of Rd , any vector v ∈ Rd can be
written uniquely in the form v = vS + vS ⊥ , where vS ∈ S and s 0 vS ⊥ = 0
for all s ∈ S. To prove uniqueness, suppose
v = vS + vS ⊥ = ṽS + ṽS ⊥ .
Then

0

= kvS − ṽS + vS ⊥ − ṽS ⊥ k2
=

kvS − ṽS k2 + kvS ⊥ − ṽS ⊥ k2 + 2(vS − ṽS )0 (vS ⊥ − ṽS ⊥ )

=

kvS − ṽS k2 + kvS ⊥ − ṽS ⊥ k2 .

which is only possible if vS = ṽS and vS ⊥ − ṽS ⊥ . Existence follows from
Property 1, with vS = PS (v ) and vS ⊥ = v − PS (v ).
46 / 171

Projections

Property 3: The projection PS (v ) is unique.
This follows from property 2 – if there exists a v for which v1 6= v2 ∈ S
both minimize the distance from v to S, then v = v1 + u1 and
v = v2 + u2 (uj = v − vj ) would be distinct decompositions of v as a sum
of a vector in S and a vector in S ⊥ , contradicting property 2.

47 / 171

Projections

Property 4: PS (PS (v )) = PS (v ). The proof of this is simple, since
PS (v ) ∈ S, and any element of S has zero distance to itself.
A matrix or linear map with this property is called idempotent.

48 / 171

Projections
Property 5: PS is a linear operator.
Let A, B be vectors with θA = PS (A) and θB = PS (B). Then we can
write
A + B = θA + θB + (A + B − θA − θB ),
where θA + θB ∈ S, and
s 0 (A + B − θA − θB ) = s 0 (A − θA ) + s 0 (B − θB ) = 0
for all s ∈ S. By Property 2 above, this representation is unique, so
θA + θB = PS (A + B).

49 / 171

Projections
Property 6: Suppose PS is the projection operator onto a subspace S.
Then I − PS , where I is the identity matrix, is the projection operator
onto the subspace
S ⊥ ≡ {u ∈ Rd |u 0 s = 0 for all s ∈ S}.
To prove this, write
v = (I − PS )v + PS v ,
and note that ((I − PS )v )0 s = 0 for all s ∈ S, so (I − PS )v ∈ S ⊥ , and
u 0 PS v = 0 for all u ∈ S ⊥ . By property 2 this decomposition is unique,
and therefore I − PS is the projection operator onto S ⊥ .

50 / 171

Projections
Property 7: Since PS (v ) is linear, it can be represented in the form
PS (v ) = PS · v for a suitable square matrix PS . Suppose S is spanned by
the columns of a non-singular matrix x. Then
PS = X (X 0 X )−1 X 0 .
To prove this, let Q = X (X 0 X )−1 x0 , so for an arbitrary vector V ,

v

=

Qv + (v − Qv )

=

X (X 0 X )−1 X 0 v + (I − X (X 0 X )−1 X 0 )v

and note that the first summand is in S while the second summand (by
direct calculation) is perpendicular to the first summand, hence is in S ⊥ .

51 / 171

Projections

Property 7 (continued):
To show that the second summand is in S ⊥ , take s ∈ S and write
s = Xb. Then

s 0 (I − X (X 0 X )−1 X 0 )v

= b 0 X 0 (I − X (X 0 X )−1 X 0 )v
=

0.

Therefore this is the unique decomposition from Property 2, above, so
PS must be the projection.

52 / 171

Projections
Property 7 (continued)
An alternate approach to property 7 is constructive. Let θ = X γ, and
suppose we wish to minimize the distance between θ and V . Using
calculus, we differentiate with respect to γ and solve for the stationary
point:
∂kv − X γk2 /∂γ

= ∂(v 0 v − 2v 0 X γ + γ 0 X 0 X γ)/∂γ
= −2x 0 v + 2X 0 X γ
=

0.

The solution is
γ = (X 0 X )−1 X 0 v
so
θ = X (X 0 X )−1 X 0 v .

53 / 171

Projections

Property 8: The composition of projections onto S and S ⊥ (in either
order) is identically zero.
PS ◦ PS ⊥ = PS ⊥ ◦ PS ≡ 0. This can be shown by direct calculation using
the representations of PS and PS ⊥ given above.

54 / 171

Least squares and projections

The least squares problem of minimizing
ky − X βk2
is equivalent to minimizing ky − ηk2 over η ∈ col(X ).
Therefore the minimizing value η̂ is the projection of y onto col(X ).
If the columns of X are linearly independent, there is a unique vector β̂
such that X β̂ = η̂. These are the least squares coefficient estimates.

55 / 171

Row-wise and column-wise geometry of least squares

The rows of the design matrix X are vectors in Rp+1 , the columns of the
design matrix are vectors in Rn .
The n-dimensional space containing y and col(X ) is called the variable
space.
The p + 2-dimensional space containing (xi , yi ) is called the case space.

56 / 171

Variable space geometry of least squares

Thinking column-wise, we are working in Rn . The vector y containing all
values of the dependent variable is a vector in Rn , and col(X ) is a p + 1
dimensional subspace of Rn .
The least squares problem can be seen to have the goal of producing a
vector of values that are in Rn , and that are as close as possible to y
among all such vectors. We will usually write this vector as ŷ . It is
obtained by projecting y onto col(X ).

57 / 171

Case space geometry of least squares

Thinking row-wise, we are working in R p+2 . The data for one case can be
written (xi , yi ), where xi is the i th row of X and yi is the i th element of y .
(xi , yi ) ∈ Rp+2 is the “cloud of data points”, where each point includes
both the independent and dependent variables in a single vector.
Alternatively, we can think of xi ∈ Rp+1 as being the domain of the
regression function E [y |x], which forms a surface above this domain.

58 / 171

Properties of the least squares fit
• The fitted regression surface passes through the mean point (X̄ , ȳ ).
To see this, note that the fitted surface at X̄ (the vector of column-wise
means) is

X̄ 0 β̂

=

(10 X /n)β̂

=

10 X (X 0 X )−1 X 0 y /n,

where 1 is a column vector of 1’s. Since 1 ∈ col(X ), it follows that
X (X 0 X )−1 X 0 1 = 1,
which gives the result, since ȳ = 10 y /n.

59 / 171

Properties of the least squares fit
• The multiple regression residuals sum to zero. The residuals are
R

≡ y − ŷ
= y − PS y
=

(I − PS )y

= PS ⊥ y ,
where S = col(x). The sum of residuals can be written
10 (I − X (X 0 X )−1 X 0 )y ,
where 1 is a vector of 1’s. If X includes an intercept, PS 1 = 1, so
10 I = 10 X (X 0 X )−1 X 0 = 10 ,
so
10 (I − X (X 0 X )−1 X 0 ) = 0.
60 / 171

Orthogonal matrices
An orthogonal matrix X satisfies X 0 X = I . This is equivalent to stating
that the columns of X are pairwise orthogonal and all columns of X have
unit length.
If X is square and orthogonal, then X 0 = X −1 and also XX 0 = I .
If X is orthogonal then the projection onto col(X ) simplifies to
X (X 0 X )−1 X 0 = XX 0
If X is orthogonal and the first column of X is constant, it follows that
the remaining columns of X are centered and have sample variance
1/(n − 1).

61 / 171

Orthogonal matrices
• If X is orthogonal, the slopes obtained by using multiple regression of
y on x = (x1 , . . . , xp ) are the same as the slopes obtained by carrying out
p simple linear regressions of y on each covariate separately.
To see this, note that the multiple regression slope estimate for the j th
covariate is
β̂m,j = X:j0 y
where X:j is column j of X . Since X 0 X = I it follows that each covariate
has zero sample mean, and sample variance equal to 1/(n − 1). Thus the
simple linear regression slope for covariate i is
c :j , y )/var(X
c :j ) = X:j0 y = β̂m,j .
β̂j = cov(X

62 / 171

Comparing multiple regression and simple regression slopes
• The signs of the multiple regression slopes need not agree with the
signs of the corresponding simple regression slopes.
For example, suppose there are two covariates, both with mean zero and
variance 1, and for simplicity assume that y has mean zero and variance
1. Let r12 be the correlation between the two covariates, and let r1y and
r2y be the correlations between each covariate and the response. It
follows that


n/(n − 1) 0
0
1
X 0 X /(n − 1) = 
0
r12


0
r12 
1




0
X 0 y /(n − 1) =  r1y  .
r2y

63 / 171

Comparing multiple regression and simple regression slopes
So we can write


β̂ = (X 0 X /(n − 1))−1 (X 0 y /(n − 1)) =

1 
r1y
2
1 − r12
r2y


0
− r12 r2y  .
− r12 r1y

Thus, for example, if r1y , r2y , r12 ≥ 0, then if r12 > r1y /r2y , then β̂1 has
opposite signs in single and multiple regression. Note that if r1y > r2y it
is impossible for r12 > r1y /r2y . Thus, the effect direction for the covariate
that is more strongly marginally correlated with y cannot be reversed.
This is an example of “Simpson’s paradox”.

64 / 171

Comparing multiple regression and simple regression slopes

Numerical example: if r1y = 0.1, r2y = 0.6, r12 = 0.6, then x1 is
(marginally) positively associated with y , but for fixed values of x2 , the
association between y and x1 is negative.

65 / 171

Formulation of regression models in terms of probability
distributions
Up to this point, we have primarily expressed regression models in terms
of the mean structure, e.g.
E [y |x] = β 0 x.
Regression models are commonly discussed in terms of conditional
moment structures, e.g. the conditional mean E [y |x], and the conditional
variance var[y |x = x].
Alternatively, some regression models are specified in terms of conditional
quantiles (Qp [y |x = x]).
In either case, we are not modeling an explicit, fully-specified probability
distribution.

66 / 171

Formulation of regression models in terms of probability
distributions
If we want to specify the model more completely, we can think in terms
of a random “error” term that describes how the observed value y
deviates from the ideal value f (x).
A very general regression model is
y = f (x, ).
where  is a random variable with expected value zero.
If we specify the distribution of , then we have fully specified the
distribution of y |x.

67 / 171

Formulation of regression models in terms of probability
distributions
A more restrictive “additive error” model is:
y = f (x) + .
Under this model,

E [y |x]

= E [f (x) + |x]
= E [f (x)|x] + E [|x]
= f (x) + E [|x].

Without loss of generality, E [|x] = 0, so E [y |x] = f (x).

68 / 171

Regression model formulations and parameterizations
A parametric regression model is:
y = f (x; θ) + ,
where θ is a finite dimensional parameter vector.
Examples:
1. The linear response surface model f (x; θ) = θ0 x
2. The quadratic response surface model f (x; θ) = θ1 + θ2 x + θ3 x 2
3. The Gompertz curve f (x; θ) = θ1 exp(θ2 exp(θ3 x)) θ2 , θ3 ≤ 0.
Models 1 and 2 are both “linear models” because they are linear in θ.
The Gompertz curve is a non-linear model because it is not linear in θ.

69 / 171

Basic inference for simple linear regression
This section deals with statistical properties of least square fits that can
be derived under minimal conditions.
Specifically, we will derive properties of β̂ that hold for the generating
model y = x 0 β + , where:
i E [|x] = 0
ii var[|x] = σ 2 < ∞ and is constant across cases
iii the  random variables are uncorrelated across cases
(given x).
We will not assume here that  follows a particular distribution, e.g. a
Gaussian distribution.

70 / 171

The relationship between E [|x] and cov(, x)
If we treat x as a random variable, the condition that
E [|x] = 0
for all x implies that cor(x, ) = 0. This follows from the double
expectation theorem:
cov(x, )

=

E [x] − E [x] · E []

=

E [x]

=

Ex [E [x|x]]

=

Ex [xE [|x]]

=

Ex [x · 0]

=

0.

71 / 171

The relationship between E [|x] and cov(, x)

The converse is not true. If cor(x, ) = 0 and E [] = 0 it may not be the
case that E [|x] = 0.
For example, if x ∈ {−1, 0, 1} and  ∈ {−1, 1}, with joint distribution
-1
0
1

-1
1/12
4/12
1/12

1
3/12
0
3/12

then E  = 0 and cor(x, ) = 0, but E [|x] is not identically zero. When 
and x are jointly Gaussian, cor(, x) = 0 implies that  and x are
independent, which in turn implies that E [|x] = E  = 0.

72 / 171

Data sampling
To be able to interpret the results of a regression analysis, we need to
know how the data were sampled. In particular, it important to consider
whether x and/or y and/or the pair x, y should be considered as random
draws from a population. Here are two important situations:
I

Designed experiment: We are studying the effect of temperature on
reaction yield in a chemical synthesis. The temperature x is
controlled and set by the experimenter. In this case, y is randomly
sampled conditionally on x, but x is not randomly sampled, so it
doesn’t make sense to consider x to be a random variable.

I

Observational study: We are interested in the relationship between
cholesterol level x and blood pressure y . We sample people at
random from a well defined population (e.g. residents of Michigan)
and measure their blood pressure and cholesterol levels. In this case,
x and y are sampled from their joint distribution, and both can be
viewed a random variables.

73 / 171

Data sampling
There is another design that we may encounter:
I

Case/control study: Again suppose we are interested in the
relationship between cholesterol level x, and blood pressure y . But
now suppose we have an exhaustive list of blood pressure
measurements for all residents of Michigan. We wish to select a
subset of 500 individuals to contact for acquiring cholesterol
measures, and it is decided that studying the 250 people with
greatest blood pressure together with the 250 people with lowest
blood pressure will be most informative. In this case x is randomly
sampled conditionally on y , but y is not randomly sampled.

Summary: Regression models are formulated in terms of the conditional
distribution of y given x. The statistical properties of β̂ are easiest to
calculate and interpret as being conditional on x. The way that the data
are sampled also affects our interpretation of the results.

74 / 171

Basic inference for simple linear regression via OLS

Above we showed that the slope and intercept estimates are
P
yi (xi − x̄)
cd
ov(y , x)
= Pi
,
β̂ =
2
var[x]
c
i (xi − x̄)
α̂ = ȳ − β̂x̄.
Note that we are using the useful fact that
X
i

(yi − ȳ )(xi − x̄) =

X
i

yi (xi − x̄) =

X
(yi − ȳ )xi .
i

75 / 171

Sampling means of parameter estimates
First we will calculate the sampling means of α̂ and β̂. A useful identity
is that

β̂

=

X
X
(α + βxi + i )(xi − x̄)/
(xi − x̄)2

=

β+

i

i

X

i (xi − x̄)/

i

X
(xi − x̄)2 .
i

From this identity it is clear that E [β̂|x] = β. Thus β̂ is unbiased (a
parameter estimate is unbiased if its sampling mean is the same as the
population value of the parameter).
The intercept is also unbiased:
E [α̂|x]

= E (ȳ − β̂x̄|x)
= α + βx̄ + E [¯
|x] − E [β̂x̄|x]
= α
76 / 171

Sampling variances of parameter estimates

Next we will calculate the sampling variances of α̂ and β̂. These values
capture the variability of the parameter estimates over replicated samples
from the same population.
First we will need the following result:

cov(β̂, ¯|x)

=

X

=

0,

cov(i , ¯|x)(xi − x̄)/

i

X
(xi − x̄)2
i

since cov(i , ¯|x) = σ 2 /n does not depend on i.

77 / 171

Sampling variances of parameter estimates
To derive the sampling variances, start with the identity:
β̂ = β +

X

i (xi − x̄)/

i

X

(xi − x̄)2 .

i

The sampling variances are

var[β̂|x]

= σ2 /

X

(xi − x̄)2

i

= σ 2 / ((n − 1)var[x])
c
.
var[α̂|x]

= var[ȳ − β̂x̄|x]
= var[α + βx̄ + ¯ − β̂x̄|x]
= var[¯
|x] + x̄ 2 var[β̂|x] − 2x̄cov[¯
, β̂|x]
= σ 2 /n + x̄ 2 σ 2 /((n − 1)var[x]).
c
78 / 171

Sampling covariance of parameter estimates
The sampling covariance between the slope and intercept is
= cov[ȳ − β̂x̄, β̂|x]

cov[α̂, β̂|x]

cov[ȳ , β̂|x] − x̄var[β̂|x]

=

= cov[¯
, β̂|x] − x̄var[β̂|x]
= −σ 2 x̄/((n − 1)var[x]).
c
When x̄ > 0, it’s easy to see what the expression for cov(α̂, β̂|x) is telling
us:
Underestimate slope, overestimate intercept

Y

−

Overestimate slope, underestimate intercept

0.5

0.0

0.5

1.0
X

1.5

2.0

2.5
79 / 171

Basic inference for simple linear regression via OLS

Some observations:
I

All variances scale with sample size like 1/n.

I

β̂ does not depend on x̄.

I

var[α̂] is minimized if x̄ = 0.

I

α̂ and β̂ are uncorrelated if x̄ = 0.

80 / 171

Some properties of residuals
Start with the following useful expression:
ri

≡ yi − α̂ − β̂xi
= yi − ȳ − β̂(xi − x̄).

Since yi = α + βxi + i and therefore ȳ = α + βx̄ + ¯, we can subtract to
get
yi − ȳ = β(xi − x̄) + i − ¯
it follows that
ri = (β − β̂)(xi − x̄) + i − ¯.
Since E β̂ = β and E [i − ¯] = 0,P
it follows that Eri = 0. Note that this is
a distinct fact from the identity i ri = 0.

81 / 171

Some properties of residuals

It is important to distinguish the residual ri from the “observation errors”
i . The identity ri = (β − β̂)(xi − x̄) + i − ¯ shows us that the centered
observation error i − ¯ is one part of the residual. The other term
(β − β̂)(xi − x̄)
reflects the fact that the residuals are also influenced by how well we
recover the true slope β through our estimate β̂.

82 / 171

Some properties of residuals

To illustrate, consider two possibilities:
I

We overestimate the slope (β − β̂ < 0). The residuals to the right
of the right of the mean (i.e. when x > x̄) are shifted down (toward
−∞), and the residuals to the left of the mean are shifted up.

I

We underestimate the slope (β − β̂ > 0). The residuals to the right
of the right of the mean are shifted up, and the residuals to the left
of the mean are shifted down.

83 / 171

Some properties of residuals
We can use the identity ri = (β − β̂)(xi − x̄) + i − ¯ to derive the
variance of each residual. We have:

var[(β − β̂)(xi − x̄)]

=

σ 2 (xi − x̄)2 /

X

(xj − x̄)2

j

var[i − ¯|x]

=

var[i |x] + var[¯
|x) − 2cov(i , ¯|x)

=

σ 2 + σ 2 /n − 2σ 2 /n

=

σ 2 (n − 1)/n.



cov (β − β̂)(xi − x̄), i − ¯|x

= −(xi − x̄)cov(β̂, i |x)
X
= −σ 2 (xi − x̄)2 /
(xj − x̄)2 ,
j
84 / 171

Some properties of residuals
Thus the variance of the i th residual is

var[ri |x]



= var (β − β̂)(xi − x̄) + i − ¯|x


= var (β − β̂)(xi − x̄)|x + var[i − ¯|x] +


2cov (β − β̂)(xi − x̄), i − ¯|x
X
= (n − 1)σ 2 /n − σ 2 (xi − x̄)2 /
(xj − x̄)2 .
j

Note the following fact, which ensures that this expression is
non-negative:
(xi − x̄)2 /

X

(xj − x̄)2 ≤ (n − 1)/n.

j

85 / 171

Some properties of residuals

I

The residuals are not iid – they are correlated with each other, and
they have different variances.

I

var[ri ] < var[i ] – the residuals are less variable than the errors.

86 / 171

Some properties of residuals

2
Since E [rP
i ] = 0 it follows that var[ri ] = E [ri ]. Therefore the expected
2
value of i ri is

(n − 1)σ 2 − σ 2 = (n − 2)σ 2
since the (xi − x̄)2 /

P

j (xj

− x̄)2 sum to one.

87 / 171

Estimating the error variance σ 2
Since
E

X

ri2 = (n − 2)σ 2

i

it follows that
X

ri2 /(n − 2)

i

is an unbiased estimate of σ 2 . That is
!
E

X

ri2 /(n

− 2)

= σ2 .

i

88 / 171

Review of moment calculations for random vectors

Suppose that z ∈ Rd is a random vector. The expected value of z,
denoted E [z], is a fixed vector in Rd , and the covariance matrix of z,
denote cov[z], is a fixed d × d matrix.
(cov[z])ij = cov(zi , zj ), the covariance between the i th and j th elements
of z.
The diagonal element (cov[z])ii = var[zi ] is the variance of the i th
element of z.

89 / 171

Basic inference for multiple linear regression via OLS
We have the following useful identity for the multiple linear regression
least squares fit:

β̂

=

(X 0 X )−1 X 0 y

=

(X 0 X )−1 X 0 (X β + )

=

β + (X 0 X )−1 X 0 .

Letting

η = (X 0 X )−1 X 0 

we see that E [η|X ] = 0 and

var[η|X ] = var[β̂|X ]

=

(X 0 X )−1 X 0 var[|X ]X (X 0 X )−1

= σ 2 (X 0 X )−1
• var[|X ] = σ 2 I since (i) the i are uncorrelated and (ii) the i have
constant variance given x.
90 / 171

Variance of β̂ in multiple regression OLS
Let ui be the i th row of the design matrix X . Then
X 0X =

X

ui0 ui

i

where ui0 ui is an outer-product (a p + 1 × p + 1 matrix).
If we have a limiting behavior
n−1

X

ui0 ui → Q,

i

for a fixed p + 1 × p + 1 matrix Q, then X 0 X ≈ nQ, so
cov(β̂|X ) ≈ σ 2 Q −1 /n.
Thus we see the usual influence of sample size on the standard errors of
the regression coefficients.
91 / 171

Level sets of quadratic forms
Suppose


1
0.5

2
3



C=

0.5
1



so that
C −1 =

2
−1



−1
2

Left side: level curves of g (x) = x 0 Cx
Right side: level curves of h(x) = x 0 C −1 x
3

3
2
1

1.0

0

0

0.5

1

1

4.0

2
33

0
4.0 3. 1.0
0.5

3
2.0.0

1

5.0

5.0
2.0

2

2
2

1

0

1

2

3

33

2

1

0

1

2

3

92 / 171

Eigen-decompositions and quadratic forms
The dominant eigenvector of C maximizes the “Rayleigh quotient”
g (x) = x 0 Cx/x 0 x,
thus it points in the direction of greatest change of g .
If

C=

1
0.5

0.5
1


,

the dominant eigenvector points in the (1, 1) direction. The dominant
eigenvector of C −1 points in the (−1, 1) direction.
P
If the eigendecomposition of C is j λj vj vj0 then the eigendecomposition
P
0
of C −1 is j λ−1
j vj vj – thus directions in which the level curves of C are
most spread out are the directions in which the level curves of C −1 are
most compressed.
93 / 171

Variance of β̂ in multiple regression OLS
If p = 2 and


1
X 0 X /n =  0
0

0
1
r


0
r ,
1

then


1
cov(β̂) = σ 2 n−1  0
0


0
0
1/(1 − r 2 ) −r /(1 − r 2 )  .
−r /(1 − r 2 ) 1/(1 − r 2 )

So if p = 2 and x1 and x2 are positively colinear (meaning that
(x1 − x̄1 )0 (x2 − x̄2 ) > 0), the corresponding slope estimates β̂1 and β̂2 are
negatively correlated.

94 / 171

The residual sum of squares
The residual sum of squares (RSS) is the squared norm of the residual
vector:

RSS = ky − ŷ k2
= ky − Py k2
= k(I − P)y k2
= y 0 (I − P)(I − P)y
= y 0 (I − P)y ,
where P is the projection matrix onto col(X ). The last equivalence
follows from the fact that I − P is a projection and hence is idempotent.

95 / 171

The expected value of the RSS

The expression RSS = y 0 (I − P)y is a quadratic form in y , and we can
write
y 0 (I − P)y = tr (y 0 (I − P)y ) = tr ((I − P)yy 0 ) ,
where the second equality uses the circulant property of the trace.
For three factors, the circulent property states that
tr(ABC ) = tr(CAB) = tr(BCA).

96 / 171

The expected value of the RSS
By linearity we have
E tr[(I − P)yy 0 ] = tr[(I − P) · Eyy 0 ],
and
Eyy 0

=

EX ββ 0 X 0 + Exβ0 + E β 0 X 0 + E 0

=

X ββ 0 X 0 + E 0

=

X ββ 0 X 0 + σ 2 I .

Since PX = X and hence (I − P)X = 0,
(I − P)E [YY 0 ] = σ 2 (I − P).
Therefore the expected value of the RSS is E [RSS] = σ 2 tr(I − P).

97 / 171

Four more properties of projection matrices

Property 9: A projection matrix PS is symmetric. One way to show this
is to let v1 , . . . , vq be an orthonormal basis for S, where PS is the
projection onto S. Then complete the vj with vq+1 , . . . , vd to get an
orthonormal basis. By direct calculation,

(PS −

q
X

vj vj0 )Vk = 0

j=1

for all k, hence PS =

Pq

0
j=1 vj vj

which is symmetric.

98 / 171

Four more properties of projection matrices

Property 10: A projection matrix is positive semidefinite. Let v be an
arbitrary vector and write v = v1 + v2 , where v1 ∈ S and v2 ∈ S ⊥ . Then
(v1 + v2 )0 PS (v1 + v2 ) = v10 v1 ≥ 0.

99 / 171

Four more properties of projection matrices

Property 11: The eigenvalues of a projection matrix PS must be zero or
one.
Suppose λ, v is an eigenvalue/eigenvector pair:
PS v = λv .
Since PS is the projection onto the subspace S, this implies that λv is
the closest element of S to v . But if λv ∈ S then v ∈ S, and is strictly
closer to v than λv , unless λ = 1 or v = 0. Therefore only 0 and 1 can
be eigenvalues of PS .

100 / 171

Four more properties of projection matrices

Property 12: The trace of a projection matrix is its rank.
The rank of a matrix is the number of nonzero eigenvalues. The trace of
a matrix is the sum of all eigenvalues. Since the nonzero eigenvalues of a
projection matrix are all 1, the rank and the trace must be identical.

101 / 171

The expected value of the RSS
We know that E [RSS] = σ 2 tr(I − P), where P is the projection onto
col(X ).
Since I − P is the projection onto col(X )⊥ , I − P has rank
n − rank(X ) = n − p − 1. Thus
E [RSS] = σ 2 (n − p − 1),
so
RSS/(n − p − 1)
is an unbiased estimate of σ 2 .

102 / 171

Covariance matrix of residuals

Since E β̂ = β, it follows that E ŷ = X β = Ey . Therefore we can derive
the following simple expression for the covariance matrix of the residuals.

cov(y − ŷ )

=

E [(y − ŷ )(y − ŷ )0 ]

=

(I − P)Eyy 0 (I − P)

=

(I − P)(X ββ 0 X 0 + σ 2 I )(I − P)

=

σ 2 (I − P)

103 / 171

Distribution of the RSS

The RSS can be written

RSS = tr[(I − P)yy 0 ]
= tr[(I − P)0 ]
Therefore, the distribution of the RSS does not depend on β. It also
depends on X only through col(X ).

104 / 171

Distribution of the RSS
If the distribution of  is invariant under orthogonal transforms, i.e.
d

 = Q
when Q is a square orthogonal matrix, then we can make the stronger
statement that the distribution of the RSS only depends on X through
its rank.
To see this, construct a square orthogonal matrix Q so that Q 0 (I − P)Q
is the projection onto a fixed subspace S of dimension n − p − 1 (so Q 0
maps col(I − P) to S). Then

tr[(I − P)0 ]

=

d

tr[(I − P)Q(Q)0 ]

=

tr[Q 0 (I − P)Q0 ]

Note that since Q is square we have QQ 0 = Q 0 Q = I .
105 / 171

Optimality
For a given design matrix X , there are many linear estimators that are
unbiased for β. That is, there are many matrices M ∈ Rp+1×n such that
E [My |X ] = β
for all β. The Gauss-Markov theorem states that among these, the
least squares estimate is “best,” in the sense that its covariance matrix is
“smallest.”
Here we are using the definition that a matrix A is “smaller” than a
matrix B if
B −A
is positive definite.

106 / 171

Optimality (BLUE)

Letting β ∗ = My be any linear unbiased estimator of β, when
cov(β̂|X ) ≤ cov(β ∗ |X ),
this implies that for any fixed vector θ,
var[θ0 β̂|X ] ≤ var[θ0 β ∗ |X ].
The Gauss-Markov theorem implies that the least squares estimate β̂ is
the “BLUE” (best linear unbiased estimator) for the least squares model.

107 / 171

Optimality (proof of GMT)
The idea of the proof is to show that for any linear unbiased estimate β ∗
of β, β ∗ − β̂ and β̂ are uncorrelated. It follows that
cov(β ∗ |X )

=

cov(β ∗ − β̂ + β̂|X )

=

cov(β ∗ − β̂|X ) + cov(β̂|X )

≥

cov(β̂|X ).

To prove the theorem note that
E [β ∗ |X ] = M · E [y |X ] = MX β = β
for all β, and let B = (X 0 X )−1 X 0 , so that
E [β̂|X ] = BX β = β
for all β, so (M − B)X ≡ 0.
108 / 171

Optimality (proof of GMT)
Therefore

cov(β ∗ − β̂, β̂|X )

= E [(M − B)y (By − β)0 |X ]
= E [(M − B)yy 0 B 0 |X ] − E [(M − B)y β 0 |X ]
=

(M − B)(X ββ 0 X 0 + σ 2 I )B 0 − (M − B)X ββ 0

= σ 2 (M − B)B 0
=

0.

Note that we have an explicit expression for the gap between cov(β̂) and
cov(β ∗ ):
cov(β ∗ − β̂|X ) = σ 2 (M − B)(M − B)0 .

109 / 171

Multivariate density families
If C is a covariance matrix, many families of multivariate densities have
the form c · φ((x − µ)0 C −1 (x − µ)), where c > 0 and φ : R+ → R+ is a
function, typically decreasing with a mode at the origin (e.g. for the
multivariate normal density, φ(u) = exp(−u/2) and for the multivariate
t-distribution with d degrees of freedom, φ(u) = (1 + u/d)−(d+1)/2 ) .
Left side: level sets of g (x) = x 0 Cx
Right side: level sets of a density c · φ((x − µ)0 C −1 (x − µ))
3

3
2

0

0.5

1

2
2

1

0

1

2

3

3

1

1

0

2

3

2

3

2
1

0.0

5

0.5

0.0
01

1

2

1

0

0.0

5.0
2.0

4.0 3.0 1.0

1

33

2

0.10.15

0

33
3

2
1

0.1
0.05

1

4.0

2
33

01

1

1.0

0

0.1
5

1

0.0
1

5
3 .0
2.0.0

0.0

2

2
2

1

0

1

2

3

33

2

1

0

1

110 / 171

Regression inference with Gaussian errors

The random vector x = (x1 , . . . , xp )0 has a p-dimensional standard
multivariate normal distribution if its components are independent and
follow standard normal marginal distributions.
The density of x is the product of p standard normal densities:
p(x) = (2π)−p/2 exp(−

1
1X 2
xj ) = (2π)−p/2 exp(− x 0 x).
2
2
j

111 / 171

Regression inference with Gaussian errors

If we transform
Z = µ + Ax,
we get a random variable satisfying

EZ
cov(Z )

= µ
= Acov(x)A0 = AA0 ≡ Σ.

112 / 171

Regression inference with Gaussian errors
The density of Z can be obtained using the change of variables formula:


1
p(Z ) = (2π)−p/2 |A−1 | exp − (Z − µ)0 A−T A−1 (Z − µ)
2


1
= (2π)−p/2 |Σ|−1/2 exp − (Z − µ)0 Σ−1 (Z − µ)
2
This distribution is denoted N(µ, Σ). The log-density is
1
1
− log |Σ| − (Z − µ)0 Σ−1 (Z − µ),
2
2
with the constant term dropped.

113 / 171

Regression inference with Gaussian errors

The joint log-density for an iid sample of size n is


1X
n
n
n
(xi − µ)0 Σ−1 (xi − µ) = − log |Σ| − tr Sxx Σ−1
− log |Σ| −
2
2
2
2
i

where
Sxx =

X
(xi − µ)(xi − µ)0 /n.

114 / 171

The Cholesky decomposition

If Σ is a non-singular covariance matrix, there is a lower triangular matrix
A with positive diagonal elements such that
AA0 = Σ
This matrix can be denoted Σ1/2 , and is called the Cholesky square root.

115 / 171

Properties of the multivariate normal distribution:
A linear function of a multivariate normal random vector is also
multivariate normal. Specifically, if
x ∼ N(µ, Σ)
is p-variate normal, and θ is a q × p matrix with q ≤ p, then y ∼ θx has a
N(θµ, θΣθ0 )
distribution.
To prove this fact, let Z ∼ N(0, Ip ), and write
x = µ + AZ
where AA0 = Σ is the Cholesky decomposition.
116 / 171

Properties of the multivariate normal distribution:
Next, extend θ to a square invertible matrix

θ̃ =

θ
θ∗


.

where θ∗ ∈ Rp−q×p .
The matrix θ∗ can be chosen such that
θΣθ∗0 = 0,
by the Gram-Schmidt procedure. Let

ỹ = θ̃x =

y
y∗




=

θµ + θAZ
θ∗ µ + θ∗ AZ


.

117 / 171

Properties of the multivariate normal distribution:
Therefore

cov(ỹ ) =

θΣθ0
0

0
θ∗ Σθ∗0


,

and
cov(ỹ )−1 =



(θΣθ0 )−1
0

0
(θ∗ Σθ∗0 )−1


,

Using the change of variables formula, and the structure of the
multivariate normal density, it follows that
p(ỹ ) = p(y )p(y ∗ ).
This implies that y and y ∗ are independent, and by inspecting the form
of their densities, both are seen to be multivariate normal.
118 / 171

Properties of the multivariate normal distribution:

• A consequence of the above argument is that, uncorrelated
components of a multivariate normal vector are independent.
• If x is a standard multivariate normal vector, and Q is a square
orthogonal matrix, then Qx is also standard multivariate normal. This
follows directly from the fact that QQ 0 = I .

119 / 171

The χ2 distribution
If z is a standard normal random variable, the density of z 2 can be
calculated directly as
√
p(z) = z −1/2 exp(−z/2)/ 2π.
This is the χ21 distribution. The χ2p distribution is defined to be the
distribution of
p
X

zj2

j=1

where z1 , . . . , zp are iid standard normal random variables.

120 / 171

The moments of the χ2 distribution

By direct calculation, if F ∼ χ21 ,
EF = 1

var[F ] = 2.

Therefore the mean of the χ2p distribution is p and the variance is 2p.

121 / 171

The density of the χ2 distribution
The χ2p density is
p(x) =

1
x p/2−1 exp(−x/2).
Γ(p/2)2p/2

To prove this by induction, assume that the χ2p−1 density is
1
x (p−1)/2−1 exp(−x/2).
Γ((p − 1)/2)2(p−1)/2
The χ2p density is the density of the sum of a χ2p−1 random variable and a
χ21 random variable, which can be written

122 / 171

The density of the χ2 distribution
1
Γ((p − 1)/2)Γ(1/2)2p/2

x

Z

s (p−1)/2−1 exp(−s/2)(x − s)−1/2 exp(−(x − s)/2)ds
Z x
1
exp(−x/2)
s (p−1)/2−1 (x − s)−1/2 ds
Γ((p − 1)/2)Γ(1/2)2p/2
0
Z 1
1
p/2−1
exp(−x/2)x
u p/2−3/2 (1 − u)−1/2 du.
Γ((p − 1)/2)Γ(1/2)2p/2
0

=

0

=

where u = s/x.
The density for χ2p can now be obtained by applying the identities

Z

Γ(1/2)

=

u α−1 (1 − u)β−1 du

=

√

π

1

Γ(α)Γ(β)/Γ(α + β).

0

123 / 171

The χ2 distribution and the RSS
Let P be a projection matrix and z be a iid vector of standard normal
values. For any square orthogonal matrix Q,
z 0 Pz = (Qz)0 QPQ 0 (Qz).
Since Qz is equal in distribution to z, z 0 Pz is equal in distribution to
z 0 QPQ 0 z.
If the rank of P is k, we can choose Q so that QPQ 0 is the projection
onto the first k canonical basis vectors, i.e.
0



QPQ =

Ik
0

0
0


,

where Ik is the k × k identity matrix.
124 / 171

The χ2 distribution and the RSS
This gives us

d

z 0 Pz = z 0 QPQ 0 z =

k
X

zj2

j=1

which follows a χ2k distribution.
It follows that

n−p−1 2
σ̂
σ2

=

y 0 (I − P)y /σ 2

=

(/σ)0 (I − P)(/σ)

∼

χ2n−p−1 .

125 / 171

The χ2 distribution and the RSS
Thus when the errors are Gaussian, we have
var[σ̂ 2 ] =

2σ 4
.
n−p−1

and

SD[σ̂ 2 ] = σ 2



2
n−p−1

1/2
.

Note that the SD of σ̂ 2 depends on σ 2 , whereas the SD of β̂ does not
depend on β.

126 / 171

The t distribution
Suppose Z is standard normal, V ∼ χ2p , and V is independent of Z . Then
T =

√

√
pZ / V

has a “t distribution with p degrees of freedom.”
Note that by the law of large numbers, V /p converges almost surely to 1
as p grows. Therefore T converges in distribution to a standard normal
distribution.
To derive the t density apply the change of variables formula. Let


U
W




≡

√ 
Z/ V
V

127 / 171

The t distribution
The Jacobian is

J=

√
1/ V
0

−Z /2V 3/2
1

= V −1/2 = W −1/2 .

Since the joint density of Z and V is
p(Z , V ) ∝ exp(−Z 2 /2)V p/2−1 exp(−V /2),
it follows that the joint density of U and W is

p(U, W ) ∝ exp(−U 2 W /2)W p/2−1/2 exp(−W /2)
=

exp(−W (U 2 + 1)/2)W p/2−1/2 .

128 / 171

The t distribution
Therefore
Z
p(U)

=

p(U, W )dW
Z

exp(−W (U 2 + 1)/2)W p/2−1/2 dW
Z
= (U 2 + 1)−p/2−1/2 exp(−y /2)y p/2−1/2 dy

∝

∝ (U 2 + 1)−p/2−1/2 .
where y = W (U 2 + 1).
√
Finally, write T = pU to get that
p(T ) ∝ (T 2 /p + 1)−(p+1)/2 .

129 / 171

Independence of estimated mean and variance parameters

To review, the linear model residuals are
R ≡ y − ŷ = (I − P)y
and the estimated coefficients are
β̂ = (X 0 X )−1 X 0 y .

130 / 171

Independence of estimated mean and variance parameters
Recalling that (I − P)X = 0, and E [y − ŷ |X ] = 0, it follows that

cov(y − ŷ , β̂)

=

E [(y − ŷ )β̂ 0 ]

=

E [(I − P)yy 0 X (X 0 X )−1 ]

=

(I − P)(X ββ 0 X 0 + σ 2 I )X (X 0 X )−1

=

0.

Therefore every estimated coefficient is uncorrelated with every residual.
If  is Gaussian, they are also independent.
Since σ̂ 2 is a function of the residuals, it follows that if  is Gaussian,
then β̂ and σ̂ 2 are independent.

131 / 171

Confidence interval for a regression coefficient
Let
Vk = [(X 0 X )−1 ]kk
so that
var[β̂k |X ] = σ 2 Vk .
If the  are multivariate Gaussian N(0, σ 2 I ), then
β̂k − βk
√
∼ N(0, 1).
σ Vk

132 / 171

Confidence intervals for a regression coefficient

Therefore
(n − p − 1)σ̂ 2 /σ 2 ∼ χ2n−p−1
and using the fact that β̂ and σ̂ 2 are independent, it follows that the
pivotal quantity
β̂k − βk
√
σ̂ 2 Vk
has a t-distribution with n − p − 1 degrees of freedom.

133 / 171

Confidence intervals for a regression coefficient

Therefore if QT (q, k) is the q th quantile of the t-distribution with k
degrees of freedom, then for 0 ≤ α ≤ 1, and
qα = QT (1 − (1 − α)/2, n − p − 1),


p
P −qα ≤ (β̂k − βk )/ σ̂ 2 Vk ≤ qα = α.
Rearranging terms we get the confidence interval

p
p 
P β̂k − σ̂qα Vk ≤ βk ≤ β̂k + σ̂qα Vk = α
which has coverage probability α.

134 / 171

Confidence intervals for the expected response
Let x ∗ be any point in Rp+1 . The expected response at x = x ∗ is
E [y |x = x ∗ ] = β 0 x ∗ .
A point estimate for this value is
β̂ 0 x ∗ ,
which is unbiased since β̂ is unbiased, and has variance
var[β̂ 0 x ∗ ] = σ 2 x ∗0 (X 0 X )−1 x ∗ ≡ σ 2 Vx ∗ .

135 / 171

Confidence intervals for the expected response

As above we have that
β̂ 0 x ∗ − β 0 x ∗
√
σ̂ 2 Vx ∗
has a t-distribution with n − p − 1 degrees of freedom.
Therefore
P(β̂ 0 x ∗ − σ̂qα

p

Vx ∗ ≤ β 0 x ∗ ≤ β̂ 0 x ∗ + σ̂qα

p

Vx ∗ ) = α

defines a CI for E [y |x = x ∗ ] with coverage probability α.

136 / 171

Prediction intervals

Suppose y ∗ is a new observation at x = x ∗ , independent of the data used
to estimate β̂ and σ̂ 2 . If the errors are Gaussian, then y ∗ − β̂ 0 x ∗ is
Gaussian, with the following mean and variance:
E [y ∗ − β̂ 0 x ∗ |X ] = β 0 x ∗ − β 0 x ∗ = 0
and
var[y ∗ − β̂ 0 x ∗ |X ] = σ 2 (1 + Vx ∗ ),

137 / 171

Prediction intervals

It follows that
y ∗ − β̂ 0 x ∗
p
σ̂ 2 (1 + Vx ∗ )
has a t-distribution with n − p − 1 degrees of freedom. Therefore a
prediction interval at x ∗ with coverage probability α is defined by



p
p
P β̂ 0 x ∗ − σ̂qα (1 + Vx ∗ ) ≤ y ∗ ≤ β̂ 0 x ∗ + σ̂qα (1 + Vx ∗ ) = α.

138 / 171

Wald tests
Suppose we want to carry out a formal hypothesis test for the null
hypothesis βk = 0, for some specified index k.
Since var(β̂k |X ) = σ 2 Vk , the “Z-score”
β̂k /

p
σ 2 Vk

follows a standard normal distribution under the null hypothesis. The
“Z-test”
√ or “asymptotic Wald test” rejects the null hypothesis if
|β̂k |/ σ 2 Vk > F −1 (1 − α/2), where F is the standard normal cumulative
distribution function (CDF).
More generally, we can also test hypotheses of the form βk = c using the
Z-score
(β̂k − c)/

p
σ 2 Vk

139 / 171

Wald tests

If the errors are normally distributed, then
β̂k /

p
σ̂ 2 Vk

follows a t-distribution with n − p − 1 √
degrees of freedom, and the Wald
test rejects the null hypothesis if |β̂k |/ σ 2 Vk > Ft−1 (1 − α/2, n − p − 1),
where Ft (·, d) is the student t CDF with d degrees of freedom, and α is
the type-I error probability (e.g. α = 0.05).
√
If the sample size is large, the Wald statistic β̂k / σ̂ 2 Vk approximately
follows a standard normal distribution (undert the null hypothesis) even if
the residual variation () is not Gaussian.

140 / 171

Wald tests for contrasts

More generally, we can consider the contrast θ0 β defined by a fixed vector
θ ∈ Rp+1 . The population value of the contrast can be estimated by the
plug-in estimate θ0 β̂.
We can test the null hypothesis θ0 β = 0 with the Z-score
√
θ0 β̂/ σ̂ 2 θ0 V θ.
This statistic approximately follows a standard normal distribution, and
more “exactly” (under specified assumptions) it follows a student
t-distribution with n − p − 1 degrees of freedom (all under the null
hypothesis).

141 / 171

F-tests
Suppose we have two nested design matrices X1 ∈ Rn×p1 and
X2 ∈ Rn×p2 , such that
col(X1 ) ⊂ col(X2 ).
We may wish to compare the model defined by X1 to the model defined
by X2 . To do this, we need a test statistic that discriminates between the
two models.
Let P1 and P2 be the corresponding projections, and let

ŷ (1)

= P1 y

(2)

= P2 y

ŷ
be the fitted values.

142 / 171

F-tests
Due to the nesting, P2 P1 = P1 and P1 P2 = P1 . Therefore
(P2 − P1 )2 = P2 − P1 ,
so P2 − P1 is a projection that projects onto col(X2 ) − col(X1 ), the
complement of col(X1 ) in col(X2 ).
Since (I − P2 )(P2 − P1 ) = 0, it follows that if E [y ] ∈ col(X2 ) then
Cov(y − ŷ (2) , ŷ (2) − ŷ (1) ) = 0.
If the linear model errors are Gaussian, y − ŷ (2) and ŷ (2) − ŷ (1) are
independent.
If E [y ] ∈ col(X1 ) then (y − ŷ (2) )/σ and (ŷ (2) − ŷ (1) )/σ each have zero
mean and unit standard deviation.

143 / 171

F-tests
Since P2 X1 = P1 X1 = X1 , we have
(I − P2 )X1 = (P2 − P1 )X1 = 0.
Now suppose we take as the null hypothesis that E [Y ] ∈ col(x1 ), so we
can write y = θ + , where θ ∈ col(X1 ). Therefore under the null
hypothesis
ky − ŷ (2) k2 = tr[(I − P2 )yy 0 ] = tr[(I − P2 )0 ] = 0 (I − P2 ).
and

kŷ (2) − ŷ (1) k2 = tr[(P2 − P1 )yy 0 ] = tr[(P2 − P1 )0 ] = 0 (P2 − P1 ).

144 / 171

F-tests
Since I − P2 and P2 − P1 are projections onto subspaces of dimension
n − p2 and p2 − p1 , respectively, it follows that
ky − ŷ (2) k2 /σ 2 = k(I − P2 )y k2 /σ 2 ∼ χ2n−p2
and under the null hypothesis
kŷ (2) − ŷ (1) k2 /σ 2 = k(P2 − P1 )y k2 /σ 2 ∼ χ2p2 −p1 .
Therefore
kŷ (2) − ŷ (1) k2 /(p2 − p1 )
.
ky − ŷ (2) k2 /(n − p2 )
Since kŷ (2) − ŷ (1) k2 will tend to be large when E [y |X ] 6∈ col(X1 ), i.e.
when the null hypothesis is false, this quantity can be used as a
test-statistic. It is called the F-test statistic.
145 / 171

The F distribution
The F-test statistic is the ratio between two rescaled independent χ2
draws.
If U ∼ χ2p and V ∼ χ2q , then
U/p
V /q
has an “F-distribution with p, q degrees of freedom,” denoted Fp,q .
To derive the kernel of the density, let


x
y




≡

U/V
V


.

146 / 171

The F distribution
The Jacobian of the map is
1/V
0

−U/V 2
1

= 1/V .

The joint density of U and V is
p(U, V ) ∝ U p/2−1 exp(−U/2)V q/2−1 exp(−V /2).
The joint density of x and y is
p(x, y ) ∝ x p/2−1 y (p+q)/2−1 exp(−y (x + 1)/2).

147 / 171

The F distribution
Now let
Z = y (x + 1),
so
p(x, Z ) ∝ x p/2−1 Z (p+q)/2−1 (x + 1)−(p+q)/2 exp(−Z /2)
and hence
p(x) ∝ x p/2−1 /(x + 1)(p+q)/2 .

148 / 171

The F distribution

Now if we let

F =

U/p
q
= x
V /q
p

then
p(F ) ∝ F p/2−1 /(pF /q + 1)(p+q)/2 .

149 / 171

The F distribution

Therefore, the F-test statistic follows an Fp2 −p1 ,n−p2 distribution
kŷ (2) − ŷ (1) k2 /(p2 − p1 )
∼ Fp2 −p1 ,n−p2 .
ky − ŷ (2) k2 /(n − p2 )

150 / 171

Simultaneous confidence intervals
If θ is a fixed vector, we can cover the value θ0 β with probability α by
pivoting on the tn−p−1 -distributed pivotal quantity
θ0 β̂ − θ0 β
√
,
σ̂ Vθ
where
Vθ = θ0 (X 0 X )−1 θ.
Now suppose we have a set T ⊂ Rp+1 of vectors θ, and we want to
construct a set of confidence intervals such that
P(all θ0 β covered, θ ∈ T ) = α.
We call this a set of simultaneous confidence intervals for {θ0 β; θ ∈ T }.
151 / 171

The Bonferroni approach to simultaneous confidence
intervals
The Bonferroni approach can be applied when T is a finite set, |T | = k.
Let
Ij = I(CI j covers θj0 β)
and
Ij0 = I(CI j does not cover θj0 β).
Then the union bound implies that

P(I1 and I2 · · · and Ik )

1 − P(I10 or I20 · · · or Ik0 )
X
≥ 1−
P(Ij0 ).
=

j
152 / 171

The Bonferroni approach to simultaneous confidence
intervals
As long as
1−α≥

X

P(Ij0 ),

j

the intervals cover simultaneously. One way to achieve this is if each
interval individually has probability
α0 ≡ 1 − (1 − α)/k
of covering its corresponding true value. To do this, use the same
approach as used to construct single confidence intervals, but with a
much larger value of qα .

153 / 171

The Scheffé approach to simultaneous confidence intervals
The Scheffé approach can be applied if T is a linear subspace of Rp+1 .
Begin with the pivotal quantity
θ0 β̂ − θ0 β
p
,
σ̂ 2 θ0 (X 0 X )−1 θ
and postulate that a symmetric interval can be found so that

P

!

θ0 β̂ − θ0 β

−Qα ≤ p
≤ Qα for all θ ∈ T
σ̂ 2 θ0 (X 0 X )−1 θ

= α.

Equivalently, we can write

P

supθ∈T

(θ0 β̂ − θ0 β)2
≤ Qα2
σ̂ 2 θ0 (X 0 X )−1 θ

!
= α.

154 / 171

The Scheffé approach approach to simultaneous confidence
intervals
Since
β̂ − β = (X 0 X )−1 X 0 ,
we have

(θ0 β̂ − θ0 β)2
σ̂ 2 θ0 (X 0 X )−1 θ

=
=

θ0 (X 0 X )−1 X 0 0 X (X 0 X )−1 θ
σ̂ 2 θ0 (X 0 X )−1 θ
Mθ0 0 Mθ
,
σ̂ 2 Mθ0 Mθ

where
Mθ = X (X 0 X )−1 θ.
155 / 171

The Scheffé approach to simultaneous confidence intervals
Note that
Mθ0 0 Mθ
= h, Mθ /kMθ ki2 /σ̂ 2 ,
σ̂ 2 Mθ0 Mθ
i.e. it is the squared length of the projection of  onto the line spanned by
Mθ (divided by σ̂ 2 ).
The quantity h, Mθ /kMθ ki2 is maximized at kPk2 , where P is the
projection onto the linear space
{X (X 0 X )−1 θ | θ ∈ T } = {Mθ }.
Therefore
supθ∈T h, Mθ /kMθ ki2 /σ̂ 2 = kPk2 /σ̂ 2 ,
and since {Mθ } ⊂ col(X ), it follows that P and σ̂ 2 are independent.
156 / 171

The Scheffé approach to simultaneous confidence intervals
Moreover,
kPk2 /σ 2 ∼ χ2q
where q = dim(T ), and as we know,
n−p−1 2
σ̂ ∼ χ2n−p−1 .
σ2
Thus
kPk2 /q
∼ Fq,n−p−1 .
σ̂ 2

157 / 171

The Scheffé approach to simultaneous confidence intervals
Let QF be the α quantile of the Fq,n−p−1 distribution. Then

P

p
|θ0 β̂ − θ0 β|
p
≤ σ̂ qQF for all θ
θ0 (X 0 X )−1 θ

!
= α,

so


p
p
P θ0 β̂ − σ̂ qQF Vθ ≤ θ0 β ≤ θ0 β̂ + σ̂ qQF Vθ for all θ = α
defines a level α simultaneous confidence set for {θ0 β | θ ∈ T }, where
Vθ = θ0 (X 0 X )−1 θ.

158 / 171

The Scheffé approach to simultaneous confidence intervals
The “multiplier” for the Scheffé simultaneous confidence interval is
σ̂

p

qQF Vθ

where the F distribution has q, n − p − 1 degrees of freedom. For large n,
we can approximate this with
σ̂

q
Qχ2 Vθ

where χ2 is a χ2 distribution with q degrees of freedom.
p
Instead of the usual factor of 2, we have Qχ2 . Note that this equals 2
when q = 1, and grows fairly slowly with q, i.e. it is 3.3 when q = 5 and
4.3 when q = 10.

159 / 171

Polynomial regression
The conventional linear model has the mean specification
E [y |x = x] = β0 + β1 x1 + · · · βp xp .
It is possible to accomodate nonlinear relationships while still working
with linear models.
Polynomial regression is a traditional approach to doing this. If there is
only one predictor variable, polynomial reqression uses the mean structure
E [y |x = x] = β0 + β1 x + β2 x 2 + · · · + βp x p .
Note that this is still a linear model, as it is linear in the coefficients β.
Multiple regression techniques (e.g. OLS) can be used for estimation and
inference.

160 / 171

Functional linear regression
A drawback of polynomial regression is that the polynomials can be
highly collinear (e.g. cor(U, U 2 ) ≈ 0.97 if U is uniform on 0, 1). Also,
polynomials have unbounded support, and it is often desirable to model
E [y |x = x] as a linear combination of functions with local (bounded)
suport.
If there is a single covariate x, we can use a model of the form
E [y |x = x] = β1 φ1 (x) + · · · + βp φp (x)
where the φj (·) are basis functions that are chosen based on what we
think E [y |x = x] might look like. Splines, sinusoids, wavelets, and radial
basis functions are possible choices.
Since the mean function is linear in the unknown parameters βj , this is a
linear model and can be estimated using multiple linear regression (OLS)
techniques.

161 / 171

Confidence bands
Suppose we have a functional linear model of the form
E [y |x = x, t] = β 0 x + f (t),
and we model f (t) as
f (t) =

q
X

γj φj (t)

j=1

where the φj (·) are basis functions.
A confidence band for f with coverage probability α is an expression of
the form
fˆ(t) ± M(t)
such that


P fˆ(t) − M(t) ≤ f (t) ≤ fˆ(t) + M(t) ∀t = α.
162 / 171

Confidence bands

We can use as a point estimate

fˆ(t) ≡

q
X

γ̂j φj (t)

j=1

Pq
For each fixed t, j=1 γj φj (t) is a linear combniation of the γ̂j , so if we
simultaneously cover all linear combinations of the γj , we will have our
confidence band. Thus the Scheffé procedure can be applied with
T = Rq .

163 / 171

Interactions
A basic linear model for p observed covariates is
ŷi = β̂0 + β̂1 x1i + · · · + β̂p xpi .
In this setting, the terms β̂j xj , etc. are called main effects. This basic
linear model has the following two properties:
1. The mean is linear in each covariate, when holding the values of the
other covariates fixed.
2. The slope of y on each covariate is the same, regardless of the
values of the other covariates.
The first property can be overcome using basis functions.

164 / 171

Interactions

The second property implies that if we plot E [Y |x1 = x1 , x2 = x2 ] against
x1 , for different fixed values of x2 , we get a series of parallel lines. This is
a very strong form of linearity that will not hold for most populations.
Interactions allow more realistic models to be specified, in which this
strong form of parallelism does not hold.
The most common way of specifying an interaction is by including a
product of two covariates in the mean structure, e.g.
E [y |x1 = x1 , x2 = x2 ] = β0 + β1 x1 + β2 x2 + β3 x1 x2 .

165 / 171

Interactions
By rearranging terms
E [y |x1 = x1 , x2 = x2 ] = β0 + (β1 + β3 x2 )x1 + β2 x2 ,
we can interpret this as saying that E [Y |x1 = x1 , x2 = x2 ] is linear in x1
for each fixed value of x2 , but the slopes of these lines depend on x2 (and
specifically, are equal to β1 + β3 x2 ).
Note that we can also arrange terms as
E [y |x1 = x1 , x2 = x2 ] = β0 + (β2 + β3 x1 )x2 + β1 x1 ,
so that the mean function is linear in x2 for each fixed value of x1 , with
slopes depending on the value of x1 .

166 / 171

Interactions
Suppose we have two quantitative covariates x1 and x2 , and wish to
model E [y |x1 = x2 , x2 = x2 ] in the form f1 (x1 ) + f2 (x2 ), where f1 and f2
are smooth functions of one variable. This is called an additive model.
We can model f1 and f2 using basis functions:
f1 (x) =

X
j

βj φj (x)

f2 (x) =

X

γj φj (x).

j

It is not necessary that f1 and f2 uses the same basis functions, as above,
but we do so here for simplicity.
This additive model exhibits parallelism in that E [y |x1 = x1 , x2 = x2 ],
viewed as a function of x1 with x2 held constant, produces a series of
parallel curves as we vary x2 .

167 / 171

Interactions

Interactions can be used to break this parallelism. When we interact two
covariates that are modeled using basis functions, we take all pairwise
products between the basis functions for x1 and the basis functions for
x2 , i.e.
ŷi =

X

θj1 ,j2 φj1 (x1i )φj2 (x2i ),

j1 ,j2

where the coefficients (βj1 ,j2 ; j1 = 1, . . . , q1 , j2 = 1, . . . , q2 ) are estimated
jointly using OLS.

168 / 171

Categorical covariates
A covariate is categorical if its values are unordered labels, rather than
quantitative values. Categorical variables are also called “nominal”,
“qualitative”, or “factor” variables.
For example, suppose that subjects fall into five groups, e.g. based on
where they live. These five groups might be labeled 1, 2, 3, 4, 5, but these
numbers are only labels, and have no quantitative meaning. We cannot
simply include a term β · x in the model for this covariate.
Instead, a categorical variable is represented through a set of “dummy
variables” or “indicator variables”. There are many ways to do this, but
the most common approach is to create an indicator variable for each
level, say, zji = I(xi = j), where x is the categorical variable. If x has q
levels, then it is represented through q indicator variables, z1 , . . . , zq .

169 / 171

Categorical covariates

P
Note that j zji = 1 for each i. Therefore the collection of indicator
variables is perfectly collinear with the intercept. To work around this, it
is common that one level the categorical variable is selected as the
reference level, and its indicator is omitted from the model.
There are many other conventions for resolving this collinearity, but we
will not discuss them here.
Note that choice of the reference level does not change the model in a
fundamental way. It does not change col(X ), it only leads to a
representation of col(X ) in a different basis. Therefore, ŷ is not affected
by the choice of a reference level.

170 / 171

Categorical covariates

If we choose a particular reference level, say j ∗ , then the interpretation of
the coefficient θk · zk for the indicator corresponding to the k th level of
the covariate, where k 6= j ∗ , depends on the choice of j ∗ . Specifically, θk
represents the expected difference between a case with x = k and a case
with x = j ∗ , while holding all other covariates fixed.
The standard regression analysis output in statistical software is based on
a reference level that is (usually) chosen arbitrarily. This seems to put
extra emphasis on the contrasts relative to the reference level. But in
fact, for a categorical variable with q levels, there are q(q − 1)/2 possible
contrasts, all of which are equally meaningful.

171 / 171