Decomposing Variance

Decomposing Variance
Kerby Shedden
Department of Statistics, University of Michigan

October 10, 2021

1 / 40

Law of total variation
For any regression model involving a response y ∈ R and a covariate
vector x ∈ Rp , we can decompose the marginal variance of y as follows:
var(y ) = varx E [y |x = x] + Ex var[y |x = x].
I

If the population is homoscedastic, var[y |x] does not depend on x,
so we can simply write var[y |x] = σ 2 , and we get
var(y ) = varx E [y |x] + σ 2 .

I

If the population is heteroscedastic, var[y |x = x] is a function σ 2 (x)
with expected value σ 2 = Ex σ 2 (x), and again we get
var(y ) = varx E [y |x] + σ 2 .

If we write y = f (x) +  with E [|x] = 0, then E [y |x] = f (x), and
varx E [y |x] summarizes the variation of f (x) over the marginal
distribution of x.

2 / 40

Law of total variation
4

E(Y|X)

3
2
1
0
10

1

2

X

3

4

Orange curves: conditional distributions of y given x
Purple curve: marginal distribution of y
Black dots: conditional means of y given x
3 / 40

Pearson correlation
The population Pearson correlation coefficient of two jointly distributed
random variables x ∈ R and y ∈ R is
ρxy ≡

cov(x, y )
.
σx σy

Given data y = (y1 , . . . , yn )0 and x = (x1 , . . . , xn )0 , the Pearson
correlation coefficient is estimated by

ρ̂xy

P
(xi − x̄)(yi − ȳ )
cd
ov(x, y )
(x − x̄)0 (y − ȳ )
=
= pP i
=
.
P
2
2
σ̂x σ̂y
kx − x̄k · ky − ȳ k
i (xi − x̄) ·
i (yi − ȳ )

When we write y − ȳ here, this means y − ȳ · 1, where 1 is a vector of
1’s, and ȳ is a scalar.

4 / 40

Pearson correlation

By the Cauchy-Schwartz inequality,
−1
−1

≤ ρxy
≤ ρ̂xy

≤
≤

1
1.

The sample correlation coefficient is slightly biased, but the bias is so
small that it is usually ignored.

5 / 40

Pearson correlation and simple linear regression slopes
For the simple linear regression model
y = α + βx + ,
if we view x as a random variable that is uncorrelated with , then
cov(x, y ) = βσx2
and the correlation is
β
ρxy ≡ cor(x, y ) = p
.
2
β + σ 2 /σx2
The sample correlation coefficient for data x = (x1 , . . . , xn ) and
y = (y1 , . . . , yn ) is related to the least squares slope estimate:
β̂ =

cd
ov(x, y )
σ̂y
= ρ̂xy .
2
σ̂x
σ̂x

6 / 40

Orthogonality between fitted values and residuals
Recall that the fitted values are
ŷ = x β̂ = Py
where y ∈ Rn is the vector of observed responses, and P ∈ Rn×n is the
projection matrix onto col(X).
The residuals are
r = y − ŷ = (I − P)y ∈ Rn .
Since P(I − P) = 0n×n it follows that ŷ 0 r = 0.
since r̄ = 0, it is equivalent to state that the sample correlation
coefficient between r and ŷ is zero, i.e.
cor(r
c , ŷ ) = 0.
7 / 40

Coefficient of determination
A descriptive summary of the explanatory power of x for y is given by the
coefficient of determination, also known as the proportion of explained
variance, or multiple R 2 . This is the quantity
R2 ≡ 1 −

ky − ŷ k2
kŷ − ȳ k2
var(ŷ
c )
.
=
=
2
ky − ȳ k
ky − ȳ k2
var(y
c )

The equivalence between the two expressions follows from the identity

ky − ȳ k2

= ky − ŷ + ŷ − ȳ k2
= ky − ŷ k2 + kŷ − ȳ k2 + 2(y − ŷ )0 (ŷ − ȳ )
= ky − ŷ k2 + kŷ − ȳ k2 ,

It should be clear that R 2 = 0 iff ŷ = ȳ and R 2 = 1 iff ŷ = y .

8 / 40

Coefficient of determination
The coefficient of determination is equal to
cor(ŷ
c , y )2 .
To see this, note that

cor(ŷ
c , y)

=
=
=
=

(ŷ − ȳ )0 (y − ȳ )
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ + ŷ − ȳ )
kŷ − ȳ k · ky − ȳ k
(ŷ − ȳ )0 (y − ŷ ) + (ŷ − ȳ )0 (ŷ − ȳ )
kŷ − ȳ k · ky − ȳ k
kŷ − ȳ k
.
ky − ȳ k

9 / 40

Coefficient of determination in simple linear regression
In general,
R 2 = cor(y
c , ŷ )2 =

cd
ov(y , ŷ )2
.
var(y
c ) · var(ŷ
c )

In the case of simple linear regression,

cd
ov(y , ŷ )

=

cd
ov(y , α̂ + β̂x)

=

β̂ cd
ov(y , x),

=

var(α̂
c + β̂x)

=

β̂ 2 var(x)
c

and
var(ŷ
c )

Thus for simple linear regression, R 2 = cor(y
c , x)2 = cor(y
c , ŷ )2 .
10 / 40

Relationship to the F statistic

The F-statistic for the null hypothesis
β1 = . . . = βp = 0
is
kŷ − ȳ k2 n − p − 1
R2
n−p−1
·
=
·
,
2
2
ky − ŷ k
p
1−R
p
which is an increasing function of R 2 .

11 / 40

Adjusted R 2
The sample R 2 is an estimate of the population R 2 :
1−

Ex var[y |x]
.
var(y )

Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is
not large unless the sample size is small or the number of covariates is
large. The adjusted R 2 is an approximately unbiased estimate of the
population R 2 :
1 − (1 − R 2 )

n−1
.
n−p−1

The adjusted R 2 is always less than the unadjusted R 2 . The adjusted R 2
is always less than or equal to one, but can be negative.

12 / 40

The unique variation in one covariate
How much “information” about y is present in a covariate xk ? This
question is not straightforward when the covariates are non-orthogonal,
since several covariates may contain overlapping information about y .
Let xk⊥ ∈ Rn be the residual of the k th covariates, xk ∈ Rn , after
regressing it against all other covariates (including the intercept). If P−k
is the projection onto span({xj , j 6= k}), then
xk⊥ = (I − P−k )xk .
We could use var(x
c k⊥ )/var(x
c k ) to assess how much of the variation in xk
is “unique” in that it is not also captured by other predictors.
But this measure doesn’t involve y , so it can’t tell us whether the unique
variation in xk is useful in the regression analysis.

13 / 40

The unique regression information in one covariate

To learn how xk contributes “uniquely” to the regression, we can consider
how introducing xk to a working regression model affects the R 2 .
Let ŷ−k = P−k y be the fitted values in the model omitting covariate k.
2
Let R 2 denote the multiple R 2 for the full model, and let R−k
be the
2
multiple R for the regression omitting covariate xk . The value of

2
R 2 − R−k

is a way to quantify how much unique information about y in xk is not
captured by the other covariates. This is called the semi-partial R 2 .

14 / 40

Identity involving norms of fitted values and residuals
Before we continue, we will need a simple identity that is often useful.
In general, if a and b are orthogonal, then ka + bk2 = kak2 + kbk2 .
If a and b − a are orthogonal, then
kbk2 = kb − a + ak2 = kb − ak2 + kak2 .
Thus in this setting we have kbk2 − kak2 = kb − ak2 .
Applying this fact to regression, we know that the fitted values and
residuals are orthogonal. Thus for the regression omitting variable k, ŷ−k
and y − ŷ−k are orthogonal, so ky − ŷ−k k2 = ky k2 − kŷ−k k2 .
By the same argument, ky − ŷ k2 = ky k2 − kŷ k2 .

15 / 40

Improvement in R 2 due to one covariate

Now we can obtain a simple, direct expression for the semi-partial R 2 .
Since xk⊥ is orthogonal to the other covariates,
ŷ = ŷ−k +
and

hy , xk⊥ i ⊥
x ,
hxk⊥ , xk⊥ i k

kŷ k2 = kŷ−k k2 + hy , xk⊥ i2 /kxk⊥ k2 .

16 / 40

Improvement in R 2 due to one covariate
Thus we have

R2

=
=
=
=
=

ky − ŷ k2
ky − ȳ k2
ky k2 − kŷ k2
1−
ky − ȳ k2
ky k2 − kŷ−k k2 − hy , xk⊥ i2 /kxk⊥ k2
1−
ky − ȳ k2
ky − ŷ−k k2
hy , xk⊥ i2 /kxk⊥ k2
1−
+
2
ky − ȳ k
ky − ȳ k2
hy , xk⊥ i2 /kxk⊥ k2
2
R−k
+
.
ky − ȳ k2
1−

17 / 40

Semi-partial R 2
Thus the semi-partial R 2 is
2
R 2 − R−k
=

hy , xk⊥ i2 /kxk⊥ k2
hy , xk⊥ /kxk⊥ ki2
=
.
ky − ȳ k2
ky − ȳ k2

Since xk⊥ /kxk⊥ k is centered and has length 1, it follows that
2
R 2 − R−k
= cor(y
c , xk⊥ )2 .

Thus the semi-partial R 2 for covariate k has two interpretations:
I

It is the improvement in R 2 resulting from including covariate k in a
working regression model that already contains the other covariates.

I

It is the R 2 for a simple linear regression of y on xk⊥ = (I − P−k )xk .

18 / 40

Partial R 2
The partial R 2 is
2
R 2 − R−k
hy , xk⊥ i2 /kxk⊥ k2
=
.
2
ky − ŷ−k k2
1 − R−k

The partial R 2 for covariate k is the fraction of the maximum possible
improvement in R 2 that is contributed by covariate k.
Let ŷ−k be the fitted values for regressing y on all covariates except xk .
0
Since ŷ−k
xk⊥ = 0,

hy , xk⊥ i2
hy − ŷ−k , xk⊥ i2
=
ky − ŷ−k k2 · kxk⊥ k2
ky − ŷ−k k2 · kxk⊥ k2
The expression on the left is the usual R 2 that would be obtained when
regressing y − ŷ−k on xk⊥ . Thus the partial R 2 is the same as the usual
R 2 for (I − P−k )y regressed on (I − P−k )xk .
19 / 40

The partial R 2 and variable importance
The partial R 2 is one way to measure the importance of a variable in a
regression model. However “importance” has many facets and no one
measure is a perfect indicator of performance. Other possible indicators
of variable importance are:
I

The estimated regression slope β̂k – this is not a good measure
because it’s scale depends on the units of the corresponding
covariate.

I

The standardized regression slope β̂k SD(xk ). Since
β̂k xk = β̂k SD(xk ) · xl /SD(xk ) this measures the expected change in
y corresponding to a one standard deviation change in xk . This is a
dimensionless quantity.

I

The p-value for the null hypothesis that βk = 0 (e.g. from a Wald
test). This is not a good measure of importance because in many
cases it tells you more about the sample size than the importance of
xk – as long as βk 6= 0, this p-value will tend to zero as n grows.

I

The semi-partial R 2 – this measure does not “correct” for the
strength of the base model, which is a drawback in some settings
but an advantage in others.
20 / 40

The partial R 2 and variable importance

No measure of variable importance is perfect, for example:
1. The most important variable may not have any causal relationship
with y – it may only be imporant because it is a proxy or surrogate
for the causes of y , or the most important variable may even be
caused by y .
2. The most important variable may not be modifiable, e.g. if we want
to manipulate the factors that predict y in order to alter the value
of y in a favorable direction, the most important factor may not be
modifiable (e.g. age may be the most important risk factor for a
health outcome but we cannot stop the passage of time).

21 / 40

Decomposition of projection matrices
Suppose P ∈ Rn×n is a rank-d projection matrix, and U is a n × d
orthogonal matrix whose columns span col(P). If we partition U by
columns


|
U =  u1
|

|
u2
|

···
···
···


|
ud  ,
|

then P = UU 0 , so we can write

P=

d
X

uj uj0 .

j=1

Note that this representation is not unique, since there are different
orthogonal bases for col(P).
Each summand uj uj0 ∈ Rn×n is a rank-1 projection matrix onto huj i.
22 / 40

Decomposition of R 2

Question: In a multiple regression model, how much of the variance in y
is explained by a particular covariate?
Orthogonal case: If the design matrix X is orthogonal (X 0 X = I ), the
projection P onto col(X ) can be decomposed as

P=

p
X
j=0

p

110 X 0
Pj =
+
xj xj ,
n
j=1

where xj is the j th column of the design matrix (assuming here that the
first column of X is an intercept).

23 / 40

Decomposition of R 2 (orthogonal case)
The n × n rank-1 matrix
Pj = xj xj0
is the projection onto span(xj ) (and P0 is the projection onto the span of
the vector of 1’s). Furthermore, by orthogonality, Pj Pk = 0 unless j = k.
Since
ŷ − ȳ =

p
X

Pj y ,

j=1

by orthogonality
kŷ − ȳ k2 =

p
X

kPj y k2 .

j=1

Here we are using the fact that if u1 , . . . , um are orthogonal, then
ku1 + · · · + um k2 = ku1 k2 + · · · + kum k2 .
24 / 40

Decomposition of R 2 (orthogonal case)

The R 2 for simple linear regression of y on xj is
Rj2 ≡ kŷ − ȳ k2 /ky − ȳ k2 = kPj y k2 /ky − ȳ k2 ,
so we see that for orthogonal design matrices,

2

R =

p
X

Rj2 .

j=1

That is, the overall coefficient of determination is the sum of univariate
coefficients of determination for all the explanatory variables.

25 / 40

Decomposition of R 2

Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be
the sum of single covariate R 2 ’s.
If we let Rj2 be as above (the R 2 values for regressing Y on each Xj ),
P 2
P 2
2
2
then there are two different situations:
j Rj > R , and
j Rj < R .

26 / 40

Decomposition of R 2
Case 1:

P

Rj2 > R 2

P
It’s not surprising that j Rj2 can be bigger than R 2 . For example,
suppose that the population data generating model is
y = x1 + 
and x2 is highly correlated with x1 , but is not part of the data generating
model, as in the following diagram:
y

x2

x1

27 / 40

Decomposition of R 2

For the regression of y on both x1 and x2 , the multiple R 2 will be
1 − σ 2 /var(y ) (since E [y |x1 , x2 ] = E [y |x1 ] = x1 ).
The R 2 values for y regressed on either x1 or x2 separately will also be
approximately 1 − σ 2 /var(y ).
Thus R12 + R22 ≈ 2R 2 .

28 / 40

Decomposition of R 2
Case 2:

P

j

Rj2 < R 2

This is more surprising, and is sometimes called enhancement.
As an example, suppose the data generating model is
y = z + ,
but we don’t observe z (for simplicity assume E [z] = 0). Instead, we
observe a value x1 that satisfies
x1 = z + x2 ,
where x2 has mean 0 and is independent of z and .
Since x2 is independent of z and , it is also independent of y , thus
R22 ≈ 0 for large n.
29 / 40

Decomposition of R 2

The following causal diagram illustrates this example:
y

x1

z

x2

30 / 40

Decomposition of R 2 (enhancement example)
The multiple R 2 of y on x1 and x2 is approximately σz2 /(σz2 + σ 2 ) for
large n, since the fitted values will converge to ŷ = x1 − x2 = z.
To calculate R12 , first note that for the regression of y on x1 , where
y , x1 ∈ Rn are data vectors
β̂ =

cd
ov(y , x1 )
σ2
→ 2 z 2
var(x
c 1)
σz + σx2

and
α̂ → 0.

31 / 40

Decomposition of R 2 (enhancement example)
Therefore for large n,

n−1 ky − ŷ k2

≈

n−1 kz +  − σz2 x1 /(σz2 + σx22 )k2

=

n−1 kσx22 z/(σz2 + σx22 ) +  − σz2 x2 /(σz2 + σx22 )k2

=

σx42 σz2 /(σz2 + σx22 )2 + σ 2 + σz4 σx22 /(σz2 + σx22 )2

= σx22 σz2 /(σz2 + σx22 ) + σ 2 .
Therefore
R12

=
≈
=

n−1 ky − ŷ k2
n−1 ky − ȳ k2
σ 2 σ 2 /(σ 2 + σx22 ) + σ 2
1 − x2 z 2z
σz + σ 2
σz2
2
2
(σz + σ )(1 + σx22 /σz2 )

1−

.
32 / 40

Decomposition of R 2 (enhancement example)

Thus
R12 /R 2 ≈ 1/(1 + σx22 /σz2 ),
which is strictly less than one if σx22 > 0.
Since R22 = 0, it follows that R 2 > R12 + R22 .
The reason for this is that while x2 contains no directly useful information
about y (hence R22 = 0), it can remove the “measurement error” in x1 ,
making x1 a better predictor of z.

33 / 40

Decomposition of R 2 (enhancement example)

We can now calculate the limiting partial R 2 for adding x2 to a model
that already contains x1 :
σx22
.
σx22 + σ 2 (1 + σx22 /σz2 )

34 / 40

Partial R 2 example 2

Suppose the design matrix satisfies


1
X 0 X /n =  0
0

0
1
r


0
r 
1

and the data generating model is
y = x1 + x2 + 
with var  = σ 2 .

35 / 40

Partial R 2 example 2
We will calculate the partial R 2 for x1 , using the fact that the partial R 2
is the regular R 2 for regressing
(I − P−1 )y
on
(I − P−1 )x1
where y , x1 , x2 ∈ Rn are data vectors distributed like Y , x1 , and x2 , and
P−1 is the projection onto span ({1, x2 }).
Since this is a simple linear regression, the partial R 2 can be expressed
cor((I
c
− P−1 )y , (I − P−1 )x1 )2 .

36 / 40

Partial R 2 example 2

We will calculate the partial R 2 in a setting where all conditional means
are linear. This would hold if the data are jointly Gaussian (but this is
not a necessary condition for conditional means to be linear).
The numerator of the partial R 2 is the square of

cd
ov((I − P−1 )y , (I − P−1 )x1 )

=

y 0 (I − P−1 )x1 /n

=

(x1 + x2 + )0 (x1 − rx2 )/n

→ 1 − r 2.

37 / 40

Partial R 2 example 2

The denominator contains two factors. The first is

k(I − P−1 )x1 k2 /n

=

x10 (I − P−1 )x1 /n

=

x10 (x1 − rx2 )/n

→

1 − r 2.

38 / 40

Partial R 2 example 2
The other factor in the denominator is y 0 (I − P−1 )y /n:
y 0 (I − P−1 )y /n

=

(x1 + x2 )0 (I − P−1 )(x1 + x2 )/n + 0 (I − P−1 )/n +
20 (I − P−1 )(x1 + x2 )/n

≈

(x1 + x2 )0 (x1 − rx2 )/n + σ 2

→

1 − r 2 + σ2 .

Thus we get that the partial R 2 is approximately equal to
1 − r2
.
1 − r 2 + σ2
If r = 1 then the result is zero (x1 has no unique explanatory power), and
if r = 0, the result is 1/(1 + σ 2 ), indicating that after controlling for x2 ,
around 1/(1 + σ 2 ) fraction of the remaining variance is explained by x1
(the rest is due to ).
39 / 40

Summary

Each of the three R 2 values can be expressed either in terms of variance
ratios, or as a squared correlation coefficient:

VR
Correlation

Multiple R 2
kŷ − ȳ k2 /ky − ȳ k2
cor(ŷ
c , y )2

Semi-partial R 2
2
R 2 − R−k
⊥ 2
cor(y
c , xk )

Partial R 2
2
2
(R − R−k
)/(1 − R−k
)
⊥ 2
cor((I
c
− P−k )y , xk )
2

40 / 40