Decomposing Variance Kerby Shedden Department of Statistics, University of Michigan October 10, 2021 1 / 40 Law of total variation For any regression model involving a response y ∈ R and a covariate vector x ∈ Rp , we can decompose the marginal variance of y as follows: var(y ) = varx E [y |x = x] + Ex var[y |x = x]. I If the population is homoscedastic, var[y |x] does not depend on x, so we can simply write var[y |x] = σ 2 , and we get var(y ) = varx E [y |x] + σ 2 . I If the population is heteroscedastic, var[y |x = x] is a function σ 2 (x) with expected value σ 2 = Ex σ 2 (x), and again we get var(y ) = varx E [y |x] + σ 2 . If we write y = f (x) + with E [|x] = 0, then E [y |x] = f (x), and varx E [y |x] summarizes the variation of f (x) over the marginal distribution of x. 2 / 40 Law of total variation 4 E(Y|X) 3 2 1 0 10 1 2 X 3 4 Orange curves: conditional distributions of y given x Purple curve: marginal distribution of y Black dots: conditional means of y given x 3 / 40 Pearson correlation The population Pearson correlation coefficient of two jointly distributed random variables x ∈ R and y ∈ R is ρxy ≡ cov(x, y ) . σx σy Given data y = (y1 , . . . , yn )0 and x = (x1 , . . . , xn )0 , the Pearson correlation coefficient is estimated by ρ̂xy P (xi − x̄)(yi − ȳ ) cd ov(x, y ) (x − x̄)0 (y − ȳ ) = = pP i = . P 2 2 σ̂x σ̂y kx − x̄k · ky − ȳ k i (xi − x̄) · i (yi − ȳ ) When we write y − ȳ here, this means y − ȳ · 1, where 1 is a vector of 1’s, and ȳ is a scalar. 4 / 40 Pearson correlation By the Cauchy-Schwartz inequality, −1 −1 ≤ ρxy ≤ ρ̂xy ≤ ≤ 1 1. The sample correlation coefficient is slightly biased, but the bias is so small that it is usually ignored. 5 / 40 Pearson correlation and simple linear regression slopes For the simple linear regression model y = α + βx + , if we view x as a random variable that is uncorrelated with , then cov(x, y ) = βσx2 and the correlation is β ρxy ≡ cor(x, y ) = p . 2 β + σ 2 /σx2 The sample correlation coefficient for data x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) is related to the least squares slope estimate: β̂ = cd ov(x, y ) σ̂y = ρ̂xy . 2 σ̂x σ̂x 6 / 40 Orthogonality between fitted values and residuals Recall that the fitted values are ŷ = x β̂ = Py where y ∈ Rn is the vector of observed responses, and P ∈ Rn×n is the projection matrix onto col(X). The residuals are r = y − ŷ = (I − P)y ∈ Rn . Since P(I − P) = 0n×n it follows that ŷ 0 r = 0. since r̄ = 0, it is equivalent to state that the sample correlation coefficient between r and ŷ is zero, i.e. cor(r c , ŷ ) = 0. 7 / 40 Coefficient of determination A descriptive summary of the explanatory power of x for y is given by the coefficient of determination, also known as the proportion of explained variance, or multiple R 2 . This is the quantity R2 ≡ 1 − ky − ŷ k2 kŷ − ȳ k2 var(ŷ c ) . = = 2 ky − ȳ k ky − ȳ k2 var(y c ) The equivalence between the two expressions follows from the identity ky − ȳ k2 = ky − ŷ + ŷ − ȳ k2 = ky − ŷ k2 + kŷ − ȳ k2 + 2(y − ŷ )0 (ŷ − ȳ ) = ky − ŷ k2 + kŷ − ȳ k2 , It should be clear that R 2 = 0 iff ŷ = ȳ and R 2 = 1 iff ŷ = y . 8 / 40 Coefficient of determination The coefficient of determination is equal to cor(ŷ c , y )2 . To see this, note that cor(ŷ c , y) = = = = (ŷ − ȳ )0 (y − ȳ ) kŷ − ȳ k · ky − ȳ k (ŷ − ȳ )0 (y − ŷ + ŷ − ȳ ) kŷ − ȳ k · ky − ȳ k (ŷ − ȳ )0 (y − ŷ ) + (ŷ − ȳ )0 (ŷ − ȳ ) kŷ − ȳ k · ky − ȳ k kŷ − ȳ k . ky − ȳ k 9 / 40 Coefficient of determination in simple linear regression In general, R 2 = cor(y c , ŷ )2 = cd ov(y , ŷ )2 . var(y c ) · var(ŷ c ) In the case of simple linear regression, cd ov(y , ŷ ) = cd ov(y , α̂ + β̂x) = β̂ cd ov(y , x), = var(α̂ c + β̂x) = β̂ 2 var(x) c and var(ŷ c ) Thus for simple linear regression, R 2 = cor(y c , x)2 = cor(y c , ŷ )2 . 10 / 40 Relationship to the F statistic The F-statistic for the null hypothesis β1 = . . . = βp = 0 is kŷ − ȳ k2 n − p − 1 R2 n−p−1 · = · , 2 2 ky − ŷ k p 1−R p which is an increasing function of R 2 . 11 / 40 Adjusted R 2 The sample R 2 is an estimate of the population R 2 : 1− Ex var[y |x] . var(y ) Since it is a ratio, the plug-in estimate R 2 is biased, although the bias is not large unless the sample size is small or the number of covariates is large. The adjusted R 2 is an approximately unbiased estimate of the population R 2 : 1 − (1 − R 2 ) n−1 . n−p−1 The adjusted R 2 is always less than the unadjusted R 2 . The adjusted R 2 is always less than or equal to one, but can be negative. 12 / 40 The unique variation in one covariate How much “information” about y is present in a covariate xk ? This question is not straightforward when the covariates are non-orthogonal, since several covariates may contain overlapping information about y . Let xk⊥ ∈ Rn be the residual of the k th covariates, xk ∈ Rn , after regressing it against all other covariates (including the intercept). If P−k is the projection onto span({xj , j 6= k}), then xk⊥ = (I − P−k )xk . We could use var(x c k⊥ )/var(x c k ) to assess how much of the variation in xk is “unique” in that it is not also captured by other predictors. But this measure doesn’t involve y , so it can’t tell us whether the unique variation in xk is useful in the regression analysis. 13 / 40 The unique regression information in one covariate To learn how xk contributes “uniquely” to the regression, we can consider how introducing xk to a working regression model affects the R 2 . Let ŷ−k = P−k y be the fitted values in the model omitting covariate k. 2 Let R 2 denote the multiple R 2 for the full model, and let R−k be the 2 multiple R for the regression omitting covariate xk . The value of 2 R 2 − R−k is a way to quantify how much unique information about y in xk is not captured by the other covariates. This is called the semi-partial R 2 . 14 / 40 Identity involving norms of fitted values and residuals Before we continue, we will need a simple identity that is often useful. In general, if a and b are orthogonal, then ka + bk2 = kak2 + kbk2 . If a and b − a are orthogonal, then kbk2 = kb − a + ak2 = kb − ak2 + kak2 . Thus in this setting we have kbk2 − kak2 = kb − ak2 . Applying this fact to regression, we know that the fitted values and residuals are orthogonal. Thus for the regression omitting variable k, ŷ−k and y − ŷ−k are orthogonal, so ky − ŷ−k k2 = ky k2 − kŷ−k k2 . By the same argument, ky − ŷ k2 = ky k2 − kŷ k2 . 15 / 40 Improvement in R 2 due to one covariate Now we can obtain a simple, direct expression for the semi-partial R 2 . Since xk⊥ is orthogonal to the other covariates, ŷ = ŷ−k + and hy , xk⊥ i ⊥ x , hxk⊥ , xk⊥ i k kŷ k2 = kŷ−k k2 + hy , xk⊥ i2 /kxk⊥ k2 . 16 / 40 Improvement in R 2 due to one covariate Thus we have R2 = = = = = ky − ŷ k2 ky − ȳ k2 ky k2 − kŷ k2 1− ky − ȳ k2 ky k2 − kŷ−k k2 − hy , xk⊥ i2 /kxk⊥ k2 1− ky − ȳ k2 ky − ŷ−k k2 hy , xk⊥ i2 /kxk⊥ k2 1− + 2 ky − ȳ k ky − ȳ k2 hy , xk⊥ i2 /kxk⊥ k2 2 R−k + . ky − ȳ k2 1− 17 / 40 Semi-partial R 2 Thus the semi-partial R 2 is 2 R 2 − R−k = hy , xk⊥ i2 /kxk⊥ k2 hy , xk⊥ /kxk⊥ ki2 = . ky − ȳ k2 ky − ȳ k2 Since xk⊥ /kxk⊥ k is centered and has length 1, it follows that 2 R 2 − R−k = cor(y c , xk⊥ )2 . Thus the semi-partial R 2 for covariate k has two interpretations: I It is the improvement in R 2 resulting from including covariate k in a working regression model that already contains the other covariates. I It is the R 2 for a simple linear regression of y on xk⊥ = (I − P−k )xk . 18 / 40 Partial R 2 The partial R 2 is 2 R 2 − R−k hy , xk⊥ i2 /kxk⊥ k2 = . 2 ky − ŷ−k k2 1 − R−k The partial R 2 for covariate k is the fraction of the maximum possible improvement in R 2 that is contributed by covariate k. Let ŷ−k be the fitted values for regressing y on all covariates except xk . 0 Since ŷ−k xk⊥ = 0, hy , xk⊥ i2 hy − ŷ−k , xk⊥ i2 = ky − ŷ−k k2 · kxk⊥ k2 ky − ŷ−k k2 · kxk⊥ k2 The expression on the left is the usual R 2 that would be obtained when regressing y − ŷ−k on xk⊥ . Thus the partial R 2 is the same as the usual R 2 for (I − P−k )y regressed on (I − P−k )xk . 19 / 40 The partial R 2 and variable importance The partial R 2 is one way to measure the importance of a variable in a regression model. However “importance” has many facets and no one measure is a perfect indicator of performance. Other possible indicators of variable importance are: I The estimated regression slope β̂k – this is not a good measure because it’s scale depends on the units of the corresponding covariate. I The standardized regression slope β̂k SD(xk ). Since β̂k xk = β̂k SD(xk ) · xl /SD(xk ) this measures the expected change in y corresponding to a one standard deviation change in xk . This is a dimensionless quantity. I The p-value for the null hypothesis that βk = 0 (e.g. from a Wald test). This is not a good measure of importance because in many cases it tells you more about the sample size than the importance of xk – as long as βk 6= 0, this p-value will tend to zero as n grows. I The semi-partial R 2 – this measure does not “correct” for the strength of the base model, which is a drawback in some settings but an advantage in others. 20 / 40 The partial R 2 and variable importance No measure of variable importance is perfect, for example: 1. The most important variable may not have any causal relationship with y – it may only be imporant because it is a proxy or surrogate for the causes of y , or the most important variable may even be caused by y . 2. The most important variable may not be modifiable, e.g. if we want to manipulate the factors that predict y in order to alter the value of y in a favorable direction, the most important factor may not be modifiable (e.g. age may be the most important risk factor for a health outcome but we cannot stop the passage of time). 21 / 40 Decomposition of projection matrices Suppose P ∈ Rn×n is a rank-d projection matrix, and U is a n × d orthogonal matrix whose columns span col(P). If we partition U by columns | U = u1 | | u2 | ··· ··· ··· | ud , | then P = UU 0 , so we can write P= d X uj uj0 . j=1 Note that this representation is not unique, since there are different orthogonal bases for col(P). Each summand uj uj0 ∈ Rn×n is a rank-1 projection matrix onto huj i. 22 / 40 Decomposition of R 2 Question: In a multiple regression model, how much of the variance in y is explained by a particular covariate? Orthogonal case: If the design matrix X is orthogonal (X 0 X = I ), the projection P onto col(X ) can be decomposed as P= p X j=0 p 110 X 0 Pj = + xj xj , n j=1 where xj is the j th column of the design matrix (assuming here that the first column of X is an intercept). 23 / 40 Decomposition of R 2 (orthogonal case) The n × n rank-1 matrix Pj = xj xj0 is the projection onto span(xj ) (and P0 is the projection onto the span of the vector of 1’s). Furthermore, by orthogonality, Pj Pk = 0 unless j = k. Since ŷ − ȳ = p X Pj y , j=1 by orthogonality kŷ − ȳ k2 = p X kPj y k2 . j=1 Here we are using the fact that if u1 , . . . , um are orthogonal, then ku1 + · · · + um k2 = ku1 k2 + · · · + kum k2 . 24 / 40 Decomposition of R 2 (orthogonal case) The R 2 for simple linear regression of y on xj is Rj2 ≡ kŷ − ȳ k2 /ky − ȳ k2 = kPj y k2 /ky − ȳ k2 , so we see that for orthogonal design matrices, 2 R = p X Rj2 . j=1 That is, the overall coefficient of determination is the sum of univariate coefficients of determination for all the explanatory variables. 25 / 40 Decomposition of R 2 Non-orthogonal case: If X is not orthogonal, the overall R 2 will not be the sum of single covariate R 2 ’s. If we let Rj2 be as above (the R 2 values for regressing Y on each Xj ), P 2 P 2 2 2 then there are two different situations: j Rj > R , and j Rj < R . 26 / 40 Decomposition of R 2 Case 1: P Rj2 > R 2 P It’s not surprising that j Rj2 can be bigger than R 2 . For example, suppose that the population data generating model is y = x1 + and x2 is highly correlated with x1 , but is not part of the data generating model, as in the following diagram: y x2 x1 27 / 40 Decomposition of R 2 For the regression of y on both x1 and x2 , the multiple R 2 will be 1 − σ 2 /var(y ) (since E [y |x1 , x2 ] = E [y |x1 ] = x1 ). The R 2 values for y regressed on either x1 or x2 separately will also be approximately 1 − σ 2 /var(y ). Thus R12 + R22 ≈ 2R 2 . 28 / 40 Decomposition of R 2 Case 2: P j Rj2 < R 2 This is more surprising, and is sometimes called enhancement. As an example, suppose the data generating model is y = z + , but we don’t observe z (for simplicity assume E [z] = 0). Instead, we observe a value x1 that satisfies x1 = z + x2 , where x2 has mean 0 and is independent of z and . Since x2 is independent of z and , it is also independent of y , thus R22 ≈ 0 for large n. 29 / 40 Decomposition of R 2 The following causal diagram illustrates this example: y x1 z x2 30 / 40 Decomposition of R 2 (enhancement example) The multiple R 2 of y on x1 and x2 is approximately σz2 /(σz2 + σ 2 ) for large n, since the fitted values will converge to ŷ = x1 − x2 = z. To calculate R12 , first note that for the regression of y on x1 , where y , x1 ∈ Rn are data vectors β̂ = cd ov(y , x1 ) σ2 → 2 z 2 var(x c 1) σz + σx2 and α̂ → 0. 31 / 40 Decomposition of R 2 (enhancement example) Therefore for large n, n−1 ky − ŷ k2 ≈ n−1 kz + − σz2 x1 /(σz2 + σx22 )k2 = n−1 kσx22 z/(σz2 + σx22 ) + − σz2 x2 /(σz2 + σx22 )k2 = σx42 σz2 /(σz2 + σx22 )2 + σ 2 + σz4 σx22 /(σz2 + σx22 )2 = σx22 σz2 /(σz2 + σx22 ) + σ 2 . Therefore R12 = ≈ = n−1 ky − ŷ k2 n−1 ky − ȳ k2 σ 2 σ 2 /(σ 2 + σx22 ) + σ 2 1 − x2 z 2z σz + σ 2 σz2 2 2 (σz + σ )(1 + σx22 /σz2 ) 1− . 32 / 40 Decomposition of R 2 (enhancement example) Thus R12 /R 2 ≈ 1/(1 + σx22 /σz2 ), which is strictly less than one if σx22 > 0. Since R22 = 0, it follows that R 2 > R12 + R22 . The reason for this is that while x2 contains no directly useful information about y (hence R22 = 0), it can remove the “measurement error” in x1 , making x1 a better predictor of z. 33 / 40 Decomposition of R 2 (enhancement example) We can now calculate the limiting partial R 2 for adding x2 to a model that already contains x1 : σx22 . σx22 + σ 2 (1 + σx22 /σz2 ) 34 / 40 Partial R 2 example 2 Suppose the design matrix satisfies 1 X 0 X /n = 0 0 0 1 r 0 r 1 and the data generating model is y = x1 + x2 + with var = σ 2 . 35 / 40 Partial R 2 example 2 We will calculate the partial R 2 for x1 , using the fact that the partial R 2 is the regular R 2 for regressing (I − P−1 )y on (I − P−1 )x1 where y , x1 , x2 ∈ Rn are data vectors distributed like Y , x1 , and x2 , and P−1 is the projection onto span ({1, x2 }). Since this is a simple linear regression, the partial R 2 can be expressed cor((I c − P−1 )y , (I − P−1 )x1 )2 . 36 / 40 Partial R 2 example 2 We will calculate the partial R 2 in a setting where all conditional means are linear. This would hold if the data are jointly Gaussian (but this is not a necessary condition for conditional means to be linear). The numerator of the partial R 2 is the square of cd ov((I − P−1 )y , (I − P−1 )x1 ) = y 0 (I − P−1 )x1 /n = (x1 + x2 + )0 (x1 − rx2 )/n → 1 − r 2. 37 / 40 Partial R 2 example 2 The denominator contains two factors. The first is k(I − P−1 )x1 k2 /n = x10 (I − P−1 )x1 /n = x10 (x1 − rx2 )/n → 1 − r 2. 38 / 40 Partial R 2 example 2 The other factor in the denominator is y 0 (I − P−1 )y /n: y 0 (I − P−1 )y /n = (x1 + x2 )0 (I − P−1 )(x1 + x2 )/n + 0 (I − P−1 )/n + 20 (I − P−1 )(x1 + x2 )/n ≈ (x1 + x2 )0 (x1 − rx2 )/n + σ 2 → 1 − r 2 + σ2 . Thus we get that the partial R 2 is approximately equal to 1 − r2 . 1 − r 2 + σ2 If r = 1 then the result is zero (x1 has no unique explanatory power), and if r = 0, the result is 1/(1 + σ 2 ), indicating that after controlling for x2 , around 1/(1 + σ 2 ) fraction of the remaining variance is explained by x1 (the rest is due to ). 39 / 40 Summary Each of the three R 2 values can be expressed either in terms of variance ratios, or as a squared correlation coefficient: VR Correlation Multiple R 2 kŷ − ȳ k2 /ky − ȳ k2 cor(ŷ c , y )2 Semi-partial R 2 2 R 2 − R−k ⊥ 2 cor(y c , xk ) Partial R 2 2 2 (R − R−k )/(1 − R−k ) ⊥ 2 cor((I c − P−k )y , xk ) 2 40 / 40