Fitting and inference for linear models using least squares Kerby Shedden Department of Statistics, University of Michigan October 12, 2021 1 / 171 Goals and terminology The goal is to learn about an unknown function f that relates variables y ∈ R and x ∈ Rp through a relationship y ≈ f (x). The variables y and x have distinct roles: I Independent variables (also known as predictors, regressors, covariates, exogeneous variables): x ∈ Rp , a vector. I Dependent variable (response, outcome, endogeneous variables): y ∈ R, a scalar. Note that the terms “independent” and “dependent” do not imply statistical independence or linear algebraic independence of these variables. Instead, these terms suggest that the value of x can be manipulated, or exhibits “natural variation”, and we observe the consequent changes in y . 2 / 171 Goals and terminology The analysis is empirical (based on a sample of data): yi ∈ R xi = (xi1 , . . . , xip )0 ∈ Rp i = 1, . . . , n. where n is the sample size. The data point (yi , xi ) reflects the i th analysis unit (also called a case or observation). 3 / 171 Why do we want to do this? “Two cultures” of statistics (Leo Breiman): I Prediction: Estimate f to predict the typical value of y at a given x point using fˆ(x) (x is not necessarily one of the xi points in the data). I Learning about structure: Inductive learning about the relationship between x and y , such as understanding which predictors or combinations of predictors are associated with particular changes in y , possibly attributing mechanisms or causal roles to specific factors. We will see that inferences about mechanisms (causal inferences) based on regression analysis can be made only in specific, limited settings. 4 / 171 Examples: I An empirical model for the weather conditions 48 hours from now could be based on current and historical weather conditions. Such a model could have a lot of practical value, but it would not necessarily provide a lot of insight into the atmospheric processes that underly changes in the weather. I A study of the relationship between exposure to air pollution during childhood and life outcomes (like adult earnings) would primarily be of interest for inference, rather than for prediction. We may be able to quantify a risk to individuals from this exposure, and then estimate the cumulative effects of the exposure in a large population. But the effect of the exposure on an individual child is probably too small in relation to other risk factors for it to be of predictive value at the individual level. 5 / 171 Statistical interpretations of the regression function The most common way of putting “curve fitting” into a statistical framework is to define f as the conditional expectation f (x) ≡ E [y |x]. Here, the notation E [y |x] implies that y and x are random variables, and occasionally we may emphasize this by writing E [y |x = x]. Less commonly, the regression function is defined as a conditional quantile, such as the median f (x) ≡ median(y |x) or even some other quantile f (x) ≡ Qp (y |x = x). This is called quantile regression. 6 / 171 The conditional expectation function For our purposes, the conditional expectation E [y |x] is viewed as a deterministic function of x, essentially what we would get if we sampled a large number of x, y pairs from their joint distribution, and took the average of the y values that occur when the specific value x is observed. If there are densities we can write: Z E [y |x] = y · f (y |x)dy . 7 / 171 Least Squares Fitting In a linear model, the independent variable x is postulated to be related to the dependent variable y via a linear relationship yi ≈ p X βj xij = β 0 xi = hβ, xi i. j=1 This expression holds for each observed value, i = 1, . . . , n. This is a “linear model” in two senses: it is linear in β for fixed x, and it is linear in x for fixed β (technically, it is “bilinear”). 8 / 171 Least Squares Fitting To estimate f , we need to estimate the βj . One approach to doing this is to minimize the following function of β: L(β) = n n X X X (yi − βj xij )2 = (yi − β 0 xi )2 i=1 j i=1 This is called least squares estimation – it results from solving an optimization problem. 9 / 171 Simple linear regression A special case of the linear model is simple linear regression, when there is p = 1 covariate and an intercept (a covariate whose value is always 1). L(α, β) = X (yi − α − βxi )2 i We can differentiate with respect to α and β: ∂L/∂α ∂L/∂β P = −2 Pi (yi − α − βxi ) = −2 i (yi − α − βxi )xi P = −2 P ri = −2 i ri xi ri = yi − α − βxi is the “working residual” (requires working values for α and β). 10 / 171 Simple linear regression Setting ∂L/∂α = ∂L/∂β = 0 and solving for α and β yields α̂ = ȳ − β̂x̄ P yi xi /n − ȳ x̄ β̂ = P 2 xi /n − x̄ 2 P P where ȳ = yi /n and x̄ = xi /n are the sample mean values (averages). The “hat” notation (ˆ) distinguishes the least-squares optimal values of α and β from an arbitrary pair of parameter values. 11 / 171 Simple linear regression We will call α̂ and β̂ the “least squares estimates” of the model parameters α and β. At the least squares solution, ∂L/∂α ∂L/∂β P = 2 P ri = −2 i ri xi = = 0 0 we have the following basic properties for the least squares estimates: I The residuals ri = yi − α − βxi sum to zero I The residuals are orthogonal to the independent variable x. Recall that two vectors v , w ∈ Rd are orthogonal if Pd j=1 vj wj = 0. 12 / 171 Two important identities Centered and uncentered sums of squares can be related as follows: X xi2 /n − x̄ 2 = X (xi − x̄)2 /n. i Centered and uncentered cross-products can be related as follows: X yi xi /n − ȳ x̄ = X yi (xi − x̄)/n = X (yi − ȳ )(xi − x̄)/n. i 13 / 171 Simple linear regression Note that X (xi − x̄)2 /n and i X (yi − ȳ )(xi − x̄)/n. i are essentially the sample variance of x1 , . . . , xn , and the sample covariance of the (xi , yi ) pairs. Since β̂ is their ratio, we can replace n in the denominator with n − 1 so that β̂ = cd ov(y , x) var(x) c where cd ov and var c are the usual unbiased estimates of variance and covariance. 14 / 171 Convex functions A map Q : Rd → R is convex if for any v , w ∈ Rd , Q(λv + (1 − λ)w ) ≤ λQ(v ) + (1 − λ)Q(w ), for 0 ≤ λ ≤ 1. If the inequality is strict for 0 < λ < 1 and all v 6= w , then Q is strictly convex. λ=1 Q(v) λQ(v)+(1−λ)Q(w) λ=0 Q(w) Q(λv+(1−λ)w) v w 15 / 171 Convex functions A key property of strictly convex functions is that they have at most one global minimizer. That is, there exists at most one v ∈ Rd such that Q(v ) ≤ Q(w ) for all w ∈ Rd . The proof is simple. Suppose there exists v 6= w such that Q(v ) = Q(w ) = inf u∈Rd Q(u). If Q is strictly convex and λ = 1/2, then Q(v /2 + w /2) < (Q(v ) + Q(w ))/2 = inf u∈Rd Q(u), Thus z = (v + w )/2 has the property that Q(z) < inf u∈Rd Q(u), a contradiction. 16 / 171 Convexity of quadratic functions A general quadratic function in d dimensions can be written Q(v ) = v 0 Av + b 0 v + c where A is a d × d matrix, b and v are vectors in Rd , and c is a scalar. Note that v 0 Av = X i,j vi vj Aij b0 v = X bj vj . j 17 / 171 Convexity of quadratic functions If b ∈ col(A), we can complete the square to eliminate the linear term, giving us Q(v ) = (v − f )0 A(v − f ) + s. where f is any vector satisfying Af = −b/2, and s = c − f 0 Af . If A is invertible, we can take f = −A−1 b/2. Since the property of being convex is invariant to translations in both the domain and range, without loss of generality we can assume f = 0 and s = 0 for purposes of analyzing the convexity of Q. 18 / 171 Convexity of quadratic functions Two key definitions: I A square matrix A is positive definite if v 0 Av > 0 for all vectors v 6= 0. I A square matrix is positive semidefinite if v 0 Av ≥ 0 for all v . We will now show that the quadratic function Q(v ) = v 0 Av is strictly convex if and only if A is positive definite. Note that without loss of generality, A is symmetric, since otherwise à ≡ (A + A0 )/2 gives the same quadratic form as A. 19 / 171 Convexity of quadratic functions Q(λv + (1 − λ)w ) = (λv + (1 − λ)w )0 A(λv + (1 − λ)w ) = λ2 v 0 Av + (1 − λ)2 w 0 Aw + 2λ(1 − λ)v 0 Aw . λQ(v ) + (1 − λ)Q(w ) − Q(λv + (1 − λ)w ) = λ(1 − λ)(v 0 Av + w 0 Aw − 2w 0 Av ) = λ(1 − λ)(v − w )0 A(v − w ) ≥ 0. If 0 < λ < 1, this is a strict inequality for all v 6= w if and only if A is positive definite. 20 / 171 Convexity of quadratic functions The gradient, or Jacobian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is (∂f /∂x1 , . . . , ∂f /∂xd ), which is viewed as a row vector by convention. 21 / 171 Convexity of quadratic functions If A is symmetric, the Jacobian of Q(v ) = v 0 Av is 2v 0 A0 . To see this, write v 0 Av = X vi2 Aii + i X vi vj Aij i6=j and differentiate with respect to v` to get the `th component of the Jacobian: 2v` A`` + X j6=` vj A`j + X vi Ai` = 2(Av )` . i6=` 22 / 171 Convexity of quadratic functions The Hessian of a scalar-valued function y = f (x) (y ∈ R, x ∈ Rd ) is the matrix Hij = ∂ 2 f /∂xj ∂xi . If A is symmetric, the Hessian of the quadratic form Q is 2A. To see this, note that the `th component of 2v 0 A0 is the inner product of v with the `th row (or column) of A. The derivative of this inner product with respect to a second index `0 is 2A``0 . It follows that a quadratic function is strictly convex iff its Hessian is positive definite (more generally, any continuous, twice differentiable function is convex on Rd if and only if its Hessian matrix is everywhere positive definite). 23 / 171 Uniqueness of the simple linear regression least squares fit The least squares solution for simple linear regression, α̂, β̂, is unique as long as var[x] c (the sample variance of the covariate) is positive. To see this, note that the Hessian (second derivative matrix) of L(α, β) is H= where x· = P i ∂ 2 L/∂α2 ∂ 2 L/∂α∂β ∂ 2 L/∂α∂β ∂ 2 L/∂β 2 = 2n 2x· 2x· P 2 xi2 xi . 24 / 171 Uniqueness of the simple linear regression least squares fit If var(x c 1 , . . . , xn ) > 0 then this is a positive definite matrix since all the principal submatrices have positive determinants: |2n| > 0 2n 2x· 2x· P 2 xi2 X xi2 − 4(x· )2 = 4n = 4n(n − 1)var(x c 1 , . . . , xn ) > 0. 25 / 171 The fitted line and the data center The fitted line y = α̂ + β̂x passes through the center of mass (centroid) of the data (x̄, ȳ ). This can be seen by substituting x = x̄ into the equation of the fitted line, which yields ȳ . 26 / 171 Regression slopes and the Pearson correlation The sample Pearson correlation coefficient between x and y is ρ̂ = cd ov(y , x) c )SD(x) c SD(y The relationship between β̂ and ρ̂ is β̂ = ρ̂ · c ) SD(y . c SD(x) Thus the fitted slope has the same sign as the Pearson correlation coefficient between y and x. 27 / 171 Reversing x and y Suppose that x and y are reversed in a simple linear regression. The slope of the regression of x on y is β̂∗ = c SD(x) cd ov(y , x) = cor(y c , x) . c ) var(y c ) SD(y If the data fall exactly on a line, then cor(y , x) = 1, so β̂∗ = 1/β̂, which is consistent with algebraically rearranging y = α + βx to x = −α/β + y /β. But if the data do not fit a line exactly, this property does not hold. 28 / 171 Norms The Euclidean norm on vectors (the most commonly used norm) is: kv k = sX vi2 = √ v 0v . i Here are some useful identities, for vectors v , w ∈ Rp : kv + w k2 = kv k2 + kw k2 + 2v 0 w kv − w k2 = kv k2 + kw k2 − 2v 0 w 29 / 171 Fitting multiple linear regression models Multiple regression is a regression analysis in which there may be more than one explanatory variable (i.e. p ≥ 1). In this case, the covariate data define the design matrix: 1 1 X = 1 x11 x21 ··· ··· x12 x22 ··· ··· ··· ··· ··· xn1 xn2 ··· x1p x2p xnp Note that in some situations the first column of 10 s (the intercept) will not be included. 30 / 171 Fitting multiple linear regression models In multiple regression, the linear model coefficients are written as a vector β = (β0 , β1 , . . . , βp )0 where β0 is the intercept and βk is the slope corresponding to the k th covariate. For a given working covariate vector β, the vector of fitted values is given by the matrix-vector product ŷ = X β, which is an n-dimensional vector. The vector of working residuals y − X β is also an n-dimensional vector. 31 / 171 Fitting multiple linear regression models The goal of least-squares estimation is to minimize the sum of squared differences between the fitted and observed values. L(β) = X (yi − ŷi )2 = ky − X βk2 . i Estimating β by minimizing L(β) is called ordinary least squares (OLS). 32 / 171 The multivariate chain rule Recall that if f and g are differentiable functions from R to R, and we set h = f ◦ g , i.e. h(x) = f (g (x)), then the chain rule states that h0 (x) = f 0 (g (x)) · g 0 (x). Now suppose g (·) is a map from Rm to Rn and f (·) is a scalar-valued function on Rn . If h = f ◦ g , i.e. h(z) = f (g (z)). Let fj0 (x) = ∂f (x)/∂xj , let ∇f (x) = (f10 (x), · · · , fn0 (x))0 denote the gradient of f , and let J denote the Jacobian of g Jij (z) = ∂gi (z)/∂zj . 33 / 171 The multivariate chain rule Then ∂h(z)/∂zj = X ∂gi (z)/∂zj · fi 0 (g (z)) i = [J(z)0 ∇f (g (z))]j Thus we can write the gradient of h as a matrix-vector product between the (transposed) Jacobian of g and the gradient of f : ∇h = J 0 ∇f where J is evaluated at z and ∇f is evaluated at g (z). 34 / 171 The least squares gradient function For the (multiple) least squares problem, the gradient of L(β) with respect to β is ∂L/∂β = −2X 0 (y − X β). This can be seen by differentiating L(β) = X (yi − β0 − i p X xij βj )2 j=1 element-wise, or by differentiating ky − X βk2 using the Pmultivariate chain rule, letting g (β) = y − xβ and f (x) = j xj2 . 35 / 171 Normal equations Setting ∂L/∂β = 0 yields the “normal equations:” X 0X β = X 0y Thus calculating the least squares estimate of β reduces to solving a system of p + 1 linear equations. Algebraically we can write β = (X 0 X )−1 X 0 y , which is often useful for deriving analytical results. However this expression should not be used to numerically calculate the coefficients. 36 / 171 Solving the normal equations The most standard numerical approach is to calculate the QR decomposition of X X = QR where Q is a n × p + 1 orthogonal matrix (i.e. Q 0 Q = I ) and R is a p + 1 × p + 1 upper triangular matrix. The QR decomposition can be calculated rapidly, and highly precisely. Once it is obtained, the normal equations become Rβ = Q 0 y , which is an easily solved p + 1 × p + 1 triangular system. 37 / 171 Matrix products In multiple regression we encounter the matrix product X 0 X . Let’s review some ways to think about matrix products. If A ∈ Rn×m and B ∈ Rn×p are matrices, we can form the product A0 B ∈ Rm×p . Suppose we partition A and B by rows: − a1 − a2 ··· A= ··· − an − − − − b1 − b2 ··· B= ··· − bn − − − 38 / 171 Matrix products Then A0 B = a10 b1 + a20 b2 + · · · + an0 bn , where each aj0 bj term is an outer product aj1 bj1 aj2 bj1 0 aj bj = ··· aj1 bj2 aj2 bj2 ··· ··· ··· ∈ Rm×p . 39 / 171 Matrix products Now if we partition A and B by columns | A = a1 | | a2 | | a3 | | B = b1 | | b2 | | b3 , | then A0 B is a matrix of inner products a10 b1 a20 b1 A0 B = ··· a10 b2 a20 b2 ··· ··· ··· . ··· 40 / 171 Matrix products Thus we can view the product X 0 X involving the design matrix in two different ways. If we partition X by rows (the cases) − x1 − x2 ··· X = ··· − xn − − − then X 0X = n X xi0 xi i=1 is the sum of the outer product matrices of the cases. 41 / 171 Matrix products If we partition X by the columns (the variables) | X = x1 | | x2 | | x3 , | then X 0 X is a matrix whose entries are the pairwise inner products of the variables ([X 0 X ]ij = xi0 xj ). 42 / 171 Mathematical properties of the multiple regression fit The multiple least square solution is unique as long as the columns of X are linearly independent. Here is the proof: 1. The Hessian of L(β) is 2X 0 X . 2. For v 6= 0, v 0 (X 0 X )v = (Xv )0 Xv = kXv k2 > 0, since the columns of X are linearly independent. Therefore the Hessian of L is positive definite. 3. Since L(β) is quadratic with a positive definite Hessian matrix, it is convex and hence has a unique global minimizer. 43 / 171 Projections Suppose S is a subspace of Rd , and v is a vector in Rd . The projection operator PS maps v to the vector in S that is closest to v : PS (v ) = argminη∈S kv − ηk2 . 44 / 171 Projections Property 1: (v − PS (v ))0 s = 0 for all s ∈ S. To see this, let s ∈ S. Without loss of generality ksk = 1 and (v − PS (v ))0 s ≤ 0. Let λ ≥ 0, and write kv − PS (v ) + λsk2 = kv − PS (v )k2 + λ2 + 2λ(v − PS (v ))0 s. If (v − PS (v ))0 s 6= 0, then for sufficiently small λ > 0, λ2 + 2λ(v − PS (v ))0 s < 0. This means that PS (v ) − λs is closer to v than PS (v ), contradicting the definition of PS (v ). 45 / 171 Projections Property 2: Given a subspace S of Rd , any vector v ∈ Rd can be written uniquely in the form v = vS + vS ⊥ , where vS ∈ S and s 0 vS ⊥ = 0 for all s ∈ S. To prove uniqueness, suppose v = vS + vS ⊥ = ṽS + ṽS ⊥ . Then 0 = kvS − ṽS + vS ⊥ − ṽS ⊥ k2 = kvS − ṽS k2 + kvS ⊥ − ṽS ⊥ k2 + 2(vS − ṽS )0 (vS ⊥ − ṽS ⊥ ) = kvS − ṽS k2 + kvS ⊥ − ṽS ⊥ k2 . which is only possible if vS = ṽS and vS ⊥ − ṽS ⊥ . Existence follows from Property 1, with vS = PS (v ) and vS ⊥ = v − PS (v ). 46 / 171 Projections Property 3: The projection PS (v ) is unique. This follows from property 2 – if there exists a v for which v1 6= v2 ∈ S both minimize the distance from v to S, then v = v1 + u1 and v = v2 + u2 (uj = v − vj ) would be distinct decompositions of v as a sum of a vector in S and a vector in S ⊥ , contradicting property 2. 47 / 171 Projections Property 4: PS (PS (v )) = PS (v ). The proof of this is simple, since PS (v ) ∈ S, and any element of S has zero distance to itself. A matrix or linear map with this property is called idempotent. 48 / 171 Projections Property 5: PS is a linear operator. Let A, B be vectors with θA = PS (A) and θB = PS (B). Then we can write A + B = θA + θB + (A + B − θA − θB ), where θA + θB ∈ S, and s 0 (A + B − θA − θB ) = s 0 (A − θA ) + s 0 (B − θB ) = 0 for all s ∈ S. By Property 2 above, this representation is unique, so θA + θB = PS (A + B). 49 / 171 Projections Property 6: Suppose PS is the projection operator onto a subspace S. Then I − PS , where I is the identity matrix, is the projection operator onto the subspace S ⊥ ≡ {u ∈ Rd |u 0 s = 0 for all s ∈ S}. To prove this, write v = (I − PS )v + PS v , and note that ((I − PS )v )0 s = 0 for all s ∈ S, so (I − PS )v ∈ S ⊥ , and u 0 PS v = 0 for all u ∈ S ⊥ . By property 2 this decomposition is unique, and therefore I − PS is the projection operator onto S ⊥ . 50 / 171 Projections Property 7: Since PS (v ) is linear, it can be represented in the form PS (v ) = PS · v for a suitable square matrix PS . Suppose S is spanned by the columns of a non-singular matrix x. Then PS = X (X 0 X )−1 X 0 . To prove this, let Q = X (X 0 X )−1 x0 , so for an arbitrary vector V , v = Qv + (v − Qv ) = X (X 0 X )−1 X 0 v + (I − X (X 0 X )−1 X 0 )v and note that the first summand is in S while the second summand (by direct calculation) is perpendicular to the first summand, hence is in S ⊥ . 51 / 171 Projections Property 7 (continued): To show that the second summand is in S ⊥ , take s ∈ S and write s = Xb. Then s 0 (I − X (X 0 X )−1 X 0 )v = b 0 X 0 (I − X (X 0 X )−1 X 0 )v = 0. Therefore this is the unique decomposition from Property 2, above, so PS must be the projection. 52 / 171 Projections Property 7 (continued) An alternate approach to property 7 is constructive. Let θ = X γ, and suppose we wish to minimize the distance between θ and V . Using calculus, we differentiate with respect to γ and solve for the stationary point: ∂kv − X γk2 /∂γ = ∂(v 0 v − 2v 0 X γ + γ 0 X 0 X γ)/∂γ = −2x 0 v + 2X 0 X γ = 0. The solution is γ = (X 0 X )−1 X 0 v so θ = X (X 0 X )−1 X 0 v . 53 / 171 Projections Property 8: The composition of projections onto S and S ⊥ (in either order) is identically zero. PS ◦ PS ⊥ = PS ⊥ ◦ PS ≡ 0. This can be shown by direct calculation using the representations of PS and PS ⊥ given above. 54 / 171 Least squares and projections The least squares problem of minimizing ky − X βk2 is equivalent to minimizing ky − ηk2 over η ∈ col(X ). Therefore the minimizing value η̂ is the projection of y onto col(X ). If the columns of X are linearly independent, there is a unique vector β̂ such that X β̂ = η̂. These are the least squares coefficient estimates. 55 / 171 Row-wise and column-wise geometry of least squares The rows of the design matrix X are vectors in Rp+1 , the columns of the design matrix are vectors in Rn . The n-dimensional space containing y and col(X ) is called the variable space. The p + 2-dimensional space containing (xi , yi ) is called the case space. 56 / 171 Variable space geometry of least squares Thinking column-wise, we are working in Rn . The vector y containing all values of the dependent variable is a vector in Rn , and col(X ) is a p + 1 dimensional subspace of Rn . The least squares problem can be seen to have the goal of producing a vector of values that are in Rn , and that are as close as possible to y among all such vectors. We will usually write this vector as ŷ . It is obtained by projecting y onto col(X ). 57 / 171 Case space geometry of least squares Thinking row-wise, we are working in R p+2 . The data for one case can be written (xi , yi ), where xi is the i th row of X and yi is the i th element of y . (xi , yi ) ∈ Rp+2 is the “cloud of data points”, where each point includes both the independent and dependent variables in a single vector. Alternatively, we can think of xi ∈ Rp+1 as being the domain of the regression function E [y |x], which forms a surface above this domain. 58 / 171 Properties of the least squares fit • The fitted regression surface passes through the mean point (X̄ , ȳ ). To see this, note that the fitted surface at X̄ (the vector of column-wise means) is X̄ 0 β̂ = (10 X /n)β̂ = 10 X (X 0 X )−1 X 0 y /n, where 1 is a column vector of 1’s. Since 1 ∈ col(X ), it follows that X (X 0 X )−1 X 0 1 = 1, which gives the result, since ȳ = 10 y /n. 59 / 171 Properties of the least squares fit • The multiple regression residuals sum to zero. The residuals are R ≡ y − ŷ = y − PS y = (I − PS )y = PS ⊥ y , where S = col(x). The sum of residuals can be written 10 (I − X (X 0 X )−1 X 0 )y , where 1 is a vector of 1’s. If X includes an intercept, PS 1 = 1, so 10 I = 10 X (X 0 X )−1 X 0 = 10 , so 10 (I − X (X 0 X )−1 X 0 ) = 0. 60 / 171 Orthogonal matrices An orthogonal matrix X satisfies X 0 X = I . This is equivalent to stating that the columns of X are pairwise orthogonal and all columns of X have unit length. If X is square and orthogonal, then X 0 = X −1 and also XX 0 = I . If X is orthogonal then the projection onto col(X ) simplifies to X (X 0 X )−1 X 0 = XX 0 If X is orthogonal and the first column of X is constant, it follows that the remaining columns of X are centered and have sample variance 1/(n − 1). 61 / 171 Orthogonal matrices • If X is orthogonal, the slopes obtained by using multiple regression of y on x = (x1 , . . . , xp ) are the same as the slopes obtained by carrying out p simple linear regressions of y on each covariate separately. To see this, note that the multiple regression slope estimate for the j th covariate is β̂m,j = X:j0 y where X:j is column j of X . Since X 0 X = I it follows that each covariate has zero sample mean, and sample variance equal to 1/(n − 1). Thus the simple linear regression slope for covariate i is c :j , y )/var(X c :j ) = X:j0 y = β̂m,j . β̂j = cov(X 62 / 171 Comparing multiple regression and simple regression slopes • The signs of the multiple regression slopes need not agree with the signs of the corresponding simple regression slopes. For example, suppose there are two covariates, both with mean zero and variance 1, and for simplicity assume that y has mean zero and variance 1. Let r12 be the correlation between the two covariates, and let r1y and r2y be the correlations between each covariate and the response. It follows that n/(n − 1) 0 0 1 X 0 X /(n − 1) = 0 r12 0 r12 1 0 X 0 y /(n − 1) = r1y . r2y 63 / 171 Comparing multiple regression and simple regression slopes So we can write β̂ = (X 0 X /(n − 1))−1 (X 0 y /(n − 1)) = 1 r1y 2 1 − r12 r2y 0 − r12 r2y . − r12 r1y Thus, for example, if r1y , r2y , r12 ≥ 0, then if r12 > r1y /r2y , then β̂1 has opposite signs in single and multiple regression. Note that if r1y > r2y it is impossible for r12 > r1y /r2y . Thus, the effect direction for the covariate that is more strongly marginally correlated with y cannot be reversed. This is an example of “Simpson’s paradox”. 64 / 171 Comparing multiple regression and simple regression slopes Numerical example: if r1y = 0.1, r2y = 0.6, r12 = 0.6, then x1 is (marginally) positively associated with y , but for fixed values of x2 , the association between y and x1 is negative. 65 / 171 Formulation of regression models in terms of probability distributions Up to this point, we have primarily expressed regression models in terms of the mean structure, e.g. E [y |x] = β 0 x. Regression models are commonly discussed in terms of conditional moment structures, e.g. the conditional mean E [y |x], and the conditional variance var[y |x = x]. Alternatively, some regression models are specified in terms of conditional quantiles (Qp [y |x = x]). In either case, we are not modeling an explicit, fully-specified probability distribution. 66 / 171 Formulation of regression models in terms of probability distributions If we want to specify the model more completely, we can think in terms of a random “error” term that describes how the observed value y deviates from the ideal value f (x). A very general regression model is y = f (x, ). where is a random variable with expected value zero. If we specify the distribution of , then we have fully specified the distribution of y |x. 67 / 171 Formulation of regression models in terms of probability distributions A more restrictive “additive error” model is: y = f (x) + . Under this model, E [y |x] = E [f (x) + |x] = E [f (x)|x] + E [|x] = f (x) + E [|x]. Without loss of generality, E [|x] = 0, so E [y |x] = f (x). 68 / 171 Regression model formulations and parameterizations A parametric regression model is: y = f (x; θ) + , where θ is a finite dimensional parameter vector. Examples: 1. The linear response surface model f (x; θ) = θ0 x 2. The quadratic response surface model f (x; θ) = θ1 + θ2 x + θ3 x 2 3. The Gompertz curve f (x; θ) = θ1 exp(θ2 exp(θ3 x)) θ2 , θ3 ≤ 0. Models 1 and 2 are both “linear models” because they are linear in θ. The Gompertz curve is a non-linear model because it is not linear in θ. 69 / 171 Basic inference for simple linear regression This section deals with statistical properties of least square fits that can be derived under minimal conditions. Specifically, we will derive properties of β̂ that hold for the generating model y = x 0 β + , where: i E [|x] = 0 ii var[|x] = σ 2 < ∞ and is constant across cases iii the random variables are uncorrelated across cases (given x). We will not assume here that follows a particular distribution, e.g. a Gaussian distribution. 70 / 171 The relationship between E [|x] and cov(, x) If we treat x as a random variable, the condition that E [|x] = 0 for all x implies that cor(x, ) = 0. This follows from the double expectation theorem: cov(x, ) = E [x] − E [x] · E [] = E [x] = Ex [E [x|x]] = Ex [xE [|x]] = Ex [x · 0] = 0. 71 / 171 The relationship between E [|x] and cov(, x) The converse is not true. If cor(x, ) = 0 and E [] = 0 it may not be the case that E [|x] = 0. For example, if x ∈ {−1, 0, 1} and ∈ {−1, 1}, with joint distribution -1 0 1 -1 1/12 4/12 1/12 1 3/12 0 3/12 then E = 0 and cor(x, ) = 0, but E [|x] is not identically zero. When and x are jointly Gaussian, cor(, x) = 0 implies that and x are independent, which in turn implies that E [|x] = E = 0. 72 / 171 Data sampling To be able to interpret the results of a regression analysis, we need to know how the data were sampled. In particular, it important to consider whether x and/or y and/or the pair x, y should be considered as random draws from a population. Here are two important situations: I Designed experiment: We are studying the effect of temperature on reaction yield in a chemical synthesis. The temperature x is controlled and set by the experimenter. In this case, y is randomly sampled conditionally on x, but x is not randomly sampled, so it doesn’t make sense to consider x to be a random variable. I Observational study: We are interested in the relationship between cholesterol level x and blood pressure y . We sample people at random from a well defined population (e.g. residents of Michigan) and measure their blood pressure and cholesterol levels. In this case, x and y are sampled from their joint distribution, and both can be viewed a random variables. 73 / 171 Data sampling There is another design that we may encounter: I Case/control study: Again suppose we are interested in the relationship between cholesterol level x, and blood pressure y . But now suppose we have an exhaustive list of blood pressure measurements for all residents of Michigan. We wish to select a subset of 500 individuals to contact for acquiring cholesterol measures, and it is decided that studying the 250 people with greatest blood pressure together with the 250 people with lowest blood pressure will be most informative. In this case x is randomly sampled conditionally on y , but y is not randomly sampled. Summary: Regression models are formulated in terms of the conditional distribution of y given x. The statistical properties of β̂ are easiest to calculate and interpret as being conditional on x. The way that the data are sampled also affects our interpretation of the results. 74 / 171 Basic inference for simple linear regression via OLS Above we showed that the slope and intercept estimates are P yi (xi − x̄) cd ov(y , x) = Pi , β̂ = 2 var[x] c i (xi − x̄) α̂ = ȳ − β̂x̄. Note that we are using the useful fact that X i (yi − ȳ )(xi − x̄) = X i yi (xi − x̄) = X (yi − ȳ )xi . i 75 / 171 Sampling means of parameter estimates First we will calculate the sampling means of α̂ and β̂. A useful identity is that β̂ = X X (α + βxi + i )(xi − x̄)/ (xi − x̄)2 = β+ i i X i (xi − x̄)/ i X (xi − x̄)2 . i From this identity it is clear that E [β̂|x] = β. Thus β̂ is unbiased (a parameter estimate is unbiased if its sampling mean is the same as the population value of the parameter). The intercept is also unbiased: E [α̂|x] = E (ȳ − β̂x̄|x) = α + βx̄ + E [¯ |x] − E [β̂x̄|x] = α 76 / 171 Sampling variances of parameter estimates Next we will calculate the sampling variances of α̂ and β̂. These values capture the variability of the parameter estimates over replicated samples from the same population. First we will need the following result: cov(β̂, ¯|x) = X = 0, cov(i , ¯|x)(xi − x̄)/ i X (xi − x̄)2 i since cov(i , ¯|x) = σ 2 /n does not depend on i. 77 / 171 Sampling variances of parameter estimates To derive the sampling variances, start with the identity: β̂ = β + X i (xi − x̄)/ i X (xi − x̄)2 . i The sampling variances are var[β̂|x] = σ2 / X (xi − x̄)2 i = σ 2 / ((n − 1)var[x]) c . var[α̂|x] = var[ȳ − β̂x̄|x] = var[α + βx̄ + ¯ − β̂x̄|x] = var[¯ |x] + x̄ 2 var[β̂|x] − 2x̄cov[¯ , β̂|x] = σ 2 /n + x̄ 2 σ 2 /((n − 1)var[x]). c 78 / 171 Sampling covariance of parameter estimates The sampling covariance between the slope and intercept is = cov[ȳ − β̂x̄, β̂|x] cov[α̂, β̂|x] cov[ȳ , β̂|x] − x̄var[β̂|x] = = cov[¯ , β̂|x] − x̄var[β̂|x] = −σ 2 x̄/((n − 1)var[x]). c When x̄ > 0, it’s easy to see what the expression for cov(α̂, β̂|x) is telling us: Underestimate slope, overestimate intercept Y − Overestimate slope, underestimate intercept 0.5 0.0 0.5 1.0 X 1.5 2.0 2.5 79 / 171 Basic inference for simple linear regression via OLS Some observations: I All variances scale with sample size like 1/n. I β̂ does not depend on x̄. I var[α̂] is minimized if x̄ = 0. I α̂ and β̂ are uncorrelated if x̄ = 0. 80 / 171 Some properties of residuals Start with the following useful expression: ri ≡ yi − α̂ − β̂xi = yi − ȳ − β̂(xi − x̄). Since yi = α + βxi + i and therefore ȳ = α + βx̄ + ¯, we can subtract to get yi − ȳ = β(xi − x̄) + i − ¯ it follows that ri = (β − β̂)(xi − x̄) + i − ¯. Since E β̂ = β and E [i − ¯] = 0,P it follows that Eri = 0. Note that this is a distinct fact from the identity i ri = 0. 81 / 171 Some properties of residuals It is important to distinguish the residual ri from the “observation errors” i . The identity ri = (β − β̂)(xi − x̄) + i − ¯ shows us that the centered observation error i − ¯ is one part of the residual. The other term (β − β̂)(xi − x̄) reflects the fact that the residuals are also influenced by how well we recover the true slope β through our estimate β̂. 82 / 171 Some properties of residuals To illustrate, consider two possibilities: I We overestimate the slope (β − β̂ < 0). The residuals to the right of the right of the mean (i.e. when x > x̄) are shifted down (toward −∞), and the residuals to the left of the mean are shifted up. I We underestimate the slope (β − β̂ > 0). The residuals to the right of the right of the mean are shifted up, and the residuals to the left of the mean are shifted down. 83 / 171 Some properties of residuals We can use the identity ri = (β − β̂)(xi − x̄) + i − ¯ to derive the variance of each residual. We have: var[(β − β̂)(xi − x̄)] = σ 2 (xi − x̄)2 / X (xj − x̄)2 j var[i − ¯|x] = var[i |x] + var[¯ |x) − 2cov(i , ¯|x) = σ 2 + σ 2 /n − 2σ 2 /n = σ 2 (n − 1)/n. cov (β − β̂)(xi − x̄), i − ¯|x = −(xi − x̄)cov(β̂, i |x) X = −σ 2 (xi − x̄)2 / (xj − x̄)2 , j 84 / 171 Some properties of residuals Thus the variance of the i th residual is var[ri |x] = var (β − β̂)(xi − x̄) + i − ¯|x = var (β − β̂)(xi − x̄)|x + var[i − ¯|x] + 2cov (β − β̂)(xi − x̄), i − ¯|x X = (n − 1)σ 2 /n − σ 2 (xi − x̄)2 / (xj − x̄)2 . j Note the following fact, which ensures that this expression is non-negative: (xi − x̄)2 / X (xj − x̄)2 ≤ (n − 1)/n. j 85 / 171 Some properties of residuals I The residuals are not iid – they are correlated with each other, and they have different variances. I var[ri ] < var[i ] – the residuals are less variable than the errors. 86 / 171 Some properties of residuals 2 Since E [rP i ] = 0 it follows that var[ri ] = E [ri ]. Therefore the expected 2 value of i ri is (n − 1)σ 2 − σ 2 = (n − 2)σ 2 since the (xi − x̄)2 / P j (xj − x̄)2 sum to one. 87 / 171 Estimating the error variance σ 2 Since E X ri2 = (n − 2)σ 2 i it follows that X ri2 /(n − 2) i is an unbiased estimate of σ 2 . That is ! E X ri2 /(n − 2) = σ2 . i 88 / 171 Review of moment calculations for random vectors Suppose that z ∈ Rd is a random vector. The expected value of z, denoted E [z], is a fixed vector in Rd , and the covariance matrix of z, denote cov[z], is a fixed d × d matrix. (cov[z])ij = cov(zi , zj ), the covariance between the i th and j th elements of z. The diagonal element (cov[z])ii = var[zi ] is the variance of the i th element of z. 89 / 171 Basic inference for multiple linear regression via OLS We have the following useful identity for the multiple linear regression least squares fit: β̂ = (X 0 X )−1 X 0 y = (X 0 X )−1 X 0 (X β + ) = β + (X 0 X )−1 X 0 . Letting η = (X 0 X )−1 X 0 we see that E [η|X ] = 0 and var[η|X ] = var[β̂|X ] = (X 0 X )−1 X 0 var[|X ]X (X 0 X )−1 = σ 2 (X 0 X )−1 • var[|X ] = σ 2 I since (i) the i are uncorrelated and (ii) the i have constant variance given x. 90 / 171 Variance of β̂ in multiple regression OLS Let ui be the i th row of the design matrix X . Then X 0X = X ui0 ui i where ui0 ui is an outer-product (a p + 1 × p + 1 matrix). If we have a limiting behavior n−1 X ui0 ui → Q, i for a fixed p + 1 × p + 1 matrix Q, then X 0 X ≈ nQ, so cov(β̂|X ) ≈ σ 2 Q −1 /n. Thus we see the usual influence of sample size on the standard errors of the regression coefficients. 91 / 171 Level sets of quadratic forms Suppose 1 0.5 2 3 C= 0.5 1 so that C −1 = 2 −1 −1 2 Left side: level curves of g (x) = x 0 Cx Right side: level curves of h(x) = x 0 C −1 x 3 3 2 1 1.0 0 0 0.5 1 1 4.0 2 33 0 4.0 3. 1.0 0.5 3 2.0.0 1 5.0 5.0 2.0 2 2 2 1 0 1 2 3 33 2 1 0 1 2 3 92 / 171 Eigen-decompositions and quadratic forms The dominant eigenvector of C maximizes the “Rayleigh quotient” g (x) = x 0 Cx/x 0 x, thus it points in the direction of greatest change of g . If C= 1 0.5 0.5 1 , the dominant eigenvector points in the (1, 1) direction. The dominant eigenvector of C −1 points in the (−1, 1) direction. P If the eigendecomposition of C is j λj vj vj0 then the eigendecomposition P 0 of C −1 is j λ−1 j vj vj – thus directions in which the level curves of C are most spread out are the directions in which the level curves of C −1 are most compressed. 93 / 171 Variance of β̂ in multiple regression OLS If p = 2 and 1 X 0 X /n = 0 0 0 1 r 0 r , 1 then 1 cov(β̂) = σ 2 n−1 0 0 0 0 1/(1 − r 2 ) −r /(1 − r 2 ) . −r /(1 − r 2 ) 1/(1 − r 2 ) So if p = 2 and x1 and x2 are positively colinear (meaning that (x1 − x̄1 )0 (x2 − x̄2 ) > 0), the corresponding slope estimates β̂1 and β̂2 are negatively correlated. 94 / 171 The residual sum of squares The residual sum of squares (RSS) is the squared norm of the residual vector: RSS = ky − ŷ k2 = ky − Py k2 = k(I − P)y k2 = y 0 (I − P)(I − P)y = y 0 (I − P)y , where P is the projection matrix onto col(X ). The last equivalence follows from the fact that I − P is a projection and hence is idempotent. 95 / 171 The expected value of the RSS The expression RSS = y 0 (I − P)y is a quadratic form in y , and we can write y 0 (I − P)y = tr (y 0 (I − P)y ) = tr ((I − P)yy 0 ) , where the second equality uses the circulant property of the trace. For three factors, the circulent property states that tr(ABC ) = tr(CAB) = tr(BCA). 96 / 171 The expected value of the RSS By linearity we have E tr[(I − P)yy 0 ] = tr[(I − P) · Eyy 0 ], and Eyy 0 = EX ββ 0 X 0 + Exβ0 + E β 0 X 0 + E 0 = X ββ 0 X 0 + E 0 = X ββ 0 X 0 + σ 2 I . Since PX = X and hence (I − P)X = 0, (I − P)E [YY 0 ] = σ 2 (I − P). Therefore the expected value of the RSS is E [RSS] = σ 2 tr(I − P). 97 / 171 Four more properties of projection matrices Property 9: A projection matrix PS is symmetric. One way to show this is to let v1 , . . . , vq be an orthonormal basis for S, where PS is the projection onto S. Then complete the vj with vq+1 , . . . , vd to get an orthonormal basis. By direct calculation, (PS − q X vj vj0 )Vk = 0 j=1 for all k, hence PS = Pq 0 j=1 vj vj which is symmetric. 98 / 171 Four more properties of projection matrices Property 10: A projection matrix is positive semidefinite. Let v be an arbitrary vector and write v = v1 + v2 , where v1 ∈ S and v2 ∈ S ⊥ . Then (v1 + v2 )0 PS (v1 + v2 ) = v10 v1 ≥ 0. 99 / 171 Four more properties of projection matrices Property 11: The eigenvalues of a projection matrix PS must be zero or one. Suppose λ, v is an eigenvalue/eigenvector pair: PS v = λv . Since PS is the projection onto the subspace S, this implies that λv is the closest element of S to v . But if λv ∈ S then v ∈ S, and is strictly closer to v than λv , unless λ = 1 or v = 0. Therefore only 0 and 1 can be eigenvalues of PS . 100 / 171 Four more properties of projection matrices Property 12: The trace of a projection matrix is its rank. The rank of a matrix is the number of nonzero eigenvalues. The trace of a matrix is the sum of all eigenvalues. Since the nonzero eigenvalues of a projection matrix are all 1, the rank and the trace must be identical. 101 / 171 The expected value of the RSS We know that E [RSS] = σ 2 tr(I − P), where P is the projection onto col(X ). Since I − P is the projection onto col(X )⊥ , I − P has rank n − rank(X ) = n − p − 1. Thus E [RSS] = σ 2 (n − p − 1), so RSS/(n − p − 1) is an unbiased estimate of σ 2 . 102 / 171 Covariance matrix of residuals Since E β̂ = β, it follows that E ŷ = X β = Ey . Therefore we can derive the following simple expression for the covariance matrix of the residuals. cov(y − ŷ ) = E [(y − ŷ )(y − ŷ )0 ] = (I − P)Eyy 0 (I − P) = (I − P)(X ββ 0 X 0 + σ 2 I )(I − P) = σ 2 (I − P) 103 / 171 Distribution of the RSS The RSS can be written RSS = tr[(I − P)yy 0 ] = tr[(I − P)0 ] Therefore, the distribution of the RSS does not depend on β. It also depends on X only through col(X ). 104 / 171 Distribution of the RSS If the distribution of is invariant under orthogonal transforms, i.e. d = Q when Q is a square orthogonal matrix, then we can make the stronger statement that the distribution of the RSS only depends on X through its rank. To see this, construct a square orthogonal matrix Q so that Q 0 (I − P)Q is the projection onto a fixed subspace S of dimension n − p − 1 (so Q 0 maps col(I − P) to S). Then tr[(I − P)0 ] = d tr[(I − P)Q(Q)0 ] = tr[Q 0 (I − P)Q0 ] Note that since Q is square we have QQ 0 = Q 0 Q = I . 105 / 171 Optimality For a given design matrix X , there are many linear estimators that are unbiased for β. That is, there are many matrices M ∈ Rp+1×n such that E [My |X ] = β for all β. The Gauss-Markov theorem states that among these, the least squares estimate is “best,” in the sense that its covariance matrix is “smallest.” Here we are using the definition that a matrix A is “smaller” than a matrix B if B −A is positive definite. 106 / 171 Optimality (BLUE) Letting β ∗ = My be any linear unbiased estimator of β, when cov(β̂|X ) ≤ cov(β ∗ |X ), this implies that for any fixed vector θ, var[θ0 β̂|X ] ≤ var[θ0 β ∗ |X ]. The Gauss-Markov theorem implies that the least squares estimate β̂ is the “BLUE” (best linear unbiased estimator) for the least squares model. 107 / 171 Optimality (proof of GMT) The idea of the proof is to show that for any linear unbiased estimate β ∗ of β, β ∗ − β̂ and β̂ are uncorrelated. It follows that cov(β ∗ |X ) = cov(β ∗ − β̂ + β̂|X ) = cov(β ∗ − β̂|X ) + cov(β̂|X ) ≥ cov(β̂|X ). To prove the theorem note that E [β ∗ |X ] = M · E [y |X ] = MX β = β for all β, and let B = (X 0 X )−1 X 0 , so that E [β̂|X ] = BX β = β for all β, so (M − B)X ≡ 0. 108 / 171 Optimality (proof of GMT) Therefore cov(β ∗ − β̂, β̂|X ) = E [(M − B)y (By − β)0 |X ] = E [(M − B)yy 0 B 0 |X ] − E [(M − B)y β 0 |X ] = (M − B)(X ββ 0 X 0 + σ 2 I )B 0 − (M − B)X ββ 0 = σ 2 (M − B)B 0 = 0. Note that we have an explicit expression for the gap between cov(β̂) and cov(β ∗ ): cov(β ∗ − β̂|X ) = σ 2 (M − B)(M − B)0 . 109 / 171 Multivariate density families If C is a covariance matrix, many families of multivariate densities have the form c · φ((x − µ)0 C −1 (x − µ)), where c > 0 and φ : R+ → R+ is a function, typically decreasing with a mode at the origin (e.g. for the multivariate normal density, φ(u) = exp(−u/2) and for the multivariate t-distribution with d degrees of freedom, φ(u) = (1 + u/d)−(d+1)/2 ) . Left side: level sets of g (x) = x 0 Cx Right side: level sets of a density c · φ((x − µ)0 C −1 (x − µ)) 3 3 2 0 0.5 1 2 2 1 0 1 2 3 3 1 1 0 2 3 2 3 2 1 0.0 5 0.5 0.0 01 1 2 1 0 0.0 5.0 2.0 4.0 3.0 1.0 1 33 2 0.10.15 0 33 3 2 1 0.1 0.05 1 4.0 2 33 01 1 1.0 0 0.1 5 1 0.0 1 5 3 .0 2.0.0 0.0 2 2 2 1 0 1 2 3 33 2 1 0 1 110 / 171 Regression inference with Gaussian errors The random vector x = (x1 , . . . , xp )0 has a p-dimensional standard multivariate normal distribution if its components are independent and follow standard normal marginal distributions. The density of x is the product of p standard normal densities: p(x) = (2π)−p/2 exp(− 1 1X 2 xj ) = (2π)−p/2 exp(− x 0 x). 2 2 j 111 / 171 Regression inference with Gaussian errors If we transform Z = µ + Ax, we get a random variable satisfying EZ cov(Z ) = µ = Acov(x)A0 = AA0 ≡ Σ. 112 / 171 Regression inference with Gaussian errors The density of Z can be obtained using the change of variables formula: 1 p(Z ) = (2π)−p/2 |A−1 | exp − (Z − µ)0 A−T A−1 (Z − µ) 2 1 = (2π)−p/2 |Σ|−1/2 exp − (Z − µ)0 Σ−1 (Z − µ) 2 This distribution is denoted N(µ, Σ). The log-density is 1 1 − log |Σ| − (Z − µ)0 Σ−1 (Z − µ), 2 2 with the constant term dropped. 113 / 171 Regression inference with Gaussian errors The joint log-density for an iid sample of size n is 1X n n n (xi − µ)0 Σ−1 (xi − µ) = − log |Σ| − tr Sxx Σ−1 − log |Σ| − 2 2 2 2 i where Sxx = X (xi − µ)(xi − µ)0 /n. 114 / 171 The Cholesky decomposition If Σ is a non-singular covariance matrix, there is a lower triangular matrix A with positive diagonal elements such that AA0 = Σ This matrix can be denoted Σ1/2 , and is called the Cholesky square root. 115 / 171 Properties of the multivariate normal distribution: A linear function of a multivariate normal random vector is also multivariate normal. Specifically, if x ∼ N(µ, Σ) is p-variate normal, and θ is a q × p matrix with q ≤ p, then y ∼ θx has a N(θµ, θΣθ0 ) distribution. To prove this fact, let Z ∼ N(0, Ip ), and write x = µ + AZ where AA0 = Σ is the Cholesky decomposition. 116 / 171 Properties of the multivariate normal distribution: Next, extend θ to a square invertible matrix θ̃ = θ θ∗ . where θ∗ ∈ Rp−q×p . The matrix θ∗ can be chosen such that θΣθ∗0 = 0, by the Gram-Schmidt procedure. Let ỹ = θ̃x = y y∗ = θµ + θAZ θ∗ µ + θ∗ AZ . 117 / 171 Properties of the multivariate normal distribution: Therefore cov(ỹ ) = θΣθ0 0 0 θ∗ Σθ∗0 , and cov(ỹ )−1 = (θΣθ0 )−1 0 0 (θ∗ Σθ∗0 )−1 , Using the change of variables formula, and the structure of the multivariate normal density, it follows that p(ỹ ) = p(y )p(y ∗ ). This implies that y and y ∗ are independent, and by inspecting the form of their densities, both are seen to be multivariate normal. 118 / 171 Properties of the multivariate normal distribution: • A consequence of the above argument is that, uncorrelated components of a multivariate normal vector are independent. • If x is a standard multivariate normal vector, and Q is a square orthogonal matrix, then Qx is also standard multivariate normal. This follows directly from the fact that QQ 0 = I . 119 / 171 The χ2 distribution If z is a standard normal random variable, the density of z 2 can be calculated directly as √ p(z) = z −1/2 exp(−z/2)/ 2π. This is the χ21 distribution. The χ2p distribution is defined to be the distribution of p X zj2 j=1 where z1 , . . . , zp are iid standard normal random variables. 120 / 171 The moments of the χ2 distribution By direct calculation, if F ∼ χ21 , EF = 1 var[F ] = 2. Therefore the mean of the χ2p distribution is p and the variance is 2p. 121 / 171 The density of the χ2 distribution The χ2p density is p(x) = 1 x p/2−1 exp(−x/2). Γ(p/2)2p/2 To prove this by induction, assume that the χ2p−1 density is 1 x (p−1)/2−1 exp(−x/2). Γ((p − 1)/2)2(p−1)/2 The χ2p density is the density of the sum of a χ2p−1 random variable and a χ21 random variable, which can be written 122 / 171 The density of the χ2 distribution 1 Γ((p − 1)/2)Γ(1/2)2p/2 x Z s (p−1)/2−1 exp(−s/2)(x − s)−1/2 exp(−(x − s)/2)ds Z x 1 exp(−x/2) s (p−1)/2−1 (x − s)−1/2 ds Γ((p − 1)/2)Γ(1/2)2p/2 0 Z 1 1 p/2−1 exp(−x/2)x u p/2−3/2 (1 − u)−1/2 du. Γ((p − 1)/2)Γ(1/2)2p/2 0 = 0 = where u = s/x. The density for χ2p can now be obtained by applying the identities Z Γ(1/2) = u α−1 (1 − u)β−1 du = √ π 1 Γ(α)Γ(β)/Γ(α + β). 0 123 / 171 The χ2 distribution and the RSS Let P be a projection matrix and z be a iid vector of standard normal values. For any square orthogonal matrix Q, z 0 Pz = (Qz)0 QPQ 0 (Qz). Since Qz is equal in distribution to z, z 0 Pz is equal in distribution to z 0 QPQ 0 z. If the rank of P is k, we can choose Q so that QPQ 0 is the projection onto the first k canonical basis vectors, i.e. 0 QPQ = Ik 0 0 0 , where Ik is the k × k identity matrix. 124 / 171 The χ2 distribution and the RSS This gives us d z 0 Pz = z 0 QPQ 0 z = k X zj2 j=1 which follows a χ2k distribution. It follows that n−p−1 2 σ̂ σ2 = y 0 (I − P)y /σ 2 = (/σ)0 (I − P)(/σ) ∼ χ2n−p−1 . 125 / 171 The χ2 distribution and the RSS Thus when the errors are Gaussian, we have var[σ̂ 2 ] = 2σ 4 . n−p−1 and SD[σ̂ 2 ] = σ 2 2 n−p−1 1/2 . Note that the SD of σ̂ 2 depends on σ 2 , whereas the SD of β̂ does not depend on β. 126 / 171 The t distribution Suppose Z is standard normal, V ∼ χ2p , and V is independent of Z . Then T = √ √ pZ / V has a “t distribution with p degrees of freedom.” Note that by the law of large numbers, V /p converges almost surely to 1 as p grows. Therefore T converges in distribution to a standard normal distribution. To derive the t density apply the change of variables formula. Let U W ≡ √ Z/ V V 127 / 171 The t distribution The Jacobian is J= √ 1/ V 0 −Z /2V 3/2 1 = V −1/2 = W −1/2 . Since the joint density of Z and V is p(Z , V ) ∝ exp(−Z 2 /2)V p/2−1 exp(−V /2), it follows that the joint density of U and W is p(U, W ) ∝ exp(−U 2 W /2)W p/2−1/2 exp(−W /2) = exp(−W (U 2 + 1)/2)W p/2−1/2 . 128 / 171 The t distribution Therefore Z p(U) = p(U, W )dW Z exp(−W (U 2 + 1)/2)W p/2−1/2 dW Z = (U 2 + 1)−p/2−1/2 exp(−y /2)y p/2−1/2 dy ∝ ∝ (U 2 + 1)−p/2−1/2 . where y = W (U 2 + 1). √ Finally, write T = pU to get that p(T ) ∝ (T 2 /p + 1)−(p+1)/2 . 129 / 171 Independence of estimated mean and variance parameters To review, the linear model residuals are R ≡ y − ŷ = (I − P)y and the estimated coefficients are β̂ = (X 0 X )−1 X 0 y . 130 / 171 Independence of estimated mean and variance parameters Recalling that (I − P)X = 0, and E [y − ŷ |X ] = 0, it follows that cov(y − ŷ , β̂) = E [(y − ŷ )β̂ 0 ] = E [(I − P)yy 0 X (X 0 X )−1 ] = (I − P)(X ββ 0 X 0 + σ 2 I )X (X 0 X )−1 = 0. Therefore every estimated coefficient is uncorrelated with every residual. If is Gaussian, they are also independent. Since σ̂ 2 is a function of the residuals, it follows that if is Gaussian, then β̂ and σ̂ 2 are independent. 131 / 171 Confidence interval for a regression coefficient Let Vk = [(X 0 X )−1 ]kk so that var[β̂k |X ] = σ 2 Vk . If the are multivariate Gaussian N(0, σ 2 I ), then β̂k − βk √ ∼ N(0, 1). σ Vk 132 / 171 Confidence intervals for a regression coefficient Therefore (n − p − 1)σ̂ 2 /σ 2 ∼ χ2n−p−1 and using the fact that β̂ and σ̂ 2 are independent, it follows that the pivotal quantity β̂k − βk √ σ̂ 2 Vk has a t-distribution with n − p − 1 degrees of freedom. 133 / 171 Confidence intervals for a regression coefficient Therefore if QT (q, k) is the q th quantile of the t-distribution with k degrees of freedom, then for 0 ≤ α ≤ 1, and qα = QT (1 − (1 − α)/2, n − p − 1), p P −qα ≤ (β̂k − βk )/ σ̂ 2 Vk ≤ qα = α. Rearranging terms we get the confidence interval p p P β̂k − σ̂qα Vk ≤ βk ≤ β̂k + σ̂qα Vk = α which has coverage probability α. 134 / 171 Confidence intervals for the expected response Let x ∗ be any point in Rp+1 . The expected response at x = x ∗ is E [y |x = x ∗ ] = β 0 x ∗ . A point estimate for this value is β̂ 0 x ∗ , which is unbiased since β̂ is unbiased, and has variance var[β̂ 0 x ∗ ] = σ 2 x ∗0 (X 0 X )−1 x ∗ ≡ σ 2 Vx ∗ . 135 / 171 Confidence intervals for the expected response As above we have that β̂ 0 x ∗ − β 0 x ∗ √ σ̂ 2 Vx ∗ has a t-distribution with n − p − 1 degrees of freedom. Therefore P(β̂ 0 x ∗ − σ̂qα p Vx ∗ ≤ β 0 x ∗ ≤ β̂ 0 x ∗ + σ̂qα p Vx ∗ ) = α defines a CI for E [y |x = x ∗ ] with coverage probability α. 136 / 171 Prediction intervals Suppose y ∗ is a new observation at x = x ∗ , independent of the data used to estimate β̂ and σ̂ 2 . If the errors are Gaussian, then y ∗ − β̂ 0 x ∗ is Gaussian, with the following mean and variance: E [y ∗ − β̂ 0 x ∗ |X ] = β 0 x ∗ − β 0 x ∗ = 0 and var[y ∗ − β̂ 0 x ∗ |X ] = σ 2 (1 + Vx ∗ ), 137 / 171 Prediction intervals It follows that y ∗ − β̂ 0 x ∗ p σ̂ 2 (1 + Vx ∗ ) has a t-distribution with n − p − 1 degrees of freedom. Therefore a prediction interval at x ∗ with coverage probability α is defined by p p P β̂ 0 x ∗ − σ̂qα (1 + Vx ∗ ) ≤ y ∗ ≤ β̂ 0 x ∗ + σ̂qα (1 + Vx ∗ ) = α. 138 / 171 Wald tests Suppose we want to carry out a formal hypothesis test for the null hypothesis βk = 0, for some specified index k. Since var(β̂k |X ) = σ 2 Vk , the “Z-score” β̂k / p σ 2 Vk follows a standard normal distribution under the null hypothesis. The “Z-test” √ or “asymptotic Wald test” rejects the null hypothesis if |β̂k |/ σ 2 Vk > F −1 (1 − α/2), where F is the standard normal cumulative distribution function (CDF). More generally, we can also test hypotheses of the form βk = c using the Z-score (β̂k − c)/ p σ 2 Vk 139 / 171 Wald tests If the errors are normally distributed, then β̂k / p σ̂ 2 Vk follows a t-distribution with n − p − 1 √ degrees of freedom, and the Wald test rejects the null hypothesis if |β̂k |/ σ 2 Vk > Ft−1 (1 − α/2, n − p − 1), where Ft (·, d) is the student t CDF with d degrees of freedom, and α is the type-I error probability (e.g. α = 0.05). √ If the sample size is large, the Wald statistic β̂k / σ̂ 2 Vk approximately follows a standard normal distribution (undert the null hypothesis) even if the residual variation () is not Gaussian. 140 / 171 Wald tests for contrasts More generally, we can consider the contrast θ0 β defined by a fixed vector θ ∈ Rp+1 . The population value of the contrast can be estimated by the plug-in estimate θ0 β̂. We can test the null hypothesis θ0 β = 0 with the Z-score √ θ0 β̂/ σ̂ 2 θ0 V θ. This statistic approximately follows a standard normal distribution, and more “exactly” (under specified assumptions) it follows a student t-distribution with n − p − 1 degrees of freedom (all under the null hypothesis). 141 / 171 F-tests Suppose we have two nested design matrices X1 ∈ Rn×p1 and X2 ∈ Rn×p2 , such that col(X1 ) ⊂ col(X2 ). We may wish to compare the model defined by X1 to the model defined by X2 . To do this, we need a test statistic that discriminates between the two models. Let P1 and P2 be the corresponding projections, and let ŷ (1) = P1 y (2) = P2 y ŷ be the fitted values. 142 / 171 F-tests Due to the nesting, P2 P1 = P1 and P1 P2 = P1 . Therefore (P2 − P1 )2 = P2 − P1 , so P2 − P1 is a projection that projects onto col(X2 ) − col(X1 ), the complement of col(X1 ) in col(X2 ). Since (I − P2 )(P2 − P1 ) = 0, it follows that if E [y ] ∈ col(X2 ) then Cov(y − ŷ (2) , ŷ (2) − ŷ (1) ) = 0. If the linear model errors are Gaussian, y − ŷ (2) and ŷ (2) − ŷ (1) are independent. If E [y ] ∈ col(X1 ) then (y − ŷ (2) )/σ and (ŷ (2) − ŷ (1) )/σ each have zero mean and unit standard deviation. 143 / 171 F-tests Since P2 X1 = P1 X1 = X1 , we have (I − P2 )X1 = (P2 − P1 )X1 = 0. Now suppose we take as the null hypothesis that E [Y ] ∈ col(x1 ), so we can write y = θ + , where θ ∈ col(X1 ). Therefore under the null hypothesis ky − ŷ (2) k2 = tr[(I − P2 )yy 0 ] = tr[(I − P2 )0 ] = 0 (I − P2 ). and kŷ (2) − ŷ (1) k2 = tr[(P2 − P1 )yy 0 ] = tr[(P2 − P1 )0 ] = 0 (P2 − P1 ). 144 / 171 F-tests Since I − P2 and P2 − P1 are projections onto subspaces of dimension n − p2 and p2 − p1 , respectively, it follows that ky − ŷ (2) k2 /σ 2 = k(I − P2 )y k2 /σ 2 ∼ χ2n−p2 and under the null hypothesis kŷ (2) − ŷ (1) k2 /σ 2 = k(P2 − P1 )y k2 /σ 2 ∼ χ2p2 −p1 . Therefore kŷ (2) − ŷ (1) k2 /(p2 − p1 ) . ky − ŷ (2) k2 /(n − p2 ) Since kŷ (2) − ŷ (1) k2 will tend to be large when E [y |X ] 6∈ col(X1 ), i.e. when the null hypothesis is false, this quantity can be used as a test-statistic. It is called the F-test statistic. 145 / 171 The F distribution The F-test statistic is the ratio between two rescaled independent χ2 draws. If U ∼ χ2p and V ∼ χ2q , then U/p V /q has an “F-distribution with p, q degrees of freedom,” denoted Fp,q . To derive the kernel of the density, let x y ≡ U/V V . 146 / 171 The F distribution The Jacobian of the map is 1/V 0 −U/V 2 1 = 1/V . The joint density of U and V is p(U, V ) ∝ U p/2−1 exp(−U/2)V q/2−1 exp(−V /2). The joint density of x and y is p(x, y ) ∝ x p/2−1 y (p+q)/2−1 exp(−y (x + 1)/2). 147 / 171 The F distribution Now let Z = y (x + 1), so p(x, Z ) ∝ x p/2−1 Z (p+q)/2−1 (x + 1)−(p+q)/2 exp(−Z /2) and hence p(x) ∝ x p/2−1 /(x + 1)(p+q)/2 . 148 / 171 The F distribution Now if we let F = U/p q = x V /q p then p(F ) ∝ F p/2−1 /(pF /q + 1)(p+q)/2 . 149 / 171 The F distribution Therefore, the F-test statistic follows an Fp2 −p1 ,n−p2 distribution kŷ (2) − ŷ (1) k2 /(p2 − p1 ) ∼ Fp2 −p1 ,n−p2 . ky − ŷ (2) k2 /(n − p2 ) 150 / 171 Simultaneous confidence intervals If θ is a fixed vector, we can cover the value θ0 β with probability α by pivoting on the tn−p−1 -distributed pivotal quantity θ0 β̂ − θ0 β √ , σ̂ Vθ where Vθ = θ0 (X 0 X )−1 θ. Now suppose we have a set T ⊂ Rp+1 of vectors θ, and we want to construct a set of confidence intervals such that P(all θ0 β covered, θ ∈ T ) = α. We call this a set of simultaneous confidence intervals for {θ0 β; θ ∈ T }. 151 / 171 The Bonferroni approach to simultaneous confidence intervals The Bonferroni approach can be applied when T is a finite set, |T | = k. Let Ij = I(CI j covers θj0 β) and Ij0 = I(CI j does not cover θj0 β). Then the union bound implies that P(I1 and I2 · · · and Ik ) 1 − P(I10 or I20 · · · or Ik0 ) X ≥ 1− P(Ij0 ). = j 152 / 171 The Bonferroni approach to simultaneous confidence intervals As long as 1−α≥ X P(Ij0 ), j the intervals cover simultaneously. One way to achieve this is if each interval individually has probability α0 ≡ 1 − (1 − α)/k of covering its corresponding true value. To do this, use the same approach as used to construct single confidence intervals, but with a much larger value of qα . 153 / 171 The Scheffé approach to simultaneous confidence intervals The Scheffé approach can be applied if T is a linear subspace of Rp+1 . Begin with the pivotal quantity θ0 β̂ − θ0 β p , σ̂ 2 θ0 (X 0 X )−1 θ and postulate that a symmetric interval can be found so that P ! θ0 β̂ − θ0 β −Qα ≤ p ≤ Qα for all θ ∈ T σ̂ 2 θ0 (X 0 X )−1 θ = α. Equivalently, we can write P supθ∈T (θ0 β̂ − θ0 β)2 ≤ Qα2 σ̂ 2 θ0 (X 0 X )−1 θ ! = α. 154 / 171 The Scheffé approach approach to simultaneous confidence intervals Since β̂ − β = (X 0 X )−1 X 0 , we have (θ0 β̂ − θ0 β)2 σ̂ 2 θ0 (X 0 X )−1 θ = = θ0 (X 0 X )−1 X 0 0 X (X 0 X )−1 θ σ̂ 2 θ0 (X 0 X )−1 θ Mθ0 0 Mθ , σ̂ 2 Mθ0 Mθ where Mθ = X (X 0 X )−1 θ. 155 / 171 The Scheffé approach to simultaneous confidence intervals Note that Mθ0 0 Mθ = h, Mθ /kMθ ki2 /σ̂ 2 , σ̂ 2 Mθ0 Mθ i.e. it is the squared length of the projection of onto the line spanned by Mθ (divided by σ̂ 2 ). The quantity h, Mθ /kMθ ki2 is maximized at kPk2 , where P is the projection onto the linear space {X (X 0 X )−1 θ | θ ∈ T } = {Mθ }. Therefore supθ∈T h, Mθ /kMθ ki2 /σ̂ 2 = kPk2 /σ̂ 2 , and since {Mθ } ⊂ col(X ), it follows that P and σ̂ 2 are independent. 156 / 171 The Scheffé approach to simultaneous confidence intervals Moreover, kPk2 /σ 2 ∼ χ2q where q = dim(T ), and as we know, n−p−1 2 σ̂ ∼ χ2n−p−1 . σ2 Thus kPk2 /q ∼ Fq,n−p−1 . σ̂ 2 157 / 171 The Scheffé approach to simultaneous confidence intervals Let QF be the α quantile of the Fq,n−p−1 distribution. Then P p |θ0 β̂ − θ0 β| p ≤ σ̂ qQF for all θ θ0 (X 0 X )−1 θ ! = α, so p p P θ0 β̂ − σ̂ qQF Vθ ≤ θ0 β ≤ θ0 β̂ + σ̂ qQF Vθ for all θ = α defines a level α simultaneous confidence set for {θ0 β | θ ∈ T }, where Vθ = θ0 (X 0 X )−1 θ. 158 / 171 The Scheffé approach to simultaneous confidence intervals The “multiplier” for the Scheffé simultaneous confidence interval is σ̂ p qQF Vθ where the F distribution has q, n − p − 1 degrees of freedom. For large n, we can approximate this with σ̂ q Qχ2 Vθ where χ2 is a χ2 distribution with q degrees of freedom. p Instead of the usual factor of 2, we have Qχ2 . Note that this equals 2 when q = 1, and grows fairly slowly with q, i.e. it is 3.3 when q = 5 and 4.3 when q = 10. 159 / 171 Polynomial regression The conventional linear model has the mean specification E [y |x = x] = β0 + β1 x1 + · · · βp xp . It is possible to accomodate nonlinear relationships while still working with linear models. Polynomial regression is a traditional approach to doing this. If there is only one predictor variable, polynomial reqression uses the mean structure E [y |x = x] = β0 + β1 x + β2 x 2 + · · · + βp x p . Note that this is still a linear model, as it is linear in the coefficients β. Multiple regression techniques (e.g. OLS) can be used for estimation and inference. 160 / 171 Functional linear regression A drawback of polynomial regression is that the polynomials can be highly collinear (e.g. cor(U, U 2 ) ≈ 0.97 if U is uniform on 0, 1). Also, polynomials have unbounded support, and it is often desirable to model E [y |x = x] as a linear combination of functions with local (bounded) suport. If there is a single covariate x, we can use a model of the form E [y |x = x] = β1 φ1 (x) + · · · + βp φp (x) where the φj (·) are basis functions that are chosen based on what we think E [y |x = x] might look like. Splines, sinusoids, wavelets, and radial basis functions are possible choices. Since the mean function is linear in the unknown parameters βj , this is a linear model and can be estimated using multiple linear regression (OLS) techniques. 161 / 171 Confidence bands Suppose we have a functional linear model of the form E [y |x = x, t] = β 0 x + f (t), and we model f (t) as f (t) = q X γj φj (t) j=1 where the φj (·) are basis functions. A confidence band for f with coverage probability α is an expression of the form fˆ(t) ± M(t) such that P fˆ(t) − M(t) ≤ f (t) ≤ fˆ(t) + M(t) ∀t = α. 162 / 171 Confidence bands We can use as a point estimate fˆ(t) ≡ q X γ̂j φj (t) j=1 Pq For each fixed t, j=1 γj φj (t) is a linear combniation of the γ̂j , so if we simultaneously cover all linear combinations of the γj , we will have our confidence band. Thus the Scheffé procedure can be applied with T = Rq . 163 / 171 Interactions A basic linear model for p observed covariates is ŷi = β̂0 + β̂1 x1i + · · · + β̂p xpi . In this setting, the terms β̂j xj , etc. are called main effects. This basic linear model has the following two properties: 1. The mean is linear in each covariate, when holding the values of the other covariates fixed. 2. The slope of y on each covariate is the same, regardless of the values of the other covariates. The first property can be overcome using basis functions. 164 / 171 Interactions The second property implies that if we plot E [Y |x1 = x1 , x2 = x2 ] against x1 , for different fixed values of x2 , we get a series of parallel lines. This is a very strong form of linearity that will not hold for most populations. Interactions allow more realistic models to be specified, in which this strong form of parallelism does not hold. The most common way of specifying an interaction is by including a product of two covariates in the mean structure, e.g. E [y |x1 = x1 , x2 = x2 ] = β0 + β1 x1 + β2 x2 + β3 x1 x2 . 165 / 171 Interactions By rearranging terms E [y |x1 = x1 , x2 = x2 ] = β0 + (β1 + β3 x2 )x1 + β2 x2 , we can interpret this as saying that E [Y |x1 = x1 , x2 = x2 ] is linear in x1 for each fixed value of x2 , but the slopes of these lines depend on x2 (and specifically, are equal to β1 + β3 x2 ). Note that we can also arrange terms as E [y |x1 = x1 , x2 = x2 ] = β0 + (β2 + β3 x1 )x2 + β1 x1 , so that the mean function is linear in x2 for each fixed value of x1 , with slopes depending on the value of x1 . 166 / 171 Interactions Suppose we have two quantitative covariates x1 and x2 , and wish to model E [y |x1 = x2 , x2 = x2 ] in the form f1 (x1 ) + f2 (x2 ), where f1 and f2 are smooth functions of one variable. This is called an additive model. We can model f1 and f2 using basis functions: f1 (x) = X j βj φj (x) f2 (x) = X γj φj (x). j It is not necessary that f1 and f2 uses the same basis functions, as above, but we do so here for simplicity. This additive model exhibits parallelism in that E [y |x1 = x1 , x2 = x2 ], viewed as a function of x1 with x2 held constant, produces a series of parallel curves as we vary x2 . 167 / 171 Interactions Interactions can be used to break this parallelism. When we interact two covariates that are modeled using basis functions, we take all pairwise products between the basis functions for x1 and the basis functions for x2 , i.e. ŷi = X θj1 ,j2 φj1 (x1i )φj2 (x2i ), j1 ,j2 where the coefficients (βj1 ,j2 ; j1 = 1, . . . , q1 , j2 = 1, . . . , q2 ) are estimated jointly using OLS. 168 / 171 Categorical covariates A covariate is categorical if its values are unordered labels, rather than quantitative values. Categorical variables are also called “nominal”, “qualitative”, or “factor” variables. For example, suppose that subjects fall into five groups, e.g. based on where they live. These five groups might be labeled 1, 2, 3, 4, 5, but these numbers are only labels, and have no quantitative meaning. We cannot simply include a term β · x in the model for this covariate. Instead, a categorical variable is represented through a set of “dummy variables” or “indicator variables”. There are many ways to do this, but the most common approach is to create an indicator variable for each level, say, zji = I(xi = j), where x is the categorical variable. If x has q levels, then it is represented through q indicator variables, z1 , . . . , zq . 169 / 171 Categorical covariates P Note that j zji = 1 for each i. Therefore the collection of indicator variables is perfectly collinear with the intercept. To work around this, it is common that one level the categorical variable is selected as the reference level, and its indicator is omitted from the model. There are many other conventions for resolving this collinearity, but we will not discuss them here. Note that choice of the reference level does not change the model in a fundamental way. It does not change col(X ), it only leads to a representation of col(X ) in a different basis. Therefore, ŷ is not affected by the choice of a reference level. 170 / 171 Categorical covariates If we choose a particular reference level, say j ∗ , then the interpretation of the coefficient θk · zk for the indicator corresponding to the k th level of the covariate, where k 6= j ∗ , depends on the choice of j ∗ . Specifically, θk represents the expected difference between a case with x = k and a case with x = j ∗ , while holding all other covariates fixed. The standard regression analysis output in statistical software is based on a reference level that is (usually) chosen arbitrarily. This seems to put extra emphasis on the contrasts relative to the reference level. But in fact, for a categorical variable with q levels, there are q(q − 1)/2 possible contrasts, all of which are equally meaningful. 171 / 171