Regression analysis with dependent data Kerby Shedden Department of Statistics, University of Michigan November 14, 2021 1 / 53 Dependent data Ordinary least squares (OLS) is most appropriate for uncorrelated and homoscedastic data with a linear mean structure. This means that if we write the response values y ∈ Rn and the regressors are in a design matrix X ∈ Rn×p , then we have E [y |X ] = X β Cov[y |X ] = σ 2 In×n . 2 / 53 Dependent data Recall that we can achieve unbiased estimation (E β̂ = β, E ŷ = Ey , E σ̂ 2 = σ 2 ) even when the variance model does not hold. However inference (intervals and tests) requires the variance model to hold. Also, since the Gauss-Markov theorem requires the variance model to hold, β̂ may not be an efficient estimate of β if the variance model does not hold. In this section we develop methods that allow us to efficiently estimate the model parameters, and conduct rigorous inference, when the variance model has a more general form. Throughout this section, we will continue to require the mean structure model E [y |X ] = X β to hold. 3 / 53 Clustered data Clustered data are sampled from a population that can be viewed as the union of a number of related subpopulations. Write the data as yij ∈ R, xij ∈ Rp , i = 1, . . . , m j = 1, . . . , ni where i indexes the subpopulation and j indexes the individuals in the sample belonging to the i th subpopulation. There are m clusters (subpopulations), and there are ni observations in cluster i. If the ni are all the same, we have balanced clustering. Otherwise the clusters are unbalanced. 4 / 53 Clustered data It may happen that units from the same subpopulation are more alike than units from different subpopulations, i.e. for i 6= i 0 , j 6= j 0 , cor(yij , yij 0 ) > cor(yij , yi 0 j 0 ). Part of the within-cluster similarity may be explained by the covariates, i.e. units from the same subpopulation may have similar xij values, which leads to similar y values. In this case, cor(yij , yij 0 ) > cor(yij , yij 0 |xij , xij 0 ). Even after accounting for measured covariates, units in a cluster may still resemble each other more than units in different clusters: cor(yij , yij 0 |xij , xij 0 ) > cor(yij , yi 0 j 0 |xij , xi 0 j 0 ). 5 / 53 Clustered data A simple working model to account for this dependence is ŷij = θ̂i + β̂ 0 xij . The idea here is that β 0 xij explains the variation in y that is related to the measured covariates, and the θi explain variation in y that is related to the clustering. This working model would be correct if there were q omitted variables zij` , ` = 1, . . . , q, that were constant for all units in the same cluster (i.e. zij` depends on ` and i, but not on j). P In that case, θ̂i would stand in for the value of ` ψ̂` zij` that we would have obtained if the zij` were observed. 6 / 53 Clustered data As an alternate notation, we can vectorize P the data to express n n×p+1 y ∈ R and X ∈ R where n = i ni , and then write ŷ = X θ̂i Ii + X β̂, i where Ii ∈ {0, 1}n is the indicator of which subjects belong to cluster i. If the observed covariates in X are related to the clustering (i.e. if the columns of X and the Ii are not orthogonal), then OLS apportions the overlapping variance between β̂ and θ̂. 7 / 53 Test score example Suppose yij is a reading test score for the j th student in the i th classroom, out of a large number of classrooms that are considered. Suppose xij is the income of a student’s family. We might postulate as a population model yij = θi + βxij + ij , which can be fit as ŷij = θ̂i + β̂xij . 8 / 53 Test score example Ideally we would want the parameter estimates to reflect sources of variation as follows: I “Direct effects” of parent income such as access to books, life experiences, good health care, etc. should go entirely to β̂. I Attributes of classrooms that are not related to parent income, for example, the effect of an exceptionally good or bad teacher, should go entirely to the θ̂i . I Attributes of classrooms that are correlated with parent income, such as teacher salary, training, and resources, will be apportioned by OLS between θ̂i and β̂. I Unique and unpredictable events affecting particular individuals, such as the severe illness of the student or a family member, should go entirely to . 9 / 53 Other examples of clustered data I Treatment outcomes for patients treated in various hospitals. I Crime rates in police precincts distributed over a number of large cities (the precincts are the units and the cities are the clusters). I Prices of stocks belonging to various business sectors. I Surveys in which the data are collected following a cluster sampling approach. 10 / 53 What if we ignore the θi ? The θi are usually not of primary interest, but we should be concerned that by failing to take account of the clustering, we may incorrectly assess the relationship between y and X . If the θi are nonzero, but we fail to account for them in the model, the working model is misspecified. Let X be the design matrix without intercept, and let Q = [I1 · · · Im ] be the matrix of cluster indicators (which implicitly includes the intercept): y = y11 y12 y21 y22 y23 ··· Q= 1 1 0 0 0 ··· 0 0 1 1 1 ··· ··· ··· ··· ··· ··· ··· X = X111 X121 X211 X221 X231 ··· X112 X122 X212 X222 X232 ··· ··· ··· ··· ··· ··· ··· 11 / 53 What if we ignore the θi ? Let X̃ = [1n | X ]. The estimate that results from regressing y on X̃ is E β̂ ∗ = (X̃ 0 X̃ )−1 X̃ 0 E [y ] = (X̃ 0 X̃ )−1 X̃ 0 (X β + Qθ). where β̂ ∗ = (β̂0 , β̂ 0 )0 . For the bias of β̂ for β to be zero, we need E [β̂ ∗ ] − β̃ = (X̃ 0 X̃ )−1 X̃ 0 (X β + Qθ − X̃ β̃) = 0, where β0 = E [β̂0 ] and β̃ = (β0 , β 0 )0 . Thus we need 0 = X̃ 0 (X β + Qθ − X̃ β̃) = X̃ 0 (Qθ − β0 ). 12 / 53 What if we ignore the θi ? Let S = Qθ be the vector of cluster effects. Since the first column of X̃ consists of 1’s, we have that S̄ = β0 . For any other covariate Xj , we have that Xj0 S = β0 Xj0 1n , which implies that Xj and S have zero sample covariance. This is unlikely in many studies, where people tend to cluster (in schools, hospitals, etc.) with other people having similar covariate levels. 13 / 53 Fixed effects analysis In a fixed effects approach, we model the θi as regression parameters, by including additional columns in the design matrix whose covariate levels are the cluster indicators, i.e. we regress y on [X Q]. As the sample size grows, in most applications the cluster sizes ni will remain bounded (e.g. a primary school classroom might have up to 30 students). Thus the number of clusters must grow, so the dimension of the parameter vector θ grows. This puts us in a setting where the model dimension (p) and sample size (n) are growing together. This is not typical in standard statistical analysis, and leads to the Neyman Scott problem. Contemporary methods for “high dimensional” regression provide one way to work around this challenge. 14 / 53 Cluster effect examples What does the inclusion of fixed effects do to the parameter estimates β̂, which are often of primary interest? The following slides show scatterplots of an outcome y against a scalar covariate x, in a setting where there are three clusters (indicated by color). Above each plot are the coefficient estimates, Z-scores, and R 2 values for fits in which the cluster effects are either included (top line) or excluded (second line). These estimates are obtained from the following two working models: ŷ = α̂ + β̂x + X θ̂i Ii ŷ = α̂s + β̂s x. The Z-scores are β̂/SD(β̂) and the R 2 values are cor(ŷ , y )2 . 15 / 53 Cluster effect examples Left: y is independent of both clusters and x; x and clusters are independent; β̂ ≈ 0 with and without cluster terms in model. Right: y relates to x, but not to the clusters; x and clusters are independent; β̂, Z , and R 2 are similar with and without cluster effects in model. β̂1 =1.04 Z =7.75 r2 =0.57 β̂1s =1.01 Zs =7.81 rs2 =0.56 Y Y β̂1 = −0.07 Z = −0.49 r2 =0.02 β̂1s = −0.06 Zs = −0.44 rs2 =0.00 X X 16 / 53 Cluster effect examples Left: y relates to clusters, but not to x; x and clusters are independent; β̂ ≈ 0 with and without cluster effects in model. Right: y relates to clusters but not to x; x and clusters are dependent; when cluster effects are not modeled their effect is picked up by x. β̂1 =0.03 Z =0.19 r2 =0.94 β̂1s =0.92 Zs =20.72 rs2 =0.90 Y Y β̂1 = −0.03 Z = −0.85 r2 =0.94 β̂1s =0.00 Zs =0.02 rs2 =0.00 X X 17 / 53 Cluster effect examples Left: y relates to both clusters and x; x and clusters are dependent. Right: y relates to clusters but not to x; x and clusters are dependent but the net cluster effect is not linearly related to x. β̂1 = −0.04 Z = −0.31 r2 =0.84 β̂1s =0.00 Zs =0.02 rs2 =0.00 Y Y β̂1 =0.90 Z =5.97 r2 =0.95 β̂1s =0.96 Zs =29.26 rs2 =0.95 X X 18 / 53 Cluster effect examples Left: y relates to clusters and to x; x and clusters are dependent; the x effect has the opposite sign as the net cluster effect. Right: y relates to clusters and to x; x and clusters are dependent; the signs and magnitudes of the x effects are cluster-specific. β̂1 = −0.52 Z = −1.70 r2 =0.98 β̂1s =2.92 Zs =8.20 rs2 =0.58 Y Y β̂1 = −1.93 Z = −12.68 r2 =0.99 β̂1s =2.72 Zs =16.81 rs2 =0.85 X X 19 / 53 Implications of fixed effects analysis for observational data A stable confounder is a confounding factor that is approximately constant within clusters. A stable confounder will become part of the net cluster effect (θi ). If a stable confounder is correlated with an observed covariate X , then this will create non-orthogonality between the cluster effects and the effects of the observed covariates. 20 / 53 Implications of fixed effects analysis for experimental data Experiments are often carried out on “batches” of objects (specimens, parts, etc.) in which uncontrolled factors cause elements of the same batch to be more similar than elements of different batches. If treatments are assigned randomly within each batch, there are no stable confounders (in general there are no confounders in experiments). Therefore the overall OLS estimate of β is unbiased as long as the standard linear model assumptions hold. 21 / 53 Implications of fixed effects analysis for experimental data Suppose the design is balanced (e.g. exactly half of each batch is treated and half is untreated). This is an orthogonal design, so the estimate based on the working model ŷij = β̂xij and the estimate based on the working model ŷij = θ̂i + β̂ ∗ xij are identical (β̂ = β̂ ∗ ). Thus they have the same variance. But the estimated variance of β̂ will be greater than the estimated variance of β̂ ∗ (since the corresponding estimate of σ 2 is greater), so it will have lower power and wider confidence intervals. 22 / 53 Random cluster effects As we have seen, the cluster effects θi can be treated as unobserved constants, and estimated as we would estimate any other regression coefficient. An alternative way to handle cluster effects is to view the θi as unobserved (latent) random variables. In doing this, we now we have two random variables in the model: θi and ij , which are taken to be independent of each other. If the θi are independent and identically distributed, we can combine them with the error terms to get a single random error term per observation: cij = θi + ij . 23 / 53 Random cluster effects Let yi ≡ (yi1 , . . . , yini )0 denote the vector of responses in the i th cluster, let xi ∈ Rni ×p denote the matrix of predictor variables for the i th cluster, and let ci ≡ (ci1 , . . . , cini )0 ∈ Rni denote the vector of random “errors” for the i th cluster. Thus we have the model yi = xi β + ci , for clusters i = 1, . . . , m. The i are taken to be uncorrelated between clusters, i.e. cov(ci , ci0 ) = 0ni ×ni 0 for i 6= i 0 . 24 / 53 Random cluster effects The structure of the covariance matrix Si ≡ cov(ci |xi ) is Si = σ 2 + σθ2 σθ2 σθ2 σ 2 + σθ2 ··· ··· ··· ··· σθ2 σθ2 ··· σθ2 ··· ··· ··· ··· ··· ··· ··· ··· σθ2 ··· ··· ··· σ 2 + σθ2 = σ 2 I + σ 2 11T , θ where var(ij ) = σ 2 and var(θi ) = σθ2 . 25 / 53 Generalized Least Squares Suppose we have a linear model with mean structure E [y |X ] = X β for y ∈ Rn , X ∈ Rn×p+1 , and β ∈ Rp+1 , and variance structure Cov[y |X ] ∝ Σ, where Σ is a given n × n matrix. We can write the model in generative form as y = X β + , where ∈ Rn with E [|X ] = 0, Cov[|X ] = Σ. Factor the covariance matrix as Σ = GG 0 , and consider the transformed model G −1 y = G −1 X β + G −1 . Then letting η ≡ G −1 , it follows that Cov(η) = In×n , and note that the population slope vector β of the transformed model is identical to the population slope vector of the original model. 26 / 53 Generalized Least Squares The GLS estimator of β is defined to be the OLS estimator of β for the decorrelated response G −1 y and the decorrelated predictors G −1 X . The GLS estimate of the regression slope can be expressed in terms of the original design matrix X and response vector y : β̂GLS = ((G −1 X )0 G −1 X )−1 (G −1 X )0 G −1 y = (X 0 G −T G −1 X )−1 X 0 G −T G −1 y = (X 0 Σ−1 X )−1 X 0 Σ−1 y . 27 / 53 Generalized Least Squares We can use GLS to analyze clustered data with random cluster effects. −1/2 Let yi∗ ≡ Si −1/2 c i , yi , ∗i ≡ Si −1/2 and xi∗ = Si xi . Let y ∗ , X ∗ , and ∗ denote the result of stacking yi∗ , xi∗ , and ∗i over i, respectively. 28 / 53 Generalized Least Squares Since cov(∗i ) ∝ I , the OLS estimate of β for the model y ∗ = X ∗ β + ∗ is the best estimate of β that is linear in y ∗ (by the Gauss-Markov theorem). Since the set of linear estimates based on y ∗ is the same as the set of linear estimates based on y , it follows that the GLS estimate of β based on y is the best unbiased estimate of β that is linear in y . 29 / 53 Generalized Least Squares When Σ = Cov[y |X ], the covariance matrix of β̂ in GLS is Cov[β̂|X ] = (X ∗T X ∗ )−1 = (X T Σ−1 X )−1 Note that this is consistent with what we get in OLS, where Σ = σ 2 I and Cov[β̂|X ] = σ 2 (X 0 X )−1 . To apply GLS, it is only necessary to know Σ up to a multiplicative constant. The same estimated slopes β̂ are obtained if we decorrelate with Σ or with kΣ for k > 0. However if Σ = k · Cov[y |X ], then Cov[β̂|X ] = k −1 (X 0 Σ−1 X )−1 , and at the moment we have no way to estimate k. 30 / 53 Generalized Least Squares When Cov[y |X ] ∝ Σ, the sampling covariance matrix for GLS is proportional to (X T Σ−1 X )−1 , and the parameter estimates are uncorrelated with each other if and only if X T Σ−1 X is diagonal – that is, if the columns of X are mutually orthogonal with respect to the inner product defined by Σ−1 : hu, v i ≡ u 0 Σ−1 v . This is related to the fact that ŷ in GLS is the projection of y onto col(X ) in the Mahalanobis metric defined by Σ−1 , d(u, v ) = (u − v )0 Σ−1 (u − v ). This generalizes the fact that ŷ in OLS is the projection of y onto col(X ) in the Euclidean metric. Neither of these facts depend on the proportionality constant relating Σ to Cov[y |X ]. 31 / 53 Generalized least squares with a “working” covariance What if we perform GLS using a possibly miss-specified “working covariance” S? For example, what happens if we use OLS when the actual covariance matrix of the errors is Σ 6= I ? Since E [∗ |X ∗ ] = E [∗ |X ] = 0, the estimate β̂ remains unbiased. However it has two problems: it may not be the BLUE (i.e. it may not have the least variance among unbiased estimates), and the usual linear model inference procedures will be wrong. 32 / 53 Generalized least squares with “working” covariance The sampling covariance when the error structure is mis-specified is given by the “sandwich expression:” Cov[β̂] = (X ∗0 X ∗ )−1 X ∗0 Σ∗ X ∗ (X ∗0 X ∗ )−1 where Σ∗ = Cov[∗ |X ∗ ]. This result covers two special situations: (i) we use OLS, so X ∗ = X and Σ∗ = Σ is the covariance matrix of the errors, and (ii) we use a “working covariance” to define the decorrelating matrix G , and this working covariance is not equal to Σ∗ . 33 / 53 Generalized least squares with “working” covariance Another way to write the covariance of β̂ is as follows: −1 0 −1 −1 0 −1 −1 Cov[β̂] = (X 0 Σ−1 w X ) X Σw ΣΣw X (X Σw X ) where Σw is the “working” (possibly incorrectly specifed) covariance matrix for . From this expression, it is clear that if Σw = Σ (the working model is correct), then −1 Cov[β̂] = (X 0 Σ−1 = (X 0 Σ−1 X )−1 . w X) 34 / 53 Iterated GLS In practice, we will generally not know Σ, so it must be estimated from the data. Σ is the covariance matrix of the errors, so we can estimate Σ from the residuals. Typically a low-dimensional parameteric model Σ = Σ(α) is used. Given a model for the the unexplained variation (i.e. for Σ), then the fitting algorithm alternates between estimating α and estimating the regression slopes β. 35 / 53 Iterated GLS For example, suppose our working covariance has the form Si (j, k) = r ν (j 6= k). Si (j, j) = ν This is an exchangeable model with “intraclass correlation coefficient” (ICC) r between two observations in the same cluster. There are several closely-related ICC estimates. One approach is based on the standardized residuals Rijs : XX i j<j 0 X Rijs Rijs 0 /( ni (ni − 1)/2 − p − 1) i where ni is the size of the i th cluster. 36 / 53 Iterated GLS Another common model for the error covariance is the first order autoregressive (AR-1) model: Σij = α|i−j| , where |α| < 1 is a parameter. There are several possible estimates of α borrowing ideas from time series analysis. We will not provide more details here. 37 / 53 Generalized Least Squares and stable confounders For GLS to give unbiased estimates of β, we must have E [∗ij |X ] = E [θi + ij |X ] = 0. Since E [ij |X ] = 0, this is equivalent to requiring that E [θi |X ] = 0. Thus if the covariates have distinct within-cluster mean values, and the within-cluster mean values of the covariates are correlated with the θi , then the GLS estimate of β will be biased. 38 / 53 Likelihood inference for the random intercepts model The random intercepts model yij = θi + β 0 xij + ij , can be analyzed using a likelihood approach, using: I A random intercepts density: φ(θ|X ) I A density for the data given the random intercept: f (y |x, θ) 39 / 53 Multilevel models and conditional independence The models are hierarchical (multilevel), and encode conditional independence relationships. At the base level of the hierarchy, the random effect θi are independent and identically distributed, and in particular are unrelated to X : p(θ1 , . . . , θm |X ) = p(θ1 , . . . , θm ) = m Y φ(θi ) i=1 Conditionally on the random effects, the observed data {yij } are independent, and follow distributions that depend on the (unobserved) random effects and the observed covariates: p({yij }|{θi }, X ) = Y f (yij |θi , X ) ij 40 / 53 Likelihood inference for the random intercepts model Since the random effects are not observed, we must estimate the parameters in terms of the marginal model for the observed data. This is a model that depends on the structural parameters, which are (β, σθ2 , σ 2 ) in this case. The marginal density is obtained by integrating out the random effects Z p(yi1 , . . . yini |X ) = f (yi1 , . . . , yini |X , θi )φ(θi )dθi . 41 / 53 Likelihood inference for the random intercepts model For linear multilevel models, the marginal distribution p(y |X ) can be calculated explicitly. It is Gaussian, and therefore is characterized by its moments: E [yij |xij ] = β 0 xij Var[yij |xij ] = σθ2 + σ 2 , Cov[yi1 j1 , yi2 j2 |xi1 j1 , xi2 j2 ] = 0 (if i1 6= i2 ). Cov[yij1 , yij2 |xij1 , xij2 ] = σθ2 (if j1 6= j2 ). 42 / 53 Marginal form of the generative model Suppose |X ∼ N(0, σ 2 ) θ|X ∼ N(0, σθ2 ). In this case y |X is Gaussian, with mean and variance as given above. Thus the random intercept model can be equivalently written in marginal form as yij = β 0 xij + ∗ij where ∗ij = θi + ij . It follows that E [∗ij |X ] = 0, Var[∗ij |X ] = σθ2 + σ 2 , and the ∗ij values have correlation coefficient σθ2 /(σ 2 + σθ2 ) within clusters. 43 / 53 Likelihood computation Maximum likelihood estimates for the model can be calculated using a gradient-based optimization procedure applied to the marginal log-density, or using the EM algorithm. Asymptotic standard errors can be obtained from the inverse of the Fisher information matrix. Likelihood ratio tests, AIC values, and other likelihood-based inference tools can be used. For linear multilevel models, the parameter estimates from iterated GLS and the MLE for the random intercepts model will be similar but not identical. Both are consistent, and in general will be asymptotically equivalent. 44 / 53 Predicting the random intercepts Since the model is fit by optimizing the marginal log-likelihood, we obtain an estimate of σθ2 ≡ Var[θi ], but we don’t automatically learn anything about the individual θi . If there is an interest in the individual θi values, we can predict them using the best linear unbiased predictor (BLUP). The population version of the BLUP is: Eβ,σ2 ,σ2 [θi |yi , X ] = Cov[θi , yi |X ] · Cov[yi |X ]−1 · (yi − E [yi |X ]) θ Since this depends on things we don’t know, in practice we use the sample version of the BLUP (sometimes called the eBLUP): d i , yi |X ] · Cov[y d i |X ]−1 · (yi − Ê [yi |X ]) Eβ̂,σ̂2 ,σ̂2 [θi |yi , X ] = Cov[θ θ 45 / 53 Predicting the random intercepts The BLUP is truly a linear function of y , is unbiased, and is “best” in the sense of minimizing the expected squared prediction error. However the eBLUP is none of of these things (it is not linear or unbiased, and it is unclear if it is “best”). Note also that due to the hierarchical structure of the model, to predict θi we only need to condition on yi (the other yi 0 , for i 0 6= i, contain no information). Thus the BLUP for θi only depends on the data through yi . But in the eBLUP, all the data are used indirectly, through the estimates of the structural parameters. 46 / 53 Predicting the random intercepts The estimated second moments needed to calculate the BLUP are: d i , yi |X ] = σ̂ 2 · 1n Cov[θ θ i and d i |X ] = σ̂ 2 I + σ̂ 2 11T . Cov[y θ where ni is the size of the i th group. 47 / 53 Predicting the random intercepts For a given set of parameter values, the BLUP for the random intercepts model is a linear function of the data, with the following form: Eβ̂,σ̂2 ,σ̂2 [θi |y , X ] = ni σθ2 /(σ̂ 2 + ni σ̂θ2 ) · 1T (yi − Ê [yi |X ])/ni , θ 48 / 53 Predicting the random intercepts In the BLUP for θi , this term is the mean of the residuals: 1T (yi − Ê [yi |X ])/ni and it is shrunk by this factor: ni σθ2 /(σ̂ 2 + ni σ̂θ2 ) This shrinkage allows us to interpret the random intercepts model as a “partially pooled” model that is intermediate between the fixed effects model and the model that completely ignores the clusters. 49 / 53 Predicting the random intercepts In the fixed effects model, the parameter estimates θi are unbiased, but they are “overdispersed”, meaning that the sample variance of the θ̂i will generally be greater than σθ2 . In the random intercepts model, the variance parameter σθ2 is estimated with low bias, and the BLUP’s of the θi are shrunk toward zero. 50 / 53 Random slopes Suppose we are interested in a measured covariate z whose effect may vary by cluster. We might start with the model yi = xi β + γzi + , For example, supose that zi ∈ {0, 1}ni is a vector of treatment indicators (1 for treated subjects, 0 for untreated subjects). Then γ is the average change associated with being treated (the population treatment effect). In many cases, it is reasonable to consider the possibility that different clusters may have different treatment effects – that is, different clusters have different γ values. In this case we can let γi be the treatment response for cluster i. 51 / 53 Random slopes Suppose we model the random slopes γi as being Gaussian (given X ) with variance σγ2 . The marginal model is E [yi |xi , zi ] = xi0 β and Cov[yi |xi , zi ] = σγ2 zi zi0 + σ 2 I . Again we can form a BLUP Eβ̂,σ̂2 ,σ̂2 [γi |y , X , z], and the BLUP γ turns out to be a shrunken version of what would be obtained in a fixed effects model, where we regress y on x and z within every cluster. 52 / 53 Linear mixed effects models The random intercept and random slope model are special cases of the linear mixed effects model: yi = Xi β + Zi γi + i Here, Xi is a ni × p design matrix for cluster i, Zi is a ni × q random effects design matrix for cluster i, γi ∈ Rq is a random vector with mean 0 and covariance matrix Ψ, and ∈ Rni is a random vector with mean 0 and covariance matrix σ 2 I . 53 / 53