Generalized Estimating Equations Kerby Shedden Department of Statistics, University of Michigan December 6, 2021 1 / 28 Score equations Suppose we have multivariate Gaussian data with mean structure E [y |X ] = X β and covariance structure Σ ∈ Rn×n . Using generalized least squares, we estimate β by minimizing (y − X β)0 Σ−1 (y − X β). The minimizer solves the following system of equations for β: X 0 Σ−1 (y − X β) = 0 These are the score equations `0 (β; y , X ) = 0 for the Gaussian log likelihood. They also imply orthogonality between the residuals and col(X ) in the metric of Σ−1 . 2 / 28 Score equations for clustered data Now suppose the sample can be partitioned into m clusters of sizes n1 , . . . , nm such that there is no dependence between clusters. Let yi ∈ Rni and Xi ∈ Rni ×p denote the components of y and X containing the data for the i th cluster, and let Σi = Cov[yi |Xi ]. Then (y − X β)0 Σ−1 (y − X β) = X (yi − Xi β)0 Σ−1 i (yi − Xi β), i and the score equations can be written X Xi0 Σ−1 i (yi − Xi β) = 0p+1 . i Can we obtain score equations of this form for dependent data in a GLM framework? 3 / 28 Score equations for non-Gaussian data Recall that the GLM score function in terms of the canonical parameter θ is y − γ 0 (θ) ∈ Rn . Write ηi = Xi β ∈ Rni for the linear predictor for cluster i, and θi ∈ Rni for the canonical parameter in cluster i. Then use the chain rule to differentiate with respect to β: X ∂θi i ∂β (yi − γ 0 (θi )) = 0p+1 . which in turn implies that X ∂ηi ∂θi X ∂θi (yi − γ 0 (θi )) = XiT (yi − γ 0 (θi )) = 0p+1 . ∂β ∂ηi ∂ηi i i 4 / 28 Score equations for non-Gaussian data Using GLM notation, θi = g (ηi ), so ∂θi /∂ηi = g 0 (ηi ). Since the mean satisfies µi = γ 0 (g (ηi )), it follows that ∂µi ∂β ∂ηi · γ 00 (g (ηi ))g 0 (ηi ) ∂β = XiT Vi (ηi )g 0 (ηi )/φ = where Vi (θ) = Var[yi |Xi ] ∈ Rni ×ni is the variance function. Putting everything together yields X ∂µi i ∂β Vi−1 (yi − µi ) = 0p+1 . 5 / 28 Score equations for non-Gaussian data These score equations are a system of p + 1 equations in p + 1 variables, so if they are not degenerate, they should uniquely determine β. In the Gaussian case, ∂µi /∂β = XiT , and Vi ∝ 1, thus the score equations are equivalent to requiring that the residuals be orthogonal to every column of X . These score estimating equations are a generalization to this condition for non-linear models. 6 / 28 Score equations for dependent non-Gaussian data Suppose the data are dependent within clusters, and we block the data so that yi ∈ Rni is the response data for cluster i. By analogy with GLM’s, we can propose the following score equations: X Di0 Σ−1 i (yi − µi ) = 0. i where Di = ∂E [yi |Xi ]/∂β 0 = XiT ∂µi /∂ηi , where µi = E [yi |Xi ] and ∂µi /∂ηi is a diagonal matrix. The variance matrix can be modeled as 1/2 Σi = Vi 1/2 Ri (α)Vi where Vi is the diagonal matrix determined by a GLM variance function, and Ri (α) is a correlation model determined by a separate parameter α. 7 / 28 Score equations for non-Gaussian data We can easily calculate Di for any of the familiar generalized linear model link functions. For example, in the logistic model µi = 1/(1 + exp(−Xi β)), ∂µi (j)/∂β 0 = exp(Xi (j, :)β) Xi (j, :) = µi (j)(1−µi (j))Xi (j, :). (1 + exp(−Xi (j, :))β)2 8 / 28 Mahalanobis distance minimization Now we will essentially start over and show that we can obtain the same estimating equations (for dependent data and nonlinear models), using a different approach. Suppose we model the mean structure as E [yi |Xi ] = µi = µi (β). The Mahalanobis distance between the data and the fitted means is X (yi − µi )T Σ−1 i (yi − µi ). i The gradient of this funtion is −2 X ∂µi i ∂β Σ−1 i (yi − µi ) 9 / 28 Solving the score equations Note that in general Di depends on β, and Σi may also depend on β. To solve the score equations, linearize the mean structure (0) µi ≈ µi (0) + Di (β − β (0) ) Using Gauss-Seidel iterations, we substitute the linearized mean structure into the score equations, updating β̂ for the current iteration using Di , Σi , and µi calculated from the value of β̂ at the previous iteration. β̂ (k+1) = β̂ (k) + ( X i −1 Di0 Σ−1 i Di ) X Di Σ−1 i (yi − µi ) i 10 / 28 Working covariance structures √ If the Σi are misspecified, β̂ is still n-consistent (under similar regularity conditions as when Σi is known). We can therefore choose a working correlation Ri (α) when fitting the model. 1/2 1/2 We then set Σi = Vi Ri (α)Vi containing the variance values. , where Vi is the diagonal matrix 11 / 28 Working covariance structures Using a “sandwich covariance estimate” gives correct inferences even when the working correlation structure is wrong. First, define A≡ X Di0 Σ−1 i Di j B≡ X 0 −1 Di0 Σ−1 i ri ri Σi Di i where ri = yi − µ̂i are the residuals for cluster i. 12 / 28 Working covariance structures Note that when the working covariance structure is correct, E [ri ri0 ] = Σi , and B becomes B= X Di0 Σ−1 i Di . i The following covariance approximation holds even when the working correlation matrices Ri are misspecified: Cov[β̂] ≈ A−1 BA−1 . 13 / 28 Working covariance structures Why does it work? We’ll focus on the linear case where β̂ = ( X −1 Xi0 Σ−1 · i Xi ) X i = B Xi0 Σ−1 i yi i −1 X Xi0 Σ−1 i yi i where in the case of a linear model, B= X i Di0 Σ−1 i Di = X Xi0 Σ−1 i Xi . i 14 / 28 Working covariance structures (why does it work?) ! Cov[β̂] = B −1 X −1 Xi0 Σ−1 i Cov[yi |Xi ]Σ Xi B −1 . i E [ri ri0 ] Since ≈ Cov[yi |Xi ], the “sandwich expression” is a plug-in estimate of the covariance matrix given above. 15 / 28 Working correlation structures The working covariance is usually defined in terms of a working correlation structure. Examples include: I Independence: Ri = I I Exchangeable: Ri (j, j) = 1, Ri (j, j 0 ) = r if j 6= j 0 I Autoregressive: Ri (j, j 0 ) = r |j−j | . I Many more... 0 1/2 We then have as the working covariance Σi = Vi where Vi = diag(Var[yi |Xi ]). 1/2 Ri (α)Vi , 16 / 28 Working correlation structures These variances are often specified in terms of the mean via a mean/variance relationship, e.g. for a Poisson model we have Var[yi |Xi ] = E [yi |Xi ]. To implement this, we need to devise a way to update Ri based on the residuals ri = yi − µ̂i . Usually these updates are based on the method of moments. These updates alternate with the Gauss-Seidel iterations until convergence. 17 / 28 Some applications I Clustered data: Cases are families, classrooms, coworkers, etc., possibly multiply nested; may use exchangeable working correlation. I Longitudinal data: Mean structure + serial dependence structure (e.g. autoregressive, stationary autocovariance). I Spatial data: May use any spatial covariance function; may have only one cluster. 18 / 28 Limitations/criticisms of GEE I Estimates the marginal mean structure parameters I Does not follow the likelihood principle I Details are not very intuitive I May not be fully efficient I No formal inference for variance parameters (seldom-used GEE2 allows this) I May be inconsistent in some longitudinal settings with time-dependent covariates, or in certain settings with missing data I Does not propagate uncertainty of dependence parameters to uncertainty in regression coefficient estimates. 19 / 28 Limitations/criticisms of GEE (continued) I Some familiar likelihood-based tools like the LRT and AIC are not available (but modified versions like QIC exist) I Iterations sometimes may not converge, or multiple solutions to score equations may exist; not clear if convergence is necessary, or how many iterations shoud be performed I Difficult to simulate data that exactly matches the model 20 / 28 Positive aspects of GEE analysis I Focuses on the mean structure, where the main interest usually lies I “Robust” to misspecification of things other that the mean structure I Builds on familiar GLM analysis, results are consistent with studies that have an independent subjects design. I Can be quite fast to fit (depends on details of covariance structure update) 21 / 28 Comparison to multilevel (random effects) models The generalized linear mixed effects (GLIMIX) model for single-level clustered data is iid γ1 , . . . , γm ∼ N(0, Φ) where γi ∈ Rq and Φ ∈ Rq×q . The observed responses yij |γi are independent over i and j, and follow a GLM with linear predictor xij β + zij γi where xij ∈ Rp and zij ∈ Rq . 22 / 28 Comparison to multilevel (random effects) models In the case of a linear model, we can average over γ to obtain the marginal model: E [yij |γi ] = xij β + zij γi , and E [yij ] = xij β. Thus, the coefficients for x in the multilevel model are the same as the coefficients for x in the marginal model. In the linear case, the multilevel and marginal models are both linear, and Gaussian, and β has the same meaning for both representations. 23 / 28 Comparison to multilevel (random effects) models For any nonlinear GLM, the marginal model is not a GLM. For example, the marginal form for a binomial mixed GLM with logit link is: Z Y exp(yij · (xij β + zij γi )) φ(γi )dγi 1 + exp(xij β + zij γi ) j This integral cannot be solved analytically, and is not equal to a binomial GLM. 24 / 28 Comparison to multilevel (random effects) models In a GEE analysis, the marginal distributions are GLM’s and the joint distribution is not specified. In a multilevel model, the conditional distributions factor as independent GLM’s. The marginalization is not tractable, and the marginal distributions are not GLM’s. 25 / 28 Comparison to multilevel (random effects) models There is an important difference between the interpretation of the regression coefficients in a GEE (marginal GLM) and in a multilevel model (conditional GLM). Let’s focus on the interpretation of a coefficient β1 in a binomial model, where the corresponding covariate xij1 is either 0 or 1. In the multilevel (conditional) model, β1 is the log odds ratio between yij and xij1 , conditioned on all other covariates, and also conditioned on the random effect γi . An alternative way to get approximately the same log odds ratio β1 would be to calculate the sample log odds ratio within each cluster, then average the corresponding odds ratios over the clusters. This procedure is known as the Mantel-Haensel estimate of the common odds ratio. 26 / 28 Comparison to multilevel (random effects) models In the GEE (marginal model), β1 is the log odds ratio between yij and xij1 , pooling over all clusters (i.e. ignoring the clusters). In general, the marginal odds ratio (given by GEE) is closer to 1 than the conditional odds ratio given by multilevel modeling. The marginal parameters estimated by GEE are sometimes referred to as population averaged effects, while the conditional parameters estimated in multilevel modeling are referred to as individual effects. 27 / 28 Comparison to linear mixed effects models Some comments about the relationship between these two frameworks: I The marginal (unconditional) mean structure E [yij |xij ] usually cannot be expressed in closed form, and is not a GLM or exponential family in general. I Variance parameters (Φ) also affect the mean structure. I Fitting requires numerical integration, sometimes over a high dimensional domain. I Difficult to assess fit of random effects distribution, almost always taken to be Gaussian in practice I Estimates and inference for mean structure parameters are in general not robust to misspecification of the variance structure parameters. I Can test variance parameters just like any other parameters. 28 / 28