Generalized Estimating Equations
Kerby Shedden
Department of Statistics, University of Michigan

December 6, 2021

1 / 28

Score equations
Suppose we have multivariate Gaussian data with mean structure
E [y |X ] = X β and covariance structure Σ ∈ Rn×n .
Using generalized least squares, we estimate β by minimizing
(y − X β)0 Σ−1 (y − X β).
The minimizer solves the following system of equations for β:
X 0 Σ−1 (y − X β) = 0
These are the score equations `0 (β; y , X ) = 0 for the Gaussian log
likelihood. They also imply orthogonality between the residuals and
col(X ) in the metric of Σ−1 .

2 / 28

Score equations for clustered data
Now suppose the sample can be partitioned into m clusters of sizes
n1 , . . . , nm such that there is no dependence between clusters.
Let yi ∈ Rni and Xi ∈ Rni ×p denote the components of y and X
containing the data for the i th cluster, and let Σi = Cov[yi |Xi ].
Then
(y − X β)0 Σ−1 (y − X β) =

X

(yi − Xi β)0 Σ−1
i (yi − Xi β),

i

and the score equations can be written
X

Xi0 Σ−1
i (yi − Xi β) = 0p+1 .

i

Can we obtain score equations of this form for dependent data in a
GLM framework?
3 / 28

Score equations for non-Gaussian data
Recall that the GLM score function in terms of the canonical
parameter θ is y − γ 0 (θ) ∈ Rn .
Write ηi = Xi β ∈ Rni for the linear predictor for cluster i, and
θi ∈ Rni for the canonical parameter in cluster i. Then use the
chain rule to differentiate with respect to β:
X ∂θi
i

∂β

(yi − γ 0 (θi )) = 0p+1 .

which in turn implies that
X ∂ηi ∂θi
X
∂θi
(yi − γ 0 (θi )) =
XiT
(yi − γ 0 (θi )) = 0p+1 .
∂β ∂ηi
∂ηi
i

i

4 / 28

Score equations for non-Gaussian data
Using GLM notation, θi = g (ηi ), so ∂θi /∂ηi = g 0 (ηi ).
Since the mean satisfies µi = γ 0 (g (ηi )), it follows that
∂µi
∂β

∂ηi
· γ 00 (g (ηi ))g 0 (ηi )
∂β
= XiT Vi (ηi )g 0 (ηi )/φ

=

where Vi (θ) = Var[yi |Xi ] ∈ Rni ×ni is the variance function.
Putting everything together yields
X ∂µi
i

∂β

Vi−1 (yi − µi ) = 0p+1 .

5 / 28

Score equations for non-Gaussian data

These score equations are a system of p + 1 equations in p + 1
variables, so if they are not degenerate, they should uniquely
determine β.
In the Gaussian case, ∂µi /∂β = XiT , and Vi ∝ 1, thus the score
equations are equivalent to requiring that the residuals be
orthogonal to every column of X .
These score estimating equations are a generalization to this
condition for non-linear models.

6 / 28

Score equations for dependent non-Gaussian data
Suppose the data are dependent within clusters, and we block the
data so that yi ∈ Rni is the response data for cluster i. By analogy
with GLM’s, we can propose the following score equations:
X

Di0 Σ−1
i (yi − µi ) = 0.

i

where Di = ∂E [yi |Xi ]/∂β 0 = XiT ∂µi /∂ηi , where µi = E [yi |Xi ] and
∂µi /∂ηi is a diagonal matrix.
The variance matrix can be modeled as
1/2

Σi = Vi

1/2

Ri (α)Vi

where Vi is the diagonal matrix determined by a GLM variance
function, and Ri (α) is a correlation model determined by a
separate parameter α.
7 / 28

Score equations for non-Gaussian data

We can easily calculate Di for any of the familiar generalized linear
model link functions. For example, in the logistic model
µi = 1/(1 + exp(−Xi β)),

∂µi (j)/∂β 0 =

exp(Xi (j, :)β)
Xi (j, :) = µi (j)(1−µi (j))Xi (j, :).
(1 + exp(−Xi (j, :))β)2

8 / 28

Mahalanobis distance minimization
Now we will essentially start over and show that we can obtain the
same estimating equations (for dependent data and nonlinear
models), using a different approach.
Suppose we model the mean structure as E [yi |Xi ] = µi = µi (β).
The Mahalanobis distance between the data and the fitted means is
X

(yi − µi )T Σ−1
i (yi − µi ).

i

The gradient of this funtion is
−2

X ∂µi
i

∂β

Σ−1
i (yi − µi )

9 / 28

Solving the score equations
Note that in general Di depends on β, and Σi may also depend on
β.
To solve the score equations, linearize the mean structure
(0)

µi ≈ µi

(0)

+ Di (β − β (0) )

Using Gauss-Seidel iterations, we substitute the linearized mean
structure into the score equations, updating β̂ for the current
iteration using Di , Σi , and µi calculated from the value of β̂ at the
previous iteration.
β̂ (k+1) = β̂ (k) + (

X
i

−1
Di0 Σ−1
i Di )

X

Di Σ−1
i (yi − µi )

i

10 / 28

Working covariance structures

√
If the Σi are misspecified, β̂ is still n-consistent (under similar
regularity conditions as when Σi is known).
We can therefore choose a working correlation Ri (α) when fitting
the model.
1/2

1/2

We then set Σi = Vi Ri (α)Vi
containing the variance values.

, where Vi is the diagonal matrix

11 / 28

Working covariance structures

Using a “sandwich covariance estimate” gives correct inferences
even when the working correlation structure is wrong.
First, define
A≡

X

Di0 Σ−1
i Di

j

B≡

X

0 −1
Di0 Σ−1
i ri ri Σi Di

i

where ri = yi − µ̂i are the residuals for cluster i.

12 / 28

Working covariance structures

Note that when the working covariance structure is correct,
E [ri ri0 ] = Σi , and B becomes
B=

X

Di0 Σ−1
i Di .

i

The following covariance approximation holds even when the
working correlation matrices Ri are misspecified:
Cov[β̂] ≈ A−1 BA−1 .

13 / 28

Working covariance structures
Why does it work?
We’ll focus on the linear case where

β̂ = (

X

−1
Xi0 Σ−1
·
i Xi )

X

i

= B

Xi0 Σ−1
i yi

i

−1

X

Xi0 Σ−1
i yi

i

where in the case of a linear model,
B=

X
i

Di0 Σ−1
i Di =

X

Xi0 Σ−1
i Xi .

i

14 / 28

Working covariance structures (why does it work?)

!
Cov[β̂] = B

−1

X

−1
Xi0 Σ−1
i Cov[yi |Xi ]Σ Xi

B −1 .

i

E [ri ri0 ]

Since
≈ Cov[yi |Xi ], the “sandwich expression” is a plug-in
estimate of the covariance matrix given above.

15 / 28

Working correlation structures

The working covariance is usually defined in terms of a working
correlation structure. Examples include:
I

Independence: Ri = I

I

Exchangeable: Ri (j, j) = 1, Ri (j, j 0 ) = r if j 6= j 0

I

Autoregressive: Ri (j, j 0 ) = r |j−j | .

I

Many more...

0

1/2

We then have as the working covariance Σi = Vi
where Vi = diag(Var[yi |Xi ]).

1/2

Ri (α)Vi

,

16 / 28

Working correlation structures

These variances are often specified in terms of the mean via a
mean/variance relationship, e.g. for a Poisson model we have
Var[yi |Xi ] = E [yi |Xi ].
To implement this, we need to devise a way to update Ri based on
the residuals ri = yi − µ̂i . Usually these updates are based on the
method of moments. These updates alternate with the
Gauss-Seidel iterations until convergence.

17 / 28

Some applications

I

Clustered data: Cases are families, classrooms, coworkers,
etc., possibly multiply nested; may use exchangeable working
correlation.

I

Longitudinal data: Mean structure + serial dependence
structure (e.g. autoregressive, stationary autocovariance).

I

Spatial data: May use any spatial covariance function; may
have only one cluster.

18 / 28

Limitations/criticisms of GEE
I

Estimates the marginal mean structure parameters

I

Does not follow the likelihood principle

I

Details are not very intuitive

I

May not be fully efficient

I

No formal inference for variance parameters (seldom-used
GEE2 allows this)

I

May be inconsistent in some longitudinal settings with
time-dependent covariates, or in certain settings with missing
data

I

Does not propagate uncertainty of dependence parameters to
uncertainty in regression coefficient estimates.

19 / 28

Limitations/criticisms of GEE (continued)

I

Some familiar likelihood-based tools like the LRT and AIC are
not available (but modified versions like QIC exist)

I

Iterations sometimes may not converge, or multiple solutions
to score equations may exist; not clear if convergence is
necessary, or how many iterations shoud be performed

I

Difficult to simulate data that exactly matches the model

20 / 28

Positive aspects of GEE analysis

I

Focuses on the mean structure, where the main interest
usually lies

I

“Robust” to misspecification of things other that the mean
structure

I

Builds on familiar GLM analysis, results are consistent with
studies that have an independent subjects design.

I

Can be quite fast to fit (depends on details of covariance
structure update)

21 / 28

Comparison to multilevel (random effects) models
The generalized linear mixed effects (GLIMIX) model for
single-level clustered data is
iid

γ1 , . . . , γm ∼ N(0, Φ)
where γi ∈ Rq and Φ ∈ Rq×q .
The observed responses yij |γi are independent over i and j, and
follow a GLM with linear predictor
xij β + zij γi
where xij ∈ Rp and zij ∈ Rq .

22 / 28

Comparison to multilevel (random effects) models
In the case of a linear model, we can average over γ to obtain the
marginal model:
E [yij |γi ] = xij β + zij γi ,
and
E [yij ] = xij β.
Thus, the coefficients for x in the multilevel model are the same as
the coefficients for x in the marginal model.
In the linear case, the multilevel and marginal models are both
linear, and Gaussian, and β has the same meaning for both
representations.
23 / 28

Comparison to multilevel (random effects) models

For any nonlinear GLM, the marginal model is not a GLM.
For example, the marginal form for a binomial mixed GLM with
logit link is:
Z Y
exp(yij · (xij β + zij γi ))
φ(γi )dγi
1 + exp(xij β + zij γi )
j

This integral cannot be solved analytically, and is not equal to a
binomial GLM.

24 / 28

Comparison to multilevel (random effects) models

In a GEE analysis, the marginal distributions are GLM’s and the
joint distribution is not specified.
In a multilevel model, the conditional distributions factor as
independent GLM’s. The marginalization is not tractable, and the
marginal distributions are not GLM’s.

25 / 28

Comparison to multilevel (random effects) models
There is an important difference between the interpretation of the
regression coefficients in a GEE (marginal GLM) and in a
multilevel model (conditional GLM).
Let’s focus on the interpretation of a coefficient β1 in a binomial
model, where the corresponding covariate xij1 is either 0 or 1.
In the multilevel (conditional) model, β1 is the log odds ratio
between yij and xij1 , conditioned on all other covariates, and also
conditioned on the random effect γi .
An alternative way to get approximately the same log odds ratio β1
would be to calculate the sample log odds ratio within each
cluster, then average the corresponding odds ratios over the
clusters. This procedure is known as the Mantel-Haensel estimate
of the common odds ratio.
26 / 28

Comparison to multilevel (random effects) models

In the GEE (marginal model), β1 is the log odds ratio between yij
and xij1 , pooling over all clusters (i.e. ignoring the clusters).
In general, the marginal odds ratio (given by GEE) is closer to 1
than the conditional odds ratio given by multilevel modeling.
The marginal parameters estimated by GEE are sometimes referred
to as population averaged effects, while the conditional parameters
estimated in multilevel modeling are referred to as individual
effects.

27 / 28

Comparison to linear mixed effects models
Some comments about the relationship between these two
frameworks:
I

The marginal (unconditional) mean structure E [yij |xij ] usually
cannot be expressed in closed form, and is not a GLM or
exponential family in general.

I

Variance parameters (Φ) also affect the mean structure.

I

Fitting requires numerical integration, sometimes over a high
dimensional domain.

I

Difficult to assess fit of random effects distribution, almost
always taken to be Gaussian in practice

I

Estimates and inference for mean structure parameters are in
general not robust to misspecification of the variance
structure parameters.

I

Can test variance parameters just like any other parameters.
28 / 28