Regression analysis with dependent data
Kerby Shedden
Department of Statistics, University of Michigan

November 14, 2021

1 / 53

Dependent data

Ordinary least squares (OLS) is most appropriate for uncorrelated
and homoscedastic data with a linear mean structure.
This means that if we write the response values y ∈ Rn and the
regressors are in a design matrix X ∈ Rn×p , then we have
E [y |X ] = X β

Cov[y |X ] = σ 2 In×n .

2 / 53

Dependent data
Recall that we can achieve unbiased estimation (E β̂ = β,
E ŷ = Ey , E σ̂ 2 = σ 2 ) even when the variance model does not hold.
However inference (intervals and tests) requires the variance model
to hold.
Also, since the Gauss-Markov theorem requires the variance model
to hold, β̂ may not be an efficient estimate of β if the variance
model does not hold.
In this section we develop methods that allow us to efficiently
estimate the model parameters, and conduct rigorous inference,
when the variance model has a more general form.
Throughout this section, we will continue to require the mean
structure model E [y |X ] = X β to hold.

3 / 53

Clustered data
Clustered data are sampled from a population that can be viewed
as the union of a number of related subpopulations.
Write the data as
yij ∈ R, xij ∈ Rp ,

i = 1, . . . , m

j = 1, . . . , ni

where i indexes the subpopulation and j indexes the individuals in
the sample belonging to the i th subpopulation.
There are m clusters (subpopulations), and there are ni
observations in cluster i. If the ni are all the same, we have
balanced clustering. Otherwise the clusters are unbalanced.
4 / 53

Clustered data
It may happen that units from the same subpopulation are more
alike than units from different subpopulations, i.e. for i 6= i 0 , j 6= j 0 ,
cor(yij , yij 0 ) > cor(yij , yi 0 j 0 ).
Part of the within-cluster similarity may be explained by the
covariates, i.e. units from the same subpopulation may have similar
xij values, which leads to similar y values. In this case,
cor(yij , yij 0 ) > cor(yij , yij 0 |xij , xij 0 ).
Even after accounting for measured covariates, units in a cluster
may still resemble each other more than units in different clusters:
cor(yij , yij 0 |xij , xij 0 ) > cor(yij , yi 0 j 0 |xij , xi 0 j 0 ).
5 / 53

Clustered data
A simple working model to account for this dependence is
ŷij = θ̂i + β̂ 0 xij .
The idea here is that β 0 xij explains the variation in y that is related
to the measured covariates, and the θi explain variation in y that is
related to the clustering.
This working model would be correct if there were q omitted
variables zij` , ` = 1, . . . , q, that were constant for all units in the
same cluster (i.e. zij` depends on ` and i, but not on j).
P
In that case, θ̂i would stand in for the value of ` ψ̂` zij` that we
would have obtained if the zij` were observed.

6 / 53

Clustered data
As an alternate notation, we can vectorize
P the data to express
n
n×p+1
y ∈ R and X ∈ R
where n = i ni , and then write
ŷ =

X

θ̂i Ii + X β̂,

i

where Ii ∈ {0, 1}n is the indicator of which subjects belong to
cluster i.
If the observed covariates in X are related to the clustering (i.e. if
the columns of X and the Ii are not orthogonal), then OLS
apportions the overlapping variance between β̂ and θ̂.

7 / 53

Test score example
Suppose yij is a reading test score for the j th student in the i th
classroom, out of a large number of classrooms that are
considered. Suppose xij is the income of a student’s family.
We might postulate as a population model
yij = θi + βxij + ij ,
which can be fit as
ŷij = θ̂i + β̂xij .

8 / 53

Test score example
Ideally we would want the parameter estimates to reflect sources of
variation as follows:
I

“Direct effects” of parent income such as access to books, life
experiences, good health care, etc. should go entirely to β̂.

I

Attributes of classrooms that are not related to parent
income, for example, the effect of an exceptionally good or
bad teacher, should go entirely to the θ̂i .

I

Attributes of classrooms that are correlated with parent
income, such as teacher salary, training, and resources, will be
apportioned by OLS between θ̂i and β̂.

I

Unique and unpredictable events affecting particular
individuals, such as the severe illness of the student or a
family member, should go entirely to .
9 / 53

Other examples of clustered data

I

Treatment outcomes for patients treated in various hospitals.

I

Crime rates in police precincts distributed over a number of
large cities (the precincts are the units and the cities are the
clusters).

I

Prices of stocks belonging to various business sectors.

I

Surveys in which the data are collected following a cluster
sampling approach.

10 / 53

What if we ignore the θi ?
The θi are usually not of primary interest, but we should be
concerned that by failing to take account of the clustering, we may
incorrectly assess the relationship between y and X .
If the θi are nonzero, but we fail to account for them in the model,
the working model is misspecified.
Let X be the design matrix without intercept, and let
Q = [I1 · · · Im ] be the matrix of cluster indicators (which implicitly
includes the intercept):





y =




y11
y12
y21
y22
y23
···
















Q=




1
1
0
0
0
···

0
0
1
1
1
···

···
···
···
···
···
···
















X =




X111
X121
X211
X221
X231
···

X112
X122
X212
X222
X232
···

···
···
···
···
···
···










11 / 53

What if we ignore the θi ?
Let X̃ = [1n | X ]. The estimate that results from regressing y on
X̃ is
E β̂ ∗ = (X̃ 0 X̃ )−1 X̃ 0 E [y ]
= (X̃ 0 X̃ )−1 X̃ 0 (X β + Qθ).
where β̂ ∗ = (β̂0 , β̂ 0 )0 .
For the bias of β̂ for β to be zero, we need
E [β̂ ∗ ] − β̃ = (X̃ 0 X̃ )−1 X̃ 0 (X β + Qθ − X̃ β̃) = 0,
where β0 = E [β̂0 ] and β̃ = (β0 , β 0 )0 . Thus we need
0 = X̃ 0 (X β + Qθ − X̃ β̃) = X̃ 0 (Qθ − β0 ).
12 / 53

What if we ignore the θi ?
Let S = Qθ be the vector of cluster effects. Since the first column
of X̃ consists of 1’s, we have that
S̄ = β0 .
For any other covariate Xj , we have that
Xj0 S = β0 Xj0 1n ,
which implies that Xj and S have zero sample covariance.
This is unlikely in many studies, where people tend to cluster (in
schools, hospitals, etc.) with other people having similar covariate
levels.
13 / 53

Fixed effects analysis
In a fixed effects approach, we model the θi as regression
parameters, by including additional columns in the design matrix
whose covariate levels are the cluster indicators, i.e. we regress y
on [X Q].
As the sample size grows, in most applications the cluster sizes ni
will remain bounded (e.g. a primary school classroom might have
up to 30 students). Thus the number of clusters must grow, so the
dimension of the parameter vector θ grows.
This puts us in a setting where the model dimension (p) and
sample size (n) are growing together. This is not typical in
standard statistical analysis, and leads to the Neyman Scott
problem. Contemporary methods for “high dimensional” regression
provide one way to work around this challenge.

14 / 53

Cluster effect examples
What does the inclusion of fixed effects do to the parameter
estimates β̂, which are often of primary interest?
The following slides show scatterplots of an outcome y against a
scalar covariate x, in a setting where there are three clusters
(indicated by color).
Above each plot are the coefficient estimates, Z-scores, and R 2
values for fits in which the cluster effects are either included (top
line) or excluded (second line). These estimates are obtained from
the following two working models:
ŷ = α̂ + β̂x +

X

θ̂i Ii

ŷ = α̂s + β̂s x.

The Z-scores are β̂/SD(β̂) and the R 2 values are cor(ŷ , y )2 .
15 / 53

Cluster effect examples
Left: y is independent of both clusters and x; x and clusters are
independent; β̂ ≈ 0 with and without cluster terms in model.
Right: y relates to x, but not to the clusters; x and clusters are
independent; β̂, Z , and R 2 are similar with and without cluster
effects in model.
β̂1 =1.04 Z =7.75 r2 =0.57
β̂1s =1.01 Zs =7.81 rs2 =0.56

Y

Y

β̂1 = −0.07 Z = −0.49 r2 =0.02
β̂1s = −0.06 Zs = −0.44 rs2 =0.00

X

X
16 / 53

Cluster effect examples
Left: y relates to clusters, but not to x; x and clusters are
independent; β̂ ≈ 0 with and without cluster effects in model.
Right: y relates to clusters but not to x; x and clusters are
dependent; when cluster effects are not modeled their effect is
picked up by x.
β̂1 =0.03 Z =0.19 r2 =0.94
β̂1s =0.92 Zs =20.72 rs2 =0.90

Y

Y

β̂1 = −0.03 Z = −0.85 r2 =0.94
β̂1s =0.00 Zs =0.02 rs2 =0.00

X

X
17 / 53

Cluster effect examples
Left: y relates to both clusters and x; x and clusters are
dependent.
Right: y relates to clusters but not to x; x and clusters are
dependent but the net cluster effect is not linearly related to x.
β̂1 = −0.04 Z = −0.31 r2 =0.84
β̂1s =0.00 Zs =0.02 rs2 =0.00

Y

Y

β̂1 =0.90 Z =5.97 r2 =0.95
β̂1s =0.96 Zs =29.26 rs2 =0.95

X

X
18 / 53

Cluster effect examples
Left: y relates to clusters and to x; x and clusters are dependent;
the x effect has the opposite sign as the net cluster effect.
Right: y relates to clusters and to x; x and clusters are dependent;
the signs and magnitudes of the x effects are cluster-specific.
β̂1 = −0.52 Z = −1.70 r2 =0.98
β̂1s =2.92 Zs =8.20 rs2 =0.58

Y

Y

β̂1 = −1.93 Z = −12.68 r2 =0.99
β̂1s =2.72 Zs =16.81 rs2 =0.85

X

X
19 / 53

Implications of fixed effects analysis for observational data

A stable confounder is a confounding factor that is approximately
constant within clusters. A stable confounder will become part of
the net cluster effect (θi ).
If a stable confounder is correlated with an observed covariate X ,
then this will create non-orthogonality between the cluster effects
and the effects of the observed covariates.

20 / 53

Implications of fixed effects analysis for experimental data

Experiments are often carried out on “batches” of objects
(specimens, parts, etc.) in which uncontrolled factors cause
elements of the same batch to be more similar than elements of
different batches.
If treatments are assigned randomly within each batch, there are
no stable confounders (in general there are no confounders in
experiments). Therefore the overall OLS estimate of β is unbiased
as long as the standard linear model assumptions hold.

21 / 53

Implications of fixed effects analysis for experimental data
Suppose the design is balanced (e.g. exactly half of each batch is
treated and half is untreated). This is an orthogonal design, so the
estimate based on the working model
ŷij = β̂xij
and the estimate based on the working model
ŷij = θ̂i + β̂ ∗ xij
are identical (β̂ = β̂ ∗ ). Thus they have the same variance. But the
estimated variance of β̂ will be greater than the estimated variance
of β̂ ∗ (since the corresponding estimate of σ 2 is greater), so it will
have lower power and wider confidence intervals.
22 / 53

Random cluster effects
As we have seen, the cluster effects θi can be treated as
unobserved constants, and estimated as we would estimate any
other regression coefficient.
An alternative way to handle cluster effects is to view the θi as
unobserved (latent) random variables.
In doing this, we now we have two random variables in the model:
θi and ij , which are taken to be independent of each other.
If the θi are independent and identically distributed, we can
combine them with the error terms to get a single random error
term per observation:
cij = θi + ij .

23 / 53

Random cluster effects
Let yi ≡ (yi1 , . . . , yini )0 denote the vector of responses in the i th
cluster, let xi ∈ Rni ×p denote the matrix of predictor variables for
the i th cluster, and let ci ≡ (ci1 , . . . , cini )0 ∈ Rni denote the vector
of random “errors” for the i th cluster.
Thus we have the model
yi = xi β + ci ,
for clusters i = 1, . . . , m. The i are taken to be uncorrelated
between clusters, i.e.
cov(ci , ci0 ) = 0ni ×ni 0
for i 6= i 0 .
24 / 53

Random cluster effects
The structure of the covariance matrix
Si ≡ cov(ci |xi )
is




Si = 



σ 2 + σθ2
σθ2
σθ2
σ 2 + σθ2
···
···
···
···
σθ2
σθ2

···
σθ2
···
···
···

···
···
···
···
···

σθ2
···
···
···
σ 2 + σθ2




 = σ 2 I + σ 2 11T ,
θ



where var(ij ) = σ 2 and var(θi ) = σθ2 .
25 / 53

Generalized Least Squares
Suppose we have a linear model with mean structure E [y |X ] = X β
for y ∈ Rn , X ∈ Rn×p+1 , and β ∈ Rp+1 , and variance structure
Cov[y |X ] ∝ Σ, where Σ is a given n × n matrix.
We can write the model in generative form as y = X β + , where
 ∈ Rn with E [|X ] = 0, Cov[|X ] = Σ.
Factor the covariance matrix as Σ = GG 0 , and consider the
transformed model
G −1 y = G −1 X β + G −1 .
Then letting η ≡ G −1 , it follows that Cov(η) = In×n , and note
that the population slope vector β of the transformed model is
identical to the population slope vector of the original model.

26 / 53

Generalized Least Squares
The GLS estimator of β is defined to be the OLS estimator of β
for the decorrelated response G −1 y and the decorrelated predictors
G −1 X .
The GLS estimate of the regression slope can be expressed in
terms of the original design matrix X and response vector y :

β̂GLS = ((G −1 X )0 G −1 X )−1 (G −1 X )0 G −1 y
= (X 0 G −T G −1 X )−1 X 0 G −T G −1 y
= (X 0 Σ−1 X )−1 X 0 Σ−1 y .

27 / 53

Generalized Least Squares

We can use GLS to analyze clustered data with random cluster
effects.
−1/2

Let yi∗ ≡ Si

−1/2 c
i ,

yi , ∗i ≡ Si

−1/2

and xi∗ = Si

xi .

Let y ∗ , X ∗ , and ∗ denote the result of stacking yi∗ , xi∗ , and ∗i
over i, respectively.

28 / 53

Generalized Least Squares

Since cov(∗i ) ∝ I , the OLS estimate of β for the model
y ∗ = X ∗ β + ∗
is the best estimate of β that is linear in y ∗ (by the Gauss-Markov
theorem).
Since the set of linear estimates based on y ∗ is the same as the set
of linear estimates based on y , it follows that the GLS estimate of
β based on y is the best unbiased estimate of β that is linear in y .

29 / 53

Generalized Least Squares
When Σ = Cov[y |X ], the covariance matrix of β̂ in GLS is
Cov[β̂|X ] = (X ∗T X ∗ )−1 = (X T Σ−1 X )−1
Note that this is consistent with what we get in OLS, where
Σ = σ 2 I and Cov[β̂|X ] = σ 2 (X 0 X )−1 .
To apply GLS, it is only necessary to know Σ up to a multiplicative
constant. The same estimated slopes β̂ are obtained if we
decorrelate with Σ or with kΣ for k > 0.
However if Σ = k · Cov[y |X ], then
Cov[β̂|X ] = k −1 (X 0 Σ−1 X )−1 ,
and at the moment we have no way to estimate k.
30 / 53

Generalized Least Squares
When Cov[y |X ] ∝ Σ, the sampling covariance matrix for GLS is
proportional to (X T Σ−1 X )−1 , and the parameter estimates are
uncorrelated with each other if and only if X T Σ−1 X is diagonal –
that is, if the columns of X are mutually orthogonal with respect
to the inner product defined by Σ−1 : hu, v i ≡ u 0 Σ−1 v .
This is related to the fact that ŷ in GLS is the projection of y onto
col(X ) in the Mahalanobis metric defined by Σ−1 ,
d(u, v ) = (u − v )0 Σ−1 (u − v ). This generalizes the fact that ŷ in
OLS is the projection of y onto col(X ) in the Euclidean metric.
Neither of these facts depend on the proportionality constant
relating Σ to Cov[y |X ].

31 / 53

Generalized least squares with a “working” covariance
What if we perform GLS using a possibly miss-specified “working
covariance” S?
For example, what happens if we use OLS when the actual
covariance matrix of the errors is Σ 6= I ?
Since
E [∗ |X ∗ ] = E [∗ |X ] = 0,
the estimate β̂ remains unbiased. However it has two problems: it
may not be the BLUE (i.e. it may not have the least variance
among unbiased estimates), and the usual linear model inference
procedures will be wrong.

32 / 53

Generalized least squares with “working” covariance
The sampling covariance when the error structure is mis-specified
is given by the “sandwich expression:”
Cov[β̂] = (X ∗0 X ∗ )−1 X ∗0 Σ∗ X ∗ (X ∗0 X ∗ )−1
where
Σ∗ = Cov[∗ |X ∗ ].
This result covers two special situations: (i) we use OLS, so
X ∗ = X and Σ∗ = Σ is the covariance matrix of the errors, and
(ii) we use a “working covariance” to define the decorrelating
matrix G , and this working covariance is not equal to Σ∗ .

33 / 53

Generalized least squares with “working” covariance
Another way to write the covariance of β̂ is as follows:
−1 0 −1
−1
0 −1
−1
Cov[β̂] = (X 0 Σ−1
w X ) X Σw ΣΣw X (X Σw X )

where Σw is the “working” (possibly incorrectly specifed)
covariance matrix for .
From this expression, it is clear that if Σw = Σ (the working model
is correct), then
−1
Cov[β̂] = (X 0 Σ−1
= (X 0 Σ−1 X )−1 .
w X)

34 / 53

Iterated GLS

In practice, we will generally not know Σ, so it must be estimated
from the data. Σ is the covariance matrix of the errors, so we can
estimate Σ from the residuals.
Typically a low-dimensional parameteric model Σ = Σ(α) is used.
Given a model for the the unexplained variation (i.e. for Σ), then
the fitting algorithm alternates between estimating α and
estimating the regression slopes β.

35 / 53

Iterated GLS
For example, suppose our working covariance has the form
Si (j, k) = r ν (j 6= k).

Si (j, j) = ν

This is an exchangeable model with “intraclass correlation
coefficient” (ICC) r between two observations in the same cluster.
There are several closely-related ICC estimates. One approach is
based on the standardized residuals Rijs :
XX
i

j<j 0

X
Rijs Rijs 0 /(
ni (ni − 1)/2 − p − 1)
i

where ni is the size of the i th cluster.
36 / 53

Iterated GLS

Another common model for the error covariance is the first order
autoregressive (AR-1) model:
Σij = α|i−j| ,
where |α| < 1 is a parameter.
There are several possible estimates of α borrowing ideas from
time series analysis. We will not provide more details here.

37 / 53

Generalized Least Squares and stable confounders

For GLS to give unbiased estimates of β, we must have
E [∗ij |X ] = E [θi + ij |X ] = 0.
Since E [ij |X ] = 0, this is equivalent to requiring that E [θi |X ] = 0.
Thus if the covariates have distinct within-cluster mean values, and
the within-cluster mean values of the covariates are correlated with
the θi , then the GLS estimate of β will be biased.

38 / 53

Likelihood inference for the random intercepts model
The random intercepts model
yij = θi + β 0 xij + ij ,
can be analyzed using a likelihood approach, using:
I

A random intercepts density:
φ(θ|X )

I

A density for the data given the random intercept:
f (y |x, θ)

39 / 53

Multilevel models and conditional independence
The models are hierarchical (multilevel), and encode conditional
independence relationships.
At the base level of the hierarchy, the random effect θi are
independent and identically distributed, and in particular are
unrelated to X :

p(θ1 , . . . , θm |X ) = p(θ1 , . . . , θm ) =

m
Y

φ(θi )

i=1

Conditionally on the random effects, the observed data {yij } are
independent, and follow distributions that depend on the
(unobserved) random effects and the observed covariates:
p({yij }|{θi }, X ) =

Y

f (yij |θi , X )

ij
40 / 53

Likelihood inference for the random intercepts model

Since the random effects are not observed, we must estimate the
parameters in terms of the marginal model for the observed data.
This is a model that depends on the structural parameters, which
are (β, σθ2 , σ 2 ) in this case.
The marginal density is obtained by integrating out the random
effects
Z
p(yi1 , . . . yini |X ) =

f (yi1 , . . . , yini |X , θi )φ(θi )dθi .

41 / 53

Likelihood inference for the random intercepts model
For linear multilevel models, the marginal distribution p(y |X ) can
be calculated explicitly. It is Gaussian, and therefore is
characterized by its moments:
E [yij |xij ] = β 0 xij
Var[yij |xij ] = σθ2 + σ 2 ,

Cov[yi1 j1 , yi2 j2 |xi1 j1 , xi2 j2 ] = 0 (if i1 6= i2 ).
Cov[yij1 , yij2 |xij1 , xij2 ] = σθ2 (if j1 6= j2 ).

42 / 53

Marginal form of the generative model
Suppose
|X ∼ N(0, σ 2 )

θ|X ∼ N(0, σθ2 ).

In this case y |X is Gaussian, with mean and variance as given
above. Thus the random intercept model can be equivalently
written in marginal form as
yij = β 0 xij + ∗ij
where ∗ij = θi + ij .
It follows that E [∗ij |X ] = 0, Var[∗ij |X ] = σθ2 + σ 2 , and the ∗ij
values have correlation coefficient σθ2 /(σ 2 + σθ2 ) within clusters.
43 / 53

Likelihood computation

Maximum likelihood estimates for the model can be calculated
using a gradient-based optimization procedure applied to the
marginal log-density, or using the EM algorithm.
Asymptotic standard errors can be obtained from the inverse of the
Fisher information matrix. Likelihood ratio tests, AIC values, and
other likelihood-based inference tools can be used.
For linear multilevel models, the parameter estimates from iterated
GLS and the MLE for the random intercepts model will be similar
but not identical. Both are consistent, and in general will be
asymptotically equivalent.

44 / 53

Predicting the random intercepts
Since the model is fit by optimizing the marginal log-likelihood, we
obtain an estimate of σθ2 ≡ Var[θi ], but we don’t automatically
learn anything about the individual θi .
If there is an interest in the individual θi values, we can predict
them using the best linear unbiased predictor (BLUP).
The population version of the BLUP is:
Eβ,σ2 ,σ2 [θi |yi , X ] = Cov[θi , yi |X ] · Cov[yi |X ]−1 · (yi − E [yi |X ])
θ

Since this depends on things we don’t know, in practice we use the
sample version of the BLUP (sometimes called the eBLUP):
d i , yi |X ] · Cov[y
d i |X ]−1 · (yi − Ê [yi |X ])
Eβ̂,σ̂2 ,σ̂2 [θi |yi , X ] = Cov[θ
θ

45 / 53

Predicting the random intercepts

The BLUP is truly a linear function of y , is unbiased, and is “best”
in the sense of minimizing the expected squared prediction error.
However the eBLUP is none of of these things (it is not linear or
unbiased, and it is unclear if it is “best”).
Note also that due to the hierarchical structure of the model, to
predict θi we only need to condition on yi (the other yi 0 , for i 0 6= i,
contain no information). Thus the BLUP for θi only depends on
the data through yi . But in the eBLUP, all the data are used
indirectly, through the estimates of the structural parameters.

46 / 53

Predicting the random intercepts

The estimated second moments needed to calculate the BLUP are:
d i , yi |X ] = σ̂ 2 · 1n
Cov[θ
θ
i
and
d i |X ] = σ̂ 2 I + σ̂ 2 11T .
Cov[y
θ
where ni is the size of the i th group.

47 / 53

Predicting the random intercepts

For a given set of parameter values, the BLUP for the random
intercepts model is a linear function of the data, with the following
form:

Eβ̂,σ̂2 ,σ̂2 [θi |y , X ] = ni σθ2 /(σ̂ 2 + ni σ̂θ2 ) · 1T (yi − Ê [yi |X ])/ni ,
θ

48 / 53

Predicting the random intercepts
In the BLUP for θi , this term is the mean of the residuals:
1T (yi − Ê [yi |X ])/ni
and it is shrunk by this factor:
ni σθ2 /(σ̂ 2 + ni σ̂θ2 )
This shrinkage allows us to interpret the random intercepts model
as a “partially pooled” model that is intermediate between the
fixed effects model and the model that completely ignores the
clusters.

49 / 53

Predicting the random intercepts

In the fixed effects model, the parameter estimates θi are unbiased,
but they are “overdispersed”, meaning that the sample variance of
the θ̂i will generally be greater than σθ2 .
In the random intercepts model, the variance parameter σθ2 is
estimated with low bias, and the BLUP’s of the θi are shrunk
toward zero.

50 / 53

Random slopes
Suppose we are interested in a measured covariate z whose effect
may vary by cluster. We might start with the model
yi = xi β + γzi + ,
For example, supose that zi ∈ {0, 1}ni is a vector of treatment
indicators (1 for treated subjects, 0 for untreated subjects). Then
γ is the average change associated with being treated (the
population treatment effect).
In many cases, it is reasonable to consider the possibility that
different clusters may have different treatment effects – that is,
different clusters have different γ values. In this case we can let γi
be the treatment response for cluster i.

51 / 53

Random slopes
Suppose we model the random slopes γi as being Gaussian (given
X ) with variance σγ2 . The marginal model is
E [yi |xi , zi ] = xi0 β
and
Cov[yi |xi , zi ] = σγ2 zi zi0 + σ 2 I .
Again we can form a BLUP Eβ̂,σ̂2 ,σ̂2 [γi |y , X , z], and the BLUP
γ
turns out to be a shrunken version of what would be obtained in a
fixed effects model, where we regress y on x and z within every
cluster.

52 / 53

Linear mixed effects models

The random intercept and random slope model are special cases of
the linear mixed effects model:
yi = Xi β + Zi γi + i
Here, Xi is a ni × p design matrix for cluster i, Zi is a ni × q
random effects design matrix for cluster i, γi ∈ Rq is a random
vector with mean 0 and covariance matrix Ψ, and  ∈ Rni is a
random vector with mean 0 and covariance matrix σ 2 I .

53 / 53