Specification Errors, Measurement Errors,
Confounding
Kerby Shedden
Department of Statistics, University of Michigan

October 20, 2021

1 / 42

An unobserved covariate
Suppose we have a data generating model of the form
y = α + βx + γz + .
The usual conditions E [|x = x, z = z] = 0 and var [|x = x, z = z] = σ 2
hold.
The covariate x is observed, but z is not observable.
If we regress y on x, the model we are fitting differs from the data
generating model. What are the implications of this?
Does the fitted regression model ŷ = α̂ + β̂x estimate E [y |x = x], and
does the MSE σ̂ 2 estimate var[y |x = x]?
2 / 42

An unobserved independent covariate
The simplest case is where x and z are independent (and for simplicity
E [z] = 0). The slope estimate β̂ has the form
β̂

=

X

=

X

yi (xi − x̄)/

i

X
(xi − x̄)2
i

(α + βxi + γzi + i )(xi − x̄)/

i

= β+γ

X

(xi − x̄)2

i

X

zi (xi − x̄)/

i

X
X
X
(xi − x̄)2 +
i (xi − x̄)/
(xi − x̄)2
i

i

i

By the double expectation theorem,
E [|x = x] = Ez|x E [|x, z] = 0
and since z and x are independent
X
X
X
zi (xi − x̄)|x] =
(xi − x̄)E [zi |x] = E [z] ×
(xi − x̄) = 0.
E[
i

i

i
3 / 42

An unobserved independent covariate

Therefore β̂ remains unbiased if there is an unmeasured covariate z that
is independent of x. Specifically, E [β̂|X ] = β.
What about σ̂ 2 ? What does it estimate in this case?

4 / 42

An unobserved independent covariate
The residuals are
(I − P)y = (I − P)(γz + )
So the residual sum of squares is
y 0 (I − P)y

=

γ 2 z 0 (I − P)z + 0 (I − P) + 2γz 0 (I − P).

The expected value is therefore
E [y 0 (I − P)y |x]

= γ 2 var[z]rank(I − P) + σ 2 rank(I − P)
= (γ 2 var[z] + σ 2 )(n − 2).

Hence the σ̂ 2 has expected value γ 2 var(z) + σ 2 .
5 / 42

An unobserved independent covariate

Are our inferences correct?
We can set ˜ = γz +  as being the error term of the model. Since
E [˜
|X = x] = 0

cov[˜
|X = x] = (γ 2 var[Z ] + σ 2 )I ∝ I ,

all the results about estimation of β in a correctly-specified model hold in
this setting.
In general, we may wish to view any unobserved covariate as simply being
another source of error, like . But we will see next that this cannot be
done if z is dependent with x.

6 / 42

Confounding
As above, continue to take the data generating model to be
y = α + βx + γz + ,
but now suppose that x and z are correlated.
As before, z is not observed so our analysis will be based on y and x.
A variable such as z that is associated with both the dependent and
independent variables in a regression model is called a confounder.
In this outcome, we often call y the outcome or response, and x is the
exposure or treatment.

7 / 42

Confounding
Suppose x and z are standardized, and cor[x, z] = r . Further suppose
that E [z|x] = rx.
Due to the linearity of E [y |x, z]:
I

If x increases by one unit and z remains fixed, the expected
response increases by β units.

I

If z increases by one unit and x remains fixed, the expected
response increases by γ units.

However, if we select a pair of cases with x values differing by one unit at
random (without controlling z), their z values will differ on average by r
units. Therefore the expected responses E [y ] for these two cases differ by
β + r γ units.

8 / 42

Known confounders

There is a popular informal “typology” of confounders:
I

Known and measured confounders (“known knowns”)

I

Known and unmeasured confounders (“known unknowns”)

I

Unknown and unmeasured confounders (“unknown unknowns”)

9 / 42

Known and measured confounders

Suppose we are mainly interested in the relationship between a particular
variable x and an outcome y . A measured confounder is a variable z that
can be measured and included in a regression model along with x. A
measured confounder generally does not pose a problem for estimating
the “effect” of x, unless it is highly collinear with x.
Example: Suppose we are studying the health effects of second-hand
smoke exposure (x). We measure the health outcome (y ) directly.
Subjects who smoke (z) are at risk for many of the same bad outcomes
that may be associated with second-hand smoke exposure. Thus, it
would be very important to determine which subjects smoke, and include
that information as a covariate (a measured confounder) in a regression
model used to assess the effects of second-hand smoke exposure.

10 / 42

Known and measured confounders

Caution: Just because a confounder is known and measured does not
mean that simply including it as a main effect in the regression is
sufficient to account for its role. That is, the working model
E [y |x, z] = α + βx + γz is a simple additive model that only serves as a
starting point for “controlling” for the role of z. Perhaps the actual mean
structure is E [y |x, z] = x + z 2 or E [y |x, z] = xz + z 2 /(1 + x 2 ).

11 / 42

Known but unmeasured confounders
A known but unmeasured confounder is a variable that we know about,
and for which we may have some knowledge of its distribution, but it is
not measured in our particular data set.
For example, we may know that certain occupations (like working in
certain types of factories) may produce risks similar to the risks of
exposure to second-hand smoke. If occupation data is not collected in a
particular study, this is an unmeasured confounder.
Since we do not have data for unmeasured confounders, their omission
may produce bias in the estimated effects for variables of interest. If we
have some understanding of how a certain unmeasured confounder
operates, we may be able to use a sensitivity analysis to get a rough idea
of how much bias is present.

12 / 42

Unknown confounders

An unknown confounder is a variable that affects the outcome of interest,
but is unknown to us. An unknown confounder is necessarily unmeasured.
For example, there may be unknown genetic or environmental factors
that are associated with both second hand smoke exposure (x) and the
outcome (y ).

13 / 42

Randomizaton and confounding

Unknown confounders and unmeasured confounders place major limits on
our ability to interpret regression models causally or mechanistically.
Randomization: One way to substantially reduce the risk of confounding
is to randomly assign the values of x. In this case, there can be no
systematic association between x and z (for any z), and in large enough
samples the actual (sample-level) association between x and z will be
very low, so very little confounding is possible.
Randomization is a form of intervention or manipulation, and can only be
done in specific situations where it is possible to assign the values of x,
rather than observe them.

14 / 42

Randomizaton and confounding
In small samples, randomization can only gaurantee approximate
orthogonality against unmeasured confounders.
The average result of many randomized studies is unbiased, but
individual randomized studies may be biased if by chance there are
imbalances, or chance associations between x and a confounder z.
For example if we have studies with n = 8 people, always consisting of
four females and four males, and we randomly select four people to have
x = 1 and four people to have x = 0, then a particular study could easily
be severely unbalanced, e.g. all of the subjecs with x = 1 might be
female.
If we have a known confounder such as sex, then we can do stratified
randomization, i.e. randomly assign two females to treatment and two
females to control, and similarly for males.

15 / 42

Confounding in linear models
For simplicity, suppose that z has mean 0 and variance 1, and we use
least squares to fit a working model
ŷ = α̂ + β̂x

We can work out the limiting value of the slope estimate as follows.

β̂

=
=
→

P
y (x − x̄)
Pi i i
(xi − x̄)2
Pi
+ γzi + i )(xi − x̄)/n
i (α + βx
Pi
2
i (xi − x̄) /n
β + γr .

Note that if either γ = 0 (z is independent of y given x) or if r = 0 (z is
uncorrelated with x), then β is estimated correctly.
16 / 42

Marginalization
For any population model defined by the first and second moments
E [y |x, z] and var[y |x, z], we can marginalize the model as follows:
E [y |x] = Ez|x E [y |x, z]

var[y |x] = Ez|x var[y |x, z] + varz|x E [y |x, z].
This marginalization can be done to any model, but what do we get if we
marginalize the additive model for which E [y |x, z] = α + βx + γz and
var[y |x, z] = σ 2 ?
Further, if we use least squares to model data {(yi , xi )}, are the results
consistent for the marginalizations of the population model?

17 / 42

Marginalization
For the basic additive model,

E [y |x]

=

E [E [y |x, z]|x]

= E [α + βx + γz|x]
=

var[y |x]

α + βx + γE [z|x].

= Ez|x var[y |x, z] + varz|x E [y |x, z]
= σ 2 + varz|x [α + βx + γz]
= σ 2 + γ 2 var[z|x].

Note that the marginalized linear model may be nonlinear in x. Also,
while y is homoscedastic given x and z, it may be heteroscedastic when
we only condition on x.
18 / 42

Confounding and mean structures
Suppose we regress y on x, ignoring z. Since
β̂ → β + γr ,
and it is easy to show that α̂ → α, the fitted model is approximately
ŷ ≈ α + βx + γrx = α + (β + γr )x.
How does the fitted model relate to the marginal model E [y |x]? Since
E [y |x] = α + βx + γE [z|x],
the fitted regression model agrees with E [y |x] as long as
E [z|x] = rx.
19 / 42

Confounding and variance structures

Turning now to the variance structure of the fitted model, the limiting
value of σ̂ 2 is

σ̂ 2

=

X

≈

X

(yi − α̂ − β̂xi )2 /(n − 2)

i

(γzi + i − γrxi )2 /n

i

→ σ 2 + γ 2 (1 − r 2 ).
Ideally this should estimate the marginal variance var[y |x].

20 / 42

Confounding and variance structures

By the law of total variation,
var[y |x] = σ 2 + γ 2 var[z|x].
Thus for σ̂ 2 (obtained from regressing y on x while ignoring z) to
estimate var[y |x] we need
var[z|x] = 1 − r 2 .

21 / 42

The Gaussian case
Suppose

y=

a
b



is a Gaussian random vector, where y ∈ Rn , a ∈ Rq , and b ∈ Rn−q .
Let µ = Ey and Σ = cov[y ]. We can partition µ and Σ as

µ=

µ1
µ2




Σ=

Σ11
Σ21

Σ12
Σ22


,

where µ1 ∈ Rq , µ2 ∈ Rn−q , Σ11 ∈ Rq×q , Σ12 ∈ Rq×n−q ,
Σ22 ∈ Rn−q×n−q , and Σ21 = Σ012 .
22 / 42

The Gaussian case

It is a fact that a|b is Gaussian with mean
E [a|b] = µ1 + Σ12 Σ−1
22 (b − µ2 )

and covariance matrix
cov[a|b] = Σ11 − Σ12 Σ−1
22 Σ21 .

23 / 42

The Gaussian Case
Now we apply these results to our model, taking x and z to be jointly
Gaussian.
The mean vector and covariance matrix are

E

z
x




=0

cov

z
x




=

1
r

r
1


.

so we get
E [z|x] = rx

cov[z|x] = 1 − r 2 .

These are exactly the conditions stated earlier that guarantee the fitted
mean model converges to the marginal regression function E [y |x], and
the fitted variance model converges to to the marginal regression variance
var[y |x].

24 / 42

Consequences of confounding

How does the presence of unmeasured confounders affect our ability to
interpret regression models?

25 / 42

Population average covariate effect
Suppose we specify a value x∗ in the covariate space and randomly select
two subjects i and j having x values xi = x∗ + 1 and xj = x∗ . The
inter-individual difference is
yi − yj = β + γ(zi − zj ) + i − j ,
which has a mean value (marginal effect) of

E [yi − yj |xi = x∗ + 1, xj = x∗ ] = β + γ(E [z|x = x∗ + 1] − E [z|x = x∗ ]),
which agrees with what would be obtained by least squares analysis as
long as E [z|x] = rx.

26 / 42

Population average covariate effect

The variance of yi − yj is
2σ 2 + 2γ 2 var[z|x],
which also agrees with the results of least squares analysis as long as
var[z|x] = 1 − r 2 .

27 / 42

Individual treatment effect

Now suppose we match two subjects i and j having x values differing by
one unit, and who also having the same values of z.
This is what one expect to see as the pre-treatment and post-treatment
measurements following a treatment that changes an individual’s x value
by one unit, if the treatment does not affect z (the within-subject
treatment effect).

28 / 42

Individual treatment effect
The mean difference (individual treatment effect) is
E [yi − yj |xi = x∗ + 1, xj = x∗ , zi = zj ] = β
and the variance is
var[yi − yj |xi = x∗ + 1, xj = x∗ , zi = zj ] = 2σ 2 .

These do not in general agree with the estimates obtained by using least
squares to analyze the observable data for x and y . Depending on the
sign of γθ, we may either overstate or understate the individual
treatment effect β, and the population variance of the treatment effect
will always be overstated.

29 / 42

Types of covariates

Expressed as a causal diagram, a confounder z relates to an exposure x
and an outcome y as follows:
y

x

z
In addition to confounders, there are many other ways that a variable z
can impact our ability to understand the relationship between variables of
primary interest x and y .

30 / 42

Types of covariates
If we reverse the directionality between z and x, y , then z is no longer a
confounder and instead becomes a collider.
y

x

z
If z is a confounder, then you must somehow control for z to obtain an
undistorted understanding of the relationship between x and y . If z is a
collider the opposite is true – controlling for z induces distortion in the
relationship between x and y .

31 / 42

Types of covariates

A precision variable is a variable that explains some of the variation in y
that is unrelated to x.
x

y

z
Including or excluding a precision variable in an analysis does not impact
the bias, but can impact precision. In most cases, including a precision
variable increases the precision with which the relationship between x and
y is estimated

32 / 42

Types of covariates

A mediator is a variable that lies on the causal pathway between an
exposure x and an outcome y . In the diagram below, z is a mediator.
x

z

y

Controlling for a mediator will usually reduce or eliminate the apparent
relationship between x and y , and doing so gives insight into the
underlying mechanism behind the relationship between x and y .

33 / 42

Types of covariates

More generally, an exposure can have direct effects on an outcome, and
indirect or mediated effects on an outcome, carried through a mediator z.

x

z

y

34 / 42

Types of covariates

A moderator or effect modifier is a variable that explains heteropgeneity
in the relationship between x and y .
An interaction can be seen as an effect modifier. If
E [y |x, z] = x + zx = (1 + z)x, we can interpret the slope of y on x as
being different based on the value of z.
Finally note that in many cases we cannot be sure if a variable is a
confounder, collider, mediator, or moderator, and many variables can
occupy multiple of these roles at the same time.

35 / 42

Measurement error for linear models
Suppose the data generating model is
y = zβ + ,
with the usual linear model assumptions, but we do not observe z.
Rather, we observe
x = z + τ,
where τ is a random vector of covariate measurement errors with
E [τ ] = 0. Assuming x1 = 1 is the intercept, it is natural to set the first
column of τ equal to zero.
This is called an errors in variables model, or a measurement error model.

36 / 42

Measurement error for linear models

When covariates are measured with error, least squares point estimates
may be biased and inferences may be incorrect.
Intuitively it seems that slope estimates should be “attenuated” (biased
toward zero). The reasoning is that as the measurement error grows very
large, the observed covariate x becomes equivalent to noise, so the slope
estimate should go to zero.

37 / 42

Measurement error for linear models
Let X and Z now represent the n × p + 1 observed and ideal design
matrices and let T denote the n × p + 1 matrix of measurement errors.
The least squares estimate of the model coefficients is
β̂

=

(X 0 X )−1 X 0 y

=

(Z 0 Z + Z 0 T + T 0 Z + T 0 τ )−1 (Z 0 y + T 0 y )

=

(Z 0 Z /n + Z 0 T /n + T 0 Z /n + T 0 T /n)−1 (Z 0 y /n + T 0 Z β/n + T 0 /n).

We will make the simplifying assumption that the covariate measurement
error is uncorrelated with the covariate levels, so
Z 0 T /n → 0,
and that the covariate measurement error τ and observation error  are
uncorrelated, so
τ 0 /n → 0.
38 / 42

Measurement error for linear models
Under these circumstances,
β̂ ≈ (Z 0 Z /n + T 0 T /n)−1 Z 0 y /n.
Let Mz be the limiting value of Z 0 Z /n, and let Mτ = E [τ τ 0 ] be the
limiting value of T 0 T /n. Thus the limit of β̂ is

(Mz + Mτ )−1 Z 0 y /n

=

(I + Mz−1 Mτ )−1 Mz−1 Z0 y /n

→

(I + Mz−1 Mτ )−1 β

≡

β0 .

and hence the limiting bias is
β0 − β = ((I + Mz−1 Mτ )−1 − I )β.
39 / 42

Measurement error for linear models
What can we say about the bias?
Note that the matrix Mz−1 Mτ has non-negative eigenvalues, since it
shares its eigenvalues with the positive semi-definite matrix
Mz−1/2 Mτ Mz−T /2 .
It follows that all eigenvalues of I + Mz−1 Mτ are greater than or equal to
1, so all eigenvalues of (I + Mz−1 Mτ )−1 are less than or equal to 1.
This means that (I + Mz−1 Mτ )−1 is a contraction, so kβ0 k ≤ kβk.
Therefore the sum of squares of fitted slopes is smaller on average than
the sum of squares of actual slopes, due to measurement error.

40 / 42

Types of measurement error
The “classical” measurement error model
x = z + τ,
where z is the true value and x is the observed value, is the one most
commonly considered.
Alternatively, in the case of an experiment it may make more sense to use
the Berkson error model:
z = x + τ.
For example, suppose we aim to study a chemical reaction when a given
concentration x of substrate is present. However, due to our inability to
completely control the process, the actual concentration of substrate z
differs randomly from x, by an unknown amount τ .

41 / 42

Types of measurement error

You cannot simply rearrange z = x + τ to x = z − τ and claim that the
two situations are equivalent.
In the first case, τ is independent of z but dependent with x. In the
second case, τ is independent of x but dependent with z.

42 / 42