Specification Errors, Measurement Errors, Confounding Kerby Shedden Department of Statistics, University of Michigan October 20, 2021 1 / 42 An unobserved covariate Suppose we have a data generating model of the form y = α + βx + γz + . The usual conditions E [|x = x, z = z] = 0 and var [|x = x, z = z] = σ 2 hold. The covariate x is observed, but z is not observable. If we regress y on x, the model we are fitting differs from the data generating model. What are the implications of this? Does the fitted regression model ŷ = α̂ + β̂x estimate E [y |x = x], and does the MSE σ̂ 2 estimate var[y |x = x]? 2 / 42 An unobserved independent covariate The simplest case is where x and z are independent (and for simplicity E [z] = 0). The slope estimate β̂ has the form β̂ = X = X yi (xi − x̄)/ i X (xi − x̄)2 i (α + βxi + γzi + i )(xi − x̄)/ i = β+γ X (xi − x̄)2 i X zi (xi − x̄)/ i X X X (xi − x̄)2 + i (xi − x̄)/ (xi − x̄)2 i i i By the double expectation theorem, E [|x = x] = Ez|x E [|x, z] = 0 and since z and x are independent X X X zi (xi − x̄)|x] = (xi − x̄)E [zi |x] = E [z] × (xi − x̄) = 0. E[ i i i 3 / 42 An unobserved independent covariate Therefore β̂ remains unbiased if there is an unmeasured covariate z that is independent of x. Specifically, E [β̂|X ] = β. What about σ̂ 2 ? What does it estimate in this case? 4 / 42 An unobserved independent covariate The residuals are (I − P)y = (I − P)(γz + ) So the residual sum of squares is y 0 (I − P)y = γ 2 z 0 (I − P)z + 0 (I − P) + 2γz 0 (I − P). The expected value is therefore E [y 0 (I − P)y |x] = γ 2 var[z]rank(I − P) + σ 2 rank(I − P) = (γ 2 var[z] + σ 2 )(n − 2). Hence the σ̂ 2 has expected value γ 2 var(z) + σ 2 . 5 / 42 An unobserved independent covariate Are our inferences correct? We can set ˜ = γz + as being the error term of the model. Since E [˜ |X = x] = 0 cov[˜ |X = x] = (γ 2 var[Z ] + σ 2 )I ∝ I , all the results about estimation of β in a correctly-specified model hold in this setting. In general, we may wish to view any unobserved covariate as simply being another source of error, like . But we will see next that this cannot be done if z is dependent with x. 6 / 42 Confounding As above, continue to take the data generating model to be y = α + βx + γz + , but now suppose that x and z are correlated. As before, z is not observed so our analysis will be based on y and x. A variable such as z that is associated with both the dependent and independent variables in a regression model is called a confounder. In this outcome, we often call y the outcome or response, and x is the exposure or treatment. 7 / 42 Confounding Suppose x and z are standardized, and cor[x, z] = r . Further suppose that E [z|x] = rx. Due to the linearity of E [y |x, z]: I If x increases by one unit and z remains fixed, the expected response increases by β units. I If z increases by one unit and x remains fixed, the expected response increases by γ units. However, if we select a pair of cases with x values differing by one unit at random (without controlling z), their z values will differ on average by r units. Therefore the expected responses E [y ] for these two cases differ by β + r γ units. 8 / 42 Known confounders There is a popular informal “typology” of confounders: I Known and measured confounders (“known knowns”) I Known and unmeasured confounders (“known unknowns”) I Unknown and unmeasured confounders (“unknown unknowns”) 9 / 42 Known and measured confounders Suppose we are mainly interested in the relationship between a particular variable x and an outcome y . A measured confounder is a variable z that can be measured and included in a regression model along with x. A measured confounder generally does not pose a problem for estimating the “effect” of x, unless it is highly collinear with x. Example: Suppose we are studying the health effects of second-hand smoke exposure (x). We measure the health outcome (y ) directly. Subjects who smoke (z) are at risk for many of the same bad outcomes that may be associated with second-hand smoke exposure. Thus, it would be very important to determine which subjects smoke, and include that information as a covariate (a measured confounder) in a regression model used to assess the effects of second-hand smoke exposure. 10 / 42 Known and measured confounders Caution: Just because a confounder is known and measured does not mean that simply including it as a main effect in the regression is sufficient to account for its role. That is, the working model E [y |x, z] = α + βx + γz is a simple additive model that only serves as a starting point for “controlling” for the role of z. Perhaps the actual mean structure is E [y |x, z] = x + z 2 or E [y |x, z] = xz + z 2 /(1 + x 2 ). 11 / 42 Known but unmeasured confounders A known but unmeasured confounder is a variable that we know about, and for which we may have some knowledge of its distribution, but it is not measured in our particular data set. For example, we may know that certain occupations (like working in certain types of factories) may produce risks similar to the risks of exposure to second-hand smoke. If occupation data is not collected in a particular study, this is an unmeasured confounder. Since we do not have data for unmeasured confounders, their omission may produce bias in the estimated effects for variables of interest. If we have some understanding of how a certain unmeasured confounder operates, we may be able to use a sensitivity analysis to get a rough idea of how much bias is present. 12 / 42 Unknown confounders An unknown confounder is a variable that affects the outcome of interest, but is unknown to us. An unknown confounder is necessarily unmeasured. For example, there may be unknown genetic or environmental factors that are associated with both second hand smoke exposure (x) and the outcome (y ). 13 / 42 Randomizaton and confounding Unknown confounders and unmeasured confounders place major limits on our ability to interpret regression models causally or mechanistically. Randomization: One way to substantially reduce the risk of confounding is to randomly assign the values of x. In this case, there can be no systematic association between x and z (for any z), and in large enough samples the actual (sample-level) association between x and z will be very low, so very little confounding is possible. Randomization is a form of intervention or manipulation, and can only be done in specific situations where it is possible to assign the values of x, rather than observe them. 14 / 42 Randomizaton and confounding In small samples, randomization can only gaurantee approximate orthogonality against unmeasured confounders. The average result of many randomized studies is unbiased, but individual randomized studies may be biased if by chance there are imbalances, or chance associations between x and a confounder z. For example if we have studies with n = 8 people, always consisting of four females and four males, and we randomly select four people to have x = 1 and four people to have x = 0, then a particular study could easily be severely unbalanced, e.g. all of the subjecs with x = 1 might be female. If we have a known confounder such as sex, then we can do stratified randomization, i.e. randomly assign two females to treatment and two females to control, and similarly for males. 15 / 42 Confounding in linear models For simplicity, suppose that z has mean 0 and variance 1, and we use least squares to fit a working model ŷ = α̂ + β̂x We can work out the limiting value of the slope estimate as follows. β̂ = = → P y (x − x̄) Pi i i (xi − x̄)2 Pi + γzi + i )(xi − x̄)/n i (α + βx Pi 2 i (xi − x̄) /n β + γr . Note that if either γ = 0 (z is independent of y given x) or if r = 0 (z is uncorrelated with x), then β is estimated correctly. 16 / 42 Marginalization For any population model defined by the first and second moments E [y |x, z] and var[y |x, z], we can marginalize the model as follows: E [y |x] = Ez|x E [y |x, z] var[y |x] = Ez|x var[y |x, z] + varz|x E [y |x, z]. This marginalization can be done to any model, but what do we get if we marginalize the additive model for which E [y |x, z] = α + βx + γz and var[y |x, z] = σ 2 ? Further, if we use least squares to model data {(yi , xi )}, are the results consistent for the marginalizations of the population model? 17 / 42 Marginalization For the basic additive model, E [y |x] = E [E [y |x, z]|x] = E [α + βx + γz|x] = var[y |x] α + βx + γE [z|x]. = Ez|x var[y |x, z] + varz|x E [y |x, z] = σ 2 + varz|x [α + βx + γz] = σ 2 + γ 2 var[z|x]. Note that the marginalized linear model may be nonlinear in x. Also, while y is homoscedastic given x and z, it may be heteroscedastic when we only condition on x. 18 / 42 Confounding and mean structures Suppose we regress y on x, ignoring z. Since β̂ → β + γr , and it is easy to show that α̂ → α, the fitted model is approximately ŷ ≈ α + βx + γrx = α + (β + γr )x. How does the fitted model relate to the marginal model E [y |x]? Since E [y |x] = α + βx + γE [z|x], the fitted regression model agrees with E [y |x] as long as E [z|x] = rx. 19 / 42 Confounding and variance structures Turning now to the variance structure of the fitted model, the limiting value of σ̂ 2 is σ̂ 2 = X ≈ X (yi − α̂ − β̂xi )2 /(n − 2) i (γzi + i − γrxi )2 /n i → σ 2 + γ 2 (1 − r 2 ). Ideally this should estimate the marginal variance var[y |x]. 20 / 42 Confounding and variance structures By the law of total variation, var[y |x] = σ 2 + γ 2 var[z|x]. Thus for σ̂ 2 (obtained from regressing y on x while ignoring z) to estimate var[y |x] we need var[z|x] = 1 − r 2 . 21 / 42 The Gaussian case Suppose y= a b is a Gaussian random vector, where y ∈ Rn , a ∈ Rq , and b ∈ Rn−q . Let µ = Ey and Σ = cov[y ]. We can partition µ and Σ as µ= µ1 µ2 Σ= Σ11 Σ21 Σ12 Σ22 , where µ1 ∈ Rq , µ2 ∈ Rn−q , Σ11 ∈ Rq×q , Σ12 ∈ Rq×n−q , Σ22 ∈ Rn−q×n−q , and Σ21 = Σ012 . 22 / 42 The Gaussian case It is a fact that a|b is Gaussian with mean E [a|b] = µ1 + Σ12 Σ−1 22 (b − µ2 ) and covariance matrix cov[a|b] = Σ11 − Σ12 Σ−1 22 Σ21 . 23 / 42 The Gaussian Case Now we apply these results to our model, taking x and z to be jointly Gaussian. The mean vector and covariance matrix are E z x =0 cov z x = 1 r r 1 . so we get E [z|x] = rx cov[z|x] = 1 − r 2 . These are exactly the conditions stated earlier that guarantee the fitted mean model converges to the marginal regression function E [y |x], and the fitted variance model converges to to the marginal regression variance var[y |x]. 24 / 42 Consequences of confounding How does the presence of unmeasured confounders affect our ability to interpret regression models? 25 / 42 Population average covariate effect Suppose we specify a value x∗ in the covariate space and randomly select two subjects i and j having x values xi = x∗ + 1 and xj = x∗ . The inter-individual difference is yi − yj = β + γ(zi − zj ) + i − j , which has a mean value (marginal effect) of E [yi − yj |xi = x∗ + 1, xj = x∗ ] = β + γ(E [z|x = x∗ + 1] − E [z|x = x∗ ]), which agrees with what would be obtained by least squares analysis as long as E [z|x] = rx. 26 / 42 Population average covariate effect The variance of yi − yj is 2σ 2 + 2γ 2 var[z|x], which also agrees with the results of least squares analysis as long as var[z|x] = 1 − r 2 . 27 / 42 Individual treatment effect Now suppose we match two subjects i and j having x values differing by one unit, and who also having the same values of z. This is what one expect to see as the pre-treatment and post-treatment measurements following a treatment that changes an individual’s x value by one unit, if the treatment does not affect z (the within-subject treatment effect). 28 / 42 Individual treatment effect The mean difference (individual treatment effect) is E [yi − yj |xi = x∗ + 1, xj = x∗ , zi = zj ] = β and the variance is var[yi − yj |xi = x∗ + 1, xj = x∗ , zi = zj ] = 2σ 2 . These do not in general agree with the estimates obtained by using least squares to analyze the observable data for x and y . Depending on the sign of γθ, we may either overstate or understate the individual treatment effect β, and the population variance of the treatment effect will always be overstated. 29 / 42 Types of covariates Expressed as a causal diagram, a confounder z relates to an exposure x and an outcome y as follows: y x z In addition to confounders, there are many other ways that a variable z can impact our ability to understand the relationship between variables of primary interest x and y . 30 / 42 Types of covariates If we reverse the directionality between z and x, y , then z is no longer a confounder and instead becomes a collider. y x z If z is a confounder, then you must somehow control for z to obtain an undistorted understanding of the relationship between x and y . If z is a collider the opposite is true – controlling for z induces distortion in the relationship between x and y . 31 / 42 Types of covariates A precision variable is a variable that explains some of the variation in y that is unrelated to x. x y z Including or excluding a precision variable in an analysis does not impact the bias, but can impact precision. In most cases, including a precision variable increases the precision with which the relationship between x and y is estimated 32 / 42 Types of covariates A mediator is a variable that lies on the causal pathway between an exposure x and an outcome y . In the diagram below, z is a mediator. x z y Controlling for a mediator will usually reduce or eliminate the apparent relationship between x and y , and doing so gives insight into the underlying mechanism behind the relationship between x and y . 33 / 42 Types of covariates More generally, an exposure can have direct effects on an outcome, and indirect or mediated effects on an outcome, carried through a mediator z. x z y 34 / 42 Types of covariates A moderator or effect modifier is a variable that explains heteropgeneity in the relationship between x and y . An interaction can be seen as an effect modifier. If E [y |x, z] = x + zx = (1 + z)x, we can interpret the slope of y on x as being different based on the value of z. Finally note that in many cases we cannot be sure if a variable is a confounder, collider, mediator, or moderator, and many variables can occupy multiple of these roles at the same time. 35 / 42 Measurement error for linear models Suppose the data generating model is y = zβ + , with the usual linear model assumptions, but we do not observe z. Rather, we observe x = z + τ, where τ is a random vector of covariate measurement errors with E [τ ] = 0. Assuming x1 = 1 is the intercept, it is natural to set the first column of τ equal to zero. This is called an errors in variables model, or a measurement error model. 36 / 42 Measurement error for linear models When covariates are measured with error, least squares point estimates may be biased and inferences may be incorrect. Intuitively it seems that slope estimates should be “attenuated” (biased toward zero). The reasoning is that as the measurement error grows very large, the observed covariate x becomes equivalent to noise, so the slope estimate should go to zero. 37 / 42 Measurement error for linear models Let X and Z now represent the n × p + 1 observed and ideal design matrices and let T denote the n × p + 1 matrix of measurement errors. The least squares estimate of the model coefficients is β̂ = (X 0 X )−1 X 0 y = (Z 0 Z + Z 0 T + T 0 Z + T 0 τ )−1 (Z 0 y + T 0 y ) = (Z 0 Z /n + Z 0 T /n + T 0 Z /n + T 0 T /n)−1 (Z 0 y /n + T 0 Z β/n + T 0 /n). We will make the simplifying assumption that the covariate measurement error is uncorrelated with the covariate levels, so Z 0 T /n → 0, and that the covariate measurement error τ and observation error are uncorrelated, so τ 0 /n → 0. 38 / 42 Measurement error for linear models Under these circumstances, β̂ ≈ (Z 0 Z /n + T 0 T /n)−1 Z 0 y /n. Let Mz be the limiting value of Z 0 Z /n, and let Mτ = E [τ τ 0 ] be the limiting value of T 0 T /n. Thus the limit of β̂ is (Mz + Mτ )−1 Z 0 y /n = (I + Mz−1 Mτ )−1 Mz−1 Z0 y /n → (I + Mz−1 Mτ )−1 β ≡ β0 . and hence the limiting bias is β0 − β = ((I + Mz−1 Mτ )−1 − I )β. 39 / 42 Measurement error for linear models What can we say about the bias? Note that the matrix Mz−1 Mτ has non-negative eigenvalues, since it shares its eigenvalues with the positive semi-definite matrix Mz−1/2 Mτ Mz−T /2 . It follows that all eigenvalues of I + Mz−1 Mτ are greater than or equal to 1, so all eigenvalues of (I + Mz−1 Mτ )−1 are less than or equal to 1. This means that (I + Mz−1 Mτ )−1 is a contraction, so kβ0 k ≤ kβk. Therefore the sum of squares of fitted slopes is smaller on average than the sum of squares of actual slopes, due to measurement error. 40 / 42 Types of measurement error The “classical” measurement error model x = z + τ, where z is the true value and x is the observed value, is the one most commonly considered. Alternatively, in the case of an experiment it may make more sense to use the Berkson error model: z = x + τ. For example, suppose we aim to study a chemical reaction when a given concentration x of substrate is present. However, due to our inability to completely control the process, the actual concentration of substrate z differs randomly from x, by an unknown amount τ . 41 / 42 Types of measurement error You cannot simply rearrange z = x + τ to x = z − τ and claim that the two situations are equivalent. In the first case, τ is independent of z but dependent with x. In the second case, τ is independent of x but dependent with z. 42 / 42