Model selection Kerby Shedden Department of Statistics, University of Michigan November 8, 2021 1 / 24 Background Suppose we observe data y and are considering a family of models fθ that may approximately describe how y was generated. If we are mainly interested in the individual model parameters, we will focus on how close θ̂j is to θj (e.g. in terms of its bias, variance, MSE). Alternatively, our focus may be on the probability distribution f that is the data-generating model for y . In this case, we are more interested in whether fθ̂ approximates f , rather than in whether θ̂j approximates θj . 2 / 24 Background The term model model selection is used to describe statistical estimation in a context when the focus is more on the structure of the fitted model than on the individual parameters. Model selection is a form of statistical inference, but in model selection, we are often contrasting two families of models {fθ } and {gη }. Often, the dimensions of η and θ will be different. In contrast, when we do parameter estimation we are selecting a model from within one family {fθ }, where θ has a fixed dimension and usually lies in a simple domain like Rp . 3 / 24 Model complexity and parsimony The discrepancy between fθ and fθ̂ is strongly influenced by how complex of a model we decide to fit. Suppose we have p = 30 covariates and n = 50 observations. We could consider the following two alternatives: 1. We could fit a model using all of the covariates. In this case, θ̂ is unbiased for θ (in a linear model fit using OLS). But θ̂ has very high variance. 2. We could fit a model using only the five strongest effects. In this case, θ̂ will be biased for θ, but it will have lower variance (compared to the estimate including all covariates). If our goal is for fθ̂ and fθ to be close, either approach 1 or approach 2 could perform better, depending on the circumstances. 4 / 24 Assessing model fit A more complex model will usually fit the data better than a more parsimonious (simpler) model. This is called overfitting. Due to overfitting, we cannot simply compare models in terms of how well they fit the data (e.g. in terms of the residual sum of squares, or the height of the likelihood). The more complex model will always appear to be better if we do this. To overcome this, most model selection procedures balance a measure of a model’s fit with a measure of its complexity. To justify selecting the more complex model, it must fit the data substantially better than the simpler model. 5 / 24 Fit and parsimony Example: The purple, green, and blue curves below are estimates of E [y |x]. The green line fits the data better but is more complex. Which estimate is closest to the truth? 2 0 Y 2 4 6 80.0 0.2 0.4 0.6 X 0.8 1.0 1.2 6 / 24 F-tests Suppose we are comparing two nested families of models F1 ⊂ F2 , both of which are linear subspaces. An F-test can be used to select between F1 and F2 . 7 / 24 Mallows’ Cp Suppose we postulate the model y = Xβ + but in fact E [y ] 6∈ col(X ). We’ll continue to assume that the homoscedastic variance structure cov(|X ) = σ 2 I holds. Denote this model as M. Denote the error in estimating E [y ] under model M as DM = ŷM − E [y ], where ŷM is the projection of y onto col(X ). 8 / 24 Mallows’ Cp Write E [y ] = θX + θX⊥ , where θX ∈ col(X ) and θX⊥ ∈ col(X )⊥ . Since y = θX + θX⊥ + , it follows that ŷ = θX + X , where X is the projection of onto col(X ). Therefore 0 EDM DM = E (ŷM − Ey )(ŷM − Ey )0 = E (X − θX⊥ )(X − θX⊥ )0 = θX⊥ θX⊥0 + σ 2 PX where PX is the projection matrix onto col(X ). 9 / 24 Mallows’ Cp Taking the trace of both sides, yields E kDM k2 = kθX⊥ k2 + (p + 1)σ 2 , where p + 1 is the rank of PX . Mallows’ Cp aims to estimate Cp∗ = E kDM k2 /σ 2 = kθX⊥ k2 /σ 2 + p + 1 The model that minimizes Cp∗ is the closest to the true model in this particular sense. 10 / 24 Mallows’ Cp We need an estimate of Cp∗ . To begin, we can derive the expected value of σ̂ 2 = ky − ŷM k2 /(n − p − 1) in the case where E [y ] is not necessarily in col(X ): E σ̂ 2 = E [y 0 (I − P)y /(n − p − 1)] = E [(θX + θX⊥ + )0 (I − P)(θX + θX⊥ + )/(n − p − 1)] = E [tr(I − P)(θX⊥ + )(θX⊥ + )0 /(n − p − 1)] = kθX⊥ k2 /(n − p − 1) + σ 2 . 11 / 24 Mallows’ Cp Now suppose we have an unbiased estimate of σ 2 . This could come from a regression against a much larger design matrix that is thought to contain E [y ]. Call this estimate σ ∗2 . Then (n − p − 1)E (σ̂ 2 − σ ∗2 ) = kθX⊥ k2 . Therefore we can estimate Cp∗ using Cp = (n − p − 1)(σ̂ 2 − σ ∗2 )/σ ∗2 + p + 1. The model M with the smallest value of Cp is selected. 12 / 24 AIC Suppose we are selecting from a family of linear models with design matrices XM , for M ∈ M. For each XM , the model parameters (slopes and error variance) can be estimated using least squares (and method of moments for the error variance) as a vector θ̂M . This allows us to construct a predictive density: p(y ; XM , θ̂M ). 13 / 24 AIC The Kullback-Leibler divergence (“KL-divergence”) between the predictive density and the actual density p(y ) is Ey log p(y ) p(y ; XM , θ̂M ) ! Z = log p(y ) p(y ; Xm , θ̂M ) ! p(y )dy ≥ 0. Here we are considering θ̂M to be fixed. Small values of the KL divergence indicate that the predictive density is close to the actual density. 14 / 24 AIC Akaike’s Information Criterion (AIC) aims to estimate the KL divergence between a candidate model and the data-generating model p(y ) unbiasedly. We can then select the candidate model that has the smallest estimated KL-divergence relative to p(y ). The KL-divergence can be written Ey log(p(y )) − Ey log(p(y ; θ̂M , XM ). We can ignore the first term since it doesn’t depend on M. Thus it will be equivalent to select the model that maximizes the predictive log likelihood: Z Ey log(p(y ; θ̂M , XM )) = log(p(y ; XM , θ̂M ))p(y )dy . 15 / 24 AIC The predictive log likelihood is the expected value of log p(y ∗ ; XM , θ̂M (y )) taken over the joint distribution of y and y ∗ , which are independent copies of the data. The parameter estimates θ̂M = θ̂M (y ) are determined from y , which you can think of as a “training set”, and the log-likelihood is evaluated at y ∗ , using θ̂M (y ) to set the parameters. 16 / 24 AIC Since we don’t have both y and y ∗ , it is natural to use the plug-in estimator of the predictive log likelihood: log p(y ; XM , θ̂M (y )) But this is biased upward, due to overfitting. Surprisingly, this upward bias can be shown to be approximately equal to the dimension of M, which is p + 1 for regression (p + 2 if you count σ 2 ). 17 / 24 AIC Thus we may take log(p(ytrain ; XM , θ̂M )) − p − 1 as a model selection statistic to be maximized (commonly this is multiplied by -2, in which case it is to be minimized). This quantity is the AIC. 18 / 24 AIC in linear models To apply the AIC to linear models, we assume the error values are multivariate normal, so the log-likelihood becomes n 1 − log σ 2 − 2 ky − X βk2 . 2 2σ If we work with the profile likelihood over β, we get −n log(σ̂ 2 )/2 (plus a constant). Therefore maximizing the AIC is equivalent to maximizing −n log(σ̂ 2 ) − | {z } fit 2(p + 1) | {z } complexity The conventional form of the AIC is a scaled version of the expression above, which is to be minimized: AIC = n log(σ̂ 2 ) + 2(p + 1). 19 / 24 AIC and likelihood ratios The AIC does not require the models being compared to be nested, but let’s consider this special case. Let L1 and L0 be the maximized log likelihoods for two nested models (so L1 ≥ L0 ). We know that 2(L1 − L0 ) approximately follows a χ2q distribution, where q is the difference between the number of parameters of the two models. If q = 1, we select the larger model if L1 − L0 ≥ 1.92 (the usual likelihood ratio test at the 0.05 type I error rate). If the additional parameters are not needed, then E [L1 − L0 ] = 0.5 (so 0.5 is the lowest possible threshold for L1 − L0 that could ever be considered). Under AIC, we select the larger model if L1 − L0 > 1 (less strict than the likelihod ratio test). 20 / 24 Bayesian Information Criterion (BIC) A different criterion that we will not derive here is the “Bayesian information criterion” (BIC). The BIC defines complexity differently from the AIC: −n log(σ̂ 2 ) − | {z } fit (p + 1) log(n) | {z } complexity The conventional definition of the BIC is n log(σ̂ 2 ) + (p + 1) log(n). The best-fitting model under the BIC minimizes this quantity. The complexity penalty in BIC, log(n)(p + 1), will always be larger than the corresponding AIC penalty, which is 2(p + 1). Thus the BIC will always favor simpler models than the AIC. 21 / 24 Model selection based on prediction Many approaches to model selection attempt to identify the model that predicts best on independent data. If independent “training” and “test” sets are available, for each model M the parameters of M can be fit using the training data, yielding θ̂M . Predictions can then be made on the test set ŷM,test = XM,test θ̂M and the quality of prediction can be assessed, for example, using the Mean Squared Prediction Error (MSPE): kytest − ŷM,test k2 /n. 22 / 24 Cross-validation Separate training and test sets are usually not available. Cross validation is a direct method for obtaining unbiased estimates of the prediction mean squared error when only training data are available. In k-fold cross validation, the data are partitioned into k disjoint subsets (“folds”), denoted S1 ∪ · · · ∪ Sk = {1, . . . , n}. Let β̂j be the fitted coefficients omitting the j th of these subsets, and let CVk = n −1 k X X (yi − Xi0 β̂j )2 j=1 i∈Sj This is an approximately unbiased (but potentially very imprecise) estimate of the MSPE on a test set. The special case of leave one out cross validation (LOOCV) is when k = n. 23 / 24 Cross-validation For OLS regression, CVn (also known as “prediction residual error sum of squares”, or PRESS), can be computed rapidly: CVn = n−1 X Ri2 /(1 − Pii )2 . i The generalized cross-validatation (GCV) criterion replaces Pii with the average diagonal element of P, which is trace(P)/n: GCVn = n−1 X i Ri2 /(1 − tr(P)/n)2 = n−1 ky − ŷ k2 . (1 − tr(P)/n)2 24 / 24