Prediction Kerby Shedden Department of Statistics, University of Michigan October 27, 2021 1 / 27 Prediction analysis In a prediction-oriented analysis, we fit a model to capture the conditional mean relationship between independent variables x ∈ Rp and a dependent variable y ∈ R. We then use the fitted model to make predictions on an independent data set. The model has the form fθ , where for each θ, we have a function from Rp to R. Thus {fθ } is a family of functions indexed by a parameter θ. We use the data to obtain an estimate θ̂ of θ, which in turn leads us to an estimate fθ̂ of the regression function. It is helpful to think in terms of training data {(yi , xi )} that are used to fit the model, so θ̂ = θ̂({(yi , xi )}), and testing data {(yi∗ , xi∗ )} on which predictions are made. 2 / 27 Quantifying prediction error Prediction analysis focuses on prediction errors, for example through the mean squared prediction error (MSPE): E |y ∗ − fθ̂ (x ∗ )|2 , and its sample analogue ∗ n X kyi∗ − fθ̂ (xi∗ )k2 /n∗ , i=1 where n∗ is the size of the testing set. Prediction analysis does not usually focus on properties of the parameter estimates themselves, e.g. bias E θ̂ − θ, or parameter MSE E (θ̂ − θ)2 . 3 / 27 MSPE for OLS analysis The mean squared prediction error for OLS regression is easy to derive. The testing data follow yi∗ = xi∗ β + ∗i . Let ∗ ŷ ∗ = X ∗ β̂ ∈ Rn denote the predicted values in the test set. Then E ky ∗ − ŷ ∗ k2 = E kX ∗ β + ∗ − X ∗ β̂k2 = E kX ∗ (β − β̂)k2 + E k∗ k2 = E [(β̂ − β)0 (X ∗0 X ∗ )(β̂ − β)] + n∗ σ 2 = tr X ∗0 X ∗ · E [(β̂ − β)(β̂ − β)0 ] + n∗ σ 2 = tr X ∗0 X ∗ · Σβ̂ + n∗ σ 2 , where Σβ̂ is the covariance matrix of β̂ from the training process. Note the requirement for ŷ ∗ and y ∗ to be independent (given X and X ∗ if they are random). 4 / 27 MSPE for OLS analysis The MSPE for OLS is tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 . If X is the training set design matrix, then Σβ̂ = σ 2 (X 0 X )−1 , so if X = X ∗ , then E ky ∗ − ŷ k2 = σ 2 (p + 1 + n∗ ), and the MSPE in this case is σ 2 (p + 1)/n∗ + σ 2 = σ 2 (p + 1)/n + σ 2 . 5 / 27 MSPE for OLS analysis More generally, suppose X 0 X /n = X ∗0 X ∗ /n∗ . Then Σβ̂ = σ 2 (X 0 X )−1 = σ 2 n∗ (X ∗0 X ∗ )−1 /n. Thus the MSPE is tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 = σ 2 (p + 1)/n + σ 2 . 6 / 27 MSPE in practice The MSPE discussed here is the primary population quantity of interest for prediction. It is not always straightforward to estimate. The task of model selection, discussed later, can be viewed as aiming to identify the model with the lowest MSPE (among a set of candidate models under consideration). Note that the candidate models we fit to data may not be correctly-specified, so the usual estimate of σ̂ 2 may be biased. 7 / 27 PRESS residuals One way to estimate the MSPE with few theoretical conditions is using cross validation. We will briefly introduce this idea here, then return to it and cover it in more detail when we talk about model selection. If case i is deleted and a prediction of yi is made from the remaining data, we can compare the observed and predicted values to get the prediction residual: r(i) ≡ yi − ŷ(i)i . where ŷ(i)i is the prediction of yi based on a data set in which case i was removed. 8 / 27 PRESS residuals A simple formula for the prediction residual in OLS is given by r(i) = yi − xi β̂(i) = yi − xi (β̂ − ri (X 0 X )−1 xi /(1 − Pii )) = ri /(1 − Pii ). where X is the design matrix, xi is row i of the design matrix, and P is the projection matrix (for the full sample). The sum of squares of the prediction residuals is called PRESS (prediction residual error sum of squares). It is equivalent to using leave-one-out cross validation to estimate the “generalization error rate”. 9 / 27 Bias and variance in prediction If we are using a function fθ̂ to predict y from x, we can view the prediction error as arising from contributions of bias and variance. The bias is b(x) ≡ E [fθ̂ (x)|x] − E [y |x]. The variance is v (x) ≡ var[fθ̂ (x)|x]. The MSPE is E [(y − fθ̂ (x))2 ] = b(x)2 + v (x). 10 / 27 Bias and variance in prediction While having zero bias is an important consideration in some statistical analyses, arguably the overall accuracy, as measured by MSPE, should be the dominant consideration. The MSPE results from a combination of squared bias and variance. If we want to minimize the MSPE we should consider using a biased estimator, if by doing so we attain better MSPE (due to it having much smaller variance). The relationship between bias and variance discussed here is often referred to as the bias/variance tradeoff. 11 / 27 Ridge regression Ridge regression uses the minimizer of a penalized squared error loss function to estimate the regression coefficients: β̂ ≡ argminβ ky − X βk2 + λβ 0 Dβ. Typically D is a diagonal matrix with 0 in the 1,1 position and ones on the rest of the diagonal. In this case, β 0 Dβ = X βj2 . j≥1 This makes most sense when the covariates have been standardized, so it is reasonable to penalize the βj equally. 12 / 27 Ridge regression Ridge regression is a compromise between fitting the data as well as possible (by making ky − X βk2 small), while not allowing any one fitted coefficient to get very large (which causes β 0 Dβ to get large). 13 / 27 Ridge regression and colinearity Suppose x1 and x2 are standardized vectors with a substantial positive correlation (i.e. x10 x2 /n is large), and the population slopes are β1 and β2 , i.e. E [y |x1 , x2 ] = β1 x1 + β2 x2 . Fits of the form (β1 + γ)x1 + (β2 − γ)x2 = E [y |x1 , x2 ] + γ(x1 − x2 ) have similar MSE values as γ varies, since x1 − x2 is small when x1 and x2 are strongly positively associated. In other words, OLS can’t easily distinguish among these fits. For example, if x1 ≈ x2 , then 3x1 + 3x2 , 4x1 + 2x2 , 5x1 + x2 , etc. all produce similar fitted values. 14 / 27 Ridge regression and colinearity For large λ, ridge regression favors the fits that minimize (β1 + γ)2 + (β2 − γ)2 . This expression is minimized at γ = (β2 − β1 )/2, giving the fit (β1 + β2 )x1 /2 + (β1 + β2 )x2 /2. ⇒ Ridge regression favors coefficient estimates for which strongly positively correlated covariates have similar estimated effects. 15 / 27 Calculation of ridge regression estimates For a given value λ > 0, ridge regression is no more difficult computationally than ordinary least squares, since ∂ ky − X βk2 + λβ 0 Dβ = −2X 0 y + 2X 0 X β + 2λDβ, ∂β so the ridge estimate β̂ solves the system of linear equations (X 0 X + λD)β = X 0 y . This equation can have a unique solution even when X 0 X is singular. Thus one application of ridging is to produce regression estimates for singular design matrices. 16 / 27 Ridge regression bias and variance Ridge regression estimates are biased, but may be less variable than OLS estimates. If X 0 X is non-singular, the ridge estimator can be written β̂λ = (X 0 X + λD)−1 X 0 y = (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 y = (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 (X β + ) = (I + λ(X 0 X )−1 D)−1 β + (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 . Thus the bias is E [β̂λ |X ] − β = ((I + λ(X 0 X )−1 D)−1 − I )β 17 / 27 Ridge regression bias and variance The variance of the ridge regression estimates is varβ̂λ = σ 2 (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 (I + λ(X 0 X )−1 D)−T . 18 / 27 Ridge regression bias and variance Next we will show that var[β̂] ≥ var[β̂λ ], in the sense that var[β̂] − var[β̂λ ] is a non-negative definite matrix. First let M = λ(X 0 X )−1 D, and note that 19 / 27 Ridge regression bias and variance v 0 (varβ̂ − varβ̂λ )v ∝ v 0 (X 0 X )−1 − (I + M)−1 (X 0 X )−1 (I + M)−T v = u 0 (I + M)(X 0 X )−1 (I + M)0 − (X 0 X )−1 u = u 0 M(X 0 X )−1 + (X 0 X )−1 M 0 + M(X 0 X )−1 M 0 u = u 0 2λ(X 0 X )−1 D(X 0 X )−1 + λ2 (X 0 X )−1 D(X 0 X )−1 D(X 0 X )−1 u where u = (I + M)−T v . We can conclude that for any fixed vector θ, var(θ0 β̂λ ) ≤ var(θ0 β̂). 20 / 27 Ridge regression effective degrees of freedom Like OLS, the fitted values under ridge regression are linear functions of the observed values ŷλ = X (X 0 X + λD)−1 X 0 y In OLS regression, the degrees of freedom is the number of free parameters in the model, which is equal to the trace of the projection matrix P that satisfies ŷ = Py . Fitted values in ridge regression are not a projection of y , but the matrix X (X 0 X + λD)−1 X 0 . plays an analogous role to P. 21 / 27 Ridge regression effective degrees of freedom The effective degrees of freedom for ridge regression is defined as EDFλ = tr X (X 0 X + λD)−1 X 0 . The trace can be easily computed using the identity trace X (X 0 X + λD)−1 X 0 = trace (X 0 X + λD)−1 X 0 X . 22 / 27 Ridge regression effective degrees of freedom EDFλ is monotonically decreasing in λ. To see this we will use the following fact about matrix derivatives ∂tr(A−1 B)/∂A = −A−T B 0 A−T . By the chain rule, letting A = X 0 X + λD, we have −1 ∂tr A X ∂tr A−1 X 0 X ∂Aij · X X /∂λ = ∂Aij ∂λ ij X = − [A−T (X 0 X )A−T ]ij · Dij 0 ij X = − [A−T (X 0 X )A−T ]ii · Dii i ≤ 0. 23 / 27 Ridge regression effective degrees of freedom EDFλ equals rank(X ) when λ = 0. To see what happens as λ → ∞, we can apply the Sherman-Morrison-Woodbury identity (A + UCV )−1 = A−1 − A−1 U C −1 + VA−1 U −1 VA−1 . Let G = X 0 X , and write D = FF 0 , where F has independent columns (usually F will be p + 1 × p as we do not penalize the intercept). 24 / 27 Ridge regression effective degrees of freedom Applying the SMW identity and letting λ → ∞ we get tr (G + λD)−1 G G −1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 G −1 G = tr Ip+1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 → trIp+1 − tr G −1 F (F 0 G −1 F )−1 F 0 → trIp+1 − tr (F 0 G −1 F )−1 F 0 G −1 F = tr = p + 1 − rank(F ). Therefore in the usual case where F has rank p, EDFλ converges to 1 as λ grows large, reflecting the fact that all coefficients other than the intercept are forced to zero. 25 / 27 Ridge regression and the SVD Suppose we are fitting a ridge regression with D = I , and we factor X = USV 0 using the singular value decomposition (SVD), so that U and V are orthogonal matrices, and S is a diagonal matrix with non-negative diagonal elements. The fitted coefficients are β̂λ = (X 0 X + λI )−1 X 0 y = (VS 2 V 0 + λVV 0 )−1 VSU 0 y = V (S 2 + λI )−1 SU 0 y Note that for OLS (λ = 0), we get β̂ = VS −1 U 0 y . The effect of ridging is to replace S −1 in this expression with (S 2 + λI )−1 S, which are uniformly smaller values when λ > 0. 26 / 27 Ridge regression tuning parameter There are various ways to set the ridge parameter λ. Cross-validation can be used to estimate the MSPE for any particular value of λ. Then this estimated MSPE could be minimized by checking its value at a finite set of λ values. Generalized cross validation, which minimizes the following over λ, is a simpler, and more commonly used approach. GCV(λ) = ky − ŷλ k2 . (n − EDFλ )2 27 / 27