Generalized Linear Models

Generalized Linear Models
Kerby Shedden
Department of Statistics, University of Michigan

December 6, 2021

1 / 52

Motivation for nonlinear models
The key properties of a linear model are that
E [y |X ] = X β

and

var[y |X ] ∝ I .

In some cases where these conditions are not met, we can
transform y so that the properties of a linear model are
well-satisfied.
However it is often difficult to find a transformation that
simultaneously linearizes the mean and gives constant variance.
Also, if y lies in a restricted domain (e.g. yi ∈ {0, 1}),
parameterizing E [y |X ] as a linear function of X violates the
domain restriction.
Generalized linear models (GLM’s) are a class of nonlinear
regression models that can be used in certain cases where linear
models do not fit well.
2 / 52

Logistic regression
Logistic regression is a specific type of GLM. We will develop
logistic regression from first principles before discussing GLM’s in
general.
Logistic regression is used for binary outcome data, where yi = 0
or yi = 1. It is defined by the probability mass function
P(yi = 1|xi = x) =

1
exp(β 0 x)
=
,
0
1 + exp(β x)
1 + exp(−β 0 x)

which implies that
P(yi = 0|xi = x) = 1 − P(yi = 1|xi = x) =

1
,
1 + exp(β 0 x)

where in most cases, xi0 = 1 so β0 is the intercept.
3 / 52

Logistic regression
This plot shows P(y = 1|x) and P(y = 0|x), plotted as functions
of β 0 x:
1.0

Probability

0.8

′

P (Y=1| β X)

0.6

′

P (Y=0 | β X)

0.4

0.2

0.0
-6

-4

-2

0
′

2

4

6

βX

4 / 52

Logistic regression
The logit function
logit(x) = log(x/(1 − x))
maps the unit interval onto the real line. The inverse logit
function, or expit function
expit(x) = logit−1 (x) =

exp(x)
1 + exp(x)

maps the real line onto the unit interval.
In logistic regression, the logit function is used to map the linear
predictor β 0 x to a probability.

5 / 52

Logistic regression
The linear predictor in logistic regression is the conditional log
odds:

P(y = 1|x)
= β 0 x.
log
P(y = 0|x)


Thus one way to interpret a logistic regression model is that a one
unit increase in xj (the j th covariate) results in a change of βj in
the conditional log odds.
Or, a one unit increase in xj results in a multiplicative change of
exp(βj ) in the conditional odds.

6 / 52

Latent variable model for logistic regression
It may make sense to view the binary outcome y as being a
dichotomization of a latent continuous outcome yc ,
y = I(yc ≥ 0).
Suppose yc |X follows a logistic distribution, with CDF
F (yc |x) =

exp(yc − β 0 x)
.
1 + exp(yc − β 0 x)

In this case, y |x follows the logistic regression model:

P(y = 1|x) = P(yc ≥ 0|x) = 1−

exp(0 − β 0 x)
exp(β 0 x)
=
.
1 + exp(0 − β 0 x)
1 + exp(β 0 x)
7 / 52

Mean/variance relationship for logistic regression
Since the mean and variance of a Bernoulli trial are linked, the
mean structure
E [y |x] = P(y = 1|x) =

exp(β 0 x)
1 + exp(β 0 x)

also determines the variances

var[y |x] = P(y = 1|x) · P(y = 0|x) =

2+

1
.
+ exp(−β 0 x)

exp(β 0 x)

Since the variance depends on x, logistic regression models are
always heteroscedastic.

8 / 52

Logistic regression and case-control studies

Suppose we sample people based on their disease status D (D = 1
is a case, D = 0 is a control).
We are interested in a binary marker M ∈ {0, 1} that may predict
a person’s disease status.
The prospective log odds

log

P(D = 1|M = m)
P(D = 0|M = m)



measures how informative the marker is for the disease.

9 / 52

Logistic regression and case-control studies
Suppose we model P(M = m|D = d) using logistic regression, so
P(M = 1|D = d) =

exp(α + βd)
1 + exp(α + βD)

P(M = 0|D = d) =

1
.
1 + exp(α + βd)

More generally,
P(M = m|D = d) =

exp(m(α + βd))
.
1 + exp(α + βd)

10 / 52

Logistic regression and case-control studies

Since
log

P(M = 1|D = d)
= α + βd
P(M = 0|D = d)

we see that β is the coefficient of d in the retrospective log odds.

11 / 52

Logistic regression and case-control studies

The prospective log odds can be written

log

P(D = 1|M = m)
P(D = 0|M = m)

P(M
P(M
P(M
= log
P(M
= log

= m|D
= m|D
= m|D
= m|D

= 1)P(D
= 0)P(D
= 1)P(D
= 0)P(D

= 1)/P(M = m)
= 0)/P(M = m)
= 1)
= 0)

12 / 52

Logistic regression and case-control studies
Continuing from the previous slide, we have

log

P(M = m|D = 1)P(D = 1)
=
P(M = m|D = 0)P(D = 0)


exp(m · (α + β))/(1 + exp(α + β)) P(D = 1)
·
,
log
exp(m · α)/(1 + exp(α))
P(D = 0)

which equals



1 + exp(α)
P(D = 1)
βm + log
·
.
1 + exp(α + β) P(D = 0)
Thus β is both the coefficient of d in the retrospective log odds,
and it is the coefficient of m in the prospective log odds. This is
sometimes called case/control convertibility.
13 / 52

Estimation and inference for logistic regression
Assuming independent cases, the log-likelihood for logistic
regression is

Y exp(yi · β 0 xi )
1 + exp(β 0 xi )
i
X
X
β 0 xi −
log(1 + exp(β 0 xi )).

L(β|y , X ) = log
=

i:yi =1

i

This likelihood is for the conditional distribution of y given X .
As in linear regression, we do not model the marginal distribution
of x (a row of X ).

14 / 52

Estimation and inference for logistic regression
Logistic regression models are usually fit using maximum likelihood
estimation.
This means that the parametric likelihood above is maximized as a
function of β.
The gradient of the log-likelihood function (the score function) is

G (β|y , X ) =

∂
L(β|y , X )
∂β
X
X
xi −

exp(β 0 xi )
xi
1 + exp(β 0 xi )
i:yi =1
i

X
exp(β 0 xi )
=
yi −
xi .
1 + exp(β 0 xi )
=

i

15 / 52

Estimation and inference for logistic regression

The Hessian of the log-likelihood is

H(β|y , X ) =

X
∂2
exp(β 0 xi )
L(β|y
,
X
)
=
−
xi x 0 .
∂ββ 0
(1 + exp(β 0 xi ))2 i
i

The Hessian is strictly negative definite as long as the design
matrix has independent columns. Therefore L(β|y , X ) is a concave
function of β, so has a unique maximizer, and hence the MLE is
unique.

16 / 52

Estimation and inference for logistic regression

From the general theory of the MLE, the Fisher information
I (β) = −(E [H(β|y , X )|X ])−1
is the asymptotic sampling covariance matrix of the MLE β̂. Since
H(β|y , X ) does not depend on y , I (β) = −H(β|y , X )−1 .
Since β̂ is an MLE for a regular problem, it is consistent,
asymptotically unbiased, and asymptotically normal if the model is
correctly specified.

17 / 52

Poisson regression
The Poisson distribution is a single-parameter family of
distributions on the sample space {0, 1, 2, . . .}.
A key property of the Poisson distribution is that the mean is equal
to the variance.
The Poisson distribution is usually parameterized in terms of a
parameter λ that is equal to the common mean and variance.
In regression, we don’t want just a single distribution. Instead we
want a family of distributions indexed by the covariate vector x.
To create a regression methodology based on the Poisson
distribution, we can formulate a regression model in which y |x is
Poisson, with mean and variance equal to λ(x) = exp(β 0 x).

18 / 52

Poisson regression

Since the mean function in a Poisson distribution has an
exponential form, the covariates are related multiplivatively to the
mean.
If we contrast the mean value for two different covariate vectors,
(1)
(2)
(1)
(2)
x (1) and x (2) , such that xj − xj = 1, and xk = xk for k 6= j,
then the means at these two points are related through:
λ(x (1) ) = exp(βj )λ(x (2) ).

19 / 52

Poisson regression
Setting the mean to be λi = exp(β 0 xi ), the PMF for one
observation in a Poisson regression model is
exp−λi λyi i /yi !
The corresponding contribution to the log likelihood is

−λi + yi log(λi ) − log(yi !) = − exp(β 0 xi ) + yi · β 0 xi − log(yi !),
and the contribution to the score function is
−xi exp(β 0 xi ) + yi · xi = (yi − exp(β 0 xi ))xi .

20 / 52

Score equations
The MLE is a stationary point of the score function. Thus, for
logistic regression, the following equation is satisfied at the MLE:
X
yi −
i

exp(β 0 xi )
1 + exp(β 0 xi )


xi = 0.

For Poisson regression, this equation is satisfied at the MLE:
X
(yi − exp(β 0 xi ))xi
i

We also know that for OLS (viewed here as a Gaussian regression
model), this equation is satisfied at the MLE
X
(yi − β 0 xi )xi = 0.
i
21 / 52

Score equations
Writing µi = E [yi |xi ], we see that for all three types of regression
models, the following equation is satisfied.
X

(yi − µi )xi = 0.

i

This shows that the residuals are orthogonal to each covariate in
all of these models, and that achieving this orthogonality
characterizes the MLE.
This turns out to be a useful generic framework for regression, as
many different mean functions µ(β) can be substituted into this
equation, and the solution of the equation can be used to estimate
β.

22 / 52

Relationship between the mean and variance
We have seen three parametric regression models, each of which
expresses the mean in terms of the linear predictor. The family is
the distributional family used to form the log-likelihood and score
functions.
For each of these models, the variance can also be related to the
mean.

Family
Gaussian
Binomial
Poisson

Mean (µ)
β0x
1/(1 + exp(−β 0 x))
exp(β 0 x)

Variance (v (µ))
1
µ(1 − µ)
µ

23 / 52

Relationship between the mean and variance

The variance functions here are only specified up to a constant of
proportionality. That is, var(yi |xi ) = φv (µi ), where φ is the scale
parameter.
In any single index model, µi = µ(β 0 xi ), so
∂µi /∂β = µ0 (β 0 xi ) · xi ∝ x.
Note that in each case above, ∂µi /∂β is proportional to vi · xi ,
where vi ∈ R is the variance (but this is not always the case).

24 / 52

Estimating equations
For the three examples we are focusing on here, the MLE can be
defined as the solution to the estimating equations:
X

∂µi /∂β · (yi − µi (β))/vi (β) = 0

i

which can also be expressed
X

µ0 (ηi ) · (yi − µi (β)) · xi /vi (β) = 0

i

where ηi = β 0 xi is the linear predictor.

25 / 52

Estimating equations

The score equations constitute a system of p = dim(x) equations
in p unknowns. It should be solvable unless there is some
degeneracy in the equations.
In the “canonical setting”, (∂µi /∂β)/vi (β) ∝ xi , so these
equations are equivalent to the orthogonality between residuals and
covariates. But we will see below that the form given here extends
to some “non-canonical” settings and hence is somewhat more
general.

26 / 52

Development of GLM’s using likelihoods

A GLM is based on the following conditions:
I

The yi are conditionally independent given X .

I

The probability mass function or density can be written
log p(yi |θi , φ, xi ) = wi (yi θi − γ(θi ))/φ + τ (yi , φ/wi ),
where wi is a known weight, θi = g (β 0 xi ) for an unknown
vector of regression slopes β, g (·) and γ(·) are smooth
functions, φ is the “scale parameter” (which may be either
known or unknown), and τ (·) is a known function.

27 / 52

Development of GLM’s using likelihoods

The log-likelihood function is
L(β, φ|y , X ) =

X

wi (yi θi − γ(θi ))/φ + τ (yi , φ/wi ).

i

The score function with respect to θi is
wi (yi − γ 0 (θi ))/φ.

28 / 52

Development of GLM’s using likelihoods
Next we need a fundamental fact about score functions.
Let fθ (y ) be a density in y with parameter θ. The score function is
∂
∂
log fθ (y ) = fθ (y )−1 fθ (y ).
∂θ
∂θ
The expected value of the score function is
∂
log fθ (y ) =
E
∂θ

Z

−1




∂
fθ (y ) fθ (y )dy
∂θ

fθ (y )
Z
∂
=
fθ (y )dy
∂θ
= 0.

Thus the score function has expected value 0 when θ is at its true
value.
29 / 52

Development of GLM’s using likelihoods

Since the expected value of the score function is zero, we can
conclude that
E [wi (yi − γ 0 (θi ))/φ|X ] = 0,
so
E [yi |X ] = γ 0 (θi ) = γ 0 (g (β 0 xi )).
Note that this relationship does not depend on φ or τ .

30 / 52

Development of GLM’s using likelihoods

Using a similar approach, we can relate the variance to wi , φ, and
γ 0 . By direct calculation,
∂ 2 L(θi |yi , xi , φ)/∂θi2 = −wi γ 00 (θi )/φ.
Returning to the general density fθ (y ), we can write the Hessian as



∂2
∂fθ (y ) ∂fθ (y )
∂
−2
.
log fθ (y ) = fθ (y )
fθ (y )
fθ (y ) −
·
∂θθ0
∂θθ0
∂θ
∂θ0

31 / 52

Development of GLM’s using likelihoods
The expected value of the Hessian is

∂
E
log fθ (y ) =
∂θθ0

Z

∂
log fθ (y ) · fθ (y )dy
∂θθ0

Z
Z 
∂
∂fθ (y )/∂θ ∂fθ (y )/∂θ0
·
fθ (
=
f
(y
)dy
−
θ
∂θθ0
fθ (y )
fθ (y )


∂
= −cov
log fθ (y )|X .
∂θ

Therefore
wi γ 00 (θi )/φ = var wi (yi − γ 0 (θi ))/φ|X



so var[yi |X ] = φγ 00 (θi )/wi .
32 / 52

Examples of GLM’s

Gaussian linear model: The density of y |X can be written

1
(yi − β 0 xi )2
2σ 2
= − log(2πσ 2 )/2 − yi2 /2σ 2 + (yi β 0 xi − (β 0 xi )2 /2)/σ 2 .

log p(yi |xi ) = − log(2πσ 2 )/2 −

This can be put into GLM form by setting g (x) = x, γ(x) = x 2 /2,
wi = 1, φ = σ 2 , and τ (yi , φ) = − log(2πφ)/2 − yi2 /2φ.

33 / 52

Examples of GLM’s
Logistic regression: The mass function of y |x can be written

log p(yi |xi ) = yi log(pi ) + (1 − yi ) log(1 − pi )
= yi log(pi /(1 − pi )) + log(1 − pi ),
where
pi = logit−1 (β 0 xi ) =

exp(β 0 xi )
.
1 + exp(β 0 xi )

Since log(pi /(1 − pi )) = β 0 x, this can be put into GLM form by
setting g (x) = x, γ(x) = − log(1 − logit−1 (x)) = log(1 + exp(x)),
τ (yi , φ) ≡ 0, w = 1, and φ = 1.
34 / 52

Examples of GLM’s
Poisson regression: In Poisson regression, the distribution of y |x
follows a Poisson distribution, with the mean response related to
the covariates via
log E [y |x] = β 0 x.
It follows that log var[y |x] = β 0 x as well. The mass function can
be written

log p(yi |xi ) = yi β 0 xi − exp(β 0 xi ) − log(yi !),

so in GLM form, g (x) = x, γ(x) = exp(x), w = 1,
τ (yi ) = − log(yi !), and φ = 1.
35 / 52

Examples of GLM’s

Negative binomial regression: In negative binomial regression,
the probability mass function for the dependent variable Y is

Γ(y + 1/α)
P(yi = y |x) =
Γ(y + 1)Γ(1/α)



1
1 + αµi

1/α 

αµi
1 + αµi

y
.

The mean of this distribution is µi and the variance is µi + αµ2i . If
α = 0 we get the same mean/variance relationship as the Poisson
model. As α increases, we get increasingly more overdispersion.

36 / 52

Examples of GLM’s
Negative binomial regression (continued):
The log-likelihood (dropping terms that do not involve µ) is
log P(yi = y |xi ) = y log(

αµi
) − α−1 log(1 + αµi )
1 + αµi

Suppose we model the mean as µi = exp(β 0 xi ). Then in the
standard GLM notation, we have

θi = log

α exp(β 0 Xi )
1 + α exp(β 0 xi )


,

so g (x) = log(α) + x − log(1 + α exp(x)), and
γ(x) = −α−1 log(1 − exp(x)).

37 / 52

Link functions
In a GLM, the link function maps the mean to the linear predictor
ηi = xi0 β. Since
E [yi |xi ] = γ 0 (g (η)),
it follows that the link function is the inverse of γ 0 ◦ g .
For example, in the case of logistic regression,
γ 0 (g (η)) = exp(η)/(1 + exp(η)),
which is the expit function. The inverse of this function is the logit
function log(p/(1 − p)), so the logit function is the link in this case.

38 / 52

Link functions
When g (x) = x, the resulting link function is called the canonical
link function.
In the examples above, linear regression, logistic regression, and
Poisson regression all used the canonical link function, but
negative binomial regression did not.
The canonical link function for negative binomial regression is 1/x,
but this does not respect the domain and is harder to interpret
than the usual log link.
Another setting where non-canonical links arise is the use of the
log link function for logistic regression. In this case, the coefficients
β are related to the log relative risk rather than to the log odds.

39 / 52

Estimating equations and quasi-likelihood

As noted above, the regression parameters in a GLM can be
estimated by solving these estimating equations:
X

∂µi /∂β · (yi − µi (β))/vi (β) = 0

i

Note that we only need to correctly specify vi (β) up to a constant.
For example, in the Gaussian case, we can set vi (β) = 1.

40 / 52

Estimating equations and quasi-likelihood
This opens up the possibility of specifying a large class of
regression models through their first two moments, represented by
the functions µ(β) and v (β).
For example, we can get quasi-Poisson regression by specifying
µi (β) = exp(xi0 β) and vi (β) = µi (β). This formulation of
quasi-Poisson regression never refers to the Poisson distribution
directly, it only depends on moments.
It can be shown that solving the quasi-likelihood equations
generally gives consistent estimates of β, as long as the data are
sample from a population in which the specified mean and variance
functions are correct.

41 / 52

Estimating equations and quasi-likelihood
Wedderburn introduced a “quasi-likelihood” function that can be
used when working with estimating equations. It has the form
Z
Q(y ; µ, v ) =
0

µ

y −u
du
v (u)

Since
∂Q/∂β = ∂µ/∂β · ∂Q/∂µ,
and ∂Q/∂µ = (y − µ)/v (µ) by the fundamental theorem of
calculus, we ssee that ∂Q/∂β gives the estimating equations
discussed above.

42 / 52

Estimating equations and quasi-likelihood
In some cases the quasi-likelihood is an actual likelihood, but even
if it is not, we can use it in place of a likelihood for many purposes.
For example, we can define a “Quasi Information Criterion” QIC,
analogous to AIC for model selection as
X

Q(yi , µ̂i , v ) − p,

i

where p = dim(β). This quantity is to be maximized, or
alternatively we can minimize
−2

X

Q(yi , µ̂i , v ) + 2p.

i

43 / 52

Scale parameters and quasi-likelihood
Since the quasi-likelihood estimating equations are homogeneous,
we can estimate the mean structure µ = exp(β 0 x) in a setting
where the specified variance is off by a multiplicative constant. For
example, these estimating equations can be used to consistently
estimate β in a quasi-Poisson model where Var[yi |xi ] = φE [yi |xi ].
This is a quasi-likelihood estimator, because there is no single
”quasi-Poisson distribution”. There are many distributions that
have this variance structure, but the solution to these estimating
equations is not the MLE for a specific distribution.
This can be viewed as a way to construct a consistent estimator for
β that can be used for any distribution where the conditional
variance has this structure.

44 / 52

Scale parameters and quasi-likelihood

In a quasi-likelihood analysis, the scale parameter is usually
estimated in a separate step, after the regression parameters (β)
are estimated by solving the estimating equations.
There are several related ways to estimate the scale parameter. A
common approach is to use
P
φ̂ =

− µ̂i )2 /v̂i
.
n−p

i (yi

45 / 52

Overdispersion
Under the Poisson model, var[y |x] = E [y |x]. A Poisson model
results from using the Poisson GLM with the scale parameter φ
fixed at 1.
The quasi-Poisson model is the Poisson model with a scale
parameter that may be any non-negative value. Under the
quasi-Poisson model, var[y |x] ∝ E [y |x].
The negative binomial GLM allows the variance to be
non-proportional to the mean.
Any situation in which var[y |x] > E [y |x] is called overdispersion.
Overdispersion is often seen in practice.
One mechanism that may give rise to overdispersion is
heterogeneity. Suppose we have a hierarchical model in which λ
follows a Γ distribution, and y |λ is Poisson with mean parameter
λ. Then marginally, y is negative binomial.
46 / 52

Shape and other auxiliary parameters

We have seen that the scale parameter can be estimated
intependently of the regression parameters (β). Some GLM’s (or
quasi-GLM’s) contain additional parameters that cannot be
estimated independently of β.
Once example of this is the shape parameter α in the negative
binomial GLM. The shape parameter can be estimated by
maximum likelihood, together with β (using a profile likelihood
technique), or can be selected using QIC.
Gamma and beta GLM’s also have auxiliary parameters that are
estimated in this way.

47 / 52

Model comparison for GLM’s
If φ is held fixed across models, then twice the log-likelihood ratio
between two nested models θ̂(1) and θ̂(2) is

L≡2

X
i

(1)

(yi θ̂i

(1)

− γ(θ̂i ))/φ − 2

X
(2)
(2)
(yi θ̂i − γ(θ̂i ))/φ,
i

where θ̂(2) is nested within θ̂(1) , so L ≥ 0. This is called the scaled
deviance.
The statistic D = φL, which does not depend explicitly on φ, is
called the deviance.

48 / 52

Model comparison for GLM’s
Suppose that θ̂(1) is the saturated model, in which θi = yi . If the
GLM is Gaussian and g (x) ≡ x, as discussed above, the deviance is

D = 2

X
X
(2)
(2) 2
(yi2 − yi2 /2) − 2
(yi θ̂i − θ̂i /2)
i

=

X

yi2 −

(2)
2yi θ̂i

+

i
(2) 2
θ̂i

i

=

X

(2)

(yi − θ̂i )2 .

i

49 / 52

Model comparison for GLM’s
Thus in the Gaussian case, the deviance is the residual sum of
squares for the smaller model (θ̂(2) ).
In the Gaussian case, D/φ = L ∼ χ2n−p−1 .
When φ is unknown, we can turn this around to produce an
estimate of the scale parameter
φ̂ =

D
.
n−p−1

This is an unbiased estimate in the Gaussian case, but is useful for
any GLM.

50 / 52

Model comparison for GLM’s
Now suppose we want to compare two nested generalized linear
models with deviances D1 < D2 . Let p1 > p2 be the number of
covariates in each model. The likelihood ratio test statistic is
L2 − L1 =

D2 − D1
φ

which asymptotically has a χ2p1 −p2 distribution.
If φ is unknown, we can estimate it as described above (using the
larger of the two models).
The “plug-in” likelihood ratio statistic (D2 − D1 )/φ̂ is still
asymptotically χ2p1 −p2 , as long as φ̂ is consistent.
The finite sample distribution may be better approximated using
D2 − D1
φ̂(p1 − p2 )

≈ Fp1 −p2 ,n−p1 ,
51 / 52

Model comparison for GLM’s

We can compare any two fitted GLM’s using model selection
statistics like AIC or BIC.
AIC favors models having small values of Lopt − df, where Lopt is
the maximized log-likelihood, and df is the degrees of freedom.
Equivalently, the AIC can be expressed
−D/2φ̂ − p − 1.
The same φ̂ value should be used for all models being compared
(i.e. by using the one from the largest model).

52 / 52