15 Generalized linear mixed models

15.1 Introduction

THIS CHAPTER IS UNDER CONSTRUCTION!!!

In chapter 13 on linear mixed effect models we have introduced how to analyze metric outcome variables for which a normal error distribution can be assumed (potentially after transformation), when the data have a hierarchical structure and, as a consequence, observations are not independent. In chapter 14 on generalized linear models we have introduced how to analyze outcome variables for which a normal error distribution can not be assumed, as for example binary outcomes or count data. More precisely, we have extended modelling outcomes with normal error to modelling outcomes with error distributions from the exponential family (e.g., binomial or Poisson). Generalized linear mixed models (GLMM) combine the two complexities and are used to analyze outcomes with a non-normal error distribution when the data have a hierarchical structure. In this chapter, we will show how to analyze such data. Remember, a hierarchical structure of the data means that the data are collected at different levels, for example smaller and larger spatial units, or include repeated measurements in time on a specific subject. Typically, the outcome variable is measured/observed at the lowest level but other variables may be measured at different levels. A first example is introduced in the next section.

15.1.1 Binomial Mixed Model

15.1.1.1 Background

To illustrate the binomial mixed model we use a subset of a data set used by Grüebler, Korner-Nievergelt, and Von Hirschheydt (2010) on barn swallow Hirundo rustica nestling survival (we selected a nonrandom sample to be able to fit a simple model; hence, the results do not add unbiased knowledge about the swallow biology!). For 63 swallow broods, we know the clutch size and the number of the nestlings that fledged. The broods came from 51 farms (larger unit), thus some of the farms had more than one brood. Note that each farm can harbor one or several broods, and the broods are nested within farms (as opposed to crossed, see chapter 13), i.e., each brood belongs to only one farm. There are three predictors measured at the level of the farm: colony size (the number of swallow broods on that farm), cow (whether there are cows on the farm or not), and dung heap (the number of dung heaps, piles of cow dung, within 500 m of the farm). The aim was to assess how swallows profit from insects that are attracted by livestock on the farm and by dung heaps. Broods from the same farm are not independent of each other because they belong to the same larger unit (farm), and thus share the characteristics of the farm (measured or unmeasured). Predictor variables were measured at the level of the farm, and are thus the same for all broods from a farm. In the model described and fitted below, we account for the non-independence of these clutches when building the model by including a random intercept per farm to model random variation between farms. The outcome variable is a proportion (proportion fledged from clutch) and thus consists of two values for each observation, as seen with the binomial model without random factors (Section 14.2.2): the number of chicks that fledged (successes) and the number of chicks that died (failures), i.e., the clutch size minus number that fledged.

The random factor “farm” adds a farm-specific deviation \(b_g\) to the intercept in the linear predictor. These deviations are modeled as normally distributed with mean \(0\) and standard deviation \(\sigma_g\).

\[ y_i \sim binomial\left(p_i, n_i\right)\\ logit\left(p_i\right) = \beta_0 + b_{g[i]} + \beta_1\;colonysize_i + \beta_2\;I\left(cow_i = 1\right) + \beta_3\;dungheap_i\\ b_g \sim normal\left(0, \sigma_g\right) \]

# Data on Barn Swallow (Hirundo rustica) nestling survival on farms
# (a part of the data published in Grüebler et al. 2010, J Appl Ecol 47:1340-1347)


library(blmeco)
data(swallowfarms)
#?swallowfarms # to see the documentation of the data set
dat <- swallowfarms
str(dat)
## 'data.frame':    63 obs. of  6 variables:
##  $ farm   : int  1001 1002 1002 1002 1004 1008 1008 1008 1010 1016 ...
##  $ colsize: int  1 4 4 4 1 11 11 11 3 3 ...
##  $ cow    : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ dung   : int  0 0 0 0 1 1 1 1 2 2 ...
##  $ clutch : int  8 9 8 7 13 7 9 16 10 8 ...
##  $ fledge : int  8 0 6 5 9 3 7 4 9 8 ...
# check number of farms in the data set
length(unique(dat$farm))
## [1] 51

15.1.1.2 Fitting a Binomial Mixed Model in R

15.1.1.2.1 Using the glmer function
dat$colsize.z <- scale(dat$colsize) # z-transform values for better model convergence
dat$dung.z    <- scale(dat$dung)
dat$die <- dat$clutch - dat$fledge
dat$farm.f <- factor(dat$farm)     # for clarity we define farm as a factor

The glmer function uses the standard way to formulate a statistical model in R, with the outcome on the left, followed by the ~ symbol, meaning “explained by”, followed by the predictors, which are separated by +. The notation for the random factor with only a random intercept was introduced in chapter 13 and is (1|farm.f) here.

Remember that for fitting a binomial model we have to provide the number of successful events (number of fledglings that survived) and the number of failures (those that died) within a two-column matrix that we create using the function cbind.

# fit GLMM using glmer function from lme4 package
library(lme4)
mod.glmer <- glmer(cbind(fledge,die) ~ colsize.z + cow + dung.z + (1|farm.f) , 
                   data=dat, 
                   family=binomial)
15.1.1.2.2 Assessing Model Assumptions for the glmer fit

The residuals of the model look fairly normal (top left panel of Figure 15.1 with slightly wider tails. The random intercepts for the farms look perfectly normal as they should. The plot of the residuals vs. fitted values (bottom left panel) shows a slight increase in the residuals with increasing fitted values. Positive correlations between the residuals and the fitted values are common in mixed models due to the shrinkage effect (chapter 13). Due to the same reason the fitted proportions slightly overestimate the observed proportions when these are large, but underestimate them when small (bottom right panel). What is looking like a lack of fit here can be seen as preventing an overestimation of the among farm variance based on the assumption that the farms in the data are a random sample of farms belonging to the same population.

The mean of the random effects is close to zero as it should. We check that because sometimes the glmer function fails to correctly separate the farm-specific intercepts from the overall intercept. A non-zero mean of random effects does not mean a lack of fit, but a failure of the model fitting algorithm. In such a case, we recommend using a different fitting algorithm, e.g. brm (see below) or stan_glmer from the rstanarm package.

A slight overdispersion (approximated dispersion parameter >1) seems to be present, but nothing to worry about.

par(mfrow=c(2,2), mar=c(3,5,1,1))    
# check normal distribution of residuals
qqnorm(resid(mod.glmer), main="qq-plot residuals")
qqline(resid(mod.glmer))

# check normal distribution of random intercepts
qqnorm(ranef(mod.glmer)$farm.f[,1], main="qq-plot, farm")
qqline(ranef(mod.glmer)$farm.f[,1])

# residuals vs fitted values to check homoscedasticity
plot(fitted(mod.glmer), resid(mod.glmer)) 
abline(h=0)

# plot data vs. predicted values
dat$fitted <- fitted(mod.glmer)
plot(dat$fitted,dat$fledge/dat$clutch)
abline(0,1)
Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects "farm". Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values.

Figure 15.1: Diagnostic plots to assess model assumptions for mod.glmer. Uppper left: quantile-quantile plot of the residuals vs. theoretical quantiles of the normal distribution. Upper rihgt: quantile-quantile plot of the random effects “farm”. Lower left: residuals vs. fitted values. Lower right: observed vs. fitted values.

# check distribution of random effects
mean(ranef(mod.glmer)$farm.f[,1])
## [1] -0.001690303
# check for overdispersion
dispersion_glmer(mod.glmer)
## [1] 1.192931
detach(package:lme4)
15.1.1.2.3 Using the brm function

Now we fit the same model using the function brm from the R package brms. This function allows fitting Bayesian generalized (non-)linear multivariate multilevel models using Stan (Betancourt 2013) for full Bayesian inference. We shortly introduce the fitting algorithm used by Stan, Hamiltonian Monte Carlo, in chapter 18. When using the function brm there is no need to install rstan or write the model in Stan-language. A wide range of distributions and link functions are supported, and the function offers many things more. Here we use it to fit the model as specified by the formula object above. Note that brm requires that a binomial outcome is specified in the format successes|trials(), which is the number of fledged nestlings out of the total clutch size in our case. In contrast, the glmer function required to specify the number of nestlings that fledged and died (which together sum up to clutch size), in the format cbind(successes, failures). The family is also called binomial in brm, but would be bernoulli for a binary outcome, whereas glmer would use binomial in both situations (Bernoulli distribution is a special case of the binomial). However, it is slightly confusing that (at the time of writing this chapter) the documentation for brmsfamily did not mention the binomial family under Usage, where it probably went missing, but it is mentioned under Arguments for the argument family. Prior distributions are an integral part of a Bayesian model, therefore we need to specify prior distributions. We can see what default prior distributions brm is using by applying the get_prior function to the model formula. The default prior for the effect sizes is a flat prior which gives a density of 1 for any value between minus and plus infinity. Because this is not a proper probability distribution it is also called an improper distribution. The intercept gets a t-distribution with mean of 0, standard deviation of 2.5 and 3 degrees of freedoms. Transforming this t-distribution to the proportion scale (using the inverse-logit function) becomes something similar to a uniform distribution between 0 and 1 that can be seen as non-informative for a probability. For the among-farm standard deviation, it uses the same t-distribution as for the intercept. However, because variance parameters such as standard deviations only can take on positive numbers, it will use only the positive half of the t-distribution (this is not seen in the output of get_prior). When we have no prior information on any parameter, or if we would like to base the results solely on the information in the data, we specify weakly informative prior distributions that do not noticeably affect the results but they will facilitate the fitting algorithm. This is true for the priors of the intercept and among-farm standard deviation. However, for the effect sizes, we prefer specifying more narrow distributions (see chapter 10). To do so, we use the function prior.

To apply MCMC sampling we need some more arguments: warmup specifies the number of iterations during which we allow the algorithm to be adapted to our specific model and to converge to the posterior distribution. These iterations should be discarded (similar to the burn-in period when using, e.g., Gibbs sampling); iter specifies the total number of iterations (including those discarded); chains specifies the number of chains; init specifies the starting values of the iterations. By default (init=NULL) or by setting init="random" the initial values are randomly chosen which is recommended because then different initial values are chosen for each chain which helps to identify non-convergence. However, sometimes random initial values cause the Markov chains to behave badly. Then you can either use the maximum likelihood estimates of the parameters as starting values, or simply ask the algorithm to start with zeros. thin specifies the thinning of the chain, i.e., whether all iterations should be kept (thin=1) or for example every 4th only (thin=4); cores specifies the number of cores used for the algorithm; seed specifies the random seed, allowing for replication of results.

library(brms)

# check which parameters need a prior
get_prior(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f), 
          data=dat, 
          family=binomial(link="logit"))
##                 prior     class      coef  group resp dpar nlpar lb ub
##                (flat)         b                                       
##                (flat)         b colsize.z                             
##                (flat)         b       cow                             
##                (flat)         b    dung.z                             
##  student_t(3, 0, 2.5) Intercept                                       
##  student_t(3, 0, 2.5)        sd                                   0   
##  student_t(3, 0, 2.5)        sd           farm.f                  0   
##  student_t(3, 0, 2.5)        sd Intercept farm.f                  0   
##        source
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default
##       default
##  (vectorized)
##  (vectorized)
# specify own priors
myprior <- prior(normal(0,5), class="b")
             

mod.brm <- brm(fledge|trials(clutch) ~ colsize.z + cow + dung.z + (1|farm.f) , 
               data=dat, family=binomial(link="logit"),
               prior=myprior,
               warmup = 500, 
               iter = 2000, 
               chains = 2,
               init = "random", 
               cores = 2,
               seed = 123)
# note: thin=1 is default and we did not change this here.
15.1.1.2.4 Checking model convergence for the brm fit

We first check whether we find warnings in the R console about problems of the fitting algorithm. Warnings should be taken seriously. Often, we find help in the Stan online documentation (or when typing launch_shinystan(mod.brm) into the R-console) what to change when calling the brm function to get a fit that is running smoothly. Once, we get rid of all warnings, we need to check how well the Markov chains mixed. We can either do that by scanning through the many diagnostic plots given by launch_shinystan(mod) or create the most important plots ourselves such as the traceplot (Figure 15.2).

par(mar=c(2,2,2,2))
mcmc_plot(mod.brm, type = "trace")
Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means.

Figure 15.2: Traceplot of the Markov chains. After convergence, both Markov chains should sample from the same stationary distribution. Indications of non-convergence would be, if the two chains diverge or vary around different means.

15.1.1.2.5 Checking model fit by posterior predictive model checking

To assess how well the model fits to the data we do posterior predictive model checking (Chapter 16). For binomial as well as for Poisson models comparing the standard deviation of the data with those of replicated data from the model is particularly important. If the standard deviation of the real data would be much higher compared to the ones of the replicated data from the model, overdispersion would be an issue. However, here, the model is able to capture the variance in the data correctly (Figure 15.3). The fitted vs observed plot also shows a good fit.

yrep <- posterior_predict(mod.brm)
sdyrep <- apply(yrep, 1, sd)

par(mfrow=c(1,3), mar=c(3,4,1,1))
hist(yrep, freq=FALSE, main=NA, xlab="Number of fledglings")
hist(dat$fledge, add=TRUE, col=rgb(1,0,0,0.3), freq=FALSE)
legend(10, 0.15, fill=c("grey",rgb(1,0,0,0.3)), legend=c("yrep", "y"))

hist(sdyrep)
abline(v=sd(dat$fledge), col="red", lwd=2)

plot(fitted(mod.brm)[,1], dat$fledge, pch=16, cex=0.6)
abline(0,1)
Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values.

Figure 15.3: Posterior predictive model checking: Histogram of the number of fledglings simulated from the model together with a histogram of the real data, and a histogram of the standard deviations of replicated data from the model together with the standard deviation of the data (vertical line in red). The third plot gives the fitted vs. observed values.

After checking the diagnostic plots, the posterior predictive model checking and the general model fit, we assume that the model describes the data generating process reasonably well, so that we can proceed to drawing conclusions.

15.1.1.3 Drawing Conclusions

The generic summary function gives us the results for the model object containing the fitted model, and works for both the model fitted with glmer and brm. Let’s start having a look at the summary from mod.glmer. The summary provides the fitting method, the model formula, statistics for the model fit including the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the scaled residuals, the random effects variance and information about observations and groups, a table with coefficient estimates for the fixed effects (with standard errors and a z-test for the coefficient) and correlations between fixed effects. We recommend to always check if the number of observations and groups, i.e., 63 barn swallow nests from 51 farms here, is correct. This information shows if the glmer function has correctly recognized the hierarchical structure in the data. Here, this is correct. To assess the associations between the predictor variables and the outcome analyzed, we need to look at the column “Estimate” in the table of fixed effects. This column contains the estimated model coefficients, and the standard error for these estimates is given in the column “Std. Error”, along with a z-test for the null hypothesis of a coefficient of zero. In the random effects table, the among farm variance and standard deviation (square root of the variance) are given. The function confint shows the 95% confidence intervals for the random effects (.sig01) and fixed effects estimates.

In the summary output from mod.brm we see the model formula and some information on the Markov chains after the warm-up. In the group-level effects (between group standard deviations) and population-level effects (effect sizes, model coefficients) tables some summary statistics of the posterior distribution of each parameter are given. The “Estimate” is the mean of the posterior distribution, the “Est.Error” is the standard deviation of the posterior distribution (which is the standard error of the parameter estimate). Then we see the lower and upper limit of the 95% credible interval. Also, some statistics for measuring how well the Markov chains converged are given: the “Rhat” and the effective sample size (ESS). The bulk ESS tells us how many independent samples we have to describe the posterior distribution, and the tail ESS tells us on how many samples the limits of the 95% credible interval is based on.

Because we used the logit link function, the coefficients are actually on the logit scale and are a bit difficult to interpret. What we can say is that positive coefficients indicate an increase and negative coefficients indicate a decrease in the proportion of nestlings fledged. For continuous predictors, as colsize.z and dung.z, this coefficient refers to the change in the logit of the outcome with a change of one in the predictor (e.g., for colsize.z an increase of one corresponds to an increase of a standard deviation of colsize). For categorical predictors, the coefficients represent a difference between one category and another (reference category is the one not shown in the table). To visualize the coefficients we could draw effect plots.

# glmer
summary(mod.glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(fledge, die) ~ colsize.z + cow + dung.z + (1 | farm.f)
##    Data: dat
## 
##      AIC      BIC   logLik deviance df.resid 
##    282.5    293.2   -136.3    272.5       58 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2071 -0.4868  0.0812  0.6210  1.8905 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  farm.f (Intercept) 0.2058   0.4536  
## Number of obs: 63, groups:  farm.f, 51
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.09533    0.19068  -0.500   0.6171  
## colsize.z    0.05087    0.11735   0.434   0.6646  
## cow          0.39370    0.22692   1.735   0.0827 .
## dung.z      -0.14236    0.10862  -1.311   0.1900  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) clsz.z cow   
## colsize.z  0.129              
## cow       -0.828 -0.075       
## dung.z     0.033  0.139 -0.091
confint.95 <- confint(mod.glmer); confint.95
##                   2.5 %    97.5 %
## .sig01       0.16809483 0.7385238
## (Intercept) -0.48398346 0.2863200
## colsize.z   -0.18428769 0.2950063
## cow         -0.05360035 0.8588134
## dung.z      -0.36296714 0.0733620
# brm
summary(mod.brm)
##  Family: binomial 
##   Links: mu = logit 
## Formula: fledge | trials(clutch) ~ colsize.z + cow + dung.z + (1 | farm.f) 
##    Data: dat (Number of observations: 63) 
##   Draws: 2 chains, each with iter = 2000; warmup = 500; thin = 1;
##          total post-warmup draws = 3000
## 
## Group-Level Effects: 
## ~farm.f (Number of levels: 51) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.55      0.16     0.26     0.86 1.00      910     1284
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.10      0.21    -0.52     0.32 1.00     2863     2165
## colsize.z     0.05      0.14    -0.21     0.34 1.00     2266     1794
## cow           0.41      0.25    -0.06     0.90 1.00     3069     2117
## dung.z       -0.15      0.12    -0.38     0.09 1.00     3254     2241
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

From the results we conclude that in farms without cows (when cow=0) and for average colony sizes (when colsize.z=0) and average number of dung heaps (when dung.z=0) the average nestling survival of Barn swallows is the inverse-logit function of the Intercept, thus, plogis(-0.1) = 0.47 with a 95% uncertainty interval of 0.37 - 0.58. We further see that colony size and number of dung heaps are less important than whether cows are present or not. Their estimated partial effect is small and their uncertainty interval includes only values close to zero. However, whether cows are present or not may be important for the survival of nestlings. The average nestling survival in farms with cows is plogis(-0.1+0.41) = 0.58. For getting the uncertainty interval of this survival estimate, we need to do the calculation for every simulation from the posterior distribution of both parameters.

bsim <- posterior_samples(mod.brm)
# survival of nestlings on farms with cows:
survivalest <- plogis(bsim$b_Intercept + bsim$b_cow)
quantile(survivalest, probs=c(0.025, 0.975)) # 95% uncertainty interval
##      2.5%     97.5% 
## 0.5126716 0.6412675

In medical research, it is standard to report the fixed-effects coefficients from GLMM with binomial or Bernoulli error as odds ratios by taking the exponent (R function exp for \(e^{()}\)) of the coefficient on the logit-scale. For example, the coefficient for cow from mod.glmer, 0.39 (95% CI from -0.05 to -0.05), represents an odds ratio of exp( 0.39)=1.48 (95% CI from 0.95 to 0.95). This means that the odds for fledging (vs. not fledging) from a clutch from a farm with livestock present is about 1.5 times larger than the odds for fledging if no livestock is present (relative effect).

15.2 Summary