12 Assessing Model Assumptions

12.1 Model Assumptions

Every statistical model makes assumptions. We try to build models that reflect the data-generating process as realistically as possible. However, a model never is the truth. Yet, all inferences drawn from a model, such as estimates of effect size or derived quantities with credible intervals, are based on the assumption that the model is true. However, if a model captures the datagenerating process poorly, for example, because it misses important structures (predictors, interactions, polynomials), inferences drawn from the model are probably biased and results become unreliable. In a (hypothetical) model that captures all important structures of the data generating process, the stochastic part, the difference between the observation and the fitted value (the residuals), should only show random variation. Analyzing residuals is a very important part of the data analysis process.

Residual analysis can be very exciting, because the residuals show what remains unexplained by the present model. Residuals can sometimes show surprising patterns and, thereby, provide deeper insight into the system. However, at this step of the analysis it is important not to forget the original research questions that motivated the study. Because these questions have been asked without knowledge of the data, they protect against data dredging. Of course, residual analysis may raise interesting new questions. Nonetheless, these new questions have emerged from patterns in the data, which might just be random, not systematic, patterns. The search for a model with good fit should be guided by thinking about the process that generated the data, not by trial and error (i.e., do not try all possible variable combinations until the residuals look good; that is data dredging). All changes done to the model should be scientifically justified. Usually, model complexity increases, rather than decreases, during the analysis.

12.2 Independent and Identically Distributed

Usually, we model an outcome variable as independent and identically distributed (iid) given the model parameters. This means that all observations with the same predictor values behave like independent random numbers from the identical distribution. As a consequence, residuals should look iid. Independent means that:

  • The residuals do not correlate with other variables (those that are included in the model as well as any other variable not included in the model).

  • The residuals are not grouped (i.e., the means of any set of residuals should all be equal).

  • The residuals are not autocorrelated (i.e., no temporal or spatial autocorrelation exist; Sections 12.4 and 12.5).

Identically distributed means that:

  • All residuals come from the same distribution.

In the case of a linear model with normal error distribution (Chapter 11) the residuals are assumed to come from the same normal distribution. Particularly:

  • The residual variance is homogeneous (homoscedasticity), that is, it does not depend on any predictor variable, and it does not change with the fitted value.

  • The mean of the residuals is zero over the whole range of predictor values. When numeric predictors (covariates) are present, this implies that the relationship between x and y can be adequately described by a straight line.

Residual analysis is mainly done graphically. R makes it very easy to plot residuals to look at the different aspects just listed. As a first example, we use the coal tit example from Chapter 11:

Hier fehlt noch ein Teil aus dem BUCH.

12.3 The QQ-Plot

xxx

12.4 Temporal Autocorrelation

12.5 Spatial Autocorrelation

12.6 Heteroscedasticity