3 Data analysis step by step
In this chapter we provide a checklist with some guidance for data analysis. However, do not expect the list to be complete and for different studies, a different order of the steps may make more sense. We usually repeat steps 3.2 to 3.8 until we find a model that fit the data well and that is realistic enough to be useful for the intended purpose. Data analysis is always a lot of work and, often, the following steps have to be repeated many times until we find a useful model.
There is a chance and danger at the same time: we may find interesting results that answer different questions than we asked originally. They may be very exciting and important, however they may be biased. We can report such findings, but we should state that they appeared (more or less by chance) during the data exploration and model fitting phase, and we have to be aware that the estimates may be biased because the study was not optimally designed with respect to these findings. It is important to always keep the original aim of the study in mind. Do not adjust the study question according to the data. We also recommend reporting what the model started with at the first iteration and describing the strategy and reasoning behind the model development process.
3.1 Plausibility of Data
Prepare the data and check graphically, or via summary statistics, whether all the data are plausible. Prepare the data so that errors (typos, etc.) are minimal, for example, by double-checking the entries. See chapter ?? for useful R-code that can be used for data preparation and to make plausibility controls.
Think about the direct and indirect relationships among the variables of the study. We normally start a data analysis by drawing a sketch of the model including all explanatory variables and interactions that may be biologically meaningful. We will most likely repeat this step after having looked at the model fit. To make the data analysis transparent we should report every model that was considered. A short note about why a specific model was considered and why it was discarded helps make the modeling process reproducible.
3.3 Data Distribution
What is the nature of the variable of interest (outcome, dependent variable)? At this stage, there is no use of formally comparing the distribution of the outcome variable to a statistical distribution, because the rawdata is not required to follow a specific distribution. The models assume that conditional on the explanatory variables and the model structure, the outcome variable follows a specific distribution. Therefore, checking how well the chosen distribution fits to the data is done after the model fit 3.8. This first choice is solely done based on the nature of the data. Normally, our first choice is one of the classical distributions for which robust software for model fitting is available.
Here is a rough guideline for this first choice:
continuous measurements \(\Longrightarrow\) normal distribution
> exceptions: time-to-event data \(\Longrightarrow\) see survival analysis
count \(\Longrightarrow\) Poisson or negative-binomial distribution
count with upper bound (proportion) \(\Longrightarrow\) binomial distribution
binary \(\Longrightarrow\) Bernoully distribution
rate (count by a reference) \(\Longrightarrow\) Poisson including an offset
nominal \(\Longrightarrow\) multinomial distribution
Chapter 4 gives an overview of the distributions that are most relevant for ecologists.
3.4 Preparation of Explanatory Variables
- Look at the distribution (histogram) of every explanatory variable: Linear models do not assume that the explanatory variables have any specific distribution. Thus there is no need to check for a normal distribution! However, very skewed distributions result in unequal weighting of the observations in the model. In extreme cases, the slope of a regression line is defined by one or a few observations only. We also need to check whether the variance is large enough, and to think about the shape of the expected effect. The following four questions may help with this step:
Is the variance (of the explanatory variable) big enough so that an effect of the variable can be measured?
Is the distribution skewed? If an explanatory variable is highly skewed, it may make sense to transform the variable (e.g., log, square-root).
Does it show a bimodal distribution? Consider making the variable binary.
Is it expected that a change of 1 at lower values for x has the same biological effect as a change of 1 at higher values of x? If not, a trans- formation (e.g., log) could linearize the relationship between x and y.
Centering: Centering (\(x.c = x-mean(x)\)) is a transformation that produces a variable with a mean of 0. Centering is optional. We have two reasons to center a predictor variable. First, it helps the model fitting algorithm to better converge because it reduces correlations among model parameters. Second, with centered predictors, the intercept and main effects in the linear model are better interpretable (they are measured at the center of the data instead of at the covariate value of 0 which may be far off).
Scaling: Scaling (\(x.s = x/c\), where \(c\) is a constant) is a transformation that changes the unit of the variable. Also scaling is optional. We have three reasons to scale an predictor variable. First, to make the effect sizes better understandable. For example, a population change from one year to the next may be very small and hard to interpret. When we give the change for a 10-year period, its ecological meaning is better understandable. Second, to make the estimate of the effect sizes comparable between variables, we may use \(x.s = x/sd(x)\). The resulting variable has a unit of one standard deviation. A standard deviation may be comparable between variables that oritinally are measured in different units (meters, seconds etc). A. Gelman and Hill (2007) (p. 55 f) propose to scale the variables by two times the standard deviation (\(x.s = x/(2*sd(x))\)) to make effect sizes comparable between numeric and binary variables. Third, scaling can be important for model convergence, especially when polynomials are included. Also, consider the use of orthogonal polynomials, see Chapter 4.2.9 in Korner-Nievergelt et al. (2015).
Collinearity: Look at the correlation among the explanatory variables (pairs plot or correlation matrix). If the explanatory variables are correlated, go back to step 2. Also, Chapter 4.2.7 in Korner-Nievergelt et al. (2015) discusses collinearity.
Are interactions and polynomial terms needed in the model? If not already done in step 2, think about the relationship between each explanatory variable and the dependent variable. Is it linear or do polynomial terms have to be included in the model? If the relationship cannot be described appropriately by polynomial terms, think of a nonlinear model or a generalized additive model (GAM). May the effect of one explanatory variable depend on the value of another explanatory variable (interaction)?
3.5 Data Structure
After having taken into account all of the (fixed effect) terms from step 4: are the observations independent or grouped/structured? What random factors are needed in the model? Are the data obviously temporally or spatially correlated? Or, are other correlation structures present, such as phylogenetic relationships? Our strategy is to start with a rather simple model that may not account for all correlation structures that in fact are present in the data. We first, only include those that are known to be important a priory. Only when residual analyses reveals important additional correlation structures, we include them in the model.
3.6 Define Prior Distributions
Decide whether we would like to use informative prior distributions or whether we would like use priors that only have a negligible effect on the results. When the results are later used for informing authorities or for making a decision (as usual in applied sciences), then we would like to base the results on all information available. Information from the literature is then used to construct informative prior distributions. In contrast to applied sciences, in basic research we often would like to show only the information in the data that should not be influenced by earlier results. Therefore, in basic research we look for priors that do not influence the results.
3.7 Fit the Model
Fit the model.
3.8 Check Model
We assess model fit by graphical analyses of the residuals (Chapter 6 in Korner-Nievergelt et al. (2015)), by predictive model checking (Section 10.1 in Korner-Nievergelt et al. (2015)), or by sensitivity analysis (Chapter 15 in Korner-Nievergelt et al. (2015)).
For non-Gaussian models it is often easier to assess model fit using pos- terior predictive checks (Chapter 10 in Korner-Nievergelt et al. (2015)) rather than residual analyses. Posterior predictive checks usually show clearly in which aspect the model failed so we can go back to step 2 of the analysis. Recognizing in what aspect a model does not fit the data based on residual plots improves with experience. Therefore, we list in Chapter 16 of Korner-Nievergelt et al. (2015) some patterns that can appear in residual plots together with what these patterns possibly indicate. We also indicate what could be done in the specific cases.
3.9 Model Uncertainty
If, while working through steps 1 to 8, possibly repeatedly, we came up with one or more models that fit the data reasonably well, we then turn to the methods presented in Chapter 11 (Korner-Nievergelt et al. (2015)) to draw inference from more than one model. If we have only one model, we proceed to 3.10.
3.10 Draw Conclusions
Simulate values from the joint posterior distribution of the model parameters (sim or Stan). Use these samples to present parameter uncertainty, to obtain posterior distributions for predictions, probabilities of specific hypotheses, and derived quantities.
- R for Data Science by Garrett Grolemund and Hadley Wickham: Introduces the tidyverse framwork. It explains how to get data into R, get it into the most useful structure, transform it, visualise it and model it.