11. Correlation, regression, survival
It is quite common to measure several outcome variables on each subject. For example haematology and clinical biochemistry each involve a large number of different characters. In extreme cases such as in experiments involving gene micro-arrays there may be several thousand measurements on each subject. These may be correlated so statistical tests done separately for each character will not be independent. There are a range of “multivariate” techniques which can sometimes be used in such situations. These include principal components analysis, which is used to reduce the dimensionality of the data, discriminant function analysis which is used to classify individuals into groups, and various sorts of cluster analysis. A description of these methods is beyond the scope of this web. Below is just an outline of simple linear correlation.
Correlation is a measure of the strength of linear association between two random effects factors measured on the same subject. The most widely used measure is the “product moment” correlation, also sometimes known as the Pearson correlation coefficient after the statistician Karl Pearson.
The correlation coefficient is a “pure” number without units usually designated by the letter “r”. It ranges from r= -1 to r=+1. A correlation of r=0 implies that the two variables have no association. Examples of correlations are shown on the right. The most common statistical test is of whether it differs from 0. It (r) changes with the scale of measurement, and it only measures linear association.
There are other types of correlation such as “biserial” correlation, used when one of the variables is dichotamous.
Correlation is a measure of association, but it does not imply causation. If two variables are correlated it may be because variation in one causes variation in the other, or it may be because some other factor causes variation in both.There are many interesting correlations which fall into the latter category such as::
Larger towns have more babies and storks. You won’t change your social class by adding a few names. However, smoking probably is a cause of lung cancer. The association is strong and biologically plausible, although for obvious reasons it is not possible to confirm this by a randomised, controlled experiment.
An example of some statistical calculations involving correlation are given in the section on statistical analysis.
This measures the Relationship between a random variable and one or more fixed effect variable(s). It implies “causation” of the former (often known as an independent variable) by the latter (a dependent variable). It is often used to predict a Y-value given an X-value.
Linear regression is used to estimate the straight line which gives the “best” fit of the points to the line, where fitting is based on minimising the squared distance of each point from the line (a “least-squares” estimate). This is shown diagrammatically on the right. The best fitting line minimises the total squared deviations from it.
Simple linear regression (considered here) has a single independent variable whereas multiple regression has two or more independent variables.
The aim of the statistical analysis is to estimate the slope and intercept of the line and test whether these are significantly different from zero. The page on statistical analysis gives an example.
Note that the X variables do not need to be statistically independent. Sometimes the X variable is time, and two adjacent time points are clearly not independent. In some cases the X-variable is dose and the aim is to estimate a dose-response relationship. This is often done using a completely randomised or randomised block design. If there are several doses and these are equally spaced it is possible to use “orthogonal contrasts” to assess whether there is a linear or even a non-linear dose trend, although a full description of these methods is beyond the scope of this web.
An example of the statistical analysis of a regression experiment to test a dose-response relationship is given in the statistics section.
The analysis of survival (and time-to-failure) requires special statistical analysis because data on some animals may be “right censored”. For example, it may not be possible to extend a study until all the animals have died. All that is known about an individual may be that it survived a certain time but it is not known how long it would have survived had the experiment continued. It would be wrong to ignore such animals.
The Kaplan-Meier function is used to draw graphs and estimate median survival times in the presence of this right censoring, and there is a log-rank test which can be used to test whether two or more survival curves are identical. Note that for ethical reasons animals may need to be euthanased rather than being allowed to die, which may lead to some slight biases.
Data needs to be presented as a survival time for each individual with a code indicating whether or not the data was right censored and the treatment group to which it belongs, assuming two or more treatments are being compared. Further details are given by Altman (1991) and the use of the “survival” package in the R statistical package is described in detail by Dalgaard (2002) and Ekstrom (2012) (see Literature).