6. Power and Sample Size
The power of an experiment is the probability that it can detect a treatment effect, if it is present.
The six factors listed here are intimately linked so that if we know five of them we can estimate the sixth one.
A “power analysis” is often used to determine sample size. The use of too many animals (or other experimental units) wastes animals, money, time and effort, and it is unethical. But if too few animals are used the experiment may lack power and miss a scientifically important response to the treatment. This also wastes resources and could have serious consequences, particularly in safety assessment.
The null hypothesis
In a controlled experiment the aim is usually to compare two or more means (or sometimes medians or proportions). We normally set up a “null hypothesis” that there is no difference between the means, and the aim of our experiment is to disprove that null hypothesis.
However, as a result of inter-individual variability we may make a mistake. If we fail to find a true difference, then we have a false negative result, also known as a Type II or beta error. Conversely, if we think that there is a difference when in fact it is just due to chance sampling variation, then we have a false positive, Type I, or alpha error. These are show in the table below
Type I errors are controlled by choosing the significance level. A 5% level means that on average 1/20 comparisons will be “significant” when they are just due to sampling variation
Control of Type II errors is more difficult as it depends on the relationship among several variables, the most important of which are the “signal” (difference between means of the groups), the “noise” (inter-individual variability) and the sample size. We can often use a power analysis to estimate the required sample size as discussed below.
The figure shows the six variables involved in a power analysis. They are interrelated such that if any five of them are specified, the sixth one can be estimated.
Normally, the power analysis is used to estimate sample size. But if that is fixed (e.g. only 20 subjects are available) then it can be used to estimate the signal or the power of a proposed experiment.
This is the magnitude of the difference between the means of the two groups (M1-M2) which is likely to be of clinical or scientific importance. It has to be specified by the investigator.
A small difference may not be of much interest. A large one will be. What is the cutoff below which the difference is of little interest?
In applied research it should be possible to specify an effect size. but in fundamental research you may just want to know if there are any differences between the two groups.
In this case you will have to use another method of determining sample size such as the Resource Equation (see later). But if you have an estimate of the standard deviation it is still worth doing a power analysis to estimate the effect size you are likely to be able to detect for the sample size you decide to use. If you then fail to detect a statistically significant effect you will be able to say something like “if the effect had been as large as XX standard deviations I would have had (say) a 90% chance of detecting it”. Remember, if you specify five of the above variables you can estimate the sixth one. So in practice you can estimate sample size or effect size or power (you are less likely to want to estimate the other two variables).
This is the variation among the experimental subjects, expressed as the standard deviation (in the case of measurement characters). It needs to come from previous studies or a pilot study. If no good estimate is available it may still be worth doing a power analysis with a low and a high estimate to see what difference it makes to the estimated sample size
Noise does not need to be estimated when comparing two proportions. It is sufficient just to specify the other variables.
The signal/noise ratio
This is also known as the “standardised effect size” or “Cohen’s d”. It is sometimes used as a general indication of the magnitude of an effect. For example, Cohen in his book “Statistical power analysis for the behavioral sciences”. Hillsdale N.J.: Lawrence Erlbaum Associates, 1988 suggested that values of d of 0.2, 0.4 or 0.8 should be considered as “small”, “medium” and “large” effect sizes respectively in psychological research. However, in work with laboratory animals much larger effects are usually seen, because the noise is usually so well controlled. In this case small, medium and large effects might more realistically be set at d= 0.5, 1.0 and 1.5, respectively.
The other variables
Determining sample size by power analysis
Assume that you plan an experiment with just two groups (Treated and Control) and that you will measure a metric character.
Your null hypothesis is that there is no difference between the means of the two groups. The steps that you need to take are as follows:
What if there are more than two groups?
It is technically possible to do a power analysis for an analysis of variance with several treatment groups. The problem is to specify an effect size of clinical or scientific importance when there are three or more groups. One alternative is to power the experiment assuming a t-test on the two groups likely to be most extreme such as the control and top dose (assuming there are such groups). This would mean that if the response is stronger than expected, then differences between the control and an intermediate group would become statistically significant.
Another alternative would be to specify a “small”, “medium” or “large” effect size (possibly d=0.5, 1.0 or 1.5 in the case of laboratory animals) and the number of treatment groups and use the G*Power program (below) to estimate sample sizes. A screen shot of such a calculation for an experiment with five treatment groups with an effect size of 1.0, a power of 0.9 and a significance level of 0.05 is shown below. This would require 25 animals.
G*Power will also accept the estimated means of the four groups that would be of scientific interest were they to be found together with a pooled estimate of the standard deviation, and do the power analysis on that.
Power analysis for comparing two percentages (or proportions
A power analysis for comparing two proportions requires the expected control proportions, (p1) the proportion or responders in the treated group that would give a difference of clinical or scientific importance (p2), the specified power and the significance levels. The table below shows numbers needed in each group for an 80% power and 5% significance level. Note that large numbers are needed in some cases.
A web site that will do the calculations
Click the arrow below for a pdf paper giving more details on power analysis.
Although there is probably sufficient information given in the table above and the example below for you to estimate your required sample size, you can click below for a web site which will do the calculations for you.
Click here http://www.biomath.info
A free program for power calculations
A free program G*Power includes calculations for the t-test, F-test (one-way analysis of variance) and others. It can be downloaded from this web site
An example comparing two means
A vet wants to compare the effect on blood pressure of two anesthetics for dogs under clinical conditions. He has published some preliminary data. The dogs were unsexed healthy animals weighing 3.8 to 42.6 kg. Mean systolic blood pressure was 141 mm Hg with a standard deviation of 36mm, (the noise)
1. A difference in blood pressure of 20 mmHg (the signal) or more would be of clinical importance (a clinical not a statistical decision).
Then the signal/noise ratio would be 20/36 = 0.56
From the table above the required sample size for a S/N ratio of 0.6 is about 59 dogs/group.
(Note that great accuracy is not needed as there are uncertainties in the estimates of the standard deviation and the effect size of clinical importance). However there are many statistical software packages will do the calculations. The output below is done using the R statistical package for this set of data. In this case “delta” is the signal/noise ratio and the SD is set as one, but the signal and noise could have been put in separately. Note that the sample size needs to be rounded up to a whole number. (Note that a small change in the S/N ratio from 0.6 to 0.56 makes quite a difference to the estimates: from 59 to 68 dogs per group).
Sixty-eight dogs per group (132 in total) is a lot of dogs and using such animals would be time-consuming.
In the same journal an investigator was working with male Beagles weighing 17-23 kg. These had a mean BP of 108 mm Hg. with an SD 9 mm.
Assume a 20mm difference between groups would be of clinical importance (as before). With the same assumptions as above the signal/noise ratio is 20/9 = 2.22 This is only 6/group with a 90% power (see table above).
So, by using uniform animals the number needed is reduced to 1/11th. compared with the random dogs. The table below summarises the situation. It also shows that if the vet went ahead and used the random dogs with eight dogs per group then there would only have been an 18% chance of detecting a 20mm difference in means between the two groups.
This poses a problem. Can Beagles be regarded as representing “dogs”?
And is there ever any case for using genetically heterogeneous animals if all it does is increase noise and reduce the power of the experiment, leading to false negative results?
1. If each dog could be given both anaesthetics (say in random order on different days), then it would be possible to use small numbers of even quite heterogeneous dogs, assuming that there are no important breed differences in response. Technically, this would be a randomised block design (discussed later)
2. If it is thought that there may be breed differences in response, then the vet could restrict the study using small numbers of animals of several (say 3-4) breeds in a “factorial” experimental design, discussed later. As far as possible there should be equal numbers in each group. This would indicate whether the two anesthetics differ over-all and whether breed differences need to be taken into account when choosing one of these anesthetics.
The Resource Equation: another method of determining sample size
A power analysis is not always possible.
An alternative is the “Resource Equation”` method. This depends on the law of diminishing returns. It needs an estimate of E:
E= (Total number of experimental units)-(number of treatment groups)
And E should be between 10 and 20
This is not an absolute cutoff. There may be a case for E being higher if it leads to a more balanced design, the likely cost of a Type II error is high, the procedures are very mild or it is an in-vitro experiment with no ethical implications
E is the number of degrees of freedom in an analysis of variance (ANOVA). It is based on the need to obtain an adequate estimate of the standard deviation.
The plot above right shows the amount of information in a sample of data as a function of E. The curve rises steeply, then tails off and has almost flattened off by the time E=10, and there is little extra benefit from going on much beyond 20. However, if experimental units are inexpensive (such as tissue culture dishes) then
Suppose you decide to do an experiment with four treatment groups (a control and three dose levels) and eight animals per group. Then:
E= 32 – 4 = 28. So this is unnecessarily large.
With six animals per group E=20, which is acceptable
This method is easy to use, can be used when there are many outcomes, it does not require estimates of the effect size of clinical or scientific importance, and does not require an estimate of the standard deviation. But it is crude compared with the power analysis.
Conclusion: Use a power analysis where possible. Use the resource equation when a power analysis is not possible.