         6. Power and Sample Size

The power of an experiment is the probability that it can detect a treatment effect, if it is present

The six factors listed here are intimately linked so that if we know five of them we can estimate the sixth one.

• Power
• Sample size,
• Inter-individual variability,
• The magnitude of the response to a treatment,
• The significance level and
• The alternative hypothesis

A power analysis is often used to determine sample size. The use of too many animals (or other experimental units) wastes animals, money, time and effort, and it is unethical. But if too few animals are used the experiment may lack power and miss a scientifically important response to the treatment. This also wastes resources and could have serious consequences, particularly in safety assessment. The null hypothesis

In a controlled experiment the aim is usually to compare two or more means (or sometimes medians or proportions). We normally set up a null hypothesis that there is no difference between the means, and the aim of our experiment is to disprove that null hypothesis.

However, as a result of inter-individual variability we may make a mistake. If we fail to find a true difference, then we have a false negative result, also known as a Type II or beta error. Conversely, if we think that there is a difference when in fact it is just due to chance sampling variation, then we have a false positive, Type I, or alpha error. These are show  in the table below Type I errors are controlled by choosing the significance level. A 5% level means that on average 1/20 comparisons will be significant when they are just due to sampling variation

Control of Type II errors is more difficult as it depends on the relationship among several variables, the most important of which are the signal (difference between means of the groups), the noise (inter-individual variability) and the sample size. We can often use a power analysis to estimate the required sample size as discussed below. Power analysis

The figure shows the six variables involved in a power analysis. They are interrelated such that if any five of them are specified, the sixth one can be estimated.

Normally, the power analysis is used to estimate sample size. But if that is fixed (e.g. only 20 subjects are available) then it can be used to estimate the signal or the power of a proposed experiment. The signal

This is the magnitude of the difference between the means of the two groups  (M1-M2) which is likely to be of clinical or scientific importance. It has to be specified by the investigator.

A small difference may not be of much interest. A large one will be. What is the cutoff below which the difference is of little interest?

In applied research it should be possible to specify an effect size. but in fundamental research you may just want to know if there are any differences between the two groups.

In this case you will have to use another method of determining sample size such as the Resource Equation (see later). But if you have an estimate of the standard deviation it is still worth doing a power analysis to estimate the effect size you are likely to be able to detect for the sample size you decide to use. If you then fail to detect a statistically significant effect you will be able to say something like if the effect had been as large as XX standard deviations I would have had (say) a 90% chance of detecting it. Remember, if you specify five of the above variables you can estimate the sixth one. So in practice you can estimate sample size or effect size or power (you are less likely to want to estimate the other two variables).

The noise

This is the variation among the experimental subjects, expressed as the standard deviation (in the case of measurement characters). It needs to come from previous studies  or a pilot study. If no  good estimate is available it may still be worth doing a power analysis with a low and a high estimate to see what difference it makes to the estimated sample size

Noise does not need to be estimated when comparing two proportions. It is sufficient just to specify the other variables.

The signal/noise ratio

This is also known as the standardised effect size or Cohens d. It is sometimes used as a general indication of the magnitude of an effect. For example, Cohen in his book Statistical power analysis for the behavioral sciences. Hillsdale N.J.: Lawrence Erlbaum Associates, 1988 suggested that  values of d of 0.2, 0.4 or 0.8 should be considered as small, medium and large effect sizes respectively in psychological research. However, in work with laboratory animals much larger effects are usually seen, because the noise is usually so well controlled. In this case small, medium and large effects might more realistically be set at d= 0.5, 1.0 and 1.5, respectively.

The other variables

• The alternative hypothesis
The null hypothesis is that the means of the two groups do not differ.
The alternative hypothesis may be that they do differ (two sided), or that they differ in a particular direction, e.g. that the mean of the treated group is greater than the mean of the controls (one sided)
• The significance level
As previously explained, this is usually set at 0.05, but this is quite arbitrary. It is the probability of a false positive result
• The power
This is the probability that you will be able to detect the effect you specify (the signal). You will probably want a high power, so it is often set at 0.8 or 0.9 (80% or 90%). But the higher power will require a larger sample size
• The sample size
This is the number in each group. It is usually what we want to estimate. However, we sometimes have only a fixed number of subjects in which case the power analysis can be used to estimate power or effect size. Determining sample size by power analysis

Assume that you plan an experiment with just two groups (Treated and Control) and that you will measure a metric character.

Your null hypothesis is that there is no difference between the means of the two groups. The steps that you need to take are as follows:
 Group size as a function of S/N ratio  (5% sig., 2-sided) SN ratio 90% power 80% power 0.2 526 393 0.4 132 99 0.6 59 45 0.8 34 26 1.0 22 17 1.2 16 12 1.4 12 9 1.6 9 7 1.8 8 6 2.0 6 5 2.2 6 4 2.4 5 4 2.6 4 4 2.8 4 3 3.0 4 3

• Decide on your alternative hypothesis. This will be either that the means differ (two sided) or they differ in a particular direction (one sided). The default is two sided.
• Decide the significance level you plan to use. We will assume 5%.
• Decide what power you want (i.e. the chance of detecting a real effect if it is present).
• If the consequences of failing to detect the effect (a Type II error) could be serious, such as in toxicity testing, you might want a relatively high power such as 90%.
• In fundamental studies where we may only be interested in large effects a Type II error may not have such serious consequences. An 80% power may be sufficient to catch large effects and fewer subjects will be needed.
• Obtain an estimate of the noise, i.e. the standard deviation of the character of interest. This has to come from a previous study, the literature or a pilot study. If using the literature it may be best to look at several papers and take some sort of (possibly informal) average or a guestimate. It is often helpful to do a best and worst case power analysis.
• Estimate the signal (effect size) that might interest you. How large a difference between the two means would be of scientific or clinical interest? If the difference is only small, you are probably not particularly interested in it. If it is large, then you certainly want to be able to detect it. The signal is the cutoff between these two alternatives. If the response is larger, then there will be an even greater chance of detecting it.
• Calculate the Standardised effect size (signal/noise ratio) = (Mean1-Mean2)/SD.
• The table (right) shows the S/N ratio over the range 0.2 to 3.0 and the required sample size for 80% and 90% power assuming a 5% significance level and a two-sided test.

What if there are more than two groups?

It is technically possible to do a power analysis for an analysis of variance with several treatment groups. The problem is to specify an effect size of clinical or scientific importance when there are three or more groups. One alternative is to power the experiment assuming a t-test on the two groups likely to be most extreme such as the control and top dose (assuming there are such groups). This would mean that if the response is stronger than expected, then differences between the control and an intermediate group would become statistically significant.

Another alternative would be to specify a  small, medium or large  effect size (possibly d=0.5, 1.0 or 1.5 in the case of laboratory animals) and the number of treatment groups and use the G*Power program (below) to estimate sample sizes. A screen shot of such a calculation for an experiment with five treatment groups with an effect size of 1.0, a power of 0.9 and a significance level of 0.05 is shown below. This would require 25 animals.

G*Power will also accept the estimated means of the four groups that would be of scientific interest were they to be found together with a pooled estimate of the standard deviation, and do the power analysis on that.  Power analysis for comparing two percentages (or proportions

A power analysis for comparing two proportions requires the expected control proportions,  (p1) the proportion or responders in the treated group that would give a difference of clinical or scientific importance (p2), the specified power and the significance levels. The table below shows numbers needed in each group for an 80% power and 5% significance level. Note that large numbers are needed in some cases.  A web site that will do the calculations

Click the arrow below for a pdf paper giving more details on power analysis.

Although there is probably sufficient information given in the table above and the example below for you to estimate your required sample size, you can click below for a web site which will do the calculations for you. A free program for power calculations

A free program G*Power includes calculations for the t-test, F-test (one-way analysis of variance) and others. It can be downloaded from this web site An example comparing two means

A vet wants to compare the effect on blood pressure of two anesthetics for dogs under clinical conditions. He has published some preliminary data. The dogs were unsexed healthy animals weighing 3.8 to 42.6 kg. Mean systolic blood pressure was 141 mm Hg with a standard d eviation of 36mm, (the noise)

Assume:

1. A difference in blood pressure of 20 mmHg (the signal) or more would be of clinical importance (a clinical not a statistical decision).
2. A significance level of 0.05,
3. A power of 90%
4. And a 2-sided t-test,

Then the signal/noise ratio would be 20/36 = 0.56

From the table above the required sample size for a S/N ratio of 0.6 is about  59 dogs/group.

(Note that great accuracy is not needed as there are uncertainties in the estimates of the standard deviation and the effect size of clinical importance). However there are many statistical software packages will do the calculations. The output below is done using the R statistical package for this set of data. In this case delta is the signal/noise ratio and the SD is set as one, but the signal and noise could have been put in separately. Note that the sample size needs to be rounded up to a whole number. (Note that a small change in the S/N ratio from 0.6 to 0.56 makes quite a difference to the estimates: from 59 to 68 dogs per group). Sixty-eight dogs per group (132 in total) is a lot of dogs and using such animals would be time-consuming.

An alternative

In the same journal an investigator was working with male Beagles weighing 17-23 kg. These had a mean BP of 108 mm Hg. with an SD 9 mm. Assume a 20mm difference between groups would be of clinical importance (as before). With the same assumptions as above the signal/noise ratio is  20/9 = 2.22 This is only 6/group with a 90% power (see table above).

So, by using uniform animals the number needed is reduced to 1/11th. compared with the random dogs. The table below summarises the situation. It also shows that if the vet went ahead and used the random dogs with eight dogs per group then there would only have been an 18% chance of detecting a 20mm difference in means between the two groups. This poses a problem. Can Beagles be regarded as representing dogs?

And is there ever any case for using genetically heterogeneous animals if all it does is increase noise and reduce the power of the experiment, leading to false negative results?

Alternative approaches
It would make no sense to go ahead and do the experiment simply using the heterogeneous dogs. But there are some obvious alternatives.

1. If each dog could be given both anaesthetics (say in random order on different days), then it would be possible to use small numbers of even quite heterogeneous dogs, assuming that there are no important breed differences in response. Technically, this would be a randomised block design (discussed later)

2. If it is thought that there may be breed differences in response, then the vet could restrict the study using small numbers of animals of several (say 3-4) breeds in a factorial experimental design, discussed later. As far as possible there should be equal numbers in each group. This would indicate whether the two anesthetics differ over-all and whether breed differences need to be taken into account when choosing one of these anesthetics. The Resource Equation: another method of determining sample size

A power analysis is not always possible.

• If lots of characters are being measured it may not be clear which one is the most important
• There may be no estimate of the standard deviation if the character has not previously been measured
• In fundamental research it may be impossible to specify an effect size likely to be of scientific importance
• A power analysis is difficult with complex experiments involving many treatment groups and possible interactions.

An alternative is the Resource Equation` method. This depends on the law of diminishing returns. It needs an estimate of E:

E= (Total number of experimental units)-(number of treatment groups)

And E should be  between 10 and  20

This is not an absolute cutoff. There may be a case for E being higher if it leads to a more balanced design, the likely cost of a Type II error is high, the procedures are very mild or it is an in-vitro experiment with no ethical implications

E is the number of degrees of freedom in an analysis of variance (ANOVA). It is based on the need to obtain an adequate estimate of the standard deviation.

The plot above right shows the amount of information in a sample of data as a function of E. The curve rises steeply, then tails off and has almost flattened off by the time E=10, and there is little extra benefit from going on much beyond 20. However, if experimental units are inexpensive (such as tissue culture dishes) then

Suppose you decide to do an experiment with four treatment groups (a control and three dose levels) and eight animals per group. Then:

E= 32  4 = 28. So this is unnecessarily large.

With six animals per group E=20, which is acceptable

This method is easy to use, can be used when there are many outcomes, it does not require estimates of the effect size of clinical or scientific importance, and does not require an estimate of the standard deviation. But it is crude compared with the power analysis.

Conclusion: Use a power analysis where possible. Use the resource equation when a power analysis is not possible.   