# Power and Sample Size

What is "statistical power analysis?" and how and under what circumstances is it applied?

The size of your study sample is critical to producing meaningful results.

Power is defined as "the probability (chance) that a statistical significance test will correctly reject the null hypothesis." Another way to define it is "the ability of a test to detect an effect, given that the effect actually exists."

Example: Suppose you want to explore the effect that one type of psychological intervention has on a person's current level of anxiety. When designing such an experiment, what questions should you consider?

Initial Considerations. First you will need an equal number of subjects, randomly assigned to each group (one group randomly assigned to the treatment group and one group assigned as a control). Here are a few important considerations:

Has any previous research been done on this topic? How will the response variable, "quality of life," be measured? What experimental design strategy would work best? What factors would you need to control or hold constant? What personal characteristics should be measured yet are not considered part of the design? Sample size: how many subjects are needed?

This is only a partial list of many important questions that must be carefully considered when designing any data collection activity. Yet one consideration that is often critical to the usefulness of the results is often overlooked: sample size.

When performing a statistical power analysis, you'll need to consider the following important information:

1. Significance level a or the probability of a Type I error. A common, yet arbitrary, choice is alpha =.05

2. Power to detect an effect. This is expressed as power=1 - ß, where ß is the probability of a Type II error. Power=0.80 is also a common, yet arbitrary, choice. You use this value as a critical value to be exceeded.

3. Effect size (in actual units of the response) the researcher wants to detect. Effect size and the ability to detect it are indirectly related; the smaller the effect, the more difficult it will be to find it. According to Cohen in 1988, the values are: Small effect = .10, medium = .25 and large = .40

4. Variation in the response variable. The standard deviation, which usually comes from previous research or pilot studies, is often used for the response variable of interest.

5. Sample size. A larger sample size generally leads to parameter estimates with smaller variances, giving you a greater ability to detect a significant difference. These five components of a power analysis are not independent: in fact, any four of them automatically determines the fifth. The usual objectives of a power analysis are to calculate the sample size (5) for given values of items (1)-(4). In studies with limited resources, the maximum sample size will be known. Power analysis then becomes a useful tool to determine if sufficient power exists (2) for specified values of (1), (3), (4), and (5). The researcher can evaluate whether the study is worth pursuing.

What effect size is meaningful? The size of a practical difference in the response you would like to detect among the groups is crucial. It essentially measures the "distance" between the null (H0) and a specified value of the alternative (HA) hypotheses. It also relates to the underlying population, not to data from a sample. A desirable effect size is the degree of deviation from the null hypotheses (in actual units of the response) that is considered large enough to attract your attention. Jacob Cohen, an important contributor to power analysis documentation, defined effect sizes as small, medium, and large, and he has stated that "all null hypotheses, at least in their two-tailed forms, are false." A difference is always going to be there; however, it might exist in such a small quantity that you should not be concerned about finding it. The concept of small, medium, and large effect sizes can be a reasonable starting point if you do not have more precise information. (Note that an effect size should be stated in terms of a number in the actual units of the response, not a percent change such as 5% or 10%.)

Returning to the example, if a difference in quality of life due to an exercise program exists, is the magnitude of the difference worth detecting? Suppose the levels of exercise you apply to subjects cause an observed change in quality of life of one unit on the chosen measurement scale. Is a one-unit change--or even 5 or 10 units--meaningful when facing the reality that many factors external to the study will also affect a person's quality of life?

Estimates of variation

You'll also need an estimate of the variability in the response of interest before you can determine the sample size needed to estimate an effect. This value is often found from pilot studies or from previous research, although it is all too often not readily available in published documents. Some parameters of interest are dimensionless quantities, such as a correlation or coefficient of variation, so in these cases a standard deviation would not be required.

Power calculations. Computing power for any specific study may very well be a difficult task. However, if you do not evaluate the joint influence of the size of the effect that is important and the inherent variability of the response during the planning stage, one of two inefficient outcomes will most likely result:

1. "Low power" (too little data; meaningful effect sizes are difficult to detect). If too few subjects are used, a hypothesis test will result in such low power that there is little chance to detect a significant effect. Consider someone attempting to start a car on a cold winter morning with a weak battery--it just doesn't provide the cranking power to get the engine going. This is analogous to designing an experiment in which resources were not put to optimal use (i.e., data from fewer subjects than the necessary number were collected to detect a meaningful effect).

2. "High power" (too much data; trivially small effect sizes can be detected). At the other extreme, consider an experiment where data collection is so large that a trivially small difference in the effect is detectable.Again, the researcher has not put all of his or her time and resources to good use--in statistical terms, too many subjects have been studied.

A study with low power will have indecisive results, even if the phenomenon you're investigating is real. Stated differently, the effect may well be there, but without adequate power, you won't find it.

The situation with high power is the reverse: you will likely see very significant results, even if the size of the effect you're investigating is not practical. Stated differently, the effect is there, but its magnitude is of little value.

In conclusion, the number of subjects you use is critical to the success of research. Without a sufficient number, you won't be able to achieve adequate power to detect the effect you're looking for. With too many subjects, you may be using valuable resources inefficiently.

Imagine two studies investigating the same phenomenon.

One has a sample size of 20

The other, a sample size of 20,000.

You'll probably find treatment variance in the second study - but it's likely to be due to the size of your sample, and not solely (if at all) to treatment effect. For example, the first study might end up with 25% difference between groups, and the second, 2%, but both will be significant.

This is because with a large sample, a smaller variance becomes significant.

This fact has lead yet another group of statisticians to come up with more scary, esoteric terms to inform you of the obvious. In this case, we will discuss "Treatment magnitude measures."

Treatment magnitude measures are meta-analytical techniques that look into two factors that affect experimental significance - treatment effects (good) and sample size (undesired)

A more accurate way of estimating treatment variance is to remove the effects that sample size has on creating the appearance of difference between groups.

Omega squared, w2 Omega squared is an estimate of the dependent variance accounted for by the independent variable in the population for a fixed effects model. The fixed effects model of analysis of variance applies to situations in which the experimenter has subjected his experimental material to several treatments, each of which affects only the mean of the underlying normal distribution of the response variable

``` The between-subjects, fixed effects,  form of the w2 formula is --
```

w2 = (SSeffect - (dfeffect)(MSerror)) / MSerror + SStotal

(Note: Do not use this formula for repeated measures designs)

What follows is an example calculation of w2

(Because w2 is a population estimate, it's always going to give you a smaller than value than either an Eta squared h2 or a partial Eta squaredhp2.)

Some properties of w2

The index omega quared provides a relative measure of the strength of an independent variable ranging from 0.0 to 1.0, however, it is unlikely that high omega2 values will be seen, because of the large contribution of error variance in most behavioral research. Therefore, a value of .15 or greater is considered large, a medium effect is .06-.15 and a small effect is .01.

Omega2 is not a test statistic, it cannot uncover significance on it's own. However, it can provide us with information when an F test is not significant, because it is unaffected by sample size, whereas F ratios are affected by small sample sizes.

Other measures of relative treatment magnitude

Epsilon Squared is yet another measure of relative treatment magnitude:

e2 = (SSeffect - (dfeffect-1)(MSerror)) / SStotal

You can probably see this is similar to the formula for Omega2, except that the denominator is going to be smaller. So e2 scores will tend to be slightly larger than omega 2 scores.

Eta squared (h2) Eta squared is the proportion of the total variance that is attributed to an effect. It is calculated as the ratio of the effect variance (SSeffect) to the total variance (SStotal) --

h2 = SSeffect / SStotal

The values used in the calculations for each h2 along with the hp2 from the ANOVA output are shown in Table 2.

Since Eta squared is an index of the strength used with multiple regression, it is often seen symbollized as R2. R2, or eta squared, will always give higher scores than omega squared or epsilon squared.

Conclusion

The presence of significance in an F test provides us with a relative assurance of a statistical association (predictability) between the treatment group and scores on the dependent variable, but in an ambiguous way. The index omega squared helps to clarify by parsing out the effects of sample size. Additionally, omega squared scores can still provide significant scores where F scores are nonsignificant, although great caution should be held in using an omega squared score alone, particularly in causal research, where the relative-strength measure is under partial control of the experimenter.