Null hypothesis testing and effect sizes

Most of the statistics you have covered have been concerned with null hypothesis testing: assessing the likelihood that any effect you have seen in your data, such as a correlation or a difference in means between groups, may have occurred by chance. As we have seen, we do this by calculating a p value -- the probability of your null hypothesis being correct; that is, p gives the probability of seeing what you have seen in your data by chance alone. This probability goes down as the size of the effect goes up and as the size of the sample goes up.

However, there are problems with this process. As we have discussed, there is the problem that we spend all our time worrying about the completely arbitrary .05 alpha value, such that p=.04999 is a publishable finding but p=.05001 is not. But there is also another problem: even the most trivial effect (a tiny difference between two groups' means, or a miniscule correlation) will become statistically significant if you test enough people. If a small difference between two groups' means is not signficant when I test 100 people, should I suddenly get excited about exactly the same difference if, after testing 1000 people, I find it is now significant? The answer is probably no -- if it was a trivial effect with 100 people it's still trivial with 1000: we don't really care if something makes just a 1% difference to performance, even if it is statistically significant. So what is needed is not just a system of null hypothesis testing but also a system for telling us precisely how large the effects we see in our data really are. This is where effect-size measures come in.

Effect size measures either measure the sizes of associations or the sizes of differences. You already know the most common effect-size measure, as the correlation/regression coefficients r and R are actually measures of effect size. Because r covers the whole range of relationship strengths, from no relationship whatsoever (zero) to a perfect relationship (1, or -1), it is telling us exactly how large the relationship really is between the variables we've studied -- and is independent of how many people were tested. Cohen provided rules of thumb for interpreting these effect sizes, suggesting that an r of |.1| represents a 'small' effect size, |.3| represents a 'medium' effect size and |.5| represents a 'large' effect size.

Another common measure of effect size is d, sometimes known as Cohen's d (as you might have guessed by now, Cohen was quite influential in the field of effect sizes). This can be used when comparing two means, as when you might do a t-test, and is simply the difference in the two groups' means divided by the average of their standard deviations*. This means that if we see a d of 1, we know that the two groups' means differ by one standard deviation; a d of .5 tells us that the two groups' means differ by half a standard deviation; and so on. Cohen suggested that d=0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size. This means that if two groups' means don't differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically signficant.

(* This average is calculated using the formula below )

There are various of other measures of effect size, but the only other one you need to know for now is partial eta-squared (partialetasq), which can be used in ANOVA. Partial eta-squared is a measure of variance, like r-squared. It tells us what proportion of the variance in the dependent variable is attributable to the factor in question. You can get these measures by choosing the 'Estimates of effect size' option when setting up an ANOVA. Partial eta-squared isn't a perfect measure of effect size, as you'll see if you probe further into the subject, but it's okay for most purposes and is publishable.

What is meant by 'small', 'medium' and 'large'?

Good question! In Cohen's terminology, a small effect size is one in which there is a real effect -- i.e., something is really happening in the world -- but which you can only see through careful study. A 'large' effect size is an effect which is big enough, and/or consistent enough, that you may be able to see it 'with the naked eye'. For example, just by looking at a room full of people, you'd probably be able to tell that on average, the men were taller than the women -- this is what is meant by an effect which can be seen with the naked eye (actually, the d for the gender difference in height is about 1.4, which is really large, but it serves to illustrate the point). A large effect size is one which is very substantial.

Calculating effect sizes

As mentioned above, partial eta-squared is obtained as an option when doing an ANOVA and r or R come naturally out of correlations and regressions. The only effect size you're likely to need to calculate is Cohen's d. To help you out, here are the equations. d is one mean subtracted from the other, divided by the pooled, or average, of the two groups' standard deviations. So the formula for d is:


and the formula for the pooled standard deviation is simply:


So, for example, if group 1 has a mean score of 24 with an SD of 5 and group 2 has a mean score of 20 with an SD of 4,


and therefore


revealing a 'large' value of d, which tells us that the difference between these two groups is large enough and consistent enough to be really important.

Standardized versus unstandardized effect sizes

What I have talked about here are standardized effect sizes. They are standardized because no matter what is being measured, the effects are all put onto the same scale - d, r or whatever. So if I were correlating height and weight, or education level and income, I'd be doing it with a standard scale.

However, it's also easy to give unstandardized effect sizes. Let's say we compare two groups of students to see how many close friends they have. The chemistry students have an average of 6 (SD=3) and the physics students have an average of 8 (SD=3). This gives us an effect size of d=.67, which is a useful measure. However, if I were reporting this I might choose to give this d value and also a measure in the original units, which in this case is the number of friends - "Physics students, on average, have two more close friends than chemistry students (d=.67)". Writing it this way, giving the actual difference in the number of friends as well as a standardized effect size is useful for putting the findings into context as well as for making your work readable by laypeople.

When looking at differences, try to provide standardized effect sizes such as d and also unstandardized measures of effect size in original units. When looking at relationships, you can use unstandardized regression coefficients (i.e., b values) - for example ("The relationship between revision time and exam score was fairly strong, r = .86, and showed that each extra hour of revision was associated with a 3-point increase in exam score on average"). See how this is easier for a layperson to understand?

Other easily understood measures of effect size you should consider include the number of people you'd need to treat with a therapy before one, on average, would be cured, and the time that it would take, on average, before an outcome occurred.

So the bottom line is: to make your results useful and readable, give both standardized and unstandardized effect sizes whenever possible.