The title, of course, is a reference to the landmark paper by Cohen (1994) in American Psychologist called The Earth is Round (p < 0.05). Technically speaking, the earth is an oblate spheroid because it is shaped like an ellipse rotated around one of its axis and flattened at the poles. Anyways, the arguments laid out in that review were not particularly new. In fact, they had existed for many decades. Still, they had, and continue to have, great intellectual merit. It outlines the major flaws and problems with traditional null hypothesis significance testing (NHST) using p-values.
Despite this, p-values continue to be used by scientists, although the use of effect sizes and confidence intervals are on the increase.
Cohen’s review is really an article that should be read by any aspiring researcher (linked in the reference section) and it highlights the following errors that are commonly performed when doing p-values:
- The p-value is the probability of obtaining the results, or more extreme results, given that the null hypothesis is true. It does not provide us with the probability of the null hypothesis being true, given the obtain evidenced. Confusing these two conditionals probabilities is known as the fallacy of transposed conditionals or the inverse probability error.
- NHST, by contorting deductive modus tollens into a probabilistic argument, is formally invalid.
- The p-value is not the probability of replication. In fact, the distribution of p-values over successive replications is surprisingly large.
- The rejection of the null hypothesis does not prove the alternative hypothesis. The classical example, although not discussed by Cohen in this particular sense, is that a correct guess of, say, 20 playing cards in a row, may be highly unlikely, but the alternative hypothesis of clairvoyance is even more unlikely.
- p-value is a function of sample size, and given a large enough sample size, almost everything will appear statistically significant.
- NHST leads to publication bias. Results that are deemed to be statistically significant are much more likely to be published than results that are not statistically significant.
- Reaching the level of statistical significance does not mean that the result is of any practical (e. g. biological or psychological) significance. The data may be improbable given that the null hypothesis is true, but the difference between two groups tested may be negligible for all practical intents and purposes. In other words, NHST undervalue effect sizes.
In Live by statistics, die by statistics, the associate professor of biology PZ Myers discusses a new and highly fascinating study in experimental psychology. The general gist of the paper (Masicampo and Lalande, 2012) is that the actual distribution of p-values deviates quite a bit from the theoretical distribution near just above 0.05. This appears to suggest that some experimental psychologists fudge their data a bit as to transform results that fall just below 0.05 to end up just above 0.05. This is of course intellectually dishonest and statistically inappropriate. One explanation is that too much focus is placed on achieving statistical significance, even though it does not tell you anything informative. In other words, these experimental psychologists have misunderstood NHST and p-values on a fundamental level. Another explanation is that some journals may require statistical significant result for publishing or that some reviewers complain. With that said, some journals now require confidence intervals and effect sizes in order to secure publication.
Myers have written a good, short discussion about how p-values are misunderstood. It is quite ironic that Myers himself, in the process of pointing out the flaws in using and interpreting p-values, incorrectly interprets the definition of p-value in his blog post. Myers writes that:
There is a magic and arbitrary line in ordinary statistical testing: the p level of 0.05. What that basically means is that if the p level of a comparison between two distributions is less than 0.05, there is a less than 5% chance that your results can be accounted for by accident. We’ll often say that having p<0.05 means your result is statistically significant. Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line.
The p-value is the probability of obtaining the results, or more extreme results, given that the null hypothesis is true. It is not the probability that the results can be accounted for by chance or accident. This is because the act of determining the p-value is based on the notion that the finding is the result of random chance. Remember, we want to find out how likely it is to obtain these results (or more extreme results), given that the null hypothesis is true. Clearly, this can then not lead to an evaluation of the likelihood that the results are due to random chance (since the very act of doing p-values assumes that it is 100%).
Now, this is perhaps nitpicking and I doubt that developmental biologists even use p-values that much in their research to begin with, so the error is understandable. The fact that Myers wrote a misleading explanation of p-values shows that NHST and p-values are routinely misunderstood by many researches, including knowledgeable scientists with a prominent academic position like Myers (who already know about many of the problems with p-values).
References and further reading
Cohen, J. (1994). The earth is round (P < .05). American Psychologist, 49, 997–1003.
Sterne, J. A. C., Cox, D. R., & Smith, G. D. (2001). Sifting the evidence—what’s wrong with significance tests?. BMJ, 322(7280), 226-231.
Schervish M.J. (1996). P Values: What They Are and What They Are Not. The American Statistician 50 (3): 203–206.
Masicampo E.J., and Lalande D.R. (2012). A peculiar prevalence of p values just below .05. Quarterly journal of experimental psychology. 1-9.