The second half of a typical presentation of a mediocre study in biology goes something like this: a graph is presented showing the effect sizes and a lot of asterisks are added to denote statistical significance and the presenter talks about how some differences are “significant” whereas the others have a “tendency [towards significance]”. Then, in the result section, “significant” has morphed into clinical significance, where the differences are taken to be of real clinical value and all the difference that did not pass the significance test are either claimed to approaching clinical relevance or simply claimed to be equivalent.
As pointed out in an earlier post on annoying statistical fallacies, statistical significance means that the probability of obtaining the data you got, or more extreme data, given that the null hypothesis is true is low. This is not the same as practical (clinical, medical, biological, sociological etc.) significance, which says that the differences are so large as to be of clinical value. Clearly, it is possible for it to be very unlikely to obtain a certain difference given the null hypothesis, but that this difference is so small as to be without clinical relevance.
Using asterisks may look clean, but they really obfuscate the result of the significance test. Because it is possible that you have two differences that are very close to the cutoff level, say p = 0.051 and the other with p = 0.048. In a graph with asterisks, the later would be shown as statistically significant, but the other would not, despite the fact that there is only a 0.003 difference in p value. In practice, there is not that much more evidence against the null in the later than the former.
The “there is a tendency [towards statistical significance]” for differences that have a p value not too far above 0.05 is in some sense intellectually dishonest, and is really a tactic to inflate the relevance of the results. This is easy to see because these same people would never call a result that had a p value just below 0.05 as “approaching non-significance”, but would tout it as “significant”. In other words, there is a bias towards wanting the results to appear statistically significant, as people only seem to care about adjusting results from the lower side of the 0.05 cutoff.
In a similar fashion, a statistically non-significant difference does not imply equivalence. This is primarily because of two reasons: (1) p-values depend on sample size, so a large sample size will give statistical significance for even negligible differences and a small sample size will make it so that large differences fail to yield statistical significance and (2) while p-values do tell you something about the degree of overlap between 95% confidence intervals, they do not tell you anything about the absolute size of the errors bars. This means that a treatment with an OR of, say 1.01 (1 is deemed equivalence) could have a [0.5,1.5] 95% CI, meaning the plausible values for the population parameter goes from clinically beneficial to clinically harmful. Clearly, you do not want to call two treatments equivalent if one of them has a real possibility of being clinically harmful.
What can we do about these problems? Here are a couple of recommendations.
1. If you have to use significance testing, clearly distinguish between statistical and practical significance and under no circumstance use only statistical significance as a basis for concluding practical significance.
2. When reporting p-values, report exact p-values instead of asterisks (*) or less than a arbitrary cutoff (p < 0.05).
3. Use 95% confidence intervals over significance testing as much as you can. Also, do not use the confidence intervals as just another way of doing a significance test. Think of them as the the range of plausible values for the population parameter.
4. Use the actual differences in effect size, together with the 95% confidence intervals, to interpret the difference in the biological context. Focus on quantitative interpretations! Is this difference minor and clinically negligible, moderate a clinically promising or large and clinically useful? Strictly speaking, sometimes minor differences are clinically useful etc. so prefer biological context over arbitrary cutoffs if you can.
References and further reading
Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results.
Cohen, Jacob. (1994). The Earth is Round (p< 0.05). American Psychologist. 49(12), 997-1003. doi:10.1037/0003066X.49.12.997
Cumming, Geoff. Fidler, Fiona. & Vaux, David L. (2007). Error bars in experimental biology. The Journal of Cell Biology, 177(1), 7-11. doi: 10.1083/jcb.200611141
Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis. New York: Routledge.