Why should we test improbable and irrelevant null hypotheses with a chronically misunderstood and abused method with little or no scientific value that has several, large detrimental effects even if used correctly (which it rarely is)?
During the past 60+ years, scientific research results have been analyzed with a method called null hypothesis significance testing (NHST) that produce p-values that the results are then judged by. However, it turns out that this is a seriously flawed method. It does not tell us anything about how large the difference was, the precision estimated it or what it all means in the scientific context. It tests false and irrelevant null hypotheses. P-values are only indirectly related to posterior probability via Bayes theorem, what p-value you get for a specific experiment is often determined by chance, the alternative hypotheses might be even more unlikely, it increases the false positive rate in published papers, contributes to publication bias and causes published effect sizes to be overestimated and have low accuracy. It is also a method that most researchers do not understand, neither the basic definitions nor what a specific p-value means.
This article surveys some of these flaws, misunderstandings and abuses and looks at what the alternatives are. It also anticipates some of the objections made by NHST supporters. Finally, it examines a case study consisting on an extremely unproductive discussion with a NHST statistician. Unsurprisingly, this NHST statistician was unable to provide a rationally convincing defense of NHST.
Why NHST is seriously flawed
There are several reasons why NHST is a flawed and irrational technique for analyzing scientific results.
Statistical significance does not tell us what we want to know: A p-value tells us the probability of obtaining at least as extreme results, given the truth of the null hypothesis. However, it tells us nothing about how large the observed difference was, how precisely we have estimated it, or what the difference means in the scientific context.
The vast majority of null hypotheses are false and scientifically irrelevant: It is extremely unlikely that two population parameters would have the exact same value. There are almost always some differences. Therefore, it is not meaningful to test hypotheses we know are almost certainly false. In addition, rejections of the null hypothesis is almost a guarantee if the sample size is large enough. In science, are we really interested in finding if e. g. a medication is better than placebo. We want to know how much better. Therefore, non-nil null hypotheses might be of more interest. Instead of testing if a medication is equal placebo, it can be more important to test if a medication is good enough to be better than placebo in a clinically meaningful way.
P-values are only indirectly related to posterior probability: The lower the p-value (and all other factors constant), the stronger the evidence against the null hypothesis. However, the relationship between the p-value and the posterior probability (i.e. probability of the null hypothesis given the evidence) is very indirect and is weighted by the prior probability (i.e the probability o the null hypotheses given the background information). According to Bayes theorem:
That is, p-values have very little evidential weight if you test improbable hypotheses, or even moderately probable hypotheses. As an extreme example, even low p-values in favor of homeopathy does not improve its scientific credibility.
The p-value distribution is very large: Let us say that you have two population and take a sample from each and calculate a p-value. Now, imagine doing this 1500 times and calculate 1500 p-values. How large is the distribution of p-values? What is the range of values taken by p? Cumming (2012) simulated 1500 such experiments with N = 32 per group, the populations were normally distributed, the population standard deviation of 0.2 each, the difference between the means of the two populations were 0.5 standard deviations, a two-tailed test was used, the alpha cutoff was 0.05 and the statistical power was 0.52 (common in many scientific experiments from psychology to ecology). Here are the results:
35.1 % p > 0.1,
12.1% 0.1 > p > 0.05,
23.9 % p < 0.05,
18.9 % p < 0.01
10.0% p < 0.001
In other words, what p-value you get under plausible experimental designs are largely a result of chance. The range of values taken by p is very, very large. This is known as the p-value casino.
The alternative hypotheses might be even more unlikely: a statistical significance test is used by many as a method for rejecting the null hypothesis as improbable. However, the alternative hypothesis (and the proposed mechanism) might be even more improbable. Thus, it is not enough to say that the null hypothesis is unlikely. One has to think about the probability of the alternative hypothesis as well. Thus, NHST is formally logically invalid.
It uses arbitrary cutoffs and contributes to black-and-white thinking: there is nothing special about the cutoff of 0.05. A p-value of 0.06 is not much better evidence against the null hypotheses than 0.04. The use of this arbitrary cutoff also promotes black-and-white thinking, but a competent evaluation of research results have to take into account a lot more.
It can, at best, only test statistical (not substantive) null hypotheses: NHST can only test statistical hypotheses e. g. the population average of the experimental treatment is equal to the population average of the control. It cannot test hypotheses about substance, such as what any observed difference is caused by. For instance, a rejection of the null hypothesis that the suicide frequency among men and women are identical does not show that women are psychologically inferior to men.
Increases type-I error rate in published papers: On average 5 out of 100 studies testing a medication that has no effect will reject the null hypotheses. Since null hypotheses rejections are more likely to be published, the type-I error rate is considerably higher than the canonical 0.05 that proponents of NHST uses.
Contributes substantially to publication bias: because a lot of researchers and journals are so obsessed with statistical significance, papers with null hypotheses rejections are more likely to be published than those not finding any statistically significant differences. This is known as the file-drawer effect or publication bias. This means that the accuracy of published effect sizes is low.
Overestimates effect sizes in published papers: if you have a small sample size, it is hard to get rejections of null hypotheses for small or moderate effect sizes. Thus, rejections typically happens for samples where the observed effect size is higher than the corresponding population parameter, thus overestimating it.
Not only flawed in many respects, NHST is also chronically misunderstood and abused.
Misunderstandings and abuses of NHST
There are at least around 20 or so common misunderstandings and abuses of p-values and NHST. Most of them are related to the definition of p-value. As discussed above, a p-value is the conditional probability of at least as extreme data, given the truth of the null hypothesis. Other misunderstandings are about the implications of statistical significance.
Statistical significance does not mean substantive significance: just because an observation (or a more extreme observation) was unlikely had there been no differences in the population does not mean that the observed differences is large enough to be of practical relevance. At high enough sample sizes, any difference will be statistically significance regardless of effect size.
Statistical non-significance does not entail equivalence: a failure to reject the null hypothesis is just that. It does not mean that the two groups are equivalent, since statistical non-significance can be due to low sample size.
Low p-value does not imply large effect sizes: because p-values depend on several other things besides effect size, such as sample size and spread.
It is not the probability of the null hypothesis: as we saw, it is the conditional probability of the data, or more extreme data, given the null hypothesis.
It is not the probability of the null hypothesis given the results: this is the fallacy of transposed conditionals as p-value is the other way around, the probability of at least as extreme data, given the null.
It is not the probability of falsely rejecting the null hypothesis: that would be alpha, not p.
It is not the probability that he results are a statistical fluke: since the test statistic is calculated under the assumption that all deviations from the null is due to chance. Thus, it cannot be used to estimate that probability of a statistical fluke since it is already assumed to be 100%.
Rejection null hypothesis is not confirmation of causal mechanism: you can imagine a great number of potential explanations for deviations from the null. Rejecting the null does not prove a specific one. See the above example with suicide rates.
NHST promotes arbitrary data dredging (“p-value fishing”): if you test your entire dataset and does not attain statistical significance, it is tempting to test a number of subgroups. Maybe the real effect occurs in me, women, old, young, whites, blacks, Hispanics, Asians, thin, obese etc.? More likely, you will get a number of spurious results that appear statistically significant but are really false positives. In the quest for statistical significance, this unethical behavior is common.
So what is the alternative to mindless and mechanical statistical significance testing? It is to make science-based judgements based on a number of other factors.
The alternative to NHST
These factors can include effect sizes, confidence intervals, the scientific context, replication and meta-analysis.
Effect size: An effect is something that is interesting to measure e. g. the average effectiveness of a new medication or growth rate in a specific yeast strain under particular circumstances. An effect size is simply the size of what you are measuring e. g. how big is the effectiveness or how fast was the growth rate. How big is the difference? Is the observed effect size negligible, small, moderate, large, gigantic or somewhere in between? It is a world of difference between these possibilities.
Confidence intervals: Confidence intervals are a special kind of error bars and 95% of the confidence intervals you can hypothetically generate from taking a sample from a specific population will include the population parameters. A single confidence interval will either include or not include the population parameters, but you will be right on average 95% of the time during your career if you wager on it being included. A confidence interval works like a margin of error and gives a range of plausible values for the population parameter, where the relative plausibility is highest near the effect size estimate. Sometimes, confidence intervals cannot be calculated or not calculated exactly. But that is no problem: even non-exact confidence intervals are better than NHST and other error bars can fulfill the general goal of confidence intervals.
Substantive significance / Scientific context: Scientific context matters: large changes that are statistically significant may be of no substantive significance (e. g. might have no practical relevance), and small changes that do not achieve statistical significance might be of very high importance. For instance, the effect of aspirin for preventing heart disease is very low, but since three are so many people who risk heart disease and the side effects are small, even a small benefit is going to be of substantive significance. For instance, a new medication that is worse than the current one but the results are not statistically significant should not be used.
Replication and meta-analysis: Individual studies can always be flawed and biased in several ways. Therefore, replicating studies in order to see if the results hold up is very important. This helps to weed out false results. Meta-analysis is a method for combining the results of several studies on the same topic. This allows researchers to synthesize research results across many studies and dilute the bias by excluding studies that are flawed and combine studies that are of high quality.
This section will cover a couple of the most common objections made to NHST criticism.
“Misunderstandings and abuses are not the fault of the method”
A method like NHST that has such a strong potential for misunderstandings and abuse even among a large proportion of the most highly intelligent and highly educated has to accept a large proportion of the blame.
“It is possible to use p-values correctly”
So? Even if used correctly, p-values would still not tell us what we are interested in, such as effect size,
precision, scientific context etc. There is a ton of Inertia in the system and similar suggestions have been made for 50 years and not much has changed.
“Some research projects / questions require black-or-white answer”
Should those decisions not be based on the best available methods for interpretation? P-value is not a good method for making correct black-and-white answers.
“Not possible to calculate exact confidence intervals in some research designs or sample sizes”
Confidence intervals do not need to be exact as they should not be used to do NHST, but give a range of plausible values for the population parameters. Solution to a bad research design or low sample sizes are better research designs and higher sample sizes, not NHST.
“NHST is objective”
Objectively wrong, perhaps. Making decisions based on flawed metrics and ignoring others might be “objective”, but it is not a good idea.
Case study: how to handle obstinate NHST statisticians
Unfortunately, several kinds of people are standing in the way of transforming the way scientists should interpret research results. Some scientists prefer to take shortcuts and use mechanistic statistical significance testing as a substitute for real scientific thinking and are therefore not well-equipped to consider difficult scientific questions in the light of complex data analysis. Lecturers in intro stats continue to teach outdated and flawed method of data interpretation. Many journals continue to tolerate both overemphasis and abuse of statistical significance, presumably because of incompetent editors and reviewers. Different approaches are needed to tackle each one of these groups successfully.
However, even if these issues were resolved, there is still one obstacle that remains: the NHST statisticians. These people have built both their field and careers on the widespread use of statistical significance testing. As a result, they defend their precious p-values with tooth and claw, even to the point of intellectual dishonesty and excessive engagement in personalities. Another important explanation is that experts are extremely effective at rationalizing ideas that they have reached based on flawed premises and absurd arguments.
This section will examine one such unproductive interaction with an NHST statisticians called Olle Häggström at Chalmers University of Technology (Sweden). For the full conversation, see here (it is in Swedish, so use Google translate).
The background discussion was a study on epigenetic inheritance that focuses on hazard ratios and confidence intervals. Most hazard ratios were not that big or achieved statistical significance. However, the authors decided to bold the one hazard ratio that did and made this the main story of their paper. Now, the particular epigenetic inheritance pattern in question — from the grandmother on the father’s side to the granddaughter — is not particularly likely to be especially important based on the biological context. As far as I can tell, this is a false positive. I also thought the confidence interval was unacceptably large for the authors to be that categorical about their finding. Thus, I used the scientific context and confidence intervals to conclude that their finding was not that impressive. However, the effect size was large enough to merit further investigation regardless of statistical significance or lack-thereof.
An NHST statistician tackled this study by visually estimating the p-value from the confidence interval (thereby turning fine-grain data to coarse-grain). Then, he argued that if a correction for multiple testing was used, the p-value would not be low enough to obtain statistical significance and therefore, the results were not credible. I pointed out that the paper was not overly focused on p-values (none occurred in the paper) and that the conclusion I made was valid regardless if correction was made or not and that an impressive effect size should not be dismissed merely due to being statistically non-significant. Furthermore, the correction method used by this NHST statistician was Bonferroni correction, which was statistically unsuitable for this study since the number of hypotheses tested were more than a few and this kind of correction is very hard and sacrifices too much statistical power.
He responded by calling me pompous, ignorant, precocious, arrogant, claimed that I was using ugly rhetorical techniques and emphasized his own statistical expertise. However, the only thing more dangerous than ignorance is the illusion of knowledge, something that Häggström should consider.
The only real “argument” that he used was a clear straw man. He linked to a previous blog post he wrote in defense of statistical significance testing. In it, he proposed the following two research results:
(1) 75% of people prefer Percy-Cola over Crazy-Cola (n = 4)
(2) 75% of people prefer Percy-Cola over Crazy-Cola (n = 1000)
and that statistical significance testing was required to be able to distinguish the scientific credibility of the two, since the effect size is the same. However, this is a clear straw man. The argument is not to replace tunnel vision on p-values with tunnel vision on effect size. Rather, it is to take into account many different metrics such as effect size, precision and what it all means in the scientific context.
In this toy example, it is easy to tell them apart by pointing out that (1) was a very low sample size and is therefore not a good estimate of the population parameter. The point of taking a sample from a population is that the sample should reflect the population. Low sample sizes generally do not and therefore, (1) has essentially no credibility, whereas (2) has considerably more. At no point during this argument did I invoke p-values, talk about the probability of observing at least as extreme results give the null hypothesis or accepting or rejecting null hypotheses. So much for that argument.
At this point, Häggström got so frustrated and upset that he refused to publish anymore of my comments unless I carried out a detailed alternative correction for multiple testing. I provided several elements of such a treatment, but he refused to publish it anyways.
In the end, I think that Häggström did not appreciate being debunked by a person on the Internet and the conversation became an issue of prestige for him. His expertise made him believe that he was surely right and everyone else was wrong. The cognitive dissonance that he experienced by reading my arguments made him lash out in rage, which explains the excessive engagement in personalities from his side. As a side note, Häggström identifies himself as a scientific skeptic and thus becomes just another victim of selective skepticism (like Jerry Coyne on medical psychiatry and psychiatric medication). He is extremely rational in many areas (such as climate change), but then becomes completely irrational in other areas, such as NHST, supercomputers taking over the world in a matrix / terminator scenario and mathematical Platonism.
Follow Debunking Denialism on Facebook or Twitter for new updates.
References and further reading
Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.
Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge
Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results.