# Debunking Statistically Naive Criticisms of Banning P Values

Olle Häggström is a mathematical statistician from Chalmers University of Technology and a prominent scientific skeptic. His projects and papers relevant for skepticism include several hard-hitting defenses of good science, such as opposing pseudoscience about climate change, criticizing the encroachment of postmodernism into higher education and exposing the intelligent design creationist abuse of the No Free Lunch (NFL) theorems. However, he also promotes unsupported beliefs about NHST, mathematical platonism and artificial general intelligence, thus making him another example of an inverse stopped clock.

Recently, Häggström wrote a credulous blog post where he exclaimed that banning NHST from the journal would constitute intellectual suicide by BASP. In it, he repeats a number of errors that he has done before and adds on a few others.

The only things about NHSTP and confidence intervals that are “invalid” are certain naive and inflated ideas about their interpretation, held by many statistically illiterate scientists.

In this sentence, Häggström deploys the classic rhetorical technique whereby he says that the NHST procedure itself is not flawed, only that many scientists misuse it. This was refuted in a previous post on Debunking Denialism that strongly criticized NHST: “[a] method like NHST that has such a strong potential for misunderstandings and abuse even among a large proportion of the most highly intelligent and highly educated has to accept a large proportion of the blame.” But even if we ignore that, NHST *is* flawed for a great number of reasons.

First, the p value is only indirectly related to the posterior probability. This means that a low p value is not a good argument against the null hypothesis because the alternative hypotheses might be even more unlikely. If you test homeopathy for cancer or the alleged psychic ability of someone, it is not really that impressive to find a p value that is lower than 0.05 (or lower than 0.0001 or whatever). Even testing moderately unlikely hypotheses (with an empirical prior of anywhere between, say, 10% and 30%) means that the p value is not a good measurement of posterior probability.

Second, null hypotheses are almost always both false and irrelevant. It is extremely unlikely that two population parameters are exactly identical under most realistic circumstances. The vast majority of null hypotheses are typically nil hypotheses, where the population parameter is 0 or the difference between two population parameters are 0. However, in most scientific areas, we are not interested in demonstrating that an effect size is larger than 0 or that two population parameters are not exactly identical, but in accurately estimating effect sizes and finding the *size* of the difference. Thus, for most purposes, null hypotheses being used are scientifically irrelevant. There is no point in testing a hypothesis you can almost certainly be sure is false. It is an exercise in futility.

Third, NHST contributes to a black-and-white thinking concerning scientific results. It arbitrarily divides results into “statistically significant” and “statistically non-significant” as if this was somehow important result for addressing any relevant scientific question.

Fourth, a p value is strongly influenced by sample size. This means that small samples (say, less than n = 50 per group) will rarely detect differences. On the other hand, large sample sizes (say, more than n = 500 per group) will almost always detect any small difference. Because p value is confounded by sample size, it is not a useful measure of the evidence against the null hypothesis.

Fifth, NHST is unstable with regards to replication leading to something that is called “dance of the p values”. Under realistic population parameters, experimental designs and sampling, the p value you get can be likened to playing a roulette wheel. Following Geoff Cumming, suppose that you have an experimental population and a control population with the following properties:

- Both are normally distributed.
- Their standard deviation are both 20.
- The difference between the means is d = 0.5 (often considered a medium-sized effect).
- The sample size per group is 32 (a lot of social, psychological and medical research has a smaller sample size).
- The statistical power is 0.52 (which is also fairly typical).

Let us suppose that you take a sample and calculate a p value under the null hypothesis of no difference. Now repeat that, say, fifty times. How large is the spread among p values calculated? It turns out that this is very large. Essentially, you typically get p values anywhere from below 0.001 to 0.5. Whichever you happen to get depends on happenstance. The same experiment run again can literally change the p value from “highly statistically significant” to “non-significant”. What does the frequency distribution of p values look like?

36.1% had p >= 0.1,

12.3% had 0.1 > p >= 0.05.

23.4% had 0.05 > p >= 0.01

18.4% had 0.01 > p >= 0.001

9.8% had 0.001 > p

The analogy with playing roulette is apt, since the result of a study can strongly influence where it gets published, which in turn can strongly influence whether or not a researcher get more funding or what academic positions he or she gets in the long-run.

In other words, the problem is not simply with “statistically illiterate scientists”, the NHST procedure itself is seriously flawed, both on its own and under realistic research designs.

These misconceptions about NHSTP and confidence intervals are what should be fought, not NHSTP and confidence intervals themselves, which have been indispensable tools for the scientific analysis of empirical data during most of the 20th century, and remain so today.

Although Häggström is correct that confidence intervals (outside NHST) should not be fought, judging research based on the chance results of the spin of the p value roulette wheel is not an “indispensable tools for the scientific analysis of empirical data”. Quite the contrary, NHST has done serious damage to scientific research.

If statistical significance is obtained, then we are in a position to conclude that either the null hypothesis is false, or an event of low probability (namely the event of obtaining statistical significance despite the null hypothesis being true) happened (with the traditional significance level, the probability is at most 0.05). Now, low-probability events do happen sometimes, but we expect them typically not to happen, and if such a thing didn’t happen this time, then the only remaining option is that the null hypothesis is false. The lower the threshold for significance is, the stronger is a statistically significant result considered to count against the null hypothesis. This is the logical justification of NHSTP (the justification of confidence intervals is similar), and the way we statisticians have been teaching it for longer than I have lived.

This is merely a regurgitation of a falsehood promulgated by one of the inventors of NHST called Ronald Fisher. In the late 1950s, Fisher wrote in the book *Statistical Methods and Scientific Inference* that a very low p value was “amply low enough to exclude at a high level of significance any theory involving a random distribution […] The force with which such a conclusion is supported is logically that of the simple disjunction: Either an exceptionally rare chance has occurred, or the theory of random distribution is not true.”

The reality is that there are other options besides these two. For instance, the sample size can be large. That will produce small p values and statistical significance for almost any negligibly small difference between means. In other words, nothing exceptionally rare has occurred and the two groups are essentially equivalent.

Häggström is also wrong about the justification for confidence intervals. It might be tempting for an NHST statistician to think of confidence intervals exclusively as the canonical frequentist definition of “if you do a very large number of trials, about 95% of confidence intervals generated will include the true population parameter” or confidence intervals as “merely another way to do a statistical significance test”. However, confidence intervals are so much more. Cumming (2012) lists four other ways to use confidence intervals:

– A 95% confidence interval is a range of plausible values for the population parameter.

– A 95% confidence interval is a margin-of-error that gives a likely maximum error, even though it is of course possible to get larger errors.

– The values closer to the mean is relatively more plausible than the values nearer the ends of the confidence intervals.

– A 95% confidence interval is, on average, a 83% prediction interval.

When Cumming says “range of plausible values”, he does not mean “probable values” as a single confidence interval either includes or do not include the population parameter.

Astonishingly, Häggström makes an unabashed appeal to tradition. However, the fact that an error has been perpetuated for a very long time is not an argument for it being true.

His final claim is a spectacular failure to break loose from the suffocating chains of NHST:

The irony here is that in order to quantify what is a suitable sample size in order to make descriptive statistics sufficiently “stable”, and to make sampling error “less of a problem”, the NHSTP conceptual apparatus is needed.

We need NHST to decide what sample sizes are good enough? This is perhaps even more bizarre than his previous claim that we need to calculate p values in order to tell the difference between 75% approval for a certain coke brand (n = 4), and 75% approval (n = 1000). This notion comes an excessive focus on getting a statistical power that is good enough to detect statistically significant differences of certain sizes. However, if we let go of NHST, Häggström argument falters.

First of all, bigger sample sizes are always better (all other things equal) for accurately estimating the population parameters under study (unless we are using NHST in which case big sample sizes cause studies to be overpowered). So an initial rule of thumb for sample size decisions might be “as many as you can afford and have the time for”. A more sophisticated method can involve considerations of what kind of experimental subjects you have and the reliability of the method used. If the biological variability is very low, then the function of replicates is primarily to get an estimate of the technical variability. A method that is not that reliable may require more replicates to provide the same trustworthiness in the results. Finally, there are formal statistical methods (described in Cumming (2012)) for calculating suitable sample sizes in order to get an expected length of confidence intervals (precision) at least a certain percent of the time (assurance). Notice that no part of precision or assurance arguments about sample size relates to doing a statistical test or talking about null hypotheses, so this is not “just secretly doing NHST”.

The incompetence of professors Trafimow and Marks, and their catastrophic misstep of banning NHSTP from their journal, is a splendid illustration of the main thesis of my paper Why the empirical sciences need statistics so desperately.

Sciences are indeed in desperate need of better statistical analysis of research results. This should be based on effect size estimation, precision estimation, replication/reproducibility, meta-analysis and the interpretation of these outcomes in the relevant scientific context. However, NHST should play no part in it. It is a flawed method that has had disastrous consequences and has plagued scientific research for more than half a century (despite vigorous attempts at countering it). Banning p values may seem like a drastic step, but once you appreciate the historical context, it seems the most productive way to get a handle on it. It will help break the viscous cycle where prevalent use of NHST forces method lecturers to focus on NHST so that students can understand published research which leads to students using NHST. Finally, you cannot abuse an invalid method if you cannot use it in the first place.

**References and further reading**

Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge

Ellis, P.D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results.

Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.

Ziliak, S. T. McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.

I think your early point about the null hypothesis testing is the cornerstone to the abuse of p-values. Essentially the testing isn’t really designing a study properly to be able to test the null, or the null isn’t really defined meaningfully. So then slapping any statistical test over the top is really just another example of “rubbish in, rubbish out”.

My biometrician doesn’t like p-values much, but everyone expects them, so he tries to get them to speak to him before doing experiments. Tries. They usually just dump a bunch of data on his desk after the experiments are completed and expect him to make sense of it for them.

That is indeed a familiar sight.

“[nil] null hypotheses are almost always both false and irrelevant”

I’ve realized recently that they can also be described as redundant. Take the 1919 eclipse experiments. They could have tested if the apparent position of the stars was “due to chance”, but instead they compared the data to the predictions of two theories (Einstein v. Newton). ANY data capable of distinguishing between the two must also be able to rule out chance. This will always be true.

The response will be “what if there is no theory that can make predictions?” Well, then there are still infinity – 1 explanations for the non-chance result to investigate, the set has not been meaningfully narrowed down. Also, it seems that in all cases where that excuse is brought up you can find that 1) The data is very superficial (e.g. an average of something measured at one time point) and/or 2) The people involved are unfamiliar with the tools that have been developed to study complex, dynamic phenomena (e.g. calculus).

Rather than testing the null hypothesis, the correct thing to do is collect enough data and show it to someone with the skills to develop some theories that can make predictions. The alternative seems to be jumping to conclusion that your substantive hypothesis is the correct explanation. That is clearly a strawman fallacy.

Pingback: The Laughable Desperation of NHST proponents | Debunking Denialism

Pingback: Häggström Disrobed on NHST | Debunking Denialism

Pingback: New Nature Methods Paper Argues that P Values Should be Discarded | Debunking Denialism