In the wake of the recent discussions about null hypothesis statistical significance testing and p values on this website, Häggström has decided not to respond beyond calling the latest installment in the series nothing more than a “self-parody”. No substantial statistical or scientific arguments were presented. Despite his unilateral surrender, it can be informative to examine a method paper entitled “The fickle P value generates irreproducible results” written by Halsey, Curran-Everett, Vowler and Drummond (2015) that was just published in the renowned Nature Methods journal that slammed the usage of p values. The authors even call for a wholesale rejection of p values, writing that “the P value’s preeminence is unjustified” and encouraging researchers to “discard the P value and use alternative statistical measures for data interpretation”.
As expected, it corroborates and confirms a large number of arguments I presented in this exchange and directly contradicts many of the flawed assertions made by Häggström. In fact, the paper goes even further than I have done, (1) showing that p values are unstable to direct replication even at levels of statistical powers that are generally considered to be acceptable (e.g. 0.8), (2) that p values are probably superfluous for analysis with adequate statistical power and (3) that previous research that relied on p values needs to be reexamined and replicated with proper methods for statistical analysis.
I challenge Häggström to submit a response to that paper, in which he calls the authors “disinformants” and “silly fools” while mischaracterizing them, quoting them out of context and providing little to no substantive arguments whatsoever. Let us see if his assertions and engagement in personalities pass scientific peer-review. I will not hold my breath.
The instability of p values under exact replications is unacceptably large, even with large power
As was argued at length in the previous posts on the subject, a p value varies widely under exact replications with realistic research designs. Since decisions are taken about the suspiciousness of the null hypothesis based on the obtained p value, the fact that p values are like Russian roulette means that the corresponding decisions are like Russian roulette as well. After going through similar simulation results that show that p values are unstable under exact replications for sample sizes commonly used in biology. They go even further than that, arguing that this instability is unacceptable large even at levels of statistical power that many researchers and statisticians deem acceptable:
Unfortunately, even when statistical power is close to 90%, a P value cannot be considered to be stable; the P value would vary markedly each time if a study were replicated. In this sense, P is unreliable. As an example, if a study obtains P = 0.03, there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0–0.6 (90% prediction intervals), whereas the chances of P < 0.05 is just 56%. In other words, the spread of possible P values from replicate experiments may be considerable and will usually range widely across the typical threshold for significance of 0.05. This may surprise many who believe that a test with 80% power is robust; however, this view comes from the accepted risk of a false negative
This is independent support for the notion that p values are fundamentally unstable under exact replications of realistic research designs and sample sizes. Not only that, the paper argues that this is still true even for relatively high power, such as 0.8. This completely collapses the objection that NHST works given sufficient power and that the only problem is the misuse.
The authors even go on to provide the following coup de grâce:
For example, regardless of the statistical power of an experiment, if a single replicate returns a P value of 0.05, there is an 80% chance that a repeat experiment would return a P value between 0 and 0.44 (and a 20% change that P would be even larger). Thus, and as the simulation in Figure 4 clearly shows, even with a highly powered study, we are wrong to claim that the P value reliably shows the degree of evidence against the null hypothesis. Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable. In such cases, the effect is so clear that statistical inference is probably not necessary
In sum, p values are only sufficiently stable under exact replications if the power is exceedingly high (>90%), but at that point the inferential worth of a p value is low to non-existent since the difference would be so obvious even without inferential statistics. In one fell swoop, the researchers disproved two common defenses of NHST. It is not just a problem of misuse or low statistical power, and p values are not required at large values for statistical power.
Alternatives to p values include effect size and confidence intervals
What alternatives are suggested to replace p values? I have argued that it should be effect sizes, confidence intervals, replication, meta-analysis and interpretation of effect sizes and confidence intervals in the scientific context. Again, the researchers agree:
This approach to statistical interpretation emphasizes the importance and precision of the estimated effect size, which answers the most frequent question that scientists ask: how big is the difference, or how strong is the relationship or association? In other words, although researchers may be conditioned to test null hypotheses (which are usually false), they really want to find not only the direction of an effect but also its size and the precision of that estimate, so that the importance and relevance of the effect can be judged
To aid interpretation of the effect size, researchers may be well advised to consider what effect size they would deem important in the context of their study before data analysis.
In addition, the effect size and 95% CIs allow findings from several experiments to be combined with meta-analysis to obtain more accurate effect-size estimates, which is often the goal of empirical studies.
As an added bonus, they corroborated my previous statement that null hypothesis are typically false to begin with.
Assurance and precision considerations can replace power analysis
Häggström defended NHST by claiming that NHST was required to reach well-supported conclusions about what sample sizes are appropriate. I retorted that it was possible to find appropriate sample sizes by considering a wide range of other factors, included precision and assurance. This is based on the expected lengths of confidence intervals and how often this should occur. Yet again, the researchers agree with me:
In turn, power analysis can be replaced with ‘planning for precision’, which calculates the sample size required for estimating the effect size to reach a defined degree of precision.
Let us return to some of the classic defenses that Häggström and other NHST statisticians have been deploying. Although I provided my own refutation to them, I discovered that they had been independently debunked in the third chapter called “Eight common but false objections to the discontinuation of significance testing in the analysis of research data” (Schmidt and Hunter) of a 1997 book called “What if there were no significance tests”, that unsurprisingly, reached the same conclusions that I had.
“The problem is merely the misuse of NHST generally and p values specifically. If used correctly, there would be little to no problems”
This is labeled as “objection 7” in the book mentioned above. The problems with this objection (according to Schmidt and Hunter) is p values, even if they were interpreted correctly, is not a suitable measure for “advancing the development of cumulative scientific knowledge” since it says nothing about effect size or precision. Furthermore the problem of low statistical power would still remain. I contrast, the retorts that I have previously used was to point out that p values are only indirectly related to the posterior probability and it is possible for the alternative hypothesis to be even more unlikely. Thus, even if NHST is used without any misconceptions, it is not terribly useful. Although I did point out the problem with p values not telling researchers what they want to know, I did not use it in this context.
“Doing confidence intervals is just secretly doing NHST, but indirectly.”
I previously responded that this is not relevant as confidence intervals can break the chains of NHST, since it can be use and interpreted outside the NHST framework. Schmidt and Hunter labels this “objection 5” and argues that there would still be benefits of using confidence intervals even if researchers merely used it as another way of performing statistical significance testing, since a confidence interval will give information (1) about precision and uncertainty of the study, (2) reduce wild overinterpretation of research results and (3) highlighting the effect size as a point estimate. These would not be seen if the researchers merely were to do a statistical significance test in Excel or SPSS. However, they continue by pointing out that researchers need not and should not use confidence intervals as statistical significance testing. This is because (i.) confidence intervals existed and were used a long time before statistical significance tests and (ii.) few NHST statisticians would argue that other confidence intervals, such as the 68% confidence interval (one standard error above and below the mean) or the 50% probable error confidence interval, should be or must be taken as statistical significance tests.
“This paper is not perfect, since it confuses false positive with type-I error and confuse p with alpha in one of the glossaries”
Sure, hardly any paper is absolutely perfect. Yes, the researchers wrongly use false positive and type-I error as synonyms. In reality, the type-I error rate assumes the global statistical null and is set by the researcher, whereas the false positive rate is predicated on the real-life mix of true and false statistical null hypotheses, which is not controlled by the researcher. Although this might be defended because the authors used the term “false positive” in quotation marks and may therefore not have been referring to the term in the precise statistical sense. The researchers do define p value correctly in the first glossary, but in the latter they define “P” as “achieved significance level”. However, the significance level is defined by alpha and not P and these two are not the same. These issues, however, do not distract from the bigger issues.