The edifice of null hypothesis significance testing (NHST) is shaken to its core once more. On March 6th, the American Statistical Association (ASA) revealed to the world that they’d had enough. For the first time in its history since being founded in 1839, they published a position statement and issued recommendations on a statistical issue. This issue was, of course, p values and statistical significance. The position statement came in the form of a paper in one of their journals called American Statistician, together with a press release on the ASA website. The executive director of ASA, Ron Wasserstein, also gave an interview with Alison McCook at the website Retraction Watch and the Nature website has a news item about it.

**What was the central point of the position statement?**

The press release (p. 1) summed it up quite nicely:

“The p-value was never intended to be a substitute for scientific reasoning,” said Ron Wasserstein, the ASA’s executive director. “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold. The ASA statement is intended to steer research into a ‘post p <0.05 era.'"

In other words, ASA acknowledges that p values was not supposed to be the central way to evaluate research results, that basing conclusions on p values and especially if the results are statistically significant or not cannot be considered well-reasoned and finally, that the scientific community should move in a direction that severely de-emphasize p values and statistical significance. Coming from a world-renowned statistical association, this is a stunning indictment of the mindless NHST ritual.

The final paragraph of the preamble to the position statement (p. 6) also points out that this criticism of NHST is not new:

Let’s be clear. Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail. We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.

ASA seems to share the sentiment among many critics of NHST, namely that there are several valid objections to NHST and that these have been raised as very serious problems for many decades with very little progress.

**What six principles did the ASA position statement include?**

The position statement included six core principles, together with a paragraph describing each (p. 9-12):

1. P-values can indicate how incompatible the data are with a specified statistical model.

2. P-values do not measure the probability that the studied hypothesis is true, or the

probability that the data were produced by random chance alone.3. Scientific conclusions and business or policy decisions should not be based only on

whether a p-value passes a specific threshold.4. Proper inference requires full reporting and transparency

5. A p-value, or statistical significance, does not measure the size of an effect or the

importance of a result.6. By itself, a p-value does not provide a good measure of evidence regarding a model or

hypothesis.

The ASA reiterate many of the known problems with NHST and common abuses: the conflation of p value with the probability of the null hypothesis or the chance that the results are “due to chance”, the overemphasis on statistical significance and its flawed black-and-white decision-making, the conflation of statistical and practical significance as well as the notion that p values is a reasonable measure of evidence.

**What did the ASA recommend?**

The ASA position paper made two general recommendations. First, they encourages alternatives to p values and the NHST ritual because of their capability of assessing effect sizes, ranges of plausible values and the strength of the evidence for the research hypothesis (p. 12):

In view of the prevalent misuses of and misconceptions concerning p-values, some statisticians prefer to supplement or even replace p-values with other approaches. These include methods that emphasize estimation over testing, such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence, such as likelihood ratios or Bayes Factors; and other approaches such as decision-theoretic modeling and false discovery rates. All these measures and approaches rely on further assumptions, but they may more directly address the size of an effect (and its associated uncertainty) or whether the hypothesis is correct.

Second, they emphasized the strong value of broad statistical competence throughout the research process from experimental design to reporting of methods and results in the scientific literature (p. 12-13):

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.

Let that sink in: no single index should substitute for scientific reasoning. Yet so much of the NHST ritual is precisely about substituting an appreciation for effect sizes, range of plausible values and replication with the single p value metric, typically together with one of the dozens of commonly believed myths that proponents of NHST promote.

**What room did the ASA position statement leave for statistical significance?**

The ASA position statement completely rejected the idea of statistical significance i.e. the idea that you can reject a null hypothesis as false or questionable if the p value is below an arbitrary cut-off (p. 10):

Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision-making.

[…]

The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.

This is truly the death knell of statistical significance as a concept. But why do the ASA reject this obsession with statistical significance and what is their alternative? The position statement explains (p. 10):

Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis

This would mean doing *actual science* as oppose to the cult-like ritual of NHST.

In the end, this point successfully undermines the Neyman-Pearson framework (one of the two precursor to NHST) that necessarily assumes that a dichotomous decision is not only possible but also reasonable.

**What about p values?**

The ASA clearly rejects statistical significance. But what about the underlying metric that allows people to make dichotomous decisions of “statistically significant” or “statistically non-significant”? What is, according to the ASA, the fate of the p value? The position statement is a bit ambivalent about this issue, which is perhaps to be expected for a process that sought to reach a consensus among many expert statisticians.

On the one hand, the first principle in their list seems to indicate that the p value can still have a place in the post p < 0.05 era (p. 9):

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data.

[…]

The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.

This seems to leave a small opening for the relevance and usefulness of p values. However, it should be noted that this directly contradict the claims made by NHST defender Olle Häggström, who claims (by uncritically quoting tobacco apologist R. A. Fisher) that statistical significance indicates that either the null hypothesis is false or that something unlikely has occurred. I previously discussed the counterexample of large sample size, but ASA adds another one, namely faulty underlying assumptions.

However, further down in the position statement, another view emerges. The sixth and final principle states that (p. 12):

6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Researchers should recognize that a p-value without context or other evidence provides limited

information. For example, a p-value near 0.05 taken by itself offers only weak evidence against

the null hypothesis. Likewise, a relatively large p-value does not imply evidence in favor of the

null hypothesis; many other hypotheses may be equally or more consistent with the observed

data. For these reasons, data analysis should not end with the calculation of a p-value when other approaches are appropriate and feasible.

The basic take-home message is this: p values in isolation is not a valid measure of the evidence against a hypotheses. If you can do something else than p values, the ASA position statement unequivocally declares that you should. But this invites the questions of why you should even use p values if “other approaches are appropriate and feasible”. What, in addition, does it bring to the table. Not much, apparently.

In the same way that the previous point underlined the Neyman-Pearson framework, this issue undermines the Fisher framework (the other precursor to NHST) that explicitly considered a p value as a reliable and credible measure of the evidence against the null hypothesis.

It is also interesting to read an excerpt from the Retraction Watch interview (linked earlier) with Wasserstein. Here is the question and his reply on what is meant by the post “p < 0.05" era (my bold):

Retraction Watch: You note in a press release accompanying the ASA statement that you’re hoping research moves into a “post p<0.05” era – what do you mean by that? And if we don’t use p values, what do we use instead?

Ron Wasserstein:

In the post p<0.05 era, scientific argumentation is not based on whether a p-value is small enough or not. Attention is paid to effect sizes and confidence intervals.Evidence is thought of as being continuous rather than some sort of dichotomy. (As a start to that thinking, if p-values are reported, we would see their numeric value rather than an inequality (p=.0168 rather than p<0.05)). All of the assumptions made that contribute information to inference should be examined, including the choices made regarding which data is analyzed and how.In the post p<0.05 era, sound statistical analysis will still be important, but no single numerical value, and certainly not the p-value, will substitute for thoughtful statistical and scientific reasoning.

Here it is, black-on-white, a complete vindication of the strongly anti-NHST position taken by Debunking Denialism.

**Conclusion**

So how should we see the position statement released by the American Statistical Association? At the very least, it is further evidence that something is enormously wrong with NHST. Although they do primarily focus on wrong ways to use NHST, it is hard to see what exactly p values can contribute “when other approaches are appropriate and feasible”.

What do I think we should do? Stop teaching statistics by teaching NHST. Start teaching statistics that focus on larger sample sizes, effect sizes, confidence intervals (or other interval estimation techniques if CIs cannot be calculated), replication, meta-analysis and the evaluation of research results in the scientific context (preferably by detailed case studies!). Journals should make publication guidelines for statistical analysis like the APA publication manual or the journal Psychological Science. Start refusing to publish papers that practice NHST tunnel vision and always have at least one expert statistical reviewer that specifically examines the statistical analysis. Be especially open to publish replication attempts or even give a prize to the most methodologically rigorous replication attempt during a given year. Fix the incentive structures in a way that benefits openness, honestly, replication and accuracy.

Categories: Debunking Misuse of Statistics

Thank you for a good post. I saw a recommendation from DCScience to use the p-values in this way:

P > 0.05 very weak evidence

P = 0.05 weak evidence: worth another look

P = 0.01 moderate evidence for a real effect

P = 0.001 strong evidence for real effect

What do you think?

Your suggestion will handle two problems:

(1) remove, or at the very least lessen, the dichotomous nature of NHST.

(2) remove the emphasis on 0.05 as a standard of evidence.

However, your suggestion is too little to late. It will not handle any of the following major problems:

(1) P value interpretations ignores the scientific context: A p value of 0.001 is not strong evidence for a real effect if you are testing homeopathy for cancer or copper bracelets for type-I diabetes. Basically, you can only talk about P(effect) if you use a prior probability.

(2) P value interpretations ignore effect size. A p value of 0.001 can be completely irrelevant if the observed effect size is too small to be negligible.

(3) P value interpretations tells us nothing about the interval estimations. It still makes our focus far too narrow and ignores the benefits of confidence intervals.

(4) For the typical study in medicine, psychology, biology etc. the statistical power is so low that p values are not stable during replications. Thus, any conclusions that are based on this narrow focus on p values will not be robust. Some method papers even suggests that a statistical power of 0.8 still has this problematic feature and that if you have statistical power at 0.9, inferential statistics is almost useless since almost any change, no matter how small, will appear as statistically significant.

From my perspective, we need to drop p values by the wayside and move on to do actual, meaningful statistical analysis: effect sizes, confidence intervals were possible, interpretations of the results in the scientific context, replication and meta-analysis. The time is long since over for the practice of staring blindly at a single p value and thinking this tells us something important. It does not, and never have.