New Nature Methods Paper Argues that P Values Should be Discarded

Fickle P Values

In the wake of the recent discussions about null hypothesis statistical significance testing and p values on this website, Häggström has decided not to respond beyond calling the latest installment in the series nothing more than a “self-parody”. No substantial statistical or scientific arguments were presented. Despite his unilateral surrender, it can be informative to examine a method paper entitled “The fickle P value generates irreproducible results” written by Halsey, Curran-Everett, Vowler and Drummond (2015) that was just published in the renowned Nature Methods journal that slammed the usage of p values. The authors even call for a wholesale rejection of p values, writing that “the P value’s preeminence is unjustified” and encouraging researchers to “discard the P value and use alternative statistical measures for data interpretation”.

As expected, it corroborates and confirms a large number of arguments I presented in this exchange and directly contradicts many of the flawed assertions made by Häggström. In fact, the paper goes even further than I have done, (1) showing that p values are unstable to direct replication even at levels of statistical powers that are generally considered to be acceptable (e.g. 0.8), (2) that p values are probably superfluous for analysis with adequate statistical power and (3) that previous research that relied on p values needs to be reexamined and replicated with proper methods for statistical analysis.

I challenge Häggström to submit a response to that paper, in which he calls the authors “disinformants” and “silly fools” while mischaracterizing them, quoting them out of context and providing little to no substantive arguments whatsoever. Let us see if his assertions and engagement in personalities pass scientific peer-review. I will not hold my breath.

The instability of p values under exact replications is unacceptably large, even with large power

As was argued at length in the previous posts on the subject, a p value varies widely under exact replications with realistic research designs. Since decisions are taken about the suspiciousness of the null hypothesis based on the obtained p value, the fact that p values are like Russian roulette means that the corresponding decisions are like Russian roulette as well. After going through similar simulation results that show that p values are unstable under exact replications for sample sizes commonly used in biology. They go even further than that, arguing that this instability is unacceptable large even at levels of statistical power that many researchers and statisticians deem acceptable:

Unfortunately, even when statistical power is close to 90%, a P value cannot be considered to be stable; the P value would vary markedly each time if a study were replicated. In this sense, P is unreliable. As an example, if a study obtains P = 0.03, there is a 90% chance that a replicate study would return a P value somewhere between the wide range of 0–0.6 (90% prediction intervals), whereas the chances of P < 0.05 is just 56%. In other words, the spread of possible P values from replicate experiments may be considerable and will usually range widely across the typical threshold for significance of 0.05. This may surprise many who believe that a test with 80% power is robust; however, this view comes from the accepted risk of a false negative

This is independent support for the notion that p values are fundamentally unstable under exact replications of realistic research designs and sample sizes. Not only that, the paper argues that this is still true even for relatively high power, such as 0.8. This completely collapses the objection that NHST works given sufficient power and that the only problem is the misuse.

The authors even go on to provide the following coup de grâce:

For example, regardless of the statistical power of an experiment, if a single replicate returns a P value of 0.05, there is an 80% chance that a repeat experiment would return a P value between 0 and 0.44 (and a 20% change that P would be even larger). Thus, and as the simulation in Figure 4 clearly shows, even with a highly powered study, we are wrong to claim that the P value reliably shows the degree of evidence against the null hypothesis. Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable. In such cases, the effect is so clear that statistical inference is probably not necessary

In sum, p values are only sufficiently stable under exact replications if the power is exceedingly high (>90%), but at that point the inferential worth of a p value is low to non-existent since the difference would be so obvious even without inferential statistics. In one fell swoop, the researchers disproved two common defenses of NHST. It is not just a problem of misuse or low statistical power, and p values are not required at large values for statistical power.

Alternatives to p values include effect size and confidence intervals

What alternatives are suggested to replace p values? I have argued that it should be effect sizes, confidence intervals, replication, meta-analysis and interpretation of effect sizes and confidence intervals in the scientific context. Again, the researchers agree:

This approach to statistical interpretation emphasizes the importance and precision of the estimated effect size, which answers the most frequent question that scientists ask: how big is the difference, or how strong is the relationship or association? In other words, although researchers may be conditioned to test null hypotheses (which are usually false), they really want to find not only the direction of an effect but also its size and the precision of that estimate, so that the importance and relevance of the effect can be judged

[…]

To aid interpretation of the effect size, researchers may be well advised to consider what effect size they would deem important in the context of their study before data analysis.

[…]

In addition, the effect size and 95% CIs allow findings from several experiments to be combined with meta-analysis to obtain more accurate effect-size estimates, which is often the goal of empirical studies.

As an added bonus, they corroborated my previous statement that null hypothesis are typically false to begin with.

Assurance and precision considerations can replace power analysis

Häggström defended NHST by claiming that NHST was required to reach well-supported conclusions about what sample sizes are appropriate. I retorted that it was possible to find appropriate sample sizes by considering a wide range of other factors, included precision and assurance. This is based on the expected lengths of confidence intervals and how often this should occur. Yet again, the researchers agree with me:

In turn, power analysis can be replaced with ‘planning for precision’, which calculates the sample size required for estimating the effect size to reach a defined degree of precision.

Objections anticipated:

Let us return to some of the classic defenses that Häggström and other NHST statisticians have been deploying. Although I provided my own refutation to them, I discovered that they had been independently debunked in the third chapter called “Eight common but false objections to the discontinuation of significance testing in the analysis of research data” (Schmidt and Hunter) of a 1997 book called “What if there were no significance tests”, that unsurprisingly, reached the same conclusions that I had.

“The problem is merely the misuse of NHST generally and p values specifically. If used correctly, there would be little to no problems”

This is labeled as “objection 7” in the book mentioned above. The problems with this objection (according to Schmidt and Hunter) is p values, even if they were interpreted correctly, is not a suitable measure for “advancing the development of cumulative scientific knowledge” since it says nothing about effect size or precision. Furthermore the problem of low statistical power would still remain. I contrast, the retorts that I have previously used was to point out that p values are only indirectly related to the posterior probability and it is possible for the alternative hypothesis to be even more unlikely. Thus, even if NHST is used without any misconceptions, it is not terribly useful. Although I did point out the problem with p values not telling researchers what they want to know, I did not use it in this context.

“Doing confidence intervals is just secretly doing NHST, but indirectly.”

I previously responded that this is not relevant as confidence intervals can break the chains of NHST, since it can be use and interpreted outside the NHST framework. Schmidt and Hunter labels this “objection 5” and argues that there would still be benefits of using confidence intervals even if researchers merely used it as another way of performing statistical significance testing, since a confidence interval will give information (1) about precision and uncertainty of the study, (2) reduce wild overinterpretation of research results and (3) highlighting the effect size as a point estimate. These would not be seen if the researchers merely were to do a statistical significance test in Excel or SPSS. However, they continue by pointing out that researchers need not and should not use confidence intervals as statistical significance testing. This is because (i.) confidence intervals existed and were used a long time before statistical significance tests and (ii.) few NHST statisticians would argue that other confidence intervals, such as the 68% confidence interval (one standard error above and below the mean) or the 50% probable error confidence interval, should be or must be taken as statistical significance tests.

“This paper is not perfect, since it confuses false positive with type-I error and confuse p with alpha in one of the glossaries”

Sure, hardly any paper is absolutely perfect. Yes, the researchers wrongly use false positive and type-I error as synonyms. In reality, the type-I error rate assumes the global statistical null and is set by the researcher, whereas the false positive rate is predicated on the real-life mix of true and false statistical null hypotheses, which is not controlled by the researcher. Although this might be defended because the authors used the term “false positive” in quotation marks and may therefore not have been referring to the term in the precise statistical sense. The researchers do define p value correctly in the first glossary, but in the latter they define “P” as “achieved significance level”. However, the significance level is defined by alpha and not P and these two are not the same. These issues, however, do not distract from the bigger issues.

Emil Karlsson

Debunker of pseudoscience.

13 thoughts on “New Nature Methods Paper Argues that P Values Should be Discarded

  • April 1, 2015 at 23:53
    Permalink

    Quoting:

    If we knew with high probability that a significant P-value would be followed by an even more significant one, we would believe with high probability that, on average, future experiments would leave us more certain about the falsify of the null (than with this one P-value): This would be a disastrous property of any system of inference since it would undermine the value of future evidence by confusing actual with anticipated evidence. Thus this particular property of P-values is highly desirable.

    It seems to me that a necessary quality of any evidential system is that uncertain evidence must imply that future evidence might tend in the contrary direction. After all we know that when we have tomorrow’s evidence our overall amount of evidence will have increased. Bayesians must believe therefore that in future it is probable that they will believe something more certainly than they do now but they can’t know exactly what that something will be. Otherwise we would have paradoxes of this sort. ‘Today is Monday and I believe that it is quite probable that it will rain on Wednesday. However, if you come and see me tomorrow, which is Tuesday, I will be able to tell you with absolute certainty that it will rain on Wednesday’.

    Thus it seems to me obvious and not at all problematic that if a trial were just significant at the 5% level (P=0.047, for example) there would be quite a large chance that another trial of exactly the same sort would have a modest probability of being significant at the same level. Imagine 100 such trials, for example. If 50% were significant at the 5% level this would be overwhelming evidence against the null.

    Fuller discussion: http://errorstatistics.com/2012/05/10/excerpts-from-s-senns-letter-on-replication-p-values-and-evidence/

    I rest my case.

    • April 2, 2015 at 20:17
      Permalink

      Thus it seems to me obvious and not at all problematic that if a trial were just significant at the 5% level (P=0.047, for example) there would be quite a large chance that another trial of exactly the same sort would have a modest probability of being significant at the same level. Imagine 100 such trials, for example. If 50% were significant at the 5% level this would be overwhelming evidence against the null.

      If you do a single study, you get a single p value. However, the researcher does not know if this p value stems from the distribution where half of them are less than 0.05 or a distribution where they are spread all over the place with similar probabilities. This is the problem, especially if you add on the issue that decisions about the suspiciousness of the null are taken based on obtained p values.

      No researcher is going to do those 100 trials to find out which it belongs to. Even if he or she did, then inferential statistics would probably be superfluous since the difference if it existed would be obvious at such sample sizes.

      Finally, researchers are interested in the effect size, the precision and what it all means in the scientific context, not whether a trivially false and scientific irrelevant nill null hypothesis is false (“sizeless science”).

  • April 2, 2015 at 12:34
    Permalink

    “P value. Two reasonable definitions are (i) the strength of evidence in the data against the null hypothesis and (ii) the long-run frequency of getting the same result or one more extreme if the null hypothesis is true.” http://www.ncbi.nlm.nih.gov/pubmed/25719825

    The above is just more spreading of confusion: “P-values quantify experimental evidence not by their numerical value, but through the likelihood functions that they index”
    http://arxiv.org/abs/1311.0081

    • April 2, 2015 at 20:11
      Permalink

      The first definition comes directly from R. A. Fischer and it trivial to prove it with Bayes theorem. The smaller the P(>= D|H), the smaller the posterior probability. I have, however, criticized this definition for being far to vague, since testing very likely or very unlikely hypotheses will make the prior probability have a large influence. The trivial case is a study for the efficacy of magnetic wristbands for insulin-dependent diabetes. Even a small p value is not terribly impressive since the prior probability for the null hypothesis is close to 1.

      The second definition is just a version of the frequentist definition of P(>= D|H0).

    • April 2, 2015 at 21:04
      Permalink

      Emil, those interpretations may be valid but they still lead to confusion. Scientists do not actually care about the null hypothesis or long run frequencies unless they have been confused by NHST. They care about the probability their substantive research hypothesis is true, something they NHST p-value says zero about according to those definitions.

      This is confusing since the p-value does seem related somehow to effect size which is something the scientist cares about. If you read that arxiv paper, you will see it turns out that p-values are just short-hand for effect size estimates. There is code there so anyone can prove it to themselves. P-values can be useful summary statistics, but only if you do not try to interpret them in either of the ways suggested by Halsey et al (2015). Unfortunately those have been the definitions used by pretty much everyone since Fisher.

    • April 3, 2015 at 11:39
      Permalink

      If you think the standard frequentist definition of p value leads to confusion, then that is an argument against p values themselves, not the definition.

      The p value says nothing about the truth about the substantive research hypothesis period. This is not related to which definition of p value you prefer, since definitions of a concepts are strictly restricted by the concept itself.

      P values say nothing about effect size. A given p value can be obtained from a small or large effect size if the variance and/or sample size differs. An effect size gives you information about a p value (since effect size goes into the p value calculation), but once it has been combined with e. g. sample size and variance, that information is lost. You might be able to construct elaborate algorithms that take p values, variances and sample sizes to try to torture p values to say something about effect size, but then you might as well use effect size to begin with.

    • April 3, 2015 at 14:11
      Permalink

      Emil wrote: “P values say nothing about effect size.” This is true for the p-value *alone*, but not when sample size is also known. I am not sure if you are disagreeing with the paper or just did not read it.

      Do you agree or disagree with the following claim? “The P-value and sample size together correspond to a unique likelihood function.” http://arxiv.org/abs/1311.0081

    • April 3, 2015 at 20:42
      Permalink

      This is true for the p-value *alone*, but not when sample size is also known.

      If you want to get information about effect size, you might as well actually look at the effect size instead of the hassle of using the p value and the sample size. Furthermore, it is not actually true that p value + sample size alone tells you something about the effect size because you still have the variance in there confounding the relationship even if you substitute n/N with the actual sample size. A small p value could be because of a large effect size or a small variance even if sample size is known. Thus, it does not directly tell you anything about the effect size.

      Do you agree or disagree with the following claim? “The P-value and sample size together correspond to a unique likelihood function.

      What you got left is effect size and variance. If you were interested in those quantities, you might as well look at them directly. I’m sorry, but I fail to see why such an approach would more useful than looking at the actual quantities themselves. Obviously a huge effect size with a moderate variance and a tiny effect size with an even tinier variance are clearly very, very different things.

    • April 4, 2015 at 01:29
      Permalink

      “What you got left is effect size and variance.”

      You’ll have to actually read the paper, because it is clear you have not based on that comment. I found it very enlightening to read it because it was the first time I understood exactly why p-values are related to evidence, thus giving a veneer of usefulness to NHST.

      “I’m sorry, but I fail to see why such an approach would more useful than looking at the actual quantities themselves.”

      Yes, this is true in most cases. I agree 100% that the usefulness of p-values is extremely limited (unless the null hypothesis is deduced from the substantive theory). However, this is just like looking at the distribution is more informative than only looking at the mean +/- sd or confidence intervals. There are valid reasons for data reduction.

      If I have many comparisons (e.g. MRI voxels) with different sample sizes for each (missing data in some areas) it is useful to summarize the data for each voxel using two maps of 1) p-values and 2) sample size. The maps can then be converted to likelihood functions in the brain of the viewer if they understand what a p-value means.

      We could include the max amount of information by plotting panels of histograms of the data for each voxel… but that would be difficult to comprehend. We could plot mean difference at each voxel, but we want to weight voxel differences with larger N and lower variance moreso and show this in a single plot.

      Situations like that are when p-values can be useful. This is all discussed in the paper:

      “Full likelihood functions give a more complete picture of the evidential meaning of
      experimental results than do P-values, so they are a superior tool for viewing and inter-
      preting those results. However, it is sensible to make a distinction between the processes
      of drawing conclusions from experiments and displaying the results. For the latter, it is
      probably unnecessary and undesirable for a likelihood function be included every time a
      P-value might otherwise be specied in research papers. To do so would often lead to clut-
      ter and would waste space because, given knowledge of sample size and test type, a single
      P-value corresponds to a single likelihood function and thus stands as an unambiguous
      index.”
      http://arxiv.org/abs/1311.0081

      However, at this point the confusion about how to interpret them (due to the other two interpretations that imply using them for “tests”) is so widespread that I would recommend avoiding them altogether.

    • April 4, 2015 at 14:18
      Permalink

      You’ll have to actually read the paper, because it is clear you have not based on that comment. I found it very enlightening to read it because it was the first time I understood exactly why p-values are related to evidence, thus giving a veneer of usefulness to NHST.

      You did not answer the argument. If you use p value plus sample size, this will merely be a function of effect size plus variance, not merely effect size. Thus, p value plus sample size tells you nothing about the effect size.

      If I have many comparisons (e.g. MRI voxels) with different sample sizes for each (missing data in some areas) it is useful to summarize the data for each voxel using two maps of 1) p-values and 2) sample size. The maps can then be converted to likelihood functions in the brain of the viewer if they understand what a p-value means.

      This is often very problematic since the sample size in neuroimaging studies are abysmally low and technical replicates or often treated as biological replicates. Even though such an approach can adjust for sample size and thus not face the problem of p values being confounded by sample size, severe problems will still remain, such as the inherent variability of p values under exact replication. Most of those likelihood functions will not be informative and those that appear informative may very well be due to effect size overestimation.

      This seems to be a very convoluted method to obtain very little.

      We could include the max amount of information by plotting panels of histograms of the data for each voxel… but that would be difficult to comprehend. We could plot mean difference at each voxel, but we want to weight voxel differences with larger N and lower variance moreso and show this in a single plot.

      Situations like that are when p-values can be useful.

      They would certainly be “useful” in the sense of being “convenient”. In large screen-like experiments we do not have the time, money or effort to do otherwise. But let us not confuse “convenient” with “useful for making accurate / replicable statistical inferences”.

      I’d be interested to know how well the method you describe above (or any p value focused method) would fare against something like “don’t consider comparisons / don’t do experiments with N < x, shave of the comparisons with absurdly high variances and then rank by effect size" in terms of effect size replicability.

    • April 4, 2015 at 20:39
      Permalink

      Just to make sure you understand my perspective. If i was tasked with developing an algorithm to summarize data in a way containing the absolute minimum possible amount of information >0 that simultaneously had the greatest possible potential for misleading people… it would be the nil-null NHST p-value.

      Still, I think it can be useful. Perhaps we need a concrete example. Are you familiar with R? If so I will write a script to demonstrate what I am thinking, then you can compare that method of displaying the data to the alternative you think would be superior.

  • April 2, 2015 at 14:35
    Permalink

    Although Häggström insisted that he has stopped responding to me, he does not seem to be able to let go. In a recent post he wrote on his blog about a paper on Bayesian estimates of climate sensitivity, he attempts to deliver the following “insult”: note the continued lack of substantive argument (my translation):

    It may surprise some readers that have taken note of my defense of frequentist concepts like p value and statistical significance that I work with Bayesian methods. I was recently called an “NHST-statistician” (where NHST stands for null hypothesis significance testing) by a dude who understands very little in the way of the theoretical basis or application of statistics, but has louder opinions, but that epithet is a misunderstanding: I have a pragmatic attitude to the different methods of statistics, and do not profess any particular ideology considering which methods can be used (other than that they should be correct), but realize that statistical problems are of many different kinds and therefore demand partially different statistical toolkits.

    In a stunning reversal, Häggström now admits, point-blank, that NHST is an ideology! This is in stark contrast to his earlier insinuation that NHST was a science.

    Defending the flawed method of NHST does make one into an NHST statistician even if one also think other methods could sometimes be useful. The term “NHST statistician” is a derogatory epithet for someone who defends NHST, regardless of whether or not that person is a pure frequentist or not.

    Having a pragmatic attitude about statistical methods would suggest that one would stop using bad methods that do not work, neither logically nor practically. Since Häggström continues to defend NHST, he cannot be considered a pragmatist.

    Finally, let us use Häggström’s flawed arguments against him. He thinks that confidence intervals used within the framework of New Statistics is still “NHST in disguise” (even though we do not calculate p values or do statistical significance testing), so we can reply that since Bayesian statistics use P(>=D|H) or similar, doing Bayesian statistics must be a form of NHST in disguise! Even worse, posterior probabilities are just “transformed” p value-like quantities, so working with posterior probabilities is again a form of NHST! These are clearly either absurd, misleading or vacuous/irrelevant conclusions, but they are sufficiently analogous to the rhetoric deployed by Häggström, making it a reasonable reductio.

    I repeat my challenge to Häggström: I encourage him to submit a response to that paper calling the authors “disinformants” and “silly fools” while mischaracterizing them, quoting them out of context and providing little to no substantive arguments whatsoever. Let us see if his verbal outpourings are taken seriously. I doubt it.

  • Pingback: The Fifth Anniversary of Debunking Denialism | Debunking Denialism

Comments are closed.

%d bloggers like this:

Hate email lists? Follow on Facebook and Twitter instead.

Subscribe!