In a previous post, the many insurmountable flaws and problems of null hypothesis statistical significance testing (NHST) were discussed, such as the fact that p values are only indirectly related to the posterior probability, almost all null hypotheses are false and irrelevant, it contributes to black-and-white thinking on research results, p values depends strongly on sample size, and it is unstable with regards to replication. For most realistic research designs, it is essentially a form of Russian roulette. After a mediocre effort, mathematical statistician Olle Häggström failed to defend p values and NHST from this onslaught. Now, he was decided to rejoin the fray with yet another defense of NHST, this time targeting the dance of the p values argument made by Geoff Cumming. Does his rebuttal hold water?
Arguing from rare exceptions does not invalidate a general conclusion
Häggström seems to be under the impression that if he can find rare and complicated counterexamples, he can undermine the entire case for confidence intervals [being generally superior to p values, see clarification here]. (all translations are my own):
To calculate a confidence intervals is akin to calculating p values for all possible parameter values simultaneously, and in more complex contexts (especially when more than one unknown parameter exists) this is often mathematically impossible and/or lead to considerably more complicated and difficult-to-interpret confidence regions than the nicely intervals that are obtained in the video.
This is perhaps due to his background in mathematics where a single counterexample really does disprove a general claim. For instance, the function f(x) = |x| is continuous but not differentiable, thus disproving the claim that continuity implies differentiability. In the case of confidence intervals, on the other hand, the fact that they work in cases with a single parameter is enough to justify their usage. Keeping in mind that the vast number of experiments done in e. g. medicine are probably not complicated estimations of multiple population parameters, but more akin to measuring the effects of a medication compared with placebo, the superiority of confidence intervals over p values for a large portion of experiments stands. Yes, obviously we need more sophisticated statistical tools in more complicated experiments, but that is not a valid argument in the surrounding where they can be calculated and where they do work.
Finally, Häggström continues to refuse the fact that confidence intervals can be dislodged from the framework of NHST. As was explained in the previous post, and many times earlier on this website, confidence intervals are more than the canonical frequentist definition of “if you take a very large number of samples, 95% of confidence intervals calculated will include the population parameter” or “just another way of making a statistical significance test”:
However, confidence intervals are so much more. Cumming (2012) lists four other ways to use confidence intervals:
– A 95% confidence interval is a range of plausible values for the population parameter.
– A 95% confidence interval is a margin-of-error that gives a likely maximum error, even though it is of course possible to get larger errors.
– The values closer to the mean is relatively more plausible than the values nearer the ends of the confidence intervals.
– A 95% confidence interval is, on average, a 83% prediction interval.
When Cumming says “range of plausible values”, he does not mean “probable values” as a single confidence interval either includes or do not include the population parameter.
Häggström has never addressed these arguments. Perhaps he should read Cumming (2012).
It does not matter if NHST “works in theory”
Häggström thinks that the dance of the p values is not a problem with NHST, just a problem with low sample sizes:
The dance of the p values that Cumming demonstrates in the video is the consequence of the combination of a modest effect size and a small sample size. If it is true, as Cumming says, that the effect sizes and sample sizes of the simulation are typical for empirical studies in psychology, then I think the most important lesson of his example is not about there being anything wrong with the p value concept as such. Rather, the lesson is this: psychologists need to be more stringent in their so-called power analyses, which means that they need to ensure that their sample sizes are large enough to detect reasonable effect sizes with reasonable reliability.
Häggström is completely right that empirical research in psychology needs larger sample sizes, but not for the reason he thinks. He thinks that larger sample sizes are needed to reduce the Russian roulette-like behavior of the NHST. This is misguided, because NHST is a nonsensical ritual. It start by assuming a probably false and scientifically irrelevant null hypothesis is true. This null hypothesis is typically taken to be that a population parameter is equal to zero or that there is no difference between two population parameters. Few researchers actually believe that this null hypothesis is true, and most of them are probably more interested in testing their experimental hypothesis, which is almost never the complement of the null hypothesis. The second step is to calculate the probability of data you have never observed given this hypothesis. The last step is to make a decision about the whether to reject or not reject the null hypothesis, even though you are not justified in doing so because NHST is not a test of the credibility of the null hypotheses. Even more bizarrely, some people (including Häggström) thinks that a failure to reject the null hypotheses means that you can accept it, apparently selectively oblivious (going so far as to call it semantical hairsplitting) to the fact that the failure to reject the null hypothesis could be due to low statistical power and not because it is true. Finally, an old difference will be detected as statistical significant given a high enough sample size, so merely increasing the sample size does not alleviate the problems with NHST.
The real reason we need higher sample sizes is that it provides a better estimate of the population parameter and increases precision. No need to talk about p values.
Häggström does not understand simulations
Häggström thinks that it is a sweeping generalization for Cumming to state that “[f]or a typical experiment, p tells you virtually nothing”. Not so. His simulations prove that, for typical experiments, the p value obtained can be seen as randomly picked from the obtained distribution. Exact replications can, with a high probability, produce a p value that most consider to be unacceptably large (such as p = 0.54) or a p value that most consider fantastic (p < 0.001). So Cumming is right, it does not tell you anything useful.
Häggström does not understand replication in science
Faced with the undeniable fact that NHST is kind of like Russian roulette for most realistic research designs, how does Häggström respond? By claiming that this implies that the data itself is not reproducible!
To speak of replicability of p values are so inept that I facepalm. A p value is not a parameter of the unknown distribution that the researcher is trying to estimate, but a measurement of to what extent the obtained data can be said to speak against the so called null hypotheses. To criticize the p value concept for lacking in replicability is like refusing to realize that data are uncertain. Face it: a new experiment entails new data – and a new p value. The person who accepts the logic in rejecting p values on this basis might as well reject the data collection itself – data is different each time, and as such not replicable!
Häggström goes on to state that this is an unreasonable conclusion, and considers this a reductio ad absurdum argument. Not so fast! The dance of the p values shows that the range of possible p values that can be gotten from repeating the exact same experiment is on the order of three orders of magnitude. Remember, about 10% of obtained p values were less than 0.001 and almost 40% of p values were above 0.1 (several were close to 1). How many effect size estimations differ by three orders of magnitude from replication to replication? Hardly any, at least not when it comes to methods that are not wildly unreliable. Furthermore, confidence intervals give a much better idea about replication than p values. Thus, Häggström makes an obvious false analogy.
So why do we care about p values being reproducible? That is because decisions about whether or not the null hypotheses is suspicious or not (and by extension, whether or not researchers should continue to entertain alternative hypotheses or move on) are made based on p values. If p values vary enormously simply by chance, then so does those decisions. Clearly, science should not be Russian roulette. That is the reason we care.
Häggström does not, and cannot, produce “the baby”
There are a lot of critical things to say about how p values and statistical significance are used and interpreted in practice in many areas. The most important thing, however, is to not throw out the baby with the bath water. It is the misunderstandings and faulty usage of p values and statistical significance that should be fought, not the concepts themselves, which often offer very important statistical tools.
As was explained in the an even earlier article, “[a] method like NHST that has such a strong potential for misunderstandings and abuse even among a large proportion of the most highly intelligent and highly educated has to accept a large proportion of the blame.” Häggström has never been able to refute this point.
Finally, where is this alleged “baby”? Show us where p values are “very important statistical tools”. Show us what additional insights a p value can give us if we can use confidence intervals. Show us the insights that can only be gotten from a p value and where all other methods invariably fail. Show us that using p values has more practical benefits than drawbacks. Show us that it is worth the risk.</phttps://debunkingdenialism.files.wordpress.com/2016/03/hagg24.png
Categories: Debunking Misuse of Statistics