Debunking Misuse of Statistics

Häggström on NHST: Once More Unto the Breach

Häggström, round four

It appears that Häggström still refuses to address the major criticisms laid out against NHST. In the addition he wrote to his previous post, he continues to engage in personalities and develops his tendency to mischaracterize my position into a genuine art form. Contrary to Häggström, I actually do think that people can often be statistically or scientifically non-naive yet promote naive beliefs and positions. That is the very definition of selective skepticism, that we all know is widespread. By claiming that “accept” is a legitimate NHST synonym for “not reject”, Häggström inadvertently show that NHST has to carry some of the responsibility for common misconceptions, such as confusing statistical non-significance with equivalence. I go into greater detail about how the popular R. A. Fischer quote that statistical significance either means that the null hypothesis is false or that something unlikely has occurred is false with the counterexample of large sample sizes. Finally, I reiterate the many criticisms that Häggström has either failed to respond to, or “responded to” by making faulty straw man assertions about what my position was.

Häggström, despite correction, fails to distinguish between person and argument

His recent response comes with another spate of attempted insults and engagement in personalities. This time, he alleges that I am a “disinformant” and a “silly fool”. Not only that he has now started complaining that my tone is “precocious” and “patronizing”. He even goes so far as to arbitrarily attribute emotions to me when he claims that I “angrily attack” NHST. Yet none of this constitute actual substantive arguments. None of it implies that any of my arguments are mistaken and none of it implies that Häggström is correct.

I can think of nothing else to do but to quote what I wrote in the latest post: “The more my opponent dwell on my alleged personal traits or failings and make liberal use of invectives, the more they demonstrate that they are (1) unable to distinguish between an argument and the person making that argument, (2) have reduced capacity for emotional regulation and (3) tacitly admit that they do not have much in way of substantive arguments against my position. Their behavior does not harm me in any way. In fact, I find it endlessly entertaining. All they are doing is harming their own capacity to accurately perceiving reality.” Not that I think the message will get across. So by all means, I hope that Häggström continues to engage in personalities, since it is just a repeated demonstration of (1)-(3). He is doing all the work for me. Fantastic.

The difference between person and argument

Häggström seems astonished that I make a distinction between person and argument (my translation):

Emil Karlsson has now authored a long-winded response devoid of substance, where he devotes a lot of space to a variety of diversions on the same level as his claim that I am wrong when I claim that he has accused me of statistical naivety, since he has only maintained that my arguments are statistically naive (a distinction that, of course, can only be important if you thinks that it is completely normal for statistically non-naive thinkers to promote statistically naive arguments).

Exactly! Believe it or not, that is precisely my position. Actually, I would make it even more general than that: I think it is completely common for scientific thinkers (and thinkers in science-associated subjects) who are often non-naive to believe in and promote naive and absurd claims about science and science-associated subjects. In fact, I think this happens all the time.

Although best known to occur among Nobel Prize winners (called the Nobel disease), it has been observed among professors and other academics as well. Not just in completely different topics than their expertise such as Linus Pauling’s promotion of mega dose vitamin C against cancer or Pierre and Marie Curie support for an alleged psychic, but in areas that are more aligned with their research fields. The examples are many and varied. Luc Montagnier co-discovered HIV, but promotes homeopathy and anti-vaccine activism. Kary Mullis created PCR, but promotes HIV/AIDS denialism, astrology and climate change denialism. Peter Duesberg co-discovered the first cancer-causing virus, is a professor of cell and molecular biology, was elected to the National Academies of Science, but father of the HIV/AIDS denialist movement and quack models for the origin of cancer. Fred Singer is an atmospheric physicist and was a professor of environmental science, but one of the major climate change denialists. Jerry Coyne is a professor of biology at the University of Chicago, but has uncritically promoted anti-psychiatry and race realism. Marcia Angell is a doctor and the first female editor of NEJM, but has also promoted anti-psychiatry. Many of the founders and early researchers in quantum physics promoted ideas about quantum mysticism, such as Erwin Schrödinger, Werner Heisenberg, as well as many later physicists such as Fritjof Capra. The list goes on and on and on and on.

There are, of course, even examples from method teachers who teach NHST to psychology students. In a study by Haller and Krauss (2002), 80% of so-called “methodology instructors” that are tasked with teaching NHST To students failed to correctly define p value. Sure, the sample size is only N = 30 and it was carried out at six different Germany universities, so the results may not generalize. However, I am sure that these methodology instructors generally know what they are talking about in, say, research design (assuming universities do not hire people who are completely incompetent for the job), so they serve as another example of “often non-naive, sometimes naive” individuals.

“Accepting” the null again

I wrote:

Häggström responds by insisting that he really meant “accept” in a non-statistical sense and not at all the way the term is used in NHST and that he certainly does not believe that statistical non-significance implies the falsehood of the null hypothesis.

Häggström turns mischaracterization to an art form when he writes that (my translation):

When I wrote that Emil Karlsson “confuses the formal usage of the term ‘accept’ within NHST theory with the everyday usage of the word, I meant that it was I who used the word in the formal NHST sense (namely that the data is stated to be fully normal given what one could expect [Häggström misspells the word “vänta” as “vänstra”, meaning infidelity – E.K] under the null hypothesis), and that it was he who misinterpreted it as indicating one of the words more common meanings (namely that the null hypothesis is established as true). The quoted sentence shows that Emil Karlsson has understood the situation the other way around, and that he imagines that the established formal NHST-terminology is the way meaning he himself ascribes to the word “accept” (namely, as already stated, that the null hypothesis is established as true). This delusion from Emil Karlsson shows that he does not even understand the most basic terminology within the NHST theory he so angrily attacks.

Not even remotely close to my position. My position is that the correct NHST terms is, or rather should be, “reject” and “not reject” and that the term “accept” should not be used. When Häggström wrote that (my translation):

Yes, hairsplitting is precisely what it is. Semantics hairsplitting. Trust me: we statisticians uses the word ‘accept’ as a synonym for ‘not reject’ without embarrassment

I interpreted this as indicating that he had no problem conflating the terms “not reject” and “accept” and was therefore dangerously close to the false statement that statistical non-significance implies the truth of the null hypothesis. He does correctly reject this fallacy elsewhere, but slips into bad practice here which is why the reason I previously wrote “selectively oblivious”.

After Häggström had explained that he really did not mean to promote that fallacy in this quote, I reached the conclusion that he must have been using the term “accept” in a sloppy and informal manner (not really meaning “accepting” as the term is used in the everyday meaning of the term), in much the same way that biologist sometimes use the phrase “a gene for phenotype X”.

No serious biologist thinks that there is a one-to-one relationship between a single gene and a single phenotypic trait and that no environmental processes can affect that relationship. All genes probably have multiple effects, all phenotypic traits are probably affected by multiple genes and environment plays a role on all levels. What they mean is something like “given all other factors constant, a gene that increases the probability of developing phenotype X, but does fully determine”. The reason this is sloppy and really a bad way of talking about biology is that it seemingly implies an extremely naive view of genetics and development based on genetic determinism and the genome as a blueprint. I much rather prefer phrases such as “a genetic risk factor for X” or “a genetic influence for X” as they do not have these faulty connotations. I did not fully realize that, apparently, “accept” as a synonym of “not reject” is an established NHST terminology! This is deeply problematic, as it is a fertile ground for misunderstanding non-significance as equivalence, which is precisely one of the most common misunderstandings of NHST. Thus, this realization is further evidence that NHST has to accept a lot of the blame for the misunderstandings that exists.

Large sample sizes, again

I wrote:

Häggström implicitly quoted R. A. Fischer when claiming that statistical significance either means that the null hypothesis is false or that something unlikely has happened. This is false, as large sample sizes can yield statistical significance for even minute differences that could be observed with high probability even if the null hypothesis was true.

Häggström replied (my translation):

Wrong, wrong, wrong. Emil Karlsson has apparently not understood what statistical significance means.

Hardly an impressive rebuttal. To be honest, I am amazed that Häggström does not seem to be aware that my argument was that statistical significance can be obtained from overpowered studies with very large sample sizes even though the null hypothesis is essentially true and nothing unlikely has occurred.

To drive the point home, let us look at an example provided by Ellis (2010, p. 53):

Box 3.3 Overpowered statistical tests

Researchers sometimes compare groups to see whether there are meaningful differences between them and, if so, to assess the statistical significance of these differences. The statistical significance of any observed difference will be affected by the power of the statistical test. As statistical power increases, the cut-offs for statistical significance fall. Taken to an extreme this can lead to the bizarre situation where two essentially identical groups are found to be statistically different. Field and Wright (2006) provide the following SPSS-generated results showing how this situation might arise:

t df Sig. (2-tailed) Mean difference
−2.296 999998 .022 .00

The number in the last column tells us that the difference between two groups on a particular outcome is zero, yet this “difference” is statistically significant at the p < .05 level. How is it possible that two identical groups can be statistically different? In this case, the actual difference between the two groups was not zero but −.0046, which SPSS rounded up to .00. Most would agree that −.0046 is not a meaningful difference; the groups are essentially the same. Yet this microscopic difference was judged to be statistically significant because the test was based on a massive sample of a million data-points. This demonstrates one of the dangers of running overpowered tests. A researcher who is more sensitive to the p value than the effect size might wrongly conclude that the statistically significant result indicates a meaningful difference.

Häggström might retort that the statistical null hypothesis is not true in this scenario since the observed effect size was −.0046. But this need not be the population effect size as the measurement method may introduce a small but systematic error. If we assume that a very large sample size often produce an observed effect size very close to the population effect size in the absence of small systematic errors, Häggström is denied the possibility of this result being a low probability event. As a last ditch case, Häggström might argue that the sample size here is enormous and thus an unlikely occurrence. While the sample size is indeed very large, the argument is valid for large-but-not-enormous sample sizes as well, albeit with a bigger (but still quite small) difference than |−.0046|. Depending on the specific circumstance, even an observed effect size that is larger by three orders of magnitude may still be sufficiently small to consider the statistical null hypothesis as true. That would mean that this argument applies more broadly.

Finally, let us remind ourselves about the criticisms that Häggström has failed to respond to, or attempted to respond to by butchering and falsely characterized my position.

Bonus round: criticism that Häggström has failed to provide a satisfactory response to

The modus operandi of Häggström appears to be to ignore all the cases where he was proven wrong and increasingly focus on smaller and smaller issues where he thinks he can see a kink in the armor. In an effort to expose how Häggström uses this method, I have compiled a (partial) list of arguments that Häggström either never responded to, or provided a deeply unsatisfactory response.

—> The instability of p values under exact replication: for most realistic research designs, the breadth with which the p value varies is enormous and is thus unstable under exact replications. Since decisions are made based on obtained p values, those decisions are also unstable. This means that p values are not useful for most realistic research designs. Häggström made a straw man argument claiming that this argument claimed that p values were population parameters to be estimated (and therefore ridiculous) and that anyone who rejected p values because of this should also reject the data itself. I reiterated that the instability of p values under exact replications are a problem because decisions are based on p values and if p values were like Russian roulette, then so were those decisions. I dispatched the claim that it would imply the rejection of data itself since the p value varies between three orders of magnitudes, whereas hardly any effect size does that under exact replications unless the instruments are terribly imprecise.

—> The logical invalidity of NHST: NHST proponents frequently claim that statistical significance means that the null hypothesis is suspicious. However, this is a logical error since it is possible for the alternative hypotheses to be even more unlikely. Trivial cases involve alleged psychic powers or homeopathy, but can readily be applied in ordinary circumstances as well.

—> The indirect relationship between p value and posterior probability: A p value is not a good measure of the strength of the evidence against the null hypothesis. This is because it has to be weighted against the prior probability of the null hypothesis, which can be in a radically different. Low p values for a priori likely null hypotheses or a relatively higher p values for a priori unlikely null hypotheses does not entail a low posterior probability. Thus, even the name of NHST is misleading: NHST is not really a test of null hypotheses. Häggström retorted that this amounted to the claim that p values are bunk because they are not posterior probabilities, which missed the point. The point was that p values are not a good guide to whether or not the null hypothesis is suspicious.

—> NHST does not tell us what we want to know: a p value tells us nothing bout the effect size, the precision by which the effect size is estimated, if it can be replicated or not or what it means in the scientific context. Correctly interpreted, p values tell us very little for most realistic research designs. Yet this is not how NHST is used or advocated for. There it is said to play a considerably larger role. In fact, there are common textbooks in statistics that almost exclusively focuses on NHST and statistical significance.

—> The Cola example: see earlier discussion.

—> Häggström’s appeal to tradition: in a previous post, Häggström deployed the following appeal to tradition: “This is the logical justification of NHSTP (the justification of confidence intervals is similar), and the way we statisticians have been teaching it for longer than I have lived.” Yet the fact that a method has been used for a long time does not mean that it is valid or useful. It baffles me how Häggström could deploy this kind of “argument” with a straight face.

—> Häggström’s appeal to R. A. Fischer: Häggström implicitly quoted R. A. Fischer when claiming that statistical significance either means that the null hypothesis is false or that something unlikely has happened. This is false, as large sample sizes can yield statistical significance for even minute differences that could be observed with high probability even if the null hypothesis was true..

—> The trans-generational epigenetics study: Häggström deploys another straw man, this time claiming that I was incapable of understanding that the results could not be interpreted as evidence of a specific kind of epigenetic inheritance. In reality, I was explicit when I wrote in the comment section (i.) that the study should have made a correction for multiple testing, (ii.) that the confidence interval is too broad that they should have been extremely careful in their conclusions and not made it a headline or main finding and that (iii.) I thought the study had a limited scientific value. Nowhere did I claim that the p value implied a credible H0 rejection within the NHST paradigm. What I did claim was that (a.) their NHST-based criticisms (that was based on translating a confidence interval to a p value and then concluding that it was above 0.05 and that the results were therefore not credible) was severely flawed and that (b.) the effect size was large enough to be of medical interest and that future research testing that particular inheritance route is warranted. My main arguments were that neglecting confidence intervals by using them to do a statistical significance test ignores many of the other relevant methods of using confidence intervals for interpreting research results and that statistical non-significance is not evidence for equivalence because p values depend on sample size. Häggström wasted much ink arguing against his own false characterizations of my position, despite the fact that it was explicitly laid out in the very first comment.


Ellis, P.D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. New York: Cambridge University Press.

Heller, H and Krauss, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? Methods of Psychological Research Online. 7(1), 1-20.


Debunker of pseudoscience.

One thought on “Häggström on NHST: Once More Unto the Breach

Comments are closed.


Hate email lists? Follow on Facebook and Twitter instead.