In previous posts, I criticized the doomsday arguments made by some NHST statisticians about the recent banning of null hypothesis significance testing (NHST) as well as debunked the objections leveled against Geoff Cumming’s dance of the p value argument. This has now drawn the attention of mathematical statistician Olle Häggström and prompted him to write a response post to yours truly. He spends most of the post engaging in personalities and raving about perceived injustices he thinks I subjected him to, but he eventually discuss two examples where he thinks I have gone astray. Unfortunately, his first example is a trivial misreading of what I wrote as well as a quotation out of context. The second example, where he provides a situation where he thinks NHST is essential, is only slightly better. In the end, he fails to successfully rebuke any of my substantial arguments.
Häggström excessively engages in personalities
Because I have argued against pseudoscience for many years, I have developed a thick skin and a laser-like mentality trained at cutting through the nonsense. The more my opponent dwell on my alleged personal traits or failings and make liberal use of invectives, the more they demonstrate that they are (1) unable to distinguish between an argument and the person making that argument, (2) have reduced capacity for emotional regulation and (3) tacitly admit that they do not have much in way of substantive arguments against my position. Their behavior does not harm me in any way. In fact, I find it endlessly entertaining. All they are doing is harming their own capacity to accurately perceiving reality.
Häggström makes liberal use of various invectives and attempted insults and most of his post contains engagements in personalities, such as calling me an “overaged biology student without any kind of scientific merits”. This shows that Häggström cannot reliably distinguish between person and argument. My scientific qualifications do not determine the credibility of the arguments I make: those stand or fall on their own merits. This skeptical maxim is one of the primary reasons why I do not give that much detail with regards to my academic qualifications or current activities. Another reason is the many historical and contemporary examples of excessive emphasis on academic qualifications leading to equally excessive hubris, such as some Nobel Prize winners promoting HIV/AIDS denialism, homeopathy and autism quackery and so on. Finally, it also relates to online privacy. My name is pretty common, but not that common if you also factor in a few facts about my academic history. As a skeptical activist that have made a lot of people agitated, I prefer my entropy to be well above zero and enough quasi-identifiers to be hidden from view. However, if he publicly begs me for them (thereby embarrassing him further), I’ll go ahead and email him copies of my degree certificates.
In addition to arbitrary invectives, Häggström tries to play the victim card on multiple occasions by claiming that I misrepresented his position and that I have personally called him “statistically naive” and “laughably desperate”. Is there any truth to these allegations?
Did I call Häggström “statistically naive”?
No. In reality, “statistically naive” was a description I made about the doomsday-like arguments (he called it “intellectual suicide”) made by Häggström with regards the fact that a psychology journal banned p values. Making a statement about an argument is not the same as a personal assault. Those that cannot distinguish an argument against an idea from the argument against a person have no business trying to take part in a rational discussion about serious topics.
Did I personally call Häggström “laughably desperate”?
As for “laughably desperate”, that was a sweeping generalization I made about NHST proponents as a group, not Häggström as an individual. This is no different from making the generalization that I hate telemarketers. It certainly does not mean that, for every telemarketer in existence, I personally hate that person. I just hate getting random calls by telemarketers. It is therefore deliciously ironic that Häggström himself appeals to the difference between extreme linguistical stringency and every day usage of a word in an effort to explain an embarrassing statement he made several years ago. Apparently it is alright for Häggström to use words in their everyday sense of the term, but not for anyone else. This is a clear indication of selective skepticism.
Did I misrepresent Häggström on confidence intervals?
In one of the previous posts on Häggström and NHST, I wrote:
Häggström seems to be under the impression that if he can find rare and complicated counterexamples, he can undermine the entire case for confidence intervals.
Häggström replied that he did not at all reject confidence intervals and pointed to a sentence in his post where he agreed that there are many cases where confidence intervals are more informative than just reporting a p values. At first I did not understand why he thought I had claimed that he rejected the usefulness of confidence intervals, but then I noticed that I had not been sufficiently clear. The sentence should have read:
Häggström seems to be under the impression that if he can find rare and complicated counterexamples, he can undermine the entire case for confidence intervals [being generally superior to p values].
In other words, I argued for the near-total domination of confidence intervals to p values in most typical research designs. Häggström rebutted that there are circumstances where confidence intervals cannot be calculated or are very difficult to calculate. I responded by saying that it is not really that relevant that he can dig up rare and complicated counterexamples, because it does not refute the general idea that, for most typical research designs (which are likely to involve single parameter estimations), p values says very little, whereas confidence intervals are much better. This should be clear from the context of the paragraph after (I have added another clarification):
In the case of confidence intervals, on the other hand, the fact that they work in cases with a single parameter is enough to justify their usage [instead of p values]. Keeping in mind that the vast number of experiments done in e. g. medicine are probably not complicated estimations of multiple population parameters, but more akin to measuring the effects of a medication compared with placebo, the superiority of confidence intervals over p values for a large portion of experiments stands.
Here it is clear that I was not merely arguing for the usefulness of confidence intervals in addition to p values, but for a much stronger position that confidence intervals are generally superior to p values and should replace them. Häggström does not reject confidence intervals, merely the more extreme position that I and others hold. I completely acknowledge that the sentence was phrased in a way that appeared misleading and unfair to Häggström when the context was not taken into account.
Did I misrepresent Häggström on “accepting the null”?
I wrote that:
Even more bizarrely, some people (including Häggström) thinks that a failure to reject the null hypotheses means that you can accept it, apparently selectively oblivious (going so far as to call it semantical hairsplitting) to the fact that the failure to reject the null hypothesis could be due to low statistical power and not because it is true.
Häggström responds by insisting that he really meant “accept” in a non-statistical sense and not at all the way the term is used in NHST and that he certainly does not believe that statistical non-significance implies the
falsehood of the null hypothesis. Yet it is not clear what the relevant difference is between the two. Furthermore, his claim that it is “semantical hairsplitting” was never addressed and neither his insistence that he alone can use words outside their most stringent linguistical definition. Yes, of course, Häggström sometimes write that there is a difference between the failure to reject the null hypothesis and the null hypothesis being true (particularly when pressed on the details later on in forum thread). That is why I purposely wrote “selectively oblivious” and not “completely oblivious”! Or did Häggström think it was just two arbitrary words that had no relevance to the rest of sentence? This is one of the many times in his post that Häggström has completely failed to carefully read the text he comments on.
In summary, Häggström attempts to play the victim card in multiple ways and simultaneously tries to deploy the exact same tactics he scolds me for allegedly using. If he spent nearly as much time constructing actual arguments as he did being testy, his criticism might add up to something worth taking seriously. Now that we have dealt with his distractions, let us move onto the actual substance of his post. Despite alleging that I have a “glaring ignorance” on the topics I discuss, he limits himself to discussing two points. Unfortunately, it was not the discussion that I hoped for. He quotes me out of context and trivially misunderstand my position, even though it is obvious what I mean in the section he quotes. Finally, he does make a small attempt at fulfilling the challenge I gave him.
The sample size issue
Häggström make the stark claim that NHST is required to be able to know what sample sizes are suitable for an experiment. This is based on his previous claim that you need NHST to distinguish between the two different Cola brands above. This was dispatched in an earlier post:
The only real “argument” that he used was a clear straw man. He linked to a previous blog post he wrote in defense of statistical significance testing. In it, he proposed the following two research results:
(1) 75% of people prefer Percy-Cola over Crazy-Cola (n = 4)
(2) 75% of people prefer Percy-Cola over Crazy-Cola (n = 1000)
and that statistical significance testing was required to be able to distinguish the scientific credibility of the two, since the effect size is the same. However, this is a clear straw man. The argument is not to replace tunnel vision on p-values with tunnel vision on effect size. Rather, it is to take into account many different metrics such as effect size, precision and what it all means in the scientific context.
In this toy example, it is easy to tell them apart by pointing out that (1) was a very low sample size and is therefore not a good estimate of the population parameter. The point of taking a sample from a population is that the sample should reflect the population. Low sample sizes generally do not and therefore, (1) has essentially no credibility, whereas (2) has considerably more. At no point during this argument did I invoke p-values, talk about the probability of observing at least as extreme results give the null hypothesis or accepting or rejecting null hypotheses. So much for that argument.
To make matters worse, a study like that with 1000 is likely to be overpowered, yielding statistical significance for even small differences. So it is not a suitable method for evaluating to what degree people can be said to prefer Percy-Cola.
So far, Häggström has refused to directly address this argument or acknowledge that he committed a straw man fallacy. He did, however, write something that only with extreme charity can be called “response” (my translation):
Here we need to be a little bit careful with what is meant by the “conceptual framework of NHST statistics”. Of course I do not mean that the usage of the words “p value” and “statistical significance” are themselves necessary; these quantities can of course be re-named, or we can use transformed quantities and thus experiment with p values in a more masqueraded form. What I mean is that if one, somewhere in the argument, needs to calculate or at least estimates of the following kind: assuming these-and-these parameter values, what is the probability of getting at least such-and-such extreme data? In other words, we need to calculate or at least estimate p values
This is yet another straw man from Häggström. Of course I did not claim that the reason confidence intervals can be uncoupled from NHST rests in the fact that you can use confidence while literally avoiding using the words “p value” and “statistical significance” themselves. What I mean is that they can be used without any reference to a p value or the probability of at least as extreme data given the null, doing a statistical significance test or accepting/rejecting null hypotheses. In other words, it is an argument from function not an argument from semantics. This is precisely what I mean when I say that confidence intervals can break free from the chains of NHST: they can be used for many other exciting things besides the canonical frequentist definition or doing a visual statistical significance test. You are not in any way forced to use them for statistical significance testing or interpret them as part of NHST. At this point, I am half expecting Häggström to claim that arithmetic means cannot break free from the “NHST conceptual framework” because NHST uses them.
My response to the claim that NHST is required to make reasonable conclusions about sample sizes was this: all other factors being equal, bigger sample sizes are always better for the accurate estimation of population parameters. The simplest rule of thumb I provided was “as many as you can afford / have time for”, but I also argued that it depends on other factors as well such as the kind of experimental subjects used, their biological variability and reliability of the method. Finally, I pointed out that there exists concepts such as assurance and precision that are the confidence interval equivalents of statistical power. Again, we can calculate these quantities without referencing p values, performing statistical significance testing or the rejection of null hypotheses, thus avoiding the charge of “secretly doing NHST”. How did Häggström respond? He started by selectively quoting the first part of the paragraph in question (my translation):
In the first of his blog posts, he seeks to rebuke my claim that well-founded decisions about sample size requires the conceptual framework of NHST, and claims the following:
First of all, bigger sample sizes are always better (all other things equal) for accurately estimating the population parameters under study (unless we are using NHST in which case big sample sizes cause studies to be overpowered). So an initial rule of thumb for sample size decisions might be “as many as you can afford and have the time for”.
This suggests a near-total ignorance on Emil Karlsson’s part about how research is done in reality. Largest possible sample size, no really, but that is not any help when you have a limited research budget and have to weigh sample size against other costs, or when a research project involves more than one sample collection that, from a budget perspective, needs to be weighed against each other. Even without research budget issues, there are situations where important aspects argue for a restriction in sample size e. g. when new medications are tested for side effects, where the usage of unnecessarily large sample sizes is directly unethical.
It seems that Häggström did not properly read what I wrote. I wrote that, all other factors being equal, larger sample sizes are always better for the purpose of accurately estimating a population parameter. Nowhere did I claim that this was a descriptively accurate view of how research is currently being done or that the accurate estimation of a population parameter is the sole consideration that should be made. In fact, I specifically included considerations of cost (“as many as you an afford”, which can include budget considerations) and the experimental subjects (“what kind of experimental subjects you have”). By quoting me out of context and not even properly reading the text he does cite, he does not even seem to be making an effort. However, the minimum requirement to be taken seriously is that you at least represent the position of our opponents correctly. In this case, Häggström failed spectacularly.
The challenge to the claim that “NHST is essential”
It seems that Häggström has decided to take on the challenge I gave him in a previous post:
Finally, where is this alleged “baby”? Show us where p values are “very important statistical tools”. Show us what additional insights a p value can give us if we can use confidence intervals. Show us the insights that can only be gotten from a p value and where all other methods invariably fail. Show us that using p values has more practical benefits than drawbacks. Show us that it is worth the risk.
Unfortunately, it was a very weak attempt. He did not show that the practical benefits outweighed the drawbacks. He did not show us any insight that can only be gotten from a p value and where all other methods fail. Here is his example (my translation):
However, let me mention another example, namely studies using multiple testing, that is, where more than one statistical inference is done. To reach a conclusion on whether or not the obtained results deviates from what would be typical if all effect sizes were zero would in most cases be extremely clumsy and difficult if one was banned from experimenting with p values.
I am glad that Häggström chose this example, because the reality is the exact opposite. NHST is not a productive technique to use for most large-scale experiments that involve massive multiple testing. This is because when the sample size looks large, it is often merely technical replicates and not biological replicates. Under such circumstances, NHST should not be used as only biological replicates should be considered to increase the sample size. When biological replicates are used, the sample size is usually very small. This has very interesting effects on the outcome of applying NHST. First of all, the same problem with p value instability with regards to replication occurs for multiple testing scenarios as it does for single exact replications. However, there is also a much, much worse problem that relates to the size of the observed variance and the parameter estimation accuracy. I cannot go into additional details in this post because this is original research from other people, but I assume Häggström is smart enough to figure out what kind of simulation (and results) I have in mind.
There is also a very problematic assumption hiding in the formulation Häggström uses, namely “[…] if all effect sizes were zero”. This suggest that he is thinking about what is known as the global null hypothesis, which is the hypothesis that all measured effect sizes are 0. But this hypothesis is incredibly unlikely and becomes increasingly unlikely as the number of statistical tests grows. So why try to test a hypothesis we almost certainly know is false? This exposes one of the major problems with NHST: it is not a procedure that is concerned with testing research hypotheses, just incredibly unlikely hypotheses that are scientifically irrelevant and that almost no one believes in the first place. Häggström might retort that NHST is a way to test the most absurdly ridiculous hypotheses to see if the data can call it into question as a minimal requirement. But this is not the way NHST is used in practice, it is not the way NHST was intended and it is certainly not would not justify calling a p value ban “intellectual suicide”.
Regardless, there are ways to adjust confidence intervals for multiple testing as well. Furthermore, it is also possible to just rank effect sizes by size and continue working with, say, top 20 regardless of p value. No NHST is required to rank effect sizes.
In the end, Häggström failed to demonstrate that NHST is essential.
Bonus round: criticism that Häggström has failed to provide a satisfactory response to
The modus operandi of Häggström appears to be to ignore all the cases where he was proven wrong and increasingly focus on smaller and smaller issues where he thinks he can see a kink in the armor. In an effort to expose how Häggström uses this method, I have compiled a (partial) list of arguments that Häggström either never responded to, or provided a deeply unsatisfactory response.
—> The instability of p values under exact replication: for most realistic research designs, the breadth with which the p value varies is enormous and is thus unstable under exact replications. Since decisions are made based on obtained p values, those decisions are also unstable. This means that p values are not useful for most realistic research designs. Häggström made a straw man argument claiming that this argument claimed that p values were population parameters to be estimated (and therefore ridiculous) and that anyone who rejected p values because of this should also reject the data itself. I reiterated that the instability of p values under exact replications are a problem because decisions are based on p values and if p values were like Russian roulette, then so were those decisions. I dispatched the claim that it would imply the rejection of data itself since the p value varies between three orders of magnitudes, whereas hardly any effect size does that under exact replications unless the instruments are terribly imprecise.
—> The logical invalidity of NHST: NHST proponents frequently claim that statistical significance means that the null hypothesis is suspicious. However, this is a logical error since it is possible for the alternative hypotheses to be even more unlikely. Trivial cases involve alleged psychic powers or homeopathy, but can readily be applied in ordinary circumstances as well.
—> The indirect relationship between p value and posterior probability: A p value is not a good measure of the strength of the evidence against the null hypothesis. This is because it has to be weighted against the prior probability of the null hypothesis, which can be in a radically different. Low p values for a priori likely null hypotheses or a relatively higher p values for a priori unlikely null hypotheses does not entail a low posterior probability. Thus, even the name of NHST is misleading: NHST is not really a test of null hypotheses. Häggström retorted that this amounted to the claim that p values are bunk because they are not posterior probabilities, which missed the point. The point was that p values are not a good guide to whether or not the null hypothesis is suspicious.
—> NHST does not tell us what we want to know: a p value tells us nothing bout the effect size, the precision by which the effect size is estimated, if it can be replicated or not or what it means in the scientific context. Correctly interpreted, p values tell us very little for most realistic research designs. Yet this is not how NHST is used or advocated for. There it is said to play a considerably larger role. In fact, there are common textbooks in statistics that almost exclusively focuses on NHST and statistical significance.
—> The Cola example: see earlier discussion.
—> Häggström’s appeal to tradition: in a previous post, Häggström deployed the following appeal to tradition: “This is the logical justification of NHSTP (the justification of confidence intervals is similar), and the way we statisticians have been teaching it for longer than I have lived.” Yet the fact that a method has been used for a long time does not mean that it is valid or useful. It baffles me how Häggström could deploy this kind of “argument” with a straight face.
—> Häggström’s appeal to R. A. Fischer: Häggström implicitly quoted R. A. Fischer when claiming that statistical significance either means that the null hypothesis is false or that something unlikely has happened. This is false, as large sample sizes can yield statistical significance for even minute differences that could be observed with high probability even if the null hypothesis was true..
—> The trans-generational epigenetics study: Häggström deploys another straw man, this time claiming that I was incapable of understanding that the results could not be interpreted as evidence of a specific kind of epigenetic inheritance. In reality, I was explicit when I wrote in the comment section (i.) that the study should have made a correction for multiple testing, (ii.) that the confidence interval is too broad that they should have been extremely careful in their conclusions and not made it a headline or main finding and that (iii.) I thought the study had a limited scientific value. Nowhere did I claim that the p value implied a credible H0 rejection within the NHST paradigm. What I did claim was that (a.) their NHST-based criticisms (that was based on translating a confidence interval to a p value and then concluding that it was above 0.05 and that the results were therefore not credible) was severely flawed and that (b.) the effect size was large enough to be of medical interest and that future research testing that particular inheritance route is warranted. My main arguments were that neglecting confidence intervals by using them to do a statistical significance test ignores many of the other relevant methods of using confidence intervals for interpreting research results and that statistical non-significance is not evidence for equivalence because p values depend on sample size. Häggström wasted much ink arguing against his own false characterizations of my position, despite the fact that it was explicitly laid out in the very first comment.
NHST as science?
Häggström finishes by claiming that my tagline should be “Attacking science using the methods of irrationality”. I do not really care about his feeble efforts to appear witty, but what made me pause was the part about “attacking science”. In other words, Häggström thinks that criticisms of NHST is criticism of science. That a logically flawed ritual based on assuming a false and irrelevant hypothesis and then calculating probability of unobserved results in an effort to make an extremely weak test of said hypothesis, constitutes science. Let that realization detonate in your brain. What do we call something that is not science, but tries to imitate it?
There is a lot more to be said about these issues, but this post is long enough.