Why P-Values and Statistical Significance Are Worthless in Science

November 10, 2014 emilskeptic

P-values are scientifically irrelevant

Why should we test improbable and irrelevant null hypotheses with a chronically misunderstood and abused method with little or no scientific value that has several, large detrimental effects even if used correctly (which it rarely is)?

During the past 60+ years, scientific research results have been analyzed with a method called null hypothesis significance testing (NHST) that produce p-values that the results are then judged by. However, it turns out that this is a seriously flawed method. It does not tell us anything about how large the difference was, the precision estimated it or what it all means in the scientific context. It tests false and irrelevant null hypotheses. P-values are only indirectly related to posterior probability via Bayes theorem, what p-value you get for a specific experiment is often determined by chance, the alternative hypotheses might be even more unlikely, it increases the false positive rate in published papers, contributes to publication bias and causes published effect sizes to be overestimated and have low accuracy. It is also a method that most researchers do not understand, neither the basic definitions nor what a specific p-value means.

This article surveys some of these flaws, misunderstandings and abuses and looks at what the alternatives are. It also anticipates some of the objections made by NHST supporters. Finally, it examines a case study consisting on an extremely unproductive discussion with a NHST statistician. Unsurprisingly, this NHST statistician was unable to provide a rationally convincing defense of NHST.

Why NHST is seriously flawed

There are several reasons why NHST is a flawed and irrational technique for analyzing scientific results.

Statistical significance does not tell us what we want to know: A p-value tells us the probability of obtaining at least as extreme results, given the truth of the null hypothesis. However, it tells us nothing about how large the observed difference was, how precisely we have estimated it, or what the difference means in the scientific context.

The vast majority of null hypotheses are false and scientifically irrelevant: It is extremely unlikely that two population parameters would have the exact same value. There are almost always some differences. Therefore, it is not meaningful to test hypotheses we know are almost certainly false. In addition, rejections of the null hypothesis is almost a guarantee if the sample size is large enough. In science, are we really interested in finding if e. g. a medication is better than placebo. We want to know how much better. Therefore, non-nil null hypotheses might be of more interest. Instead of testing if a medication is equal placebo, it can be more important to test if a medication is good enough to be better than placebo in a clinically meaningful way.

P-values are only indirectly related to posterior probability: The lower the p-value (and all other factors constant), the stronger the evidence against the null hypothesis. However, the relationship between the p-value and the posterior probability (i.e. probability of the null hypothesis given the evidence) is very indirect and is weighted by the prior probability (i.e the probability o the null hypotheses given the background information). According to Bayes theorem:

$P(H|E) = \frac{P(H)P(E|H)}{P(E)}$

That is, p-values have very little evidential weight if you test improbable hypotheses, or even moderately probable hypotheses. As an extreme example, even low p-values in favor of homeopathy does not improve its scientific credibility.

The p-value distribution is very large: Let us say that you have two population and take a sample from each and calculate a p-value. Now, imagine doing this 1500 times and calculate 1500 p-values. How large is the distribution of p-values? What is the range of values taken by p? Cumming (2012) simulated 1500 such experiments with N = 32 per group, the populations were normally distributed, the population standard deviation of 0.2 each, the difference between the means of the two populations were 0.5 standard deviations, a two-tailed test was used, the alpha cutoff was 0.05 and the statistical power was 0.52 (common in many scientific experiments from psychology to ecology). Here are the results:

35.1 % p > 0.1,
12.1% 0.1 > p > 0.05,
23.9 % p < 0.05,
18.9 % p < 0.01
10.0% p < 0.001

In other words, what p-value you get under plausible experimental designs are largely a result of chance. The range of values taken by p is very, very large. This is known as the p-value casino.

The alternative hypotheses might be even more unlikely: a statistical significance test is used by many as a method for rejecting the null hypothesis as improbable. However, the alternative hypothesis (and the proposed mechanism) might be even more improbable. Thus, it is not enough to say that the null hypothesis is unlikely. One has to think about the probability of the alternative hypothesis as well. Thus, NHST is formally logically invalid.

It uses arbitrary cutoffs and contributes to black-and-white thinking: there is nothing special about the cutoff of 0.05. A p-value of 0.06 is not much better evidence against the null hypotheses than 0.04. The use of this arbitrary cutoff also promotes black-and-white thinking, but a competent evaluation of research results have to take into account a lot more.

It can, at best, only test statistical (not substantive) null hypotheses: NHST can only test statistical hypotheses e. g. the population average of the experimental treatment is equal to the population average of the control. It cannot test hypotheses about substance, such as what any observed difference is caused by. For instance, a rejection of the null hypothesis that the suicide frequency among men and women are identical does not show that women are psychologically inferior to men.

Increases type-I error rate in published papers: On average 5 out of 100 studies testing a medication that has no effect will reject the null hypotheses. Since null hypotheses rejections are more likely to be published, the type-I error rate is considerably higher than the canonical 0.05 that proponents of NHST uses.

Contributes substantially to publication bias: because a lot of researchers and journals are so obsessed with statistical significance, papers with null hypotheses rejections are more likely to be published than those not finding any statistically significant differences. This is known as the file-drawer effect or publication bias. This means that the accuracy of published effect sizes is low.

Overestimates effect sizes in published papers: if you have a small sample size, it is hard to get rejections of null hypotheses for small or moderate effect sizes. Thus, rejections typically happens for samples where the observed effect size is higher than the corresponding population parameter, thus overestimating it.

Not only flawed in many respects, NHST is also chronically misunderstood and abused.

Misunderstandings and abuses of NHST

There are at least around 20 or so common misunderstandings and abuses of p-values and NHST. Most of them are related to the definition of p-value. As discussed above, a p-value is the conditional probability of at least as extreme data, given the truth of the null hypothesis. Other misunderstandings are about the implications of statistical significance.

Statistical significance does not mean substantive significance: just because an observation (or a more extreme observation) was unlikely had there been no differences in the population does not mean that the observed differences is large enough to be of practical relevance. At high enough sample sizes, any difference will be statistically significance regardless of effect size.

Statistical non-significance does not entail equivalence: a failure to reject the null hypothesis is just that. It does not mean that the two groups are equivalent, since statistical non-significance can be due to low sample size.

Low p-value does not imply large effect sizes: because p-values depend on several other things besides effect size, such as sample size and spread.

It is not the probability of the null hypothesis: as we saw, it is the conditional probability of the data, or more extreme data, given the null hypothesis.

It is not the probability of the null hypothesis given the results: this is the fallacy of transposed conditionals as p-value is the other way around, the probability of at least as extreme data, given the null.

It is not the probability of falsely rejecting the null hypothesis: that would be alpha, not p.

It is not the probability that he results are a statistical fluke: since the test statistic is calculated under the assumption that all deviations from the null is due to chance. Thus, it cannot be used to estimate that probability of a statistical fluke since it is already assumed to be 100%.

Rejection null hypothesis is not confirmation of causal mechanism: you can imagine a great number of potential explanations for deviations from the null. Rejecting the null does not prove a specific one. See the above example with suicide rates.

NHST promotes arbitrary data dredging (“p-value fishing”): if you test your entire dataset and does not attain statistical significance, it is tempting to test a number of subgroups. Maybe the real effect occurs in me, women, old, young, whites, blacks, Hispanics, Asians, thin, obese etc.? More likely, you will get a number of spurious results that appear statistically significant but are really false positives. In the quest for statistical significance, this unethical behavior is common.

So what is the alternative to mindless and mechanical statistical significance testing? It is to make science-based judgements based on a number of other factors.

The alternative to NHST

These factors can include effect sizes, confidence intervals, the scientific context, replication and meta-analysis.

Effect size: An effect is something that is interesting to measure e. g. the average effectiveness of a new medication or growth rate in a specific yeast strain under particular circumstances. An effect size is simply the size of what you are measuring e. g. how big is the effectiveness or how fast was the growth rate. How big is the difference? Is the observed effect size negligible, small, moderate, large, gigantic or somewhere in between? It is a world of difference between these possibilities.

Confidence intervals: Confidence intervals are a special kind of error bars and 95% of the confidence intervals you can hypothetically generate from taking a sample from a specific population will include the population parameters. A single confidence interval will either include or not include the population parameters, but you will be right on average 95% of the time during your career if you wager on it being included. A confidence interval works like a margin of error and gives a range of plausible values for the population parameter, where the relative plausibility is highest near the effect size estimate. Sometimes, confidence intervals cannot be calculated or not calculated exactly. But that is no problem: even non-exact confidence intervals are better than NHST and other error bars can fulfill the general goal of confidence intervals.

Substantive significance / Scientific context: Scientific context matters: large changes that are statistically significant may be of no substantive significance (e. g. might have no practical relevance), and small changes that do not achieve statistical significance might be of very high importance. For instance, the effect of aspirin for preventing heart disease is very low, but since three are so many people who risk heart disease and the side effects are small, even a small benefit is going to be of substantive significance. For instance, a new medication that is worse than the current one but the results are not statistically significant should not be used.

Replication and meta-analysis: Individual studies can always be flawed and biased in several ways. Therefore, replicating studies in order to see if the results hold up is very important. This helps to weed out false results. Meta-analysis is a method for combining the results of several studies on the same topic. This allows researchers to synthesize research results across many studies and dilute the bias by excluding studies that are flawed and combine studies that are of high quality.

Objections anticipated

This section will cover a couple of the most common objections made to NHST criticism.

“Misunderstandings and abuses are not the fault of the method”

A method like NHST that has such a strong potential for misunderstandings and abuse even among a large proportion of the most highly intelligent and highly educated has to accept a large proportion of the blame.

“It is possible to use p-values correctly”

So? Even if used correctly, p-values would still not tell us what we are interested in, such as effect size,
precision, scientific context etc. There is a ton of Inertia in the system and similar suggestions have been made for 50 years and not much has changed.

“Some research projects / questions require black-or-white answer”

Should those decisions not be based on the best available methods for interpretation? P-value is not a good method for making correct black-and-white answers.

“Not possible to calculate exact confidence intervals in some research designs or sample sizes”

Confidence intervals do not need to be exact as they should not be used to do NHST, but give a range of plausible values for the population parameters. Solution to a bad research design or low sample sizes are better research designs and higher sample sizes, not NHST.

“NHST is objective”

Objectively wrong, perhaps. Making decisions based on flawed metrics and ignoring others might be “objective”, but it is not a good idea.

Case study: how to handle obstinate NHST statisticians

Unfortunately, several kinds of people are standing in the way of transforming the way scientists should interpret research results. Some scientists prefer to take shortcuts and use mechanistic statistical significance testing as a substitute for real scientific thinking and are therefore not well-equipped to consider difficult scientific questions in the light of complex data analysis. Lecturers in intro stats continue to teach outdated and flawed method of data interpretation. Many journals continue to tolerate both overemphasis and abuse of statistical significance, presumably because of incompetent editors and reviewers. Different approaches are needed to tackle each one of these groups successfully.

However, even if these issues were resolved, there is still one obstacle that remains: the NHST statisticians. These people have built both their field and careers on the widespread use of statistical significance testing. As a result, they defend their precious p-values with tooth and claw, even to the point of intellectual dishonesty and excessive engagement in personalities. Another important explanation is that experts are extremely effective at rationalizing ideas that they have reached based on flawed premises and absurd arguments.

This section will examine one such unproductive interaction with an NHST statisticians called Olle Häggström at Chalmers University of Technology (Sweden). For the full conversation, see here (it is in Swedish, so use Google translate).

The background discussion was a study on epigenetic inheritance that focuses on hazard ratios and confidence intervals. Most hazard ratios were not that big or achieved statistical significance. However, the authors decided to bold the one hazard ratio that did and made this the main story of their paper. Now, the particular epigenetic inheritance pattern in question — from the grandmother on the father’s side to the granddaughter — is not particularly likely to be especially important based on the biological context. As far as I can tell, this is a false positive. I also thought the confidence interval was unacceptably large for the authors to be that categorical about their finding. Thus, I used the scientific context and confidence intervals to conclude that their finding was not that impressive. However, the effect size was large enough to merit further investigation regardless of statistical significance or lack-thereof.

An NHST statistician tackled this study by visually estimating the p-value from the confidence interval (thereby turning fine-grain data to coarse-grain). Then, he argued that if a correction for multiple testing was used, the p-value would not be low enough to obtain statistical significance and therefore, the results were not credible. I pointed out that the paper was not overly focused on p-values (none occurred in the paper) and that the conclusion I made was valid regardless if correction was made or not and that an impressive effect size should not be dismissed merely due to being statistically non-significant. Furthermore, the correction method used by this NHST statistician was Bonferroni correction, which was statistically unsuitable for this study since the number of hypotheses tested were more than a few and this kind of correction is very hard and sacrifices too much statistical power.

He responded by calling me pompous, ignorant, precocious, arrogant, claimed that I was using ugly rhetorical techniques and emphasized his own statistical expertise. However, the only thing more dangerous than ignorance is the illusion of knowledge, something that Häggström should consider.

The only real “argument” that he used was a clear straw man. He linked to a previous blog post he wrote in defense of statistical significance testing. In it, he proposed the following two research results:

(1) 75% of people prefer Percy-Cola over Crazy-Cola (n = 4)
(2) 75% of people prefer Percy-Cola over Crazy-Cola (n = 1000)

and that statistical significance testing was required to be able to distinguish the scientific credibility of the two, since the effect size is the same. However, this is a clear straw man. The argument is not to replace tunnel vision on p-values with tunnel vision on effect size. Rather, it is to take into account many different metrics such as effect size, precision and what it all means in the scientific context.

In this toy example, it is easy to tell them apart by pointing out that (1) was a very low sample size and is therefore not a good estimate of the population parameter. The point of taking a sample from a population is that the sample should reflect the population. Low sample sizes generally do not and therefore, (1) has essentially no credibility, whereas (2) has considerably more. At no point during this argument did I invoke p-values, talk about the probability of observing at least as extreme results give the null hypothesis or accepting or rejecting null hypotheses. So much for that argument.

At this point, Häggström got so frustrated and upset that he refused to publish anymore of my comments unless I carried out a detailed alternative correction for multiple testing. I provided several elements of such a treatment, but he refused to publish it anyways.

In the end, I think that Häggström did not appreciate being debunked by a person on the Internet and the conversation became an issue of prestige for him. His expertise made him believe that he was surely right and everyone else was wrong. The cognitive dissonance that he experienced by reading my arguments made him lash out in rage, which explains the excessive engagement in personalities from his side. As a side note, Häggström identifies himself as a scientific skeptic and thus becomes just another victim of selective skepticism (like Jerry Coyne on medical psychiatry and psychiatric medication). He is extremely rational in many areas (such as climate change), but then becomes completely irrational in other areas, such as NHST, supercomputers taking over the world in a matrix / terminator scenario and mathematical Platonism.

Follow Debunking Denialism on Facebook or Twitter for new updates.

References and further reading

Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association.

Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge

Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results.

1. Sure it’s a common problem that p-values are misused, but so is the scientific method, our own mind and virtually all knowledge in the world. Does it mean that we should give up our best working solution? No. Does it mean we shouldn’t seek out better approaches? Of course not… What I meant to say is that black and white thinking, the urge to use shortcuts, to “spice up” data, to cherry-pick, etc. does not stem from the usage of p-values and using “m-values” instead would not help one-bit. Black and white thinking is not inherent to NHST, it’s inherent to our brains.

Just because something else is misused does not mean that the abuse of p-values is justified. The argument for giving up on p-values is not merely that they are abused, but that correct application of p-values have little value and there are much better approaches that actually address meaningful scientific questions. P-values certainly contribute to a black-and-white thinking and it is inherent in NHST: a difference is either statistically significant or not. If it is statistically significant, it is emphasized as important. If not, it is usually dismissed. The fact that there are other factors that also contributes to black-and-white thinking does not change this. Would you say that running with scissors does not promote accidents just because there are other ways in which people are stabbed? Of course not. Would you say that war does not contribute to human suffering because the ultimate root of human suffering rests elsewhere? Of course not. The idea is not to replace p-values with some other black-and-white statistic. Rather, it is about replacing it with better methods that give a fuller view of what is going on: effect sizes, confidence intervals, meta-analysis and evaluation of results in the scientific context.

We need to have tresholds in order to think.

Speak for yourself: there are plenty of very good science that relies on interpreting effect sizes and confidence intervals in the scientific context without using arbitrary thresholds.

So, depending on the case, you’ll set up your threshold at 0.05, 0.01, 0.005 or 0.000005 even. It’s not an arbitrary, it’s based on the tradeoff you are willing to make, the error your are willing to allow based on the cost of the experiment and the practical significance (negative or positive) of each point of effect size that you can achieve, the reversability of negative outcomes, the cost for reversing, etc. etc

Extremely few studies use alpha other than 0.05. They do not control any empirical error rate, merely the error rate given the truth of the global null hypotheses (which is usually false and uninteresting). The fact remains, landing just above or just below the alpha has enormous implications for NHST believers: it is the boundary between what many of them interpret as “significant” and “meaningless”. For those of us who are not indoctrinated into it, a p value of 0.06 versus 0.04 is not interesting. Either way, it is marginal evidence against an implausible null.

If you set it at 0.005 and the reported statistic is 0.0055, do you scrap it or are you encouraged to do further testing (adjusting accordingly your stats) because obviously you are very close to getting value from what you are doing… That would depend on the stringency with which you’ve set the treshold.

What NHST believers typically do is to arbitrarily change their alpha to, say 0.01 and report the result as statistically and practically significant. A p value will always end up below a given significance level as long as you have a large enough sample size. So NHST researchers typically repeat the study with more people, and then they get a statistically significant result (all without there having to be any difference between the populations samples). These kinds of arguments that NHST researchers do is not productive to acquiring scientific knowledge.

2. Effect size, meaning practical implications of all sorts given different effect should have been taken into account when setting the p-value requirement.

This is not what I mean. I do not mean that we should “merely” use an effect size for sample size calculations and decisions about alpha. I mean that we should use observed effect sizes in their own right to draw conclusions about the nature of reality. Some observed effect sizes are negligible, some are small, some are moderate, some are huge and this matters a lot. The classic example is e.g. drug development.

3. I agree with you on confidence intervals, but they are based off p-values (or the NHST error statistical framework anyways) so as I said I don’t see these as fundamentally different.

They are fundamentally different, because confidence intervals gives you a lot more than a p-value. As I wrote in a previous post:

“Geoff Cumming (2012), a strong defender of the use of confidence intervals, points to the following six ways of interpreting a 95% confidence interval. I have simplified them so readers not so familiar with statistics can understand the gist:

1. A 95% confidence interval is an interval that will capture the true population mean 95% of the times.
2. A 95% confidence interval is a range of plausible (using the term probable here would be wrong as we saw above) values for the population parameter.
3. A 95% confidence interval is a margin-of-error that gives a likely maximum error for the estimation, even though larger errors are possible.
4. The values closer to the mean is relatively more plausible than the values nearer the ends of the confidence intervals.
5. The relationship between confidence intervals and statistical significance (outlined earlier).
6. On average, a 95% confidence interval is an 83% prediction interval.”

It gives us a range of plausible values for the population parameter, it is a margin-of-error, it can work as a prediction interval and so on. You only see (5), but there are five other benefits with confidence intervals that you do not get from p-values.

5. Replication: sure, p-values tell us a lot about replication, otherwise we wouldn’t use them at all. I meant the practice of replicating studies (the way you meant it in your article) and in this sense they are complimentary, not alternatives.

No, p-values tells us nothing about replication. That is a classic NHST fallacy. A p-value of 0.32 does not mean that the findings are any more or less likely to replicate than a p-value of 0.04. A p-value is the probability of at least as extreme data, given the null. It is not the probability of replication.

What replication offers that p-values do not is that we can eliminate the biases of single studies. Several studies that reach the same result (in terms of effect size) is much more credible than a single study with a p value under a specific alpha.

6. Meta-analysis: ok, still, they are no substitute for p-values, but can be used with p-values (or other approaches) in a complimentary fashion.

Meta-analysis is better than p-values, because it allows us to synthesize observed effect sizes from many different studies and thus get a more reliable result. This is much better than being obsessed with p-values (that we know are lower and lower the larger the sample size).

“Freiman and colleagues (1978)” actually looked at “negative” RCTs and concluded that people don’t understand Type II error. It has nothing to do with p-values which is a statistic related to the Type I error. Power is what you need to look at for Type II errors, e.g. falsely accepting the null hypothesis and I don’t see how this paper supports what you say, since we are talking about false rejection of the null hypothesis here.

No, we are talking about dismissing big effect sizes that can be very worthy of future investigation simply because they “non-significant”.
This shows the large negative effects of NHST.

I claimed that:

Freiman and colleagues (1978) looked at over 70 papers with statistically non-significant results (p > 0.1). Half of them showed a result that was consistent with a 50% therapeutic improvement, which is clearly of clinical importance. This shows the detrimental of the “so it is not statistically significant no matter how big the observed effect seems to you intuitively” kind of thinking.

You responded by:

I don’t see how this paper supports what you say

You clearly did not even read the abstract (that explicitly mentions the fact I brought up), but here is the relevant section:

Estimates of 90 per cent confidence intervals for the true improvement in each trial showed that in 57 of these “negative” trials, a potential 25 per cent improvement was possible, and 34 of the trials showed a potential 50 per cent improvement. Many of the therapies labeled as “no different from control” in trials using inadequate samples have not received a fair test. Concern for the probability of missing an important therapeutic improvement because of small sample sizes deserves more attention in the planning of clinical trials.

In other words, NHST has a large detrimental effect on science: it has led a lot of scientists to dismiss very important findings as “non-significant” (in the practical sense) even though the studies actually showed a 50% improvement. No matter how much you try to squirm your way out of it, the fact remains: NHST is essentially useless and causes considerable harmful.

12 thoughts on “Why P-Values and Statistical Significance Are Worthless in Science”

Geo

November 24, 2014 at 19:05

I like most of what you like on your blog, but here we would disagree to some extent, as it seems.

I don’t think you understand how p-values work if you think they call for a “black and white” thinking in science. The reporting of research results as “statistically significant”/”not statistically significant” is indeed black and white thinking, but it is a practice that is being discouraged, as far as I know. Authors are encouraged to publish the exact p-values and other information in order to avoid this. So lazy reporting is what promotes b/w, reporting p-values can help avoid b/w.

In terms of “The alternative to NHST” you propose:

– effect size is something that can (and should) complement p-values, not an alternative per se
– confidence intervals are not that different than p-values, in my understanding they are best used in conjunction as well
– scientific context is something that a scientist should definately consider in his everyday work, but since this is subjective, it is outside the scope of statistics
– replication and meta-analysis are also complementary to p-values and rely on p-values to work properly. Don’t see how they are a substitute/alternative.

Maybe the title of the section is misleading or something, cause I see all of these as valid and necessary in conjunction with using NHST, not as alternatives.

“However, the effect size was large enough to merit further investigation regardless of statistical significance or lack-thereof.” – I begin to seriously doubt your understanding of p-values. The observed effect size is taken into account in the p-value calculation, so it is not statistically significant, no matter how big the observed effect seems to you intuitively.

The Bonferroni correction is indeed very stringent and I would generally prefer a more powerful method e.g. Sidak Step Down. Not sure why Mr. Häggström insisted on Bonferroni and I don’t have the time to go over your whole discussion, sorry.

My understanding of NHST and p-values improved greatly after reading the works of professor Mayo. Her article “Error Statistics” (Mayo & Spanos, 2011) was especially useful to me. She maintains a blog and her papers are available for free online, if you want to imrpove your understanding of NHST.
- Emil Karlsson
  
  November 25, 2014 at 19:13
  
  I don’t think you understand how p-values work if you think they call for a “black and white” thinking in science. The reporting of research results as “statistically significant”/”not statistically significant” is indeed black and white thinking, but it is a practice that is being discouraged, as far as I know.
  
  Do you know why it is being discouraged? Because it is a common problem. Researchers publish and emphasize differences that are statistically significant, and neglect the ones that are not. For instance, some surveys show over 80% of all papers published in a range of economics journals make this kind of error. The fact that it is being discouraged is almost completely irrelevant, as improvements are slow or non-existent in published research.
  
  Authors are encouraged to publish the exact p-values and other information in order to avoid this. So lazy reporting is what promotes b/w, reporting p-values can help avoid b/w.
  
  Exact p-values instead of ** spam or NS is very beneficial, but it does nothing by itself to ameliorate the black-and-white thinking inherent in NHST.
  
  – effect size is something that can (and should) complement p-values, not an alternative per se
  – confidence intervals are not that different than p-values, in my understanding they are best used in conjunction as well
  – scientific context is something that a scientist should definately consider in his everyday work, but since this is subjective, it is outside the scope of statistics
  – replication and meta-analysis are also complementary to p-values and rely on p-values to work properly. Don’t see how they are a substitute/alternative.
  
  Effect size: this can work as an alternative to NHST, because it allows researchers to focus on substantive / practical / scientific importance, not merely statistical significance.
  
  Confidence intervals: the general idea with confidence intervals is not merely to do a visual NHST. It has other, more beneficial interpretations: it gives a margin or error, a range of plausible values for the population parameter and so on.
  
  Scientific context: this is not at all subjective, but the most objective you can get in science. An effect size of r = 0.04 might be completely irrelevant for some research questions, but highly relevant in others. For instance, if this is the benefit of aspirin to prevent heart disease, then it is a cheap, safe therapy that is highly beneficial on the population level (as so many people have heart disease, it might prevent thousands of deaths per year). This is certainly not a “subjective” conclusion. Rather, it is an example where p-value obsession would be subjective, as it does not take into account the facts of the situation. It is just a dichotomous decision based on limited information.
  
  Replication: this is not a complement to p-values since it is about different things. A p value tells us nothing about the probability of replication.
  
  Meta-analysis: there are some meta-analytical methods that employ p-values, such as Fisher’s method, but other meta-analytical techniques do not rely on p-values at all, but rather effect size and spread. Obviously, the alternative focus that I am arguing for does not use Fisher’s method at all.
  
  So in the end, what use are p values if you have effect size, confidence intervals, scientific context, replication and meta-analysis? Be specific! What a difference would it make for you tacked on a p value?
  
  However, the effect size was large enough to merit further investigation regardless of statistical significance or lack-thereof.” – I begin to seriously doubt your understanding of p-values. The observed effect size is taken into account in the p-value calculation, so it is not statistically significant, no matter how big the observed effect seems to you intuitively.
  
  You are right that a p value is calculated with the help of effect size. However, many other things goes into this calculation as well, such as sample size and spread. This means that while we can say that a large effect size produces a small p-value if all other factors are constant, a small p-value does not imply a large effect and a large p-value does not imply a weak effect. Therefore, arguing against effect size considerations with statement that the p-value is non-significant is misguided.
  
  Although you acknowledge the problem with a tunnel vision on p-values and agree that other factors should be taken into account, you commit an important mistake in the above quote: you dismiss a potentially important result simply because it is not statistically significant. This is fallacious.
  
  Freiman and colleagues (1978) looked at over 70 papers with statistically non-significant results (p > 0.1). Half of them showed a result that was consistent with a 50% therapeutic improvement, which is clearly of clinical importance. This shows the detrimental of the “so it is not statistically significant no matter how big the observed effect seems to you intuitively” kind of thinking.
Emil Karlsson

November 24, 2014 at 20:22

After self-described indulgence in “narcissistic Googling” Häggström recently posted an additional comment on the blog post in question responding to this post. Did he:

– respond to my criticisms of the fact that his criticisms of the article in question inappropriately relied on NHST?
– respond to the positive arguments I made against NHST in the comment section?
– acknowledged that his usage of Bonferroni correction was statistically inappropriate?
– respond to the additional arguments against NHST I made in this post?

No, none of that. Instead he played the martyr card by falsely characterizing my discussion of his behavior as “personal attacks” and had the audacity to claim that I never submitted the comment that fulfilled some of the aspects of his demands. Entertainingly, he continues his childish behavior by stating that I cannot “count on ever being allowed to return to the blog”. I re-post a shortened version of my comment for completeness (I have a copy of the full-text comment), although I will not hold my breath.

I highly doubt that any kind of intellectually productive interaction is going to be possible with him. If there was any doubt about the intellectually dishonesty of Olle Häggström, that has now completely vanished. I cannot possibly take him seriously any longer.
Emil Karlsson

November 25, 2014 at 11:20

As I suspected, Häggström refused to publish my second comment and explicitly told his readers that it “went into the trash without being read” because he has to keep “some level of consistency”. This is further evidence that

(1) Häggström refuses to publish my comments, even though they fulfill some of his demands. This is in stark contrast to his earlier claim that that my statement that he does not publish my comments was “pure invention” on my part.

(2) Häggström is intellectually dishonest who cannot defend his claims, or admit error.

He finishes off by encouraging me to explain why his usage of Bonferroni correction was statistically inappropriate. But I have done this several times (!), both in the comment section of his blog (twice published, twice unpublished) and in the blog post above. Here is the main argument, for the 6th time (!):

Bonferroni correction is a statistically inappropriate method when the number of hypotheses are more than few (in this case, it was minimum 8), because the sacrifice of statistical power is relatively large and so the correction becomes relatively hard, which often defeats the purpose of a statistical significance test in the first place. If you have low or negligible power (which is very common in research), you are very unlikely to detect real differences.

I suggested that a method that conserves power is more suitable, and the example I gave was Benjamini-Hochberg (BH/FDR). The point is not to argue that this would make the differences achieve statistical significance (my entire argument has been that statistical significance is not a valid or useful metric!). Rather, it was to point out that his criticism of the study was flawed. In reality, the effect sizes alone implies that the researchers might be onto something interesting, whether or not it passes the superstitious p < 0.05 criteria. These results do not stand and fall with the results of the significance test.

Certainly, the authors should have done a multiple testing correction and should never have inflated that single result into the key message of the paper. There, we are in full agreement.

Häggström, instead of admitting that his usage of Bonferroni correction was statistically inappropriate, tried to twist the discussion into whether the minute assumptions of BH were fulfilled by this dataset or not.

However, the above argument (in bold) does not depend on whether or not all of the assumptions BH are fulfilled or not. I did provide a partial motivation for why some of the assumptions was fulfilled in those two comments (exact p-values were missing from the paper, so it was not possible to provide a complete justification), but I now understand that it is futile to argue with Häggström: he refuses to engage my arguments, he refuses to publish my comments and he tries to twist the discussion in a very deceptive manner to avoid admitting error.

The argument in bold above returns to the core of the issue, and it is a pretty simple argument that I doubt Häggström could ever defeat. At any rate, I am done with him. Of course I will continue to debunk his pseudoscientific beliefs about NHST, AGI and mathematical platonism as they arise, but I will no longer attempt to reason with him, only against him.
Emil Karlsson

November 25, 2014 at 15:10

Astonishingly, despite having posted the explanation for why Häggström’s application of Bonferroni was inappropriate for the 6th time (this time in all bold), he still does not get it. Worse, he continues to claim that no such explanation even exists! If someone does not get it after the 6th time, even when the section is marked clearly in bold, then that person is clearly beyond the reach of reason. It is therefore deliciously ironic that Häggström considers my arguments crazier than those provided by creationists and climate change denialists, when he himself is clearly unable to grasp a fairly basic statistical argument.

He spends the rest of his comment trying to obscure the fact that he refuses to publish my comments despite the fact that they attempt to fulfill part of his demands. But his words betray him. He clearly states that he put my second comment in the trash without reading, because he says he must “maintain some degree of consistency”. But consistency in behavior requires more than one instance of such behavior. So, perhaps without realizing it, he has confessed to the very thing that he so vehemently denied in the first place. Another delicious irony.
Geo

January 13, 2015 at 16:18

Hi Emil,

For some reason I am not getting notifications for your comments here, so I almost accidently got back to this thread and I see you’ve responded to me. Thanks!

Let’s see:
1. Sure it’s a common problem that p-values are misused, but so is the scientific method, our own mind and virtually all knowledge in the world. Does it mean that we should give up our best working solution? No. Does it mean we shouldn’t seek out better approaches? Of course not… What I meant to say is that black and white thinking, the urge to use shortcuts, to “spice up” data, to cherry-pick, etc. does not stem from the usage of p-values and using “m-values” instead would not help one-bit.

Black and white thinking is not inherent to NHST, it’s inherent to our brains. We need to have tresholds in order to think. We need to use descreet values to describe a non-descreet world and to make desicions and come to conclusions. So, depending on the case, you’ll set up your threshold at 0.05, 0.01, 0.005 or 0.000005 even. It’s not an arbitrary, it’s based on the tradeoff you are willing to make, the error your are willing to allow based on the cost of the experiment and the practical significance (negative or positive) of each point of effect size that you can achieve, the reversability of negative outcomes, the cost for reversing, etc. etc. If you set it at 0.005 and the reported statistic is 0.0055, do you scrap it or are you encouraged to do further testing (adjusting accordingly your stats) because obviously you are very close to getting value from what you are doing… That would depend on the stringency with which you’ve set the treshold.

2. Effect size, meaning practical implications of all sorts given different effect should have been taken into account when setting the p-value requirement.

3. I agree with you on confidence intervals, but they are based off p-values (or the NHST error statistical framework anyways) so as I said I don’t see these as fudnamentally different.

4. I didn’t see your argument for the objectivity of the scientifc context, honestly. However, objective or subjective, it’s taken into account when setting the p-value requirements, so, as I said, it’s complimentory to p-values, not an alternative per se.

5. Replication: sure, p-values tell us a lot about replication, otherwise we wouldn’t use them at all. I meant the practice of replicating studies (the way you meant it in your article) and in this sense they are complimentary, not alternatives.

6. Meta-analysis: ok, still, they are no substitute for p-values, but can be used with p-values (or other approaches) in a complimentary fashion.

7. “Therefore, arguing against effect size considerations with statement that the p-value is non-significant is misguided.” and “Although you acknowledge the problem with a tunnel vision on p-values and agree that other factors should be taken into account, you commit an important mistake in the above quote: you dismiss a potentially important result simply because it is not statistically significant. This is fallacious.” – never said those things in my comment.

“Freiman and colleagues (1978)” actually looked at “negative” RCTs and concluded that people don’t understand Type II error. It has nothing to do with p-values which is a statistic related to the Type I error. Power is what you need to look at for Type II errors, e.g. falsely accepting the null hypothesis and I don’t see how this paper supports what you say, since we are talking about false rejection of the null hypothesis here.
- Emil Karlsson
  
  January 14, 2015 at 20:32
  
  1. Sure it’s a common problem that p-values are misused, but so is the scientific method, our own mind and virtually all knowledge in the world. Does it mean that we should give up our best working solution? No. Does it mean we shouldn’t seek out better approaches? Of course not… What I meant to say is that black and white thinking, the urge to use shortcuts, to “spice up” data, to cherry-pick, etc. does not stem from the usage of p-values and using “m-values” instead would not help one-bit. Black and white thinking is not inherent to NHST, it’s inherent to our brains.
  
  Just because something else is misused does not mean that the abuse of p-values is justified. The argument for giving up on p-values is not merely that they are abused, but that correct application of p-values have little value and there are much better approaches that actually address meaningful scientific questions. P-values certainly contribute to a black-and-white thinking and it is inherent in NHST: a difference is either statistically significant or not. If it is statistically significant, it is emphasized as important. If not, it is usually dismissed. The fact that there are other factors that also contributes to black-and-white thinking does not change this. Would you say that running with scissors does not promote accidents just because there are other ways in which people are stabbed? Of course not. Would you say that war does not contribute to human suffering because the ultimate root of human suffering rests elsewhere? Of course not. The idea is not to replace p-values with some other black-and-white statistic. Rather, it is about replacing it with better methods that give a fuller view of what is going on: effect sizes, confidence intervals, meta-analysis and evaluation of results in the scientific context.
  
  We need to have tresholds in order to think.
  
  Speak for yourself: there are plenty of very good science that relies on interpreting effect sizes and confidence intervals in the scientific context without using arbitrary thresholds.
  
  So, depending on the case, you’ll set up your threshold at 0.05, 0.01, 0.005 or 0.000005 even. It’s not an arbitrary, it’s based on the tradeoff you are willing to make, the error your are willing to allow based on the cost of the experiment and the practical significance (negative or positive) of each point of effect size that you can achieve, the reversability of negative outcomes, the cost for reversing, etc. etc
  
  Extremely few studies use alpha other than 0.05. They do not control any empirical error rate, merely the error rate given the truth of the global null hypotheses (which is usually false and uninteresting). The fact remains, landing just above or just below the alpha has enormous implications for NHST believers: it is the boundary between what many of them interpret as “significant” and “meaningless”. For those of us who are not indoctrinated into it, a p value of 0.06 versus 0.04 is not interesting. Either way, it is marginal evidence against an implausible null.
  
  If you set it at 0.005 and the reported statistic is 0.0055, do you scrap it or are you encouraged to do further testing (adjusting accordingly your stats) because obviously you are very close to getting value from what you are doing… That would depend on the stringency with which you’ve set the treshold.
  
  What NHST believers typically do is to arbitrarily change their alpha to, say 0.01 and report the result as statistically and practically significant. A p value will always end up below a given significance level as long as you have a large enough sample size. So NHST researchers typically repeat the study with more people, and then they get a statistically significant result (all without there having to be any difference between the populations samples). These kinds of arguments that NHST researchers do is not productive to acquiring scientific knowledge.
  
  2. Effect size, meaning practical implications of all sorts given different effect should have been taken into account when setting the p-value requirement.
  
  This is not what I mean. I do not mean that we should “merely” use an effect size for sample size calculations and decisions about alpha. I mean that we should use observed effect sizes in their own right to draw conclusions about the nature of reality. Some observed effect sizes are negligible, some are small, some are moderate, some are huge and this matters a lot. The classic example is e.g. drug development.
  
  3. I agree with you on confidence intervals, but they are based off p-values (or the NHST error statistical framework anyways) so as I said I don’t see these as fundamentally different.
  
  They are fundamentally different, because confidence intervals gives you a lot more than a p-value. As I wrote in a previous post:
  
  “Geoff Cumming (2012), a strong defender of the use of confidence intervals, points to the following six ways of interpreting a 95% confidence interval. I have simplified them so readers not so familiar with statistics can understand the gist:
  
  1. A 95% confidence interval is an interval that will capture the true population mean 95% of the times.
  2. A 95% confidence interval is a range of plausible (using the term probable here would be wrong as we saw above) values for the population parameter.
  3. A 95% confidence interval is a margin-of-error that gives a likely maximum error for the estimation, even though larger errors are possible.
  4. The values closer to the mean is relatively more plausible than the values nearer the ends of the confidence intervals.
  5. The relationship between confidence intervals and statistical significance (outlined earlier).
  6. On average, a 95% confidence interval is an 83% prediction interval.”
  
  It gives us a range of plausible values for the population parameter, it is a margin-of-error, it can work as a prediction interval and so on. You only see (5), but there are five other benefits with confidence intervals that you do not get from p-values.
  
  5. Replication: sure, p-values tell us a lot about replication, otherwise we wouldn’t use them at all. I meant the practice of replicating studies (the way you meant it in your article) and in this sense they are complimentary, not alternatives.
  
  No, p-values tells us nothing about replication. That is a classic NHST fallacy. A p-value of 0.32 does not mean that the findings are any more or less likely to replicate than a p-value of 0.04. A p-value is the probability of at least as extreme data, given the null. It is not the probability of replication.
  
  What replication offers that p-values do not is that we can eliminate the biases of single studies. Several studies that reach the same result (in terms of effect size) is much more credible than a single study with a p value under a specific alpha.
  
  6. Meta-analysis: ok, still, they are no substitute for p-values, but can be used with p-values (or other approaches) in a complimentary fashion.
  
  Meta-analysis is better than p-values, because it allows us to synthesize observed effect sizes from many different studies and thus get a more reliable result. This is much better than being obsessed with p-values (that we know are lower and lower the larger the sample size).
  
  “Freiman and colleagues (1978)” actually looked at “negative” RCTs and concluded that people don’t understand Type II error. It has nothing to do with p-values which is a statistic related to the Type I error. Power is what you need to look at for Type II errors, e.g. falsely accepting the null hypothesis and I don’t see how this paper supports what you say, since we are talking about false rejection of the null hypothesis here.
  
  No, we are talking about dismissing big effect sizes that can be very worthy of future investigation simply because they “non-significant”.
  This shows the large negative effects of NHST.
  
  I claimed that:
  
  Freiman and colleagues (1978) looked at over 70 papers with statistically non-significant results (p > 0.1). Half of them showed a result that was consistent with a 50% therapeutic improvement, which is clearly of clinical importance. This shows the detrimental of the “so it is not statistically significant no matter how big the observed effect seems to you intuitively” kind of thinking.
  
  You responded by:
  
  I don’t see how this paper supports what you say
  
  You clearly did not even read the abstract (that explicitly mentions the fact I brought up), but here is the relevant section:
  
  Estimates of 90 per cent confidence intervals for the true improvement in each trial showed that in 57 of these “negative” trials, a potential 25 per cent improvement was possible, and 34 of the trials showed a potential 50 per cent improvement. Many of the therapies labeled as “no different from control” in trials using inadequate samples have not received a fair test. Concern for the probability of missing an important therapeutic improvement because of small sample sizes deserves more attention in the planning of clinical trials.
  
  In other words, NHST has a large detrimental effect on science: it has led a lot of scientists to dismiss very important findings as “non-significant” (in the practical sense) even though the studies actually showed a 50% improvement. No matter how much you try to squirm your way out of it, the fact remains: NHST is essentially useless and causes considerable harmful.
Geo

January 19, 2015 at 14:51

Waiting to see how confidence intervals get abused the same way as p-values, if they get adopted, as you prescribe 🙂 What you describe above is mostly malpractice, bordering or crossing into fraud. And this will not be stopped by a mere switch to confidence intervals 🙂 Btw, I’m just reading “Faking Science” by the infamous Diedrek Stapel and what I see described in there is horrendous and, believe me, no statistical methodology is able to rectify this by itself… I recommend it, btw, I think it was distributed for free (downloaded it some time ago, so I don’t vividly remember).

About the Freiman and co paper: yes, I understand it very well and I stand by my assertion that it has nothing to do with your case above, cause it has to do with power (beta), not p-values (alpha). In fact NHST, when applied correctly and competently, is the way we know the data of these studies doesn’t support their “negative” conclusions… Should everyone be able to learn Assembler? No. Is it an irreplecable programming language for specific use cases. Absolutely.
- Emil Karlsson
  
  January 20, 2015 at 23:40
  
  You are not responding to my arguments.
  
  I have described more than malpractice: (1) p values do not tell us what we want to know, (2) most null hypotheses are false, (3) most null hypotheses are scientifically irrelevant, (4) promotion of black-and-white thinking, (5) promotion of publication bias, (6) large p value distribution and so on.
  
  Even if we assume it all was merely malpractice rather than intrinsic flaws in NHST, it would still be a very powerful argument against NHST since a method that is so profoundly misunderstood and abused has to take some of the blame.
  
  About the Freiman and co paper: yes, I understand it very well and I stand by my assertion that it has nothing to do with your case above, cause it has to do with power (beta), not p-values (alpha).
  
  No, you do not understand the Freiman paper at all.
  
  This is evident from the fact that I quoted the precise sentence that supports the claim I made: NHST leads to the ignorance of differences that have a potent possibility of being very clinically significant because statistically non-significant differences are dismissed as if they were showing equivalence and this is a major problem with NHST. Your reaction to this, despite having the evidence in front of your eyes, was to merely repeat your original assertion despite the fact that even the abstract explicitly supports my description of the study.
  
  In fact NHST, when applied correctly and competently, is the way we know the data of these studies doesn’t support their “negative” conclusions…
  
  No, the way we know these studies do not support equivalence is because the results do not show an effect size near equivalence with sufficient precision. The fact remains: NHST lifts up differences that are statistically significant and downplays those that are not (if it did not, statistical significance tests would be meaningless). The fact that the conflation of statistical non-significance and equivalence is so widespread is a searing indictment of NHST.
  
  If NHST is “applied correctly and competently” it tells us next to nothing of scientific value. If researchers let their wishful thinking fool them into thinking that it does tell them anything of scientific value, they are abusing it.
  
  Now, I have provided arguments and evidence for my position. Now it is time for you to do the same. Show me why you believe that p values are important in science, that they can answer relevant scientific questions and that the benefits outweigh the drawbacks.
Pingback: Debunking Statistically Naive Criticisms of Banning P Values | Debunking Denialism
Glen Sizemore

March 11, 2015 at 11:52

Unfortunately you do not mention one particularly powerful way to avoid using NHST – that is so-called single-subject designs (SSD). Not every question is amenable to SSD but it is reasonable to inquire as to how many such questions are actually worth asking. SSDs represent experimental science at its best – direct experimental control of the subject matter rather than statistical inference.
Pingback: Häggström Disrobed on NHST | Debunking Denialism

Comments are closed.

Why NHST is seriously flawed

Misunderstandings and abuses of NHST

The alternative to NHST

Objections anticipated

Case study: how to handle obstinate NHST statisticians

Spread this:

Like this:

emilskeptic

You May Also Like

Paranormal Believers and Pareidolia? Not So Fast…

Debunking Statistically Naive Criticisms of Banning P Values

The Top Five Most Annoying Statistical Fallacies

12 thoughts on “Why P-Values and Statistical Significance Are Worthless in Science”

Discover more from Debunking Denialism