Untangling Steven Novella on Effect Sizes and NHST

NHST and Effect Sizes

Steven Novella is a neurologist, assistant professor, the founder and executive editor of the Science-Based Medicine blog, host of the podcast The Skeptics’ Guide to the Universe, president of the New England Skeptical Society and involved in such skeptical organizations as the James Randi Educational Foundation and Committee for Skeptical Inquiry. In addition, he is one of the scientific skeptics that has influenced me the most and I have benefited greatly from his writings on everything from criticisms of acupuncture to the debunking of anti-psychiatry.

In a previous post, I discussed the very impressive paper where Colquhoun and Novella convincingly showed that acupuncture probably was not better than placebo, and even if it was, the effect was probably clinically negligible. However, in the Science-Based Medicine blog post talking about this paper, Novella expanded on a statistical argument about to what extent scientists could provide evidence for an effect size of zero for a given treatment. Although no single claim Novella made was wrong in isolation, the overall context in which some of them were stated made the line of reasoning a little bit confusing .

The reason for this was that the distinction between two different approaches to statistical analysis of treatment efficacy was not clear enough. These two approaches are null hypothesis (statistical) significance testing (henceforth NHST) and effect size (ES) estimation with confidence intervals (CIs).

Using NHST, scientists calculate the probability of obtaining at least as extreme results as they did given the truth of the null hypothesis (usually this is the nil hypothesis of no difference between the experimental and control group). If this probability is low enough, the results are declared to be statistically significant. This does not mean that the experimental treatment is better than placebo, it just means that at least as extreme results were unlikely on the hypothesis that the experimental treatment were equally effective as placebo. In addition, a statistically non-significant result does not indicate that the treatment is not better than the placebo. This is because the statistical power of the study may not have been large enough, so the probability of detecting an existing difference might have been very low.

Using effect sizes and confidence intervals, scientists can state the obtained difference between the experimental group and the placebo group (an indication of the efficacy of the treatment) with error bars. Just as scientists can obtain a non-zero effect size, they can also obtain an effect size that is approximately zero (indicating that the active treatment is no better than placebo).

With this in mind, let us look at what Novella wrote:

What I think David and I convincingly demonstrated is that, according to the usual standards of medicine, acupuncture does not work.

Let me explain what I mean by that. Clinical research can never prove that an intervention has an effect size of zero. Rather, clinical research assumes the null hypothesis, that the treatment does not work, and the burden of proof lies with demonstrating adequate evidence to reject the null hypothesis. So, when being technical, researchers will conclude that a negative study “fails to reject the null hypothesis.”

Further, negative studies do not demonstrate an effect size of zero, but rather that any possible effect is likely to be smaller than the power of existing research to detect. The greater the number and power of such studies, however, the closer this remaining possible effect size gets to zero. At some point the remaining possible effect becomes clinically insignificant.

In other words, clinical research may not be able to detect the difference between zero effect and a tiny effect, but at some point it becomes irrelevant.

Novella is certainly correct on many things here:

  • Colquhoun and Novella (2013) clearly showed that acupuncture does not work.
  • Clinicians using NHST assumes the null hypothesis of no difference.
  • The burden of evidence lies with those claiming that a treatment is better than placebo.
  • Researchers should state “could not reject the null hypothesis” if the study fails to obtain statistical significance.
  • Statistical non-significance does not imply an ineffective treatment as the statistical power might be too small to detect an existing difference.
  • The more studies with adequate power failing to detect statistical significance, the smaller the alleged non-zero effect size has to be.
  • The difference between a very small non-zero effect size and a zero effect size is likely to be clinically irrelevant.

The statement that I found to be confusing was this: “Clinical research can never prove that an intervention has an effect size of zero”. It is technically accurate as science cannot prove anything in the same sense that mathematics or logic can. However, if we mentally replace “prove” with “convincingly demonstrate”, then the statement is trivially incorrect. If several independent studies are performed that have large sample sizes as well as credible methodologies and the obtained effect sizes are zero or extremely close to zero (with small error bars), then we can conclude that such clinical research has shown that the effect size of a certain treatment compared with placebo is zero.

To me, it would seem like Novella would agree with the above argument. So he must have meant something different when he wrote that sentence. Presumably, the intention was to point out that obtaining non-significance does not necessarily entail that the experimental treatment is not better than placebo. However, bringing in the concept of effect size into an argument that was almost exclusively about NHST made it a little bit confusing. Rather, he should have written that NHST alone could not establish that a treatment had an effect size of zero or that a failure to obtain results that would have been unlikely on the null hypothesis does not imply an effect size of zero.

Of course, I do realize that he was trying to make a difficult statistical subject understandable for readers who may not be versed in the details of statistical analysis of clinical trail data. Then again, Novella did use the statistical concept of effect size so it might have been possible to formulate a more accurate description without losing clarity.

Emil Karlsson

Debunker of pseudoscience.

%d bloggers like this: