Butchering Scientific Studies


Sometimes, people who promote pseudoscience online try to reference the scientific literature. In one sense, this is progress. They are going from just making arbitrary assertions to trying to justify them. In another sense, it is a turn for the worse. That is because the papers they reference are either of incredibly low scientific quality or rarely support what is being claimed. However, the behavior gives the illusion of evidential support for some readers. A lot of the time, they damage their own position by spamming long lists of links to videos and blog articles, but some promoters of pseudoscience are more sophisticated.

Previously, I wrote a short introduction on how to counter cranks that reference the scientific literature. Consider this to be the intermediate to advanced version. It will attempt to provide scientific skeptics with additional tools to counter pseudoscience online. The focus will be on research articles, specifically clinical trials. However, the general arguments can often be extended to other forms of research articles. Some of the tools are evidential or methodological in nature and directly related to the meat of the article such as whether or not there was a control group or control for confounders, the appropriateness of the statistical analysis and whether the conclusion accurately reflected the results. Others are more sociological in nature, looking at the journal itself, the presence or absence of peer-review, impact factor, who the authors are etc. These do not necessary count against the research in the article directly and should not be used alone, but provide useful external arguments if combined with criticisms of the study itself. There is of course some overlap between and within these broad categories.

First, a word of warning. Knowledge can be used for good or evil, and this is no exception. It is very dangerous to find oneself in a situation when the studies that run counter to one’s position are subjected to merciless criticisms while the research that support it is being accepted with little or no critical thought. This is known as pseudoskepticism and something to avoid at all cost. It can even undermine the rationality of some of the giants in science seemingly without difficulty. It is very unfortunate when leading scientists succumb to pseudoscience.

Does the article actually exist?

Believe it or not, there may situations where the article being references does not even exist. Hardcore pseudoscientists can go to extraordinary lengths to spread misinformation and confusion. This may be especially common when the person talks about how a study shows this or that, but cannot provide a full reference or even partial information such as journal, year or author(s). Until you have actually gotten independent corroboration that such a study actually exist, treat it’s existence with initial skepticism. However, keep in mind that humans memory is not a camera and that even ordinary people sometimes forget to write things down.

Sometimes, the reference might actually be a book, not an article. Books can be published by various publication companies. Popular presses almost never have any sort of peer-review and publish almost solely based on profit considerations. University presses are better, but there is often no absolute requirement for peer-review.

Has the article actually been published?

Sometimes, references may appear to be to the scientific literature, but is actually a reference to an article written by an arbitrary person on an external website. This offers very little, if any, evidence for the claims being made. On the Internet, anyone can claim almost anything (and often get away with it). Some articles have been sent in, but not yet accepted or accepted by not yet published. The former can exist on so called pre-print servers. Make sure you know the situation.

Follow Debunking Denialism on Facebook or Twitter for new updates.

Was it published in a scientific journal?

So you know the article has been published somewhere. But do you know that it is a scientific journal? Or is it the New York Times, a local newspaper or crank blog network? There is a world of difference. Sometimes universities can publish articles or reports as well.

Has it been retracted?

Some poorly carried out research are sometimes published in the scientific literature. When this is discovered, the journal may decide to retract the paper all together. This means that it can no longer be considered a published scientific research paper. This is not always completely clear when browsing online, or reading older copies of the article. An example of a paper that has been retracted because of being junk science and because of severe conflict of interest and multiple counts of scientific fraud is the anti-vaccine paper by Wakefield from 1998.

Is the article really a research paper?

Journals often publishes commentaries and letters to the editor. So just because a text was published in a scientific journal does not mean that it had to be a research paper.

Is the article generally relevant to the claims being made?

Does the article reference deal with anything remotely relevant? Is the claim about vaccines and the article is about meteorology? The easiest and fastest way to determine this is to read the abstract, but keep in mind that the abstract may not accurately reflect the paper (see below).

Is the article specifically relevant to the claims being made?

If we take clinical trials of treatments or toxicity testing as examples, is the concentrations used comparable to the claim being made and/or the real-world situation? Sometimes, there may be as much as 1 million-fold difference between the concentration used in the study and reality. A classic example is thimerosal in vaccines.

Is the topic of the article suitable for the field that the journal covers?

Generally, scientists publish articles about a topic in journals that cover that topic. This is to make the research available to other workers in the field, who pay attention to journals that cover topics they themselves work in. Be suspicious if the article deals with, say, the hard science of vaccines or global warming, yet published in an obscure law or electrical engineering journal.

What is the reputation of the journal globally?

Is it a famous journal in general, or one no one has heard of?

What is the reputation of the journal in the specific field?

Sometimes, the journal may not be globally recognized because the field is very narrow and specialized, but among the top journals in the particular field. Other times it may be both globally unrecognized and among the longer tier in the field. When the journal is a very specialized, the tier matters.

What is the impact factor of the journal?

The Impact factor of a journal is a number that tells you something about the importance of the research being published in that journal. The impact factor is simply the average number of citations to articles published during a specific time. The higher the impact factor, the more influential the journal, and therefore, usually, the more reliable papers.

Journals with a very low impact factor, such as 0.5, is generally considered to be crank journals and should not be trusted.

Does the journal apply peer-review?

Peer-review involves sending the article to experts in the field for critical evaluation. If these find serious errors or flaws in the manuscript sent in, it will usually be rejected. Peer-review works as an attempt at weeding out most of the bad research. Journals that do not apply peer-review may publish whatever junk is sent their way.

Was this particular study peer-reviewed?

Even if a journals applies peer-review, a particular paper may have been unjustifiable excluded from the peer-review process so that it will be published, despite containing serious flaws. This was the case for an article called “The origin of biological information and the higher taxonomic categories” written by intelligent design creationist Stephen C. Meyers and published in Proceedings of the Biological Society of Washington. While this is not a clinical trial, it serves as a case where although the journal applies peer-review in general, a particular punished study has not peer-reviewed. The paper was excluded from the peer-review process by managing editor Richard Sternberg and published despite severe flaws. The paper was later retracted by the journal. For a more detailed look at the Sternberg situation, read the discussion provided by National Center for Science Education at the ExpelledExposed website. It can be found here.

Who are the authors of the paper?

This does not really affect the accuracy of the study, but sometimes the authors are hardcore cranks, so searching on Google on their names is a fast and easy gateway to the analysis of the paper.

How was the study funded?

This is also often irrelevant to the merits of the study, but can give you some information about where the study comes from.

Does the abstract accurately reflect the study?

The abstract is the text at the top of an article that researchers who want to get a bird’s-eye view look at first. It is basically a summary of the different parts of the article: introduction, method, results and conclusion. An abstract that does not match the study is a sign of intellectual dishonesty. They are trying to make the study seem more groundbreaking and important than it really is. Sometimes, the abstract can even state the exact opposite of what the study actually found. Read the study instead of just posting the abstract.

Does the introduction cover the existing literature well enough?

A good introduction covers the existing literature broadly. It contains only discussions of relevant facts and prior studies. An introduction that contains irrelevant things, incorrectly characterize the studies it references or does not cover the field well enough is usually a sign of negligence and incompetence.

Does the introduction state a precise aim of the study?

While this is not a particularly strong argument against the conclusions of a study, it tells you something about the competence of the writer. A clear aim (i. e. what we are going to do and why) is often added at the end of the introduction section for several reason, such as a bridge between introduction (what has been done) and method (what we did). A poorly written or absence aim reflects badly on the paper and an aim that does not reflect the actual study may in some cases be a sign of tampering with the research.

Is the research design suitable for the aim?

If the research design is unsuitable for the aim, this means that the research design probability will not answer the questions asked by the scientists. A good paper has a research design that is appropriate for the aim described.

Is there sufficient sample size?

A low sample size means that the effects of randomness is bigger, which will obfuscate the actual efficacy and a small sample size is usually not enough to tell you if the efficacy is representative of the overall population.

Is the method described clearly enough?

Can you tell what the researchers did and why? Does it make sense? Can you tell where the subjects recruited came from? How they were recruited? Based on what criteria where they included in the study? What was the criteria for exclusion (if some participants where excluded)?

Does the paper use an appropriate control group?

With out a control group, working as the base rate, it is difficult to say anything about a given treatment. What we want to know is how different is the treated group from the control group. Only that differences will tell us anything about how effective the treatment is. Appropriate control groups are those that match the experimental group in many areas and receive a suitable placebo treatment. If the active treatment are injections, then placebo treatment should also involve injections. If the active treatment involves two green pills, then the placebo treatment should also involve two green pills etc.

Is there a placebo treatment given to the control group?

Generally speaking, the change in the variable or variables being measured in the experimental group will depend primarily on two things: (1) the active treatment and (2) the expectancy effects. Expectancy effects are simply every influence besides the pharmacological substances themselves. Not using placebo may overestimate the difference between the experimental and control group. In a typical research trial, scientists want to know the effect size of the pharmacological substance, not how much expectancy effects help.

Was the research design double-blinded?

Did the patients know if they got the pharmacologically active treatment or placebo? If so, this will be a major confounder and damage the credibility of the study. Did the people administrating the drug know which patient got which treatment? If so, their behavior can implicitly influence the patients, such as providing more care to those with the placebo treatment, or showing more interest in the outcomes of the patients with the pharmacologically active treatment.

Does the paper attempt to control for important confounders?

Correlation does not imply causation. Therefore, reliable studies need to control for important confounders. What if a third factor is causing the correlation between the treatment and improvement? Controlling for a confounder can involve making sure both groups are equal in that regard.

Is group assignment randomized or use a technique to reach a similar goal?

Sometimes, the groups may not be similar enough in areas where confounders may appear at the start of the experiment. The differences obtained in the results may be affected by these differences, and not just the variable or variables being tested. A good group assignment is randomized or use different techniques to ensure that the groups are about equal in other areas to eliminate a lot of confounders.

Is the statistical analysis appropriate for the research design?

Certain forms of statistical analysis is suitable for certain types of research designs, whereas others are not. Some research designs can make it really easy to get a statistically significant result, which is problematic.

Does the statistical analysis contain any obvious mathematical errors?

This should not happen, but it can be checked easily. There have been times when such errors have been published, although they are probably few by comparison with other statistical errors.

Does the study use p-values or effect size and confidence intervals?

Although still used by researchers, p-values and null hypothesis significance testing (NHST) is generally considered outdated. This is because it suffers from numerous flaws and misunderstandings: (1) p-values are often misinterpreted as the probability that the results where due to chance, probability of replication etc, when in fact, it is the probability of obtaining the results, or more extreme results, given that the null hypothesis is true, (2) rejecting the null hypothesis does not prove the alternative hypothesis, (3) p-value is a function of sample size and given a large enough sample size, almost anything will appear statistically significant, (4) using p-values with subgroup analysis can lead to identifying relationships as statistically significant by chance at a higher than acceptable probability, (5) p-values leads to publication bias, (6) statistical significance says nothing about practical significance.

More appropriate statistical techniques are effect size and confidence intervals. An effect size is simply the value of what is being measured, and a 95% confidence interval means that 95% of the time you take a sample, calculate the average and then confidence intervals, the interval will overlap the true population parameter. This replaced the dichotomous black-and-white thinking of NHST with estimation.

Does the study separate statistical significance from practical significance?

Statistical significance is about how likely the obtain results, or more extreme results, are given that the null hypothesis is true. It is not the probability that the results where due to chance or the probability that the null hypothesis is true given the evidence.

Practical significance, on the other hand, is about asking “how big is this effect size in the biological context”.

Thus, statistical significance does not imply practical significance. A statistically significant result is just something that was unlikely given the null hypothesis. It does not mean that the results are large, important or publishable.

A common statistical error is going from treating the data as statistically significant in the result section to treating it as practically significant in the conclusion section and we now understand that this is a flawed approach. Practical significance has to be justified separately by taking into account the biological context.

Is the results practically significant in the scientific context?

Alright, so the study does not make the fallacy of inferring practical significance from statistical significance. Maybe the study does not even discuss the practical significance of the findings. While this is best let up to the actual experts, a lay person can attempt to make a rough evaluation of the practical significance. For instance, if the difference is very small, just a few percentage, it is often safe to assume that the practical significance is low. It is important to understand that statistical significance is an either/or situation, whereas practical significance comes in degrees.

Is it likely that the the study used data dredging in any subgroup analysis?

Subgroup analysis is the practice of analyzing a particular section of the study group and see if the results for that particular group is interesting in some way compared with other subgroups or the group being studied at large. There are good ways to carry out a subgroup analysis and there are bad ways. The bad way is called data dredging. It occurs when scientists, who did not get a statistically significant result on the entire group, starts arbitrarily analyzing subgroups in order to find something interesting. The problem is that there are some many different variables being studied that you are bound to find some statistically significant result if you look at enough subgroups purely by chance. This result will rarely be reproducible. Data dredging may most often be due to an over-reliance on statistical significant, but it can conceivably arise when focusing on practical significance as well. Data dredging is generally considered to be scientific misconduct and should be avoided.

Does that mean that subgroup analysis should never be done? Not at all. There are good ways to do subgroup analysis. A good way to do a subgroup analysis consists of deciding to do a subgroup analysis before carrying out the study and having an independent and rational reason for doing one. It also involves presenting all the subgroup analyzes done, not just the ones that happen to produce an interesting result. All of this should be explained in the paper itself. Maybe individuals with a given co-morbidity reacts differently to a treatment. So subgroup analysis is not by default data dredging.

If there is no rational justification for subgroup analysis and it appears to come out of nowhere, then data dredging may have occurred.

Does the conclusion of the study accurately reflect the results?

Some studies draw wildly inappropriate conclusions from mundane results. Classics involve going from statistical significance in the results to practical significance in the conclusion or applying arbitrary or deprecated standards to hide the fact that the results are practically significant.

Does the paper investigate alternative explanations for the data obtained?

A good researcher will not just blindly follow his or her own favorite hypothesis, but critically consider it in light of competing ideas. This can usually be found in the discussion. If there is not a discussion of possible alternative explanations, then this should cause alarms to go off in your head.

Does the article discuss limitations of the study design?

This relates to the above point: a good researcher aims to get things right, not be right. Limitations are a good way to see how applicable the results are to the general population or the validity of the conclusion made in the paper. If there is no discussion of limitations, the researcher is probably not being objective.

Can the results generalized to other populations?

Is the group being tested representative for the general population, or wildly different? Generally speaking, the more representative, the more generalizations are reasonable.

Has the results been independently replicated?

A single study demonstrates nothing in science. That is because a single study can be flawed. The true mark of science is independent replication. If multiple research groups arrive at the same general conclusion independent of each other, then this provides strong evidence for that conclusion.

Is the study supported or contradicted by earlier studies?

A single study is not conclusive evidence for anything. This is because the results may have been due to chance, important confounders may been ignored, the study may have applied questionable research design or inappropriate statistical analysis.

Does the article have any commentary from other individual scientists?

There is often debate in the scientific literature. Sometimes, other scientists send in comments on the research published previously. These can sometimes reveal limitations and flaws in the study that was not noticed or discussed by the original authors.

What is the general appraisal of the impact of the study by scientists in the field?

Is it a groundbreaking study that has been celebrated widely and incorporated into textbooks or is it a marginal study that has not had any particular impact on the scientific community? Is it somewhere in between?

Does the study conform or contradict a scientific consensus position (should one exist)?

The consensus position will most likely be based on a mountain of studies. Is this particular study just another in the pile, or does it provide data that contradicts the consensus position? The results of a single study should not be rejected simply because it contradicts the prevailing view, but it should be evaluated in the context of the previous research that created the consensus.


There are many ways to critically examine a research paper. I have presented some of the main approaches, but there are more. With these, you can confidently criticize any study. This is because there is a general trade-off between research funds and the ability to eliminate or control for confounders. The more confounders you attempt to get rid off, the more money will be spent and research funds are always finite.

This leads into a more general conclusion: apply skepticism of studies evenly across those that conform to or contradict your position. Do not use these tools to blow apart studies that disagree with you, will uncritically accepting studies that agree with you.

References and further reading

Altman, D. G. (1999). Practical Statistics For Medical Researchers. New York: Chapman & Hall/CRC, p. 495-497.
Girden, E. R. & Kabacoff, R. I. (2010). Evaluating Research Articles From Start to Finish. Thousands Oaks, CA: Sage Publications.
Greenhalgh, T. (2010). How to Read a Paper: The Basics of Evidence-Based Medicine. West Sussex, UK: BJM Books.

Emil Karlsson

Debunker of pseudoscience.

4 thoughts on “Butchering Scientific Studies

Comments are closed.

%d bloggers like this: