It is very important to correct the misuse of statistics regardless of the identity of the perpetrator. Sometimes, it may be even more important to correct well-known individuals because their erroneous statistical argument will have a much more substantial influence than if it had been committed by an average blogger.

One such case is that of Rebecca Watson (who has arguably done more than anyone to highlight important issues related to feminism in the skeptical community) and her analysis of the ages of specific female movie stars and the age of men playing their male love interests in a selection of their movies. The background leading up to her blog post entitled Leading Women Age, Too is that an article posted on Vulture showed data suggesting that male movie stars increase in age, whereas the age of the women playing their female love interests stays roughly within the same age range regardless of the age of the male actor (this is, as we shall see below, erroneous). Someone suggested doing a similar thing for female movie stars that has been doing movies for a long time. Since Watson is apparently “a party animal” she “got totally crazy and spent like an hour on IMDB just to satisfy your curiosity”.

The basic idea was to compare the age of the female movie star (Watson picked Meg Ryan, Julia Robers and Meryl Streep) in different movies with the age of who she believed to the female characters’ love interest and to see if there is a difference before and after the actress turns 40. Here is the logic of her statistical analysis:

If you’re interested, Meg’s mean age in this chart is 35.6 and her costar’s mean age is 39.9 (a difference of 4.3 years). Prior to the age of 40, her mean age is 31.8 and her love interest’s mean age is 37.8 (a difference of 6 years).

See the problem? Watson apparently thinks she can just average the age of the female movie star in all movies, then compare it with the average age of the male love interest in all movies. However, this is only possible for unpaired data and it is highly statistically inappropriate to attempt this for paired data .

Simplified, unpaired data is obtained when there is no connection between a specific data point in one group and a specific data point in the other e. g. one treatment group and one placebo group. Paired data, on the other hand, is something you get when such a connection exists e. g. measuring blood pressure in a patient before and after treatment. Clearly, in the latter case, you do not average the blood pressure for every patient before treatment and compare it with the average for every patient after treatment. Rather, you first calculate the difference before and after treatment for each individual, then average that difference.

Therefore, it is easy to understand that the more appropriate way to analyze the data would be to take the difference between the female star and the male love interest for each movie before and after the age of 40, then average them separately and compare.

I pointed this out in a comment on her blog post, where I also noticed that the original claim in the Vulture post was false, there was often a moderate correlation between the age of the male movie star and the age of his love interest as the male movie star got older (here is a webcite in case those comments happen to go missing).

Now, a rational person would understand and accept these rebuttals and correct their errors. This is what Phil Plait did after he performed the statistical errors I described in a previous post. A commenter by the name of Jack99 then made the following irrelevant objection:

Jack99 clearly misses the point: my objections were that Rebecca Watson treated the data as if it was unpaired, then it, in fact, was paired and that the original Vulture claim that the love interest of male movie stars do not age as themselves age is wrong. Here is my response:

Here is where Rebecca Watson joins the conversation. Does she acknowledge her error and fix it? Not even close:

I then go on to ask why she does not fix it and here is her stunning reply:

Note how Watson attempts to deflect and sweep her errors under the rug by suggesting that my arguments are ragy and incomprehensible. It is also fascinating to see that Watson apparently thinks that it is better to leave statistical errors uncorrected because she would have to “spend several more hours redoing chart” for something that was only “a joke post” to begin with.

Not only does Rebecca Watson not understand basic statistics, she also does not have sufficient intellectual integrity to correct her errors. Again, compare this with Phil Plait who I took to task in a previous post. He corrected every single statistical error he made (which were greater in both number and complexity) within 24 hours after having them pointed out to him.

Phil Plait has a high amount of intellectual integrity and honesty and cares about statistical accuracy. When it came to that issue, he was more or less at the summit of the skeptical ideal. In stark contrast, Rebecca Watson seems to prefer to chill out at the base camp.

Categories: Debunking Misuse of Statistics, Skepticism

welcome to Skepticism+

Chill out at Base Camp? More like at the bottom of the Marianas Trench.

Unfortunately, it doesn’t appear that you understand statistics much better than Watson does. There’s no “paired data” here and comparing means with no regard to variability is fairly meaningless whether dealing with difference scores or raw scores.

There are several appropriate tests which will answer the question, but neither of you are even close to identifying one of them.

Actually, the data is paired: age of female movie star is paired with the age of male love interest.

You also did not reply to any of my criticisms of Watson, such as low sample size making a statistical significance test inappropriate

Just to clarify: the paired context is between the age of the female move star and the age of her male love interest, not for data points before and after 40. That is a separate issue.

Barbara is right that your suggested method would still not be the best way to find an answer to this question. I think the important thing to note here is that this bit of analysis by Watson was a “joke” – which is to say that it was not meant to convey any actual information but instead to just be a funny little pseudoscientific caricature. There’s nothing wrong with that, maybe you’d be better off saving your outrage until somebody references it as if it were serious evidence for anything.

I can’t agree less. Skepchick is not a parody site. It’s not a joke. If she says it is, she’s back-pedaling. She was looking for evidence to support her beliefs–the opposite of good science and skepticism. Calling it “humor” doesn’t make it okay.

Anti-intellectualism and skeptical activism do not make for good bedfellows.

I’m assuming you don’t read skepchick regularly, there is often a joke “bad chart” post and this was always marked as such. The OP was roundly mocked for taking it seriously since the whole point is to laugh at the eponymous bad chart. Very funny he wasn’t even right about the stats but that’s just icing on the cake. If you think humour has no place in sceptical websites then tough luck, its everywhere.

Wondering what beliefs of hers she is pushing with her “bad chart” since its hardly ground breaking to point out that in our culture older men with younger women is much more accepted than the reverse. That’s out and out sexist and not slightly controversial I would have thought?

My understanding (not being a Skepchick reader and so having only a vague awareness of it) is that Bad Chart Thursday posts are intentionally “bad”. Personally, in her shoes I would put in the effort to actually make interesting and valid points*, but if she chooses not to then I can’t fault her for it.

* In fact, a nerdy commitment to technical accuracy even on occasions when people are ostensibly joking seems to be one of the marks of the scientific mindset – but then, she isn’t a scientist, so it would be weird to expect that from her.

I don’t know where the idea came from that I don’t think humor has a place on skeptical websites. I certainly didn’t imply that.

When one is claiming to promote science and skepticism, to do sloppy work, then laugh it off as humor, is nothing less than pure hypocrisy and anti-intellectualism. If most people can’t tell the difference between your jokes and your serious posts, then you’re doing it wrong.

Furthermore, Rebecca’s response to the criticism was not, “It was a joke”. So SHE isn’t the one trying to use that as an excuse.

The more plausible explanation for this is that she doesn’t know anything about how to conduct quality research and wrongly thinks that doesn’t matter; it isn’t worth her time.

I do not disagree with you in principle, but Rebecca did refer to it as a “joke post”. I agree with your assessment of her skills, however I thought having a “Bad Chart Thursday” was her way of embracing that and poking fun at herself.

I refuted the “joke” defense in the post, but I will repeat it for you: calling something a joke does not make the statistical error go away. Fixing it, on the other hand, does.

What if the statistical error is a (possibly unintentional) part of the joke? The little studies are supposed to be funny because they are “bad” (“lol”). If they were actually done properly it would just be “Chart Thursdays”.

Does it, in your opinion, make the statistical error not an error?

I thought this would never happen, but I finally found a roughly acceptable context to quote Howard Wolowitz: “I’m a horny engineer, Leonard. I never joke about math or sex”.

To call it an “error” implies that she was attempting to get it “right”. The study is not persuasive. It is a bad study. That makes it appropriate for what Watson was attempting to create. Apparently bad charts etc. are supposed to be funny – they don’t do it for me (except for the Batman comment) – but I’m not going to mistake this for something done in seriousness.

The truth of falsehood of something does not depend on the motives of the person putting forward the claim.

It does not matter if Watson tried to be funny, did not try to be serious, or did not attempt to get it right.

Well, in that case your case would be best made by doing it “right”. If you wanted to test the significance of the difference in age on an actress-by-actress basis you could do a simple sign test. It is a very conservative (possibly the most conservative?) statistical test, so if you find significance with it then you can be quite confident that any other statistical test (with more power) would also find an effect.

Meg Ryan is highly significant under a sign test.

By which I of course mean “significant” (p = 0.021), oops!

Julia Roberts is highly significant (p < 0.001) and Meryl Streep is not significant (p = 0.824).

I beg to differ. Julia Roberts is not significant.

(THAT, if you couldn’t tell, was a joke.)

I am not confident calling that a joke until it has been subjected to a more rigorous analysis. =P

Again, you do not understand two key things here: (1) a statistical significance test is inappropriate at low sample sizes and (2) statistical significance only means that the observed difference, or more extreme difference, is unlikely to be obtained if the null hypothesis is true. This, however, says nothing about the practical significance of the test i.e. whether the effect size should be considered large enough to be of practical importance.

“Significant” doesn’t mean what you think it means. I have written about this in many other posts on the blog, such as Paranormal Believers and Pareidolia? Not So Fast…, Quantity and the Biological Context, PZ Myers is not an Oblate Spheroid (p < 0.05) and others.

How many samples would you be happy with? For these actresses we are looking at the entire population that can be sampled from. Why do we care about effect size? Isn’t the question whether there is a consistent pairing of older with younger (or vice-versa)? How would YOU do the analysis to answer this question?

To clarify, all that the sign test did is show that the obvious effect present in those graphs (that actresses are paired with older actors) is statistically significant. If I wanted to actually address the question of whether there were two stages in their careers where in one there was a relationship between their age and the age of their on-screen partner and in the other the two were independent (which I believe is the point that Watson was trying to make) then I would probably design a nested model hypothesis test to check the significance of that.

You are confusing sample size with the number of samples. These are not the same thing. Typically, a sample size able to detect a moderate effect size, say d = 0.5, with a statistical power of 0.8 should suffice. At any rate, significance testing is deeply flawed the way you used it (as a algorithm for a dichotomous decision of practical significance), so not having sufficient sample size is only one of the problems with your approach.

Because we want to know, roughly, how big the difference is in practice (negligible? small? moderate? large? huge? etc.). It is possible to obtain statistical significance for a negligible difference, and a large practical difference may appear statistically non-significant. So overly focusing on statistical significance leads to many statistical, scientific and social problems (which I and others have reviewed elsewhere).

I would calculate the difference between the star and the love interest for each movie, then average it across all movies for a given movie star. Then I would average the figure obtained from all movie stars of a given gender (preferably a substantial sample size), calculate 95% confidence intervals (parametric or non-parametric), then interpret the effect size difference between male and female movie stars and the corresponding 95% CI in the sociological context.

Which is almost completely relevant. The probability we are interested is not P(at least as extreme data | null hypothesis), but rather P(null hypothesis| at least as extreme data) or better yet, effect sizes + 95% CIs + interpretation in the sociological context.

—

At any rate, this discussion is just a diversion from the point of the blog post I wrote, which was:

(1) Rebecca Watson made a statistically inappropriate analysis

and

(2) when I pointed this out, she not only failed to correct it, she deflected it with poor excuses (it was just a “joke post” and she did not want “spend several more hours redoing charts”) and irrelevant personalizations (calling my criticisms “ragy” and implying that they were incomprehensible).

How can we talk about “practical” significance in the context of deciding whether there is a tendency to pair actresses with older or younger actors? I notice your proposed method would also tackle that question (which I approached using a sign test) rather than the actual main claim that Rebecca seems to be making (about the relationship between the two ages over the actress’s career).

I am not familiar with the distinction that you’re referring to between “sample size” and “number of samples. The Wikipedia article on “sample size” seems to think it means “number of samples” also. I am also puzzled at how you are simultaneously concerned about the sample size being too small (lacking statistical power), and that the statistically significant effect that is being detected might be “practically” insignificant. I don’t think you get to worry about both! I can think of other problems with having a small sample size, but using a sign test gets around them.

Personally I only ever use significance tests as an adjunct to other ways in which I present data (usually graphically). I also don’t think that most other scientists place as much emphasis on p values as you are insinuating.

The practical significance in this context would be how big the age difference is, on average. Even if a difference exists, it could be negligible low, low, moderate, large, huge etc.

Only because that was the one you discussed. Instead of comparing female and male stars, you compare female stars before 40 and after 40. The method I outlined can easily be adapted to answer that question.

In this case, the distinction would be between the number of movies per move star and number of move stars in the analysis. Depending on what question you ask, the first would be the sample size and the latter number of samples, or the former the number of dependent replicates and the later the sample size.

Both are different manifestation of the same basic problem: p values are confounded by sample size.

Check out e. g.

Ioannidis, John P. A. (2005). Why Most Published Research Findings Are False. PLoS Med, 2(8), e124. doi: 10.1371/journal.pmed.0020124

“Practical” does imply you are going to DO something based on the effect size, that was my point. There is no practical consequence to the effect size in this case. I am familiar with the Ioannidis paper. I guess it might be true in some areas of science that the consensus conforms to the caricature that he puts forward in that paper:

“the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05”

The analysis that he presents in the paper is very interesting. However I think the reason why it did not instigate an overnight revolution is that many scientists were already aware (if only in a qualitative sense) of these problems. Personally, I don’t know anybody who maintains that stance.

No, practical significance (called clinical significance in medicine, biological significance in biology, sociological significance in sociology etc.) is just a term for “the difference is large enough to be important in the relevant scientific context”. For instance, a difference in doubling time between two strains of a certain bacteria might be practically significant if it is 50% shorter for one of them, irrespective if you actually do something with those results in practice. It is just a matter of having a high enough effect size to be of relevance in the context.

Actually, the problem of low sample size and the flawed usage of statistical significance testing has been known for over half a century, yet people keep performing the same mistakes. Here is one recent paper about it:

Button, Katherine S., Ioannidis, John P. A., Mokrysz, Claire, Nosek, Brian A., Flint, Jonathan, Robinson, Emma S. J., & Munafo, Marcus R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci, 14(5), 365-376. doi: 10.1038/nrn3475

You can find an overemphasis on statistical significance over effect sizes in many current statistics textbooks used in life sciences, such as Dawn Hawkins “Biomeasurement: A student’s guide to biological statistics”.

So “within the context”, what kind of effect size are we looking for? =P

You may also enjoy:

Friston, K. (2012). Ten ironic rules for non-statistical reviewers. NeuroImage, 61, 1300–1310.

That is an excellent question! Since the relevant context is sociology, and I am not a sociologist, I cannot answer how large the effect size has to be to be deemed of practical significance in a sociological context.

Also, the “ironic rule” regarding sample size in the paper misses the point entirely. Here is the relevant section:

This is a severe false characterization for a number of reasons. The fictitious ignorant reviewer is operating within the context of null hypothesis significance testing (NHST) instead of effect sizes and confidence intervals and confuses the problem of low statistical power in the context of statistical significance with the problem of low statistical power in the context of statistical non-significance (see below).

Here is the valid form of sample size criticism:

1. A low sample size is that sample effect size is unlikely to approximate to population effect size. Thus, the results of a study using few subjects probably do not generalize to the population. If a study makes categorical inferences from the sample to the population without considering the problem of low sample size, then it is flawed.

2. Having a low sample size means that the statistical power is low. This leads to two problems, depending on if the results are statistically significant or statistically non-significant:

– If statistical non-significance obtains, it may be due to low statistical power rather than the two groups being equivalent. If the paper claims equivalence, then it is flawed.

– If statistical significance obtains, it may be a capitalization of chance (since detection of statistical significance is so unlucky) and thus overestimate population effect size (since only those studies that happen by chance to find a large sample effect size that compensate for the low sample size obtain statistical significance and thus get published). If a paper does not report effect sizes with effect size error bars and points out that the obtain effect size is most likely an overestimation, it can be considered flawed.

In addition, there are a number of flaws with NHST in general: it says nothing about effect size or error bars, it is routinely misunderstood and abused and so on.

It’s not a joke if she is indicating that “No, is not bad Chart Thursday”.

FROM THE SKEPCHICK ARTICLE UNDER DISCUSSION:

“I included every movie in which the actress had a love interest that I could easily deduce if I hadn’t seen the film. Maybe I got some wrong. I don’t know, what does this look like? Good Chart Thursday? No. It’s Bad Chart Thursday. Deal with it.”

Jesus H. Christ on a painted pony. It was a jokey chart. Jokey!

I refuted the “it’s a joke” defense both in the post and in the comment section. Please read it before continuing to discuss this issue.

Furthermore, the “bad chart” was referring to her possibly not getting ever movie star / love interest pair right, not that she believed that the statistical analysis was a joke.

I did read it.

“Furthermore, the “bad chart” was referring…”

“Bad chart” means “bad chart.” If she got a few love interests wrong, the numbers will also be wrong. This is an ongoing feature. IT ISN’T MEANT TO BE TAKEN SERIOUSLY.

Here is an earlier example of Bad Chart Thursday:

http://skepchick.org/2013/02/bad-chart-thursday-sucky-valentines/

Go nuts with that one.

If you can’t acknowledge you took a joke too seriously, maybe the one who needs to examine his intellectual integrity is you.

You claim to have read it, yet you continue to spout the same argument despite the fact that it was refuted in the original post and in the comment section? Why?

My criticisms was completely unrelated to this issue of whether she got some love interests wrong. That was a complete non-issue for me. This really goes to show that you either did not read my posts and comments or failed to understand them.

You still do not seem to understand: it does not matter if an inappropriate statistical analysis was made in the context of a joke, it is still an inappropriate statistical analysis.

Imagine if she had made an offensive comment about a sexual minority instead. Would you have defended her beyond all reason with the “it’s a joke!” argument? I doubt it.

:” {Rebecca Watson} .. has arguably done more than anyone to highlight important issues related to feminism in the skeptical community. ”

You shot your wad of credibility right at the get-go good buddy.

When I write blog posts criticizing a specific individual, I often try to be as charitable as possible when portraying them in the initial paragraphs of the post. This is because it defuses some of the polarization innate in this kind of debunking approach. It is a way to soften the blow, so to speak.

However, just because I try to be charitable does not mean that I agree with the individual on every issue. In fact, that should be obvious given that the blog post is a critique of that individual.

If you claim to advance the cause of feminism, resorting to “jokey” posts containing incorrect methodology to get gender related conclusions is a shitty move.

That you also claim to be a proponent of rationality is the icing on the cake.