Regarding our recent post on the syllogism that ate science, someone points us to this article, “The CSI Effect: Popular Fiction About Forensic Science Affects Public Expectations About Real Forensic Science,” by N. J. Schweitzer and Michael J. Saks.

We’ll get to the CSI Effect in a bit, but first I want to share the passage from the article that my correspondent pointed out. It’s this bit from footnote 16:

Preliminary analyses (power analyses) suggested a sample size of 48 would be sufficient to detect the CSI effect, if it existed. In addition, the subsequent significance tests adjust for sample size by holding smaller samples to a higher standard when determining statistical significance.

In other words, finding that a difference is statistically significant is the same as saying that the sample size was of sufficient size to test for the effect.

Emphasis added. This is a great quote because it expresses so clearly this error. What to call it? The “retroactive precision fallacy”?

For a skeptical take on the CSI effect, see this article by Jason Chin and Larysa Workewych, which begins:

The CSI effect posits that exposure to television programs that portray forensic science (e.g., CSI: Crime Scene Investigation) can change the way jurors evaluate forensic evidence. We review (1) the theory behind the CSI effect; (2) the perception of the effect among legal actors; (3) the academic treatment of the effect; and, (4) how courts have dealt with the effect. We demonstrate that while legal actors do see the CSI effect as a serious issue, there is virtually no empirical evidence suggesting it is a real phenomenon. Moreover, many of the remedies employed by courts may do no more than introduce bias into juror decision-making or even trigger the CSI effect when it would not normally occur.

My correspondent writes:

Some people were worried that the sophisticated version of CSI that is portrayed on TV sets up an unrealistic image and so jurors (who watch the show) will be more critical of the actual evidence, which is much lower tech. There have been a handful of studies trying to demonstrate this and two did (including the one at issue).

I was pretty shocked at the poor level of rigour across the board – I think that’s what happens when legal scholars (the other study to show the effect was done by a judge) try to do empirical/behavioural work.

The truly sad thing is that many courts give “anti-CSI Effect” instructions based on these two studies that seem to show the effect. Those instructions do seem to be damaging to me – the judge tells the jury that the prosecution need not bring forensic evidence at all. The number of appeals and court time spent on this shoddy line of research is also a bit problematic.

So, two issues here. First, is the CSI effect “real” (that is, is this a large and persistent effect)? Second, the article on the CSI effect demonstrates a statistical fallacy, which is the view that, once a statistically significant result has been found, that this retroactively removes all concerns about inferential uncertainty due to variation in the data.

This reminds me of a great quote from “The Insignificance of Statistical Significance Testing”, Johnson 1999, Journal of Wildlife management:

“Certainly we knew before any data were collected that the null hypotheses being tested were false … The only question was whether or not the sample size was sufficient to detect the difference.”

Tables 1 and 2 are also relevant.

Great paper!

Andrew,

Would you not agree that their is some truth to the claim that if you find a stat. sig. comparison in a small sample this is evidence that sample size was sufficient to detect the effect? Indeed, this is true by definition, no? I understand that you think this is harmful thinking because of the way research such as this is typically conducted (e.g. garden of forking paths). But if we assume that the research was done exactly as somebody would like if they were setting out to test a specific hypothesis (i.e. preregistered analysis) then I don’t really see the issue with this claim. As long as there is not selective reporting or some form of p-hacking, then I don’t see why it’s incorrect to say that statistical significance *means more* in small samples. Of course, all else equal, I’d prefer to have an estimate from a larger sample than a smaller one, as it conveys more information.

Matt:

No, it’s not true, by definition or otherwise, that if you find a statistically significant comparison in a small sample that this is evidence that the sample size was sufficient to detect the effect? Not at all.

Here’s an example, one that I’ve used before: Suppose someone looks at data from a survey of 3000 people and estimates the difference in proportion of girl births, comparing children of beautiful to non-beautiful parents, and he finds a difference of 8 percentage points with a standard error of 3 percentage points, which is statistically significant at the conventional level. In real life, though, any population difference in these proportions cannot realistically be larger than 0.1 percentage points or so. N=3000 is simply not enough data to learn anything useful here. It’s the kangaroo problem. In this example, the sample size was not sufficient to detect the underlying comparison of interest, and that’s the case whether or not the particular sample happens to be statistically significant.

I think of it like this—assuming that you have drawn a perfectly random sample from the population, your have not engaged in any “forking paths”, and the effect size is truly exactly equal to zero, you will still find “significant” effects some proportion of the time (proportional to your alpha level). Therefore, when you find a significant effect, it is impossible to know that you have even estimated a true effect, let alone that the sample size was large enough to detect it.

True, but generally just another instance of the alchemy of getting certainty.

We all should realise empirical studies no matter how well done and analysed will misled us a certain percentage of the time. We will never know when (at least given a single study regarding something really unknown, such as the fairness of an ordinary coin flip) nor how often. As Mosteller and Tukey once put it with a single study, you simply cannot assess the real uncertainty. It is beyond observation. With multiple studies the is some real access to it (the how often wrong _should_ be less) but its only a single set of studies.

Being certain you are not wrong about the sample size being adequate is not possible. All you can to do is form your best judgement and make a bet. As Oliver Wendell Holmes put it, we can never be more than a bettabilitarian.

Meant to argue that the bet is simply about whether another study will be informative enough to be worthwhile doing.

I think Andrew’s example is OK but seems to me to be a bit confusing, what with the distractions of ‘beautiful’ parents and so on.

How about this. I flip an ordinary coin 1000 times and I get heads 540 times, vs only 460 tails, obviously. The difference is easily ‘statistically significant by standard measures’, indeed it’s significant at the 0.5% level for a one-tailed test or the 1% level for a two-tailed test. Taken on its own, this is very strong evidence that the coin is biased towards heads.

But I’ve already told you it’s an ordinary coin. As Andrew has discussed somewhere, an ordinary coin can be a little teeny bit biased, but there’s no way to make one that will give you 54/46. I don’t even think you could do 51/49 without doing something very non-ordinary like an extremely beveled edge that will favor heads when the coin bounces, and maybe not even then.

So haven’t I just contradicted myself? I’ve got an experimental result that the coin is strongly biased and the result is statistically significant at the 1% level, but I’ve just told you that it is pretty much impossible for this result to be real. Is this a completely artificial example? The answer is no. A result this extreme will happen about 1% of the time, and 1% is not zero. If you flip an ordinary-seeming coin 1000 times and get 540 heads, you have _not_ learned that your coin is strongly biased towards heads: 1000 flips isn’t nearly enough to quantify the bias of an ordinary-seeming coin.

I now have to link to this terrific Persi Diaconis et al. paper https://statweb.stanford.edu/~susan/papers/headswithJ.pdf

“Vanishingly small”. Who knew you could calculate the probabilities of conditionals from sample size? https://cosmosmagazine.com/biology/huge-numbers-of-deformities-found-in-ancient-human-remains

“retroactive precision fallacy” fits perfectly for this howler! Yet another unsuccessful attempt to violate David Freedman’s Law of the Conservation of Rabbits

Freedman’s Rabbit Axioms

1. For the number of rabbits in a closed system to increase the system must contain at least 2 rabbits.

2. You cannot pull a rabbit from a hat unless at least one rabbit has previously been placed in the hat

3. Corollary: You cannot “borrow” a rabbit from an empty hat, even with a binding promise to return the rabbit later. NO NEGATIVE RABBITS