Javier Benitez points us to this horrifying story from Liliana Segura: “Junk Arson Science Sent Claude Garrett to Prison for Murder 25 Years Ago. Will Tennessee Release him?”

Javier Benitez points us to this horrifying story from Liliana Segura: “Junk Arson Science Sent Claude Garrett to Prison for Murder 25 Years Ago. Will Tennessee Release him?”

Well, since nobody else is going to comment: “A statistically significant price adjustment following a corrective disclosure is evidence the original misrepresentation did, in fact, affect the stock price.” So dear Gelman followers, what, exactly, is evidence?

The next sentence reads as follows: ” The converse, however, is not trueâ€”the absence of a statistically significant price adjustment does not show the stock price was unaffected by the misrepresentation.” https://scholar.google.com/scholar_case?case=17598698939675929167

So statistical significance IS evidence but the absence of statistical significance is NOT evidence. Seems fair.

Please help my profession. Alms for the poorly M’Lord, alms for the poor!

I’ll give it a shot (disclaimer – I’m not a professional statistician but I’ve hung around enough that I think I can answer this). First I’ll say what evidence is, then I’ll address the statistical significance thing, which is a bit weird.

Statistical evidence is about how much better the data fits one hypothesis than the other. Note that this requires two distinct hypotheses – talking about evidence for or against one position without defining other positions that could be true generally isn’t possible (exception: if you can falsify a hypothesis with determinstic logic “If A then DEFINITELY not B, B, therefore definitely not A,” you don’t need statistics or alternative hypotheses). When I say how well the data fits a hypothesis, I mean the probability of observing that data if the hypothesis was true.

Also note an interesting thing about how statistical evidence works – it doesn’t directly set the probability of the hypothesis being true to fixed amount. Rather, it changes it by a fixed amount. You have to have initial probabilities for each of the hypotheses you’re considering. Sometimes this is easy (e.g., one out of every fifty people patients has disease x, therefore the prior probability of this random patient having the disease is 1/50), other times it involves some uncomfortable guessing. You can evaluate how much evidence there is without these “prior probabilities”, but you won’t be able to evaluate the final probability of a hypothesis being true. You’ll just know by how much it changed.

Actually, it’s pretty convenient to express both the evidence and the probability of the hypotheses in terms of ratios. Let’s say you have two hypotheses, and prior to collecting evidence, Hypotheses A has a 2/3 probability of being true and hypothesis B has a 1/3 probability of being true. Since 2/3 is twice 1/3, the “odds” are 2:1 in favor of hypothesis A. Now let’s say you collect data, and the data you see has a 3/100 chance of being observed if hypothesis A is true and a 2 in 100 probability of being observed if hypothesis B is true, a ratio of 3:2. This ratio summarizes all the evidence and is called the “liklihood ratio”. Just multiply each side of this ratio by the corresponding sides of the odds – 2*3 : 2 * 1 – to get the final set of odds, which will now be 6:2, or 3:1 in favor of hypothesis A. You can then convert the odds back to probabilities. If A and B are the only possibilities, their probabilities have to add up to one, so what we get is a 3/4 probability that hypothesis A is true and a 1/3 probability that hypothesis B is true (since the odds are 3:1 and 3/4 is 3 times 1/4).

As for significance testing, you are indeed correct that the above rule is far from evenhanded. Or logically sound. But I’m pretty sure I get why people think that. Firstly, statistical significance says, “Observations these extreme would be pretty rare if Hypothesis B was true.” It’s probably popular in part because it mimics the form of the deterministic falsification mentioned above: “If A then DEFINITELY not B, B, therefore definitely not A.” But this time it’s not really deterministic. “If A then PROBABLY not B, B, therefore probably not A.”

In deterministic logic, there’s a big difference between falsification and confirmation. Once you see an observation that couldn’t happen if a hypothesis were true, that hypothesis is finished; no matter how many additional observations you make that don’t themselves disconfirm the hypothesis, it’s still been disproven. On the other hand, seeing data that could (or even definitely would) be observed if a hypothesis were true doesn’t prove the hypothesis. After all, no one said that kind of evidence could ONLY be observed if the hypothesis in question were true, and you could still see evidence later that disproves your original hypothesis.

Once probability enters the picture, things change. Whether the evidence probably would happen if your hypothesis was true, or probably wouldn’t happen if your hypothesis was true, things can always turn around. Later observations can always overpower the evidence provided by your initial observations.

People who say “statistical significance IS evidence but the absence of statistical significance is NOT evidence,” are probably trying to apply the rules of deterministic logic to a situation that is governed by probability. They’re wrong – or at least they can be wrong. Statistical significance is not directly connected to evidence – where evidence is defined as something that makes it more or less probable that a given hypothesis is true. The evidence could go either way, you need more information than whether it was statistically significant or not. That said statistical signficance is probably correlated with evidence – but I didn’t set this explanation up in a way that easily lets me talk about how.

One more thing – while it doesn’t tap evidence directly, significance testing promises you’ll only falsely reject a hypothesis a certain percentage of the time (conditional on that hypothesis being true). But it doesn’t say anything about how often you’ll fail to reject if the hypothesis is indeed false and should be rejected. The probability of correctly rejecting a given hypothesis using significance testing depends on what the truth really is (if your hypothesis is that exactly 20% of people become tuba players, and the actual percentage is 21%, you’ll have a lower chance of correctly rejecting your initial hypothesis than if the real percentage had been 99%). This unknown error rate is the other thing people might be referring to when they say “statistical significance IS evidence but the absence of statistical significance is NOT evidence.” Again, I’d like to stress that evidence is about relative probabilities, not these conditional error rates, but it’s reasonably fair to be concerned about this issue.

Many thanks. This will require a bit of digesting!

By the way, there are some rather annoying things about processing likelihoods that might come up from time to time. The three that come to mind are filtered evidence, nonindependent evidence, and heirarchical hypotheses.

In terms of filtered evidence, when the evidence is coming from a source that only tells you about observations when the observations turn out a certain way, you’re not dealing with the evidence contained in those observations anymore. You’re dealing with the evidence contained in hearing about those observations. As in the example I gave below when I thought my first comment had dissapeared, imagine you’re trying to see if a coin is fair, or if it always comes up heads (I’m here ripping off an example from Eliezer Yudkowsky, pretty much). You can’t flip the coin yourself, but someone else does (an unknown number of times) and reports getting a few heads. Because observing heads is more likely if the coin always comes up heads, this would be evidence for it being heads – if you had observed it yourself. But what if the rule of the other person was “never report tails, just flip the coin until you get heads a few times and report the heads?” Then it wouldn’t be evidence for anything, since it was always guaranteed they’d get at least some heads and you’d be hearing about them. What if the rule was “Flip the coin a hundred times, and if it’s always heads report specifically that it always turned up heads – otherwise, report getting a few heads?” Then you’d actually know for sure that it was a fair coin. This seems particularly pertinent in the legal system since each side clearly has a motive to only report evidence to favorable to their side. It’s argued that since there is a defense and prosecution, every piece of evidence will be reported since someone has a motive to report it – but I’m not so sure that’s the case (I know the defense is supposed to get all relevant information the police turn up, but I’d bet that that doesn’t happen at least sometimes). There may even be good reason for this, persuasion-wise; for what it’s worth, the old psychological research apparently suggested that you shouldn’t water a strong argument down with a bunch of weaker arguments that also support your point (this finding was before the replication movement, before preregistration, and also I’m going off vaguely remembered class material here – so take these findings with several grains of salt).

Nonindependent evidence is most easily demonstrated via extreme cases. Imagine a murder took place in a bar, and a security camera showed the suspect entering the bar earlier that day. This is more likely if the suspect committed the murder than if the suspect did not commit the murder, so it’s evidence they committed the murder. Now imagine you learned that there was also another security camera in the area, and it too showed the murder suspect entering the bar at the time of the murder (remarkable). This might contribute a small amount of additional evidence if, say, the footage from the first camera was blurry and the second camera helped confirm that the guy on the film was really the suspect. But the fact of the suspect being there only provides evidence once, so the second camera isn’t going to contribute as much marginal evidence as the first. In some cases, additional evidence could be worthless because it’s basically just a copy of another piece of evidence. I think this is sort of about the two pieces of evidence being correlated, irrespective of whether the murder happens or not, but I’m not sure – but hopefully the previous example should be helpful even if you don’t understand what’s going on there better than I do.

As for hierarchical hypotheses, consider the following example. You have 53 decks of cards. You have one fair deck, and 52 decks which are each only kind of card (e.g., one deck made of 52 copies of the King of Hearts, one with 52 copies of the Queen of Spades). You draw a random deck, and having done that, you want to test if it’s a normal deck or an unfair deck. To test this, from that deck, you draw a single random card (drawing two cards would just give it away entirely and ruin the game, because I have again chosen a scenario with 100% probabilities for the sake of simplicity – a bad habit perhaps). As it turns out, that card is the Queen of Hearts. Is this evidence that you’ve drawn one of the unfair decks? Some people think it is, because this is 52 times more likely if you’ve drawn the Queen-of-Hearts-only deck. They’re wrong though. To be clear, it is evidence in favor of the hypothesis that you’ve drawn the Queen-of-Hearts-only deck, which is one of the unfair decks – but it’s also evidence AGAINST the idea that you’ve drawn the King-of-Hearts-only deck, the Ace-of-Spades only deck, etc. The hypothesis that you have an unfair deck consists of all these hypotheses as well. By the law of total probability, if you want the probability of drawing this the Queen of Hearts given that you have an unfair deck, you have to take the probability of getting the Queen of Hearts given that you have have the Queen-of-Hearts-only deck (100%), times the probability of having the Queen-of-Hearts-only deck given that you have one of the unfair decks (1/52), plus the probability of drawing the Queen of Hearts given that you have the King-of-Hearts-only deck (0%), times the probability that you have the King-of-Hearts-only deck given that you have one of the unfair decks (1/52), and so on. When you add this all up, the probability of getting the Queen of Hearts conditional on having an unfair deck is 1/52 – the same as if you had the fair deck. So this outcome cannot provide evidence for either super-hypothesis (though the sub-hypothesis that you have the Queen-of-Hearts-only deck does indeed rise probability, by taking all the probability away from the other formerly-possible unfair decks). (Incidentally, this example is slightly modified from one given by someone named Royall, who promoted likelihoods as an account of evidence, but he came to the answer that I’m saying is incorrect; basically, I think he’s unknowingly swapping what test he’s doing in the middle of his argument).

Note that the sub-hypotheses don’t have to have equal probabilities – if the unfair decks had consisted of two King-of-Hearts-only decks and one of each other kind, the overall probability of drawing the Queen of Hearts given that you had an unfair deck would be 1/53. Since the probability of drawing it in the fair deck would still be 1/52, drawing the Queen of Hearts would actually be evidence for the fair deck.

Also note that since drawing the Queen of Hearts changes the probabilities of the sub-hypotheses contained in the super-hypothesis “the deck is unfair,” you have to recompute the probability of other observations before you can process new evidence. In this case, given that you have drawn one Queen of Hearts, the probability of the next card being a Queen of Hearts given that the deck is unfair is 100%. I think this might in part be a consequence of sampling without replacement, but I suspect you’re going to find a similar phenomenon elsewhere too when there are super-hypotheses containing sub-hypotheses.

Hi Thanatos,

You should consider writing a book on the uses of Statistics in the law. I recall that Phil Dawid has written some on this. I may have implored you to write once before. I’ve tried to convince Sander Greenland to write on Law and Statistics. That to me would be fascinating.

You also linked up an article on the subject here. But don’t recall when

Sameera,

Mary W. Gray (who is both a statistician and a lawyer) regularly writes on statistics and the law for a column called The Odds of Justice in Chance Magazine. You might be able to find some of her articles on the web — or you might be interested in subscribing to Chance.

Martha:

Mary Gray was very nice to me, decades ago when I took a class at American University.

Mary Gray was also very nice to me, probably more decades ago, when the percentage of women in mathematics was epsilon and the small group of us met at math meetings.

Martha,

Thanks for the resources. I will look into them.

Actually in the past I’ve been on a conf. call with Dawid re: these issues. The problem is that the courts want to resolve disputes, and p-values (seem to) promise resolutions.

My current (fills up a yellow pad) thinking is that evidence cannot be divorced from a witness’ beliefs. In other words, evidence is an observation that unfolded within the context of what the observer believed about the world before the incident was witnessed. What would be the value of a witness’ testimony about the color of a traffic light as the cars entered the intersection if he didn’t know green from red and didn’t know how traffic regulation was supposed to work? It seems to me then that each observation is in fact its own little experiment in which the present is judged against the predictions made by our understanding of the past. If so, then rummaging about in data after the fact and summoning statistical ghosts that were never observed by those involved is not, and cannot be, evidence of anything.

I read some of Dawid’s work. I also read Nathan Schactman’s blog, which contains a lot of interesting observations. Doing a retrospect of recent cases using statistics would be beneficial

Thanatos:

Your yellow pad is interesting in that its tapping in to some recent work in Stat Theory such as The prior can generally only be understood in the context of the likelihood https://arxiv.org/abs/1708.07487 , the work of Evans I referred to below and likely Dawid.

To paraphrase John Tukey https://en.wikiquote.org/wiki/John_Tukey – The combination of advanced legal knowledge and an aching desire for a credible statistical theory of evidence that can be prescribed to ensure reasonable application widely in courts may not be extract-able from a given body of statisticians pro bono or possibly even with fees – presently.

OK – I always used too many words.

In this case, using historical daily returns constitutes a reasonable sample of a noisy process that has frequency properties that are verifiable in terms of large sample size… so rejecting a hypothesis test that the post-disclosure data comes from the pre-disclosure distribution is evidence that the post-disclosure distribution is different (not causal evidence that the disclosure caused the change though).

The absence of a statistically significant test is evidence that *either the distribution of returns did not change, or our test was not sensitive enough to detect the change*

Using logic here is what’s needed rather than any fancy stats. If you walk into your bedroom and find the walls are a different color than they were yesterday, then this is evidence that someone painted your bedroom in the last 24 hours. If you walk into the bedroom in the night with the lights out and you can’t see the color of the bedroom walls very well, but you don’t think they look dramatically different, this is not evidence that no-one painted your bedroom in the last 24 hours, it’s just evidence that you can’t tell whether someone painted your bedroom.

A critical thing to note here is that if you walk into the room with the lights on, and don’t see any difference, this is evidence that your room wasn’t painted because you should have seen a difference if there was one.

A potentially important thing to think about, though, is sub-hypotheses. “My room wasn’t repainted” refers to one specific state of the world, but “My room was repainted” contains “My room was repainted in much more vibrant colors than before,” “My room was repainted using the left over paint from when we painted it the first time,” “My room was repainted the same color but with glossier paint,” and “My room was repainted with the unspeakable colors of the Silent King.” Walking into the room and failing to see a difference is almost certainly strong evidence against the general idea “My room was repainted,” but may give different levels of evidence for or against the different sub-hypotheses. It’s strong evidence against the idea that your room was repainted more vibrantly, weak evidence against the idea that it was repainted a glossier version of the same color – and the hypothesis that it was repainted using the same paint as the first time just might rise in probability (not because it’s stealing probability from the idea that your room wasn’t repainted, but because it’s stealing probability from the idea that it was repainted more vibrantly).

A notable example of this comes from parameter estimation. Let’s conventional wisdom is that 50% of people will go to a theme park in a given year. You think this might not be true – and if it’s not, the figure is probably more like 10%. You collect data on 10 people (horrible sample size) and six of them have gone to theme parks. I’ve never performed continuous Bayesian estimation like this, so I don’t know for sure, but I think this makes it more likely than before that 50% is the real number – and also makes it more likely than before that 60% is the real number. You may legitimately count this as evidence that conventional wisdom is right, but if you keep collecting samples of 10 people, and time after time 6 of them have been to a theme park, eventually you’ll be looking at evidence that conventional wisdom is wrong and the real number is 60% (actually, you’ll be looking at evidence that your sampling procedure is somehow turning out nonrandom results, but disregarding that).

I wrote up this whole elaborate reply last night, but it has yet to appear, so I don’t know if it has been glitched out of existence or if it’s still undergoing review to make sure it’s civil and whatnot.

In the meantime, a reply as to what is evidence, I’m assuming that by evidence you mean something that should logically make you more confident that a certain hypothesis is true (or false). Evidence in that sense would be about how likely it is you’d observe the things you’re observing if the hypothesis in question was true, relative to how probable it is that you’d observe that data if something else were true.

So for example, let’s say that you’re wondering whether a coin has a 50% probability of turning up heads, or a 100% probability of turning up heads. Let’s say these are the only two possibilities. You decide to flip the coin twice (which is horrible if you care about getting the right answer, but for the sake of the example it works). Both coin flips come up heads. The probability of this happening if the coin was fair is 1/4. The probability of this happening if the coin is unfair is 4/4. Since 4/4 is 4 times 1/4, this is 4:1 evidence in favor of the coin being unfair. This ratio is called the likelihood ratio and is generally considered to contain all the evidence in the sample (I’ve heard this disputed I think, but I’m not sure if the disputers are right – in any case this definition should work in most cases).

What do I mean by 4:1 evidence? Well, let’s say that fair coins and unfair coins were equally common. Thus, before collecting any evidence, the coin had a 50% probability of being fair and a 50% probability of being unfair. When using likelihood ratios it’s convenient to convert probabilities to odds: the ratio of one probability to another probability. In this case, that’s 1:1. The first one signifies the probability of the unfair coin, the second one signifies the probability of the fair coin. Anyhow, you multiply each side of the odds by the corresponding side of the likelihood ratio, and you get the new odds which take the evidence into account. That would be 4 * 1 : 1 * 1, or 4:1 in favor of the unfair coin. That translates to a 4/5 probability that the coin is unfair, and a 1/5 probability that the coin is fair.

Let’s say you flip the coin twice more, and get the same results. The likelihood ratio is the same, 4:1, so if you multiply this by the set of odds we calculated after the first round of evidence collection, now we get 16:1 odds in favor of the coin being unfair.

Note, by the way, that in this example there is a zero percent probability of observing tails if the coin is unfair, so if one tails turned up it would be completely proven that the coin is fair. This is an unusual situation. More realistic would be if the unfair coin had a 3/4 probability of turning up heads, as opposed to a 2/4 probability in the case of the fair coin. This means that a single head (not a pair of heads) would provide 3:2 evidence in favor of the coin being unfair. Conversely, a single tails would provide 1:2 evidence (in favor the coin being fair) since this has a 1/4 probability of happening if the coin is unfair and a 2/4 probability of happening if the coin is fair).

As for the statistical significance thing, you’re right: that rule is unfair, and contains some incorrect ideas. I’ll probably follow up on that in another comment.

Ah, and there goes the page refresh, showing me my original comment was posted all along.

“Death by Fire,” a PBS Frontlne documentary available online (pbs.org/wgbh/frontline/film/death-by-fire/) covers the case of Cameron Todd Willingham, who was executed in Texas largely on the basis of similar junk science.

A few details that should be clarified. There is some confusion about how a legal proceeding works.

Burden of proof. In court, the plaintiff has to prove its case. Failure to prove your case means you lose (which is what the court has to decide). Failure to prove your case doesn’t mean the null is true, it just means you haven’t proved your case. So courts are not making the mistake of inferring from a lack of statistical significance that the null is true. The court is just saying you haven’t proved your case so you lose.

What is evidence? Technically, everything (including statistically insignificant results) that is put forward in court is evidence. “Evidence” just means stuff the court considers. So statistically insignificant results introduced in court actually are evidence in the legal sense. Yes, they aren’t evidence in favor of the null in the statistical sense, but that is not what evidence means in a court. “Evidence” just means stuff the court thinks is credible enough that it should be considered.

This is not quite true. Courts do not consider everything as evidence – at least when expert testimony is involved. There is a threshold concerning what is accepted as valid scientific evidence – see Judging Science: Scientific Knowledge and the Federal Courts, by Kenneth Foster and Peter Huber (1997) or the myriad articles written about the Daubert case. Expert testimony may or may not be admissible – and it depends on assessments of its conforming to “accepted” scientific practice. So, the issue of statistical significance can come into consideration of whether or not to consider evidence.

I am not suggesting that insignificant results are not considered evidence. The courts have not (to my knowledge) developed clear principles regarding what statistical evidence will or won’t be considered. I only want to point out that the issue is far from clear, and it is possible that insignificant results may not be admissible in particular cases or circumstances.

I agree. I was being simplistic. I was conflating “put forward in court” with “admitted into evidence by the court.”

Not everything is evidence:

Federal Rules of Evidence

Rule 401. Test for Relevant Evidence

“Evidence is relevant if:

(a) it has any tendency to make a fact more or less probable than it would be without the evidence; and

(b) the fact is of consequence in determining the action.”

Note the word “probable”. It’s why I keep coming back here.

Looks very much like Mike Evans definition of evidence, for u if posterior(u)/prior(u) > 1 and against u if < 1 and not evidence if 1.

A recent paper which does contain a review of the above and some of its history – https://arxiv.org/abs/1903.01696