How to think about the statistical evidence when the statistical evidence can’t be conclusive?

There’s a paradigm in applied statistics that goes something like this:

1. There is a scientific or policy question of some theoretical or practical importance.

2. Researchers gather data on relevant outcomes and perform a statistical analysis, ideally leading to a clear conclusion (p less than 0.05, or a strong posterior distribution, or good predictive performance, or high reliability and validity, whatever).

3. This conclusion informs policy.

This paradigm has room for positive findings (for example, that a new program is statistically significantly better, or statistically significantly worse than what came before) or negative findings (data are inconclusive, further study is needed), even if negative findings seem less likely to make their way into the textbooks.

But what happens when step 2 simply isn’t possible. This came up a few years ago—nearly 10 years ago, now!—with the excellent paper by Donohue and Wolfers which explained why it’s just about impossible to use aggregate crime statistics to estimate the deterrent effect of the death penalty. But punishment policies still need to be set; we as a society just need to set these policies without the kind of clear evidence that one might like.

Another example, where the aggregate statistical evidence is even weaker (and, again, with no real prospect of improvement) was pointed out to me by sociologist Philip Cohen, who wrote:

In a (paywalled) article in the journal Family Relations, Alan Hawkins, Paul Amato, and Andrea Kinghorn, attempt to show that \$600 million in marriage promotion money (taken from the welfare program!) has had beneficial effects at the population level. . . .

Cohen noticed a bunch of statistical problems with the published paper (see this recent entry on the sister blog for links and further discussion), but really the problem is much deeper than the flaws of one particular paper. It’s just going to be nearly impossible to learning much about the effects of such a program from aggregate state-level statistics (Cohen says that the paper looks at: percentage of the population that is married, divorced, children living with two parents, one parent, nonmarital births, poverty and near-poverty). There’s just no way.

That’s fine—I’m not saying that a new program shouldn’t be implemented or expanded, just because of lack of evidence. My point is that in cases such as this I think we need to discard the paradigm of steps 1, 2, 3 above. It could be possible to study effects via a more targeted analysis but I don’t think the aggregate thing tells us much of anything at all. But I think it can be difficult to talk about because of the pressure to demonstrate that a program “works.”

1. Chip Lynch says:

This is pretty much why politics exist, IMO; to resolve these sorts of things at a community level. Even if we have strong scientific evidence of a cause/effect relationship (particularly if the correlation vs. causation issue is totally laid to rest), there are qualitative considerations to be made. Take the crime example… even if the death penalty did not reduce the number of crimes, or even if it was shown to increases it, there would still be people that think that misses the point…that an eye-for-an-eye is the correct punitive path even if it doesn’t deter others. People against welfare will be against it even if you could completely show that a %600 Million welfare program returned twice that in benefits — it’s the principle of taking money from one person and giving it to someone else that chafes them, not how efficiently or effectively the money is used.

Strong evidence is helpful to inform decisions, but there will always be politics and points of view with differing goals and perspectives that the statistics will never fully account for.

• True enough, but I think we should make one thing plain: it pretty much ALWAYS and for EVERYONE *OUGHT* to be true that a policy which is either ineffective at achieving its goals, or even counterproductive for those goals should be eliminated, regardless of what our personal feeling about the policy are. Unfortunately, this is manifestly not the case. In particular people often support policies intended to help people of some type because they think “helping” is the right thing to do. They will support these policies even if there is evidence that on average it actually hurts those people. In many cases this is because it doesn’t hurt people uniformly, there are some cases where some subset of people are substantially helped. The fallacy of the one-tailed result… or something like that.

• Chip Lynch says:

The problem lies in “regardless of what our personal feeling about the policy are”, because that idea conflicts with the idea of “achieving its goals”; most policy goals are set by people’s personal feelings. I just don’t think there’s a real example of a real policy that behaves the way that’s described here — how many people agree with a policy because of unstated or secondary goals? The support for a policy does not always follow the stated goals of a policy’s originators or leaders, so it’s often difficult (possibly always impossible) to compare quantifiable goals to a single policy that could, by itself, be implemented or reversed.

As I said, I certainly prefer a more informed policy making style, I’m just saying that “ALWAYS” and “EVERYONE” are too black and white for statistics, let alone culturally influenced policy. I gave counterexamples to the crime and welfare examples that were posited… I’m open to other examples; I’d really like to see one, actually, where there’s agreement that the policy and the goal were inexorably linked in the first place.

Really, the population itself is a statistical problem. It’s sort of bayesian in a way — a policy can be well informed by a statistical correlation with a goal only insofar as the goal itself is represented by a prior distribution of beliefs over those affected by the policy. (OK, maybe that’s a stretch). :-)

2. K? O'Rourke says:

But thats usually the case in epidemiological studies and also the meta-analysis of published RCTs (just using selected and recast information that is made public.) Two is unusual except for predictive performance or where replication efforts are economical and fast.

So excellent question.

3. Anonymous says:

Isn’t the question another way of asking: How to interpret statistical evidence when the quantity of interest is not even identified?

4. Thomas Ball says:

Great question re statistics and policy. Here’s a link to a report from the British House of Lords that is quite comprehensive in its evaluation of the various tools government can employ in effecting behavioral change at the population level. They consider many different programs beginning with the ban on smoking in public places with deep consideration to policies intended to reduce obesity and regulate the use of cars. The key point is that all policy decisions should be evidence-based using the best available research as guidance and emphasize that conflicting research results is not an excuse for doing nothing…

http://www.publications.parliament.uk/pa/ld201012/ldselect/ldsctech/179/179.pdf

And here’s another study that looks at conflicts over public policy and its relation to numeracy and political party…

http://www.cogsci.bme.hu/~ktkuser/KURZUSOK/BMETE47MC15/2013_2014_1/kahanEtAl2013.pdf

5. Eli Rabett says:

Which is why statistics without mechanism is wool gathering.

If you will, statistics gathers observations into summary form, but most often cannot be inverted to mechanism, which means that you have to start from basic principles from which you can generate models of the statistical summaries that can be compared with the observational data.

The situation strongly resembles that in chemistry when Eli, as a young bunny, entered the field, forty years ago. At the time the hope was that observations could be inverted to yield a quantum level description of a molecule and/or reaction.

This proved impossible.

What has been very successful is to build first principle descriptions which can then be run forward to model observations and then compare the model with the summarized data. Inevitably, global quantum models are lacking, or computational power is lacking although both have improved over the years, so we have some quantum chemistry models that do well on small things (molecules), others things like molecular modelling that are less quantum in nature that do well on bigger things (biomolecules), and, of course, some fiddling in the middle and on the edges.

Which is how science lurches forward.

6. I think you may be too generous to the program.

From the posturing of the marriage promotion movement – from the very first sentence of the 1996 welfare reform law (“(1) Marriage is the foundation of a successful society,” Public Law 104-193, 1996) – it’s clear their intent is to achieve population-level results. The program might increase marriage a tiny bit here or there, for some people, in hard-to-discern ways, but that’s not its purpose. If they don’t succeed in turning around the decades-long downward trend in marriage on a societal level, they have failed.

Now, whether that means they need more money or should be cut off is a different question, and I agree this research doesn’t answer that.

My question is how you know when you’re looking at a case where demonstrative effectiveness is impossible versus a case where the policy is just failing.

7. To point out the obvious – the statistician should say that the data are not conclusive, and suggest ways in which more data could be found, including changes to policy which could increase the data available, such as experiments. To copy a chunk from http://www.pc.gov.au/__data/assets/pdf_file/0018/96210/05-chapter4.pdf:

Our discussion emphasises that policymakers largely determine the quality of the evidence that they have available for making policy decisions. They exercise this control in a variety of ways, both direct and, more importantly, indirect. Good evidence depends on much more than just a demanding client, a topnotch evaluator and an adequate budget when commissioning an evaluation. It also depends on broader decisions about program design and implementation prior to evaluation, in the design and funding of general social science data sets, on the quality of administrative data systems, on peer review and on institutions that encourage the development of informed evaluation consumers within government.
(end quote)

There is much enthusiasm about “Big Data”. I note that real life use of this often involves experiments such as A/B testing. If randomization was politically unacceptable, perhaps policies could be designed to make analysis such as regression discontinuity or other quasi-experimental studies practical and powerful.

8. […] policy. But what happens when the statistical evidence is inconclusive? Blogger Andrew Gelman addressed the subject informally last week. Good gut check for people who tend to think that every complex system can be reduced to […]