## “Precise Answers to the Wrong Questions”

Our friend K? (not to be confused with X) seeks pre-feedback on this talk:

Can we get a mathematical framework for applying statistics that better facilitates communication with non-statisticians as well as helps statisticians avoid getting “precise answers to the wrong questions*”?

Applying statistics involves communicating with non-statisticians so that we grasp their applied problems and they understand how the methods we propose address our (incomplete) grasp of their problems. Statistical theory on the other hand, involves communicating with oneself and other qualified statisticians about statistical models that embody theoretical abstractions and one would be foolish to limit mathematical approaches in this task. However, as put in Kass, R. (2011), Statistical Inference: The Big Picture – “Statistical procedures are abstractly defined in terms of mathematics but are used, in conjunction with scientific models and methods, to explain observable phenomena. … When we use a statistical model to make a statistical inference [address applied problems] we implicitly assert … the theoretical world corresponds reasonably well to the real world.” Drawing on clever constructions by Francis Galton and insights into science and mathematical reasoning by C.S. Peirce, this talk will discuss an arguably mathematical framework (in the Peirce’s sense of diagrammatic reasoning) that might be better.

*“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.” – John Tukey.

P.S. from Andrew: Here’s my article from 2011, Bayesian Statistical Pragmatism, a discussion of Rob Kass’s article on statistical pragmatism.

Key quote from my article:

In the Neyman–Pearson theory of inference, confidence and statistical significance are two sides of the same coin, with a confidence interval being the set of parameter values not rejected by a significance test. Unfortunately, this approach falls apart (or, at the very least, is extremely difficult) in problems with high-dimensional parameter spaces that are characteristic of my own applied work in social science and environmental health.

In a modern Bayesian approach, confidence intervals and hypothesis testing are both important but are not isomorphic [emphasis added]; they represent two different steps of inference. Confidence statements, or posterior intervals, are summaries of inference about parameters conditional on an assumed model. Hypothesis testing—or, more generally, model checking—is the process of comparing observed data to replications under the model if it were true.

1. Dan Hicks says:

I’m a philosopher of science who works on the relationship between science and society and, especially, public scientific controversies (climate change, GM crops, vaccines, that kind of thing). In my work, I frequently encounter gaps between statistical practice (and scientific practice more generally) and the full scope of what the abstract calls “applied problems.” So, it’s great to see statisticians recognizing and taking up that problem. And even better, the talk doesn’t place all the blame on public ignorance or a lack of deference to scientists. (Folks in Science and Technology Studies call that “the deficit model,” and it’s something we’ve been criticizing for like 30 years.) If I were a bit closer to Ottawa, I would try to make it to the talk. As it is, I’d be interested in taking a look at a paper.

Since the abstract is more about the problem than the proposed solution, I don’t have much to offer in the way of concrete suggestions. However, I would recommend looking around for philosophers of science or science communications scholars as potential collaborators.

2. derek says:

All I can see is the abstract.

3. Dan Simpson says:

I would quite like to see the slides.

• Dan Simpson says:

I’m also mildly curious who K is. Without going the full “Human Stain”, is he a spy?

4. Rahul says:

To me a p-value exemplifies the notion of a “precise answer to the wrong question”.

Which applied practitioner ever naturally would ask that particular question which a p-value answers?

• Dale Lehman says:

Actually, economists do it almost all the time. I do not exaggerate. Testing whether a coefficient is significantly different than zero permits economists to make sweeping statements regarding policy (if you do x,…, then the costs of health care will do y… etc). Of course, they often follow it with a statement about the size of the effect. But the p value answers the question they are usually asking: e.g., will raising the minimum wage lead to more unemployment? Will raising copays lead to decreased health care usage? Will raising the tax on gasoline lead to decreased consumption? How much? etc etc etc

• The precise null is always a stupid question. I mean, even causes that go backwards in time have some kind of possibly extremely small theoretical plausibility under some non-crank modern physics (tachyons).

The fact is, if you whisper love poems into a telephone the total cost of health care in the US will change by more than 0.000000000000000000000000000001287%

• I believe this should be stated much more often! When teaching NHST (I was forced) I struggled to come up with a single good example where you would want to test a precise null. What I finally settled on was a telepathy example (can you predict the what card the person in the other room is looking at), but that is really an extreme example… I can see that null testing could be an approximation to testing whether there is a “small effect”, but then it still seems like it would be easier to do estimation.

I really have a hard time imagining situations where one would want to do null testing rather than estimation, but I would be grateful for any good example (would make teaching easier)!

• Daniel Gotthardt says:

+1

• Christian Hennig says:

Although we hardly believe that any point-H0 is exactly true, with small sample sizes quite often effects are small enough that the H0 cannot be significantly rejected, meaning that the data are not good enough to tell apart what happened from the point-H0. This happens all the time, and the message is that although we should not believe that the H0 is literally true, effect estimation is rather pointless because you may estimate a positive effect which may in fact be negative etc.

What is interesting is not the point-H0 itself, but rather that the data are to weak to nail down anything that deviates from it. There are truck loads of examples for this.

• James Annan says:

Rejection of the null is frequently the threshold at which a confidence interval has single sign, ie it’s the point at which you can say something confident (in the frequentist world view) about the sign of the effect. So although testing the null is usually not in itself interesting, rejecting the null can be somewhat useful.

Not that I’m a fan of this sort of thing.

• Andrew says:

Rasmus:

You ask for a no-b.s. real-life example where hypothesis testing can make sense. We have an example in chapter 2 of ARM. A quick summary of the example is on page 70 of this paper.

In that example (looking at possible election fraud), a rejection of the null hypothesis would not imply fraud, not at all. But we do learn from the non-rejection of the null hyp; it tells us there’s no evidence for fraud in the particular data pattern that was shown to me.

Which fits perfectly into today’s post!

• Keith O'Rourke says:

Rasmus:

As Andrew, Christian and other pick up here and in http://statmodeling.stat.columbia.edu/2015/03/02/what-hypothesis-testing-is-all-about-hint-its-not-what-you-think/ if you are being purposeful not literally true representations can serve you well.

• Jonathan (another one) says:

Let me second Dale’s opinion, and in so doing partly address K’s question. Often the question arises in litigation economics (my field) as to whether X affects Y at all. This question can be decomposed into two parts: first, where the effect of X on Y could be, on plausible theoretical grounds, of no effect at all, or of the wrong sign, do the data you have allow you to opine one way or the other? Second, is a measurably large effect large enough to exclude the underlying variability of the sample as the cause? For both of these, a p value is valuable evidence, though the 0.05 standard ought to be meaningless, though it isn’t. To address the underlying question, though, would it be valuable to couch this priestly calculation in more lay clothing? Unequivocally yes.

5. Keith O'Rourke says:

> valuable to couch this priestly calculation in more lay clothing?
Yup.

What I found humorous and so I emailed Andrew was that the announcement did not mention I was giving the talk (and it almost appears if John Tukey is.)

The slides will just be prompts to me about what to discuss.

I am trying to make sense of Rob Kass’s paper, Stigler’s paper http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.2010.00643.x/abstract and Peirce’s “argument” http://en.wikisource.org/wiki/A_Neglected_Argument_for_the_Reality_of_God

Also, I noticed this that might be related – https://stat.duke.edu/events/15741

If don’r wonder (better in public) we delay getting less wrong.

Thanks, Andrew.

• Christian Hennig says:

For the moment I’m just say that I’m curious.
And a little bit skeptical. It may be that central problems of relating statistics to the “real world” are problems of relating mathematics to the real world, which cannot be solved by mathematics.
We can probably go further with mathematics than we currently do, but I think that we can’t go all the way.

• Keith O'Rourke says:

Christian:

> go further with mathematics than we currently do, but I think that we can’t go all the way.

I fully agree, what I am trying to do in the talk is try to get further than Rob Kass’ paper (which I really like) to enable statisticians and domain experts (those who have a better sense of the reality they want to represent and for what purpose) to jointly and purposely work together doing science. But (I think) they need a sense of what scientific inquiry is (not everyone will agree on my take on what that is), they will need a math (but most won’t get a enough insight about analysis/algebra for that to work) and they will need to experience statistical reasoning comfortably and often (Bayesian, Frequentist, Robust, whatever).

We can’t step outside our representations of the world to see it as it really is, but we want to bend over backwards to see how representations are importantly wrong and that requires everyone involved to understand the representations and their role.

Andrew: From your commentary “Finally, I am surprised to see Kass write that scientists believe that the theoretical and real worlds are aligned.” I was taking this is the sense of would be better aligned if inquiry continued productively and why we “(a) feel free to make assumptions without being paralyzed by fear of making mistakes, and (b) feel [compelled!] free to check the fit of our models”.

Others: I do appreciate the comments from folks here, once I have given the talk, I’ll hopefully have better sense of how to proceed.

• jrc says:

That Kass paper (which I just now read) gets at a feeling I’ve been trying to enunciate to myself. I think it is a step forward, but somehow there is this part of the thinking where I just keep coming up against a brick wall (in my thinking to myself). From Kass:

“we act \emph{as if} the data \emph{were} generated by random variables”

And I agree that that is what we do, but I don’t think there is a solid theoretical foundation there yet, either in terms of mathematics or metaphysics.

Mathematically, the closest I’ve seen is a moment in Emily Oster’s old version of her working paper on correcting selection bias where she just drops the error term in favor of omitted variables, but that is just as a motivation to think through a different, more classic problem (page 6 of the NBER working paper version, where Y = BX + W1 + W2 but there is no “error” term from a statistical distribution).

Metaphysically, I am reminded of X being reminded of Nietzsche: “It is perhaps just dawning on five or six minds that physics, too, is only an interpretation and exegesis of the world (to suit us, if I may say so!) and not a world-explanation;” Obviously, interpretations are really useful (like interpreting data as random variables), but they aren’t world-explanations, and I’m not sure that “as good as random” is much of a world-explanation either – “as good as random” being the name I give to the metaphysics of Kass’ statistical pragmatism in the context of causal inference from non-experimental econometric analyses.

• If you see the likelihood as coming from drawing random numbers from a random number generator, then yes this gets problematic. If, instead, you interpret the probability distribution you assign to the error term as a plausibility measurement (a la Cox’s Theorem foundations of Bayesian stats). Then you don’t interpret the data as coming from that particular *frequency* distribution, instead you interpret your state of knowledge about where the data will fall as being measured by the probability distribution assigned.

This is a far more plausible interpretation in my opinion, and I’ve been banging on this stuff about Likelihoods since even before my post about “where do likelihoods come from”

ACK F***K someone defaced my blog.

• jrc says:

Daniel Lakeland – applied statistician, blog commenter, cyber gladiator.

6. Chris G says:

An excellent topic. (And I agree with Rahul.) Unfortunately, I have nothing to add. The Tukey quote is one of my favorites, as is the corollary –

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.” – John Tukey

• Rahul says:

I hadn’t heard this one before. Thanks. But it is something I can probably use every other day in my line of work. :)

• Andrew says:

Chris:

I agree 100%. People don’t want to accept uncertainty, but sometimes uncertainty is the way of the world.

That said, I could counter the Tukey quote with a quote of my own:

“Not making a decision is itself a decision.”

That is: even if a reasonable answer cannot be extracted, some decision often must be made. Not necessarily a decision of whether to believe a scientific theory, but maybe a decision of whether to pursue a line of research, or even some business or policy decision.

• Rahul says:

Andrew:

Is is useful / practical to tease apart some measure of uncertainty inherent to the data versus uncertainty because we are bad / lazy / incompetent modelers?

• jrc says:

Rahul,

By “inherent to the data” do you mean something like “inherent to the data generating process” (the difference being that the data is part of the “real” world and the DGP is part of the “theoretical” world, to use Kass’ vocabulary)? As in, you are thinking about a distinction between uncertainty that comes from “luck” (such as in a classical linear regression equation with an error term) and uncertainty that comes from imperfect modeling related to our imperfect understanding of the underlying data generating process (that is, models that don’t capture the real world)?

I’m of the general opinion that these are essentially the same thing. But I wasn’t sure I was understanding your question.

• Rahul says:

jrc:

Say I model tomorrow’s temperature. I could produce a model that had awfully wide intervals. e.g. Tomorrow’s low-temp. will be between -30 C to +45 C. Very high chance that I’m correct.

Now this uncertainty in my prediction, it is due to a crappy model (in my naive view) & not the stochastic nature of weather.

Is there a way to quantify this? I come across people that sell model uncertainty (due to a poor model) as reflecting the realistic uncertainty in the underlying generating process.

i.e. We don’t have a bad model. Just a more realistic one.

• jrc says:

Rahul,

Yeah – I see what you are saying. Since I don’t really believe in an “underlying generating process” for most social behaviors (at least not one where, say, age, experience, gender, race, education and other easily observable characteristics of people are deep structural variables) I tend to think of error terms as almost always modeling error, since the structural model isn’t there in the real world to compare a statistical model to (structural models exist in the theoretical world). I don’t even know what an error term in the statistical sense can mean when applied to individual economic outcomes (“luck” I guess), and the idea that measurement error outweighs modeling error seems unlikely to me (sure, we might get your schooling off by a year, but that is nothing compared to us knowing almost nothing about you as \emph{you}, that is, as an individual distinct from their socioeconomic observables).

That said, I believe deeply in trying to quantify uncertainty. I just don’t think uncertainty comes from underlying stochastic processes. Now – thinking about the world \emph{as if} it had stochastic properties of that sort… well, I think that can be useful.

Another way of saying this: if someone says they are realistically reflecting uncertainty, I’d ask them what kind of uncertainty – uncertainty related to the real world, or uncertainty related to the analysis/comparison they are making. The former I don’t get in any statistical sense, the latter I think is important.

• Andrew says:

Rahul:

Yes, it is good to have a sense of what will be gained by more statistical modeling effort, as compared to simply getting better data.

This is an interesting issue (worth its own post, really), because a key property of the best statistical models (including hierarchical models, Bayesian inference, lasso, Anova, and various others) is that they allow users to fold more data into an analysis.

One thing we often say is that it’s better to have more data (and better data) than to have a better analysis. This is somewhat similar to the principle in chess programs that we’d generally have more tree depth than have a better evaluation function (maybe that’s not really true with chess programs, but that’s my understanding).

Or, to put it another way, what makes a method work is that it allows the use of more and better information. This is an obvious selling point for Bayes but one could make the argument that it is the ultimate appeal behind any statistical method. Even something as seemingly limited and robotic as a t-test could be viewed, in a larger sense, as a way to incorporate more information in that researchers can do lots of little t-tests and thus fill the scientific literature with information that can later be integrated.

Or, maybe another way to look at it is that there’s a tradeoff in statistical methods between rigor (or, at least, perceived rigor) and the ability to include more information. Rigor has limited appeal to me because rigor is conditional on a model (for example, the “probability sampling” model so beloved of the buggy-whip crowd), but methods that include more information typically require more work, or more assumptions, or more tuning, or whatever.

So the tradeoff is there, especially given that, at the end of the day, we never have complete rigor nor are we ever truly using all relevant information. We’re always stopping short a bit in both dimensions because of diminishing returns.

As I said, worth its own post (or even an article, if I want to put in a lot more work to write something that a lot fewer people will read).

• Rahul says:

Andrew:

Well, I’d love to read it either way. So consider this a friendly nudge to write your blog post on it. :)

As an aside, what motivated me to think about this is your past post on predicting goal differentials of the Football World Cup.

http://statmodeling.stat.columbia.edu/2014/07/15/stan-world-cup-update/

I really thought the uncertainty bars on your predicted goal differentials were too wide for it to be a good model but I’ve no way of knowing how much of that uncertainty reflects the true uncertainty in a game’s outcome versus just a poor model. There was some good discussion in the comments there but I never got a really good answer.

That’s where I’m coming from.

• The game is always going to come down to some official score, with no uncertainty in that score. In what sense is there a “true uncertainty in a game’s outcome” when after the fact the score will be a fixed thing?

The only uncertainty is in our knowledge of what that fixed thing will be!

• Keith O'Rourke says:

All clouds are clocks – Laplace

All clocks are clouds – Peirce

I am not sure who was most wrong.

• Chris G says:

> “Not making a decision is itself a decision.”

Absolutely. I say that myself from time to time. Your lead sentence is spot on: People don’t want to accept uncertainty, but sometimes uncertainty is the way of the world. If you have insufficient information to make an informed decision and the decision will get made for you if you do nothing that can be an uncomfortable spot – non-technical major life decisions are the ones that come to mind for me. Make your best guess given the information available to you and hope for the best – and hope that if you make the wrong call that your error is recoverable. (Ask me about my career choices… Actually, no, don’t;-)

7. Christian Hennig says:

For me robustness theory is still central to what seems to be the key issue here, namely modelling what happens if the model isn’t completely true. Tukey started this off, followed by Huber, Hampel, and more recently Laurie Davies.

To me, this is a great advance, although it illustrates the limits of modelling as well. The neighbourhood models that are used in robustness theory are models themselves and come with assumptions (e.g., in many cases, i.i.d.), and one can always go one step further and ask, “what if the assumptions of robust models are violated, too?” In the robustness community people tend to talk about outliers as if there are “true outliers” in reality as reflected by the contamination neighbourhoods used in robustness theory distinguishing between “good observations” and “outliers”, but that’s a model, too. It may be more realistic than assuming having only “good observations”, but it doesn’t quite get “really real”.

Another thing:

> “the difference being that the data is part of the “real” world and the DGP is part of the “theoretical” world,”

We shouldn’t forget that in order to obtain measurements in many cases quite something is done to the unmeasured “real world”. When looking at data, we already look at a world that is shaped to quite some extent by theory-guided scientific activity. (I believe that the theory of measurement is something most statisticians should care much more about than they do.)