It goes like this: there’s something you want to estimate and you have some data. Maybe, to take my favorite recent example, you want to break down support for school vouchers by religion, ethnicity, income, and state (or maybe you’d like to break it down even further, but you have to start somewhere).
Or maybe you want to estimate the difference between how rich and poor people vote, by state, over several decades—but you’re lazy and all you want to work with are the National Election Studies, which only have a couple thousand respondents, at most, in any year, and don’t even cover all the states.
Or maybe you want to estimate the concentration of cat allergen in a bunch of dust samples, while simultaneously estimating the calibration curve needed to get numerical estimates, all in the presence of contamination that screws up your calibration.
Or maybe you want to identify the places in the United States where it’s cost-effective to test your house for radon gas—and the data you have across the country are 80,000 noisy measurements, 5,000 accurate measurements, and some survey data and geological information.
Or maybe you want to understand how perchloroethylene is absorbed in the body—a process that is active at the time scale of minutes and also weeks—given only a couple dozen measurements on each of a few people.
Or maybe you want to get a picture of brain activity given indirect measurements from a big clanking physical device encircling a person’s head.
Or maybe you want to estimate what might have happened in past elections had the Democrats or Republicans received 1% more, or 2% more, or 3% more, of the vote.
Or maybe . . . or maybe . . .
What all these examples have in common is some data—not enough, never enough!—and a vague sense arising in my mind of what the answer should look like. Not exactly what it would look like—for example, I did not in any way anticipate the now-notorious pattern of vouchers being more popular among rich white Catholics and evangelicals and among poor blacks and Hispanics (maybe I should’ve anticipated it; I’m not proud in the level of ignorance that I had that allowed this finding to surprise me, I’m just stating the facts)—but what it could look like. Or, maybe it would be more accurate to say, various things that wouldn’t look right, if I were to see them.
And the challenge is to get from point A to point B. So, you throw model after model at the problem, method after method, alternating between quick-and-dirty methods that get me nowhere, and elaborate models that give uninterpretable, nonsensical results. Until finally you get close. Actually, what happens is that you suddenly solve the problem! Unexpectedly, you’re done! And boy is the result exciting. And you do some checking, fit to a different dataset maybe, or make some graphs showing raw data and model estimates together, or look carefully at some of the numbers, and you realize you have a problem. And you stare at your code for a long long time and finally bite the bullet, suck it up and do some active debugging, fake-data simulation, and all the rest. You code your quick graphs as diagnostic plots and build them into your procedure. And you go back and do some more modeling, and you get closer, and you never quite return to the triumphant feeling you had earlier—because you know that, at some point, the revolution will come again and with new data or new insights you’ll have to start over on this problem, but, for now, yes, yes, you can stop, you can step back and put in the time—hours, days!—to make pretty graphs, you can bask in the successful solution of a problem. You can send your graphs out there and let people take their best shot. You’ve done it.
But, not so deep inside you, that not-so-still and not-so-small voice reminds you of the compromises you’ve made, the data you’ve ignored, the things you just don’t know if you believe. You want to do more, but that will require more computing, more modeling, more theory. Yes, more theory. More understanding of what these things called models do. Because, just like storybook characters take on a life of their own, just like Gollum wouldn’t die and Frank Bascombe comes up with wisecracks all on his own, and Ramona Quimby won’t stay down even if you try to make her, and so on and so on and so on, just like these characters, each with his or her internal logic, so any statistical model worth fitting also has its internal logic, mathematical properties latent in its form but, Turing-machine-like, impossible to anticipate before applying it to data—not just “real data” (how I hate that phrase), but data from live problems. And then comes Statistical Theory—the good kind, the kind that tells us what our models can and cannot do, when they can bend with the data and when they snap. (Did you know that doubly-integrated white noise can’t really turn corners? I didn’t, until I tried to fit such a model to data that went up, then down.) And you do your best with your Theory, and your simulations, and even your computing (yuck!). But you move on. And you hope that when it’s time to come back to this problem, you’ll have some better models at hand, things like splines and time series cross sectional models, and you’ll have a programming and modeling environment where you can just write down latent factors and have them interact, and you’ll be able to include three-way interactions, and four-way interactions, and . . . and . . . you hope that in ten years you’ll be fitting the models that, ten years ago, you thought you’d be fitting in five years. And you take a rest. You write up what you found and you write up exactly what you did (not always so easy to do). And a new question comes along. You want a quick answer. You try putting together available data in a simple way. You try some weighting. But you don’t believe your answer. You need more data. You need more model. You get to work.
That’s how it feels, from the inside.