This post is by Phil.
Psychologists perform experiments on Canadian undergraduate psychology students and draws conclusions that (they believe) apply to humans in general; they publish in Science. A drug company decides to embark on additional trials that will cost tens of millions of dollars based on the results of a careful double-blind study….whose patients are all volunteers from two hospitals. A movie studio holds 9 screenings of a new movie for volunteer viewers and, based on their survey responses, decides to spend another $8 million to re-shoot the ending. A researcher interested in the effect of ventilation on worker performance conducts a months-long study in which ventilation levels are varied and worker performance is monitored…in a single building.
In almost all fields of research, most studies are based on convenience samples, or on random samples from a larger population that is itself a convenience sample. The paragraph above gives just a few examples. The benefits of carefully conducted randomized trials are well known, but so are the costs and impediments. Lucky people studying some natural phenomena like solar output or earthquakes can deal with complete datasets, but for most data analysts and applied statisticians the fact that your data are not a random sample from your population of interest is so commonplace that it usually goes without saying. This does not mean that it goes without thinking, of course: most researchers, and all good ones, think about the extent to which their results might or might not be applicable to a wider population and try to frame their conclusions accordingly. But most or all researchers are willing to extrapolate their results to wider populations to some degree. The movie studio reshoots their ending not because they want to please their 9 test audiences, but because they think that the response of those 9 test audiences tells them something about millions of other viewers, even though those 9 audiences were not selected according to a careful, randomized sampling scheme.
If you think everything I’ve said so far is so obvious as to be boring, so did I, but I was proven wrong. Read on.
I’m currently working with time series data on building electricity consumption. People want to be able to answer questions like “I changed something about my large commercial building on March 1; how much energy am I saving,” and one way to do that is to fit a statistical model using data from before March 1, use it to predict the energy use after March 1, and compare the predictions to what actually happened. There are other uses for these models too.
There are quite a few companies that offer energy modeling programs. Typically, a company develops its own proprietary tool. Everyone thinks their package is better than anyone else’s, or at least they say so, but there’s usually no evidence. A few months ago a company approached us to ask if we will compare the accuracy of their tool to some standard methods and allow them to publicize the results if they want to. (I work at a government research lab and they correctly see us as a disinterested party.) We said sure. Our chosen approach is cross-validation. We compiled electricity data from a few dozen large commercial buildings — the population of interest — blanked out big chunks of data, and gave the resulting dataset to the company. Their task is to make predictions for the missing time periods and give them to us, and we will compare their predictions to reality. We’re doing the same with some standard prediction models. The data we’ve given them are a convenience sample. There’s no practical way to get data from a random sample of buildings in the country or even from a single electric utility, in part because of privacy issues (the utilities can’t give out the data without permission of the building owners). Our data are a grab bag from different sources, and certainly not representative of the broad population of commercial buildings in many ways. Still, what else can one do? We want to find out if these guys have a model that outperforms standard models, and this is a way to find the answer.
To my surprise, our postdoc strongly disagrees. He asserts that since we don’t have a sampling plan, just a haphazard collection of building data, we can’t say anything at all. He says “sure, of course you can say which method performs better for these 52 buildings, but you cannot say a single thing about a 53rd building. Nothing.” At first I thought he meant that we should be cautious about making firm claims, and I certainly agree with that, but that’s not it: he really thinks that it is wrong to draw any conclusions whatsoever from our results. He thinks it’s wrong (incorrect, and borderline immoral) to say that our results are even suggestive. I offered a wager: if we find that one method performs substantially better than the others on average, I will bet you dinner that it will perform better for a 53rd building. He said sure, fine for wagering over dinner, but it is scientifically indefensible to make any statement at all about which of the methods performs better in general on the basis of anything we might find using our dataset.
To me this has some parallels to a recent post about a theoretical statistician whose work is useless in practice. I am as aware of the problems of biased datasets as anyone — I once looked into the issue of bias in indoor radon measurement datasets, and the bias turned out to be absolutely enormous — but in most fields of research if you think you can’t learn anything about the wider world unless that wider world is inside your sampling frame, you may as well quit now because you are never going to get the world into your sampling frame.
This is an interesting problem because it is sort of outside the realm of statistics, and into some sort of meta-statistical area. How can you judge whether your results can be extrapolated to the “real world,” if you can’t get a real-world sample to compare to? (And if you could get a sample to compare to, you would, and then this problem wouldn’t come up).
I would welcome thoughtful commentary on this subject.