Stephen Senn quips: “A theoretical statistician knows all about measure theory but has never seen a measurement whereas the actual use of measure theory by the applied statistician is a set of measure zero.”

Which reminds me of Lucien Le Cam’s reply when I asked him once whether he could think of any examples where the distinction between the strong law of large numbers (convergence with probability 1) and the weak law (convergence in probability) made any difference. Le Cam replied, No, he did not know of any examples. Le Cam was the theoretical statistician’s theoretical statistician, so there’s your answer.

The other comment of Le Cam’s that I remember was his comment when I showed him my draft of Bayesian Data Analysis. I told him I thought that chapter 5 (on hierarchical models) might especially interest him. A few days later I asked him if he’d taken a look, and he said, yes, this stuff wasn’t new, he’d done hierarchical models back when he’d been an applied Bayesian back in the 1940s.

A related incident occurred when I gave a talk at Berkeley in the early 90s in which I described our hierarchical modeling of votes. One of my senior colleagues–a very nice guy–remarked that what I was doing was not particularly new; he and his colleagues had done similar things for one of the TV networks at the time of the 1960 election.

At the time, these comments irritated me. But, from the perspective of time, I now think that they were probably right. Our work in chapter 5 of Bayesian Data Analysis is–to put it in its best light–a formalization or normalization of methods that people had done in various particular examples and mathematical frameworks. (Here I’m using “normalization” not in the mathematical sense of multiplying a function by a constant so that it sums to 1, but in the sociological sense of making something more normal.) Or, to put it another way, we “chunked” hierarchical models, so that future researchers (including ourselves) could apply them at will, allowing us to focus on the applied aspects of our problems rather than on the mathematics.

To put it another way: why did Le Cam’s hierarchical Bayesian work in the 1940s and my other colleague’s work in 1960s not lead to more widespread use of these methods? Because these methods were not yet normalized–there was not a clear separation between the math, the philosophy, and the applications.

To focus on a more specific example, consider the method of multilevel regression and poststratification (“Mister P”), which Tom Little and I wrote about in 1997, then David Park, Joe Bafumi and I picked back up in 2004, and then finally took off with the series of articles by Jeff Lax and Justin Phillips (see here and here). This is a lag of over 10 years, but really it’s more than that: when Tom and I sent our article to the journal Survey Methodology back in 2006, the reviews said basically that our article was a good exposition of a well-known method. Well-known, but it took many many steps before it became normalized.

Also, good timing is very important. Without proper hardware and software, I would be very impressed by anyone making hierarchical modeling popular in the 1940s! (or even up to the 1980s)

(The link doesn't seem to be working, btw…)

Although you're discreet about names, the team involved in the election work you refer to — or at a minimum a very similar project — were discussed publicly in

David R. Brillinger. 2002. John W. Tukey: His Life and Professional Contributions. Annals of Statistics 30: 1535-1575.

It's said there that Tukey regarded the work as proprietary to the networks.

I don't have any information beyond that, but it's a fair guess that it would have been difficult to publish satisfactorily in any conventional form. There was possibly a lot of brilliant adhockery throughout that reviewers might object to. Also, the result would have long since been very old news by the time it hit print.

This is often a long and frustrating process! In epidemiology, there is a long "gray period" where a method is too well known to be published in the statistical literature and too novel to be accepted in the medical literature. There really isn't a good intermediate venue to build up a bank of substantive examples and generate understanding of how the methods performs in well understood processes.

It's a tough issue!

Well known to whom???

I recall being involved in a two day course with some senior colleagues and when I suggested we provide some worked examples for the participants one “recoiled sternly” with “you should not have to show professional statisticians how to do such things”.

But from my own experience in clinical research, if one could not get a toy example to work in an afternoon or two, one could not afford the risk of trying to use that new method (unless of course it was your chosen area of research itself). Today the numerous downloadable working examples from R and WinBugs have done much to remove this “risk” barrier to trying out new methods.

So I would stress more on Andrew’s “chunked” comment – that methods are made more “doable” by more people rather than normalized. And things are certainly much more doable by many more when it is separated from the math. [ Some believe computing is becoming more important than math for applying statistics and maybe math has had more of an attitude of if they can’t work their way through the theorems they should not be trying to use my methods ]

But I also wanted to comment on Stephen’s quip

Depends on what you mean by “use” and “measure theory” …

In the paper Andrew recently posted, he points out the improvement of formalizing missing data p(y|theta) -> p(y,I|theta,psi) the “I” being the observation process and psi the parameter in the model for that. Thinking of I(y), i.e. what is actually observed as a function of y, it might be helpful to know that all reasonable functions are measureable and hence the probabilities for I(y) are immediately available from the (sum of probabilities of the) pre-image (which always exists because I(y) is measurable). Or maybe not.

Or for “Mister P” – would knowing about the Radon–Nikodym derivative be helpful?

Now in formal measure theory courses, 95% of the emphasis maybe on the “distractions” and the development of math “skill” but now the web can give the basic ideas about many obscure math concepts providing perhaps better cost/benefit ratios …

Perhaps I am just wondering what sort of math knowledge and skill is needed today to perceptively apply statistics?

Keith

This reminds me of the year I spent at Stanford writing up my (Edinburgh) thesis. It was the mid-180s, and there was a whole new crop of computational linguists (including me) running around re-inventing the wheel.

The standard seminar would start like: "we have a new idea to do X and evaluated in on Y" and then either Bonnie Webber or Ron Kaplan would stand up and say they'd tried that in the early 1970s. I actually followed up a few of the references, and sure enough, they really had discovered most of these ideas in the early 1970s, and even implemented them.

A good example is when logistic regression entered the classification fray (in the late 1980s, early 1990s) under the guise of "maximum entropy modeling" [sic]. Talk about ignorance run amok.

I now take it as a good sign when I get into a new field and re-invent the wheel. It makes me feel like I'm on the right track. On the other hand, if you're diligent enough in your research and honest enough in your citations, you get dinged for novelty in papers.

I wrote a longish blog entry on the scientific zeitgeist.

Ricardo: Definitely. If I were working 50 yrs ago, I probably wouldn't be doing multilevel modeling.

Nick: If they had wanted to, I'm sure they could've published a "methods" paper on what they did. I don't know the full story: perhaps they didn't realize the generality of what they were doing. It can be hard to move from the particular to the general. For example, lots of interesting and complicated hierarchical modeling was being done at the Census Bureau in the 1980s and 1990s, but they didn't generally put it in a convenient "Mister P"-style package for the rest of us to use. So we're still rediscovering some of their ideas. In addition, many of the key steps involve confidence building and model checking, areas where statistical theory is traditionally weak.

Keith: I have no idea what the Radon–Nikodym derivative is, but perhaps I use it under a different name.

Bob: Recall that whatever you do, somebody in psychometrics already did it long before.

Joseph: Nice points – also (unless it is your chosen area of research) you likely want to be an early adopter rather than a pioneer (i.e. cost/benifit of using it)

Andrew: Radon–Nikodym derivative –> importance sampling, Horwath-Thompson estimation, Inverse Probability Weighting, etc – but online materials about it now seem less friendly than I recall when I checked a couple years ago. So obviously not necessary knowledge but perhaps helpful to some.

Keith

Andrew: I guess you're right, but I guess also that even Tukey couldn't manage to publish the 200 (2000?) (20000?) papers per year that 200 (…) lesser minds would have made of the ideas he had.

So, he found reasons not to publish some of the work he did, and the stuff being proprietary no doubt was convincing at enough levels.

What you say about what you did reminds me of what programmers mean when they say that a category of software has been "commoditized". Often a particular category of software will be esoteric and ill-understood. Commonly, in these cases, just one or a few vendors sell extremely expensive products that are widely viewed as nearly miraculous. Eventually, the insights into the state of the art that enabled the creation of such products, along with the particular tricks and techniques used, will be disseminated or independently rediscovered or reinvented. This may not entail any fundamental breakthroughs but rather a process of "chunking" well-known but previously separate ideas into a coherent approach to solving the problem. As programmers learn this approach, many people or teams of people will implement their own clones or minor variants on the idea. The price for some such products will fall dramatically, which will siphon off customers from the low-end of the market for the most expensive, earliest examples of such software. Products like it, once rare gems, will become a cheap commodity, even though the newer versions are apt to be more powerful and elegant than the much more expensive first versions.

I have to say that this post is refreshing. As a grad student in stats, I often feel that I'm not a real statistician unless I'm hard-core in measure theory. It's great to hear a statistician of your prominence say that you have no idea what a Radon-Nikodym derivative is!