Sociologist Fabio Rojas reports on “a conversation I [Rojas] have had a few times with statisticians”:

Rojas: “What does your research tell us about a sample of, say, a few hundred cases?”

Statistician: “That’s not important. My result works as n–> 00.”

Rojas: “Sure, that’s a fine mathematical result, but I have to estimate the model with, like, totally finite data. I need inference, not limits. Maybe the estimate doesn’t work out so well for small n.”

Statistician: “Sure, but if you have a few million cases, it’ll work in the limit.”

Rojas: “Whoa. Have you ever collected, like, real world network data? A million cases is hard to get.”

The conversation continues in this frustrating vein. Rojas writes:

This illustrates a fundamental issue in statistics (and other sciences). One you formalize a model and work mathematically, you are tempted to focus on what is mathematically interesting instead of the underlying problem motivating the science. . . .

We have the same issue in statistics. “Statistics” can mean “the mathematics of distributions and other functions arising in statistical models.” Or it can mean the traditional problems of statistics like inference, measurement, model estimation, sampling, data collection/management, forecasting, and description. The problem for a guy like me (a social scientist with real data) is that the label “statistician” often denotes someone who is actually a mathematician who happens to be interested in distributions. . . . What I really want is a nuts and bolts person to help me solve problems.

My first reaction—actually, my main reaction—is that Rojas hangs out with the wrong sort of statistician. Following the links, I see that Rojas works at Indiana University, which features a large statistics department. I suspect he had the misfortune to encounter “a mathematician who happens to be interested in distributions” and he didn’t realize he could shop around among the many statisticians in that department who work on applied social research.

On the other hand, it’s a bad sign that Rojas reports having this conversation multiple times. I thought that statisticians nowadays know they’re supposed to be helpful on real problems. That “n -> infinity” thing seems so old-fashioned! I’d like to believe that Rojas was just having some bad luck, but maybe there’s more of this bad stuff going on than I realized. Or maybe it was just a communication problem?

It’s hard for me to imagine a statistician in 2012 telling a sociologist, “if you have a few million cases, it’ll work in the limit,” except as a joke, as an ironic comment on the limitations of some of our theory. But perhaps that just reflects the poverty of my imagination.

Is “n->infinity” really that old-fashioned? I agree that claims of validity and efficiency that rely on huge n are not practically useful. But we can also view asymptotics as a tool for approximating how methods behave at realistic n, and then it seems more helpful. Even better, asymptotic arguments can help pinpoint which features of the data make methods work badly (e.g. heavy tails, high leverage etc).

Of course, there are other techniques out there for both jobs, but used the right way – i.e. not as in Rojas’ conversation – surely “n->infinity” still has a place?

Fred:

Sure, “n -> infinity” has a place, but I don’t think there’s much place for a statistician to say that the n=100 case is “not important” (as reported in the above story). Again, this may have been a misunderstanding on Rojas’s part, but in that case there was still a problem of communication.

Sure. I should have said “used and communicated in the right way”.

Maybe the statistician who thinks he needs a million samples could do well with far less data but doesn’t have enough experience to know it. That is, the fault may lie in the statistician’s knowledge of his tools, not his tools per se.

I’m completely with you. I think he hangs out with the wrong kind of statisticians. In my experience, statisticians housed in mathematics departments at universities tend to think along those lines. Statisticians in other departments, and certainly statisticians in business, generally do not.

Quoting myself, “What would be the fate of a crime analyst who told the police chief, ‘We only have two cases in which the victim was dismembered; this is too small an N to infer a pattern’?” Fred Mosteller’s discussion of giving two sailors each one of six treatments for scurvy is another (albeit N=12) example.

The conversation is extreme, but in my experience it is not at all unusual to find statisticians (in my case econometricians) who prefere estimators with good large sample properties no matter what the sample size. Whenever I point out that the small sample properties of these estimators are, at best, unknown, and often known to be quite poor in Monte Carlo tested small samples, they shrug. Seriously, that’s the modal reaction.

This could just be a small-sample problem. As the number of statisticians he talks to approaches infinity, the problem will eventually disappear.

Well said. But I feel that the statistician was joking.

Those statisticians should learn about some modern asymptotic theory, from books like

” Applied Asymptotics: Case Studies in

Small-Sample Statistics ” By Alessandra Brazzale

, Anthony Davison and Nancy Reid

discussing methods like saddlepoint approximations which even in some cases, works

well even for n=1!, which is rather good for asymptotics. He could also google for interesting things such as the “Barnforff-Nielsen p*-formula”.

One almost elementary introduction to such things is “Intermediate Probability. A Computational Approach” by Marc Paolella (Wiley).

I could be wrong but I think it speaks to the limits of the two people involved, as well as the communication in the interaction. It may also speak to the disingenuous nature off all three, but that is nearly impossible to know with a comfortable degree of certainty. In the end I find the whole interchange one sad heap on the side of the higher education freeway.

Eliason is the last name…sorry…

As a non-statistician by degree, but largely a user, Rojas’ conversation is not rare to me.

A symptom is that a good share of self-called statisticians never actually program much calculations. Or if they do, the models are so stripped out of all inconveniences (e.g., precorrected data, or simulated data “that would exist but does it not yet”) that are pointless.

But I think this varies enormously across places and countries.

I wonder if part of the issue is the way that we educate statisticians. It seems most graduate programs in statistics have a pretty heavy emphasis on training students to perform asymptotic analyses and other mathematical computations, with substantially less emphasis on applied data analysis. This is probably at least in part due to the fact that teaching data analysis is frequently a trickier proposition (it requires lots of hands on experience) than mathematical statistics.

The result of this focus is that most graduates of these programs have spent the majority of their time thinking about mathematical statistics, so that is of course how they think when interacting with collaborators. On the other hand, as AnnMaria points out, a person with a primarily applied focus in statistics (in business, biology, or politics) might not have this same tendency to focus on the math.

One way I think we could resolve this issue is by injecting a heavier dose of practical data analysis into statistics curricula. One idea is a case-studies course, where the data sets are “real” in that they might have small sample size, or data missing not at random, or be in a gnarly format. Students trained in this way may be a little less likely to immediately jump to asymptotic results.

Overall, I think the statistics discipline/profession has been duped or co-opted by a mathematics agenda to help spread quantitative literacy and especially Mathematistry (stolen from Rod Little’s abstract for this year’s joint meeting Fisher’s lecture)

Some of this comes out in the comments here.

To me at least, statistics primary agenda should be to sort out uncertainties and haphazard variability so math is just another tool like computation.

Now, how exactly horrified would we be if someone discovered that this could be best accomplished (uncertainties best sorted out) by computing using some lost dead language (OK hard to think of how math would be entirely left out – but what if the unthinkable happened?)

Now perhaps the not too distant future, the uncertainties in the majority of applications might be best sorted out by a super Bayes/frequency computational engine. Here just the mechanics would need the math to keep this running.

Most academic statisticians are located in math departments (I managed to always avoid those) and have three options: either learn to act like a mathematician or at least appear to act like a mathematician.

social scientist: i have two groups to compare, and my outcome is binary.

statistician: you can compare proportions between the 2 groups then. give the difference in proportions with its confidence interval, and you can also use a test of proportions to get a p-value if that’s what you want.

social scientist goes away to his group and comes back later.

social scientist: i was told that we don’t know what a test of proportions is, and that you should run a chi-squared test.

—–

another social scientist: we have two groups to compare, and my outcome is continuous. we calculated the correlation between the two variables.

another statistician: you may want to look at the mean differences in your outcome between groups also.

social scientist: whatever. i want to talk about causal interpretations of the analysis.

statistician: your study is observational. i would be cautious about that.

social scientist: that does not matter. look how small the p-value is.

Jimmy:

Cool. This is like really bad sketch comedy.

Two weeks ago I heard twice in a day “I am a mathematician who happens to work in the Statistics Department”. This is often used to avoid being asked about details of the context, because “they are too theoretical” for trying to understand what they are supposedly modeling. On the other hand, they do not present their “theoretical work” in theoretical conferences because most of the time there is no real theoretical contribution.

[…] Andrew Gelman wrote a simple response, which is that I am hanging around with the wrong people. There is some truth. The last time I had the “n–>00″ argument was with a visitor. Indiana has hired some exceptional applied statisticians, like Stanley Wasserman. The program has also hired people with non-statistics PhDs, like sociology and economics. I have consulted with these folks and it is easier to get concrete guidance on statistical practice. […]

[…] me this has some parallels to a recent post about a theoretical statistician whose work is useless in practice. I am as aware of the problems […]

The fact is that this is a very common stereotype, though statisticians may be less aware of it than everyone else because politeness prevents people from saying it to their face. The origin is easy to explain: at many institutions, the primary thing taught to stats undergrads is a bunch of recipes along with the caveat that they only apply to sufficiently large data sets. They also get taught (either deliberately or as an unintended side-effect of the curriculum) that the most important criterion for an estimator is its asymptotic behaviour. Concepts like unbiasedness are over-emphasised, while small-sample behaviour is not investigated even empirically. They study proofs of asymptotic properties even in contexts where the data sets are never large enough for asymptotics to apply. They grow up thinking that this is the essence of their discipline. Applied researchers quickly learn to avoid collaborating with statisticians who grow up in this mold, so they are never exposed to alternative approaches.

I wonder if it was a miscommunication where some statististician was talking about a limit as you repeat this trial of 100 observations lots of times (eg talking about confidence intervals) rather than 100–>infinity.