15 thoughts on “Beef with data

  1. The “elite” hate data because it’s a threat to their realm of knowing everything. If problems can be solved with a computer the right data set, what will people like Brooks have to be experts at. But, you’re right – “big data” is trendy right now so he does have to come out against it…

    • Asa:

      I have never met David Brooks but I doubt he thinks of himself as the elite. I’d guess he thinks of himself a tribune of the common man, protecting them from the follies of elite number-crunchers. I might be wrong in this impression, though.

      • The guy is a multimillionaire with a massive audience and connections to important and influential people, how could he think of himself as a common man? Obama named his favorite political pundit.

      • I think he’s just adding a cautionary note to the overly enthusiastic big data meme.

        Sound’s reasonable to me. We needed a moderating voice.

  2. Brooks writes “As we acquire more data, we have the ability to find many, many more statistically significant correlations”. This sentence, of course, should end “that are actually spurious and not genuine correlations”. That is because data hunting, mining, multiple testing and such make it probable that “impressive” associations will be found, even though they are generated by chance. So the actual significance level may be very high, not very low. Ironically, this leads some to blame the use of p-values, when in fact significance testing methodology was developed (e.g., by Fisher) precisely in order to ring alarm bells as a result of invalid hunting, cherry-picking, etc. I think I shall call these “unaudited p-values” in a book I am writing. Error statistical checks force us to “temper our enthusiasm” to use a phrase from Mayo-Cox (2010, 270). http://www.phil.vt.edu/dmayo/personal_website/Ch 7 mayo & cox.pdf

    • In my opinion, is that the focus on type i errors due to multiple testing is a bit misplaced. I don’t think those reflect the main interpretational issue in most of these settings.

      Where large-scale multiple testing occurs usually coincides with the use of observational data/uncontrolled experiments – case control genome wide association studies, convenience samples off the internet, mri scans, gene expression arrays, etc. In these cases errors due to confounding bias dwarf dwarf sampling error and it’s common that the entire distribution of test statistics is centered on a non-null effect. Trying to address multiple testing isn’t going to help the quality of the interpretation much – the correlations are “true” – the problem is that the causal explanation for the observed correlation is underdetermined.

      • revo11:
        I agree, but I think I might have been underestimating the relative damage of multiplicity to the problem in the past.

      • revo11: I agree with the first sentence, and half of the second. “Trying to address multiple testing isn’t going to help the quality of the interpretation much” but that’s because the “adjustments” also are based on assumptions that may not hold. Thus I would question claiming – “the correlations are ‘true'” –they may not even be genuine statistical correlations (because of violated assumptions). Granted confronted with genuine correlations (not rendered spurious) we’d separately need to probe substantive causal claims. With people like Brooks, I am mainly irked with their writing as if they have developed some deep, new-fangled, insightful, anti-high-techy reflections–rather than dressing up some issues that are either as old as the hills or quite trite–merely by alluding to some contemporary verbiage (even if one grants that ‘big data’ exacerbates the statistical issues). He is obviously just repeating recent laments from other sources, and (as Gelman notes in his current post) doesn’t appear to really care to grasp the issues at any more sophisticated a level.

  3. I don’t think it’s one of Brooks’ best columns, but not bad, either. I run into people all the time who think that the more data you have, the more insights you get, regardless of data quality, characteristics of the sampling frame — or lack of a definable sampling frame, and other limitations of the data.

  4. Brooks is a piece of work. For anyone who hasn’t read it, here’s a link to Sasha Issenberg’s article (investigative report?) on Brooks, Boo-Boos in Paradise:
    http://www.phillymag.com/articles/booboos-in-paradise/

    An excerpt:
    “I called Brooks to see if I was misreading his work. I told him about my trip to Franklin County, and the ease with which I was able to spend $20 on a meal [which Brooks claimed he wasn’t able to do]. He laughed. “I didn’t see it when I was there, but it’s true, you can get a nice meal at the Mercersburg Inn,” he said. I said it was just as easy at Red Lobster. “That was partially to make a point that if Red Lobster is your upper end … ” he replied, his voice trailing away. “That was partially tongue-in-cheek, but I did have several mini-dinners there, and I never topped $20.”

    I went through some of the other instances where he made declarations that appeared insupportable. He accused me of being “too pedantic,” of “taking all of this too literally,” of “taking a joke and distorting it.” “That’s totally unethical,” he said.

    Satire has its purpose, but assuming it’s on the mark, Brooks should be able to adduce real-world examples that are true. I asked him how I was supposed to tell what was comedy and what was sociology. “Generally, I rely on intelligent readers to know — and I think that at the Atlantic Monthly, every intelligent reader can tell what the difference is,” he replied. “I tried to describe the mainstream of Montgomery County and the mainstream of Franklin County. They’re both diverse places, and any generalization is going to have exceptions. But I was trying to capture the difference between the two places,” he said. “You’ve obviously come at this from a perspective. I don’t think if you went to the two places you wouldn’t detect a cultural difference.” ”

    [end quote of Issenberg article]

    Yeah, a perspective where one presumes that an apparent statement of fact is in fact a statement of fact.

    [resuming quote of Issenberg article]

    “I asked him about Blue America as a bastion of illegal immigrants. “This is dishonest research. You’re not approaching the piece in the spirit of an honest reporter,” he said. “Is this how you’re going to start your career? I mean, really, doing this sort of piece? I used to do ’em, I know ’em, how one starts, but it’s just something you’ll mature beyond.”

    I shared with him some more of my research, and asked how he made his observations. On NASCAR name recognition: “My experience going around to people that I know in urban metro areas is a lot of them can’t name five NASCAR … but that’s a joke.” On Spa Lady locations: “I think that’s the type of place where people would get the joke and get the reference.” On whether Blue Americans read more books: “That would be interesting, but one goes by one’s life experiences.”

    “What I try to do is describe the character of places, and hopefully things will ring true to people,” Brooks explained. “In most cases, I think the way I describe it does ring true, and in some places it doesn’t ring true. If you were describing a person, you would try to grasp the essential character and in some way capture them in a few words. And if you do it as a joke, there’s a pang of recognition.”

    By holding himself to a rings-true standard, Brooks acknowledges that all he does is present his readers with the familiar and ask them to recognize it…”

    [end quote of Issenberg article]

    Whatever conclusions Brooks draws from his analysis of data analysis it’s a pretty safe bet they’ll a) have a ring of truthiness to them and b) turn out to be a load of BS when you examine the details.

    • Thanks for the link. I really don’t like Krugman and think he’s an over-rated economist. But his point mid-way through this response nicely sums up what’s wrong with Brooks’ piece:

      The fault lies not in our data, but in ourselves

      That is, Brooks is taking “Big Data” to task for what are really human faults and foibles. Statistical data analysis is like any other tool: it has a designed purpose, a set of assumptions about its proper use. If one uses statistical analysis outside that designed purpose, one is like a man trying to open a jar of pickles with a can opener. Brooks here sees that failure and is criticizing the can opener. He’s also ignoring the many “cans” that statistical tools *have* been able to open.

      (Besides, did anyone else NOT have a problem with Brooks’ question: “Quick, what’s the square root of 437?” I immediately thought, “Uh… it’s a little over 20”).

  5. He basically just lists standard textbook challenges associated with a big data approach. I really can’t see why there is a beef with the (non) beef.

  6. Pingback: My beef with Brooks: the alternative to “good statistics” is not “no statistics,” it’s “bad statistics” « Statistical Modeling, Causal Inference, and Social Science

  7. Pingback: David Brooks and Data « More Practical Solutions

Comments are closed.