Skip to content

The challenges of statistical measurement . . . in an environment where bad measurement and junk science get hyped

I liked this article by Hannah Fry about the challenges of statistical measurement. This is a topic that many statisticians have ignored, so it’s especially satisfying to see it in the popular press. Fry discusses several examples described in recent books of Deborah Stone and Tim Harford of noisy, biased, or game-able measurements.

I agree with Fry in her conclusion that statistical measurement is both difficult and important:

Numbers are a poor substitute for the richness and color of the real world. . . . But to recognize the limitations of a data-driven view of reality is not to downplay its might. It’s possible for two things to be true: for numbers to come up short before the nuances of reality, while also being the most powerful instrument we have when it comes to understanding that reality.

And she quotes Stone as saying, “To count well, we need humility to know what can’t or shouldn’t be counted.”

The role of the news media (and now social media as well)

I just want to add one thing to Fry’s discussion. Bad statistics can pop up from many directions, including the well-intentioned efforts of reformers, the muddled thinking of everyday scientists trying their best, the desperate striving of scientific glory hounds, and let’s never forget political hacks and out-and-out frauds.

OK, that’s all fine. But how do we hear about these misleading numbers? Through the news and social media. Also selection bias, as we’ve discussed before:

Lots of science reporters want to do the right thing, and, yes, they want clicks and they want to report positive stories—I too would be much more interested to read or write about a cure for cancer than about some bogus bit of noise mining—and these reporters will steer away from junk science. But here’s where the selection bias comes in: other, less savvy or selective or scrupulous reporters will jump in and hype the junk. So, with rare exceptions (some studies are so bad and so juicy that they just beg to be publicly debunked), the bad studies get promoted by the clueless journalists, and the negative reports don’t get written.

My point here is that selection bias can give us a sort of Gresham effect, even without any journalists knowingly hyping anything of low quality.

Fry published her article in the New Yorker, and even that august publication will occasionally jump on the junk-science bandwagon. For example, Fry cites “the great psychologist Daniel Kahneman,” which is fine—Kahneman has indeed done great work—but the problem with any “the great scientist X” formulation is that it can lead to credulity on the occasions that the great one gets it wrong. And of course one of their star writers is Malcolm Gladwell.

I’m not saying that Fry in her New Yorker article is supposed to dredge up all the mistakes made by its feature writers—after all, I’ll go months on this blog without mentioning that I share an employer with Dr. Oz! Rather, I’m trying to make a more general point that these mistakes in measurement come from many sources, but we should also be aware of how they’re promulgated—and how this fits into our narratives of science. We have to be careful to not just replace the old-fashioned Freakonomics or Gladwell-style scientist-as-hero narrative with a new version involving heroic debunkers; see also here, where I argue that scientific heroism, to the extent it exists, lives in the actions that the hero inspires.


  1. sentinel chicken says:

    I’m curious about what you mean by invoking the idea of ‘statistical’ measurement? Is this somehow different that quantitative measurement? Is it not sufficient to just call it measurement?

  2. Peter Dorman says:

    I’m paywalled out of the Fry piece (and not willing to commit to the New Yorker yet), but I have some thoughts about this topic that she might have addressed. When I taught stats (never beyond intro but often at the grad level) I devoted time to the question of how to integrate quantitative and qualitative (or N = very small) research. It wouldn’t have to be the same person/team doing both, but at some level a deeper understanding of whatever’s on the table requires this.

    It’s not just that there are limits to quantification, but that there are ways to mix methods so the shortcomings of each are offset by the advantages of the other. The stuff that gets left out when you pick a few characteristics to measure can be brought back in, and you can ask whether what you’ve learned from detailed observation of a few cases might alter your interpretation of the statistical work. Of course, stats texts are very enthusiastic about the power of large N-generated estimates to assess the representativeness of closely observed small N studies. I guess what I’m saying is that a parallel contribution can be made in the other direction. And of course small N observation is crucial for the choice of what to quantify and look for, etc.

    For example, in areas of labor economics I’ve worked in (occ safety and health, child labor) statistical work with large samples is essential, but lots of implausible interpretation prospers in the research world because close observation has not been brought to bear. There really is something to be learned from case studies of particular work hazards at particular firms (and how people respond to them) and the work activity of particular children in particular locations. Of course, all that small N stuff is being done regularly, but the problem is that the findings are not sufficiently entering the world of large N number crunching. (In the case of OSH the wall is virtually impermeable; the value of statistical life people never never ever look at case studies documenting behavioral responses.)

    I hope Fry made these points; if not, here they are.

  3. Re: ” Bad statistics can pop up from many directions, including the well-intentioned efforts of reformers, the muddled thinking of everyday scientists trying their best, the desperate striving of scientific glory hounds, and let’s never forget political hacks and out-and-out frauds.”

    Sad but true.

  4. psych defector says:

    Have you looked at Uli Schimmack’s work on the z-curve? I just read this post over my morning coffee & haven’t thought about it deeply, but it seems to me like there’s a lot of similar ideas echoing in your & his research, might be opportunity for collaboration.

  5. Patrick Turner says:

    Long time Aussie reader of the blog here. This is a very interesting topic, Andrew – what do you think of the importance of measurement error versus more commonly-discussed statistical issues e.g. p-hacking/data dredging and confounded methodology?

    My impression is that the latter two are much more commonly-discussed, because they’re often used unintentionally or otherwise to produce impressive-looking results whereas issues with statistical measurement don’t usually get used to help spurious hypotheses demonstrate statistical significance. I was actually just reading a local pseph who wrote a piece on how easy it is to create an “election model” with economic data (I can post it here if you’d like), and it’s interesting because people rarely bring up measurement error with regards to things like unemployment or inflation despite discussing them to death everywhere.

  6. jim says:

    One of the things about statistics, science, and the media is that there are “statistics,” sensu stricto, like election forecasting; and then there are *statistics*, sensu lato, which are typically a few simple number comparisons which are implied to have some specific meaning. both are frequently wrong for completely different reasons.

    A good example of the later in Fry’s piece is: “For example, Black and white Americans use marijuana at around the same levels, but the former are almost four times as likely to be arrested for possession.” The supplied context context implies a meaning for the “statistic”, but that meaning – that the difference indicates discrimination and there is no other reasonable interpretation – is little more than a wild guess. It could be correct. Or it could be that the cops use dope possession to bust gang members or other known troublemakers to keep them off the streets – a form of discrimination, for sure, but one having nothing to do with skin color. So in that context, the algorithm (statistics sensu stricto) that forecasts recidivism may be more accurate than the popular statistic (statistics sensu lato), even though the meaning of the latter seems immediately apparent. I personally strongly oppose the use of such algorithms – the idea of using probability to incarcerate people is against everything I believe about justice. However, the fact that we find some countervailing statistic sensu lato that seems sensible doesn’t mean its right.

    • Jim,

      Interesting commentary. I also oppose the use of such algorithms b/c biases are implicated in their construction, even if skin color is not a criterion. For could it not assumed, for example, that street gang members possessing or selling dope are non-whites. In fact, drug dealings and violence are featured as being conducted by brown and black individuals. So while there may not be explicit skin pigment criterion, our media highlight that violence and drugs are conducted largely in minority urban communities.

      We have been discussing these issues here in DC as part of our education task force. And we were able to explore the incarceration of minorities through the COPS program, which brought psychologists, lawyers, and police from all over the US to DC. Much has to change within the criminal justice system in order to reduce and end discriminatory and unjust treatment of minorities.

      Bottom line, anyone who has a commitment to fairness, justice, and equality should be concerned with the use of algorithms.

    • Kevin I says:

      I’ll push back gently on that. The alternative to these models isn’t no model, it’s another model that’s potentially even worse. For parole and bail decisions, the probability of reoffending/failing to appear are absolutely relevant, and if there’s no formal model estimating these probabilities, then they’ll be subjectively estimated by the decision maker. This will be wildly inconsistent from one decision maker to the next, or even from one case to another. If we had an acceptable formal model, I would strongly prefer its use as one component of the decision.

      That being said, existing models have some serious issues with the data used to train them. I think it’s worth the effort to try to correct this and come up with something usable.

Leave a Reply

Where can you find the best CBD products? CBD gummies made with vegan ingredients and CBD oils that are lab tested and 100% organic? Click here.