Glenn Shafer tells us about the origins of “statistical significance”

Shafer writes:

It turns out that Francis Edgeworth, who introduced “significant” in statistics, and Karl Pearson, who popularized it in statistics, used it differently than we do.

For Edgeworth and Pearson, “being significant” meant “signifying”. An observed difference was significant if it signified a real difference, and you needed a very small p-value to be sure of this. A p-value of 5% meant that the observed difference might be significant, not that it definitely was. Details are in my working paper, On the nineteenth-century origins of significance testing and p-hacking.

Perhaps knowing this history could help us talk about what to do with the word now.

P.S. Sander Greenland sends along this article by Michael Cowles and Caroline Davis from 1982, “On the origins of the .05 level of statistical significance.”

10 thoughts on “Glenn Shafer tells us about the origins of “statistical significance”

  1. Nice! The abstract does contain a little glitch, though, doesnt it (“The probability that a normally distributed random variable is more than three probable errors from its mean is approximately 5% [sic]”)

  2. Thanks. Looking forward to reading the linked paper. Signify was a fairly common word and generally meant a signal that needed to be decoded, typically invoked socially to mean what you present as a signifier and what is read by those around you because the social world was very oriented around types that defined social classes and thus the obligations attached to them in eras when those obligations meant something. A very good example of this is the end of the movie Remans of the Day where the butler’s borrowed Rolls breaks down and the locals identify him by the car but the gentleman – a doctor, I believe – they contact to provide gentlemanly companionship while the car is repaired immediately understands this is a servant. That kind of social obligation was a holdover but in the not-so-distant past, the signifiers carried much more social weight and could bring down legal consequences for failing to pay proper heed. When I watch rap videos with men strutting with an entourage I think of how in Elizabethan times a man of horse, meaning one of that particular class, adopted a social walk that, per descriptions, was meant to take up as much space as possible, legs wide, arms often elbows out, so everyone of the lower orders would get out of the way (or face consequences).

    One of my favorite signifiers was when I was ‘shopping’ with a girlfriend in the mid-70’s and she found a bag (at Bergdorff’s, I think) by a new Italian luxury brand (that’s since become huge). The bag was very expensive and not even finished – the closure simply pushed through 4 small cuts – but it signified a whole ton of things for an NYC fashion person. She bought it. Think about that in terms of significance: she was entirely correct in her assessment that this would be a major brand, so you might assign a ‘significance’ to the choice. But that’s an anecdote. And lots of people buy things they think will become hip that don’t or which fade quickly and don’t become major brands. That’s more how significant was used in the past: it signified something but whatever it signified was meant to be read within a specific context. A famous example is in Hamlet the use of ‘protest’ has changed so much that people think Gertrude is confessing when she says the actress doth protest too much when Gertrude meant at the time that this was over-acting.

    But it’s entirely normal to read into things what you believe must have been there. I do it all the time. It’s a significant part of my work. It’s inherent in any inferential system as that inferential system extends above slime mold capabilities of organizing thought and behavior. That gets to signaling again: in those days, you couldn’t use a telephone or radio to contact a unit of your army or navy so you’d raised signal flags or fire cannons or send up flares (and rarely strike up a band). Those all had to be read for what they signified given the actual conditions and whatever prior understandings existed.

    Taken up too much of your time.

  3. Shafer says: “Perhaps knowing this history could help us talk about what to do with the word [significance] now.”

    I have a lot of respect for Glenn Shafer, but here he is kind of suggesting that we need a committee to sit down and decide how we use the word “significance” in a statistical context.

    I do not believe that is the way languages usually develop in a democracy. I hope that people will continue to give this word any meaning that it can justifiably have.

    Personally I do not like phrases such as “statistically significant at the 5% level” because it implies that something could be “statistically significant at the 20% level” which sounds quite strange. Also, saying something is simply “statistically significant” feels way too vague. However, these are just my preferences.

    • People can continue to the word as they please, while at the same time a committee of experienced statisticians who care about the future direction of the field try to come up with an updated definition of what the word means in today’s scientific landscape. These aren’t mutually exclusive things. There isn’t even an inkling in Shafer’s post that indicates he wants to prevent (not sure how he would go about doing this) people from using the word “significant” as they personally see fit. E’rybody got to make e’rything political these days.

      • The naked statistician is right about general language use, though it has nothing to do with democracy and everything to do with how meaning emerges by metaphor and analogy and then evolves. Legislation pretty much never works.

        For example, Breck and I did a project on classifying chief complaints at emergency rooms, which are as minimal as texts from a teenager. The prescribed standard forbids marking left or right, so “brkn arm” is OK but “brkn lft arm” is forbidden. Almost every chief complain that involved a limb has an indicator of which one, as that’s how medical professionals talk to each other, because they have to be sure not to treat the wrong limb.

        On the other hand, In technical settings, we often legislate meanings of common words, like “line” or “planet” so that we can communicate more efficiently. Now if we can only get “random effect” nailed down we’ll be golden.

  4. Very readable and informative!
    (Only through to page 9 right now.)

    When I read Airy many years ago I was a bit lost on why he was so dismissive of Laplace. Now its startling that CS Peirce was put in with Laplace’s followers that thought they were getting [could get] practical certainty. Now, Peirce did have the take, that in practical science (e.g. engineering) as opposed to science of discovery (e.g. inference), one should often act as if certain or at least the standards were much lower. In his view of science of discovery, no one should ever be certain.

    Also, when Peirce was so disparaging of Laplace views on statistics in his later writing, I had no sense this was a common view.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *