Jerrod Anderson points us to Table 1 of this paper:

It seems that the null hypothesis that this particular group of men and this particular group of women are random samples from the same population, is false.

Good to know. For a moment there I was worried.

On the plus side, as Anderson notes, the paper includes distributional comparisons:

This is fine as a visualization, but I don’t think there’s much here beyond the means and variances. Seems a lot of space to devote to demonstrating that men, on average, are bigger than women. There’s other stuff in the paper as well, but my favorite is the p-value of 4.76×10^−264. I love that they have all these decimal places. Because 4×10^-264 wouldn’t be precise enuf. That’s even worse—actually, a lot worse—than this example.

Am I reading this correctly? The first p-value in the mean difference test column is ~0.00 and the second is 1.75 x 10^(-287)? So the latter isn’t small enough to warrant “~0.00”?

The paper is employing a wonderful threshold for the “approximately” operator ~. Truly special.

The threshold is very likely 2.2 x 10^(-308), the value below which you get double-precision floating-point underflow … if the authors had been really committed they could have computed the log of the p-value directly and given a precise number smaller than that …

Yep, that sounds right. Reporting p-values to machine precision. I was going to say “p-value fetishism at its finest”, but you are right, they could have taken it to another level. “~0.00” is such a cop-out.

Good catch. I thought the authors just had a wacky sense of humor.

“… my favorite is the p-value of 4.76×10^−264.[..] Because 4×10^-264 wouldn’t be precise enuf. ”

You mean 5×10^-264. Proper rounding is important!

Well, Andrew did say that it “wouldn’t be precise enuf”, which is true :)

Shouldn’t that be 5 x 10^-264, rounding up?

If they are calculating this in double precision, where are the other digits? I would expect them to print them all. Can’t be imprecise, can we?

Hi Andrew,

I think you are a bit unfair with this paper:

– They do talk about effect sizes as well, not just these ridiculously small p-values. Check the 7th column of Table1.

– In the supplementary materials of the published version ( https://academic.oup.com/cercor/article/28/8/2959/4996558#118682756 ), there are lots of comparisons of the distributions (via the shift function ~comparison of the deciles), where they do go beyond means and variances.

Best,

Szabolcs

Szabolcs:

Sure, it could be that this paper has good things in it too. I make lots of mistakes, also I do some useful things too, or so I hope. Could be the case with these authors too. One thing I tell students is not to write anything that you don’t want people to read. That p-value of 4.76×10^−264 is wrong on so many levels . . . but, sure, there could be other stuff in the paper that’s useful.

The probability of randomly selecting a particular atom in the observable universe is about 10^−80.

Sure but the chance of picking a random permutation of all the atoms of the universe (for example, ours) is way smaller than 5×10^-64.

Isn’t the mathematics of that interesting? You have an extrapolated point which is ridiculously far away. The perspective from which the point was generated is equally as far away from that point. Look at ‘us’ from its perspective: it exists in a universe which can be counted in rational percentages, so it is anywhere from 1 to 0 in a variety of orientations that map its relations. The same then applies to the us point of observation. It’s mathematically pure relativity that extends to big as well as to small. One mathematical way to look at this is to take each as a plane and treat one as inverting into the other, where inversion is scaling by magnitudes up and down and may also rotate one planar end versus the other. One neat thing is this kind of statistic is some multiple derivative level away from whatever processes occur. I’d have to read more to figure out how many. But then a weakness of null rejection is that its derivative nature is generally hidden, that rejection of a hypothesis as a derivative measure has a specific slope. Since many other lines share that slope in that vicinity, not only can sign can flip, but that slope may itself be near instantaneous, meaning there is always another derivative level, and the universe of those lines can be presented as an overlay or expansion, for example, of potential and then likelihood within the potential – generating and compressing complexity – so the result ‘captured’ is a flash in the pan which may be real or fool’s gold.

My daughter teaches pilates and tells me her most difficult clients are typically women who grew up before women were allowed to play sports; they have never developed large motor functions used in athletics. A good pre-school teacher will talk about ‘emerging curriculum’, which means how essential skills start to manifest in each kid and how you tailor instruction to that child’s emerging strengths and weaknesses. How can you ascribe biological reasons to differences in reaction times when male and female roles are different in society? If you do, then do small biological differences reflect that history? I don’t know but I think that’s at least an interesting question, but how is that tested? Athletes are trained. Similarly educated people are trained. If you’re then left with some ‘bit’ of difference – like you test educated versus uneducated of the other group – then what are you actually identifying other than this measured this difference. That reminds me of the low IQ reports associated with immigration in the early 20thC. This included groups that now test higher than average, so the tests would appear to measure some amount of unlocked potential not actual potential.

Good points.

“before women were allowed to play sports”

When was this, approximately, in the US? Till you wrote it I never imagined anyone from that era was still alive and learning pilates!

to echo Martha’s comment, I grew up in Gainesville, Florida (a town of about 100,000 whose economics and social structure was dominated by the University of Florida, which is a “big college sports” school) in the 70s – leaving to college in the early 80s. School sports did not start until 9th grade, so until then, the options were “Boys Club” and “city league” and the available sports were football, baseball, football, basketball, and football. I played “Boys club” which was very white, very suburban, and all boys, except they did have all-girls cheerleading for football. I had a couple of friends who were girls who played city league baseball and basketball – but these were effectively boys teams that allowed the very rare girl to play. I think city league softball for girls was the only team sport. About 1980 I think soccer opened up for boys and probably girls. There were swim and tennis teams that included girls. Surprisingly very few girls played golf (there were no HS girls golf teams at the time).

Hi Jonathan,

Which years constituted the period when women were not allowed to play sports? Before the 40’s?

RE: IQ

Here is an interesting article. As an aside, I am not a proponent of the IQ really. I cringe when someone says, ‘so and so has a high IQ’. It sounds so amateurish. Most professionals are in the 119-140 range anyway, I understand, which has been considered sufficient for most intellectual endeavors.

https://www.businessinsider.com/iq-tests-dark-history-finally-being-used-for-good-2017-10

Rahul and Sameera,

Jonathan’s phrase “before women were allowed to play sports” might be a little too strong. Perhaps “when few women had the opportunity or encouragement to play sports, and many were discouraged from participating in sports” would be more accurate. Like may social changes, the transition from sports being mostly prohibited for women and the current situation was slow and varied from place to place and between social classes. The history of “progressive” schools (mostly elite) that encouraged women’s participation in sports goes back a long time, but for a long time, women were mostly excluded from participating in sports in a variety of ways – in some cases because of beliefs that sports were bad for women, in some cases because they were considered “unladylike”, in some cases because sports were considered a “male domain” and women were not welcome, and in some cases simply because sports were considered unimportant for women, so funding for women’s sports was much less than for men’s sports. (I’m saying this just from personal observation of how things have slowly changed from stories about my grandparents’ generation to what I see or hear about children today,)

FWIW: In the US, Title IX (http://www.ncaa.org/about/resources/inclusion/title-ix-frequently-asked-questions) was passed in 1972. I seem to recall sensing that its implementation wasn’t instantaneous, but rather the equivalence of support for women’s and men’s sports took time to develop (and, to a degree, may still be ongoing).

Martha

Thanks for that clarification. I gathered that something like what you described was the case. Pilates is actually more beneficial than most exercises. That plus building anaerobic capacity.

Just for the record:

https://twitter.com/StuartJRitchie/status/1069352225691058185

[BTW I am amazed/thrilled I can post again on this blog!]

Ritchie says on twitter he used Bayes factors. Can anyone find any evidence of that in the paper? I only grepped for it and quickly scanned the paper but didn’t see anything. What priors did he use? The use of so-called default priors is insanity so I wanted to know if he used those canned BF calculators. I have also seen a calculator on the web that converts a t-value to the BF, it caused a sensation when it came out because it all became so easy. I also saw an excel sheet by a psychologist that converts a p-value to BFs. Mannomann.

They mention it multiple times in the published paper, writing

“In an additional Bayesian analysis of the mean difference, we used the BayesFactor package for R (Morey and Rouder 2015) to compute BF10 values from a Bayesian t-test (using the ttestBF function; see Supplemental Materials).”

“A set of Bayesian t-tests (see Supplemental Materials and Table 1)”

“The Bayesian analyses, also shown in Table S1, again confirmed these results”

“Once more, all of the sub- regional analyses were confirmed using the alternative Bayes Factor analyses.”

The phrase “Bayes Factor” also shows up in the caption for Table 1.

Here is the entirety of the text on the Bayesian analysis from the supplemental materials (a 50 MB Word doc packaged in a 49 MB zip file for some reason):

“Bayes Factor analysis. We calculated Bayes Factors (BF10 values) for each comparison. These values indicate the probability of the alternative hypothesis (in this case that there is a sex difference) compared to that of the null hypothesis (in this case that there is no sex difference). For example, a BF10 value of 2 would indicate that the hypothesis of a sex difference is twice as likely (2/1 = 2) than the hypothesis of no sex difference. Conversely, BF10 values under 1 indicate a higher likelihood of the null hypothesis than the alternative hypothesis. For example, a BF10 value of 0.5 indicates that the null hypothesis of no sex difference is twice as likely (1/0.5 = 2) as the hypothesis that there is a difference.”

If only Andrew were on Twitter, then he and Stuart Ritchie could argue fruitlessly about this in a format extremely poorly designed for substantive discussion.

More seriously, I think the distribution plots show more than just that men are larger than women. It’s pretty clear that the distributions almost completely overlap for some regions (e.g., the two rightmost panels in the top row), while for others they do not (e.g., a bunch of the others).

Noah:

Yes, I hate twitter. Lots of name-calling, not much substance.

Regarding the distribution plots, I think the thing is to follow Tukey’s general principles and pull out averages and then focus on what’s left. So, first you have the general pattern that men are larger than women on average, then you can have relative averages for each region, then you can do a similar comparison of standard deviations (or some other measure of dispersion), and then get to the shape. The plots in the figures shown above are fine for what they are, but it’s really hard to see the subtleties with the main effects being so large.

The advantage of a series of plots that first shows center and scale, and only then shows relative shapes, is that you can make a lot more direct comparisons of centers and scales if you reduce each distribution to two numbers. You can compare things directly which will be more informative than having to stare at all these shapes. Then once you have the comparisons of the centers and scales, you can go back to the distributions and see what more can be learned.

I think people misinterpreted my “Seems a lot of space” comment above. My point was that a decomposition can allow for more effective comparisons. The display is very effective at showing that men’s brain’s are bigger than women’s in many different subcortical structures, and that’s fine to know, but with care you can show a lot more information in the same space, first by subtracting off (or dividing by, or otherwise adjusting for) averages, and second by first presenting centers, then scales, then shapes. I recognize that the following graph in the paper does adjust for total volume, and that’s great, but really you can do that adjustment right away, and I think that would be more informative.

Shravan:

That’s a funny comment in that he refers to “the precise p-values,” which leaves me wondering why he wasn’t really precise. Why not report 4.763401920432098324×10^−264?.

More seriously, I don’t think my post above is “stupid,” nor do I think it is “trivial” as claimed by another commenter in that thread. Communication is important; it’s not trivial. When we report meaningless precision, we make our tables more difficult to read, thus making it more difficult to communicate to our audience, and indeed to ourselves. In addition there are conceptual problems with these p-values, first most obviously that the purported precision in these numbers is essentially meaningless as it depends crucially on long-tail properties of the normal distribution, and second because, as noted above, the null hypothesis of zero difference is uninteresting. It would be clearer and more communicative to just say that the difference is 37 standard errors from zero (that has some problems too as there’s an assumption of independence and zero sampling bias, but, unlike the p-value, it’s on the right scale). Or, if you want to follow some convention to report all differences as p-values, and to report all p-values, no matter how low, to three significant figures, then go for it, but realize that you’re filling your paper with meaningless digits. I agree that this point is not deep, but, again, I don’t see it as “stupid” or “trivial.” Communication matters. Even spelling mistakes (to take something much less important) make a difference in making a paper more difficult to read. But this is worse than a spelling mistake because it represents either some conceptual errors (as noted just above) or else a willingness to flood the zone with meaningless digits based on a desire to follow convention. Conventions are fine but ultimately the goal should be communication. If you have good stuff in your research paper, that’s great! No need to distract readers with silly things like “4.76×10^−264.”

I agree of course. I advised him (on twitter, which I agree is a stupid medium overall for discussion, it’s OK for dissemination of papers or links to github repos of interest) to take the substantive point implicit in your comment seriously and not get focused on the ridiculing aspect. I think people are too thin-skinned and defensive in general. In his place I would just concede the point and try to do better next time round.

Shravan:

I too think he should concede the point! But I wouldn’t really mind if he wanted to disagree, to argue, perhaps, that it took less effort for him to follow convention than it would’ve been for him to think carefully about each number. Or he could argue that the relative values of these extreme p-values convey some useful information if they’re interpreted appropriately. (I’d disagree with such a claim, but I guess he could make a case for it.) But, as we’ve discussed, one could say that scientists are trained to deflect criticism, not to take it seriously.

My guess is that he’d be more inclined to concede if this didn’t start as something that seemed too personal. His tweet suggests to me that he felt targeted by the post (like it was simply, “Hey, look at this dummy!”). The post didn’t do all that much ridiculing, but did make a small joke at the end at his expense. It also didn’t attempt to generalize about the problem (i.e., depersonalize it). If it read more like, “Here’s a number that’s way too small to be worth reporting; people may think this is good to do, but it’s not, because yadda yadda,” it might have had a positive impact. As it is, the passive-aggressive thing that’s happening on both sides, where neither is speaking directly to the other, is uncomfortable to even read from a passer-by’s point of view. (I read it anyway and learned something I didn’t know, though, so thanks for that part.)

Joshisanonymous:

It’s interesting that you write about the post seeming personal, because I purposely did not include the names of the authors, in part in an attempt to make it

notbe personal.In any case, I don’t think that either the author of that article or I are being “passive-aggressive” by not speaking directly to the other. That article was in the published record, and my responses to it are responses to the published article, not to its authors (and certainly not to a single one of a long list of authors, as I have no idea who prepared that table). And, again, my post is published too. If someone is annoyed at my post and thinks it’s stupid, rude, unpleasant, etc. . . well, I might disagree—in fact, I

dodisagree!—but they’re free to express their opinion, and I don’t see any passive-aggressiveness to it. This dude is sharing his views on a public blog post. Nothing wrong with that.I like that he responded to you by sarcastically thanking you for your unsolicited advice, given that Twitter consists almost entirely of unsolicited advice.

I actually ran into the same thing in my DPhil thesis were I had purposely reported many digits as way of being reproducible – given the thesis was all about the computational methods. I actually found it annoying that the examiners wanted me to treat the example analyses seriously.

Anyway, I learned if I do this again, I need to add a note about the precision not to be taken seriously in any interpretation.

I’d like to push back on the idea that reporting lots of decimal places indicates some sort of innumeracy. I think a lot of the time when people publish p-values or probabilities to many decimal places they just default to reporting the estimates in the form they were spit out by their program. There’s no implicit assertion that the decimal places actually matter. It’s just reporting results by copy and paste. Sure, some readers will wrongly perceive an implicit assertion, so it’s best practice to omit small decimal places, but it’s hardly something that deserves mockery.

Z:

We could call it innumeracy or we could call it a poor choice, and it’s not necessarily the conscious choice of the author of the article, but it’s a problem

somewhere. Kinda like if a sportswriter reported Steph Curry’s height as 6 feet 3.240193 inches: maybe it’s because the writer didn’t know better or maybe the writer was copying-and-pasting on the computer or maybe there was a flaw in the newspaper’s style guide, but somewhere along the way something went wrong.For that matter, if I publish an article full of misspellings, that doesn’t make me illiterate. I might be literate in other ways and just not know how to spell, or maybe I don’t care about the misspellings (not realizing that these can distract the reader, making my article harder to follow), or maybe there’s just a quality control problem in my proofreading. But, whatever it is, misspellings can make my article harder to read.

Z said: “I’d like to push back on the idea that reporting lots of decimal places indicates some sort of innumeracy. I think a lot of the time when people publish p-values or probabilities to many decimal places they just default to reporting the estimates in the form they were spit out by their program. ”

I thought that considering significant figures was *part* of numeracy at the level of scientific publication.

Putting aside problems with p-values generally for a moment, how would you actually report that ridiculously small p-value (assuming you calculated it, and wanted to tell someone)? Just p < X, where X is some arbitrary alpha level you care about? Report it as .00 (if you only wanted 2 decimal places)? Or do you just have an * with a note where you say, "Look, these p-values are all super small. Reporting them would be ridiculous, so just trust me that they're tiny."

I mean, if the problem here is just that null hypothesis was dumb, thus all the p-values are worthless that's fair. Quibbling over the number of decimal places to report seems inconsequential to me though. That said … uh, 4×10^-264 IS kinda more precise than anyone could possibly care about, and IS pretty funny juxtaposed with the ~0.00 right above it.

Sean:

I guess you could simply report it as 10^−264. Getting rid of that ridiculous “4.76” would be a start. Or just reporting the estimate and standard error. The point is that, even beyond all technical issues of why the number is so meaningless, the p-value is a comparison to a null hypothesis that is of no interest here. If there’s a difference between a z-score of 10 and a z-score of 20 here, it’s because of the size of the difference, not because of the p-value. At some point, sampling error becomes irrelevant and you should be more concerned about systematic error, which for example you could study by looking at how much the averages vary when comparing data from different countries, or different labs, or whatever. (I’m speaking generically here without reference to the specifics of this particular dataset.)

Ah, ok that does make sense. In the context of p-values are testing a null hypothesis of no interest, exacting precision is a waste of space. When you’ve talked about this issue in the past, I came away with the impression that the point was “reporting too many decimal places = innumeracy” which seemed like hyperbole. But if I’m understanding correctly, the criticism is really more about the mindless application of statistics that aren’t very useful (and reporting THOSE to a needlessly high degree of precision).

I report to two decimal places routinely, because the formatting style of my discipline (psychology) and thus the journals request it. It’s not innumeracy (hopefully!), but rather just less work than arguing about it with some reviewer / proofreader when it ultimately has little bearing on the results.

I would imagine in another context where measurement was very good and precision was needed (not anything in social science! Maybe like a really accurate scale that measures weight), then more decimal places would be merited.

Sean:

It depends on variation as much as precision. For example, you might be able to measure an animal’s weight to the nearest gram, but if the weight varies by 100g from day to day, then the precise measurement at a particular time might not matter for anything, unless it’s tied to other measurements taken at the same time.

Kaminsky et al. (Nature Genetics 2009)* reported P values of 1.2•10^−294 from 40 twin sets and P < 9.9•10^−324 from only 20 sets. I've often seen such fantastic P-values used to rank "strength" of results.

If I've computed correctly, I think the P < 9.9•10^−324 is not only below their machine's underflow level, also its inverse is greater than the number of space-time points in our visible universe at the level of Planck volume by time since the big bang. All from only 20 data points. Impressive result in so many ways!

I presented a collection of such howlers at U Bristol 2010 entitled "Subatomic and Subphysical P-values: Adventures in Pseudo-Precision", then at their behest submitted to the Int J Epidemiol in 2011. It was not accepted – seems the genetics reviewers feelings were hurt by my dismissive attitude toward their statistics…

"There's not much science in science."

*Kaminsky ZA, Tang T, Wang S-C, Ptak C, Oh GHT, Wong AHC, Feldcamp LA et al. DNA

methylation profiles in monozygotic and dizygotic twins. Nature Genetics 2009;41:240-245. see p. 242.

“I presented a collection of such howlers at U Bristol 2010 entitled “Subatomic and Subphysical P-values: Adventures in Pseudo-Precision”, then at their behest submitted to the Int J Epidemiol in 2011. It was not accepted – seems the genetics reviewers feelings were hurt by my dismissive attitude toward their statistics…”

This sounds worth trying again to publish — even if only on your own website, with a link to it here.

“Seems a lot of space to devote to demonstrating that men, on average, are bigger than women”

But isn’t a bit part of the paper arguing that there is greater variance in men than women, after accounting for the different means?

Given that these p-values appear in a table of many p-values, which span order of magnitude, I am not sure why one shouldn’t just report them.

By the way, I also reported p-values as small as p < 10^-217 in this comment (https://psyarxiv.com/bn24c/), though there part of the point was that in a randomized experiment the null indeed would have been true.

Dean:

As I said in a comment above, I don’t think those distribution plots are a good way of comparing variance. I think it would be better to compare variances directly; by not plotting the distributions, you can show the variance comparisons much more directly. There’s certainly nothing wrong with graphing all those distributions—it’s a great exploratory step—but once you have that visual sense of the distributions, it makes sense to pull out what you want to compare. My point is not that those graphs are ridiculous; rather I’m just discussing how one could get more out of the data so as to more informatively visualize the comparisons of particular interest.