“Not within spitting distance”: Challenges in measuring ovulation, as an example of the general issue of the importance of measurement in statistics

Ruben Arslan writes:

I don’t know how interested you still are in what’s going on in ovulation research, but I hoped you might find the attached piece interesting. Basically, after the brouhaha following Harris et al. 2013 observation that the same research groups used very heterogeneous definitions of the fertile window, the field moved towards salivary steroid immunoassays as a hormonal index of cycle phase.

Turns out that this may not have improved matters, as these assays do not actually index cycle phase very well at all, as we show. In fact, these “measures” are probably outperformed by imputation from cycle phase measures based on LH surges or counting from the next menstrual onset.

I think the preregistration revolution helped, because without researcher flexibility to save the day a poor measure is a bigger liability. But it still took too long to realize given how poor the measures seem to be. You wouldn’t be able to “predict” menstruation with these assays with much accuracy, let alone ovulation.

The models were estimated in Stan via brms. I’d be interested to hear what you or your commenters have to say about some of the more involved steps I took to guesstimate the unmeasured correlation between salivary and serum steroids.

I think the field is changing for the better — almost everyone I approached shared their data for this project (much of it was public already) and though the conclusions are hard to accept, most did.

The preprint is here.

This is a good example of the challenges and importance of measurement in a statistical study. Earlier we’ve discussed the general issue and various special cases such as the claim that North Korea is more democratic than North Carolina.

My hypothesis on all this is that when students are taught research methods, they’re taught about statistical analysis and a bit about random sampling. Then when they do research, they’re super-aware of statistical issues such as how to calculate a standard error and also aware of issues regarding sampling, experimentation, and random assignment—but they don’t usually think of measurement as a statistical/methods/research challenge. They just take some measurement and run with it, without reflecting on whether it makes sense, let alone studying its properties of reliability and validity. Then if they get statistical significance, they think they’ve made a discovery, and that’s a win, so game over.

I don’t really know where to start to help fix this problem. But gathering some examples is a start. Maybe we need to write a paper with a title such as Measurement Matters, including a bunch of these examples. Publish it as an editorial in Science or Nature and maybe it will have some effect?

Arslan adds:

The paper is now published after three rounds of reviews. One interesting bit that peer review added: a reviewer didn’t quite trust my LOO-R estimates of the imputation accuracy and wanted me to say they were unvalidated and would probably be lower in independent data. So, I added a sanity check with independent data. The correlations were within ±0.01 of the LOO-R estimates. Pretty impressive job, LOO and team.

Leave-one-out cross validation FTW!

9 thoughts on ““Not within spitting distance”: Challenges in measuring ovulation, as an example of the general issue of the importance of measurement in statistics

  1. “They just take some measurement and run with it, without reflecting on whether it makes sense, let alone studying its properties of reliability and validity.”

    Yes, I strongly agree! Most of the work in the Unbelievable Results category fails before the data analysis even starts because the measurement method has untenable assumptions or simply doesn’t measure what it purports to measure. It’s a problem with people doing “data analysis” instead of science. For science you need to establish the method and test it multiple times before you deploy it.

  2. When I think about measure reliability, I picture SEM models: little e’s with arrows pointing at boxes (measures), which have arrows pointing at circles (constructs) with arrows pointing at each other; weights on each arrow representing reliabilities and sample correlations and population correlations. Multiply across the weights to see the devastating impact of using too few measures with too low reliabilities. Unfortunately, most people who use statistics don’t take a course on SEM, so they don’t understand those kinds models.

    The thing is, though, they don’t need to. Most studies hypothesize quite simple models–one or two boxes, one or two circles, arrows all around. Plus, I’m not even talking about a model–I’m talking about a diagram. A way of visualizing how low reliability flows downstream to interact with true variance and sampling error to blow up your standard error. And to construct and interpret that kind of diagram, you don’t need all the theory and mathematics of SEM. You can just lift the tool right out of that context and use it as a quasi-mathematical visual aid.

    Because, the reality–the one we want to convince people of–is that measurements are so sensitive to all these sources of variance that a diagram like this doesn’t need to be anything like realistic, much less identified, to illustrate the problem. Imagine an app with a few pre-made options for diagrams–maybe three different versions of structural models and three different versions of measurement models. People select one of each and draw arrows with their fingers. They plug in plausible values for weights–minimum population correlations, maximum reliabilities–or select typical values from a drop-down menu. The app spits out fuzzy ranges for standard errors, maybe plots them as a function of reliabilities. Most people will be shocked.

    I’m not talking about power analysis or Optimal Design. It’s not a statistical method. It’s just a teaching tool for students and a visual aid for researchers. The trick is to make the whole thing so simple that people are comfortable just playing around with it. All it has to do is prompt people to honestly ask themselves questions like, “Do I really think my measure’s reliability is > 95%? Do I really think the true effect is > .5? Do I really think I can control for 50% of variance due to individual differences?” Go from there.

  3. That said, the problem here is not that researchers are confused or unconvinced. The problem is that the applied research community does not believe in statisticians as statistical experts, nor in deferring to our expertise, nor even simply not contradicting our expertise. I honestly think that’s on us.

  4. High school usually cover the problems of measuring things, but it does not seem to sink in and many people have trouble generalizing from classroom lessons. They learn “this is what we do in physics class” not “this is how we measure distances” or “all measured data is subject to measurement error.” We can see this in how most people (including many professionals!) talk about government statistical publications.

    Some smart people are trying to improve the public’s understanding of how this issue affects crime statistics or COVID statistics, but its a tough slog and easily hijacked by factional thinking.

  5. I have not read it yet but Kahneman Sibony, Sunstein, have a book out, Noise: A Flaw in Human Judgment. (2021)Little, Brown Spark, that seems to address this, in part.

    I was rather surprised that they seemed to be surprised at measurement error. In some areas of psychology where we depend on human scoring or rating we will spent a large amount of time defining behaviors and training raters so that, hopefully, the raters actually agree if the behaviour occurred or not.

    I was conducting an informal survey of human powered transportation using a local bridge. I was interested in “bicycle” use. It only took me an hour or two to write up an acceptable definition of “bicycle” and use. And this was a really simple case.

  6. More generally, I do wonder whether biological measures get an easier pass in psychology, because they feel sciencier. Not that there are poor self-report measures or lab tasks in psychology too, but some of the more egregious examples seem to be measures that probably wouldn’t have survived as long in their “home discipline”.

Leave a Reply

Your email address will not be published. Required fields are marked *