Battle for the headline: Hype and the effect of statistical significance on the ability of journalists to engage in critical thinking

A few people pointed me to this article, “Battle for the thermostat: Gender and the effect of temperature on cognitive performance,” which received some uncritical press coverage here and here. And, of course, on NPR.

“543 students in Berlin, Germany” . . . good enuf to make general statements about men and women, I guess! I wonder if people dress differently in different places . . . .

47 thoughts on “Battle for the headline: Hype and the effect of statistical significance on the ability of journalists to engage in critical thinking

  1. No, I did not read it. No, I will not read it. I did look far enough to see that the number of observations are on the order of 500 with around 10 independent variables and the R-square values for the regression models top out at around 0.05 (most are lower). But, there are p-values less than .05, so I guess this passes muster.

    Seriously, aside from the continuing issues related to NHST, I would not be willing to put this paper on my vitae. Things will only change when it becomes more of an embarrassment than an asset to have this on your vitae. So, this won’t get you a job at a top university (unless you have hundreds of these publications), but it will help you get a job at many.

    • I did look far enough to see that the number of observations are on the order of 500 with around 10 independent variables and the R-square values for the regression models top out at around 0.05 (most are lower). But, there are p-values less than .05, so I guess this passes muster.

      I don’t follow this critique. Why exactly do you consider this an insufficient sample size? If the sample size was 5k would you trust their conclusions? 50k?

      No sample size or effect size would make me trust the output of this analysis.

      • > an insufficient sample size

        OP didn’t say this. My interpretation (which Dale should feel free to correct!) is that the amount of complexity in the data combined with the number of potential predictors *should* allow one to construct a model that is capable of accounting for more variance than they report.

        In other words, given the structure available in the data, the models aren’t capturing it and so we shouldn’t bother using those models to support any further inferences (which is what the paper proceeds to do).

        • Yes, gec. I realize Anoneuoid is pretty extreme and is likely to feel that no sample size justifies any use of NHST and p values whatsoever. But although I did sign the anti-p value statement, I am not an absolutist about this. When the sample size is large, and the number of predictors are large enough to capture much of the variability, I’m willing to look at very low p values as indicating something – particularly if they reappear despite multiple variations of the analysis. But, in the present case, with R square values of .05 or lower and so much unmeasured variability across people’s anatomy, biology, intelligence, etc. I don’t find the conclusions at all convincing.

          I’ll let Anoneuoid do their usual now.

        • . When the sample size is large, and the number of predictors are large enough to capture much of the variability, I’m willing to look at very low p values as indicating something
          […]
          I’ll let Anoneuoid do their usual now.

          Shouldn’t you be expected to provide a justification for why this makes sense? Obviously you know why I will say it does not make sense…

          Also, my critique here doesn’t have anything to do with the NHST part. This design is doomed before they get to the NHST step. This is doing NHST on meaningless numbers.

        • Often there is some interpretation where a p value is close to a Bayesian posterior distribution with a nonstandard/flat prior. If the sample size is large enough, *and the model makes sense* under these circumstances you’re at least getting into the ballpark of a reasonable analysis.

          The most important question though is about whether the model makes sense.

          For example, in the regressions they run here, each person has a different set of Xij “corrections” which means that for each person the prediction should vary, meaning there shouldn’t be a meaningful line like in the plots.

          Now, if they first “correct” all the data points so that what’s plotted in graphs is a corrected prediction from their model rather than raw data… then that could make the graph make some sense, but the details are too sketchy to know what they did.

        • a p value is close to a Bayesian posterior distribution

          Do you mean?

          P("D or more extreme"|H) ~ P(H|"D or more extreme")?

          Starting from Bayes rule:

          p(H[0]|D) = P(H[0])p(D|H[0])/sum(P(H[0:n])p(D|H[0:n]))

          To get the approximation, then we would need to see the prior probability of H[0] cancel with the denominator:


          P(H[0]) ~ sum(P(H[0:n])p(D|H[0:n]))

          Or:


          1 ~ sum(P(H[1:n])p(D|H[0:n]))

          What does this have to do with the sample size? Does the “or more extreme” change that somehow?

          Anyway, checking the fit of a strawman hypothesis using a Bayesian posterior is not an improvement. I guess the common interpretation of the p-value as “the probability H[0] is true” will be more accurate.

        • Imagine someone fits a regression using maximum likelihood. The software will spit out p values for the coefficients and some nice tables with asterisks… These p values will correspond to the posterior probability that 0 is in the central high probability region of some marginal Bayesian posterior with a flat prior. If the sample size is large and the model is something you believe is meaningful, then this interpretation justifies Dale’s intuition. it’s not so much an NHST exercise, as a test that the model you actually have some confidence in shows an actual certain sized relationship represented by that parameter.

        • I can’t emphasize enough though, that for this interpretation to be meaningful, you should have a prior that is relatively flat in the region around where the maximum likelihood estimate lies, and you should believe the model is a reasonable approximation to some real consistent relationship.

          Just throwing a bunch of regression predictors at a problem and fitting ordinary least squares does not qualify as a good application of this idea, and that’s more or less what they did here.

        • These p values will correspond to the posterior probability that 0 is in the central high probability region of some marginal Bayesian posterior with a flat prior.

          Ok, with flat prior they all cancel out. So then saying:


          P("D or more extreme"|H) ~ P(H|"D or more extreme")

          Let’s call “D or more extreme” D* for short, we need to have a situation where

          sum(p(D*|H[0:n])) ~ 1

          Otherwise I don’t see how you can equate p(D*|H[0]) ~ p(H[0]|D*) in general.

          But I guess you do not say “the posterior probability of zero given the data”… You say “the posterior probability that 0 is in the central high probability region”. How is the central high probability region defined? The p-value does not depend on any cutoffs.

        • The p value is some kind of test of how often you’d get a regression coefficient as different from zero as you saw on repeated sampling if the true regression coefficient generating the data was 0… In many of these situations the sampling distribution of the regression coefficient is normal… Also, due to the large sample size, the posterior distribution of the regression coefficient under the flat prior is normal… it will turn out that

          integrate(normal(q,0,s),qest,inf) = integrate(normal(q,qest,s),-inf,0)

          by symmetry…

          you can do a similar thing for a 2 tailed test as well.

          This is kind of just a symmetry property of integrals of symmetric distributions.

        • Ok, I got it to work this way:

          x = -5:5
          m = 0
          est = 2

          f = function(x){ dnorm(x, m, 1) }
          f2 = function(x){ dnorm(x, est, 1) }

          integrate(f, est, Inf)
          integrate(f2, -Inf, 0)
          integrate(f, -Inf, Inf)
          integrate(f2, -Inf, Inf)

          With the results:

          > integrate(f, est, Inf)
          0.02275013 with absolute error integrate(f2, -Inf, 0)
          0.02275013 with absolute error integrate(f, -Inf, Inf)
          1 with absolute error integrate(f2, -Inf, Inf)
          1 with absolute error < 1.6e-06

          So this is true the way you chose H[0:n]:

          sum(p(D*|H[0:n])) ~ 1

          Makes sense, at least if you limit your hypotheses to being samples from a normal distribution with unknown mean.

        • It’s not just samples from a normal distribution, because a lot of sampling distributions will have a normal form asymptotically, so with large sample size this kind of thing will hold for lots of regression type models, at least approximately… but there are all kids of caveats, it takes some experience with full Bayesian fitting to know when hierarchical partially pooled models with informative priors would help, and in those situations the result won’t hold very well

        • It’s not just samples from a normal distribution

          The value depends on the distribution you assume right?

          But this already equals one when only considering the normal model:

          sum(p(D*|H[0:n])) ~ 1

          If you add in a second class of models (eg, t-distribution) the sum of all possible likelihoods would equal 2, etc (remember all the priors cancelled out). In other words:


          P(H|"D or more extreme") ~ P("D or more extreme"|H)/2

        • I’m not entirely clear on what you’re asking. Basically what I’m saying is that the “p value” calculated by typical maximum likelihood or least squares fitting can be close to the posterior probability that a given parameter is in a certain region bounded on one side by 0… under a flat prior.

          What makes it work is the asymptotic assumption about the sampling distribution “under the null” being symmetric with the asymptotic behavior of the posterior distribution of the parameter “under a flat prior”… for large sample sizes and many different kinds of simple regression models, these asymptotic results will match.

          You can try it out by doing something like this:

          library(brms)
          library(ggplot2)

          set.seed(1)

          x = 1:10
          y = x^2 + rnorm(10,0,30)

          qplot(x,y)

          slm = summary(lm(y ~ I(x^2)))
          slm

          #extract the p value for the coefficient:
          slm$coefficients[2,4]

          brfit = brm(y~I(x^2),data=data.frame(x=x,y=y))

          s = posterior_samples(brfit)
          meanval = mean(s$b_IxE2)

          ## calculate the posterior probability to be in [0,2*meanval]

          1-sum(s$b_IxE2 > 0 & s$b_IxE2 < 2*meanval)/NROW(s)

          the lm result's p value for the coefficient of the x^2 term is 0.0018 and the brms posterior calculation is 0.0012

          This simple example is … simple, but there are lots of cases where this kind of approximate result holds, and the bigger question is really just "is this model meaningful". Notice because of our nice perfectly normal noise etc, this works here even though we have only 10 data points.

          For essentially all of the cases like this particular headline grabbing battle of the thermostat stuff… the problem isn't "p value was small but bayesian posterior would have put a lot of probability on the temperature coefficient being 0" instead it's "the entire study design has problems, it isn't convincing at all that this would generalize to other parts of the world, other cultures, other age groups, etc, and the regression model isn't built on any principled model of how people respond to temperatures, so we shouldn't really believe its results are meaningful"

        • I’m saying that it seems to me that correspondence can only occur if the entire universe of considered possibilities consists of one likelihood function with an unknown parameter. Ie, a necessary condition is that the sum of all considered likelihoods = 1.

        • Yes the correspondence won’t work for a discrete set of models, but Frequentists inference doesn’t admit probability over models anyway. The correspondence could work if the data rules out all but one of your Bayesian alternatives.

        • Frequentists inference doesn’t admit probability over models anyway.

          It wouldn’t be difficult to add afaict. Just divide the p-value by the number of models considered to usually get close the to right answer.

    • Dale said,
      ” I would not be willing to put this paper on my vitae. Things will only change when it becomes more of an embarrassment than an asset to have this on your vitae.”

      +1

      Once a professor in another field offered to list me as a coauthor in a paper which was part of the thesis of a student of his, since I had put in a lot of time with her explaining some statistics concepts. Although I think my efforts improved the paper from what it might have been, it still didn’t come up to my standards, and I would have been embarrassed to have it on my vita. (I think this partly reflects higher standards in math than in many fields — for example, it is not typical in math for a thesis advisor to be listed as co-author of a paper that is part of a student’s thesis.)

    • Xij is a vector of the observable characteristics of the individual and session that might influence performance.

      Do they explain what Xij consisted of anywhere?

      • Another neglected two-step process! First they figure out what they want Xij to be, then they figure out how to say what it is. But maybe they need to take off their jackets first, so maybe it’s a three-step process. ;~)

      • I didn’t see it explained either. I downloaded their STATA data file, which I opened in emacs, because I don’t have stata and am lazy about trying to figure out if I can read it into R…. Here are a few data fields that seem to be defined there which might have been involved in the Xij vector:

        male, age, majorecon, nativegerman, enjoymath, enjoywords, strongme, cool, ps, normal, warm, hot, month

  2. Why is this paper so bad? Sorry, I am really not an expert, but it doesn’t stand out to me. Is it because the topic is silly (is it?)? Is it because the study is underpowered (is it?)?

    I think if we want a community of critical readers that holds science to a higher standard, we should also hold the critique to high standard. So a bit more substance than just pointing to the paper and saying it’s bad. (I know this is just a blog post, but the tone of the post and the uncritical and “pile-on” nature of the comments just rubbed me the wrong way…)

    • It is no worse than 99.9% of what gets published from what I looked at… Other than I couldn’t even figure out what variables besides temperature were used in the model. I’ll give them the benefit of the doubt that I missed it though.

    • Mathijs:

      To start, it’s ridiculous to generalize from an experiment conducted for two months in one city to make a claim such as, “gender mixed workplaces may be able to increase productivity by setting the thermostat higher than current standards.”

      To put it another way: I have no problem if journals want to publish such papers (minus the large unsupported claims). But I do have a problem with news outlets treat such a study as telling us something useful. I think the burden should be on the researcher or the news outlet to demonstrate why we should care. The fact that 3 referees for some journal decided that the paper was ok to publish, that’s not enough. The default should not be, Some paper gets published somewhere, so let’s believe everything in it.

      So yeah, write a paper without strong unsupported claims and you’ll avoid some of these negative reactions at this blog. Write a paper with strong unsupported claims and you might get the cold shoulder here; on the other hand you can get uncritical press from the Atlantic, NPR, etc. This is a tradeoff that many researchers seem happy with.

      • My point wasn’t really about the authors of that paper, but about this blog and its community. Maybe those authors violated some basic scientific principles and maybe it is obvious (to the trained eye) that they did so. Perhaps then they deserve to be mocked a bit (I’m not opposed to that necessarily). I still don’t think it is good practice to do the mocking without pointing out the errors. And at this stage I am still a bit unclear on the errors of this paper.

        I am very interested in the replication crisis and (for that reason) in this blog. I have zero interest in this paper other than as a potential example of (research methods that might underlie) the replication crisis. I really don’t have an interest in defending this paper, especially since I don’t care about their claim, even if it is true. I have not read more than the abstract and what is on this blog. So, even though I feel I have to push back a bit, this is going to be an awkward defence.

        The authors seems to have done a straightforward experiment and a simple regression. They write: “gender mixed workplaces may be able to increase productivity by setting the thermostat higher than current standards.”, which seems to be a summary of the regression-outcome taken at face value. There’s a “may” in there that seems to indicate that care is required in interpretation and extrapolation. Is this really such an outrageous way of writing about their results? Should they have emphasized that Berlin women are no Brooklyn women? How would you have summarised the regression in one sentence? Or is the point that the study is underpowered at 543 subjects? Is it?

        Again, I don’t mind that you give them the cold shoulder. I just don’t think the cold shoulder is productive if substantive criticism is not part of it. I think the standard of the blog is usually higher than that.

        • I agree with this. There’s a tendency to be sloppy because there is just so much junk out there, but we should offer some specific criticisms.

          I don’t think this paper is “so bad”, I just think it’s lazy and sloppy and overreaching… do a bunch of experiments on a narrow population, run a regression where you don’t even state what the predictors (the Xij values) are, show that there’s a statistically significant slope with respect to the combination of sex and temperature… and then immediately jump to some overblown conclusions about how this will generalize globally to “office productivity”

          Sure, they may, but then they may not. Saying that a certain bias is consistent across a largish sample of a small population is not the same as showing that it’s reliable and consistent across a broad population like even “men and women in offices in Germany” much less an unqualified population of men and women in general (globally?).

          The problem is the way things work when it comes to humans is always fairly context dependent… unless you look across a wide variety of contexts, you may just be finding for example that women in dorms at university synchronize their menstrual cycles and it synchronized with your experiment as well… or that young men in germany currently have a fashion fad to wear puffy down coats… or some other such thing.

        • Mathijs:

          As you say, the authors seems to have done a straightforward experiment and a simple regression. The problem is with the Atlantic, NPR, etc., for implying that more can be learned from this little study. Also with the authors, who bury the limitations of the study deep within the paper: neither the title nor the abstract makes it at all clear that there data are limited to one time and place. The study is what it is. If it’s a good study in terms of measurement etc. (based on other comments in this thread, I have some doubts, but, like you, I don’t really care so much about the details on this), then others could replicate it in other places, then someone could perform a meta-analysis, and then, maybe it’s newsworthy. Right now, I don’t see it as newsworthy at all.

          And I want to push back against your implication that to criticize how this study is reported, I need to find particular problems with this experiment. Even if the experiment were perfect, it would not imply what’s in the title, abstract, conclusions, or media reports. To focus on the details here would miss the point, I think.

        • I suspect Mathijs was looking for elaboration along the lines of what you said here. Not necessarily to look into the details of the study measurement and model and etc.. but just to say things like “even if the experiment were perfect, it would not imply what’s in the title, abstract, conclusions or media reports”…

          you sort of did this in a coded way when you said: ““543 students in Berlin, Germany” . . . good enuf to make general statements about men and women, I guess! I wonder if people dress differently in different places . . . .”

          sarcasm unfortunately doesn’t translate well to the internet, and I suspect you have some “new” readers each week, not just the rest of us old guys who’ve been here over a decade.

    • Mathijs,

      I can see your point — there is so much that we take for granted (as common background), and we do joke around (partly to maintain our sanity) and sometimes forget the smiley-faces — but I can see how it can come across to a newcomer as at best unhelpful and at worst as mocking. So, to give some background on where I at least am coming from, please look over my website (especially the class notes) at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html .

  3. I’m willing to entertain the idea that women and men have different optimal indoor temperatures… but there’s no way you’ll convince me that this study explores the causality of that. I mean, I’m pretty sure on average women and men dress differently, have different levels of body fat distribution, eat differently, have different basal metabolic rates, different activity levels, etc … but looked at individually, the variation in all of these things is rather large. Sex/gender here is likely a proxy for all these statistically related factors

  4. 1. Why, for the love of god, in a world full of established math assessments of every length and difficulty, would you use an assessment with no established reliability or validity, which was apparently created for a different study (see their cited paper) in which the point was not to assess math performance but to have a task that required some effort but was relatively easy? I didn’t even bother to look at the other two measures.

    2. If you’re going to show a scatterplot, show the raw data, not just means produced (adjusted?) by your analysis. Same if you’re going to report CI’s.

    3. The data are nested, but they didn’t report (or check?) the ICC. I haven’t read it closely, but a word search found no mention of clustering, multilevel models, hierarchical models, or any associated stats (other than R). I see they call the regression equation an “econometric model,” so maybe econometric standards for analysis and reporting are different from basic statistics?

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *