Frequentists inference doesn’t admit probability over models anyway.

It wouldn’t be difficult to add afaict. Just divide the p-value by the number of models considered to usually get close the to right answer.

]]>What makes it work is the asymptotic assumption about the sampling distribution “under the null” being symmetric with the asymptotic behavior of the posterior distribution of the parameter “under a flat prior”… for large sample sizes and many different kinds of simple regression models, these asymptotic results will match.

You can try it out by doing something like this:

library(brms)

library(ggplot2)

set.seed(1)

x = 1:10

y = x^2 + rnorm(10,0,30)

qplot(x,y)

slm = summary(lm(y ~ I(x^2)))

slm

#extract the p value for the coefficient:

slm$coefficients[2,4]

brfit = brm(y~I(x^2),data=data.frame(x=x,y=y))

s = posterior_samples(brfit)

meanval = mean(s$b_IxE2)

## calculate the posterior probability to be in [0,2*meanval]

1-sum(s$b_IxE2 > 0 & s$b_IxE2 < 2*meanval)/NROW(s)

the lm result's p value for the coefficient of the x^2 term is 0.0018 and the brms posterior calculation is 0.0012

This simple example is … simple, but there are lots of cases where this kind of approximate result holds, and the bigger question is really just "is this model meaningful". Notice because of our nice perfectly normal noise etc, this works here even though we have only 10 data points.

For essentially all of the cases like this particular headline grabbing battle of the thermostat stuff… the problem isn't "p value was small but bayesian posterior would have put a lot of probability on the temperature coefficient being 0" instead it's "the entire study design has problems, it isn't convincing at all that this would generalize to other parts of the world, other cultures, other age groups, etc, and the regression model isn't built on any principled model of how people respond to temperatures, so we shouldn't really believe its results are meaningful"

]]>It’s not just samples from a normal distribution

The value depends on the distribution you assume right?

But this already equals one when only considering the normal model:

sum(p(D*|H[0:n])) ~ 1

If you add in a second class of models (eg, t-distribution) the sum of all possible likelihoods would equal 2, etc (remember all the priors cancelled out). In other words:

P(H|"D or more extreme") ~ P("D or more extreme"|H)/2

x = -5:5

m = 0

est = 2f = function(x){ dnorm(x, m, 1) }

f2 = function(x){ dnorm(x, est, 1) }integrate(f, est, Inf)

integrate(f2, -Inf, 0)

integrate(f, -Inf, Inf)

integrate(f2, -Inf, Inf)

With the results:

> integrate(f, est, Inf)

0.02275013 with absolute error integrate(f2, -Inf, 0)

0.02275013 with absolute error integrate(f, -Inf, Inf)

1 with absolute error integrate(f2, -Inf, Inf)

1 with absolute error < 1.6e-06

So this is true the way you chose H[0:n]:

sum(p(D*|H[0:n])) ~ 1

Makes sense, at least if you limit your hypotheses to being samples from a normal distribution with unknown mean.

]]>integrate(normal(q,0,s),qest,inf) = integrate(normal(q,qest,s),-inf,0)

by symmetry…

you can do a similar thing for a 2 tailed test as well.

This is kind of just a symmetry property of integrals of symmetric distributions.

]]>you sort of did this in a coded way when you said: ““543 students in Berlin, Germany” . . . good enuf to make general statements about men and women, I guess! I wonder if people dress differently in different places . . . .”

sarcasm unfortunately doesn’t translate well to the internet, and I suspect you have some “new” readers each week, not just the rest of us old guys who’ve been here over a decade.

]]>These p values will correspond to the posterior probability that 0 is in the central high probability region of some marginal Bayesian posterior with a flat prior.

Ok, with flat prior they all cancel out. So then saying:

P("D or more extreme"|H) ~ P(H|"D or more extreme")

Let’s call “D or more extreme” D* for short, we need to have a situation where

sum(p(D*|H[0:n])) ~ 1

Otherwise I don’t see how you can equate p(D*|H[0]) ~ p(H[0]|D*) in general.

But I guess you do not say “the posterior probability of zero given the data”… You say “the posterior probability that 0 is in the central high probability region”. How is the central high probability region defined? The p-value does not depend on any cutoffs.

]]>As you say, the authors seems to have done a straightforward experiment and a simple regression. The problem is with the Atlantic, NPR, etc., for implying that more can be learned from this little study. Also with the authors, who bury the limitations of the study deep within the paper: neither the title nor the abstract makes it at all clear that there data are limited to one time and place. The study is what it is. If it’s a good study in terms of measurement etc. (based on other comments in this thread, I have some doubts, but, like you, I don’t really care so much about the details on this), then others could replicate it in other places, then someone could perform a meta-analysis, and then, maybe it’s newsworthy. Right now, I don’t see it as newsworthy at all.

And I want to push back against your implication that to criticize how this study is reported, I need to find particular problems with this experiment. Even if the experiment were perfect, it would not imply what’s in the title, abstract, conclusions, or media reports. To focus on the details here would miss the point, I think.

]]>Just throwing a bunch of regression predictors at a problem and fitting ordinary least squares does not qualify as a good application of this idea, and that’s more or less what they did here.

]]>

1 ~ p(D|H[0])*sum(P(H[1:n])p(D|H[1:n]))

]]>a p value is close to a Bayesian posterior distribution

Do you mean?

P("*D or more extreme*"|H) ~ P(H|"*D or more extreme*")?

Starting from Bayes rule:

p(H[0]|D) = P(H[0])p(D|H[0])/sum(P(H[0:n])p(D|H[0:n]))

To get the approximation, then we would need to see the prior probability of H[0] cancel with the denominator:

P(H[0]) ~ sum(P(H[0:n])p(D|H[0:n]))

Or:

1 ~ sum(P(H[1:n])p(D|H[0:n]))

What does this have to do with the sample size? Does the “or more extreme” change that somehow?

Anyway, checking the fit of a strawman hypothesis using a Bayesian posterior is not an improvement. I guess the common interpretation of the p-value as “the probability H[0] is true” will be more accurate.

]]>The most important question though is about whether the model makes sense.

For example, in the regressions they run here, each person has a different set of Xij “corrections” which means that for each person the prediction should vary, meaning there shouldn’t be a meaningful line like in the plots.

Now, if they first “correct” all the data points so that what’s plotted in graphs is a corrected prediction from their model rather than raw data… then that could make the graph make some sense, but the details are too sketchy to know what they did.

]]>2. If you’re going to show a scatterplot, show the raw data, not just means produced (adjusted?) by your analysis. Same if you’re going to report CI’s.

3. The data are nested, but they didn’t report (or check?) the ICC. I haven’t read it closely, but a word search found no mention of clustering, multilevel models, hierarchical models, or any associated stats (other than R). I see they call the regression equation an “econometric model,” so maybe econometric standards for analysis and reporting are different from basic statistics?

]]>
. When the sample size is large, and the number of predictors are large enough to capture much of the variability, I’m willing to look at very low p values as indicating something

[…]

I’ll let Anoneuoid do their usual now.

Shouldn’t you be expected to provide a justification for why this makes sense? Obviously you know why I will say it does not make sense…

Also, my critique here doesn’t have anything to do with the NHST part. This design is doomed before they get to the NHST step. This is doing NHST on meaningless numbers.

]]>I don’t think this paper is “so bad”, I just think it’s lazy and sloppy and overreaching… do a bunch of experiments on a narrow population, run a regression where you don’t even state what the predictors (the Xij values) are, show that there’s a statistically significant slope with respect to the combination of sex and temperature… and then immediately jump to some overblown conclusions about how this will generalize globally to “office productivity”

Sure, they may, but then they may not. Saying that a certain bias is consistent across a largish sample of a small population is not the same as showing that it’s reliable and consistent across a broad population like even “men and women in offices in Germany” much less an unqualified population of men and women in general (globally?).

The problem is the way things work when it comes to humans is always fairly context dependent… unless you look across a wide variety of contexts, you may just be finding for example that women in dorms at university synchronize their menstrual cycles and it synchronized with your experiment as well… or that young men in germany currently have a fashion fad to wear puffy down coats… or some other such thing.

]]>I’m not sure smiley-faces would have mellowed statements about embarrassment on CV’s. I think such statements are fair enough, if the paper is bad, but they do come with some responsibility to explain.

]]>I am very interested in the replication crisis and (for that reason) in this blog. I have zero interest in this paper other than as a potential example of (research methods that might underlie) the replication crisis. I really don’t have an interest in defending this paper, especially since I don’t care about their claim, even if it is true. I have not read more than the abstract and what is on this blog. So, even though I feel I have to push back a bit, this is going to be an awkward defence.

The authors seems to have done a straightforward experiment and a simple regression. They write: “gender mixed workplaces may be able to increase productivity by setting the thermostat higher than current standards.”, which seems to be a summary of the regression-outcome taken at face value. There’s a “may” in there that seems to indicate that care is required in interpretation and extrapolation. Is this really such an outrageous way of writing about their results? Should they have emphasized that Berlin women are no Brooklyn women? How would you have summarised the regression in one sentence? Or is the point that the study is underpowered at 543 subjects? Is it?

Again, I don’t mind that you give them the cold shoulder. I just don’t think the cold shoulder is productive if substantive criticism is not part of it. I think the standard of the blog is usually higher than that.

]]>male, age, majorecon, nativegerman, enjoymath, enjoywords, strongme, cool, ps, normal, warm, hot, month

]]>I’ll let Anoneuoid do their usual now.

]]>I can see your point — there is so much that we take for granted (as common background), and we do joke around (partly to maintain our sanity) and sometimes forget the smiley-faces — but I can see how it can come across to a newcomer as at best unhelpful and at worst as mocking. So, to give some background on where I at least am coming from, please look over my website (especially the class notes) at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html .

]]>” I would not be willing to put this paper on my vitae. Things will only change when it becomes more of an embarrassment than an asset to have this on your vitae.”

+1

Once a professor in another field offered to list me as a coauthor in a paper which was part of the thesis of a student of his, since I had put in a lot of time with her explaining some statistics concepts. Although I think my efforts improved the paper from what it might have been, it still didn’t come up to my standards, and I would have been embarrassed to have it on my vita. (I think this partly reflects higher standards in math than in many fields — for example, it is not typical in math for a thesis advisor to be listed as co-author of a paper that is part of a student’s thesis.)

]]>OP didn’t say this. My interpretation (which Dale should feel free to correct!) is that the amount of complexity in the data combined with the number of potential predictors *should* allow one to construct a model that is capable of accounting for more variance than they report.

In other words, given the structure available in the data, the models aren’t capturing it and so we shouldn’t bother using those models to support any further inferences (which is what the paper proceeds to do).

]]>To start, it’s ridiculous to generalize from an experiment conducted for two months in one city to make a claim such as, “gender mixed workplaces may be able to increase productivity by setting the thermostat higher than current standards.”

To put it another way: I have no problem if journals want to publish such papers (minus the large unsupported claims). But I do have a problem with news outlets treat such a study as telling us something useful. I think the burden should be on the researcher or the news outlet to demonstrate why we should care. The fact that 3 referees for some journal decided that the paper was ok to publish, that’s not enough. The default should not be, Some paper gets published somewhere, so let’s believe everything in it.

So yeah, write a paper without strong unsupported claims and you’ll avoid some of these negative reactions at this blog. Write a paper *with* strong unsupported claims and you might get the cold shoulder here; on the other hand you can get uncritical press from the Atlantic, NPR, etc. This is a tradeoff that many researchers seem happy with.

I think if we want a community of critical readers that holds science to a higher standard, we should also hold the critique to high standard. So a bit more substance than just pointing to the paper and saying it’s bad. (I know this is just a blog post, but the tone of the post and the uncritical and “pile-on” nature of the comments just rubbed me the wrong way…)

]]>I did look far enough to see that the number of observations are on the order of 500 with around 10 independent variables and the R-square values for the regression models top out at around 0.05 (most are lower). But, there are p-values less than .05, so I guess this passes muster.

I don’t follow this critique. Why exactly do you consider this an insufficient sample size? If the sample size was 5k would you trust their conclusions? 50k?

No sample size or effect size would make me trust the output of this analysis.

]]>Xij is a vector of the observable characteristics of the individual and session that might influence performance.

Do they explain what Xij consisted of anywhere?

]]>Committing this error is probably even worse than doing NHST.

]]>Maybe it could get you a job at a top university if you have the right connections.

]]>Seriously, aside from the continuing issues related to NHST, I would not be willing to put this paper on my vitae. Things will only change when it becomes more of an embarrassment than an asset to have this on your vitae. So, this won’t get you a job at a top university (unless you have hundreds of these publications), but it will help you get a job at many.

]]>