I happened to come across a post of mine that’s not scheduled until next April, and I noticed the above line, which I really liked, so I’m sharing it with you right now here.

The comment relates to a common procedure in statistics, where researchers decide exclude potentially important interactions from their models, just because these interactions are not statistically significant.

As I wrote, whether something is statistically significant is itself a very random feature of data, so in this case you’re essentially outsourcing your modeling decision to a random number.

At some level, sure, we know that our decisions won’t be perfect, and any data-based decision can be wrong. But using statistical significance (or any other binary procedure, whether it be a p-value or a Bayes factor or whatever) in this way . . . That’s just an unnecessary addition of noise into your procedure, and it can have real and malign consequences.

If we reject binary procedures how do we reconcile that with the fact that ultimately many real world descisions are of a binary nature? To operate on a tumor or not. To allow a drug or ban it. Etc.

Unless you provide a way to map inputs to an ultimate yes no descision aren’t you just pushing the problem elsewhere? A Bayesian probablity is not a descision.

The point that I’m sure has been made here before (and is made in other venues by people like Frank Harrell) is that yes, there are binary decisions that have to be made, but that you should delay this dichotomization until you actually *have* to do it. When deciding on a treatment plan,I’d much rather my doctor knew that I had a risk score of 72/100 than that I was in the “high risk” category that was above a cutoff of 70/100: that way they can use lots of other information that they have about me.

Andrew beat me to it.

Ben

I think your response is better than Andrew’s. But I’m biased.

Ben:

Makes sense. But here’s the practical problem: Delaying the dichotomization decision is fine, but in practice I find very few papers actually walk the last mile.

i.e. Everyone seems to be working on finding these risk scores (or Bayesian posteriors etc. ) but very little work actually on descision theory.

In other words, show me work that talks about how to establish the cutoff. The threshold.

I have struggled with this for years now, and now have an answer to this question that works in practice for me. What I do in my papers is to talk about whether the observed posterior distribution of the effect of interest is consistent with the quantitative prediction or not. If I don’t have a quantitative prediction, I try to work out a meta-analysis posterior; this is my best guess given what I know. If I don’t know anything about my problem (very rare), I wait till I do.

Then, I use the Kruschke/Freedman/Spiegelhalter method of region of practical equivalence to decide whether I am going to *act as if* my theoretical prediction and the posterior from the data are consistent with each other. More often than not, I write that the data are equivocal, or that we can’t conclusively say whether the prediction was confirmed.

Most of the time, the data are so noisy I can’t really say much. We had a modeling paper rejected by a top journal (the Journal of Memory and Language) because the published data (collected by people other than my lab) that we were evaluating the model on was basically too noisy for answering the question. The paper was eventually accepted (after a total of five years in the peer review mill) in Cognitive Science: https://osf.io/b56qv/. The rejection from JML was valid, because there may really be nothing much to talk about, even after a 100+ experiments on a single topic. But you should read the bold claims in the original papers.

It turns out I can still publish my own inconclusive studies, in top journals like JML. People tell me it’s all very well for me to do this because I am some kind of authority on data analysis; but (a) I don’t believe that (I’m rejected all the time for various random-sounding reasons), (b) people are just not trying to express their uncertainty in their papers and feel they have to make overblown claims just to get their paper accepted. Re (b) they are just scared to try it; I see this again and again in the papers I review. The evidence is just not there, but people desperately try to make strong claims, where they could easily and more realistically say, we’re not sure but this and this might be true.

“Re (b) they are just scared to try it…”

Dude! Absolutely! But this kind of behavior starts early. Look at the students in your classes. They’re all trying to anticipate the “right” answer and spit it out, rather than *follow the reasoning path* to the solution it leads to. Look at middle management in any business. They’re all trying to do what they think other people expect them to do – they’re scared to do anything that hasn’t been done exactly the same way before – rather than just solve the problems they are facing. That’s why companies are always cleaning out middle management: it’s where the people who don’t or can’t solve problems pile up.

We need a word or a name for this behavior.

Shravan:

Thanks! Good points!

What I object to is casting “binary decisions” as some sort of enemy. That’s throwing out the baby with the bathwater.

Let’s talk about better (formalized) ways to take binary decisions.

Most of the work I see agonizes and splits hair on the models and evaluating probabilities but when it comes to mapping that onto a binary, real-world decision just lets the reader adrift with a cursory “do your best you can”…..

Rahul, I think the reason we don’t see good decision analyses is that people are desperate to avoid making their utilities known. noone wants to say “the loss of a patient due to a fatal embolism is not worth $85000 to perform this surgery” but people are happy to hear “well as an expert in surgery I think…” as long as everything is hidden inside some experts mind people don’t have to confront hard tradeoffs…

but I agree, we should do more formal decision analyses and publish them so people can get used to the idea

You may be right.

I’ve an alternative explanation: If you don’t make a decision, in some sense, your model is non-falsifiable.

By avoiding decision analysis people just reduce the “attack surface” that their claims or model has. In a sense, the sharp decisions that NHST makes (perhaps unwarranted) are it’s own downfall.

Probably a bit of both.

But when you make decisions, if people attack them after the fact you can always respond “it would have been even worse without the intervention”, and this is how the “experts” usually do respond I think.

Rahul:

What I sometimes say is, Don’t collapse the wavefunction prematurely.

If you have to make a binary decision, make a binary decision. But don’t take non-decision problems and turn them into binary decisions.

And, when you are making decisions, do your best rather than making the decision based on a random number.

Andrew:

The “do your best” part is the problem. Let’s try to formalize that bit of decision making.

Agreed let’s not collapse the wavefunction prematurely, but let’s not live in the delusion that there’s a free lunch of not having to take a binary decision at the end of it.

Very often, there’s a very real decision to be made and that too a binary one. The wavefunction must collapse at some point.

PS. What are some examples of what you refer to as “non-decision problems”?

Rahul:

Look at what I wrote above. I criticized the practice of “essentially outsourcing your modeling decision to a random number.” I didn’t criticize the practice of making modeling decisions: I agree that these are necessary.

I’ve written a few books showing how I make modeling decisions. The problem is not with decision making, it’s with making decisions based on what are essentially random numbers. For a recent example of problems with that approach, and better alternatives, see here.

Finally, an example of a non-decision problem is the question, Is effect X real? This is a question which, even where it can be defined, does not need to be answered in a yes/no way. We can include our uncertainty and move on. A decision problem might be, Should you take drug X or drug Y? In this case I’d recommend making decisions based on the usual probabilistic cost/benefit rules, not based on statistical significance.

The main point here is that the p-value is extremely variable, much more than people realize.

Yes, we are in agreement then.

“The comment relates to a common procedure in statistics, where researchers decide exclude potentially important interactions from their models, just because these interactions are not statistically significant.”

That’s not inherently bad IMO, for they have an objective procedure and they should justify their alpha. They are not outright rejecting it, like universally rejecting an interaction is important, but just rejecting at a certain level. For example, there has to be some number of heads that you observe out of 100 flips such that you start to conclude a coin is not fair. Is it at 50? 75? 99?

Also, everything you state can be flipped- “potentially important” also means “potentially unimportant”, which may get into Type I vs Type II discussion.

“…whether something is statistically significant is itself a very random feature of data, so in this case you’re essentially outsourcing your modeling decision to a random number.”

I wouldn’t agree with this characterization, as it is something of a red herring. For if there is something there, we expect p-value to be small (observed test statistic far away from what we expect under the model), doesn’t depend if it is random or not.

“But using statistical significance (or any other binary procedure, whether it be a p-value or a Bayes factor or whatever) in this way . . . That’s just an unnecessary addition of noise into your procedure, and it can have real and malign consequences.”

I’ll bite and flip things around again to say that not using it can also have real and malign consequences. It could breed too much subjectivity, wild goose chases using priors that are strongly believed but wrong, brittle, and reliance on procedures (alternatives to p-values and significance testing) that are not proven in and used all over the world in many sciences and other areas for the last 80 years.

For the dichotomania discussion, there are probably some cases where it is justified. For example, Boole saw the world as “law of thought”, X(1-X)=0, where X is a class and the operations are class operations, stratification in survey sampling can categorize continuous data, and we operate in the world using dichotomania such as age cutoff to get senior discount, shoe sizes, and so on.

And it doesn’t have to be binary decision, for example with different regions of acceptance, non-acceptance, continue testing, or equivalence testing.

Justin

Let’s say you get a (1-alpha)% confidence interview for some model coefficient that looks like (-10, 10) and suppose any value of that model coefficient in the range (-10,1) or (1,10) is big enough to be important.

Under this scenario, if the the estimate for the coefficient is -9, the coefficient wont reach statistical significance at that alpha and so this term will be dropped in the model, even though the term is almost certainly making a difference and should be included.

Exactly this type of scenario occurs in almost every real world example of using significance tests to pick model parameters I’ve ever seen and the only Frequentist kludge that avoids it would be to change the alpha = 5% level to something absurd like alpha = 80%.

Contrast this with the kind of thing done before Frequentist destroyed the scientific method. A modern example can be found here (Start reading on the 6th (p 623) page at “4. What is our rational?” and continue on for the a couple of pages):

https://bayes.wustl.edu/etj/articles/what.question.pdf

Note: this issue doesn’t include the facts that Frequentist guarantees of “not being in error” more than some percentage are a laughable joke in the real world, nor does it consider that Confidence Intervals in real examples can have pathological behavior such as: “the CI cannot possible contain the true value and this is provable from the same information used to construct the CI”

I’ve heard this “justify your alpha” thing a lot, but I’ve never heard of a good way to do it. My instinct is that it’s not really something that makes sense to do since the decision theoretic principles that would justify something (expected values, etc.) are based on Bayesian probabilities. So it seems to me that if you wanted to justify your alpha you’d basically have to perform calculations with a Bayesian model, then for some reason throw away that Bayesian model and go back to significance testing. Maybe this would make sense if you had a theory you could model with Bayes, but wanted to use some nonparametric frequentist procedure because you didn’t really trust that theory?

But by any chance, did you have a procedure in mind for rationally justifying an alpha?

“But by any chance, did you have a procedure in mind for rationally justifying an alpha?”

Justifying an alpha depends on the details of the specific question being studied and the context(s) in which it will be applied. There can’t possibly be a “one size fits all” procedure for doing this — it requires thinking about the particular problem, how the results will be used, thinking about the consequences of different types of “errors” in the context in which results will be used, etc., etc. It involves making judgment calls.

Martha:

It’s hard for me to imagine a situation where the relevant discussion would be in terms of “alpha” or a tail-area probability of a null hypothesis, as this doesn’t map on to any decision problem I could imagine being relevant in real life.

You can usually shoehorn quality control into the alpha cutoff model… but it’s basically by doing a Bayesian type analysis in the background, and then using that to set alpha, and then use that fixed alpha repeatedly during production.

Daniel,

A lot of these Lagrange multiplier-type problems have this duality, that you can set an alpha cutoff or you can set a multiplier for the prior, call it sigma. For any given problem, there’s a direct one-to-one tradeoff between alpha and sigma, so it seems reasonable to say that setting an alpha threshold or setting sigma are equivalent. But that’s not correct, because the particular tradeoff depends on the data. Setting sigma has a direct interpretation in terms of the prior or population distribution. Setting alpha has no direct interpretation except as a derived quantity.

If “justify your alpha” is taken to mean, “justify your sigma and then deduce the corresponding alpha,” then I’d be ok with it. But then there’s no reason to bring up alpha in the first place except as a way of communicating with users of legacy methods. Which has some virtue but then we should be clear that this is what we’re doing.

I guess I should write this up more formally; maybe it would clear up some confusion, at least among people who aren’t too invested in the null hypothesis testing framework.

Email me, I’d work on that paper with you. I think setting an alpha makes sense to a lot of people, but it’s mainly in a context where there is some kind of implied trade-off in the background and by pointing out the underlying reasoning it would give people a stepping stone to a Bayesian decision theory understanding.

I’ve often thought it’d be good to do a bunch of comparisons of NHST based decisions and Bayesian Utility based decisions in realistic everyday examples and see how they differ. Like, how to choose a sample size, or whether to do a diagnostic test, or whether to drop a variable from a regression… Stuff where people are typically outsourcing their decisions to the p value

I should have added to my list, “and whether or not it even makes sense in the context of the problem to set an alpha level.”

(I was indeed thinking in particular of something like a quality control situation.)

The Bayesian rational for dropping terms in model (illustrated by Jaynes here https://bayes.wustl.edu/etj/articles/what.question.pdf) is as follows. Given some parameter and uncertainty distribution P(lambda | Data), this can be replace with a delta function about some constant, delta(lambda – lambda_0), whenever two things happen:

(1) P(lambda|Data) is very narrowly peaked about lambda_0, and

(2) plugging delta(lambda-lambda_0) in everywhere for P(lambda|Data) doesn’t make a practical numerical difference for the specific calculation you’re doing.

The Frequentist criteria for dropping a model term completely ignores both (1) and (2). As a consequence the Frequenitst procedure gets this seriously wrong. And I don’t just mean wrong in a “theoretical, in principle” sort of way. It’s insane in practice and is effectively equivalent to almost randomly deciding which terms to include or not.

Frequntist philosophy caused people to adopt an absurd procedure for including terms in a model that is still being used in the real world hundreds of thousands of times per day.

‘The comment relates to a common procedure in statistics, where researchers decide exclude potentially important interactions from their models, just because these interactions are not statistically significant.’

Data please! Show, that out of 100 papers X decide to drop interactions with the _only_ reasoning given being lack of statistical signifiance. And convince me X is high enough to justify the label ‘common’.

In think the exclusion is sensible when the beta estimate is close to 0, p > 0.6. There are cases with very wide CI’s, where p>0.6 and the coefficient is substantially different from 0 which can make sense locally, in that effects of interest should be close to or below p = 0.05 despite the wide CI and then the argument is that locally the estimated interaction effect is negligible. If this seems objectionable in principle, I suggest avoiding the social sciences, as the effects of factors (and interactions) not included and factors unknown will always remain a problem that in almost all cases will be much worse than whether a particular interaction is dropped or not.

In practice, yes, I have seen people drop interactions for lack of significance, but (a) those people were usually at the lower end to bottom of statistical sophistication, not representative of the mean and (b) almost always involve a theory guided disbelief in the interaction. Observationally, the very same people sing a different tune if the interaction they expected is a bit above 0.05. I sure wish they wouldn’t do that, but on the other hand, the behaviour is not too different from having strong priors and on the gripping hand is empirically indistinguishable from all the many cases where interactions are not even considered. The latter would seem to be considered superior under the rationale above. I think they’re not. Since I don’t have strong argument for faulting everyone who fails to consider all potential interactions (and nonlinearities) in their modelling it follows that it would be unfair to single out authors for dropping interactions they consider droppable just because they _may_ do so for statistically unsound reasons.

Markus:

I see it all that time, that people come up with interpretations of coefficients and comparisons, entirely based on p-values.

For one example, see the article discussed in section 2.2 of this paper. I agree that, based on the evidence of that quoted passage, the authors of the cited article appear to be at the lower end to bottom of statistical sophistication; however they appear to be influential within their fields of research.

For another example, see the article discussed here. This was an influential article where, again, researchers just looked at p-values to mistakenly make scientific conclusions.

For anyone not following the links: In neither case is an interaction term involved or dropped. As evidence for the quoted statement they’re less than worthless. Following the links is a waste of time, not following them may leave one with the false impression that evidence for the quoted claim was provided, when in fact it was not.

Regardless, two examples would not be enough to establish a ‘common procedure’, unless they’re randomly selected from the literature.

Markus:

It is common in my experience that researchers decide exclude to potentially important interactions from their models, just because these interactions are not statistically significant. I’ve seen it a lot.

But I have not done a statistical study to see how common it is. When I say “common,” I mean that I’ve seen it many times; I’m not making a claim about general prevalence, or what percentage of the time this is done in published papers, etc. Such studies of general practice can be valuable, and surveys of practitioners can also be valuable (for example the paper by McShane and Gal discussed here); I have not done that myself.

The above links are examples of my statement in the above comment, “I see it all that time, that people come up with interpretations of coefficients and comparisons, entirely based on p-values.” Both these examples involved interactions in the form of varying treatment effects, in one case an interaction between treatment and time lag, in another case an interaction between treatment and frequency of an electromagnetic field. In both cases, the interaction could be considered a varying treatment effect.

I am sorry to hear that you consider these links “less than worthless”; fortunately these blog comments have many readers, not just you!

P.S. I appreciate your comments here. I recognize that not everyone will agree with everything I write—each of us has had different experiences—and it’s valuable to see different perspectives in these discussions.

This is off-topic, but I figured this was the best post to bring it up at. I was reading a discussion of a recent study in which detractors claimed an N of 14 was too small, but someone else brought up the article Why you shouldn’t say “this study is underpowered”. Since you often emphasize study design and better measurement over statistical significance, I thought you might be interested.

Tggp:

I have a few things to say about this discussion.

First, the study under discussion seems consistent with the general belief that these teaching evaluations are biased. I think it’s a mistake to expect or demand that a study necessarily provide conclusive evidence for some claim; sometimes it is fine to just say that your data are consistent with existing understanding.

Second, I do think the authors of the article are pushing too hard. For example, they write, “Our findings that students are biased against females and non-natives in their evaluations of teaching…”, but I don’t see anything in their findings relating to non-natives. So I don’t know where that is coming from.

Third, the authors of the article and also the twitter commenters are making the mistake of labeling non-statistically-significant differences as zero (a problem alluded to in the above post!!).

The authors write, “nonwhite faculty members were not evaluated more harshly,” and then they go on to offer an explanation. But no explanation is needed for a non-statistically-significant difference from N=14! They also make the common and notorious error of labeling a result with an intermediate p-value as representing “weak initial evidence.”

The twitter commenter makes a similar mistake, writing, “There is NO effect of race.” Setting aside causal issues (race is not being manipulated; rather, they’re comparing particular instructors of different races), it’s an error to label a non-statistically-significant difference as zero.

In any case, it’s good to see these sorts of discussions online. I’m kinda surprised that the paper got published as is in a political science journal. I think that political scientists are usually pretty aware of these statistical issues.