## I’m still struggling to understand hypothesis testing . . . leading to a more general discussion of the role of assumptions in statistics

I’m sitting at this talk where Thomas Richardson is talking about testing the hypothesis regarding a joint distribution of three variables, X1, X2, X3. The hypothesis being tested is that X1 and X2 are conditionally independent given X3. I don’t have a copy of Richardson’s slides, but here’s a paper that I think it related, just to give you a general sense of his theoretical framework.

The thing that’s bugging me is that I can’t see why anyone would want to do this, test the hypothesis that X1 and X2 are conditionally independent given X3. My problem is that in any situation where these two variables could be conditionally dependent, I think they will be conditionally dependent. It’s the no-true-zeroes thing; see the discussion starting on page 960 here. I’m not really interested in testing a hypothesis that I know is false, that I know would be rejected if I could just gather enough data.

That said, Thomas Richardson is a very reasonable person, so even though his talk is full of things that I think make no sense—he even brought up type 1 and type 2 errors!—I expect there’s something reasonable in all this research, I just have to figure out what.

I can think of a couple of possibilities.

First, maybe we’re not really trying to see if X1 and X2 are conditionally independent; rather, we’re trying to see whether we have enough data to reject the hypothesis of conditional independence. That is, the goal of the hypothesis test is not to accept or reject a hypothesis, but rather to make a statement about the strength of the data.

But I don’t think this is what Richardson was getting at, as he introduced the problem by saying that the goal would be to choose among models. I don’t like this. I think it would be a mistake to use the non-rejection of a hypothesis test to choose the model of conditional independence.

Second, maybe the hypothesis test is really being used as a sort of estimation. For example: I said that if two variables can be dependent, then they will. But what about a problem such as genetics, where two genes could be on the same or different chromosomes? If they’re on different chromosomes, you’ll have conditional independence. Well, not completely—there are always real-world complications—but close enough. I guess that’s the point, that “close enough” could be what you’re testing.

I think this might be what Richardson is getting at, because later in his presentation, he talked about strong and weak regimes of dependence. So I think the purpose of the hypothesis test is to choose the conditional independence model when the conditional dependence is weak enough.

OK, fine. But, given all that, I think it makes sense to estimate all these dependences directly rather than to test hypotheses. When I see all the contortions that are being done to estimate type 1 and type 2 errors . . . I just don’t see why bother. And I’m concerned that the application of these results can lead to bad science, in the same way that reasoning based on statistical significance can lead to bad science more generally.

That said, I can’t say that my above arguments are airtight. After all, my colleagues and I around making inferences based on normal distributions, logistic regressions, and all sorts of other assumptions that we know are false.

Assuming something false and then working from there to draw inferences: it’s a ridiculous way to proceed but it’s what we all do. Except in some very rare cases (for example, working out the distribution of profits from a casino), here’s really no alternative.

True, I get a bit annoyed when statisticians, computer scientists, and others talk about “assumption-free methods” and “theoretical guarantees“—but that’s all just rhetoric. Once we accept that all methods and theorems are based on assumptions, we can all proceed on an equal basis.

At this point it would be tempting to say that assumps are fine, we just need to evaluate our assumps. But that won’t quite work either, as what does it mean to “evaluate” an assump? The evaluation has to be along the lines of, How wrong is the assump? But how to compare, for example, the wrongness of the assumption of a normal distribution for state-level error terms, the wrongness of the assumption of a logistic link mapping these to probability of Republican vote choice, and the wrongness of a conditional independence assumption?

I guess one problem I have with work such as Richardson’s on conditional independence is that I fear that the ultimate purpose of these methods is often to give researchers an excuse to exclude potentially important interactions from their models, just because these interactions are not statistically significant. The trouble here is that (a) whether something is statistically significant is itself a very random feature of data, so in this case you’re essentially outsourcing your modeling decision to a random number, and (b) if lack of statistical significance is a concern, which it can be, then I think the ultimate concern is not whether the interaction in question is zero, but rather that the uncertainty in that interaction is large. In which case I think the right approach is to recognize that uncertainty, both through partial pooling of the estimate and through propagation of that uncertainty in subsequent inferences.

But then again you could say something similar about the statistical methods that my colleagues and I use, in that we’re riding on strong assumptions—just a different set of assumptions, that’s all.

So I’m not sure what to think. Different methods can work well on different applied problems, and all the methods discussed above are general frameworks, not specific algorithms or models, which means that effectiveness can come in the details—recall the principle that the most important aspect of a statistical method is not what it does with the data but rather what data it uses—so I can well imagine that, in the right hands, modeling the world in terms of conditional independence and estimating this structure through hypothesis testing could solve real problems. Still, that model seems awkward to me. It bothers me, and I’d need to be convinced that it really does anything useful.

1. Anon says:

I haven’t read the paper or dug into the details on this one. However, if Thomas Richardson thinks it’s a good idea, it probably is…just know that while you wait to figure out how.

2. Anoneuoid says:

Consider the following simplied medical trial to examine the effect of diet and exercise on diabetes, adapted from [5]. At baseline, patients are randomly assigned to perform t hours of exercise in a week, but actually perform x hours. At the end of the week their blood pressure (bp) is measured, this is assumed to depend upon x, but also to be confounded with it by lifestyle factors. In the second phase of the trial, patients are assigned to lose ∆bmi kilograms in weight; the value of ∆bmi is random, but for ethical reasons depends linearly on x and bp. Finally, at the end of the trial, triglyceride levels (y) are measured, which is used to diagnose diabetes; these are assumed to be correlated with blood pressure, and dependent on exercise and weight loss. This causal structure naturally yields the ADMG shown in Fig. 4(a).

http://auai.org/uai2018/proceedings/papers/255.pdf

I really don’t see what anyone is supposed to do with this.

• Anoneuoid says:

From their ref 5:

We motivate the normal linear models analyzed here with the following example, which is adapted from a more complex longitudinal study considered in Robins (2008).

Consider a two-phase sequential intervention study examining the effect of exercise and diet on diabetes. In the rst phase patients are randomly assigned to a number of hours of exercise per week (Ex) drawn from a log-normal distribution. At the end of this phase blood pressure (BP) levels are measured. In the second phase patients are randomly assigned to a strict calorie controlled diet that produces a change in body-mass index (∆BMI). The assigned change in BMI, though still randomized, is drawn, by design, from a normal distribution with mean depending linearly on X =log(Ex) and BP. The dependence here is due to practical and ethical considerations. Finally at the end of the second phase, triglyceride levels (Y) indicating diabetic status are measured.

A question of interest is whether or not there is an effect of X on the outcome Y that is not mediated through the dependence of ∆BMI on X and BP. In other words, if there had been no ethical or practical restrictions, and the assignment (∆BMI) in the second phase was completely randomized and thus independent of BP and X, would there still be any dependence between X and Y? Note that due to underlying confounding factors such as life history and genetic background, we would expect to observe dependence between BP and Y even if the null hypothesis of no effect of X on Y was true and the second treatment (∆BMI) was completely randomized.

http://jmlr.csail.mit.edu/papers/v10/drton09a.html

And here is that Robins 2008, which is interested in coming up with equations to tell us what would happen t mortality rates if you put nonsmokers on a calorie restricted diet from when they were 18-70 to maintain the same weight: https://www.ncbi.nlm.nih.gov/pubmed/18695650

Is there any prediction made in all those equatuions we can check now, 12 years later?

3. I would like to see Miguel Hernan participate in this subject. Sander Greenland, Stephen Senn, Steven Goodman, John Ioannidis, and Judea Pearl too.

4. Let me turn this question around and ask you how you decide whether to include a predictor or an interaction, or non-linearity or anything else for that matter in your models? Some include interactions, but none of the non-trivial ones include all 2^N interactions of N predictors. How was the model selection done in practice?

• Anoneuoid says:

Throw everything you have in there then prune it back based on cross validation predictive skill. That skill will be overstated due to leakage from the data into the features/model, so then you judge the performance on the predictive skill on a so far unseen holdout dataset.

It really isn’t complicated unless you want to do something impossible like derive some meaning from the coefficients of an arbitrary model. Which is of course what all these people waste all our tax dollars trying to do.

• matt says:

God, the second part of this comment is stupid. At least it seems like people have just stopped replying to these, anyways.

Meaningful insight can be gleaned from the ‘coefficients’ of a statistical model (let’s stick with regression). I’m still just absolutely stunned that you think you’ve proven otherwise by noting that if you ‘add a regressor’ to a regression model the coefficient on the other variables changes – thereby showing that the coefficient has no meaning. The interpretation of the coefficient also changes when you add another variable, so I’m not sure what it is you’ve proven. Some models are more reasonable than others (based off theory – and no, it doesn’t need to be theory derived from the lowest-level mathematical principles, as in your cell growth examples). Further, even if we are using the correct model (i.e. the DGP) we would expect the coefficients to change after adding an irrelevant regressor due to sampling noise. Regression coefficients always have a well-defined interpretation – often this is a purely statistical one – but, under a specific set of assumptions we can move from statistical to causal interpretations. And no, these assumptions aren’t that strong in a lot of settings. Even if they are strong, the interpretation is still well-defined, it’s just contingent on the assumptions being true (as with all statistical models).

• Anoneuoid says:

The interpretation of the coefficient also changes when you add another variable, so I’m not sure what it is you’ve proven.

Yep.

Some models are more reasonable than others (based off theory – and no, it doesn’t need to be theory derived from the lowest-level mathematical principles, as in your cell growth examples).

If you just throw in whatever data you have around the meaning depends on whatever data you had around. It is totally arbitrary.

This is really basic stuff, and easily proven to yourself by fiddling with the model by adding removing variables/interactions like everyone does. Then imagine, what if we are missing data on a key variable so it can’t be included in the model?

Further, even if we are using the correct model (i.e. the DGP) we would expect the coefficients to change after adding an irrelevant regressor due to sampling noise.

Yes, you have changed the meaning of the coefficient, so the value will change too. In theoretically derived model you are never going to add irrelevant variables into the model.

Regression coefficients always have a well-defined interpretation – often this is a purely statistical one – but, under a specific set of assumptions we can move from statistical to causal interpretations.

Yes, and this well-defined interpretation changes if you change model. You can always assume anything you want, that doesn’t make it correct.

• matt says:

What is your point here, though? Some set of assumptions are more reasonable than others, no? Ergo some interpretations of coefficients are more reasonable than others. There is a back and forth between theory and data – they inform one another – ideally, yes you would have your theory beforehand and derive your model and then fit it, but that is not how it works in practice. You might still add in some other variables that aren’t necessarily in the working version of your theory to see how things change – this could conceivably inform your theory (but I agree this is problematic in all sorts of ways).

Honestly, what I dislike about your comments is that you take a very extreme position that I assume you don’t actually hold; namely, you claim that ‘it is impossible to derive meaning from the coefficients of an arbitrary model’, and as far as I can tell nearly all models that aren’t derived from mathematical first principles are ‘arbitrary’ according to you. So this means you are spending all this time on a blog about social science and statistics, and yet you don’t think that we can ever glean insight about the world, in a structural sense, not just a my-black-box-model-can-predict-things sense, from statistical modelling? Do you not like any of the work Gelman does? Because he has work where he is interpreting parameter estimates from a statistical model. The position you take is just so extreme I can’t believe you actually hold it, and yet you tout it in nearly every comment you make on this blog.

• Anoneuoid says:

So this means you are spending all this time on a blog about social science and statistics, and yet you don’t think that we can ever glean insight about the world, in a structural sense, not just a my-black-box-model-can-predict-things sense, from statistical modelling?

Yes, I do quite a bit of machine learning which is basically just statistical modelling taken to the logical conclusion. So the arbitrariness of the coefficients is very obvious to me.

People only think those in simpler models are more meaningful, but they aren’t. If you have millions of different plausible models, then the coefficient has millions of different values and you have no way to choose between them. And I would say millions is an underestimate by orders of magnitude: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

There is nothing complicated about it and it should be self evident to anyone who ever fiddled with a model and saw the coefficient of interest change.

Do you not like any of the work Gelman does?

I learned that from this blog. It is the essence of the “garden of forking paths” concept.

• matt says:

Heh ya I’m not interested in discussing the quality of social psychology experiments where N=30.

A good context for you to look at is sports analytics. Tons of models in that space that I would consider ‘structural’, and they provide you with meaningful interpretations of model quantities (e.g. player skill estimates). These are easily tested by analyzing new data; these approaches will absolutely dominate any ML approach that doesn’t make use of domain-knowledge to impose a lot of structure on the data.

• Anoneuoid says:

Tons of models in that space that I would consider ‘structural’, and they provide you with meaningful interpretations of model quantities (e.g. player skill estimates). These are easily tested by analyzing new data; these approaches will absolutely dominate any ML approach that doesn’t make use of domain-knowledge to impose a lot of structure on the data.

If the model is derived from some assumptions you are willing to accept then the coefficients will have meaning. Of course if you plug more relevant information into the model it will have better predictive skill. I never said it wouldn’t, that doesn’t mean the coefficients are not arbitrary though.

• Andrew says:

Bob:

Yeah, I’m not sure. Setting aside computational constraints (we’d like to fit more interactions but then our models would take too long to fit), I guess my decisions usually have something to do with prediction error. If adding new predictors or interactions doesn’t improve our estimated out-of-sample prediction error, then . . . it’s not that we think our coefficients are zero, we just don’t really have the information in the data to learn anything about them. Maybe it would also make sense to study this by comparing posteriors to priors, to get another angle on what we’re learning from the data.

The thing that Thomas Richardson was doing bothered me because I couldn’t really see the goal. Include enough data, and all these interactions can be estimated. So if you test the hypothesis that X1 and X2 are conditionally independent given X3, then what you’re really testing is whether you have enough data to estimate this particular interaction. And that doesn’t really seem like a goal in itself.

It’s worth thinking about all this further, though. That’s why I wrote the post, six months ago!

• Ben says:

I dunno what this is, but conditional independence is what that Kleppe reparameterization paper is built on, so maybe it’d be possible to use this in an MCMC reparameterization somewhere. There nothing is actually totally conditionally independent, but that’s okay, you’re just reparameterizing to make stuff run faster.

5. Terry says:

So I think the purpose of the hypothesis test is to choose the conditional independence model when the conditional dependence is weak enough.

So is this is the old argument about “do we simplify the model by eliminating (probably) weak variables, or do we include everything.”?

Yes, using statistical significance is a seriously flawed way to eliminate variables. But does the “include everything” approach have the opposite weakness of including a zillion variables in each analysis? How does the “include everything” crowd justify not including every variable in every (even remotely) relevant dataset?

Perhaps it is because Baye’s allows us to include so much more in our models that the world is tipping towards the “include more” side of things. But isn’t there still a limit, albeit higher?

• Anoneuoid says:

Yes include everything you can, but some of the data will be misleading for some reason or another so you filter it out based on predictive skill of the model. The other thing I forgot to mention is eventually you run out of time or some other resource so focus on the more correlated stuff first. These DAGS have nothing to do with it.

6. Terry says:

Literally thousands of variables?

Literally billions of interactions?

Should we set up a master dataset with every dataset ever used and run all analyses on this master dataset?

• Anoneuoid says:

Sure, the limit is in RAM and computational power more than anything. Just do a kaggle competition where you don’t even know what the columns represent but can still get a useful model out.

I mean many times you do end up pruning out a lot of the features because some will be:

– highly correlated with each other (so keeping both doesnt add much info),
– have extremely low cardinality (eg all the same value)
– be a linear combination of other features you have (column C = column A + column B)

But its better to start as big as practical so you don’t miss something. That is kind of how the human brain grows, children have more connections than adults: https://en.wikipedia.org/wiki/Synaptic_pruning

The hardest part is getting a good holdout dataset so the model’s skill will generalize to when you want to actually apply it.

• matt says:

What if the rules of the game change? Then the data you’ve fit and tested your model on is irrelevant, and your model won’t generalize.

i.e. the Lucas critique. This is why you need causal models of the world.

Related question, do you think that you could win a Kaggle competition not knowing what the predictor columns represent? Or do you think that by applying domain knowledge and knowing what the variables are you could improve prediction substantially?

• Anoneuoid says:

Related question, do you think that you could win a Kaggle competition not knowing what the predictor columns represent?

No one doing the competition would know, I mean you can figure out one column is ages or timestamps or whatever. I haven’t done one for awhile but that was pretty much standard practice.

Or do you think that by applying domain knowledge and knowing what the variables are you could improve prediction substantially?

Sure, you could get an order of magnitude estimate of the volume of a box by plugging in all sorts of random attributes that will correlate for some reason like color, shape, material, location, etc but obviously it is better to know volume = length*width*height. But if you don’t know that you are going to have more success throwing everything in.

• Anoneuoid says:

And where is the causality in that equation btw? Eg, v = l*w*h is the same as h = v/(l*w).

7. Dale Lehman says:

I have pondered this very question: include everything (just how many is that?) or ignore interactions if they are not the focus of your analysis (and what would be the basis for them being or not being the focus)? I am starting to believe that is only in part a statistical matter – and a small part at that. Somewhere, the subject matter knowledge must be a determining factor. When there is a theoretical reason for an interaction to be potentially important, then the burden must shift to a researcher that chooses not to include it. In the absence of such theory, I think the burden belongs mostly on those who would critique a study for failing to include such interactions.

I think the way courts examine these things is a reasonable model. Critiquing a study because it does not include “everything” doesn’t get you very far. You would have to also defend why it should have been included. I do think this puts a great deal of emphasis on established theory – perhaps too much. In a number of discussions on this blog about particular psychological theories, I have myself been guilty of questioning established theory (in areas that are not my specialty). There is always a tension between paying deference to a theory on the basis that is was accepted in the past, and asking for all potential theories to be evaluated on an equal basis. My guess is that neither extreme is workable. But it is possible that the pendulum is shifting (away from accepting the establishment and towards skepticism), and perhaps over-shifting.

• Jonathan (another one) says:

Well put, Dale. I would add two things. Not only is the critique of unspecified omitted variables unpersuasive to a court, I have never seen a court criticize an included variable which failed to achieve conventional statistical significance. The vast majority of statistical claims I have seen adjudicated in court separate over theory, not the specifics of the modelling. A lot of this comes from the sharing of data before entering the courtroom. Give both sides the same data and what they choose to do with it, and how they defend their use of it, is usually far more dependent on theory than anything else. One side’s model fits the data better, but the other side dismisses that fit as theoretically incoherent.

• Hi Dale,

Can you give us a specific case exemplifying a reasonable model? Thanks.

• Dale Lehman says:

It is the complex AT&T/Time Warner case and much of the decision depended on an economic model the government used. The model itself had a number of assumptions and the judge evaluated the model both on its assumptions and the degree to which they comported both with established theory and empirical evidence.

• Dale Lehman says:

I posted a response with a very ugly URL so I don’t know if that will ever appear. But perhaps a relevant example is the recent Harvard admissions discrimination suit. Much of that case revolved around the statistical models examining whether or not race played a role in admissions, and the models differed in whether or not various factors were included in the model (in particular, whether more subjective factors, such as ability to interact socially). While these were not interaction effects, they are similar in that any number of variables could or could not be included in the models. And the court case involved both statistical consideration of the competing models as well as theoretical considerations of what variables should or should not be included.

• Andrew says:

Dale:

Yes, we discussed that Harvard case here. The title of our essay: “What Statistics Can’t Tell Us in the Fight over Affirmative Action at Harvard.”

• Martha (Smith) says:

“include everything (just how many is that?) or ignore interactions if they are not the focus of your analysis (and what would be the basis for them being or not being the focus)? I am starting to believe that is only in part a statistical matter – and a small part at that. Somewhere, the subject matter knowledge must be a determining factor. When there is a theoretical reason for an interaction to be potentially important, then the burden must shift to a researcher that chooses not to include it. In the absence of such theory, I think the burden belongs mostly on those who would critique a study for failing to include such interactions.”

Subject matter knowledge is very important. But what if people disagree on what “subject matter knowledge” is? Aye, there’s the rub. Life (and especially good science) ain’t easy!

8. Clay says:

Richardson is a Causal Discovery guy. He got his PhD with Peter Spirtes out of CMU. Similar to Judea Pearl. I haven’t followed these people closely lately, and I’m not a statistician, but I did study with them at some point. They are most interested in estimating causal relationships from observational data. People are interested in causal relationships because they want to predict the outcomes of taking new exogenous actions that break the existing correlations in the observed system.

“My problem is that in any situation where these two variables could be conditionally dependent, I think they will be conditionally dependent. It’s the no-true-zeroes thing”
I think practically speaking, we need to act as if most potential causal relationships are negligible; otherwise we would be paralyzed.

These people represent the non-negligible relationships in a causal graph (usually acyclic — much simpler then). Their main approach employs what they call the Faithfulness Assumption, which basically says that if there is no dependence relationship, then we assume that it’s due to causal graph structure, not due to cancelation of parameters. They admit that this kind of reasoning is not appropriate to all systems (paradigmatically, Faithfulness is inappropriate in homeostatic systems). You could take a fully Bayesian approach to Causal Graph discovery, but then you would have a bunch of painful decisions about priors, which these particular folks would rather avoid I think you would anticipate correctly. But a Bayesian approach could still be compatible with their interests broadly, I think.

some publications:
A recent review out of CMU: : https://www.frontiersin.org/articles/10.3389/fgene.2019.00524/full
A Bayesian Approach to Causal Discovery, Heckerman Meek Cooper, Paywalled, https://link.springer.com/chapter/10.1007/3-540-33486-6_1
A Bayesian Approach to Constraint-Based Causal Discovery: https://arxiv.org/abs/1210.4866

9. Justin says:

“The thing that’s bugging me is that I can’t see why anyone would want to do this..”

IMO this is a variation of “The Statistician’s Fallacy” by Laken (see http://daniellakens.blogspot.com/2017/11/the-statisticians-fallacy.html)

“…that I know would be rejected if I could just gather enough data.”

See Lakens (and Hagen in the article) again: http://daniellakens.blogspot.com/2014/06/the-null-is-always-false-except-when-it.html

There are also procedures that adjust alpha based on sample size, say alpha goes to 0 as sample size goes to infinity, but these are not widely applied I find.

“I guess one problem I have with work such as Richardson’s on conditional independence is that I fear that the ultimate purpose of these methods is often to give researchers an excuse to exclude potentially important interactions from their models, just because these interactions are not statistically significant.”

The alternative would be to include potentially unimportant (or all) interactions?

“The trouble here is that (a) whether something is statistically significant is itself a very random feature of data, so in this case you’re essentially outsourcing your modeling decision to a random number,”

BFs too. ;)

But I (still) wouldn’t characterize however one is defining “statistical significance” as “a very random feature of the data”. For if there is something there, we’d expect the test statistic to be far away from what is expected under the model.

Justin

• jim says:

Justin,

You defend NHST regularly but I don’t recall seeing you comment on many of the studies that have failed replication. I’m curious how you see studies like the famous “power pose” study, or the “strong republicans” study. Care to comment?

• Martha (Smith) says:

Justin said,

“But I (still) wouldn’t characterize however one is defining “statistical significance” as “a very random feature of the data”.”

The data are a random selection of possible outcomes, and “statistical significance” is a function of the data — hence, I see statistical significance as a random variable, whose value is a function of the data being considered.

Justin also said,
“For if there is something there, we’d expect the test statistic to be far away from what is expected under the model.”

It’s unclear what you mean by “there”. Do you mean “in the data”, or “in the situation from which the data arise “? If you mean “in the data”, then you are just talking about the data, not about the underlying random variables. If you mean, “in the situation from which the data arise, ” then we’d only expect the test statistics calculated from the distribution of possible data to be *on average* far away from what is expected under the model. (And, of course, the model could be wrong.)

• Shane says:

+1

• Carlos Ungil says:

“I see statistical significance as a random variable, whose value is a function of the data being considered”

You are not wrong, but anything that depends on the data is a random variable and arguably the “in this case you’re essentially outsourcing your modeling decision to a random number” would apply to any data-based decision making procedure.

10. Shravan says:

Andrew, i think you find hypothesis testing puzzling because you don’t have to routinely do planned experiments. Even if one doesn’t do a test, one often has to decide to act as if an effect is present or absent. How one comes to that decision can be a Bayes factor, p-val calculation, or eyeballing the posterior etc. what i try to do is eyeball the posterior relative to a predicted range of effects (in my research i have model predictions), and at best say, yes it’s more or less consistent or inconsistent with the predicted effect. To make the deterministic reader happy, we have started sticking in “BF curves”: BFs under increasingly strong priors on the target parameter. I see your point that even in planned expts, hypothesis testing is a largely fictional exercise. For me, in my field at least, replicability is more convincing than a BF or whatever. If i can roughly replicate a pattern i am happyenough to conclude the effect is present. Eg, in one of my students’ work, none of the key effects was “significant”. But it fits the pattern in the literature. I think i learnt this from you, “the secret weapon” (Gelman and Hill, a footnote near page 92 iirc). The student’s paper:

https://www.sciencedirect.com/science/article/pii/S0749596X20300012?via%3Dihub

What is cool here is that we did not a single hypothesis test. Just eyeballing the posterior. And it was published in a top journal! That is some serious progress.

If a strict frequentist were writing this paper, they would argue they found evidence that there are no agreement attraction effects in Armenian. Lol. It is a beautiful contrast between mindless application of data analysis procedures vs actually thinking about the posterior given existing data.

• Chris Wilson says:

Shravan, thanks for sharing your experience! Appeasing reviewers (and coauthors) who have been trained to look for deterministic conclusions from stat procedures is tricky business- especially in analysis of planned experiments as you say. The usual strategy is to weave your story out of the pattern of significant/not-significant results. Once you’ve seen through how muddled this is – all the way down – it is impossible un-see, yet communication becomes so tortuous at times :)

• Martha (Smith) says:

+1 So many researchers are stuck in the pattern of That’s the Way We’ve Always Done It (TTWWADI), and take those traditions as something like gospel.

11. Jag Bhalla says:

Suggest a higher-level heuristic might be useful in practical cases. Always handy to ponder how to adjust the model/stats results for the known exclusions/simplifications. Economics provides abundant examples of what goes wrong when you fail to do this. As the saying goes, all models/maps leave out details… but for example, the “rational utility maximizing” assumption is like not putting a known roadblock on your map.

Wrote a short piece on this (inspired by an old post of Andrew’s).
https://bigthink.com/errors-we-live-by/savvy-consumers-of-economic-ideas-know-how-to-spot-what-wolf-and-rigor-distortis-errors
Here are the key lines:
A rigor-loving psychology, paradoxically, predisposes many economists to prefer being narrowly right yet broadly wrong. Their precision-seeking methods rigorously misrepresent reality (“rigor distortis”)

Always ask how economists adjust for known exclusions. And why given models presume causal stability. Unless they offer practical answers and adjustments for unmodeled effects, you can ignore them, just like real economies do.

12. Thanks for mentioning my talk.
The paper I was presenting, which is co-authored with F. Richard Guo ( https://unbiased.co.in ) is:

On Testing Marginal versus Conditional Independence
https://arxiv.org/abs/1906.01850
It will appear in Biometrika.

The problem considered is very simple, but there are not so many frequentist procedures that address non-nested hypotheses.

Under the assumption that one of the two (Gaussian) models is true our asymptotic guarantees are “rate-free”, in other words they do not require a particular scaling of the non-zero parameters with the sample size. (We do not claim that the method is “assumption free”.)

The proposed test abandons the usual NHST asymmetry between null and alternative {Reject H0} / {Fail to Reject H0}, instead opting to give one of three answers:
{H0 and not H1}, {H1 and not H0}, {H0 or H1}.

We re-define Type I error as reporting {H1 and not H0} when H0 is true, and
Type II error as reporting {H1 and not H0} when H1 is true.

In this way, the procedure can control both Type I and Type II errors since it has the option to report {H0 or H1}.
At the same time, we show that the performance of the procedure is near optimal.

We apply the method to shed light on an issue relating to Blau and Duncan (1967) analysis of the American Occupational Structure.

—–

Though this testing problem is motivated by (graphical) causal discovery, I think it may be of more general interest.

Solving this problem can be seen as a warm-up for choosing amongst directed graphical models; in other words, developing an analog to the graphical lasso for DAG models without a pre-specified ordering.

Our results can also be viewed as a critique of choosing models based on BIC.
Since both models have the same number of parameters, choosing the model with the higher BIC corresponds to choosing the model with the higher maximized likelihood. In some of the settings we consider BIC performs not much better than flipping a coin.

• Andrew says:

Thomas:

Thanks for the background and the links.

And, yeah, BIC is the worst. It’s completely incoherent, and it’s never been clear to me what useful question it could be answering. See this paper from 1995 from Sociological Methodology.