Fabio Martinenghi writes:

I am a PhD candidate in Economics and I would love to have guidance from you on this issue of scientific communication. I did an empirical study on the effect of a policy. I had an hypothesis, which turned out to be wrong, in the sense the the expected signs of the effects were opposite of what I thought (and robust to several different specifications and estimators). I looked at the impact of such policy from several angles (looking at different dependent variables), which implies that there is not an infinite amount of hypothesis consistent with the results. Of course, once I stood contradicted by the data, I noticed another aspect of the issue which in turn made me come up with a new convincing explanation which once tested resulted consistent with the results.

Because I care deeply about good science and academic integrity, I wonder how can I write my paper without breaking the conventions in academic writing while avoiding to pretend that I held my final hypothesis as true since the very beginning.

I will learn Bayesian methods as soon as I can (now I need to graduate).

My reply:

Can I post your qu and my response on blog? It should appear in October. You can be anonymous.

Martinenghi’s response:

I will have handed my dissertation in by then. I understand, publication time is what it is these days! Unless you mean you are pre-dating it to last year’s October. I am happy not to be anonymous and in a sense make public that I care about these issues.

In all seriousness, I was just hoping to solve this issue. Covering up the process through which one arrives at his final hypothesis is like shooting a film about a scientific discovery in which you make up the story behind it. It really bothers me and most of the (Econ) academics will just suggest me to do that. I am indifferent relative to the way the answer is given, if any.

My perhaps not-so-satisfying reply:

My quick answer is in two parts. First, you do not need in your paper to go through all the steps of everything you tried that did not work. It’s just not possible, and it’s not so interesting to the readers. Second, you should present all relevant analyses and hypotheses. In this case, it seems that you just have one analysis or set of analyses, but you have multiple hypotheses. I recommend that in your paper you present both hypotheses, then state that the data are more consistent with hypothesis 2, but that if the experiment were replicated under other conditions, perhaps the data would be more consistent with hypothesis 1.

**P.S.** The original email came in May 2019 and I guess I must have postponed it a couple times, given that it’s only appearing now.

Possibly a naive question, but is it really not possible to say “I expected X and found Y, which is really exciting and makes us realise something important is up”. Like the Michelson-Morley experiment or something. Is there really an obligation to pretend you know everything in advance?

Peter:

Researchers

dooften write, and publish, things like, “I expected X and found Y, which is really exciting and makes us realise something important is up.”The problem is, I’m suspicious of many of these claims. I’m not suspicious of the sincerity of these claims—I expect that this is really how the researchers perceive what happened—but I think that often what is happening is that the researchers are chasing noise. Yes, their findings are a surprise to them, but they’re making a mistake by generalizing from some patterns in a particular dataset and thinking that gives them more general knowledge of the world.

One of my favorite economics papers is Friedman and Ostroy, Competitivity in Auction Markets: An Experimental and Theoretical Investigation, Economic Journal, 1995, which starts (in the second paragraph, after describing the result that two sided-markets seem more competitive than we think they ought to be): “This paper began as a sharp disagreement between the two authors as to the proper explanation for the puzzle. We investigate three approaches to reconciling theory and experiment.

(1) The traditionalist approach, favored by one author….[description of approach excised]

(2) The institutionalist approach initially favored by the other author… [description of approach excised]

(3) A third approach, which we will call the as-if complete information Nash equilibrium approach (or complete information for short) occurred to us only after looking at the results of some experiments.”

The paper then goes on to provide a number of interesting tests of all three theories. But it is the transparent honesty at the start of the paper that I find so appealing. And the tests they give to the third theory suggests they aren’t just chasing noise, but that would take me too far afield.

I don’t think there’s anything wrong with “chasing” noise–it’s only a problem when you embrace noise. I think that’s what you really mean, based on your comments elsewhere–that it’s foolish to claim that failing to reject represents evidence for a post hoc hypothesis. But as long as it’s done in the context of long-term model building/theory building, that’s just science. I would say the best answer to Peter’s question is that journals like papers that make positive, declarative statements. They want discoveries, not refinements. Which is why they prioritize p-values over parameter estimates in the first place.

Of course, the exception to the rule is when the post hoc hypothesis is logically stronger than the original hypothesis, and would have been a better a priori idea to test, but it didn’t even occur to the researcher until after seeing the data (like #3 in Jonathan’s comment above).

Michael:

Yes, it’s fine to chase noise, but then you should chase noise in all directions, i.e., do a multiverse analysis. The mistake is when a researcher picks out one particular piece of noise and grabs on to it, while ignoring or downplaying all the other correlations in the data.

It’s like the old joke about the dog chasing the car–its fine to chase the noise, you just don’t want to catch it. :)

It’s a new one for me! Thanks for the laugh.

First, null hypothesis testing is dumb. It is best to report a parameter estimate, then discuss motivations, interpretations, etc. in the intro/conclusions sections. If this is unpublishable, put a version of it on a pre-print site anyway. You will have fulfilled your obligation to the scientific method.

Second, you have two choices. You can write the paper up as a failure to reject, and (correctly) say in your intro and discussion sections that the implicit support for the new hypothesis is a valuable finding that should be replicated with new data. This is textbook confirmatory vs exploratory analysis. It is almost certainly publishable, but you may pay a professional price for the tier of journal that will accept it. OR: You can write it up in the traditional way, and direct readers to the preprint (or call it a technical report) for a detailed description of the analytical process. Or call it supplemental materials. You will have sewn some confusion in the literature as to the correct interpretation of your findings, but nonetheless, you’ve done your scientific duty.

Third: a null hypothesis isn’t “a” hypothesis, it’s all the hypotheses that are mutually exclusive with your anticipated experimental results. For any line of research you are committed to (most aren’t to their dissertation), try to specify these a priori–and more importantly, specify the differences in empirical consequences that would distinguish them. Structure this list as a tree diagram, with branches splitting off where empirical predictions are mutually exclusive. This can be a valuable design tool: sometimes you can design your study to exclude a portion of these even when you FTR your experimental hypothesis.

You now have an a priori research agenda. FTR and you move to the next plausible hypothesis that’s consistent with your results (understanding that this is probabilistic so new results may have you return to the “root” of your tree). Maybe the next best hypothesis doesn’t occur to you until you can look at data, but it will fit into your plan schematically–it will naturally form a new “leaf” of the branch representing hypotheses consistent with your actual results. I would argue that this isn’t just a helpful tool for planning research: your argument for secondary hypotheses is stronger, and maybe more publishable, if you can present it in this context.

‘A null hypothesis isn’t “a” hypothesis, it’s all the hypotheses that are mutually exclusive with your anticipated experimental results.’

That’s not true. If it were, you would calculate p-values very differently. Indeed, the problem with the null is that it is far too sharp…. so sharp that it is almost impossible that it is true.

No model is ever true, and the problem addressed by a test is whether the data are compatible with the H0, not whether it’s true.

Fair enough, but (a) the null is still not everything incompatible with your expectation; (b) some nulls are true, it’s just that there are few interesting ones that are true. The null that radio waves from the Crab Nebula which will strike earth 10,000 years from have no influence on my decision as to which vacuum cleaner I will buy is true (More interestingly, there may be nulls in quantum physics, for example, that are literally true — to some (like me) carefully constructed nulls reflecting an absence of ESP may well be true as well) ; and (c) while reasonable nulls are rarely true they hold, for better or worse, a provisionally superior ontological status.

Re true nulls: What exactly do you mean by this? A null hypothesis is a frequentist probability model, and there is no way to verify it’s true without infinite repetition, which doesn’t exist. Further everything in the world depends in some way on everything else, so no i.i.d. model will ever be true. (That’s not even all that can be said but I leave it at that.)

What I mean is that while I agree that there is no way to *verify* it is true, it might nonetheless be true. And, as my example from the future Crab Nebula radio waves demonstrate, not everything *necessarily* affects everything else, e.g. the future does not affect the past. Similarly, if there is no ESP, then nulls which require the absence of ESP are true, whether or not we can prove it.

Here’s another example: I hypothesize that one-tenth of the digits following ‘3’ in the decimal expansion of pi are ‘4’. As you point out, there is no *statistical* test to verify this null. There might be, eventually, a *mathematical* proof of this null, however, which could prove the frequentist probability model is true with verifying it statistically.

I hate the inability to edit… “…without verifying it statistically.”

“If there is no ESP, then nulls which require the absence of ESP are true” – no, because a null as a fully specified probability model will always require more than just ESP not existing, for example independence of observations (of course you can specify models that have other requirements than independence, but “ESP doesn’t exist” alone is not a model, and when you make it a model, you will have to add requirements that will not be literally true in reality, as reality is not a data generating process ruled by formal probability models.

“I hypothesize that one-tenth of the digits following ‘3’ in the decimal expansion of pi are ‘4’” This is a deterministic process; no probability model will be true except stating that with probability one pi will be equal to pi.

I don’t know much about future Crab Nebula radio waves.

This is a fun discussion, albeit off the original topic, but let’s just say we disagree. On the ESP example, I agree there would be auxiliary things like exchangeability necessary to make a full statistical model, but my point is simply that if ESP does not in fact exist, then even if it can’t be proven not to exist in a statistical model, the fact that it doesn’t exist can be true. Truth and statistical evidence are not coterminous.

The decimal expansion of pi may be deterministic, but who’s to say that *everything* isn’t deterministic? We just don’t know the model that determines them. Just because it’s deterministic doesn’t mean it doesn’t lend itself to a statistical demonstration (if not a statistical proof, since the expansion is deterministic but infinite). Who wrote the Federalist Papers is deterministic, but since we don’t know who they were we carry out statistical tests with error bounds. We can still make a probability model and that model can have a null (#8 by Hamilton, #23 by Madison, etc, etc.) and that null might be true. The process was deterministic and yet we still have a statistical null because we don’t know the process.

Finally, you don’t have to know much about the Crab Nebula to make the general point that in our current understanding of causality. The unknowable true future cannot affect today’s events. And even if there’s some bizarre quantum spooky action at a distance that means it could actually have an effect, the effect on my vacuum cleaner purchase will be nil to as many decimal places as you care to write down.

“My point is simply that if ESP does not in fact exist, then even if it can’t be proven not to exist in a statistical model, the fact that it doesn’t exist can be true.” Fair enough, but that’s not a statistical null hypothesis, which was what the discussion was about. A statistical null hypothesis is a formally specified probability model.

“The decimal expansion of pi may be deterministic, but who’s to say that *everything* isn’t deterministic?” Fair enough again; it was actually my point that probability models are never true in reality.

“We can still make a probability model and that model can have a null (#8 by Hamilton, #23 by Madison, etc, etc.) and that null might be true. The process was deterministic and yet we still have a statistical null because we don’t know the process.” I think this is a fundamental source of misunderstanding. There are different concepts of probability around, roughly divided into “aleatory” (referring to data generating processes) and epistemic (referring to uncertainty of a person or humankind as a whole). You seem to be mixing them up. Normally statistical hypothesis testing is done in a frequentist (aleatory) setup. This means that the models model the data generating process, *not* subjective uncertainty. In this case whether you know or not what the exact generating process is has no implications on whether the model is true. However if you take probability models as modelling subjective (or objective, based on a set of informations) uncertainty, the model models your uncertainty and you *know* that it’s true if it matches your state of uncertainty. But then testing it is pointless, because the underlying process that sends you the data is not what the model is about.

Sorry, I won’t play the away game on the crab nebula.

Sad as I am to leave the Crab Nebula behind, I think you’ve cleared up the difference between us. “But then testing it is pointless, because the underlying process that sends you the data is not what the model is about.” But the model is only defined up to a family of probability processes. You still have to estimate the parameters. And you might well be interested on whether the data, even in a subjective probability model, is or not consistent with the notion that a particular parameter is zero, which we might term the null hypothesis. You might similarly be interested in the question “who wrote Federalist #20” and write a multinomial probability model that encodes your epistemic uncertainty. Underlying that multinomial model is a word frequency model that maps to texts of known authorship. In principle, this model is aleatory if only you could find infinite samples of each author’s work outside of the Papers. You can’t, but few models are really infinitely extendible.

“And you might well be interested on whether the data, even in a subjective probability model, is or not consistent with the notion that a particular parameter is zero, which we might term the null hypothesis.”

Sure, I agree, actually with that whole posting. But being consistent with a model is essentially different from the model being true. My “mission” here is that I object against the apparent urge of many people to assign “truth” to models. Granted, these models are surely useful in helping us to think about reality and making decisions, but they are still thought constructs and cannot be identified with how reality works. I just think that this obsession with truth tempts us to over-interpret and to claim more than we can actually achieve.

It isn’t about whether a model is true at all. You want to compare different models (explanations) to all the data and see which explains it best as is done by Bayes rule.

Then you can hopefully deduce other useful predictions from the model. This process does assume your premise is true, but that doesn’t mean you must believe it is true.

Also, I don’t get all the ESP hate. It made more sense when we weren’t surrounded by wifi, bluetooth, etc transmitting information invisibly over a distance without requiring much energy. I’m interested in how or why we see so little evidence of “organic” ESP, it must be a pretty serious vulnerability.

You did this: https://en.wikipedia.org/wiki/Abductive_reasoning

Then to test your theory you need to work out the consequences of it being correct and deduce predictions from it. Then compare those predictions (along with ones deduced from competing theories) to new data.

I agree with Andrew. In this case, I have an uneasy suspicion that the models tested were collinear (as many economic models are) and that the researcher’s hypotheses were about the sign of a particular coefficient. Such studies are invalid to begin with. It is usually impossible to sensibly hypothesize about the direction (or value) of a coefficient in a multiple regression model where the predictors are collinear. P-values and narrow confidence intervals don’t protect us from this.

You’re right–poor choice of words. There is only one statistical null hypothesis. There are (potentially) many substantive null hypotheses–models for what happened in the study instead of what you expected to happen. We select a particular parameter for the statistical test because observing a certain value (or set of values) is logically incompatible with the substantive experimental hypothesis. If our logic is correct, and the FTR is not a Type II error, then we are left with the the possibilities that are consistent with the observed value. NHST says we can’t accept any of these hypotheses, but we can always devise a new study that will produce data that is not equally compatible with all of the remaining possibilities–otherwise, they must be empirically indistinguishable.

This is why I call it a research “agenda”: the objective is to find out how events are related, not whether they are related in a particular way. It’s fundamentally bottom-up model-building, because you’re gradually narrowing down the values of parameters until they are consistent with only a narrow set of models. The approach is still compatible with journals’ bias toward NHST because any one study produces a p-value that may support the statistical experimental hypothesis, and if p > .05, you have something to say about the implications without it being post hoc. (All this assumes you are doing basic research and not testing a one-off intervention that’s too specific to generalize from its failure.)

Oops, meant this to be a reply to Jonathan!

Perhaps because the question was brought up by someone studying economics, I believe it’s safe to assume that the paper falls under an Econometric rubric. Under that consideration…

“Econometrics can be defined as the study in which the tools of economic theory, statistical inference and mathematics are systematically applied, using observed data, to the analysis of economic laws. It is therefore concerned with the ’empirical determination of economic laws…If the observed data are found to be incompatible with the predictions of the theory, it is rejected.” Brown (1991).”

Hendry has said that: “Theory consistent – our model should make sense.” I just don’t know that forming a new theory after the fact and not being totally honest about the timeline is good practice in this discipline.

Does that reframe the discussion somewhat?