## “I feel like the really solid information therein comes from non or negative correlations”

Steve Roth writes:

I’d love to hear your thoughts on this approach (heavily inspired by Arindrajit Dube’s work, linked therein):

This relates to our discussion from 2014:

My biggest takeaway from this latest: I feel like the really solid information therein comes from non or negative correlations:

• It comes before
• But it doesn’t correlate with ensuing (or it correlates negatively)

It’s pretty darned certain it isn’t caused by.

If smoking didn’t correlate with ensuing lung cancer (or correlated negatively), we’d say with pretty strong certainty that smoking doesn’t cause cancer, right?

By contrast, positive correlation only tells us that something (out of an infinity of explanations) might be causing the apparent effect of A on B. Non or negative correlation strongly disproves a hypothesis.

I’m less confident saying: if we don’t look at multiple positive and negative time lags for time series correlations, we don’t really learn anything from them?

More generally, this is basic Popper/science/falsification. The depressing takeaway: all we can really do with correlation analysis is disprove an infinite set of hypotheses, one at a time? Hoping that eventually we’ll gain confidence in the non-disproved causal hypotheses? Slow work!

It also suggests that file-drawer bias is far more pernicious than is generally allowed. The institutional incentives actually suppress the most useful, convincing findings? Disproofs?

(This all toward my somewhat obsessive economic interests: does wealth concentration/inequality cause slower economic growth one year, five years, twenty years later? The data’s still sparse…)

Roth summarizes:

“Dispositive” findings are literally non-positive. They dispose of hypotheses.

1. The general point reminds me of my dictum that statistical hypothesis testing works the opposite way that people think it does. The usual thinking is that if a hyp test rejects, you’ve learned something, but if the test does not reject, you can’t say anything. I’d say it’s the opposite: if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.

2. That said, what you write can’t be literally true. Zero or nonzero correlations don’t stay zero or nonzero after you control for other variables. For example, if smoking didn’t correlate with lung cancer in observational data, sure, that would be a surprise, but in any case you’d have to look at other differences between the exposed and unexposed groups.

3. As a side remark, just reacting to something at the end of the your email, I continue to think that file drawer is overrated, given the huge number of researcher degrees of freedom, even in many preregistered studies (for example here). Researchers have no need to bury non-findings in the file drawer; instead they can extract findings of interest from just about any dataset.

1. Benoit Essiambre says:

“If the test rejects, you haven’t learned anything — after all, we know ahead of time that just about all null hypotheses of interest are false — but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.”

That is the most correct interpretation of null hypothesis testing I have ever seen. Add it to all statistics textbooks.

• Anoneuoid says:

It is close, but “pure noise”? If I want to model a sequence of binary outcomes as “pure noise”, which of these should I use:

Another example would be including the “error” in your null model. You can use a normal distribution or a t-distribution, which is “pure noise”?

Also:

If smoking didn’t correlate with ensuing lung cancer (or correlated negatively), we’d say with pretty strong certainty that smoking doesn’t cause cancer, right?

No… the direction of the correlation depends on what else you put in your model. Like this study where they looked at 600 million different plausible arbitrary statistical models and saw the correlation range from positive to negative: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

I would say just looking at the simple x vs y correlation could be a useful heuristic though. If there is little or negative correlation then smoking probably isn’t the main factor you should be focusing on for now.

• Peter F Chapman says:

I disagree, particularly so for experimental situations. Suppose you want to know whether a treatment gives a 10% effect, or better, than the control, but you are not interested if the effect is less than 10% but greater than zero. Then, using as much existing information as you can get your hands on you do some power calculations to determine the amount of replication that you need to stand a good chance of detecting the effect. You then discuss with the study director or project sponsor. If there are insufficient funds to do the experiment with the required level of replication you abandon the experiment. If the funds are available and you go ahead and the test doesn’t reject then you can be fairly confident that the effect is not greater than 10%, although it may be greater than zero.

“just about all null hypotheses of interest are false”. This is certainly not in line with my experience.

• Anoneuoid says:

Can you give a real life example? The “interesting” effect size for studies like I think you are referring to depends on the risks and costs of the intervention. I’m not coming up with any situations when it would be a static value.

• Richard Kennaway says:

“but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.”

There is a rather large exception to that. In dynamical systems with circular causation (ubiquitous in biology, psychology, and the social sciences), it is quite possible for two variables with a direct causal connection to have a correlation indistinguishable from zero, however large the sample size. In such a case, no amount of data is enough. The reverse also happens: high correlation with only indirect causal links, via causal links that all have near-zero correlation.

2. gec says:

“Non or negative correlation strongly disproves a hypothesis.”

Just seconding Andrew’s point that this ain’t necessarily so: The classic example of Simpson’s paradox (https://en.wikipedia.org/wiki/Simpson%27s_paradox; one of my favorites when teaching correlation) illustrates how strong causation within groups can look like zero correlation when those groups are combined.

And selection biases can induce correlations where there is no direct causal connection, as in the classic example of “attractiveness and intelligence are negatively correlated”. Really, there is no causal connection there, it is just that most people ignore individuals who are neither attractive nor intelligent, meaning there are no samples in the bottom-right quadrant, resulting in a scatterplot that looks like a negative correlation.

Indeed, although we often remember the dictum that correlation doesn’t imply causation (or non-causation), the underlying reason for that caution is that the true causal pathway is often much more complicated/indirect relative to the data available. E.g., it is not the indicator variable “smoking” that “causes” the outcome variable “cancer”, the causal relationship is between what those variables *represent* in the world which might be very distant from how they are represented in our model.

• Anoneuoid says:

the classic example of “attractiveness and intelligence are negatively correlated”. Really, there is no causal connection there

It would be very surprising to me if there were really no causal connection.

the true causal pathway is often much more complicated/indirect relative to the data available

The true causal pathway is that everything in the past timecone of an event collectively caused it.

• gec says:

> It would be very surprising to me if there were really no causal connection

So if an independent rater gave a picture of you a higher attractiveness rating, you would suddenly become dumber? Which in light of your second comment (past timecone) would surely be some spooky action at a distance!

• Anoneuoid says:

So if an independent rater gave a picture of you a higher attractiveness rating, you would suddenly become dumber?

No, this is some kind of strawman. This strawman thing is really getting tiring.

If people kept telling me how attractive I was my whole life and giving me stuff and entertaining me to try to have sex with me I probably wouldn’t spend as much time on intellectual pursuits and would test lower on IQ tests.

• Martha (Smith) says:

“If people kept telling me how attractive I was my whole life and giving me stuff and entertaining me to try to have sex with me I probably wouldn’t spend as much time on intellectual pursuits and would test lower on IQ tests.”

There may be some very small element of truth there, but it’s vastly overstated. For example, people giving you stuff and entertaining you to try to have sex with you can be a real turn-off — intellectual pursuits can be much more appealing than wasting time with such petty, boring people. (Although if someone doesn’t have any interest in intellectual pursuits to begin with, I suppose flattery and entertainment could be attractive.)

• Anoneuoid says:

The claim was “there is no causal connection there”. I described one of many possible “causal paths”.

• gec says:

> strawman

I see now I shouldn’t have used the attractiveness example because it is causing people to focus on the particulars of the (I thought rather silly) example and not my general point that the neither the presence *nor* absence of a correlation can be taken as evidence either for or against a causal connection in the absence of a good model of that connection and how it manifests in observables.

But another component of what I was saying (albeit with tongue in cheek) is that people *really are adopting* the strawman in many cases, that is, they really are jumping to the conclusion of causation from correlation between measures that are often only tangentially related to the constructs that are supposed to be causally related.

And moreover, it is just as bad to jump to the conclusion of non-causation from the absence of a correlation without a good model of the potential causal paths between the measured outcomes.

• Anoneuoid says:

Everything is correlated with everything else and every event is caused by everything that happened earlier.

Start with that as a fundamental principle and you will see asking whether x causes y or a is correlated with b is a waste of time. All the methods “testing” such relationships are simply measuring whether people are willing to put enough effort into detecting it according to the customary rules of the field (eg, p < 0.05).

My point is, these aren’t interesting questions. Like in the OP:

This all toward my somewhat obsessive economic interests: does wealth concentration/inequality cause slower economic growth one year, five years, twenty years later? The data’s still sparse…

Obviously the answer to this question is: Under some circumstances it does, but under others it has the opposite effect.

I’m sure the OP author can think up more interesting questions about the topic but their mode of thinking has been clouded by training in NHST logic. NHST is based on the principle that correlations and causal relationships between any two different phenomena are rare. Due to this, it cannot answer interesting questions and drives research down uninteresting, unproductive paths.

• Garnett says:

“Everything is correlated with everything else and every event is caused by everything that happened earlier.

I suspect that people raised in the NHST paradigm will have a very, very hard time believing that.

• Anoneuoid says:

I suspect that people raised in the NHST paradigm will have a very, very hard time believing that.

It is based on empirical observation (and also just makes sense…). Meehl was talking about social science studies where everything was significant in the 1960s[1]. Today we see it in GWAS and particle accelerator studies where they have such a huge sample size they need to use alpha of 10^-7 to 10^-8 or else get so many “false positives” that no one could take them seriously.[2]

But yes, it is the opposite of the principle that you are forced to accept when you are trained to “use” NHST.

[1] https://www.journals.uchicago.edu/doi/10.1086/288135
[2] see here and surrounding discussion.

• That’s why NHST has been so bad for science.

We know that to a very good approximation everything that happens happens because of quantum electrodynamics. Sure, there’s some radioactivity and stuff, but basically everything that happens is at some level a massive wavefunction goes around determining where electrons and photons and soforth should go and that causes chemical reactions and physical motion and soforth. This is the underlying facts about for example “the effect of brushing your teeth nightly on the price of tea in china”…

It’s easy to calculate that the motion of a gram mass a lightyear away provides enough perturbation to a mole of ideal gas that you can’t calculate the paths accurately after a few seconds. (calculation showing this was done over a hundred years ago by Emile Borel.

I wish Raghuveer had an actual citation, but I take his mention of it as authoritative, since it’s just another manifestation of “the butterfly effect” or “sensitive dependence on initial conditions”, a well established phenomenon)

What is true, and remarkable, is that in any given situation there are usually a few “macro” variables that dominate the interaction between anything and anything else. So the right question is something like “which variables explain the bulk of the effect” and “how big is the effect of X on Y”. So for example the fraction of people in the US that brush their teeth nightly is much more related to the price of tea in china than whether any individual actually brushes their teeth.

It’s only a lack of logical training, or rather even an active un-training in proper logic caused by poor statistics teaching, that leads people to think like “statistically significant findings are real, and non-significant findings are zero”

• For Borel’s point, http://www.informationphilosopher.com/solutions/scientists/borel/

That source attributes the interpretation to Leon Brillouin citing Borel’s “Introduction geometrique a quelques theories physiques (p 94)” for the calculation and Brillouin for the explication of the interpretation.

• Martha (Smith) says:

Daniel said,
“So the right question is something like “which variables explain the bulk of the effect” and “how big is the effect of X on “

+1

• Garnett says:

Anon and Daniel:

Here is my dilemma. An investigator comes into my office interested in “testing if there is an effect” of, say, hearing loss on depression severity. This is a seemingly straightforward idea that most anyone can at least superficially understand. No special training required.

The investigator measures tons of variables, builds a DAG to eliminate some of those variables, then regresses a depression severity measure on hearing loss + confounders. The statistical properties of the fitted model lead them to conclude that evidence has (or has not) been found supporting a “causal link” between hearing loss and depression.

My experience tells me that this paradigm covers the vast majority of research outside of physics/chemistry/some biology.
Arguments about the quantum world or even Newton’s law of gravitation won’t resonate with that epistemology. I struggle to find another tactic.

• Anoneuoid says:

“how big is the effect of X on Y”.

I’d even say you need to ask “how big is the effect of X on Y under various conditions”. But once you start doing that you stop thinking about “the effect” and instead want to know “Y is some function of X, Z, etc, what is that function?” Then there are two approaches:

1) Approximate whatever function it is with machine learning/etc to make useful predictions.
2) Attempt to rationally derive the functions from a set of assumptions.

#1 can be useful but doesn’t really lead to cumulative understanding, #2 is much better but also more difficult.

• Anoneuoid says:

Here is my dilemma. An investigator comes into my office interested in “testing if there is an effect” of, say, hearing loss on depression severity. This is a seemingly straightforward idea that most anyone can at least superficially understand. No special training required.

The investigator measures tons of variables, builds a DAG to eliminate some of those variables, then regresses a depression severity measure on hearing loss + confounders. The statistical properties of the fitted model lead them to conclude that evidence has (or has not) been found supporting a “causal link” between hearing loss and depression.

You can’t reason someone out of something they didn’t reason themselves into. They are simply operating on argument from consensus/authority. It is pointless to direct any effort into convincing them there is something wrong with their approach until they start doubting those heuristics.

So, change will require political machinations (to affect the consensus/authority) or enough money to become an alternative funding source uncontaminated by NHST-type thinking.

• Garnett:

I’m less pessimistic about what you individually can do than Anoneuoid is.

You can point them to my brand new blog post for a discussion of the basic idea above with the citations, because I’ve referred to this a few times, I figured I should document it somewhere: http://models.street-artists.org/2019/08/19/emile-borel-leon-bruillouin-the-butterfly-effect-and-the-illogic-of-null-hypothesis-significance-testing/

And then what? They’re not studying ideal gasses or whatnot, so what can you say to help them understand the issue?

The basic idea is this, in every problem we have “stuff that matters a bunch” and “stuff that cumulatively doesn’t matter much”. Conventionally we put “stuff that matters a bunch” into a function that describes what happens… and “stuff that cumulatively doesn’t matter much” into an “Error term”

y = f(a,b,c,d) + epsilon(every-other-variable-in-the-universe)

And we call it a reasonable model of the process if epsilon is almost always “smallish” in some sense that depends on our purposes. We often describe this “smallness” in terms of normal(0,sigma) telling us that on average f predicts correctly, and has an error that’s a small multiple of a number sigma that we accept as a description of an “ok size”.

“whether f is the right description” and if so

“how big are a,b,c,d” and

“what is the basic size limit on the order of magnitude of the epsilon we expect, or how big should sigma be?”

So a person doing an experiment on hearing loss and depression should be asking themselves things like “what variables do you think affect depression in addition to hearing loss”, let’s call “a” hearing loss, so we need to identify b,c,d etc that are variables going into our function that predicts depression.

Can we calculate how big “a” is in the absence of such a description of depression? No, because “the effect of “a”” is more or less “\partial y / \partial a” which is itself a function of b,c,d etc so at best what you are going to measure is the partial derivative of depression with respect to hearing loss *under certain conditions* and if you don’t even know which variables describe those conditions, then you have no way of knowing when your model would be useful.

For example, do we believe that hearing loss has the same effect on depression in people who are cancer patients undergoing chemotherapy and losing their hearing due to the drugs as it would for people who are musicians losing their hearing due to loud concert environments, as it would for people who are machinists losing their hearing due to loud industrial environments, as it would for people who are children losing their hearing due to a viral infection, as it would for computer programmers who started out with bad hearing at birth……….

Now, what is often done is to try to average over all these things by sampling from some population… Sure we’ll get lots of different values of b,c,d etc among the sample, but as long as the sample represents some population of interest, and we can get an average partial derivative across these people, maybe we’re happy with that.

The first thing we should remember here is that when we average across a population, we don’t get any kind of “universal” truth. The other variables b,c,d will change from place to place, time to time, etc So the average across a population we sample today could be meaningless next week, or year.

The second thing we should remember here is that the average need not represent anyone at all. And the higher the dimensionality of b,c,d… the less likely our average is to any one person’s reality.

Suppose we have a room full of people, 35 of them are unemployed and here to listen to another person who is the CEO of a large company describe job opportunities. The average income is (35 * 0 + 1000000 * 1) / 36 = $28,000 Does$28,000/yr income describe anyone in this room?

These are just *some* of the issues we need to address with our clients to get a realistic view of what we’re doing in science.

The DAG analysis you describe *could* be useful science, but it may also be highly misleading. Thinking through some of these issues may be a useful way to describe the problems to your clients.

On the other hand, it will derail their search for quick sound-bite papers that make strong claims about the simplicity and controllability of people’s lives. Claims that lead to grants and promotion.

• jd says:

Replying to Garnett’s post that starts “Anon and Daniel”.
Along with explaining what has been said by Daniel, you can also point out specific citations such as the Nature commentary, blog posts on this blog, NEJM’s new requirements to post effect sizes and CI’s in some cases, etc., as evidence that there are better ways than NHST and that there is at least some traction for change.

I have found it easier to convince people of this route when referencing articles rather than just reasoning with them.

As for alternatives, tools like ‘brms’ and ‘rstanarm’ are pretty accessible. It’s really not that much harder to build up your Bayesian multilevel model with priors than your traditional fixed effect model, these days.

• Garnett says:

Daniel and Anon:

• jim says:

Too much loose talk about correlation and causation.

How you interpret or recognize the relationship between correlation and causation depends on many things. Some correlations are clearly causal because the relationship is simple and clear: higher gradient, faster stream. No question. Another way of saying that is that the magnitude of one interaction (gradient) is so much larger than all the others (friction, temperature, pressure, tidal forces) that the causation is clear, even though it might not be exclusive (e.g., there is still some tidal pull on the stream, but it’s so small it’s not relevant).

Correlations like smoking and lung cancer are less clear because other factors may exert an influence of approximately equal magnitude to the effects of smoking on the age of mortality. Do we even know what these factors are? Genetic, probably, but likely other environmental factors, possibly the person’s activity level – there are many other possible factors. So, in the case of smoking, while causation is clearly, it’s not nearly as direct.

So there definitely are circumstances where correlation absolutely indicates causation; others where the relationship is clear but not 100%, and more situations where the relationship is quite muddy because there are many factors of approximately equal influence. So just like everything else in science, you can’t use simple rules to make the call, you actually have to know what’s going on.

3. Carlos Ungil says:

> if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise.

There may be a bit of truth to that in some cases, but when a researcher gives some drug candidate to twenty rats and most of them die while the twenty rats in the control group remain alive and kicking hopefully he will learn something: that the compound should be ditched. It would be tragic if it moved into clinical trials because looking at the pile of rat corpses he researcher thought “too bad we have not learned anything, we knew already that mortality wouldn’t be exactly the same in both groups… I wish no rats were dead, then I would have learned that more testing was required until finally I would be able to not learn anything!”

• Anoneuoid says:

I’m pretty sure you know the difference between parameter estimation and running a statistical test. Also your conclusion is wrong:

when a researcher gives some drug candidate to twenty rats and most of them die while the twenty rats in the control group remain alive and kicking hopefully he will learn something: that the compound should be ditched.

There is no such thing as a toxic compound, it depends on the dose.

• Carlos Ungil says:

Everything depends on the dose and when the dose required to have a therapeutic effect is higher than the toxic dose the commercial opportunity for the drug is severely compromised.

• Anoneuoid says:

So what you do is take the estimate of the rates of various side effects (including death), along with the proposed benefit (at the known therapeutic dose), along with any other monetary, etc costs. Then you do a cost benefit analysis of some kind to decide if the treatment is still worth pursuing.

The correct process does include getting an estimate of the risks as in your example, but nothing to do with “testing”. The “testing” part is pure cargo cult. What you described is an imitation of what scientists would do while failing to understand the purpose of collecting that data.