Niall Bolger sends along this article he wrote with Katherine Zee, Maya Rossignac-Milon, and Ran Hassin, which begins:

All experimenters know that human and animal subjects do not respond uniformly to experimental treatments. Yet theories and findings in experimental psychology either ignore this causal effect heterogeneity or treat it as uninteresting error. This is the case even when data are available to examine effect heterogeneity directly, in within-subjects designs where experimental effects can be examined subject by subject. Using data from four repeated-measures experiments, we show that effect heterogeneity can be modeled readily, that its discovery presents exciting opportunities for theory and methods, and that allowing for it in study designs is good research practice. This evidence suggests that experimenters should work from the assumption that causal effects are heterogeneous. Such a working assumption will be of particular benefit, given the increasing diversity of subject populations in psychology.

We’ve been saying this for a long time—indeed, apparently above-linked article is based on a workshop from 2009, and some of the ideas in this paper date back to papers by Bolger in 1990 and 1991—so it’s good to see this message continue to appear more and more in psychology, political science, economics, and policy.

As I wrote in 2012:

Once you think about it, it’s hard to imagine any nonzero treatment effects that don’t vary. I’m glad to see this area of research becoming more prominent.

A key difficulty here is that, even though interactions are clearly all over the place, they’re hard to estimate. Remember, you need 16 times the sample size to estimate an interaction than to estimate a main effect. So, along with accepting the importance of interactions, we also have to accept inevitable uncertainty in their estimation. We have to move away from the idea that a statistical analysis will give us effective certainty for the things we care about.

I’m also quite pleased that the paper puts itself in the historical context of people like Estes who were sounding the very same alarms starting in the 50’s. This is not a new problem and people have been thinking about it and solving it for a long time. The frustration, which I also gleaned even from the relatively sterile academic prose of the paper, is that these messages have not spread very far beyond segments of cognitive psychology and psychophysics (and recently some parts of psycholinguistics). So hopefully the paper will advance the borders of theoretically sound psychology to new areas and I thank the authors for their efforts.

My only real concern is summed up in this paragraph from the intro: “Because our primary concern is with theory formulation and testing, we advocate models that are adequate for this task, that are readily available, and that are easy to use. Some causal processes in experimental work will no doubt require the more sophisticated tools that are now becoming available, but in our view, the bulk of the benefits can be obtained using simpler approaches.”

I certainly appreciate why they focus on accessibility in order to communicate their message more broadly, but I think a focus on “readily available” and “easy to use” models is exactly why some parts of psychology are so theoretically barren (this is the story of ANOVA after all, which they even recount in the paper). As unpleasant as it may be, I think the best way forward is to accept that more “sophisticated tools” (by which I mean more complex models) are what you really need to formulate and advance theories that have any substantial value.

Yes, to understand what is happening you need to start from modeling the individual instances, not what happens on average.

The average person, rat, cell does not exist so studying that could be very misleading: https://statmodeling.stat.columbia.edu/2015/10/05/cognitive-skills-rising-and-falling/#comment-245800

An agent based approach can be a very useful construct… in many ways averages are the only stable observable thing, we know that there are always many short term reasons why people might do things other than the reasons we can incorporate in our model… Like for example, people might suddenly burst into tears in the middle of answering a questionnaire because a particular question reminds them of their recently deceased relative… are we going to include all of those sorts of things? Whether someone has heartburn from having eaten too quickly a few minutes before arriving at the Psych lab? Whether the fact that their kids couldn’t find any clean socks this morning made them late for work? Each model will have several billion explanatory variables none of which are really observable.

Instead, we should look for relatively consistent explanations in terms of things that reliably produce some effects, and allow some model error, but require in aggregate that some statistics of the predictions be rather consistent with statistics of the observed.

It’s still a model of the individual instances, but our decision about whether the model is working is based on comparing the aggregated predicted effects across multiple prediction instances to the aggregated observed outcomes across many observed experiments.

This is, after all, the structure of for example the Lattice Boltzmann model of Fluid Mechanics and I don’t think anyone is going to claim that it’s not a good theory.

The important thing is to compare average to average and individual to individual. In the learning curve example they looked at the average curve, but were coming up with models to explain the shape of individual curves (which do not look like the average curve).

This was another issue I found in physics btw. They model an average universe, and compare it to what we observe (one instance of a universe). Thus they have the wrong intuition about what should happen: https://physics.stackexchange.com/questions/505662/why-is-matter-antimatter-asymmetry-surprising-if-asymmetry-can-be-generated-by

I suspect this is a common confusion because it won’t get caught by dimensional analysis. As I say in one of the comments there:

The model there is I think just a random walk. At each time either the total number of (signed) particles increases by 1 or decreases by 1… In the continuous limit you wind up with the stochastic process called a brownian motion.

The long term distribution is a gaussian whose variance increases linearly with time. So the probability to be within epsilon of 0 decreases to zero with time. I think Chris in your example was pointing out that this process *by itself* doesn’t produce enough particles fast enough, but obviously this process is only a simple process designed as basically an asymptotic “inner” description (when particles are dense). So what this really requires is sufficient numbers of black holes in the early universe.

Anyway, the point you’re making about comparing process paths to process paths and averages to averages is correct. Averages are often very different looking from individuals. For example a brownian motion path is nowhere differentiable, but the average is a constant.

This would explain it. Is there a term for this? From my conversation with Chris (who seemed to be somewhat of an expert on random walks), he seemed unaware of such a law.

No, he kept talking about sqrt(t) and I think his point was that because the variance grows like t, then the standard deviation (which is in units of particles) grows like sqrt(t) and so you couldn’t get a universe full of particles from a simple random walk like this… Of course this is equivalent to some kind of assumption on the rate of generation of particles, basically an assumption that the constant is a certain size… Also obviously this is a model only for early dense universes… that the behavior would change later is to be expected.

There might also be some kind of feedback mechanism involving changing properties of the black holes… in any case I don’t want to get into it too deep here, since it’s not really on topic, but you can read up on the mathematical concept of brownian motion starting here: https://en.wikipedia.org/wiki/Brownian_motion

Actually for the pure mathematical concept try:

https://en.wikipedia.org/wiki/Wiener_process

The other link is for the physical process of motion of particles.

If you really want to see what happens, try downloading NetLogo

https://ccl.northwestern.edu/netlogo/

and modeling a process involving say 4 particles: protons and antiprotons, electrons and positrons… Create some “black holes” randomly located in space, at each time tick generate a large number of particle pairs at random locations on the grid… give them a very high kinetic energy and conserve momentum (send them in opposite directions) use simple newtonian gravitation and electrodynamics… it’s not going to be cheap computation but you could get a sense of how things work… might be fun.

Also, an interesting thing about the brownian motion is that once you get to some point in time T where the quantity of particles is N, then the conditional expectation for the future is N… there’s nothing that’s going to “pull the count towards zero”… even though when you start at t=0 and there are 0 particles… at this point in time your expectation for the future is 0.

This property of the current value being the expectation for the future is called being a Martingale, the property of having the future determined only by the current value and not additional info from the entire past history is called being Markovian (or a Markov process).

Not sure if you were responding to my later post but I don’t see how “martingale” or “markovian” describes:

Well, consider that at time t=0 everything you can expect about the future is contained in the information that in the future the sum of the signed particle count is a random variables normal(0,sigma*sqrt(t)) where the second argument is the *standard deviation* not the variance…

the 0 comes about because at time t=0 there are no particles yet by assumption, and from the Martingale property (that the conditional expected value for the future is the current value). The sigma*sqrt(t) comes about from the linearity of the variance of the sum of random variables.

So at t=0 what we know for large t is that at that future time our knowledge of that future universe is that it’s a random normal(0,sigma*sqrt(t)) random variable.

fix a small region around 0, like 0 +- 1… since this small region is fixed, but the standard deviation grows like sqrt(t), the normal distribution is spreading out… and the density at 0 is decreasing towards 0 because of the normalization constant in the normal curve which is 1/sqrt(2*pi)/sigma/sqrt(t)… but the probability to be in this region is approximately p(0)*2 where p(0) is the density of the normal(0,sigma*sqrt(t)) function… so it’s decreasing to 0.

A potentially realistic assumption is that net particle generation rate declines with time, a simple version of this assumption is that say sigma = sigma0*exp(-t/Tc) where Tc is just some characteristic time for exponential decay…

so you can simplify your model as

N(t + dt) = N(t) + normal(0,sigma0*exp(-t/Tc)*sqrt(dt))

for simplicity let dt=1, let sigma0 = say 100, and Tc=1000, run the model for 10000 steps… my guess is you’ll see a lot of early expansion, and then it basically becomes constant… like a “big bang” ;-)

This code implements the model with exponential decay of particle generation rate:

library(ggplot2)

sigma = 100*exp(-(1:10000)/1000)

rnd = rnorm(10000)

N = cumsum(rnd*sigma)

qplot(1:10000,N,geom=”line”)

For more complex models look at the edit 4 where annihilation rates are a function of the number of particles (so more than one particle can be removed each step). Chris suggested this model thinking it would show a tendency to near equal matter-antimatter numbers, but it didn’t. Then the argument switched to “not enough black holes”, etc. That is why I say it doesn’t seem he understood that “the probability to be within epsilon of 0 decreases to zero with time.”

And via the hawking radiation mechanism proposed in that link I don’t see why the generation rate must decrease with time, but of course it could… In fact, in the simulation “time” is determined by each particle-pair that gets generated (and possibly immediately annihilated).

This makes sense to me.

Local to black holes, who knows what happens, but in the “normal space” far from black holes… you might imagine that the net flux into this space of particles is small at large times… Since we’re modeling the whole universe as a single particle count… if we want to know what happens in “normal regions of space” we should model those regions. This is basically just asymptotic matching… asymptotically at long times we expect total particle counts to be kind of constant in normal regions of space, and asymptotically near the big bang, we expect lots of particles generated and annihilating each other with high variance… just interpolate between the two with an exponential decay… you get the model I gave, and sure enough it has wild fluctuations that always settle down to either matter or antimatter, but never 0 (because it’s just too unlikely).

To desperately pull all this back to the original topic:

Causal processes in Psychology are Heterogenous… meaning we should have models in which individuals undergo various fluctuations in their behavior… an agent based model for example…

But if we want to determine whether that model “makes sense” we *can* do things like compare the average of many selected agent predictions to the average of many observed quantities… If our model for example predicts that rates of violence should increase with time, but we show that rates of violence decrease with time… we’ve made a bad assumption about the rules of the agents.

we should not, however, compare one single agent to say the average of 100 people’s surveys… That one single agent has say a lot of violent behavior doesn’t invalidate a model. There are plenty of people incarcerated for violence for example.

I don’t really see how you derived this from the simulations but ok. I think you are making the same argument I did that if the universe became less dense over time a certain favored species of particle would get “frozen in” as the majority. In another comment I speculated:

However, you do not require that assumption to get the same outcome. Even with constant rate of baryogenesis, the simulations show you still get a dominant species.

Pretty much any way I look at it I do not expect equal amounts of matter/anti-matter, despite them being symmetrically generated. It just doesn’t seem to be any great mystery when you look at individual outcomes instead of the average.

To bring it more on topic, I think this confusion between average and individual instances is going to be looked back on as a very common error in our day. Spanning pretty much all fields of research.

I remember looking at a dynamite plot (mean +/- sd) from a colleague of something like western blot results (some kind of bio assay) and asking why the sd was so much larger in one group vs the other. Was there a high outlier, a low outlier, or what? They had no idea what the underlying data even looked like.

All their “models” (in the loose sense used by biomed researchers) were about what happened at the level of an individual cell, but the observations were averages. They never even look at the individual points… unless of course they want to “drop outliers” to “get significance”. While this is not *always* a problem, it should not be assumed.

Actually, averages of averages. And I’ve even seen an additional layer of averages.

Say you measure a number of cells from the same animal at one timepoint. Take the average. Do the same for multiple timepoints in the same animal, then average all those averages. Do the same thing for multiple animals in the same “group”, and get the average of all those averaged averages.

“I don’t really see how you derived this from the simulations but ok. “

It looked to me like at least in the simplest simulations, either you generated a particle of the dominant type, in which case the count of that type increased, or you generated a particle of the non-dominant type, in which case it annihilated another particle and the dominant type count decreased.

If you think of + as matter and – as antimatter… basically you either added +1 or -1 to the “total count”.

The sum of a large number if IID random variables is a normal random variable, so basically if you skip a few hundred of your timesteps you can already guess that the result of that larger timestep will be a normal random increment.

If we aggregate a few hundred timesteps of your particle by particle simulation, we can simulate the net effect by just doing normal random numbers… so you can think of my simulation as automatically simulating several hundred times as many simulations as you did per each of my timestep…

Yes, but this is not a consequence of that:

One species is dominant most of the time, but it will switch back and forth between them if you wait long enough. So that is an additional assumption you are making then.

Yes, the decline with time was an additional assumption I made based on the idea that when things are dense there are a lot more interactions, and as they expand these interactions drop off in frequency, sorry if that was confusing.

In 2012, Andrew wrote:

“Once you think about it, it’s hard to imagine any nonzero treatment effects that don’t vary.”

Why is the qualifier “nonzero” necessary? Isn’t it just as true to say:

“it’s hard to imagine any treatment effects that don’t vary.”

I suppose, but there are some truly zero treatment effects. Like retroactive intercessory prayer: https://www.ncbi.nlm.nih.gov/pubmed/11751349

I don’t think that “sometimes it helps” and “sometimes it doesn’t”

“Remember, you need 16 times the sample size to estimate an interaction than to estimate a main effect”

As far as I can tell, this is not generally true:

https://vasishth-statistics.blogspot.com/2018/04/the-relationship-between-paired-t-tests.html

In this case, an interaction is just a main effect seen differently so it is nothing special; unless i made a mistake somewhere. I think Jake Westfall might have mentioned this at some point too.

Shravan:

Indeed, as discussed here, it all depends on the relative size of the interactions and the main effects. To say “an interaction is just a main effect seen differently” isn’t generally correct, because the choice of parameterization itself contains information. Main effects are “main effects” for a reason; this is how the models get set up. Again, it depends on the example.

Andrew, Jeff:

agreed that what I wrote may not be *generally* true either (need to think about that). But maybe you want to qualify what you wrote in yourblog post, Andrew, because otherwise people will cite your claim as fact (“Gelman said so”).

As for choice of parameterization, always use the sum contrast parameterization as it centers the predictor (Gelman and Hill 2007). treatment contrasts are almost always irrelevant for my research questions anyway. See: https://www.sciencedirect.com/science/article/pii/S0749596X19300695?via%3Dihub

Thanks for the link Shravan. I played around with visualizing an interaction effect using different contrasts (dummy vs. effects vs. “Gelman”) here: https://rdoodles.rbind.io/2019/07/is-the-power-to-test-an-interaction-effect-less-than-that-for-a-main-effect/. I find the dummy coding way to model interactions very intuitive, but maybe that’s because I think about an interaction as what is left over after you add treatment A and treatment B to a control.

This is a fascinating rdoodle Jeff, thanks for that. I need to study it and will get back to you. How can I contact you?

In a sense, the premise of the paper boils down to two widely-known but widely-misapplied (or just ignored) principles: all models are wrong, and the term “error” is a misnomer. The consequence of taking this pair of principles seriously in our designs, analyses and interpretations is important but not particularly novel: we must understand that variability in individual-level outcomes is, in general, the net effect of potentially countless individual-level variables representing traits and states of the participants and the context; most of these variables will not be measured and may not even be measurable with meaningful precision, so our models include randomness as a false-but-useful stand-in for that net effect. Researchers who ignore this simple but vital point will incorrectly infer that all variables not accounted for by their theories are ignorable, and that variability in their outcomes signifies that the construct is intrinsically “noisy” when it may just be intrinsically complex (“heterogeneous”). This is not emphasized enough in our training or in the article review process.

Zee, et al. present a method for checking the plausibility that the standard deviation of an effect could be better modeled by including one or more variables measured but not initially considered theoretically relevant. This seems a lot like the practice of checking the empirical ICC between clusters before deciding whether to include it in the model (e.g., we may model an effect with between-classrooms ICC but leave out between-schools ICC), which is consistent with best practices. It also seems a lot like stepwise regression, which I think has gotten a bad rap due to the our pursuit of a parsimonious, “true” model and our fear of overfitting–as well as abuses like failing to report the steps and reliance on p-values to decide what stays in and what gets booted. In other words, their method should be standard practice for descriptive regression analysis and reporting, particularly when the theory is vague and the predictors are many.

My one criticism of their work lies in their call to “distinguish true causal effect heterogeneity from spurious sources operating at the subject level such as sampling error or measurement error.” The term “spurious” here implies that these forms of error actually are random or ignorable, unlike heterogeneity due to individual differences among study participants. This is only true of sampling error when participant inclusion in the sample has been determined by a truly random selection process, and it may never be true of measurement error. Much of what we call measurement error can be explained by differential item functioning (DIF), the unintended interaction of participant characteristics (other than the target construct) with measure characteristics. Even what the authors call “treatment error” (and I take to mean treatment infidelity) generally covaries with participant characteristics like engagement and experience. Our use of a random error term to represent these sources of variability is just as much of a useful fiction, and potentially just as much as a missed opportunity, as conceptualizing individual heterogeneity as truly random. As a practical matter, it’s fine (and ultimately necessary) to draw the line at modeling individual differences but not DIF, or DIF but not infidelity, so long as they recognize the arbitrariness of our choice. But wherever we draw that line it is still a useful fiction dictated by our design and measures. To coin a phrase, it’s model assumptions all the way down!