. . . it makes a really important point that I hadn’t even noticed.
The article in question is Theory-testing in psychology and physics: A methodological paradox and it’s from 1967.
It’s entirely my fault that I missed the point, as it’s in the very first paragraph of the paper, and Meehl even helpfully puts it in italics:
In the physical sciences, the usual result of an improvement in experimental design, instrumentation, or numerical mass of data, is to increase the difficulty of the “observational hurdle” which the physical theory of interest must successfully surmount; whereas, in psychology and some of the allied behavior sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount.
He continues:
Hence what we would normally think of as improvements in our experimental method tend (when predictions materialize) to yield stronger corroboration of the theory in physics, since to remain unrefuted the theory must have survived a more difficult test; by contrast, such experimental improvement in psychology typically results in a weaker corroboration of the theory, since it has now been required to survive a more lenient test.
This post will proceed as follows. First, I’ll explain Meehl’s point and say why I think it’s important. Second, I’ll discuss how it is that I’ve read this famous paper so many times and never noticed its key message. Third, I’ll consider what we should do in social science, given this new understanding.
1. The message of Meehl’s paper
Here’s the key idea. In physics, the model you’re interested in is the null hypothesis. Get enough data and you can reject it. For a simple example, start with Copernicus’s model of the planets going in circles around the sun. Gather enough data and you can reject that model, replacing it with Kepler’s model of elliptical orbits. Gather more data and you can reject that model–but, hey! you can fix the model by hypothesizing another planet–Uranus. Then Neptune. Gather more data and you can reject the elliptical paths too. Now we have general relativity, and I don’t think it’s been rejected yet. Similarly with models in particle physics, solid state physics, etc. Gather enough data and you’ll reject your model, and that’s where you learn something.
In psychology, it’s the opposite. The model you’re interested in is that two variables are related to each other: do more of X and you get more of Y. The null hypothesis–that X is independent of Y–that’s not very interesting. I mean, sure, it would be interesting if it’s true, but typically it’s not, and you know it’s not. Gather enough data and you can reject the null.
We know that.
But here’s the interesting point made by Meehl in his 1967 paper: the process of hypothesis testing in psychology is the opposite of that in physics. In physics, as you gather more and more data, you put your model more and more to the test, and eventually it breaks down. This is also how I’ve always framed things in Bayesian data analysis: The point of any statistical model is to be there and do its job long enough for it to be replaced. A model is like a car that you drive until it runs out of gas, then you give it some more gas, you fix it when it breaks down, and eventually it’s more effort to fix than to just build or buy a new car. This is the Lakatos philosophy: it’s Lakatos’s version of Popper.
But in psychology, when you gather more data, it’s easier and easier to reject the null, which makes it easier and easier to confirm your favored hypothesis. All you learn by rejecting the null hypothesis is that you now have enough data to estimate a bigger model. That’s not nothing, but it’s not falsification in the Popperian sense, or a motivation for improvement in the Lakatosian sense. A psychology researcher proceeding by successfully rejecting the straw-man null is not working within the falsificationist paradigm.
It’s funny because the math for hypothesis testing in physics is the same as in psychology, but the interpretation is the opposite.
This is a big deal. Meehl wrote it in 1967, and his paper got lots of attention (according to Google, it’s only been cited 1500 times, but that’s a lot for a paper published so long ago; also, papers on philosophy don’t get cited that much, compared to papers on methods), but I don’t think most people got the point.
2. How did I miss the message?
Indeed, until I reread the paper the other day (in preparation for including it the readings for Week 13 of my Rationalizing the World course), I didn’t get it either.
Maybe one reason is that I’m trained in physics and I read Jaynes, so I’d already internalized the use-the-model-you-love-and-put-it-to-the-test-and-when-it-finally-fails-you-can-rejoice-and-think-about-how-to-do-better philosophy (I call it “model checking” rather than “hypothesis testing”). Another reason is that I already knew that Meehl, like me, opposed null hypothesis significance testing (unlike me, he was Saul Bellow’s therapist, but that’s another story), and so every time I’d read that paper, I’d just kinda skimmed it and seen that he was attacking the use of hypothesis tests in psychology research.
I’d understood that Meehl was criticizing null hypothesis significance testing as being confirmationist rather than falsificationist, but what I hadn’t caught was his comparison with physics.
Meehl’s point is not just the commonplace that social sciences have physics envy, that they’re unrealistically looking for law-like relationships that you can’t hope to find when studying human behavior or society; rather, he’s saying that the entire endeavor of data-based science changes if you move from precise, physics-type models to more vague, directional, social-science hypotheses.
3. What to do?
What’s the answer here? The answer is not to tell psychologists (or political scientists, or economists, or sociologists) to form strong, physics-like models and test them. Whatever strong, physics-like models we create, from the median voter theorem to prospect theory to social network models, are gross oversimplifications: they’re guides to thinking that we don’t expect to fit reality.
Rather, I think we need to act like statisticians (ok, maybe you could see that coming!) and consider model building to be an ongoing process. We don’t need to waste our time rejecting straw-man null hypotheses, and we certainly shouldn’t be comparing models based on their p-values or other measures of distance of data from uninteresting nulls; rather, we should build the best models we can, knowing that they’re imperfect, and gathering more data until we can do better.
This is not a falsificationist paradigm, nor is it confirmationist. It’s more like “normal science” in the Kuhnian sense. Or, to put it another way, we will be doing small falsifications every step of the way, occasionally slipping into a new paradigm when the old models get too damn clunky. It’s the fractal nature of scientific revolutions.
And that’s the final reason I didn’t catch Meehl’s point on my first (or second, or third) reading of his article. He identifies the problem without offering much of a solution. That’s fine–identifying the problem is the first step. I’m glad I read it again, this time actually reading that first paragraph carefully, as it deserved to be read.
I’m confused. You say “But in psychology, when you gather more data, it’s easier and easier to reject the null, which makes it easier and easier to confirm your favored hypothesis.” But I thought we all agree that you never “confirm” a hypothesis. Further, the Meehl paper is totally couched in terms of binary NHST conclusions. Does it remain true if we focus on more nuanced measures such as the width of a confidence interval – does getting more data narrow these intervals in physics and widen them in psychology? And, should we distinguish between getting more data in terms of more observations vs getting more data in terms of more factors? Does that difference matter?
First sentence of the abstract:
_Because physical theories typically predict numerical values, an improvement in ex-
perimental precision reduces the tolerance range and hence increases corroborability. In
most psychological research, improved power of a statistical design leads to a prior
probability approaching ½ of finding a significant difference in the theoretically predicted direction._
How do I italicize things?? I tried asterisks and underscores and neither seem to work.
To italicize use HTML tags. I will try to type them here
Italics on is <em> read as emphasis on
Italics off is </em> read as emphasis off
Thank you very much.
Dale,
I think the main problem here is when H0 is a fixed number but the researchers hypothesis H1 is any other number. In that case confidence intervals will also not provide any “stronger test” against H1 as you increase the number of observations. Because the confidence interval is always going to include some values in H1.
Now, I don’t fully agree with Meehl here, because psychological researchers often don’t set up H0 as a fixed number (e.g., r = 0) because any other number agrees with their theory, but because it works as a boundary point for their actual prediction (e.g., that r > 0). In that case, it is not necessarily true that this setup does not provide a stronger test with more data, because with more data, estimation intervals are more likely to “move outside” of the researcher’s predictions (r > 0) if the other side is actually true (r 0).
However, for this to be a strong test, it of course means that the “other side of the coin” is not an unlikely strawman even outside of the theory. I think this is similar to Andrew’s point about not rejecting strawman nulls. I made similar points in a previous thread, but a commenter started making up a bunch of things about me and derailing the conversation, so that didn’t really go anywhere.
I also think this makes more sense for physics as well, because it doesn’t seem to be true that their theories really specify H0 as a fixed number that is the only thing that agrees with their theory. Not if their predictions depend on some free parameters (e.g., the speed of light) that needs to be estimated. Because then their predictions can only be as precise as what is allowed by the precision in the best estimates of those parameters. Consequently, the actual predictions from the theory become some type of intervals. Now, it may be true that this interval is so small, and the study so controlled, that specifying it with a fixed number doesn’t make much of a difference for interpretations.
However, conceptually, I think this shows that the issue is not between whether the researcher’s hypothesis or the alternative is specified as a fixed number (both approaches typically have ranges both for their and the alternative hypothesis). The issue seems much more to be: how good are the predictions at excluding things. The larger the range of correct exclusions, the better, but even if a fairly large range is not excluded, it can still be useful if the exclusions are meaningful in some way. Say, because they are otherwise perceived to be likely, or because the they can be used to exclude some meaningful qualitative difference (e.g., if the theory correctly predicts continuing with a treatment after a certain point will decrease its effectiveness, there is no point in continuing it).
As for more factors (more variables?) I don’t think its the same issue, because it could help controlling for otherwise spurious relationships. As others point out below, Meehl also argues for the need for improved methods for testing the theories, which I think includes being able to make such controls.
Hm… the “(r 0)” should be “r larger than or equal to 0”. Don’t know why it didn’t come through.
I should also say: The issue, I think, is when researchers don’t care about how good their predictions are at excluding – meaningful – other things, and instead just treat a smaller p-value (or smaller/larger Bayes factor) as stronger evidence of their theory, rather than treating the ability of the predictions at excluding untrue things, and the ability of the test at detecting deviations from these predictions, as the strength of the evidence.
Argh, I mean “r smaller than or equal to 0”. Sorry, I didn’t think through what I wrote exactly in the sentence before responding.
Sorry for spamming (feel free to remove posts). I might have made a mistake when first posting, which may explain my previous two confused responses. The final sentence of the second paragraph should be as follows:
In that case, it is not necessarily true that this setup does not provide a stronger test with more data, because with more data, estimation intervals are more likely to “move outside” of the researcher’s predictions (r higher than 0) if the other side is actually true (r lower than 0).
after Meehl (1967), the next influential work in psychology is Cohen (1988), who tried to make psychologists think about effect sizes and made fun of nil-hypothesis testing (very different from null-hypothesis testing (1994). This started the slow transition to effect size estimation with confidence intervals that gained prominence during the replication crisis (Cummins).
I think of this same concept in terms of incentives.
The best experiment in psychology is the one that “frees” your hypothesized effect from the surrounding noise, allowing it to stand tall. That doesn’t sound so bad, right? This view of how things work creates space for post-collection data cleaning. Your effect shows up just fine with the students in your class, but disappears when you pad your “n” with online responses? Must be something fishy about those online responses, just drop them.
This slowly dawned on me as I was puzzling over the Gino scandal. She would matter-of-factly discuss how her datasets floated around the lab, subject to any cleaning steps anyone could think of. For her, that was just what you did. She was perceived as a rock star precisely because she seemed to have the magic touch in designing experiments that allowed her effect to jump off the page.
In physics, if you don’t subject your hypothesis to the most severe test you have available in the lab, you can be sure that your rivals will do precisely that. And then they will publish a rebuttal that will make you look like a fool.
This strikes me as unrelated to the amount of data – it sounds like a different standard for research practice. If you are saying that research standards are higher in physics than psychology, I don’t disagree. But that difference would exist whether available data was small or large. I don’t see how it accounts for different conclusions as more data is collected. The differences would be there if we focused on availability of AI or globalization of research teams, or power of computing services just as much as availability of data.
I think Meehl’s argument is about exactly these kinds of differences, not just the amount of data. Meehl talks about observational hurdles, experimental precision, improvements in our experimental method, and difficult tests, all of which may involve the amount of data, and all of which encompass much more than just this.
Dale,
I think Andrew’s post (and your comments) is mostly about increasing “n,” my comment was about tweaking the method, and Meehl’s point was about both.
Mentally I’ve always connected that point in Meehl’s paper (which felt important when I first read it) with arguments to test against more than one point hypothesis if you going to do hypothesis testing (e.g. compatibility curves, which Sander Greenland has argued for), because it makes clearer that with more data you’re ruling out more possible effect sizes. The implication being that if people did this more, it might put pressure on theories to make more quantitative commitments and help undo the inversion Meehl is talking about.
I read the first four chapters of Jaynes and I kick myself daily for not taking a sabbatical and reading the whole thing ASAP.
Not knowing who Paul Meehl was, I looked him up, and the Wikipedia entry on him has a section under “Philosophy of Science” entitled “Meehl’s Paradox”, which seems to be about the point Andrew is making in this post.
Gregory:
Indeed, I claim no originality in this post, the point of which was made by Meehl in italics in the first paragraph of his article! It’s just funny that I’ve known about this article for a long time but had never caught that particular point.
Well, Andrew as regards Meehl , everyone should additionally read Gerd Gigerenzers excellent paper:
Can psychology learn from the natural sciences
free download here:
https://journals.sagepub.com/doi/epub/10.1177/09593543231209342
He makes similar points to what you hint and more.
Yup, agree with Jessica. Initially I had a hard time seeing what Andrew was talking about in the physics scenario. But once you replace “null hypothesis” with Sander’s “tested hypothesis” then it makes sense that one can start with Copernicus’s model as the tested hypothesis, then test Kepler’s model etc. What one chooses as the tested hypothesis is critical. We learn far less, if anything, if it is a straw-man hypothesis.
Even anoneuoid got the point of this the first time. If you aren’t testing your own hypothesis you are doing bizarro world science. Many psychology researchers their hypothesis is just “under conditions A then B happens more often than under control” or something similar.
Its not much of a hypothesis, it doesn’t have much in the way of implications, but the stats still dont directly examine it usually. You could do a Bayesian analysis in which the frequency of B is a parameter under the two conditions and compare the posterior region of the parameter and say “theres a 90% posterior chance that the rate is higher in condition A than control” but this isnt the kind of stats Psych researchers are traditionally taught. so they flounder around.
This is the kind of thing when I started advising my current student he couldn’t wrap his head around, like estimating a parameter and seeing if its posterior distribution is concentrated in the positive side or whatever. i think after 1.5 years hes fully internalized that stuff. but it bothers me that a guy with a masters in stats and a PhD candidate in social science needed like 6-12 months to “get it”.
i mean I dont have a masters in stats, heck I haven’t taken anything more than a math stats class in the 90s introducing a bestiary of distributions. so something is deeply wrong.
Hmm, I’ve probably quoted those exact lines dozens of times since ~2014, and even put them in comments on this blog. Meehl should have gotten a nobel prize for that paper.
The lesson seems to be wait 10 years for people to start catching up with “crazy” ideas. Then again Meehl waited 45 with little to no effect on general practice, so that may be optimistic.
Possibly to do merely because I am ancient, I have met both Bellow (before his Nobel Prize) and Meehl (several times before he became famous in this blog). I suggested to my university that it ought to invite Bellow to deliver a talk, but was turned down(see first parenthesis).
> What’s the answer here? The answer is not to tell psychologists … to form strong, physics-like models and test them. Whatever strong, physics-like models we create … [are] guides to thinking that we don’t expect to fit reality. Rather, I think we need to act like statisticians … and consider model building to be an ongoing process.
Why is the answer not to tell psychologists to test strong models? Wouldn’t psychologists benefit from thinking more holistically about the data-generating process, in addition to embracing the practice of learning from model failures and iterating vs. shooting down point nulls? It seems like in any field models serve as “guides to thinking that we don’t expect to fit reality.” It’s just that in physics the model features are more precise, and the systems they represent are able to be more precisely probed. Is the claim that psychology theories are currently so far from fitting reality that testing them against precise predictions would be uninformative? If not, could you clarify what middle ground you’re advocating between strawman nulls and “strong, physics-like” models? I think I’m missing something.
+1. If psychologists are to become scientists, they should learn from science.
Zach,
I think psychology is important. I just don’t think that strong, physics-like models work in psychology. There are just too many things going on in a person’s head. For some examples of psychology that I like, see here, for example. Also there’s psychometrics, which uses the same sorts of models we use in political science. So, for that matter, you could look at my books such as Bayesian Data Analysis and Regression and Other Stories to see the sorts of models that I’m talking about. As I wrote in my post above, from my perspective the point of any statistical model is to be there and do its job long enough for it to be replaced.
I’ve wondered what psychology research would look like if we really cared about knowing the answers — if we ran a few studies on 25000 participants rather than a lot of studies each with 25 college undergrads, for example. Perhaps we could then “form strong, physics-like models and test them.”
I think, by the way, that social media and marketing companies do exactly this. (Not for the purpose of advancing basic science, though.)
Ragha/anyone else;
What do you think about the types of behavioral research/sociology stuff that uses similar methods? I am thinking along the lines of the Obama-era studies related to online hate word usage and such. I am not smart enough to know if these studies have strength or real problems of inference, but it does seem to address the low sample issue.
I found a recent paper using this approach confirming the earlier studies. https://journals.sagepub.com/doi/10.1177/20563051231205592
My guess is the response data may have some funkiness to it, and really, the premise of Obama speech–>increase in n-word isn’t exactly surprising. Nor is earlier stuff, but I am curious if this is along the lines of what you are thinking.
No doubt internet companies are mining intel to their own ends via all kinds of models, but I am more curious about how better behavioral psych might get over some of its very basic problems.
“Raghu”
I have no idea where that “a” came from. I say it right in my head, I swear!
I forgive you! :)
This approach can have value, but only if you are interested in studying the properties of groups of people, not if you are interested in understanding mechanisms that govern individual behavior. In other words, it is a statistical mechanics approach, sort of like Hari Seldon’s “psychohistory” from the Foundation books. But as those books illustrated, models of population dynamics can often be unstable and don’t necessarily do a good job of accounting for individual behavior. Of course, those books are fictional! But my general point, which others have pointed out here as well, is that a general challenge in any science is to clearly define the scope of inquiry, to develop models that are well-suited to understanding phenomena within that scope, and which lead to experimental/observational tests that are well-suited to refine those models.
The problem with psychology is that it is too broad. It is like treating physics, engineering, and industrial design as if they were all a single discipline. As such, I think that the areas of psychology that have lead to the most productive insights have been those which most clearly delineated the scope of inquiry in order to support the kind of theory development I described above and which Andrew advocated for in his post. These areas include psychophysics (the relations between physical stimulus properties and the ability to detect/identify items), certain aspects of memory research including associative learning (mathematical models of which were absorbed by computer scientists to form the basis of modern neural nets), and certain studies of the processes by which people make rapid decisions (which, perhaps ironically, rely on models based on physical diffusion processes). In each of these domains, theories are specified as mathematical or computational models which are tested and rejected/refined based on their ability to make precise quantitative predictions.
The experimental work in these domains does not, however, rely on large samples of individuals–rather, they rely on collecting large amounts of data from a comparatively small number of individuals. This is because their scope of inquiry is at that level. At the same time, the resulting theories have proven to be quite robust not because they rely on large samples to wash out variability but because the experimental setups are so carefully controlled that differences between individuals are muted. To take a historical example, Herman Ebbinghaus in 1885 used himself as the only participant in his research on memory and learning, but all of his results (e.g., the form of forgetting curves, the relative efficacy of spaced over massed repetition) continue to be replicated. This worked because the materials he used in his experiments were nonsense syllables, such that each experiment was as close as possible to being a “blank slate”. A physical analogy might be tabletop experiments, which involve similar levels of rigor and control.
But to get back to the scope of inquiry issue, while results like Ebbinghaus’ are easily replicable in well-controlled experimental settings, it is not trivial to generalize them to “real world” situations. For example, we have models that can do a pretty good job of predicting whether or not an individual participant will successfully recognize having seen a particular patch of color earlier in an experiment (and how long it will take them to do so). But if you ask me to predict whether someone will be reminded of a vacation they took when they were 12 by seeing someone on the bus this morning whose face resembles someone they met during that vacation, I wouldn’t bet any money on my ability to do that. A physical analogy would be asking a physicist to predict where a baseball will land at a particular crack of the bat. In both cases, we may have pretty good models of the various mechanisms that would be involved, but generalizing them to a specific real-world instance would require so much additional information about the specific scenario that it is essentially impossible. Carefully delineating a scope of inquiry is necessary to make theoretical progress, but still has its limits.
These areas include psychophysics (the relations between physical stimulus properties and the ability to detect/identify items), certain aspects of memory research including associative learning (mathematical models of which were absorbed by computer scientists to form the basis of modern neural nets), and certain studies of the processes by which people make rapid decisions (which, perhaps ironically, rely on models based on physical diffusion processes). In each of these domains, theories are specified as mathematical or computational models which are tested and rejected/refined based on their ability to make precise quantitative predictions.
Can you provide some specific references (including comparisons to data) for these models?
Good comment. I think along similar lines. We can a) build physics-inspired theories for individual level behavior if we restrict our scope to very simple, experimental situations in which stimuli and possible actions are carefully restricted. But as you say, we won’t be able to generalize from them and those theories break down easily when increasing complexity.
We can also b) build physics-inspired theories about how aggregates behave, without being able to specify individual level behavior in those theories. Much like physicists model the trajectory of a ball without modeling individual level atoms.
However if we try c) to theorize about individual-level behavior and how this leads to certain aggregate outcomes, we probably won’t be able to develop a physics-inspired theory for this. It is just too complicated.
This leaves us with a) and b). I guess what you prefer is partly preference but a case can be made for b). Often modeling aggregate phenomenon is simply more important. For example internal migration does matter because many people move, what single individuals do is not important. And to model large scale phenomenon well specific knowledge about individual level mechanisms may not be necessary. Also many studies that belong to a) are –at least to me– completely uninteresting. If you can’t ever generalize to the real world, then why bother?
In response to Anonymous, I would suggest that a good starting point is the Oxford Handbook of Computational and Mathematical Psychology (https://academic.oup.com/edited-volume/41261?searchresult=1), since it has several chapters that provide overviews of the fields I mentioned (and more), summarizing major models and experimental paradigms along with citations. Even though the book itself is paywalled, I’ve found that searching on google or google scholar will often get you PDF’s of the individual chapters as you like. Plus there’s always the library!
In response to huan, I agree that I should not have been dismissive of the value in understanding group behavior–as you say, there are phenomena that are best understood at that level and it’s worth it to conduct the kind of large-scale studies Raghu describes in order to develop better theories in those domains. I was just pushing against the idea that it was the only way to develop good theory in psychology, since “psychology” encompasses too many things for a single approach to suffice.
I would suggest that a good starting point is the Oxford Handbook of Computational and Mathematical Psychology
I’m reading through it. In Chapter 1, they mention the “General Recognition Theory”, which fits large matrices to data. Do they have a null model? It seems that they just compare the “restricted” model to an “unrestricted” model in their statistical testing to test some assumption. That’s fine by itself, but the usual null model is to assume “all the data is noise”, which implies they should be using randomly generated matrices in addition (like Horn’s parallel analysis in the context of Big 5). Otherwise they don’t know what “our data is maximally uninformative” looks like.
Also, they don’t make any substantive predictions with that model. What is the use of saying that something is perceptually independent vs non-independent, then? This is one of the same traps that Big 5 falls into, see e.g Shalizi’s post on it (he talks about g but the same applies to big 5).
https://www.bactra.org/weblog/523.html
Essentially the same problem occurs in chapter 2: they estimate “drift rate,” which, in the context of a diffusion model, seems to be the rate at which the subject “drifts” toward a decision for various populations. They can fit the data but that’s not enough, again–what substantive mathematical predictions about future experiments does that parameter allow you to make? They don’t say. It seems like all their (surprisingly) complicated math is wasted.
In ch.3, on page 92:
the possibility exists that one or the other model will simply fail to fit the data, even as well as another model, due only to the specific quantitative formulation of the psychological precepts, rather than the fundamental characteristics of the latter. For instance, consider an investigator who has correctly pinpointed the proper architecture, stopping rule, and so on for a task but failed to employ the valid associated stochastic process…Meanwhile, a theoretical competitor might produce an incorrect specification of the architecture (e.g., serial), but employed a set of stochastic processes that adventitiously provide a superior fit.
This means they need to simplify their models, or test each mentioned part separately. Otherwise, they’re not really falsifiable.
A simple example of what I mean about predictions of future experiments: Suppose you have a fixed amount of gunpowder. You can measure how much energy it gives off upon explosion. Suppose now you want to use a heavier projectile in your gun than you used in the past, with the same amount of gunpowder, of course. Using already-known formulas for kinetic energy and projectile motion, you can predict the changed trajectory before doing the experiment. What is the analog of this for these models?
In response to Anon: I suggest that you take more time to work through and understand what the work you are reading about is doing. In particular, your description of both the GRT model and the diffusion model is incorrect, which suggests that you are applying assumptions to your reading that are inappropriate to the new domain you are learning about.
GRT is not a model that fits large matrices to data. It is an extension of signal detection theory that is meant to describe how noisy sensory channels induce a distribution of sensory signals in a multidimensional space, such that identification can be understood in terms of which region of the space a particular combination of signals occupies. The model is a framework that allows you to define what the terms “perceptual independence” and “perceptual dependence” mean–i.e., does an increase in the intensity of one sensory dimension entail an increase in the other, or not. As the chapter describes, this model is a framework that allows you to test specific assumptions within the context of a particular kind of controlled experiment. I do not understand your analogy to Big 5, which is the outcome of a factor analysis. Similarly, your mention of a “null model” is difficult to relate to the content of the work you are reading, so I would encourage you to think about that concept more deeply to know whether it is or is not applicable in any particular domain.
Likewise, your description of diffusion models is incorrect–there is no sense in which subjects “drift” and there is no “population”. In the context of that work, the process of deliberating between two options is modeled as a diffusion process over time, such that the momentary state of a decision maker can be said to fluctuate between two absorbing boundaries as a function of the evidence they have accumulated in favor of one option or another.
In terms of your question about generalization to additional experiments, this is not trivial in any domain–manufacturers test their products, after all. However, the specific situation you described would be analogous to estimating parameters of a diffusion model describing how a participant (i.e., firearm) detects a stimulus in noise using a particular range of intensities (i.e., projectile weight) and then making a prediction about how that participant (gun) would respond to a new level of intensity (this is what you mean by keeping everything else the same). Just as in the physical case, this generalization would require a theory that relates stimulus intensity to drift rate. But similar to the physical example, you cannot just use an “off-the-shelf” formula–the material properties of the projectile, barrel, and gunpowder matter and can often result in complex relationships between mass, energy, and how much is transferred to the projectile. Engineers test many different combinations of these factors before they feel comfortable making a prediction about an untested product. The psychological analogue would be estimating a psychometric function that is specific to that participant. In either the physical or psychological case, the resulting prediction may or may not be borne out, suggesting that any of the chain of assumptions involved in making the prediction was faulty. Thus, either model (i.e., chain of assumptions) is eminently falsifiable and subject to rejection or revision as appropriate.
Ultimately, science is hard, so I wouldn’t expect to be able to learn an entirely new field in a day, but I appreciate your willingness to engage with a domain that is new to you!
Huan,
I think we *can* build individual based / agent based models for human behavior, and we can learn about aggregated behavior and at least some aspects of individual behavior simultaneously. It’s just that social science simply has not even tried in most cases.
Let me give an example. In the Pasadena Unified School District students go into elementary schools, they take standardized tests, and they progress through time generally increasing their test scores through time, with sometimes some decreases as well.
The state department of education reports aggregate average test scores, so we know something about the behavior of the averages across school through time.
However, the students are not a fixed cohort. Students move between schools, move into or away from the school district, and some students suffer through different hardships than others (we have a nontrivial number of homeless students in the district).
With individual level data, we could hypothesize some mechanisms that cause learning, or impede learning, and we could try to discover individual or *small group* level causal behavioral issues.
For example, some schools may be “known for” some kinds of learning (say english as a second language for spanish speaking kids), so then we see influx of poor performing students at young ages who then improve. Perhaps the whole school sees aggregated depressed test scores relative to others, but it’s because the school attracts people whose kids are struggling with english as a second language, not because the school doesn’t teach well, but rather because it DOES teach well.
It’s just an example mechanism. We can think about these mechanisms in a “physics” way without serious problems. For example, perhaps the english as a second language issue affects mainly 1st through 3rd grades, and parents then move away from the school for 4th and 5th grades because the excess travel time to the school is not worth it once the child is performing well in english… this is a perfectly mechanical behavior we could impose on our model at the individual level (and, it affects only certain individuals). There’s nothing “impossible” about imposing such mechanistic effects on models. Perhaps another mechanistic effect is that because the students are learning english, their math scores lag behind? Because until they’ve learned enough english to understand the math instruction, they will simply not be able to follow the math instruction?
these are just simple ideas we could test, many other mechanisms like this could be tested
But, we need to keep track of the kids for whom the mechanism is involved, we need to track their behavior separately, and we need to have data on what happens to the individuals. Often this isn’t the case.
Based on 2 decades reading this blog, I’d say that such attempts are few and far between and that because of this, many social sciences are in a pre-scientific state because fairly obvious mechanistic explanations for social behavior that don’t require any magical simplistic physics like rules for particles or whatever, are just ignored due to the “complexity” of the model exceeding what the actual participants are capable of.
Many social sciences give up before they even consider how they would implement individual/agent based rules based, mechanistic modeling.
gec: I may have stated some things unclearly.
About GRT and null models: A null model in the sense that I am using it is applicable whenever you fit a model to data. It answers the question of “what would I see if there is nothing truly in the data/it is all noise?” This is applied in physics all the time, for example. There are signal patterns that you see when your telescope lens is covered, and you need to understand and account for those to get to the real patterns you want. Andrew would call these “non-strawman nulls,” I guess. There doesn’t seem to be any use of null models here, which is a warning sign. I brought up big 5 because despite it being a different model, null models for both GRT and big 5 will be similar (requiring random matrices), and they don’t seem to make substantive predictions.
About diffusion models, you said:
“In the context of that work, the process of deliberating between two options is modeled as a diffusion process over time, such that the momentary state of a decision maker can be said to fluctuate between two absorbing boundaries as a function of the evidence they have accumulated in favor of one option or another.”
This is consistent with what I meant. I think of the drift rate as the “speed” of diffusion. The state “drifts” or “diffuses” from one boundary to the next. The book claims ADHD and dyslexic people have different drift rates than the general public, these groups are what I meant by “population”. Different populations have different drift rates.
“But similar to the physical example, you cannot just use an “off-the-shelf” formula–the material properties of the projectile, barrel, and gunpowder matter and can often result in complex relationships between mass, energy, and how much is transferred to the projectile.”
In the 1800s, they did use these “off-the-shelf” models in engineering–keeping in mind that they are an estimate and sometimes an upper bound on what you see in reality. We have better models now of the complex relationships that you mention. Are there simple, approximate models in psychology analogous to these, that make substantive predictions?
“The psychological analogue would be estimating a psychometric function that is specific to that participant. In either the physical or psychological case, the resulting prediction may or may not be borne out”
This rather reminds me of utility functions, which may be different for each participant and then are fit with different parameters each time…where is the generalizability here?
In ch.15 of the book, “neurocognitive models of perceptual decision making”, they claim that the models supposedly make true model predictions, “not just model fits.” Nowhere else in the book is this even claimed to occur.
Essentially, I am asking this:
In general, are there simple, approximate, and general models in psychology analogous to the simple firearm model that I mentioned, that make substantive approximate predictions? (This is how physics started out.) Any specific references in the book or elsewhere that do this?
This is far better psychology than I’m used to seeing, though!
I will also look at the citations for the claim in ch.15 that I mentioned.
gec: I worded it badly again, it doesn’t have to be approximate but it helps.
Daniel, I agree in some instances it is possible to postulate individual-level mechanism and derive aggregate predictions from them. After all, this is what many economists and sociologists are trying for decades (think Coleman’s boat). But your example is a case in point of why I think it is in general too difficult. Your scope of inquiry is rather small: Schools in the district of Pasadena in 2026-30, say. You already need a lot of high quality individual level data. Not only past and present test scores, but also socio-demographic variables about the families involved, you need to be able to track their moves, about kids changing schools, you probably need to track kids learning behavior; you need aggregate test scores for schools, and probably some variables.
But assuming you obtain grants and collect all that data for some years and you do your analysis and maybe you obtain some valid causal estimates for various relations of interest and maybe you can even predict individual-level variations in test scores (like math improving after 2 years of study of English), now that would be an awesome study! But it costs a lot of money, time and it is unclear if your findings apply in other settings. Does your inferences inform other districts as well? What about other groups of children, maybe other countries? Do they apply to different modes of teaching?
Furthermore, there still will be many mechanisms unexplored. Like certain hardships, parents divorce, maybe general differences in intelligence, social media usage and what not.
I am not sure what exactly my methodological point is here. I guess I am echoing gec’s point—you probably agree—it is not always necessary and often too hard to incorporate individual level mechanisms if your goal is to infer some aggregate quantity. Imagine for example based on macro-economic conditions you can accurately predict economic downturns. That would be very useful and you don’t need individual level data to figure out who exactly will struggle, loose their job and so on. My other point is that it is often hard to generalize from small-scale studies which means you gather a lot of detailed but unconnected results.
This is basically how I view different types of explanations.
data access, theorizing becomes easier
——————————————————————————>
+————–+ +————-+ +—————-+ +————-+
| physicalism | –> | biologism | –> | individualism | –> | macro level |
+————–+ +————-+ +—————-+ +————-+
<——————————————————————————
explanations become less informative, generalization becomes harder
Where phyiscalism explains something in terms of interaction between particles, biologism in terms of hormones and stimuli, individualism in terms of decisions and macro level interms of other aggregated phenomenon. Now social science seems to favor individualistic theories, but this is in some sense arbitrary and depending on the question we may use a different type of explanation
huan, I agree with you that many of the studies might be quite difficult from a data collection and study design and support perspective.
But also, I think the model building and Bayesian perspective is often problematic for the researchers so they don’t even have an idea of how to proceed if they have only some partial data.
For example, we can hypothesize different rates of preference related to english as a second language for hispanic students vs say black students vs say white students vs say asian students. Then if we have data only on rates of each entering or leaving the schools at each grade, plus english and math score average and standard deviation within each group, we can fit a hidden variable model which attempts to determine group level mechanistic effects from certain kinds of aggregated statistics, without needing literal individual level data.
Even just getting aggregated data on how many students leave each school each year for me was impossible, the SQL database with the data for PUSD was right there, but literally no-one would respond to my inquiries for data. So stuff is hard, and I get that.
Biology seems to be an interesting midpoint between Physics and Psychology, in terms of whether experimental precision constrains theory or allows sloppy theory to flourish. There’s a lot of rigorous modeling and quantitative prediction in modern biology, but there’s also a lot of awful p-hacking and noise-mining, especially in areas like microbiome research, nutrition, and other important but messy topics.
And when you see a good study in one of those areas, it looks really really different from the typical study. We discussed the “ultraprocessed diet” experiments several times on the blog. Once when the study first came out, and once where some of the data was called into question, and as I remember this question was answered satisfactorily (there was some kind of discrepancy in roundoff within certain weighing apparatus, which called some numbers into question, but it was satisfactorily addressed)
But the baseline experiment… it involved cooking people controlled meals, serving them precisely measured portions, have them spend time inside a room-sized calorimeter, making them live in the experiment for weeks so no outside errors could be introduced, weighing them to the nearest gram or whatever. it was no joke
This reminds me of a book I am currently reading
“Food Intelligence” by Julia Belluz and Kevin Hall
The book is too wordy but interesting. I think it describes the experiment you refer to. Anyways, apparently a lot about mechanisms and chemical reactions in our bodies is known by now and Hall conducts careful experiments and builds mathematical models for our metabolism.
A feeling like hunger can be explained by certain hormones reacting to stimuli. Also if you genetically modify mice to have a dopamin deficiency, then they are literally to lazy to move to food 20cm before them and just die. Fascinating stuff. We humans are just physical matter arranged in incredibly complex ways.
Sorry for OT.
I wrote a blog post on this a few years back:
https://blog.edhagen.net/posts/2018-01-16-our-statistics-dont-suck-our-theories-do/
Ed:
Yes, you put it well there.
So if the positive hypothesis being tested (not the null/nil hypothesis) often in social science research is that the correlation has a particular stated sign, then no matter the statistical test, the p-value should never be less than 0.5, because fully random data has an 0.5 chance of agreeing with the hypothesis.
Am I missing something?
Why the focus on psychology in the comments and the paper (well, Meehl was one, so that explains the paper). Medicine is the same. One paradox for me is that medicine just works, very often, and saves lives. So even the incorrect approach leads in this case to useful outcome?
Shravan:
I have no good reason to believe that medical research and practice would be any better, had null hypothesis significance testing never been invented.
This seems a good place and moment to mention two papers by Gigerenzer for some more information.
1) From Gigerenzer’s 2004 paper titled “Mindless Statistics”:
“More recently, the ritual has been labeled null hypothesis significance testing, for short, NHST or sometimes NHSTP (with P for “procedure”). It became institutionalized in curricula, editorials, and professional associations in psychology in the mid-1950s (Gigerenzer, 1987, 1993).” (p. 589)
and:
“Psychology seems to be one of the first disciplines where the null ritual became institutionalized as statistics per se, during the 1950s (Rucci and Tweney, 1980; Gigerenzer and Murray, 1987, chapter 1). Subsequently, it spread to many social, medical, and biological sciences, including economics (McCloskey and Ziliak, 1996), sociology (Morrison and Henkel, 1970), and ecology (Anderson et al., 2000).” (p. 590)
2) From Gigerenzer’s 2018 paper titled “Statistical rituals: The replication delusion and how we got there”:
“Second, and most relevant for this article, psychologists reinterpreted this marriage in their own way. Early textbook writers struggled to create a supposedly objective method of statistical inference that would distinguish a cause from a chance in a mechanical way, eliminating judgment. The result was a shotgun wedding between some of Fisher’s ideas and those of his intellectual opponents, the Polish statistician Jerzy Neyman (1894–1981) and the British statistician Egon S. Pearson (1895–1980). The essence of this hybrid theory (Gigerenzer, 1993) is the null ritual. The null ritual does not exist in statistics proper, although researchers have been made to believe so.” (p. 200)
Andrewn-
You often focus on factors that hold back scientific progress..misaligned incentives, publication biases, overly flexible analyses, over-reliance on Nhst, etc. Yet in medicine, we’ve seen rather remarkable, population-level progress (e.g., steep drops in CVD mortality, effective treatments for many cancers, etc). That progress largely aligns with the widespread adoption of NHST-embedded RCTs. Before that, progress was slower. Given those downward pressures and medicine’s strong track record under an NHST-dominated regime, it seems surprising to me – despite all of the problems you discuss with NHST – to suggest there’s NO good reason to believe NHST conferred any material positive benefit.
How could this become an empirical question to tease out causality here?
Joshua:
I’m not sure how to answer that one, but I agree that it’s a good question, and it’s one that I’d not thought of before.
Joshua:
Andrew:
Please do look into how this was determined. I’d check lead time bias and other testing effects, the choice of controls, and regression to the mean by selecting people with flareups for trials as the first confounds to consider.
Anoneuoid –
Sure, there can be confounders for survival‑time metrics, but they don’t really account for the population‑level trends I’m referring to. CVD mortality is binary, so lead‑time bias isn’t relevant there. And the improvements show up across indicators that aren’t especially sensitive to diagnostic artifacts: all‑cause mortality, age‑adjusted mortality, infant mortality, hospitalization rates, international comparisons, natural experiments, and the mechanistic understanding of disease pathways.
Andrew says that “statistics are hard,” but that cuts both ways. If someone wants to argue that these broad, convergent trends are mostly artifacts of bad statistics, they need to present a statistical story that actually explains those data patterns. Please present one if you have it. Contrarian hand‑waving is sub‑optimal.
I guess that should be “statistics is hard.”
One paradox for me is that medicine just works, very often, and saves lives.
You realize this was determined by NHST, right?
For medicine, its more progress has essentially stalled since ~1970 (cancer, stroke/tbi, Alzheimer’s, cardiovascular disease). There have also been engineering advances to improve implementation of existing strategies to counteract this a bit.
What it _because_ of NHST though, or in spite of it?
Was* it
Shravan I think that’s true but likewise for psychological treatments we have a lot of successes there as well (for example CBT for a whole range of problems). I think though it’s reasonable to ask whether, even if you have some successes, under different conditions/methodological approaches things could have been much better.
To me, as a social scientist, it has always seemed that part of the reason for the dominance of the N in NHST is that it makes the math easier. All of a sudden terms drop out. For people who were raised doing–and teaching– hand calculations the magic of terms dropping out is compelling.
Another dimension of this is why do some disciplines use Chi Square Test of Independence instead of Chi Square Goodness of fit? Partly, it is that we are more interested in whether groups differ than what their literal population values are.
But either way, Chi Square with a small number of cells is easy to do with pencil, paper and the back of a statistics book.
That said I feel like Meehl is a little slippery in eliding “improvement in experimental precision” (which I take to mean improved instrumentation) and “improved power” (which I take to mean bigger sample sizes). In psychology (not my field, but I assume) instrumentation can also get better. In physics–I guess– you could send more particles into the accelerator. Are they really the same thing?
As the simplest example, check the t-test equation, it has both s (controlled by noisiness of the measurements) and n (sample size): https://en.wikipedia.org/wiki/Student's_t-test
More money gets you both. Since the default nill-null hypothesis is always false, NHST measures how much funding you got, which is determined via a collective wealth/power-weighted prior. Lack of significance only means you did not spend enough money.
The null-null hypothesis isn’t always false, or at least it may not always be false. The electron and positron may have exactly the same rest mass… under current physics models this is true of any particle and its antiparticle. For the proton and anti-proton this has been measured to about 15 parts per trillion. They are also thought to have exactly opposite charges. The electron and positron charge-to-mass ratios have been measured to be equal to about 1 part in 10^20. These tests are performed at great expense, and to such high precision, because any difference whatsoever would invalidate CPT symmetry (charge / parity / time-reversal), and that symmetry is part of the ‘standard model’ of particle physics.
The theory may be wrong — there might be a difference in these masses, or in the charge-to-mass ratios — but that’s exactly why there’s good reason to test it.
I agree 100%, the difference is that example is predicted by the theory being tested. This is not the case for 99.99…% of significance tests being performed. In those cases, the “theory” predicts some non-zero difference.
This is the point Meehl is making. Testing something besides *your theory* inverts the entire logic of science, which then inverts all the incentives. I call it bizarro science.
Anon:
Indeed, and the Popper/Lakatos perspective has always come so naturally to me that it took me a long time to understand how weird NHST is. I think that’s one reason I’d seen Meehl’s article so many times but not caught its main point: it took me awhile to catch on to how backward was the standard statistical reasoning.
Back when I started thinking about posterior predictive checks in the late 1980s, I was already heavily influenced by Jaynes, and I took it for granted that the reason for doing things like chi-squared tests or whatever was to find out where my models were so flawed that there was clear room for improvement. This was what I had in mind when writing about posterior predictive checks in the 1990s. When people criticized this work by talking about the distribution of the p-value under the null hypothesis, I was like, Who cares about that? The goal is to improve the model.
One lesson for me in all of this is that if you want to communicate with people, sometimes you have to put in a lot of effort to understand their current way of thinking.
I understand how t-tests work, thanks.
This is great! But that sounds more like behaving as an engineer than a statistician – most stats i meet insist of changing empirical constructs to more justify their model assumptions, not other way around.
Vineet:
I’ve always said that statistics is a branch of engineering (it could be called “mathematical engineering” or “probability engineering”).