Some thoughts inspired by Lee Cronbach (1975), “Beyond the two disciplines of scientific psychology”

I happened to come across this article today. It’s hardly obscure—it has over 3000 citations, according to Google scholar—but it was new to me.

It’s a wonderful article. You should read it right away.

OK, click on the above link and read the article.

Done? OK, then read on.

You know that saying, that every good idea in statistics was published fifty years earlier in psychometrics? That’s what’s happening here. Cronbach talks about the importance of interactions, the difficulty of estimating them from data, the way in which researchers manage to find what they’re looking for, even in settings where the data are too weak to really show such patterns, he even talks about the piranha problem in the context of “Aptitude x Treatment interactions”:

In a world where researchers are babbling on about so-called moderators and mediators as if they know what they’re doing, Cronbach is a voice of sanity.

And this was all fifty years ago! All this sounds a lot like Meehl, and Meehl is great, but Cronbach adds value by giving lots of specific applied examples.

In the article, Cronbach makes a clear connection between interactions and the replication crisis arising from researcher degrees of freedom, a point that I rediscovered—40 years later—in my paper on the connection between varying treatment effects and the crisis of unreplicable research. Too bad I hadn’t been aware of this work earlier.

Hmmm . . . let me check the Simmons, Nelson, and Simonsohn (2011) article that introduced the world to the useful term “researcher degrees of freedom”: Do they cite Cronbach? No. Interesting that, even psychology researchers were unaware of that important work in psychometrics. I’m not slamming Simons et al.—I hadn’t known about Cronbach either!—I’m just noting that, even within psychology, his work was not so well known.

Going through the papers that referred to Cronbach (1975), I came across this book chapter from Denny Borsboom, Rogier A. Kievit, Daniel Cervone and S. Brian Hood, which begins:

Anybody who has some familiarity with the research literature in scientific psychology has probably thought, at one time or another, ‘Well, all these means and correlations are very interesting, but what do they have to do with me, as an individual person?’. The question, innocuous as it may seem, is a deep and complicated one. In contrast to the natural sciences, where researchers can safely assume that, say, all electrons are exchangeable save properties such as location and momentum, people differ from each other. . . .

The problem permeates virtually every subdiscipline of psychology, and in fact may be one of the reasons that progress in psychology has been limited.

They continue:

Given the magnitude of the problems involved in constructing person-specific theories and models, let alone in testing them, it is not surprising that scholars have sought to integrate inter-inter-individual differences and intra-individual dynamics in a systematic way. . . .

The call for integration of research traditions dates back at least to Cronbach’s (1957) . . .:

Correlational psychology studies only variance among organisms; experimental psychology studies only variance among treatments. A united discipline will study both of these, but it will also be concerned with the otherwise neglected interactions between organismic and treatment variables . . .

Not much has changed in the basic divisions in scientific psychology since Cronbach (1957) wrote his presidential address. True, today we have mediation and moderation analyses, which attempt to integrate inter-individual differences and intra-individual process, and in addition are able to formulate random effects models that to some extent incorporate inter-individual differences in an experimental context; but by and large research designs are characterized by a primary focus on the effects of experimental manipulations or on the structure associations of inter-individual differences, just as was the case in 1957. . . .

They continue:

In experimental research, the researcher typically hopes to demonstrate the existence of causal effects of experimental manipulations (which typically form the levels of the ‘independent variable’) on a set of properties which are treated as dependent on the manipulations (their levels form the ‘dependent variable’). . . .

One interesting and very general fact about experimental research is that such claims are never literally true. The literal reading of conclusions like Bargh et al., very prevalent among untrained readers of scientific work, is that all participants in the experimental condition were slower than all those in the control condition. But that, of course, is incorrect – otherwise there would be no need for the statistics. . . .

From a statistical perspective, it is commonplace to speak of an average treatment effect. But, when considered from the perspective of understanding human behavior, it’s a big deal that effects typically appear only in the aggregate and not on individuals.

The usual story we tell is that the average treatment effect (which we often simply call “the treatment effect”) is real—indeed, we often model it as constant across people and over time—and then we label deviations from this average as “noise.”

But I’ve increasingly come to the conclusion that we need to think of treatment effects as varying: thus, the difficulty in estimating treatment effects is not merely a problem of “finding a signal in noise” which can be solved by increasing our sample size; rather, it is a fundamental challenge.

To use rural analogies, when we’re doing social and behavioral science, we’re not looking for a needle in a haystack; rather, we’re trying to catch a slippery fish that keeps moving.

All this is even harder in political science, economics, or sociology. An essential aspect of social science is that it understands people not in isolation but within groups. Thus, if psychology ultimately requires a different model for each person (or a model that accounts for differences between people), the social sciences require a different model for each configuration of people (or a model that accounts for dependence of outcomes on the configuration).

To put it another way, if any theory of psychology implies 7,700,000,000 theories (corresponding to the population of the world today, and for now ignoring models of people who are no longer alive), then political science, economics, etc. imply 2^7,700,000,000 – 1 theories (corresponding to all possible subsets of the population, excluding the empty set, for which no social science is necessary). That’s an extreme statement—obviously we work with much simpler theories that merely have implications for each individual or each subset of the population—but the point is that such theories are either explicit or implied in any model of social science that is intended to have general application.

41 thoughts on “Some thoughts inspired by Lee Cronbach (1975), “Beyond the two disciplines of scientific psychology”

  1. @andrew: “From a statistical perspective, it is commonplace to speak of an average treatment effect. But, when considered from the perspective of understanding human behavior, it’s a big deal that effects typically appear only in the aggregate and not on individuals.”

    I read about a study that displayed this in a way that is simple to grasp (sorry, no reference). The researchers took a number of very overweight men and got them to train up to the point of being able to run a marathon. They looked at weight loss. This would presumably be relevant to the question of whether exercise helps one to lose weight.

    They found that the average weight did decrease by a few pounds (only a few), and that some of the participants had gained weight.

    So did that intense amount of exercise result in weight loss? Hmm…

    I imagine that the outcomes were a matter of losing fat and gaining muscle – which one would prevail? Of course, that’s only me speculating. But if correct, then the question of losing weight was badly posed from the beginning. Better would have been to ask how intense aerobic exercise affects the amounts of fat and muscle over time.

    • Thanks! When I hit the paywall, I tried a search, and got another paywalled site that did have a “Check access” button — but that gave me a response that there was a problem in checking the access (which I wondered might have something to do with COVID shutdowns). But the link you gave works for me.

  2. Andrew –

    > In a world where researchers are babbling on about so-called moderators and mediators as if they know what they’re doing, Cronbach is a voice of sanity.

    Why the “so-called”….? I pick up a connotation of disdain. From my non-technical perspective, the consideration of moderators and mediators related to oft speculation causation is much needed.

    • Joshua,

      I certainly can’t speak for Andrew, but, as a psychologist, I believe the disdain is justifiable due to the poor quality of most analyses that make use of mediators and moderadors. Some problematic points:

      (1) Those analyses are usually used to disentangle hypothetical causal relations from observational (and usually cross-sectional) data. Given the complexity of such models, they usually rest on many untested and untestable assumptions, making their conclusion moot.

      (2) Mediation analysis of cross-sectional data rests on very strong assumptions (e.g., ergodicity) that are guaranteed to not hold on most complex psychological phenomena. Yet, most authors don’t make any caveat emptor when concluding they ‘found’ some mediation mechanism from this kind of data.

      (3) In many studies, those analyses are a big source of forking paths. Which variables are moderators, which are mediators, which are ‘control’, all are usually very open to discussion. Add selection based on statistical significance, small samples and interactions galore, and you have a recipe for irreproducible research.

      (4) For many problems, the data can’t be used to decide between similar models because they have close fit, but very different theoretical consequences.

      (5) As is the case with Factor Analysis, moderation and mediation models have very interesting theory. But the usefulness of their theoretical properties are conditional on the model being true or at least a beliveable approximation of the phenomena at hand. But they can hardly be considered adequate to most problem they are applied to, mainly due to oversimplification.

  3. Psychology students are trained in undergraduate class that if your p statistic is low enough, it means that you have adequately controlled for all those the interactions, even if it is intractably difficult to figure out what they are. Lacking the imprimatur of being a statistics professor, I got schooled on this by a suddenly very angry psychology researcher.

    • Matt:

      I have the imprimatur of being a statistics professor, but I too got schooled on this by a suddenly very angry psychology researcher; see here. Dude was very angry! The email quoted in that linked post was the least of it.

  4. Andrew said,
    “But I’ve increasingly come to the conclusion that we need to think of treatment effects as varying: thus, the difficulty in estimating treatment effects is not merely a problem of “finding a signal in noise” which can be solved by increasing our sample size; rather, it is a fundamental challenge.”

    Yes, yes, yes! I recall that when I was first getting involved with statistics, a colleague introduced me to the word “fungible”, defined as (from a quick web search), “being something (such as money or a commodity) of such a nature that one part or quantity may be replaced by another equal part or quantity in paying a debt or settling an account Oil, wheat, and lumber are fungible commodities. fungible goods.” People are not fungible!

    • I think people are increasingly being funged.

      On the nonfungibility of Lumber: Forgive the sexism of the site title and article length, you can confine your attention to the section on “stuctural lumber.” https://www.artofmanliness.com/articles/primer-on-lumber/

      On the nonfungibility of Wheat: we have hard red winter, hard red spring, soft red winter, soft white, hard white, and durum. See Shirley Corriher for a good discussion of their fungibility.

      Don’t know much about oil. Must all be fungible.

      Fungibility is ignorance, but I suppose our pal Funes the Memorius shows some ignorance is good, conditioned on our imperfect minds.

  5. Exactly.

    Everyone in social sciences and psychology should read Molenaar’s 2004 paper and think long and hard about it…

    A Manifesto on Psychology asIdiographic Science: Bringingthe Person Back Into ScientificPsychology, This Time Forever

  6. Andrew- Tx for pointing to this 1975 article. It is stating a finding about generalisation. Somehow very few have dealt with “how to generalise”. As you mentioned implicitly in your recent talk, Pearl did that in toy model examples. He uses interchangeably the term transportability. The clinical researchers I work with add a section on generalisability of findings using a table with verbal alternatives, seee for example https://pubmed.ncbi.nlm.nih.gov/30270168/. The paper on this apparoach that I uploaded on SSRN got 250 downloads https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070. Somehow it is not getting traction in the stat community…

    • Ron:

      I’ve discussed the generalizability or transportability issue with Pearl on this blog in the past. I think that it makes sense to do generalizability or transportability using multilevel models and partial pooling, as for example here.

      • Andrew – I corresponded extensively on this with Pearl and came to the conclusion that he is fundamentally a core computer engineer with keen interest to design robusts. Statisticians have a different mission, for example helping clinical researchers design studies and analyse data from studies. Your excellent paper is focused on meta analysis using multilevel models and partial pooling. Generalisability, to me is more abstract. it is about how a doctor can understand the outcomes of a paper, as it applies to his patients. This is where the verbal approach with alternative representations kicks in. The nice features of an S-type error also apply here, but I guess you know that already…

  7. A lot of recent psychology research eminently deserves the bashing it generally receives here, but it’s nice to know you statistics people are making an effort to read up on old-school people like Lee Cronbach (of Cronbach’s alpha fame and author of some splendidly level-headed yet hard-nosed IQ controversy pieces that are still worth reading.)

  8. Lee Cronbach, Stanford University, and useful information about studying humans, huh? So you’re saying that Stanford isn’t all Zimbardos and Ioannidises?

  9. Psychology once had some very impressive thinkers. Cronbach and Meehl’s generation also included Loevinger, Rozeboom, Guttman, Donald Fiske, Donald Campbell, Messick, McGuire, and Sechrest, among others. Preceding them were Eysenck and Raymond Cattell; they had very problematic views about race and some of their empirical results are questionable, but their theoretical pieces are often of very insightful. There are still some good thinkers in psychology (and adjacent fields like educational measurement & psychometrics) (e.g., van der Maas, Borsboom, Markus, Joel Michell, Peter Molenaar, Michael Kane), but they are relatively few and far between. Many people currently working in the field are probably capable of great insight, but their rewards for producing it are trivial compared to churning out the latest headlining grabbing finding.

  10. “I’m just noting that, even within psychology, his work was not so well known.” He was actually pretty famous back in the day. Cronbach’s Alpha gets 5M+ hits on Google. It was (is?) the standard (introductory) way of assessing the reliability of a multi-item scale (i.e., a set of questions measuring an abstract concept like intelligence, where the idea is to ask lots of similar questions and take the average response, in the hope that the noise cancels out).

  11. >> we’re trying to catch a slippery fish that keeps moving.

    Have you tried a HMM?

    So if I have a fish out of water, and I’m trying to model it’s flapping tail until it dies, I might start with an oscillating covariance function, and then add a slowly increasing linear trend on the period and then a slowly decreasing linear trend on the amplitude of the signal. Wait, this is mean to be a joke. Can’t joke about this with anyone. No, not inebriated, only strange.

  12. Hmmm…Well, at the risk of displaying my ignorance:

    Seems to me like almost everything I read about “interactions” is hand-waving. Hand-waving them away when p0.05 or replication failures.

    Cronbach (1975) isn’t beating that game. When he talks about various “aptitudes”, the things he’s referring to are so fuzzy it would be hard to find the dot in the middle. So is it the interaction that’s causing the variation, or just the fact that psychology can’t find fundamental factors to measure? I mean looking at his table 1, just the idea that an instructor could, in each of four different courses, give a consistent “press” for independence or conformity over the course of a semester? Really? Um, having plenty of experience on both sides of the lectern, I’d say there is a good chance the instructor’s personality, whatever that is, will drown out just about everything else, try as s/he might to do otherwise. And the idea that we can have sharp distinction between “independent” and “conformist” learners is, well, bunk.

    So I guess overall I’m guessing that most of the variation comes not so much from a hall of mirrors about interactions related to the brand of sausage people eat for breakfast, but from the inability to find a parameter than can be precisely measured.

    • The biggest fallacy in most psycho-social measurement in my experience is theorizing about fine distinctions between VERY closely related constructs and assuming that by clever wording of your items you’ll be able to induce a research subject to respond to one or the other of those hair-splitting differences when responding to a certain item.

      I can’t tell you how many times I’ve seen a battery of perhaps ten items that to a lay person (i.e. a research subject or a statistician who is not a psychologist) seem to all be asking the same questions with slight variations of wording. Yet to the psychologists who designed the instrument, four of the items are about one construct while the other six are about what he or she (the psychologist) imagines to be a distinctly different construct.

      Then of course you do a little measurement study and show there is practically no discriminating power between the two “measures” and they’re all just getting at a single general construct. That doesn’t stop people from adopting the instrument and forging forward for the next decade blithely asserting that they’ve measured two different things. Their models never quite seem to fit, though.

      • Yeah,

        I was going to write earlier that psychology hasn’t yet distinguished rocks from minerals. Minerals are the appropriate physical unit of earth materials. They have predictable chemistry and physical properties. Rocks are just aggregations of minerals. The chemical and physical properties of minerals vary over a tiny range. For all practical purposes, minerals are discrete. OTOH, the chemical and physical properties of rocks vary continuously in every dimension. The boundaries between different rock types are gradational.

        Sometimes it feels like psychologists are matching a granite boulder to a basalt boulder because they’re both round. Other times it feels like they’re trying to tease out some critical relationship based on the fact that two different cobbles of granite have slightly different colors. Then they construct a “hall of mirrors” of interactions because the roundness and color and size of boulders all interact, when the reality is that these features have little or nothing to do with the formation of the rock, which is why their “interaction” is apparently meaningless.

      • Brent said,
        “The biggest fallacy in most psycho-social measurement in my experience is theorizing about fine distinctions between VERY closely related constructs and assuming that by clever wording of your items you’ll be able to induce a research subject to respond to one or the other of those hair-splitting differences when responding to a certain item. …”

        Definitely a good candidate for biggest fallacy … (But social scientists never cease to amaze me with their “creativity”, so …)

      • I had a conversation recently with someone I knew long ago and I was astounded at the things that person believes. Then I had to reflect on my own willingness to dive into any topic and mouth off about it, believing I’m solving some problem, rather than just jousting windmills. Now I’m all paranoid now about going Dunning Kruger.

    • >>> from the inability to find a parameter than can be precisely measured.

      Bingo. The paper mentions something about what motivates lower income and higher income students. I think it’s naive to assume that there’s one or a couple ways that’s motivate students. People are dynamic. Sources of motivation could change daily for a single person: could have read a good article, could be fear, poverty, anxiety, societal pressure. That’s what I was implying by the HMM comment. But if you fit a simple model, be it hierarchical or whatever it’s going to be a drastic oversimplification of what’s actually happening. Leading to naive conclusions. For example, in the article, one of the conclusions for low income students is, and I paraphrase as accurately as possible without re-reading the article, ‘…low income students are motivated by being told what to do.’ Could be correlated with low income but confounded with the fact that the pathologically student lacks the social intuition or experience of the upper class… anyway.

      Point being, it’s not something that you can fit a simple model with and expect it to generalize.

  13. Bob Abelson (my dissertation adviser) had us read Cronbach and Meehl and other configural/interaction psychologists mentioned above and I studied with Louis Guttman for several years. So my bias is toward interactions in psychological modeling and I am skeptical of main-effects models in many cases.

    But Abelson taught us one exception, probably due to his adviser (John Tukey). Namely, fanfold interactions that disappear after log transformation are suspect. (This concern arose, I’m guessing, in the context of conjoint measurement.) So to be cautious, I still look for crossover interactions before making a strong case for them. Is that too restrictive?

    • Not sure what you mean by “fanfold interactions”. But one relevant point is that lognormal distributions are common in nature, so we need to be alert to them and treat them appropriately rather than assuming that they are something else (such as normal). See the three handouts “Logarithms and Means”, “Lognormal Distributions 1”, and “Lognormal 2”
      at https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/ProbStatGradTeachHome.html for more discussion.

      • See the second figure here: https://en.wikipedia.org/wiki/Interaction_(statistics)

        The response surface for this type of interaction is nonlinear, but it is not a saddle. This has nothing to do with lognormal distributions. Other nonlinear transformations can make “significant” fanfold interactions become “nonsignificant.” Crossover interactions (a saddle response function) cannot be attenuated by the same class of transformations.

        • Thanks for the example/explanation of fanfold interaction.

          My comment about lognormal distributions was not addressed to fanfold interactions (I now realize I should have made this explicit) — but was prompted by your reference to log transformations; my point (which I now see I should have explained explicitly) was that log transformations are often warranted, depending on context, but not always appropriate. In particular, I was wondering what heuristic reasoning (in the context of a particular study) might prompt use of a log transformation in the type of situation you were discussing.

  14. Thanks for the clarification, Martha. Your point about lognormal distributions being common in nature makes sense to me. The Tukey/BoxCox idea to consider transforming before modeling was prevalent before generalized linear modeling made the error distribution part of the model for a wide class of distributions. In either case, it seems that some researchers do not think enough about nonlinearity when analyzing their data. Robyn Dawes, Peter Bentler and other psychology methodologists claimed in the 1970’s that human decision making could be represented by simple linear models because the R^2 values for these models were high. They used this argument against Cronbach and Meehl, saying that adding interactions or nonlinear terms to a linear model was overfitting. Using R^2 to make such arguments is a short circuit, however. You can fit almost anything fairly well with a linear model using R^2 as loss, but that doesn’t necessarily lead to understanding. Your point that we need to “treat them appropriately rather than assuming that they are something else (such as normal)” is well taken.

Leave a Reply to Matt Skaggs Cancel reply

Your email address will not be published. Required fields are marked *