Taking theory more seriously in psychological science

This is Jessica. Last week I virtually attended a workshop on What makes a good theory?, organized by Iris van Rooij, Berna Devezer, Josh Skewes, Sashank Varma, and Todd Wareham. Broadly, the premise is that what it means to do good theorizing in fields like psych has been neglected, including in reform conversations, compared to various ways of improving inference. The workshop brought together researchers interested in the role of theory from different fields (cogsci, philosophy, linguistics, CS, biology, etc.). Some background reading on the premise and how use of theory can be improved in psych and cogsci can be found here here and here. You can watch all the workshop keynotes here, and there will be a special issue of Computational Brain and Behavior including topics from some of these discussions in 2023. 

The workshop format provided a lot of time for participants to discuss various aspects of theory construction and evaluation, plus daily plenary sessions where results of the ongoing discussions were shared with everyone. I was struck by how little friction there seemed to be in these discussions, despite the fact that many participants were coming with strong views from their own theoretical perspectives. It made me realize how I’ve learned to associate serious discussion of theory and cognitive modeling with combative arguments, and, as someone who works in a very applied field, how rarely I get to have meta-level conversations about what it means to take theory seriously. 

I’ll summarize a few bits here, which should be interpreted as a very partial view of what was discussed.

 One theme was that it’s hard to define what theory is exactly. Across and even within fields what we consider theory can vary considerably. The slipperiness of trying to define theory in some universal way doesn’t preclude reasoning formally about it, but complicates it in that one must define, in a generalizable way, both the language in which a theory is described and the nature of the thing being described, e.g., functions and algorithms expressed in languages that map to string descriptions. See here for an example. A book by Marr came up repeatedly related to motivating different levels of analysis in theories of cognition, including the computational or functional claim about some aspect of cognition, the algorithmic or representational description, and implementation. 

Another takeaway for me was that theory is closely related to explanation, and is pretty certainly underdetermined by evidence. Again, see here for an example. I had to miss it but Martin Modrák led a session on identifiability as a theoretical virtue, a topic Manjari Narayan proposed in her keynote.

A slightly more specific view, proposed by Frank Jäkel, was that theory is talking about classes of models and how they relate. Similarly Artem Kaznatcheev proposed in his keynote that theory interconnects models and distributes modeling over a community. A view from math psych is that theorizing involves critically evaluating how to use formal models and experimental work to answer specific questions about natural phenomena. Another description of theory was as “candidate factive description,” as compared to models, which are tools with stricter representational and idealized elements. But the view that there is no bright line between theory and models was also proposed. Someone else mentioned that a prerequisite of any model is that it’s somehow embedded in theory.

All of this makes me think that the sort of colloquial theories that get described in motivating papers or interpreting evidence always exist somewhere along the boundaries of models, aiming to capture what isn’t fully understood, often in terms of causal explanations that are functional but not really probabilistic. I also found myself thinking a lot about the tension between probabilistic evidence and possibilistic explanations, where the latter often describe why something might be like it is, but have nothing to say about how likely it is. For example, in early stages of exploratory data analysis, one might start to generate hypotheses for why certain patterns exist, but these conjectures typically won’t encode information about prevalence, just possible mechanism. This can make colloquially-stated theory misleading, perhaps, where it describes some seemingly “intuitive” account that might be consistent with or inspired by some evidence but not representative of more common processes.  

Related to this, one dimension that came up several times was the relationship between theory and evidence. Sarahanne Field proposed a session exploring the extent to which bad data or other inputs to science can still be used to derive theory, and Marieke Woensdregt proposed a session on how theory is affected by sparse evidence, which is sometimes all that is available. Berna Devezer suggested Hasok Chang’s Inventing Temperature for a good example of bad evidence producing useful theory.

There’s a question about what it means to use a theory that’s not true, versus a model that’s not true. I found myself thinking about how theory-forward work can put one in a weird situation where you have to take the theory seriously even if you don’t think it’s right. My own experiences with applying Bayesian models of cognition fall in this category. Taking a theory seriously helps guarantee your attention is going to be there to see how it fails, i.e., trying to resolve ambiguities through data helps you understand what exactly is wrong in the assumptions you’re making. It can also be a generative tool that give you a strange lens, as discussed in Sashank Varma’s talk, that can lead you to knowledge you might not have otherwise discovered. But taking a theory very seriously can also be bad for learning, if it biases evidence collection too much for example. Theories structure the way we see the world. Embracing an idea while simultaneously being aware of the various ways in which it seems dismissable is a very familiar experience to me in my more theoretically-motivated modeling work, and can be hard to explain to those who have never tried to deal in these spaces.

Somewhat related, I often think about theory as having many different functions. Manjari Narayan commented on the various functions of statistical theories, including sometimes being used to make true statements, albeit of limited scope, versus sometimes being applied to make normative statements about how to do things without a truth commitment. 

One challenge in advocating for better theory in empirical fields, related to questions proposed by Devezer with help from Field, is that often what it looks like to do theory is left implicit. It’s not necessarily the kind of thing that’s taught the way modeling is. This raises the question of what it really means to approach a problem theoretically. The idea that what good theory is can be dependent on the domain or specific problem space came up a fair amount as well, with Dan Levenstein proposing a few sessions devoted to this. 

Tools for CS theory, especially complexity theory, came up multiple times, including in a discussion proposed by Jon Rawski and led with Artem Kaznatcheev. I like the idea of analyzing the tractability of searching for theory under different assumptions like truth commitments, since it helps illustrate why we have to be wary of placing too much trust in theories that we’ve never attempted to formalize. 

Olivia Guest gave a keynote that got me thinking about theory as an interface, which should be user-friendly in various ways. Patricia Rich’s keynote also got at this a bit, including the value of a theory that can be expressed visually. This reminded me of things I’ve read on the idea of cognitive and social values, in addition to epistemic values, in science, e.g., in work by Heather Douglas. 

Guest’s keynote also suggested that theory should be inclusive in the sense of not only being accessible or prescribed to by select parts of a community, and Rich’s talked about the importance of community in making theory and how well a theory diffuses power across a field. 

What it means for something to be a black box came up a few times, including that the meaning of a black box in psych has changed, from referring to something like behaviorism to becoming a bracketing device. There was a session about ML and/as/in science proposed by Mel Andrews, which I had to miss due to timing, but seemed related to things I’ve been thinking about lately. For instance there’s a question of how the kinds of claims generated in ML research should be taken as scientific statements subject to the usual criteria versus engineering descriptions, and whether it’s fair to think about using classes of modern ML e.g., deep learning as a type of atheoretical refuge, where the techniques don’t need to be fully understood or explained. Relatedly I wonder about the value of critiquing the absence of theory in ML, and what it means to strive toward more rigorous theory in fields focused on building tools where performance in target applications can be measured fairly directly, as opposed to fields geared toward description and explanation. Does having a powerful statistical learning method that lets you take any new evaluative criteria and train directly to optimize for that make more explicit theoretical frameworks, for instance describing data generation, more of a nice to have than a necessity?

There’s also the question of how theory evolves. I was reading Gelman and Shalizi at the same time as the workshop, and thinking about whether there’s an analogous trap to trying to quantify support for competing models in stats when it comes to how we think about theory progressing, where model expansion and checking might be better aims than direct comparison between competing theories. Somewhat relatedly, Artem Kaznatcheev’s keynote talked about how the mark of a field where theory can flourish is deep engagement with prior work, including extending it, synthetizing and unifying, and constructively challenging it.

After all this discussion, I feel less comfortable labeling certain types of work as atheoretical or blindly empirical, which were terms I used to use casually. Theory is everywhere, sometimes only implicitly, which is why we should be talking about it more.

16 thoughts on “Taking theory more seriously in psychological science

  1. The first step is to teach the students calculus and programming (with a focus on computational modelling). These are the basic skills required to quantitatively describe dynamic systems. They don’t need to know every trick and optimization, just enough to write prototype models .

    Without that you will get only a small minority that the majority cannot understand or critique.

  2. Jessica:

    Along with all this, there’s a social or political angle, which is sometimes we’re supposed to show respect to a theory that we might think is pretty bad. This has to do with an asymmetry in science where it can be ok to push a strong theory but it is considered rude to dismiss the theory. This asymmetry may be a good thing—let a thousand flowers bloom and all that—; in any case, I think it can complicate the discussions we have about particular theory.

    For instance, to take an example I’ve mentioned many times, I’ve criticized the work of that sociologist who claims that beautiful parents have more daughters, engineers have more sons, etc. He has essentially zero data to support these claims (yes, he has lots of published papers full of statistically significant p-values, but it’s noise mining all the way through), but I can’t say that his theories (gender essentialism of the schoolyard-evolutionary-biology variety) are wrong, just that he has no evidence for them. On the other side, he’s (mistakenly) claiming he has evidence for these theories, and he’s also saying they’re correct.

    That’s the asymmetry. On one side, you have someone loudly making the (unsupported) claim that beautiful parents have more daughters; on the other side you have people saying there’s no evidence. It’s natural for outsiders to split the difference and say that the theory’s kind of ok. To really fight this sort of thing, you’d need someone out there, balancing the scale, saying that beautiful parents are more likely to have sons. But that would be nuts, indeed as nuts as the claim that beautiful parents have more daughters.

    So here’s the question: how do you get to this midpoint representing true uncertainty? Do you need fanatics on both sides to balance each other out, so that reasonable people like us can just say that this is a silly question for which there’s no useful data or theory?

    At the theoretical level, if we want to come to the conclusion that a particular theory is worthless except as a thought experiment, do we need some true believer to present the opposite theory?

    I don’t know the answer to these questions. My point here is that these consensuses occur in a social environment, and one difficulty is that this environment includes people who are overconfident and don’t know what they’re doing, along with some flat-out bad actors and lots of people who are basically trend-followers.

    • Andrew said, “So here’s the question: how do you get to this midpoint representing true uncertainty? Do you need fanatics on both sides to balance each other out, so that reasonable people like us can just say that this is a silly question for which there’s no useful data or theory?”

      This doesn’t make sense to me — in particular, “how do you get to this midpoint representing true uncertainty?” To me, there are degrees of uncertainty; “true uncertainty” seems to be off in a fantasy world.

      • But if theories are typically underermined by evidence, then from our non-omniscient perspective there is unresolvable uncertainty.

        A theory is pretty much by definition underdetermined by evidence (I think your “underermined” is meant to be “underdetermined”) . That’s sort of the point of a theory. It provides an explanation of evidence as it stands, but the most important thing is it provides a context for further exploration.

        There may well be unresolvable uncertainty at the time a theory or a set of competing theories are initially considered but the theories (if they’re useful) should provide a context for resolving uncertainty through further analysis.

        That’s why I find the example of the “beautiful parents/more daughters”/”engineers have more sons” an odd one. I don’t see anything wrong with that theory (if it is a theory) since it’s testable. So rather than the odd notion of “fanatics on both sides” or some sort of “balance” associated with someone presenting “the opposite theory” why not just test it? It must be possible to design a study that is ultimately based around notions of beauty (which is obviously the tough part of the experimental design!). But once you resolve how to apportion your parental sets (“these parents are beautiful, these are less so” etc.) then count the numbers/proportions of daughters from parents in the different sets. You might get your colleagues abroad to do similar analyses to look at societal contexts associated with assessment of “beauty” and perhaps get them to judge your sets of parents to see whether they consider your “beautiful” parents are beautiful from their perspective and so on.

        It comes down to how important you consider the matter to be and whether you wish to resolve uncertainty or whether you actually like it (since it gives you something to argue about!).

        I’m coming from this from a more biological science perspective, but in these fields a set of data with an interpretation (a theory – e.g. beautiful parents have more daughters) is never the end of the story. Interpretations and theories provide contexts for testing the implication of the theories (that’s pretty much how replication gets done in the hard(er) sciences). Of course it may be that the interpretations/theories aren’t particularly interesting or in subjects considered to be important and so no-one bothers to test them or explore their implications.

        Ultimately, good theories are great and we can be comfortable embrancing uncertainty.

        • “So rather than the odd notion of “fanatics on both sides” or some sort of “balance” associated with someone presenting “the opposite theory” why not just test it?”

          You can provide some “beautiful” people to test this “theory”? :)

          My immediate reaction to this “theory” is that it’s flat-out bunk: there is no fixed definition of “beauty”. If we have 1000 photos of random Americans and 20 judges choosing 20 “beautiful” people each, how many people are in the overlapping set of “beautiful” people? Five?

          Not to mention:

          a) even on one person’s scale beauty isn’t a 1/0 assignment, it’s an infinitely dividable scale.

          b) the same person can look attractive in one photo and less attractive in another

          c) clothes, make-up and hairstyle can change people’s opinion of whether someone is “beautiful” or not.

          This is a great example of the most common research flaw in social sciences – skipping over the decades of work required to develop a method to accurately measure the feature of interest, and instead just pulling some BS out of their a*** as a substitute so they can contrive an NHST test and then pontificate without constraint on the outcome.

        • Thanks for the thoughts, yes, it was supposed to say underdetermined. 

          I agree, in the beautiful parents case the underdetermination by evidence can be resolved if the theory is made testable, like by defining ‘beautiful’ and ‘more’ in ways we can apply to data.

          In other cases though (e.g., where we’re trying to theorize about a not-directly-observable process that produces some observed patterns) I think being underdetermined by evidence can mean not just that the theory is undetermined by evidence at that particular time, like the resolvable uncertaint concerning a claim about correlation like beautiful parents have more daughters, but instead that finding the right explanation for the observed phenomena is an intractable search problem independent of the typical sources of uncertainty in drawing inferences from data. One of the papers linked above (How hard is cognitive science?) suggests something like this. But this kind of theory about theory is something I’m still becoming familiar with.

        • I agree with you chipmunk it may well be bunk. I didn’t know about this “beautiful parents/more daughters” stuff but have just had a look and it seems like loads of people including Andrew Gelman have published critiques pointing out fundamental flaws in the authors analyses.

          In a follow up paper the author used a British cohort to come to the same conclusion (“beautiful parents have more daughters”!). The parent’s “attractiveness” was based on the subjective impression of teachers when the “parents” were 7 years old (in one week in 1958) for assessing the parents child sex ratios 40 years or so later- not surprisingly most of them (~84%) were considered attractive (can’t be easy – even in 1958 to describe a 7 year old child as unattractive!).

          One can think of so many potential pitfalls in this (British cohort) study it’s difficult not to come to the conclusion that the data set is woefully underpowered to come to a meaningful conclusion.

        • Here’s a classic visual of the under-determination of theory by data used in philosophy of science classes that I think might capture the point Jessica is making when she says, ” there is unresolvable uncertainty.”

          https://images.app.goo.gl/JhKzEhavD8sMpQuZ9

          Jessica again: “I think being underdetermined by evidence can mean not just that the theory is undetermined by evidence at that particular time, like the resolvable uncertaint concerning a claim about correlation like beautiful parents have more daughters, but instead that finding the right explanation for the observed phenomena is an intractable search problem independent of the typical sources of uncertainty in drawing inferences from data.”

    • Good example, captures what I was pondering about some theories being attractive to believe despite a lack of evidence. In terms of trying to balance things, I like the idea you’ve brought up about how with some experiment results we can think about what we would theorize if they had been the opposite of what they were, and how the explanation we come up with for that can seem equally reasonable, like the idea that priming college students with songs about being old makes them walk slower because they imagine being older versus makes them walk faster because it reminds them that they’re young.

      • A clinical researcher David Sackett used to present the results of a study in reverse of what they were. Once the discussion settled on explanations for why that was so – he apologized for getting the effect reversed and then timed how long it took for explanations of that to be offered. Very quickly.

        But all in all good theory needs to lead to good economy of research (hastening getting less wrong about how the world is Peirce, C. S. (1879). Note on the theory of the economy of research).

        Some of the keynotes hit on that nicely – e.g. Artem Kaznatcheev.

  3. Psych programs focus a lot on what to do with a theory (i.e., design a study to ”test” it) and what to do with resulting evidence (i.e., analysis), but less on theory construction. I guess the hope, like Andrew said, is that if a thousand flowers are planted some will bloom, but given the issues highlighted on this blog (and elsewhere) and evaluating evidence (and poor designs), are the better theories going to be the ones that flower? Perhaps more focus should be on constructing better (but fewer) theories in the first place (e.g., here is a textbook designed for this purpose: Jaccard & Jacoby [2020]. Theory Construction and Model-Building Skills: A Practical Guide for Social Scientists, ISBN-13: 978-1462542437).

  4. An interesting recent book on theory in social sciences is “Theory and Credibility: Integrating Theoretical and Empirical Social Science” by Christopher Berry, Ethan Bueno de Mesquita, and Scott Ashworth. We read much of it for a recent seminar.

    The focus is on formal theory and “credibility revolution” style empirical work, mainly with polisci examples. It would be interesting to read comments by you or by Andrew on this.

  5. For instance there’s a question of how the kinds of claims generated in ML research should be taken as scientific statements subject to the usual criteria versus engineering descriptions, and whether it’s fair to think about using classes of modern ML e.g., deep learning as a type of atheoretical refuge, where the techniques don’t need to be fully understood or explained.

    Yes this is fascinating and particularly so coming from a field that is so different from mine. My feeling about how one assesses results and claims from Machine Learning (ML) is not about considering whether they are right or not but whether they are useful. The results and claims should also be considered very much to be provisional until validated in some way.

    Two examples from molecular biology. A ML method (AlphaFold) claims to be able to make accurate predictions of protein structures and has for example provided a prediction of the protein structure of every protein in the human genome. It’s perfectly reasonable IMO to greet that with a “so what” since the evidence that would normally allow one to assess the claims isn’t accessible (much in the same way as in your nicely thought out text above) – however AlphaFold has been shown to be useful for structural biologists struggling to determine their particular protein structure experimentally. The structures predicted by AlphaFold have repeatedly been useful in providing “targets” that allow technical difficulties involved with experimental structure determination to be overcome and to be shown to be broadly “correct”. So AlphaFold is useful but I would still say that it can’t be considered to provide a “correct” structure in any particular case but it may well provide a useful structure.

    A second example comes from the massive database of human gene sequences associated with the Human Genome project where a vast number of variants in protein coding genes have been identified (e.g. many hundreds of variants in a single protein within the human population where the protein sequence has for example a diferent amino acid at a particular position); clinicians would like to be able to advise patients that may harbour a particular variant whether this is likely to be pathogenic or benign. An ML method claims to predict pathogenic variants with 90% reliability. So far this doesn’t seem particularly useful IMO never mind the fact that 90% reliability is not great when considering clinical decisions. So I think the jury is out on this one usefullness wise.

    Anyway, I thought it might be useful to illustrate that the considerations you raise around ML are very widespread!

  6. I don’t think we want a situation where essentially atheoretical noise mining work gets only slightly revised to say “this theory suggests there should be some effect here” and then the noise mining continues unabated.

    A theory should be useful enough to pre-register certain predictions, and the decision to undertake the work should be based on whether those predictions make sense under that theory and the proposed data collection is appropriately powered to detect them (including under weaker versions of the theory, and recognizing the piranha problem). Only then should the data collection begin, after which some of the predictions will bear out and others not, and we can accept the resulting paper either way as a check on that theory.

Leave a Reply

Your email address will not be published. Required fields are marked *