“The Generalizability Crisis” in the human sciences

Posted on April 7, 2020 9:09 AM by Andrew

In an article called The Generalizability Crisis, Tal Yarkoni writes:

Most theories and hypotheses in psychology are verbal in nature, yet their evaluation overwhelmingly relies on inferential statistical procedures. The validity of the move from qualitative to quantitative analysis depends on the verbal and statistical expressions of a hypothesis being closely aligned—that is, that the two must refer to roughly the same set of hypothetical observations. Here I argue that most inferential statistical tests in psychology fail to meet this basic condition. I demonstrate how foundational assumptions of the “random effects” model used pervasively in psychology impose far stronger constraints on the generalizability of results than most researchers appreciate. Ignoring these constraints dramatically inflates false positive rates and routinely leads researchers to draw sweeping verbal generalizations that lack any meaningful connection to the statistical quantities they are putatively based on. I argue that failure to consider generalizability from a statistical perspective lies at the root of many of psychology’s ongoing problems (e.g., the replication crisis), and conclude with a discussion of several potential avenues for improvement.

I pretty much agree 100% with everything he writes in this article. These are issues we’ve been talking about for awhile, and Yarkoni offers a clear and coherent perspective. I only have two comments, and these are more a matter of emphasis than anything else.

1. Near the beginning of the article, Yarkoni writes of two ways of drawing scientific conclusions from statistical evidence:

The “fast” approach is liberal and incautious; it makes the default assumption that every observation can be safely generalized to other similar-seeming situations until such time as those generalizations are contradicted by new evidence. . . .

The “slow” approach is conservative, and adheres to the opposite default: an observed relationship is assumed to hold only in situations identical, or very similar to, the one in which it has already been observed. . . .

Yarkoni goes on to say that in modern psychology, it is standard to use the fast approach, that the fast approach gets attention and rewards, but that in general the fast approach is wrong, that instead we should be using the fast approach to generate conjectures but use the slow approach when trying to understand what we know.

I agree, and I also agree with Yarkoni’s technical argument that the slow approach corresponds to a multilevel model in which there are varying intercepts and slopes corresponding to experimental conditions, populations, etc. That is, if we are fitting the model y = a + b*x + error to data (x_i, y_i), i=1,…,n, we should think of this entire experiment as study j, with the model y = a_j + b_j*x + error, and different a_j, b_j for each potential study. To put it another way, a_j and b_j can be considered as functions of the experimental conditions and the mix of people in the experiment.

Or, to put it another way, we have an implicit multilevel model with predictors x at the individual level and other predictors at the group level that are implicit in the model for a, b. And we should be thinking about this multilevel model even when we only have data from a single experiment.

This is all related to the argument I’ve been making for awhile about “transportability” in inference, which in turn is related to an argument that Rubin and others have been making for decades about thinking of meta-analysis in terms of response surfaces.

To put it another way, all replications are conceptual replications.

So, yeah, these ideas have been around for awhile. On the other hand, as Yarkoni notes, standard practice is to not think about these issues at all and to just make absurdly general claims from absurdly specific experiments. Sometime it seems that the only thing that makes researchers aware of the “slow” approach is when someone fails to replicate one of their studies, at which point the authors suddenly remember all the conditions on generality that they somehow forgot to mention in their originally published work. (See here or an extreme case that really irritated me.) So Yarkoni’s paper could be serving a useful role even if all it did was remind us of the challenges of generalization. But the paper does more than that, in that it links this statistical idea with many different aspects of practice in psychology research.

That all said, there’s one way in which I disagree with Yarkoni’s characterization of scientific inferences as “fast” or “slow.” I agree with him that the “fast” approach is mistaken. But I think that even his “slow” approach can be too strong!

Here’s my concern. Yarkoni writes, “The ‘slow’ approach is conservative, and adheres to the opposite default: an observed relationship is assumed to hold only in situations identical, or very similar to, the one in which it has already been observed.”

But my problem is that, in many cases, I don’t even think the observed relationship holds in the situations in which has been observed.

To put it more statistically: Claims in the sample do not necessarily generalize to the population. Or, to put it another way, correlation does not even imply correlation.

Here’s a simple example: I go the store, buy a die, I roll it 10 times and get 3 sixes, and I conclude that the probability of getting a six from this die is 0.3. That’s a bad inference! The result from 10 die rolls gives me just about no useful information about the probability of rolling a six.

Here’s another example, just as bad but not so obviously bad: I find a survey of 3000 parents, and among those people, the rate of girl births was 8% higher among the most attractive parents than among the other parents. That’s a bad inference! The result from 3000 births gives me just about no useful information about the probability of a girl birth.

So, in those examples, even a “slow” inference (e.g., “This particular die is biased,” or “More attractive parents from the United States in this particular year are more likely to have girls”) is incorrect.

This point doesn’t invalidate any of Yarkoni’s article; I’m just bringing it up because I’ve sometimes seen a tendency in open-science discourse for people to give too much of the benefit of the doubt to bad science. I remember this with that ESP paper from 2011: people would say that this paper wasn’t so bad, it just demonstrated general problems in science. Or they’d accept that the experiments in the paper offered strong evidence for ESP, it was just that the evidence overwhelmed their prior. But no, the ESP paper was bad science, and it didn’t offer strong evidence. (Yes, that’s just my opinion. You can have your own opinion, and I think it’s fine if people want to argue (mistakenly, in my view) that the ESP studies are high-quality science. My point is that if you want to argue that, argue it, but don’t take that position by default.)

That was my point when I argued against over-politeness in scientific discourse. The point is not to be rude to people. We can be as polite as we want to individual people. The point is that there are costs, serious costs, to being overly polite to scientific claims. Every time you “bend over backward” to give the benefit of the doubt to scientific claim A, you’re rigging things against the claim not-A. And, in doing so, you could be doing your part to lead science astray (if the claims A and not-A are of scientific importance) or to hurt people (if the claims A and not-A have applied impact). And by “hurt people,” I’m not talking about authors of published papers, or even of hardworking researchers who didn’t get papers published because they couldn’t compete with the fluff that gets published by PNAS etc., I’m talking about the potential consumers of this research.

Here I’m echoing the points made by Alexey Guzey in his recent post on sleep research. I do not believe in giving a claim the benefit of the doubt, just cos it’s published in a big-name journal or by a big-name professor.

In retrospect, instead of saying “Against politeness,” I should’ve said “Against deference.”

Anyway, I don’t think Yarkoni’s article is too deferential to dodgy published claims. I just wanted to emphasize that even his proposed “slow” approach to inference can let a bunch of iffy claims sneak in.

Later on, Yarkoni writes:

Researchers must be willing to look critically at previous studies and flatly reject—on logical and statistical, rather than empirical, grounds—assertions that were never supported by the data in the first place, even under the most charitable methodological assumptions.

I agree. Or, to put it slightly more carefully, we don’t have to reject the scientific claim; rather, we have to reject the claim that the experimental data at hand provide strong evidence for the attached scientific claim (rather than merely evidence consistent with the claim). Recall the distinction between truth and evidence.

Yarkoni also writes:

The mere fact that a previous study has had a large influence on the literature is not a sufficient reason to expend additional resources on replication. On the contrary, the recent movement to replicate influential studies using more robust methods risks making the situation worse, because in cases where such efforts superficially “succeed” (in the sense that they obtain a statistical result congruent with the original), researchers then often draw the incorrect conclusion that the new data corroborate the original claim . . . when in fact the original claim was never supported by the data in the first place.

I agree. This is the sort of impoliteness, or lack of deference, that I think is valuable going forward.

Or, conversely, if we want to be polite and deferential to embodied cognition and himmicanes and air rage and ESP and ages ending in 9 and the critical positivity ratio and all the rest . . . then let’s be just as polite and deferential to all the zillions of unpublished preprints, all the papers that didn’t get into JPSP and Psychological Science and PNAS, etc. Vaccine denial, N rays, spoon bending, whatever. The whole deal. But that way lies madness.

Let me again yield the floor to Yarkoni:

There is an unfortunate cultural norm within psychology (and, to be fair, many other fields) to demand that every research contribution end on a wholly positive or “constructive” note. This is an indefensible expectation that I won’t bother to indulge.

Thank you. I thank Yarkoni for his directness, as earlier I’ve thanked Alexey Guzey, Carol Nickerson, and others for expressing negative attitudes that are sometimes socially shunned.

2. I recommend that Yarkoni avoid the use of the terms fixed and random effects as this could confuse people. He uses “fixed” to imply non-varying, which makes a lot of sense, but in economics they use “fixed” to imply unmodeled. In the notation of this 2005 post, he’s using definition 1, and economists are using definition 5. The funny thing is that everyone who uses these terms thinks they’re being clear. But the terms have different meanings for different people. Later on page 7 Yarkoni alludes to definitions 2 and 3. The whole fixed and random thing is a mess.

Conclusion

Let me conclude with the list of recommendations with which Yarkoni concludes:

Draw more conservative inferences

Take descriptive research more seriously

Fit more expansive statistical models

Design with variation in mind

Emphasize variance estimates

Make riskier predictions

Focus on practical predictive utility

I agree. These issues come up not just in psychology but also in political science, pharmacology, and I’m sure lots of other fields as well.

43 thoughts on ““The Generalizability Crisis” in the human sciences”

Terry on April 7, 2020 10:29 AM at 10:29 am said:

The Yarkoni paper wins Best Acknowledgement Award of 2019:

“This paper was a labor of pain that took an inexplicably long time to produce.”

Reply ↓
Steve on April 7, 2020 10:37 AM at 10:37 am said:

Andrew writes, “instead of saying “Against politeness,” I should’ve said “Against deference.””

There is a strong connection between politeness and deference. Often politeness is just about people in authority preventing true dissent. Of course, bullying is also used by those in authority to put down dissent. I think we should value kindness and openness to the weak and under-represented and outspokenness and frankness to those in power and retire notions of politeness forever. In other words, pointing out the errors of some fringe researchers and how they can improve their work and telling the Cass Sunstein’s of the world that they are full of it, is not morally inconsistent.

Reply ↓
- Andrew on April 7, 2020 11:04 AM at 11:04 am said:
  
  Steve:
  
  One difficulty with discussions of tone is that they can devolve into infinite regress.
  
  For example, in December, Keith O’Rourke posted a blog criticizing a public talk given by a statistician in which concepts of statistical significance were garbled.
  
  A commenter, Beatrice Pascal, then described O’Rourke as “this cretin” and characterized his post as “scientifically illiterate, sexist, arrogant.”
  
  Ben Bolker commented that Pascal could be more polite (“drivel” and “cretin” seem unnecessary).
  
  Martin Modrak commented that he agreed with Pascal “that the tone/wording of the post [by O’Rourke] felt more personally-attackish than is usual for this blog.” And then I replied that Pascal’s comment was much more personally-attackish than anything in O’Rourke’s post.
  
  This sort of thing comes up a lot in discussions of tone: there are so many things to get mad at, or meta-mad at, or meta-meta-made at, etc. Person B can criticize person A for their tone; then person C can criticize person B for their tone in criticizing person A; etc. Similarly with discussions of power and authority. It’s never clear who is punching up and who is punching down, or how relevant this all is.
  
  Reply ↓
Terry on April 7, 2020 12:10 PM at 12:10 pm said:

The most damning part of the Yarkoni paper is similar to the piranha criticque.

The literature seeks to find small stimuli that can have large and consistent effects. But there are a zillion small stimuli, so statistical logic requires that all zillion be modeled, and when you do that, the estimated variances explode. You can’t have it both ways: you can’t say that your small stimuli is important but that all the other small stimuli can be ignored.

The rather disturbing implication of all this is that, in any research area where one expects the aggregate
contribution of the missing σ2u terms to be large—i.e., anywhere that “contextual sensitivity” … is high—the inferential statistics generated from models like (2) will often underestimate the true uncertainty surrounding the parameter estimates to such a degree as to make an outright mockery of the effort to learn something from the data using conventional inferential tests.

Reply ↓
Steve on April 7, 2020 12:53 PM at 12:53 pm said:

Andrew:

Doesn’t the example you gave just demonstrate my point? We should dispense with calls for “politeness” and try to get back to the issue being studied. Ignore the insults and assume (where possible) that the interlocutor who is screaming is doing so out of a good faith belief that important information is being ignored. Of course, in practice it is hard for many to ignore insults, which is why we should avoid them. On the other hand, political satire and insults have been an effective tool for those fighting against establishments that want to suppress the truth. So, we become impolite when we think it is needed and polite when we can. Just assume everyone is operating in good faith and worry about the truth.

Reply ↓
Ron Kenett on April 7, 2020 12:54 PM at 12:54 pm said:

Andrew – thank you for posting this. The first sentence is telling very loudly but, is somehow ignored. Let me quote: “Most theories and hypotheses in psychology are verbal in nature”.

If this is the case, and claims are verbal, we need an approach to present and assess verbal claims. This has been my message at the 2017 SSI conference and this is what is presented in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070. Unfortunately, like the first sentence in your post (quoted above), statisticians ignore this.

On the positive side, clinicians have adopted the boundary of meaning (BOM) approach presented in my paper and are applying it repeatedly (see references in the SSRN paper). The irony is that your Sign type error proposal with John Carlin applies here very well. Also this seems ignored….

Reply ↓
Michael Nelson on April 7, 2020 2:26 PM at 2:26 pm said:

Agreed. The problem, I think you (and the author) are saying, is that papers tend to have the following structure: 1. Broad verbal question, 2. Narrow (and poorly-aligned) statistical hypothesis/design, 3. Narrow statistical result, 4. Broad verbal interpretation of result to answer broad verbal question.

What we need is more papers with this structure: 1. Broad verbal question, 2. Narrow verbal hypothesis related to the question, 3. Narrow (and closely-aligned) statistical hypothesis/design, 4. Narrow statistical result, 5. Narrow verbal interpretation of result, 6. Much narrower verbal interpretation of the result’s implications for answering the broad question.

We retain #1 because that is the thing we care about, the significance section of the paper. #5 and #6 remain somewhat subjective, but reviewers and editors should enforce reasonable limits on unwarranted speculation (unless it’s labeled as such). We should probably also have #7: Statistical information provided by the study’s results. Frequentists seem to leave that sort of thing to meta-analysts, but maybe we shouldn’t, particularly if we’re going to cite effect sizes in our introduction. Baysians seemingly could and should say something like “My prior distribution was X, but were I to replicate the study, I’d now modify the prior in light of this data to be X + Y.”

Reply ↓
Steve on April 7, 2020 2:35 PM at 2:35 pm said:

Michael Nelson writes: “What we need is more papers with this structure: 1. Broad verbal question, 2. Narrow verbal hypothesis related to the question, 3. Narrow (and closely-aligned) statistical hypothesis/design, 4. Narrow statistical result, 5. Narrow verbal interpretation of result, 6. Much narrower verbal interpretation of the result’s implications for answering the broad question.”

Isn’t the solution instead of broad verbal questions, mathematically precise questions, having terms defined with precision. Why do we care about ambiguous and vague claims? Those will never yield insight or knowledge.

Reply ↓
- Ron Kenett on April 7, 2020 3:07 PM at 3:07 pm said:
  
  Steve – beyond the need of domain experts to state research claims verbally, let me offer another “surprise”. Much of science and application areas such as marketing, operations, R&D etc.. are generalizing findings, without posing mathematically precise questions. Indeed, much generalisation is done without data, for example using first principles such as Newtonian mechanics or hydrodynamics. In many cases people take experience gained in one set up (say in a pilot plant) and generalise it to another set up (full scale operation). Physicians treat patient A and, using that experience, make decisions on patient B. Statisticians need to contribute to this. Stating that without a mathematically stated questions we cannot do anything is “problematic”. See for example https://blogisbis.wordpress.com/2019/11/12/a-pragmatic-view-on-the-role-of-statistics-and-statisticians-in-modern-data-analytics/
  
  Reply ↓
- Michael Nelson on April 7, 2020 3:43 PM at 3:43 pm said:
  
  If all of our questions about human behavior are strictly and solely mathematical, then only mathematicians will understand human behavior.
  
  There are several good reasons to continue to frame our research in verbal terms: 1) The public funds our work, in part, because we are able to provide them with (incomplete, conditional) answers to “Why?” and “How?” questions. 2) In the absence of scientific work that attempts to answer these kinds of questions, even reasonable people will flock to pseudo-scientists, philosophers, con men, and sophists. (Fox News!) 3) Policymakers need (and pay for) answers they can understand. 4) Most scientists begin by asking verbal, conceptual questions, and we are motivated by finding verbal, conceptual answers; to keep those questions and answers secret deprives our audience of insight. 5) It is easier to evaluate the a priori plausibility of a verbal claim than a mathematical one. 5) It is easier to think you understand something that you don’t if you never have to explain it to anyone.
  
  Reply ↓
  - Ben on April 7, 2020 5:16 PM at 5:16 pm said:
    
    I like the idea that we start broad and narrow and try to stay narrow. I think also it would be a bit of a lie to not include the broad part at the beginning.
    
    Like, these things don’t start off mathematically, in any case, and we’re always compromising on measurement and modeling to what can be measured and what can be modeled. So not describing how your interests drove your stats/math/measurement selection seems misleading.
    
    Reply ↓
Grad student on April 7, 2020 3:21 PM at 3:21 pm said:

Think Andrew gave a great example about making a poor inference from experimental data even with 3000 datapoints.

However, can someone say more on how to correctly infer in the slow inference regime?

I think it would be illuminating to me and other readers alike.

Reply ↓
Steve on April 7, 2020 4:13 PM at 4:13 pm said:

Ron Kenett writes: “Stating that without a mathematically stated questions we cannot do anything is “problematic”.” in response to my post.

Of course, I didn’t say that. I said, “Isn’t the solution instead of broad verbal questions, mathematically precise questions, having terms defined with precision.”

Okay, what I said was not mathematically precise either, but I’ll try again. Theories in social science (or any science) ought to tell us what the relations are between the posited entities, causes or forces in a mathematically precise way. A theoretician ought to be able to explain the relationship between the variables she is studying. Theories that state that “x has an effect on y” or “x tends to increase y” or ” are vague and (I would say) meaningless. Theories don’t have to be stated in mathematical notation, but if the mathematical relationship between variables is unclear, there can’t be any progress.

Reply ↓
- Michael Nelson on April 7, 2020 5:52 PM at 5:52 pm said:
  
  A slight tangent, but I love this clip with Richard Feynman explaining why he can’t answer a journalist’s question as to why magnets attract and repel each other: https://www.youtube.com/watch?v=Q1lL-hXO27Q
  
  His point is that scientific statements are inherently hierarchical, so that each level requires more specificity and greater prior knowledge than the last. In the present context, I take it to mean that, while hypotheses can be mathematically precise, scientific statements about why the hypothesis is of interest, or why the results are what they are, must necessarily be in some sense incomplete or, as you say, vague. And that’s okay, as long as the author then goes on to elaborate more specific and testable claims, and then avoids presenting test results as full-on answers to the verbal questions.
  
  Reply ↓
- Ron Kenett on April 7, 2020 6:03 PM at 6:03 pm said:
  
  Steve – this is a good discussion, tx.
  
  To make the case, consider for example a recent paper from Sir Peter Radcliffe, one of the thee 2019 Nobel laureates in medicine. In https://www.ncbi.nlm.nih.gov/pubmed/29917232
  he writes (this is a quote):
  1) The carotid body is a peripheral arterial chemoreceptor that regulates ventilation in response to both acute and sustained hypoxia.
  2) Type I cells in this organ respond to low oxygen both acutely by depolarization and dense core vesicle secretion and, over the longer term, via cellular proliferation and enhanced ventilatory responses.
  3) Using lineage analysis, the present study shows that the Type I cell lineage itself proliferates and expands in response to sustained hypoxia.
  This is exactly the vague format you refer to. These statements are based on pretty standard statistical analysis. Specifically (another quote):
  “The statistical analysis section states: “Data are shown as the mean ± SEM. Statistical analyses were performed using unpaired Student’s t tests. For repeated measures, data were analysed by ANOVA followed by Tukey’s multiple comparison test or t test with Holm–Sidak correction for multiple comparisons as appropriate and as described in Hodson et al. (2016). P < 0.05 was considered statistically significant.”
  
  My suggestion is to assess the statement: "Using lineage analysis, the present study shows that the Type I cell lineage itself proliferates and expands in response to sustained hypoxia" using the Sign type error of Gelman and Carlin and state it. Such errors are dependent on study design and of course experimental outcomes. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3035070 for more details. .
  
  https://onlinelibrary.wiley.com/doi/full/10.1111/pai.12915
  
  Reply ↓
  - Steve on April 8, 2020 11:04 AM at 11:04 am said:
    
    Ron:
    
    I am not sure why the statements in your example are vague and ambiguous. Are you saying that the words “regulate” and “respond” are vague. I am saying that a statement of a theory is vague or ambiguous when we cannot know what the relationship is that we are supposedly testing for. Let me give an example from social psychology, cognitive dissonance theory. It states that inconsistent cognitions produce unpleasant states that motivate individuals to change one or more cognitions to restore the consistency.
    
    I would say this is a vague theory that can’t be made meaningful even though it is very popular. For example, what is inconsistency. That means something very clear in logic, but what does it mean here. Do “cognitions” appear in the mind in some formal language? If not, then how can I get an inconsistency from vague and informal statements. Such informal statements are subject to multiple interpretations. Some may produce an inconsistency others will not. Since my beliefs have all sorts of implications that may lead to inconsistencies does my failure to derive those inconsistencies count as a counterexample to cognitive dissonance theory or does one have to be conscious of the inconsistency. Who gets to judge whether I am conscious of the inconsistency given that the beliefs in question could be formulated in multiple ways to avoid the inconsistency. We can do all sorts of experiments “confirming” this theory, but the theory remains hopelessly vague.
    
    Reply ↓
- Martha (Smith) on April 7, 2020 9:45 PM at 9:45 pm said:
  
  Steve said,
  
  “Theories in social science (or any science) ought to tell us what the relations are between the posited entities, causes or forces in a mathematically precise way. A theoretician ought to be able to explain the relationship between the variables she is studying. Theories that state that “x has an effect on y” or “x tends to increase y” or ” are vague and (I would say) meaningless. Theories don’t have to be stated in mathematical notation, but if the mathematical relationship between variables is unclear, there can’t be any progress.”
  
  Sorry to be blunt, but this sounds like it’s off in a fairly tale world. Statements such as, “x tends to increase y” are indeed vague, but they are not the final word — they are steps on the way to something more precise. And, reality being what it is, we can’t expect to get theories that are airtight precise. Uncertainty is part of reality; we can’t expect to get rid of it, but, with continued work, we can. sometimes get more precise than what we started with.
  
  Reply ↓
  - Steve on April 8, 2020 11:13 AM at 11:13 am said:
    
    I agree that we start with vague statements and move to more formal presentations. But, a theory or model ought to tell me the relationship between the variables that I expect to see. I am not sure we are in disagreement. A “precise” theory is still going to be an approximation, but I need a theory that could be wrong. In some social sciences, there are theories that a vague enough that it is unclear what can confirm or refute them. That I think is the problem with the generalizability crisis.
    
    Reply ↓
Bill Spight on April 7, 2020 4:28 PM at 4:28 pm said:

W. I. B. Beveridge, in “The Art of Scientific Investigation”, pointed out that experimental results on sheep in New Zealand do not necessarily hold for the results of the same experiment on sheep in Scotland. (Maybe I misremembered the places.) And that’s biology, which I suppose is more generalizable than psychology.

Not to mention cultural differences among humans. I used to be interested in folk tales. While I can grok many tales from Africa and the Far East, and I usually like Narsuddin, there are tales from European countries that leave me puzzled, they don’t make sense to me. Obviously, they made sense to those who told them and grew up with them. Joseph Campbell proposed a grand theory of folk tales, and he was not the first. But Campbell cheated. When I read “The Hero with a Thousand Faces” I noticed that he took snippets of tales to make his points. When you look at each tale as a whole, you see that his interpretations of the snippets are not compatible with each other. It’s as though Elvis drove Miss Daisy while singing bluegrass in double time and plotting to kill Hitler.

You don’t have to be a statistician to look at the tall grass and see which way the wind is blowing. Nor to know that the wind is blowing in a different direction somewhere else.

Reply ↓
- jim on April 7, 2020 7:34 PM at 7:34 pm said:
  
  “experimental results on sheep in New Zealand do not necessarily hold for the results of the same experiment on sheep in Scotland.”
  
  Bill, you provide a great metaphor. My comment isn’t a reply to you in particular, but I’ll leverage your story about the sheep and make a general comment:
  
  I’m confused about why everyone is so confused about how to make progress in science. There’s only one way to find out if sheep in Scotland and sheep in Australia respond the same way under the same conditions: do both experiments. That’s the only way.
  
  We can predict and theorize
  About the sheep and their size;
  We can say in words or say in equations;
  How they behave at different elevations;
  We can describe one quickly;
  Or consider it with care;
  We can sing about it if we dare;
  We can sketch it out from head to toe;
  but there’s only one way that I know;
  Only one way that can claim;
  To tell if or not they are the same;
  It’s the way that works the best;
  This only way is to do a test.
  
  All this theorizing about how we find out what’s true or not is way off the rails. People are looking for shortcuts that don’t exist.
  
  The tried and true way to prove a phenomenon – the ONLY way – is to test the phenomenon again and again and again and again. My goodness who cares if a study replicates twice by chance! That’s like making a theory about finding two similar grains of sand on the freakin’ beach! We’ve been testing Newtonian mechanics for over three centuries! We’re still learning how evolution works after 150 years!
  
  Reply ↓
  - Martha (Smith) on April 7, 2020 9:50 PM at 9:50 pm said:
    
    Well put!
    
    Reply ↓
  - Ben on April 8, 2020 9:11 AM at 9:11 am said:
    
    Haha good post
    
    Reply ↓
Bill Spight on April 7, 2020 4:36 PM at 4:36 pm said:

On deference

As I recall, Carl Sagan once pointed out that in the 1950s, many scientists who read Velikovsky noted his mistakes in their fields, but gave him the benefit of the doubt in areas they knew nothing about.

Velikovsky was right about the temperature of Venus, however.

Reply ↓
- jrkrideau on April 7, 2020 5:30 PM at 5:30 pm said:
  
  And the historians noted the historical problems but gave him the doubt in his science. IIRC.
  
  A bit like Bjørn Lomborg’s book The Skeptical Environmentalist where various scientists seemed to nod at some of his claims until they hit their own area of expertise and suddenly there were cries of “What?”, “Where did that come from?” and “%$>@#%#$ ?”.
  
  Reply ↓
  - Martha (Smith) on April 7, 2020 9:53 PM at 9:53 pm said:
    
    Ooh! My ears are burning at the keyboard swearing!
    
    Reply ↓
[email protected] on April 7, 2020 5:05 PM at 5:05 pm said:

Michael Nelson writes: “If all of our questions about human behavior are strictly and solely mathematical, then only mathematicians will understand human behavior.”

False, I am not a mathematician, but I understand what “more than” or “additive” or “multiply” means. The distinction is not between verbal vs mathematical, mathematical concepts can often be rendered clearly in English without special notation. The distinction is between relations that are expressed rigorously (making the relationship between variables clear) and poorly or undefined relationships.

Reply ↓
- Martha (Smith) on April 7, 2020 9:55 PM at 9:55 pm said:
  
  Well put.
  
  Reply ↓
Martha (Smith) on April 7, 2020 6:46 PM at 6:46 pm said:

Andrew said:”On the other hand, as Yarkoni notes, standard practice is to not think about these issues at all and to just make absurdly general claims from absurdly specific experiments”

Regrettably, it all too often seems that standard practice is to not think at all.

Reply ↓
Martha (Smith) on April 7, 2020 6:50 PM at 6:50 pm said:

“in general the fast approach is wrong, that instead we should be using the fast approach to generate conjectures but use the slow approach when trying to understand what we know.”

+1

Reply ↓
Sameera Daniels on April 7, 2020 9:04 PM at 9:04 pm said:

RE: ‘Or, to put it slightly more carefully, we don’t have to reject the scientific claim; rather, we have to reject the claim that the experimental data and hand provide strong evidence for the attached scientific claim (rather than mere evidence consistent with the claim). Recall the distinction between truth and evidence.’
———-
Isn’t there a name for this bias?

Reply ↓
- Martha (Smith) on April 7, 2020 9:58 PM at 9:58 pm said:
  
  Hmm — maybe Andrew (or someone) should start selling naming rights?
  
  Reply ↓
  - Sameera Daniels on April 7, 2020 10:01 PM at 10:01 pm said:
    
    Base rate neglect?
    
    Reply ↓
  - Andrew on April 7, 2020 10:34 PM at 10:34 pm said:
    
    I added an entry on Truth and Evidence to the lexicon.
    
    Reply ↓
    - Sameera Daniels on April 8, 2020 7:41 PM at 7:41 pm said:
      
      Thanks.
Andres on April 7, 2020 9:41 PM at 9:41 pm said:

Yarkoni’s article is great and I think his points are correct. However, I wonder if there could be a more pragmatically minded recommendation for psychologists: start using fixed effects to model nesting in your data, in order to encourage the use of multiple stimuli etc., and constraint your inferences to the stimuli at hand. I agree we should move towards random effect models but in many situations estimating random effect parameters requires very large numbers of stimuli and many psychologists might see this as infeasible and might ignore the problem altogether.

Reply ↓
- Martha (Smith) on April 7, 2020 10:02 PM at 10:02 pm said:
  
  Many people like to have “the steps” clearly laid out– but doing so gives them the excuse to neglect the hard thinking that needs to be done on a case-by-case basis.
  
  Reply ↓
Shravan on April 8, 2020 1:20 AM at 1:20 am said:

Andrew, I think your gripe about the misleading nature of the words fixed vs random in multilevel models is well taken, but I think that that one should be be careful about when to bring it up.

It’s like with the way the flow of electrical currents is (or was?) taught in school; the direction is wrong, but that’s fine for now. As long as people understand what the terms are intended to mean in a particular context, it’s OK. It’s like the term “unbiased”. In my beginner classes I don’t even let my students encounter that term.

It’s just confusing for beginners to be encounter a terminology fight too early in their education. I remember when I started reading the Gelman and Hill book in 2006 or so, I was confused and disturbed to read that Bates’ terminology was so wrong wrong wrong. It confuses newcomers; such debates should be reserved for advanced audiences that are used to stats terminology being a confusing mess.

Reply ↓
- Andrew on April 8, 2020 1:38 AM at 1:38 am said:
  
  Shravan:
  
  I disagree!
  
  You say, “As long as people understand what the terms are intended to mean in a particular context . . .”
  
  But, no, I don’t think people understand! People really do say things like, “The number of groups is not part of a larger population so you can’t do random effects,” or “We care about these particular groups in themselves, not the larger population, so we have to use fixed effects,” or “It’s not a random variable, so . . .”
  
  This is often the problem with bad notation or imprecise language: it’s not just that people become confused; it’s also that they can become very certain in their mistakes. Indeed, I think problems are created by the very fact that people think that everyone has a common understanding.
  
  I agree, though, that there’s not time to bring up all issues at all times. You have to pick your battles. Statistics is complicated, and no way we can explain it all in one book, let alone one blog post!
  
  Reply ↓
  - Shravan on April 8, 2020 4:03 AM at 4:03 am said:
    
    Sorry, I wasn’t precise enough. When I say, “As long as people understand what the terms are intended to mean in a particular context” I mean that there are levels of understanding at different stages in one’s life as a student of statistics. I like the Mahajan philosophy (Street-fighting math) of just lying your way through a story to eventually get to the truth.
    
    E.g., I would love to say in my intro class on stats that guys, you are never going to get any answers by doing an experiment or doing your stats on the data. We are going to learn the methods, but just be aware that you will be about as ignorant about your problem after the analysis is done as you were before, maybe even more. You’ll just have something to think about.
    
    People in linguistics and psych are not ready to hear that statistics is usually not going to be the answer to a problem, it’s just going to raise even more questions. More data will leave you even more uncertain about what the facts are. You have to slowly bring them to the point when they realize it themselves (or not). That’s an example of revealing some truths too early.
    
    I think what I am trying to say is that the beginner doesn’t need to understand everything fully from the outset, and this fixed/random effects stuff is too advanced/deep to bring up in intro texts like Gelman and Hill (although I now think that’s not an intro text).
    
    I still remember the time when I was starting out, and I remember the huge worries these things generated in my already uncertain mind. Wow, could Doug Bates, a great computational statistician, have gotten the terminology of lmer so wrong? I just think there are more important pedagogical goals to achieve at different stages of one’s education.
    
    PS I’ve had a student storm out of my class in tears because I used \hat\sigma for an estimate of sd in one slide and, foolishly, the letter s a bit later. What that taught me was that in the early stages of learning, even small things can lead to a magnified sense of instability in the unfolding story.
    
    Reply ↓
    - Keith O'Rourke on April 8, 2020 1:33 PM at 1:33 pm said:
      
      Shravan: In my intro class at Duke (almost 15 years ago), I repeatedly told the students that if they were ever to encounter real data or a real study to not think they are prepared to deal with it adequately. I suggested they get help from a statistician or someone with real research experience.
      
      Not sure what effect that had, but they did not like hearing it…
Jonathan on April 8, 2020 1:53 PM at 1:53 pm said:

My 2 cents: problem not solvable because the implied multi-level model is not understood, and thus words and math are both incomplete renderings. There are probably 10k ways to say this, but one is that words and math in these situations have limited equivalence between their domains. That not only means they don’t transform well – and clearly aren’t commutative – but that they are part of a larger domain whose dimensionality is not understood. Thus, we see vogues in approaches: e.g., remember when it felt like everything pointed at evolutionary ‘biology’ as ‘the’ explanatory mechanism? That also leads to over-valuing of the current approach, which then ‘down regulates’ as the wave recedes. I used the ‘regulation’ term because I’ve been reading a lot of medical papers over the last few months, and they use the term to refer to the outcome of processes that may or may not be understood, but which can be at least generally identified.

While I truly enjoyed the post and the article, I see the issues recurring no matter what is done (e.g., the efforts to eradicate some of the most misused and misunderstood statistical conceptions) for a couple of reasons. First, better understanding of a larger model would allow, I would hope, more n and thus less pure noise. I may be wrong but I think small n often occurs because larger n would confound the results when the various domains aren’t understood; your chance for results is not doing larger work. Second, since concept and math need to map to each other, I’m not sure that greater understanding of the appearance of precision versus the reality of non-precision will lead to better conceptualization. You address this a lot: the eradication of p-hacking requires a different valuation approach for career success. I don’t see how the conceptual side can accept what appears to them as less rigor that reveals they don’t really know.

As a personal note, back when Covid-19 appeared, I started to look into my blood pressure medication because the work on SARS showed that attached to ACE2 receptors. Here’s the kind of messy form they use in run on form: there’s ACE1 and ACE2 and these are nothing alike but are part of a chain that involves angiotension I and angiotensin II (note the second form of 2 in one chain!), plus angiotension 1-7, plus a whole bunch more. So I take Losartan, which is an ACE2 drug, meaning it blocks the ACE2 receptors. It’s an ARB for angiotension receptor blocker. After wading through the various almost identically named elements that sort of fit into a general chain, it seems the ARB outcompetes the virus for access to ACE2 receptors. Which is a good thing. Which I mention because very intelligent people were so confused by the labels and the chain description that the simple math result was obscured. There was tremendous fear about upregulation versus downregulation, which again indicates how the verbiage used to describe the complex dimensional chain obscures the end meanings. Think how much more we could know if we could straighten out these concepts so they could be learned without twisting the mind into a pretzel.

My wife had a significant case. Opacity in lungs, cough, shortness of breath, etc. This was in March when testing was hard to obtain. We are both around a number of people who travel a lot, many to and from China. We resisted the hospital because, bluntly, when you go on a ventilator, you have to be weaned and that is not easy when they understand the disease progression. It was close for a few days and nights; about 10 minutes from calling 911 knowing she had a significant chance of never coming back. She pulled through. I never had more than a sore throat. Which I think may have been partly attributable to the ARB outcompeting for ACE2 receptors. (There are some clinical trials underway testing ARB’s, but the cases I see described typically involve pretty much every drug and treatment they can try, so I don’t expect great results – especially given the comorbidities being seen.)

My point isn’t about ARB’s but about the linkage between the conceptualization and the presentation and then the math behind all that. Much is obscured when we don’t know. It’s a human habit to cover up what we don’t know. And it’s a human habit to phrase things in ways that intentionally create jargon walls. The miscommunications that generates within fields, among doctors, among scientists, is much larger than poeple want to admit.

Reply ↓
- Martha (Smith) on April 8, 2020 5:06 PM at 5:06 pm said:
  
  “The miscommunications that generate within fields, among doctors, among scientists, is much larger than people want to admit.”
  
  Agreed. Thanks for your examples. One example that comes to mind from my experience was working with a biology Ph.D. student who was trying to use some software that a biologist had developed and made available. He had an option called “hyperparameter”, which allowed a choice of a small list of values for the parameter in question, or the option of putting a prior of a particular type on the parameter. This terminology really interfered with her understanding.
  
  Reply ↓
Clara B Jones on April 10, 2020 12:34 PM at 12:34 pm said:

There are general “laws” in Experimental Psychology [e.g., the “Inverted-U (Motivation); a few in Psychology of Perception], in Behaviorism [e.g., schedules of Reinforcement; “matching”; in Human Factors], many of them expressed mathematically…

Reply ↓

Statistical Modeling, Causal Inference, and Social Science

“The Generalizability Crisis” in the human sciences

43 thoughts on ““The Generalizability Crisis” in the human sciences”

Leave a Reply Cancel reply