Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students)

Paul Alper writes:

Unless I missed it, you haven’t commented on the recent article of Michael Bang Peterson [with Daniel Sznycer, Aaron Sell, Leda Cosmides, and John Tooby]. It seems to have been reviewed extensively in the lay press. A typical example is here. This review begins with “If you are physically strong, social science scholars believe they can predict whether or not you are more conservative than other men…Men’s upper-body strength predicts their political opinions on economic redistribution, they write, and they believe that the link may reflect psychological traits that evolved in response to our early ancestral environments and continue to influence behavior today. . . . they surveyed hundreds of people in America, Denmark and Argentina about bicep size, socioeconomic status, and support for economic redistribution.”

Further, “Despite the fact that the United States, Denmark and Argentina have very different welfare systems, we still see that — at the psychological level — individuals reason about welfare redistribution in the same way,” says Petersen. “In all three countries, physically strong males consistently pursue the self-interested position on redistribution.

“Our results demonstrate that physically weak males are more reluctant than physically strong males to assert their self-interest — just as if disputes over national policies were a matter of direct physical confrontation among small numbers of individuals, rather than abstract electoral dynamics among millions.”

However, the actual journal article and its supplemental material show how shaky the all-encompassing conclusions are. For example, R-sq for each of the three countries is very low. The regression line and 90% confidence intervals drawn for each of the three countries is devoid of the individual data points and thus, no visual sense of the variability. SES assessment was highly subjective and different in each country. In the U.S. and Argentina, the study relied on college students measuring college students’ biceps while in Denmark “a protocol was devised and presented to the subjects over the internet instructing them on how to measure their biceps correctly.” As to SES, the questions used were different in each country, supposedly justified by taking “into account country-specific factors regarding the political discussions on redistribution.”

So, while the study’s conclusion may in fact be valid, is this yet another example of overreach by social scientists and embrace by the lay press which seeks eye-catching studies?

My reply: This article is worth discussing, partly because it appears to be of higher quality than that the other Psychological Science article we discussed recently. That other paper was so full of holes, and its claimed effect size was so large as to be utterly implausible. In particular, that other article claimed to find huge within-person effects (different attitudes for women in different parts of their menstrual cycles), but estimated this entirely with a between-person study. In contrast, the Peterson et al. article linked to above is much more sensible, claiming only between-person effects.

To be more specific, they claim that physically strong low-SES 21-year-old men are more likely to favor income redistribution, compared to physically weak low-SES 21-year-old men. They don’t suggest that going to the gym will make one of these individual young man more in favor of redistribution; they’re only making a claim about differences in the population.

Here are my reactions to the Petersen et al. paper:

1. As noted above, the correlations in the paper do not seem completely unreasonable; that is, it seems possible that they apply to the general population of 21-year-old men, not just to the small samples analyzed in the study.

2. The statistical evidence is not as clear as the authors seem to think (given the declarative, non-caveated style of the article’s abstract and conclusions). Most obviously, the authors report a statistically significant interaction with no statistically significant main effect. But, had they seen the main effect (in either direction), I’m sure they could’ve come up with a good story for that too. That’s fine—they should report what they saw in the data—but the p-values don’t quite have the pure implications implied in the presentation.

3. There also appear to be some degrees of freedom involved in the measurement. From the supplementary material:

The interaction effect is not significant when the scale from the Danish study are used to gauge the US subjects’ support for redistribution. This arises because two of the items are somewhat unreliable in a US context. Hence, for items 5 and 6, the inter-item correlations range from as low as .11 to .30. These two items are also those that express the idea of European-style market intervention most clearly and, hence, could sound odd and unfamiliar to the US subjects. When these two unreliable items are removed (alpha after removal = .72), the interaction effect becomes significant.

The scale measuring support for redistribution in the Argentina sample has a low α-level and, hence, is affected by a high level of random noise. Hence, the consistency of the results across the samples is achieved in spite of this noise. A subscale with an acceptable α=.65 can be formed from items 1 and 4.

Lots of options in this analysis. Again, these decisions may make perfect sense but they indicate the difficulty of taking these p-values at anything like face value. As always in such settings, the concern is not a simple “file-drawer effect” that a particular p-value was chosen out of some fixed number of options (so that, for example, a nominal p=0.003 should really be p=0.03) but that the data analysis can be altered at so many different points under the knowledge that low p-values are the goal. This can all be done in the context of completely reasonable scientific goals.

3. My first reaction when seeing a analysis of young men’s biceps size is that this could be a proxy for age. And, indeed, for the analyses from the two countries where the samples were college students, when age is thrown into the model, the coefficient for biceps size (or, as the authors put it, “upper-body strength”) goes away.

But then comes the big problem. The key coefficient is the interaction between biceps size and socioeconomic status. But the analyses don’t adjust for the interaction between age and socioeconomic status. Now, it’s well known that political attitudes and political commitments change around that time: people start voting, and their attitudes become more partisan. I suppose Petersen et al. might argue that all this is simply a product of changing upper-body strength, but to me such a claim would be more than a bit of a stretch.

This is a general problem with the language of regression modeling, which leads researchers to think that including a variable in a regression “controls for it” so that they can interpret the remaining coefficients causally.

4. I agree with Alper that the authors should’ve presented raw data. For example, Figure 1 could easily have included points showing average support for income redistribution for respondents broken into bins characterized by SES and biceps size. The dots could be connected into lines, thus for each of their graphs you’d see three lines showing avg attitude vs. SES for respondents in the lower, middle, and upper terciles of bicep size. Such a graph would still have be problem of being contaminated by correlation between age and biceps size, but at least it would show the basic patterns in the data.

5. Finally, the authors engage in the usual tabloid practice of dramatically overselling their findings. What they actually found were some correlations among three samples, two of which were of college students. But their abstract says nothing about college students, instead presenting their claims entirely generally, referring only to “men,” never to “young men” or “students.” And then there is the causal language. The abstract is clean here (the use the term “predicted” rather than “caused”) but later on they unambiguously write, “Does upper-body strength influence support for economic redistribution in men? Yes.” Such a statement is simply wrong. Or, to be more precise, it could be correct but it’s not good scientific practice to make such a casual claim based on a correlation. Later they write, “Does upper-body strength influence support for economic redistribution in women? No.” This statement is even more wrong, in that, if you accept their causal interpretation, you still have to remember that lack of statistical significance is not the same as a zero effect.

Then they go deep into story time: “the results indicate that physically stronger males (rich and poor) are more prone to bargain in their own self-interest . . .” Also recall that quote above, where Petersen claimed that their results say something about how “individuals reason about welfare redistribution.”

And, from the conclusion of the paper, here are all the overstatements at once:

We showed that upper-body strength in modern adult men influences their willingness to bargain in their own self-interest over income and wealth redistribution. These effects were replicated across cultures and, as expected, found only among males. The effects in the Danish sample were especially informative because it was a large and representative national sample.

Actually, I didn’t see anything in the data about bargaining, nor are the causal claims supported by the analysis, and nor do college students represent “modern adult men.” A careful reader may have noted that the U.S. and Argentina samples were students, but the authors managed to get through the abstract, intro, and conclusion without mentioning this restriction.

Again, I don’t think any malign intent among the authors is required here. They believe their story (of course they do, that’s why they’re putting in the effort to study it), and so it’s natural for them, when reflecting on problems of measurement, causal identification, and representativeness of the sample, to see these as minor nuisances rather than as fundamental challenges to their interpretation of their data.

I have mixed feelings about criticizing this sort of study

On one hand, it’s a seriously flawed exercise in headline bait that is presented as scientifically definitive. On the other hand, you have to start somewhere. In the modern academic environment with the option of immediate publication, it’s too much to expect that a group of researchers would sit quietly for years re-designing and replicating their experiments, looking at all their claims with a critical eye, and plugging all the holes in their arguments, before submitting their paper for publication. Indeed, arguably it’s even better to publish these sorts of preliminary results right away so as to engage the larger scientific research community (and to get the sort of free post-publication peer review you’re seeing right here). Ideally, I think, they’d publish this sort of thing in Plos-One, with space in a top journal such as Psychological Science reserved for more careful work. Or, if the top journal Psychological Science really wants to publish this material (cutting-edge research and all that), it could have a section of each issue clearly labeled Speculations, so that media and other outsiders wouldn’t be misled into taking the article’s claims too seriously.

Just to say it again: it’s easy for me to stand on the sidelines taking potshots at other people’s work. It’s pretty clear that the work of people like me who stand around criticizing statistical analyses is relevant only in the context of the work of the people who do the actual research studies. The question is: how can statistical understanding better work its way into the applied research community. Traditionally we rely on referees to point out issues of measurement, causal identification, and representativeness—but that approach clearly isn’t working here. This blog provides some post-publication peer review, but it’s not attached to the article itself. I could try writing this up as a letter to the editor for the journal, but my impression is that editors don’t like to run critiques of papers they have published. And, again, there’s an important role in science for speculative studies. It would be a mistake for the system to be run so strictly that only airtight findings get published.

Public data

The other big, big thing is that the data should already be public. As discussed above, one can come up with all sorts of explanations of their findings in this paper, and it would be good if other researchers (or even bloggers!) could try out their own analyses. The survey data could be anonymized (if they are not already) so that confidentiality wouldn’t be an issue.

36 thoughts on “Another one of those “Psychological Science” papers (this time on biceps size and political attitudes among college students)

  1. It’s pretty clear that the people who do the research are contributing more to science than people like me who stand around criticizing statistical analyses.

    Not sure if you are being disingenuous here, but i disagree entirely (and I bet Basebøll does too). I learn a lot more from a post like this than an article like that!

    • David:

      Here’s what I was trying to say: A given post of mine might be contribute more than a particular published article. But overall I am providing a supporting role. My posts would not help much in the absence of the original research. I’ll try to reword.

      • My take on Popper is: all we really know is what isn’t so. In that case, those who poke holes in the claims coming off the conveyor belt at the association-manufacturing facilities do more to advance the cause than the “researchers” who run the place.

    • Psychological Science publishes critical commentaries on their papers on a regular basis including convincing failed replications, reanalysis, conceptual concerns. Despite the flak they get, they appear more self critical than other journals with Science in their titles.

  2. > referees to point out issues of measurement, causal identification, and representativeness—but that approach clearly isn’t working here

    It likely would severely reduce the number of publishable articles?
    The last time a colleague did a blinded review of a proposed epidemiological study, they pointed out clearly problematic issues of measurement, causal identification, and representativeness. It was later unblinded and discussed with them and the epidemiologists’ managers, where the managers seemed to agree with the problems but suggested the reviewer was expecting far too much be done to lessen them. They later found out, the lead epidemiologist on the study was their _star_ epidemiologist.

    If we had a good blinded survey assessment of the percent of practicing epidemiologists who could adequately address these 3 issues and then the number of practicing statisticians that work with epidemiologists who could provide adequate advice/support we would have some sense of how much to expect versus how much we have to live with to hear about the studies at all.

    • Keith:

      Yes, but the funny thing is that journals such as Psychological Science are really hard to publish in, and there’s a big competition to get papers in. So I don’t think they have a shortage of articles that they can publish.

      • But does anyone have a sense of what percentage of articles submitted have handled issues of measurement, causal identification, and representativeness adequately?

        Or better still – that they could attract if they offered to make this the primary criteria?

        • Snarky response (with big time Econo-arrogance): a noisy estimate of the fraction of well-considered empirical papers published in Epi journals might be the fraction of papers published each year in top-end applied economics journals over top-end empirical epidemiology journals, normalized by the number of practitioners in each. Or said another way… 1 per researcher per year.

          Less snarky (with only some Econo-hubris): As Andrew has pointed out, economists publish maybe 1/10th as many papers as researchers in other stats-based disciplines, and part of that is because the bar is so high – it is also why our papers are about 5-10 times as long as Epi papers, because you have to convince everyone you have dealt with measurement and identification issues as best you can. But I guess my point is, every time I read, say, an Epidemiology paper using 8 rounds of Demographic and Health Survey data from 3 different countries, and they show 3 tables, some summary stats, and use some poorly described but strangely complex analytic technique, I think to myself: there is no possible way that these researchers have really thought about this problem sufficiently from a stats perspective.

          So I guess I just don’t think that the field of Epi (or Psych, or Nutrition or Public Health or whatever) has a publishing standard designed to deal with complex statistical problems where the author’s aren’t just comparing Treatment and Control from a Randomized Evaluation, but actually need to make a statistical argument…you know, with caveats and nuance and reflection and whatnot. Then again, we got plenty (I mean it, PLENTY) of problems with Econ papers (even just the empirical ones with good methods and non-coding-errored data), so maybe we aren’t the model. My point is just that the sheer volume of papers in other fields using advanced techniques and non-experimental data give me immediate pause as to how useful many of them are.

          I’ll take any hits I deserve for being an Imperialist Econo-jerk, but I think we are on to something with the fewer/longer papers model.

        • Jrc:

          Yes, but I’ve always thought of psychology as being a field like economics, where researchers spend a great deal of time and effort on a single paper. Even the (perhaps)-debunked Dennis-the-Dentist paper included something like 10 different studies.

          This new breed of psychology paper (as illustrated by this one and the earlier paper on menstrual cycles) seems like a bad move, in that the psychologists are becoming more willing to do little bits of research and call it a paper. Of course it doesn’t help that the top journal in the field is publishing these things; that’ll just encourage more people to submit such papers.

          That said, I’m not one to talk, given that I publish little papers all the time.

        • Yeah. I’ve not read a ton in Psych, so I probably should’ve stuck with the health field papers I know better. That said, there is something very, very nice about a GOOD “little paper”, one that shows something interesting, puts it in context in the literature, and interprets the results humbly but suggestively (and is preferably followed up with a “big paper” if the results stand up to further scrutiny).

          This paper maybe could have been that, but looking at the abstract, it was probably designed specifically not to be:

          “Because personal upper-body strength is irrelevant to payoffs from economic policies in modern mass democracies, the continuing role of strength suggests that modern political decision making is shaped by an evolved psychology designed for small-scale groups.”

          Now there is some Science for you.

        • I don’t think econ is much better. Just because you use IV doesn’t mean you’re using a valid instrument. Witness the recent b.s. with distance from Africa as an “instrument” for genetic variation. There seems to be a misplaced sense of arrogance that because one is using methods _intended_ for causal inference (and other fields haven’t caught on yet as to how IV/RD/etc. can go wrong), that one is actually succeeding in estimating causal effects. It may be a little better than psych and a lot of the epi literature, but that’s a pretty low bar.

          I wonder if there’s something akin to the grade inflation thing going on where the incentives are all pointing in a direction which is antithetical to the stated mission of the institution. There’s a lot of incentive to publish a lot of garbage on observational data that generates very little new knowledge. There’s very little incentive to put in the orders of magnitude more effort required to unambiguously isolate a causal relationship.

        • Anonymous,

          I agree with you about bad IV. Really bad econ research just uses some “causal method” or another, throws it at the data, and does a few more robustness checks than a bad Epi paper. But the really good papers don’t. Some are experimental (Ted Miguel on worms, Hilary Hoynes on welfare reform), some are quasi-experimental difference-in-difference (David Card’s earlier works), some are really good IV (see Ether Duflo’s paper on Indonesian school expansion…which actually uses a DnD and an IV in a plausibly quasi-experimental setting). I think that the best econ papers using observational methods are better than the best papers in other fields (papers, not findings or research designs).

          Maybe this isn’t a fair comparison. Maybe I should compare median-quality Econ with median-quality Epi. In that case, I still think Econ comes out ahead in that at least we are trained how to regurgitate the basic assumptions of the powerful-but-dangerous methods we use – even if we don’t always do our best to think about if those assumptions hold. But this is a good segue to your second point about whether this is an incentive problem (your hypothesis) or an institutional one (the hypothesis I provided):

          When I read, say, the Lancet (high quality), I’m almost never convinced that a paper has gotten something right, unless it is a pure, simple experiment, and even then the sample sizes are so small and covariate sets so odd that I don’t get much from it. And if it is an observational paper, I’m almost always highly skeptical, just because they can’t show enough (check out Hoddinott’s work on Nutrition and Econ in Guatemala…the Lancet version is near useless in judging the quality of the work, but I’ve read Econ versions and working papers that were very convincing). So partly I think it is the publishing platform that is the problem. And partly it’s the incentives, as you say, because “Why bother, and I need tenure/degree/promotion/more-publications-than-that-other-guy.” You know…scientists are people, so they’re competitive jerks, and they cut corners, and they try to get away with things.

        • Very much agree with your second paragraph about the (perceived) incentives being far from ideal. That’s long been the case with RCTs in clinical research but observational studies are much harder. I once put it as the two guys being chased by a bear. The first says “we can’t run faster than a bear!”, the second replies “I only have to run faster than you”. I don’t believe observational studies can be adequately analysed (in the same way RCTs can), so the analyses can only be noticeably less inadequate. More importantly, as Jamie Robins and other have emphasized, they simply can’t stand as islands on their own but are published (mostly) a single (definitive) studies with the data and analysis (usually) totally unavailable.

          What are the rewards for less inadequate analyses and efforts to share data – apparently very little. As David Dunson once told me, some epidemiologists he has talked to have explicitly done these calculations and decided to stick with the “fast to publication” highly inadequate ones – their better for their career.

        • Keith:

          Also consider the problem with classical power calculations. These are typically framed as a matter of cost: you don’t want to waste money (and, possibly lives) on an underpowered study that won’t yield anything useful. But the implication is that, if you run an underpowered study and get statistical significance, you’ve gotten lucky and learned something big. More likely, you’ve gotten lucky and science as gotten unlucky, and a bit of noise will be published as if it represents truth.

        • Economists seem often proud of the length of their papers, the crazy long times the field takes to review and finally publish, the long parallel limbo of “working papers” and their somewhat lower productivity in terms of papers / year.

          As an outsider, these always seemed serous bugs to me. But somehow the insiders like to think of them as features.

  3. Yes, I came across this case a little while ago and chalked it up to another overreach of evolutionary psych. That is, quite aside from the statistics, its the presumption that some such spin is valid.

  4. What’s striking to me is how lame the main idea is. Anyone could come up with hundreds of hypothesis on the level of “biceps vs political attitudes” in about an hour spend decades churning out these kinds of papers. But why would anyone want to do that? It’s such a mind numbingly tedious and wasteful way to spend a life.

    No one cares about biceps vs political attitudes. They want to know the big questions like “exactly which postal workers are going to go crazy and shoot everyone?” or “when does consciousness start?” or some such. It’s like economists who use supply-n-demand models to predict when a prom date buys a corsage for their gal. No one cares. We just want an accurate prediction for GDP/Unemployment years from now given major policy changes.

    Researchers should go big or go home. Even if most researchers aren’t good enough to put a dent in the big questions, I’d rather they tried and failed then churn out these papers.

    • Entsophy:

      I’m not a fan of this particular paper but I disagree with your general point. Big ideas and big problems are important, but (a) smaller problems can be somewhat important, and (b) maybe we can gain insight into big problems via specific things we’ve learned from small studies. Consider, for example, the “tabletop science” work of Kahneman and Tversky: small findings with big implications.

      My problem with the Petersen et al. paper is how they overstated everything. Mostly they did some surveys on college students. They talk about physical strength and human reasoning but they did not measure physical strength or human reasoning. I think it would’ve been better for them to either design a study to get more definitive results or to publish what they had in a direct, non-exaggerated way.

      • Kahneman and Tversky is a great example. Do you think their work would have lead to big implications if they had spent their energies publishing papers like “Men who wear wife-beater shirts vote for women candidates less”?

        Some research requires real brilliance. For example, Galois work on the solvability of higher order polynomial equations (all done before the age of 21: Kahneman and Tversky’s work wasn’t as brilliant as that, either in conception or execution, which means there are lots of smart people running around who are quite capable of doing something similiar.

        Obviously some big problems need to be broken into a series of smaller chunks, but Researchers, even the mediocre ones, should follow Kahneman and Tversky’s example: go big or go home.

        • Agree 100% (+/- 4%, 0.95 LC).

          The Ripleys-Believe-It-Or-Not (aka “WTF?”) conception of psychology that treats studies like these as enlarging knowledge is much more disturbing than the (admittedly disturbing and common-place) methodological problems Andrew and others have identified.

    • “No one cares about biceps vs political attitudes.”

      I care. I’ve been fascinated by this question for decades. Here’s something I wrote on these kind of studies last year:

      This correlation between male muscularity and politics seems plausible to me, especially with the researchers’ clever distinction between proclaimed ideology and political self-interest. (I would expect that strength also correlates with solidarity, that team spirit is stronger on the football field than on, say, the tennis team.)

      For example, the rare out-of-the-closet Republican in Hollywood is typically an action movie star.

      Likewise, the strong right arm of the Democratic Party was long a beefy union guy in a windbreaker. Or, in the case of my late father-in-law, the classical tuba-playing head of the Chicago Federation of Musicians, a beefy union guy in a tuxedo. To weedier musicians, he looked like what he was: a big man who wouldn’t back down in negotiations with the bosses.

      In contrast, liberal college professors are frequently ectomorphic runners.

      This study raises the follow-on question of whether political predilections are in-born, or if changes in exercise routines can influence opinions.

      I often read liberals lamenting how much, holding demographics equal, the country has shifted to the right since the good old days of the mid-1970s. (Note: the mid-1970s may not have been that good for you.)

      It occurs to me now that 1974, when the Democrats swept Congress, might have been the skinniest year in recent American history. The jogging craze had been kicked off by Frank Shorter’s gold medal in the 1972 Olympic marathon. And weightlifting was completely out of fashion, endorsed mostly by weirdoes like that freakish Austrian bodybuilder with all those consonants in his name.

      I don’t know how to explain to younger people just how absurd the idea that muscle man Arnold Schwarzenegger would someday be elected governor of California would have struck people in 1974. By 1984, however, a profile of Schwarzenegger in Rolling Stone wisely devoted a paragraph to explaining that Arnold was Constitutionally ineligible to become President. …

      My point, though, is that the proposition that different types of exercise could drive political views could be ethically tested on college students by offering free personal trainers. Randomly assign some volunteers to the weightlifting trainer, others to the running trainer, and measure if their attitudes change along with their shapes.

      Please share this article by using the link below. When you cut and paste an article, Taki’s Magazine misses out on traffic, and our writers don’t get paid for their work. Email [email protected] to buy additional rights.

      • Steve:

        I agree that the topic is potentially worth studying; I just didn’t think the paper in question was a very serious piece of research, certainly not given all its grandiose claims that go so far beyond the data.

  5. I don’t want to spend $35 to read the whole paper. Did they actually measure upper body strength? From the discussions that I’ve seen they (or their surrogates) measured upper arm circumference. The paper would sound very different if it claimed that fat guys have different politics than skinny guys.
    In the ionterest of full disclosure, I am a stereotypical skinny, weak, nerdy liberal, but I have achieved a 1% income as well as the affection of some hot ladies.

    • Yup, it’s arm circumference.

      And, don’t worry about it, if you’re not a college student you’re not in the population described by the study. One of the many annoying features of this article and its press is that they buried this particular restriction of their data.

      • My, oh my!
        I know that there are difficult epistemilogical problems in social science research. However, it strikes me as very strange to say that A correlates with B if one has not actually measured A.
        Dr. Gelman, can you tell if the authors are deceived or deceiving?

        • I think they’re just imprecise. Statisticians such as myself are obsessed with precision; evolutionary psychologists, maybe not so much so.

  6. Pingback: A week of links - Evolving Economics

  7. To follow up on Slugger’s criticism above, I am skeptical of their “upperbody strength” proxy… Did they simply use arm circumference? Couldn’t you conceivably have weak upper-body strength but a large arm circumference if you have a body fat percentage? And vice versa?

    • Afriedman:

      They controlled for BMI in their regression models, but of course “controlling” for a variable is no substitute for careful and direct measurement.

  8. You guys obviously aren’t familiar with the literature in this field.

    It is has been proven (i.e., published in a peer-review journal) that getting drunk promotes conservative thought: Eidelman, S., Crandall, C.S., Goodman, J.A. & Blanchar, J.C. Low-Effort Thought Promotes Political Conservatism. Pers Soc Psychol B (2012).

    Getting drunk, particularly if you are a college student, involves repeatedly lifting a full (and then less full but whatever) point glass to your lips.

    Getting drunk repeatedly would thus result in one being a very strong conservative.

    Why all this skepticism?

  9. Pingback: I’ve got your missing links right here (01 June 2013) – Phenomena: Not Exactly Rocket Science

  10. Pingback: Links 6/4/13 | Mike the Mad Biologist

  11. Pingback: The old paradigm of a single definitive study in the social sciences should be abandoned | Impact of Social Sciences

Comments are closed.