How to fix the tabloids? Toward replicable social science research

Posted on May 31, 2013 9:40 AM by Andrew

This seems to be the topic of the week. Yesterday I posted on the sister blog some further thoughts on those “Psychological Science” papers on menstrual cycles, biceps size, and political attitudes, tied to a horrible press release from the journal Psychological Science hyping the biceps and politics study.

Then I was pointed to these suggestions from Richard Lucas and M. Brent Donnellan have on improving the replicability and reproducibility of research published in the Journal of Research in Personality:

It goes without saying that editors of scientific journals strive to publish research that is not only theoretically interesting but also methodologically rigorous. The goal is to select papers that advance the field. Accordingly, editors want to publish findings that can be reproduced and replicated by other scientists. Unfortunately, there has been a recent “crisis in confidence” among psychologists about the quality of psychological research (Pashler & Wagenmakers, 2012). High-profile cases of repeated failures to replicate widely accepted findings, documented examples of questionable research practices, and a few cases of outright fraud have led some to question whether there are systemic problems in the way that research is conducted and evaluated by the scholarly community. . . .

In an ideal world—one with limitless resources—the path forward would be clear. . . . In reality, time and money are limited . . . Once the reality of limited resources is acknowledged, then agreement about the precise steps that should be taken is harder to attain. Our view is that at least in the initial stages of methodological reform, we should target those changes that bring the most bang for the buck. . . .

These are good points with which I largely agree but maybe it’s not so simple. Even in a world of unlimited resources, I don’t think there’d be complete agreement on what to do about the replicability crisis. Consider all the cases where journals have flat-out refused to run correction letters, non-replications, and the like. A commenter recently pointed to an example from Richard Sproat. Stan Liebowitz has a story. And there are many others, along with various bystanders who reflexively defend fraudulent research and analogize retractions of flawed papers to “medieval torture.” Between defensiveness, publicity seeking, and happy talk, there’s a lot of individual and institutional opposition to reform.

Lucas and Donnellan continue:

First, a major problem in the field has been small sample sizes and a general lack of power and precision (Cohen, 1962). This not only leads to problems detecting effects that actually exist, it also results in lower precision in parameter estimates and systematically inflated effect size estimates. . . . Furthermore, running large numbers of weakly powered studies increases the chance of obtaining artifactual results

This is all fine, but in addition, low-powered studies have high Type S errors, that is, any statistically significant claims have a high probability of being in the wrong direction. Thus, the problem of low-powered studies is not just they have problems detecting effects that actually exist, but also that they apparently “detect” results in the wrong direction. And, contrary to what might be implied by the last sentence above, it is not necessary to run large numbers of weakly powered studies to get artifactual results (i.e., Type S errors). Running just one study is enough, because with enough effort you can get statistical significance out of just about any dataset!

I’m not making these comments out of a desire to be picky, just trying to clarify a couple of issues that have arisen lately, as psychometrics as a field have moved beyond a narrow view of file-drawer effects into an awareness of the larger problems of p-hacking. I think it’s important to realize that the problem isn’t just that researchers are “cheating” with their p-values (which might imply that all could be solved via an appropriate multiple comparisons correction) but rather that the old paradigm of a single definitive study (the paradigm which, I think, remains dominant in psychology and in statistics, even in my own articles and books!) should be abandoned.

A theory of the tabloids

By the way, how did Science and Nature become so strongly associated with weak, overly-hyped social science research? Has this always been the case? I don’t know, but here’s a (completely speculative) theory about how this could have happened.

The story goes like this. Papers in Science and Nature are short. The paradigmatic paper might be: We constructed a compound that cures cancer in mice. The underlying experiment is a randomized controlled study of a bunch of mice, there’s also a picture of slides showing the live and dead cancer cells, and the entire experiment was replicated in another lab (hence the 50 coauthors on the paper). It’s a short crisp paper, but underlying it are three years of research and a definitive experiment. Or, if it’s a physics paper, there might be a log-log plot of some sort. More recently we’ve been seeing papers on imaging. These are often on shakier ground (Vul and all that), but if done carefully they can result in valid population inference given the people in the study.

In social science, though, we usually can’t do definitive experiments. The relevant data are typically observational, and it’s difficult to design an experiment that plausibly generalizes to the real world. Effects typically vary a lot across people, which means that you can’t necessarily trust inferences from a convenience sample, and you also have to worry about generalizing from results obtained under particular conditions on a particular date.

But . . . people can still write short crisp papers that look like Science and Nature papers. And I think this might be the problem with Science, Nature, Psychological Science, and any other “tabloid” journals that might be out there. People submit social science papers that have the look of legitimate scientific papers. But, instead of the crisp tabloid paper being a concise summary of years of careful research, it’s a quickie job, a false front.

A place for little studies

I’d also like to repeat the point that there’s nothing wrong with writing a paper with an inconclusive observational study coupled with speculative conclusions. This sort of thing can go on to Arxiv or Plos-One or any number of specialized journals. Researchers A and B publish a speculative paper based on data from a convenience sample, researchers C and D publish their attempted replications, E and F publish criticisms, and so forth. The problem is that Science, Nature, Psychological Science, etc. publish quickie papers, and so there’s a motivation to send stuff there, and this in turn devalues the papers that don’t make it into the top journals.

Currently, journals hold criticisms and replications to such a high standard of publication that papers with errors often just stand by themselves in the scientific literature. Publishing a criticism can require a ridiculous amount of effort. Perhaps blogs are a problem here in that they provide an alternative outlet for the pressure of criticism. If I were not able to reach many thousands of people each day with my blog, I’d probably be putting more effort into getting correction notices published in scientific papers, maybe my colleagues and I would already have created a Journal of Scientific Criticism, and so forth.

I hope that the occasional high-profile criticisms of flawed papers (for example, here) will serve as some incentive for researchers to get things right the first time, and to avoid labeling speculation as certainty.

35 thoughts on “How to fix the tabloids? Toward replicable social science research”

Jacob H on May 31, 2013 11:07 AM at 11:07 am said:

It seems to me that the Gelmanian approach of “statistics for understanding, not for proof” would be helpful to Evo. Psych. and similar fields. A descriptive study that showed lots and lots of levels, correlations and patterns among, say, voting behavior, economic characteristics, and physical characteristics would be interesting and informative without making claims about the direction or existence of any kind of causality or the particular etiology of the behaviors and beliefs described. It would also, insofar as it was suggestive of particular hypotheses that merited more conclusive research, set the bar higher for them, one would hope. If someone says, “A is correlated with B therefore [story about the Pleistocene Savanna]” the correct response will be, “Well, A is also correlated with Q R S T and V, as shown in last month’s study, so you’ll have to do better than that if you want to convincingly show something.”
- Rahul on June 1, 2013 5:09 AM at 5:09 am said:
  
  The part about “short crisp papers” was a bit confusing. Would those bad papers be more easily caught had they been long and verbose? Is it so easy to fool reviewers with “a look of a legitimate paper”?
  
  You can call Psychological Science what you want but clubbing it with Science and Nature seems unfair. Science and Nature are hardly tabloid journals. What’s your definition of Tabloid Journals?
  
  And do others agree with Andrew’s claim that “Science and Nature are strongly associated with weak, overly-hyped social science research”? I sure didn’t think so. If Science and Nature are bad which are the better journals?
  
  Perhaps, the weakness and over-hype is a more general and widespread quality but when it piggybacks on the fame and wide-reach of Nature-Science it just gets noticed a bit more than if those results had appeared in some Obscure Annals of Psych. Academics
  - Andrew on June 1, 2013 9:04 AM at 9:04 am said:
    
    Rahul:
    
    1. A short paper that summarizes a careful years-long research project is one thing. A short paper that actually represents a small project with inconclusive results is another thing entirely. Of course I’m not suggesting that people pad their papers with empty words. What I’m saying is that a short, highly speculative paper can look at first glance a lot like a short paper that is a summary of careful work.
    
    2. And yes, it is so easy to fool reviewers; that’s the point. I’m a reviewer myself. We do it for free. And the criteria for publication are not so clear. Some journal editors seem to want papers that will get attention. There are lots of excuses for publishing crap: if a paper is seen to be exciting, reviewers will give it the benefit of the doubt.
    
    3. When it comes to social science, Science and Nature are tabloid journals. We started using this term about a year ago. Yes, they publish good stuff too. So do newspaper tabloids. What makes these journals “tabloids” is not that they only publish sensationalist material but that they do it regularly.
    
    4. You ask, If Science and Nature are bad which are the better journals? I don’t know enough about psychology, but I’m guessing that mainstream journals such as Child Development publish strong material. In political science, we have APSR, AJPS, BJPS, POQ, QJPS, etc. I’m not saying all the papers in these journals are wonderful, but it’s my impression that they restrict themselves to serious work and that they don’t overhype their articles.
Stochastic Sam on May 31, 2013 11:17 AM at 11:17 am said:

Speaking as someone who works with mouse models of cancer, there is considerable evidence that the prototypical “definitive” sort of studies you cite (“we cured cancer in a mouse”) are frequently not reproducible. See for example Begley & Ellis Nature 2012, a highly publicized claim that Amgen was unable to substantiate the vast majority of 53 so-called “landmark” cancer papers that were interesting and highly cited. This general problem is particularly true in the area of so-called “translational science”, meaning results coming from basic science that are meant to suggest definite directions for clinical studies. Within sub-fields of cancer biology there are vigorous debates over most of the interesting results. I think it’s quite rare that an life sciences experiment would be considered definitive, particularly in areas related to disease biology.
- Sanjay Srivastava on May 31, 2013 1:32 PM at 1:32 pm said:
  
  Thank you for saying this. Paul Rozin wrote a really thoughtful paper a few years back arguing that the model of “hard” science that many psychologists pursue isn’t just inappropriate for psychology, it’s actually a caricature of what scientists in those other fields actually do:
  
  https://sites.sas.upenn.edu/rozin/files/socpsysci195pspr2001pap_0.pdf
Jacob H on May 31, 2013 11:28 AM at 11:28 am said:

I guess the (conveniently Evo-Psychy) counterargument to my suggestions above is that we hairless apes are “wired for stories” and find a pattern of interconnected numbers with no driving narrative as inscrutable as a landscape of distant trees, albeit a lot less pretty. I think this is ultimately why the much bemoaned info-viz is sometimes useful, by making us more content with staring at the numbers without immediately putting forth a single irreducible story, and being nudged towards observing multiple patterns, multiple alternative possibilities and simultaneously-true things.
K? O'Rourke on May 31, 2013 11:49 AM at 11:49 am said:

The whole publication process is ill designed and disingenuous for adequately cumulating evidence.

When folks can compel evidence (e.g. FDA for drugs) they try to disregard publications at all reasonable cost and instead thoroughly consider and audit what the evidence generators had planned to generate, actually did generate and how they processed it. Or if they have lots of money, they try to reproduce in house, taking the original published source as hearsay (maybe true). Some co-operative (international) clinical research groups try to emulate some combination of the proceeding the best they can.

Also agree with Stochastic Sam that the animal studies (though obviously easier to do in an ideal manner) actually are often done very poorly indeed.
Mayo on May 31, 2013 11:54 AM at 11:54 am said:

As was documented in the Tilberg report and elsewhere, the Journals encourage sexy narratives and cutting out distracting caveats. I’m for a Journal of Scientific Criticism, or at least a newsletter.
Amaya Zombee on May 31, 2013 12:49 PM at 12:49 pm said:

Lots of good points there!

But I don’t think that a majority of the social/behavioral science in Science and Nature _is_ observational, as you suggest. A lot is experimental.

Prior to about 1990, most psychology in Science and Nature was from perception and cognition fields, and used within-subjects designs with a lot of statistical power. There were a few unreplicable results, due to emphasis on surprisingness and brevity, but most results were real and much of it was good and important.

A random but fairly typical example:

http://psycnet.apa.org/index.cfm?fa=search.displayRecord&UID=1988-34802-001

Then the editors got bored with that kind of stuff, and instead became entranced with Social Cognition and the romance of the Mysterious Unconscious, and in this new era a more typical study is:

http://www.sciencemag.org/content/314/5802/1154.short

One may speculate that the editors felt that general mass media attention for their magazine was desirable, and picked articles likely to attract that.
R on May 31, 2013 1:23 PM at 1:23 pm said:

What I wonder when reading all of this is what the requirements for journals are be called “scientific”? I assume that criteria for publications in scientific journals are things like using a certain general lay-out, use of references, statistics, and some logical reasoning.

But here’s the thing I subsequently wonder: if these published findings are so hard to reproduce (and so in turn the references used may not point to true findings, or valid conclusions), and if published articles contain (sometimes blatantly obvious) statistical errors (e.g. see Tilburg report), and illogical reasoning, what then exactly makes it a “scientific”- article or journal in the sense that the findings/ conclusions in it optimally contributes to knowledge about how the world/things/people work?

It seems to me that journals should set the bar way, way higher for their publications, and all journal should have an option for online comments to quickly point to possible inaccuracies. I also never understand the “In reality, time and money are limited” argument? I mean, I assume that a lot of time and money gets wasted by relying on/ building on unreliable or incorrect findings and conclusions, so wouldn’t it actually save time and money when the bar would be set higher?
Entsophy on May 31, 2013 4:04 PM at 4:04 pm said:

I predict once every paper is “methodologically rigorous” the reproducibility problems will still be there.

The “methodologically rigorous” way to make inferences about a population of ~10^8 is to take a random sample of ~10^3. But in most real problems the sample space isn’t the ~10^8 population it’s really something like a product space: “Population” times “current environment” or whatever. Because of the combinatorial way these sample spaces explode they usually have a ridiculous number of elements. Say 10^100 for example. So you can take all the “random samples” of size ~10^3 you want, or even ~10^5, you’re still liable to find out that you didn’t learn as much about that true 10^100 sample space as you thought.

Or take the gold standard: Random Trials. No matter what the (entirely deterministic) random number generator spits out to determine whose in the treatment and control groups, there will be an enormous number of differences between the two groups. The real question is whether any of those differences affect the thing we’re trying to study. If they do, then when you repeat the experiment and the random assignment introduces different systematic differences you’re going to get a different result. Whether the study was “methodologically rigorous” doesn’t actually address that problem much.

Even if all of that gets solved, then there’s the problem that frequency type data contains very little information in truth. Most of the information about what caused the data gets “washed out” and even if you extract all the information available you usually learn very little about the trillions of details that caused the data. Physicists spent centuries creating far more predictive theories using a miniscule amount of data compared to modern psychologists. The physicists “data” wasn’t frequency data though. The equations of rigid body motion for example weren’t discovered by examining the percentage of time a coin comes up heads.

The fact is hypothesis like “biceps vs politics” are extraordinarily easily to create. Since anyone can generate hundreds of these things, then all they have to do is learn the nearly meaningless statistical incantations from an introductory Frequentist text book and viola, they have a promising academic career. The whole thing is a scam which, from the viewpoint of multiple decades, doesn’t seem to lead to many real advances and mostly just crowds out those trying to do real work.
- pk on May 31, 2013 4:43 PM at 4:43 pm said:
  
  “The fact is hypothesis like “biceps vs politics” are extraordinarily easily to create.”
  
  How about the following one: I am assuming that the paper is a joke. Nevertheless, the literature on economic growth is full of such junk.
  - pk on May 31, 2013 4:43 PM at 4:43 pm said:
    
    The link did not show up: https://helda.helsinki.fi/bitstream/handle/10138/27239/maleorga.pdf
    - Anonymous on July 26, 2013 1:58 PM at 1:58 pm said:
      
      Not a joke. Found a Tim Harford column on this paper:
      http://www.ft.com/intl/cms/s/2/2b11d758-bd80-11e0-89fb-00144feabdc0.html
      
      Well, well. What are we to make of this? I asked Westling[author of the paper] how he would characterise his research paper, and he suggested the term “sardonic economics” – and, he added, “Scientifically, this paper is probably as worthless as much of contemporary economics.”
- K? O'Rourke on May 31, 2013 5:09 PM at 5:09 pm said:
  
  I think the most common concept of reproducibility (between independent sudies, AKA replication) is that any observed result will be too far from what it should be only rarely. Almost exactly the same observed results are not expected (is this what you meant?).
  
  > learn the nearly meaningless statistical incantations from an introductory Frequentist text book and viola, they have a promising academic career
  
  My MBA policy prof, told me about a year ago – this describes his current field, apparently most new faculty don’t even try to advise businesses and government anymore – thats not so easy to get away with being so repeatedly wrong.
  - Nick Cox on May 31, 2013 6:04 PM at 6:04 pm said:
    
    viola? voilà!
proof is in the pudding on May 31, 2013 5:04 PM at 5:04 pm said:

“I predict once every paper is “methodologically rigorous” the reproducibility problems will still be there.”

That could, heck “should”, be a nice hypothesis to test ! I would find it very interesting to see whether methodologically rigorous findings (e.g. based on increased sample sizes/”power”, and only “confirmatory” statistical analyses, etc.) would result in less reproducibility problems.
- Entsophy on May 31, 2013 5:20 PM at 5:20 pm said:
  
  Yes, in the spirit of Gelman’s “make a strong model and try to break it”, it was stated as a strong hypothesis with a definite implication. Finance would be a great place to test it out currently.
- Entsophy on May 31, 2013 6:14 PM at 6:14 pm said:
  
  Just to add clarification. The implicit notion that “methodologically rigorous -> reproducibility” is a theory. Frequentist intuition, which most Bayesians share in reality, strongly induces people into thinking it’s a fact. But it’s not a fact, it’s just a theory and it could be wrong.
  
  So by all means, lets see just how true it is.
  
  But first let me shed some doubt on “methodologically rigorous -> reproducibility”. Everybody says that if Economists could just conduct more random trials, they’d be in business. So lets say you randomly place people into a treatment group and placebo group. All the treatment group sit on one side of the room and placebo group sit on the other (or maybe they’re both done in the same room at different times). Well then that means these groups differed systematically on how close they were to the moon. If you try to avoid that problem, then they differ systematically on how close they are to the sun. There’s an essentially endless supply of these differences and these groups always differ on a very large number of measures.
  
  Is this difference enough to affect what you’re trying to study? probably not, but who knows, the moon does affect some things on earth significantly after all. My point was whether this is affecting the result has absolutely NOTHING to do with whether the study was “methodologically rigorous” as commonly understand in statistics classes. So why exactly would you expect “methodologically rigorous” to imply reproducibility?
  - jrc on May 31, 2013 6:57 PM at 6:57 pm said:
    
    The idea that the world is material is also an assumption – we assume we are not figments of G-d’s or someone else’s imagination.
    
    Closer to the matter at hand – we assume that the physical universe will have the same laws tomorrow as it did yesterday. There is no proof of that, other than it has happened like that every other day.
    
    So, with experiments, we assume that random assignment will balance observable and unobservable characteristics across treatment in such a manner that the difference in the two groups is caused by treatment. You found an observable that is different across treatment in the experimental space you invented, but one could argue that, given any indication that the moon might matter here, this was not “rigorous methods” but poor randomization.
    
    I don’t think that was your point – I think your point was that there are infinite things that could affect treatment. Fine and well. So you do another experiment and see if the result holds. And another. And at some point, we all agree that: Tomorrow that treatment will almost surely cause that result.
    
    My point is that “methodological rigor” will often lead to reproducibility if we believe that tomorrow will look like yesterday. Fair question whether economic forces are changing so much over time that this isn’t true, but I think you are looking for too solid a foundation. The foundation of all science is just that we accept Hume has a point, and move on, because every time I let the ball out of my hand, it falls to the floor. If someday it floats in mid-air, we’ll have to readjust our priors. Then again, I don’t believe in “Truth”, I just believe is stuff that works or doesn’t work. And I think that randomization works – not every time (it’s not supposed to), but a lot of the time.
    - K? O'Rourke on June 1, 2013 10:52 AM at 10:52 am said:
      
      CS Peirce argued that even though the world does change, if you keep taking random samples, you will not necessarily get increasingly wronger over time. (As an aside, Google’s (former?)head statistician (D Pregibon)used to try and sell clients on this).
      
      But randomisation is a mathematical rather than empirical finding – it guarantees that random samples from a population are “usually” like the population and therefor similar to each other. This can be demonstrated empirically – sort of – but that’s just ornamental or pedagogical.
      
      Now people trying to be more methodologically rigorous could well make thing less reproducible (e.g. they don’t succeed)
    - Entsophy on June 1, 2013 1:37 PM at 1:37 pm said:
      
      K? O’Rourke,
      
      My entire point was that it might not be true in the real world. The empirical demonstrations you’re talking about are simulations especially designed to match the mathematics. Real world science data is a whole different ball game.
      
      I believe baysian statistics is basically forming best guesses about things. Whether something is a “best guess” by some explicit mathematical criterion can be determined deductively by someone sitting in their living room. But there’s absolutely no guarantee that the guess is right.
      
      Frequentest can’t live without that missing guarantee, so they imagine they’re mini-physicists modeling some real physical phenomenon called randomness. They really believe 95% CI have 95% coverage in the real world even though they seemingly never do. They really believe measuring devices throw off IID N(0,simga) errors even though every big time frequentest that checked this was shocked to discover it’s rarely true and then scratched their heads to explain why NIID work so well in practice.
      
      And in this instance, everyone believes that if you randomize the treatment groups, you’ll get repeatable results. But this is just a theory, not a fact. And since frequentests have consistently shown that “mini-physics” is really more like “poor physics”, there’s every reason to question the theory. The bottom line is that it needs to be checked. Look at “methodologically rigorous” studies where the effects aren’t so big it could have been easily discovered using without statics and see how reproducible the results are.
    - jrc on June 1, 2013 3:59 PM at 3:59 pm said:
      
      On empirically testing reproducibility – two things 1) I agree that more “replication” studies of important findings are good; and 2) no good* practicing economist takes any one study and believes that it definitively shows something, and most big field studies are not reproducible/replicable in the sense that years have gone by, institutions have changed, schools have new teachers, prices have moved, etc. All the good* researchers I know see a new study, evaluate the method, look at the various tests, and weakly (or less weakly) update their beliefs about what is going on in the world. (* “good” meaning most engaged, high level practitioners in the field at research Universities).
      
      Now, as for simulations, inference and “real world data” – there are other kinds of simulations. For instance, I was recently concerned that methods for computing standard errors recommended by the providers of a large survey were too restrictive in their assumptions (iid conditional on sampling procedure), so I did a simulation where I randomly assigned “Treatment” to half the people in one of their data sets, and “Control” to the other half. Then I ran the regression of Y on the placebo-T using the real data, used a few different inference strategies (theirs, the one I thought should be right, and one that was common in other, similar empirical set-ups), recorded the p-values, and did the whole thing a bunch of times.
      
      In the end, I get rejection rates (rejecting the true, known null hypothesis of no effect) of right on about .05 when I used the a priori theoretically pleasing strategy (mine) and much higher rejection rates for the other two. Here’s the point – this is a simulation based on real data, and it tells me that, if treatment is randomly assigned in this manner (a kind of cluster random assignment), and I use the right inference strategy, I will reject a true null about 5% of the time. And we could take it to another dataset from these people (a dataset is a country here) and I could predict that my inference strategy would give reasonable p-values, and theirs would be way too large (in fact, I’ve done some of that simulation on 1 or 2 other countries, and in fact that seems to be about right).
      
      As a semi-Frequentist, all I really want is for my standard errors to perform about right in a placebo environment, and to have good theoretical reasons to believe that everything I don’t know about people is probably unrelated to treatment or treatment intensity. I make no claims about things in the “real” world – deep underlying metaphysical parameters – I just use the machinery to give me some gauge of how likely some effect was to appear that large just due to chance. And it seems most plausible that if I randomly assign treatment, there is no difference in the outcome before the experiment across groups, there is a difference later, that treatment is probably responsible for this difference.
      
      I guess this doesn’t really get at your reproducibility argument – but my point is just that if we can see that our estimators work on real data when a known treatment effect is added, and that no one really believes that one study is a cause for believing some effect in the world, and that people have some things in common and respond in similar ways in aggregate, I don’t see where your argument has much traction.
      
      But maybe your argument is just that every person in the world is so different and every moment so unique that things will never be the same (not true in Physics, likely more true in Econ, but still not really true)? Or is it just that you think everything is a file-drawer selection problem? Do you doubt our ability to weed out a 0 effect, or that the world is regular enough for us to learn from it experimentally? And why are physicists OK to experiment, but not economists? (my answer would involve “physicists don’t experimentally fuck with poor people,” but I’m guessing that’s not where your argument is).
    - Anonymous on June 1, 2013 9:00 PM at 9:00 pm said:
      
      There seem to be a couple issues here.
      
      Hypothesis testing is a terrible way to go about modeling observations, which I think is what you’re getting at with the “modeling real physical phenomenon” thing.
      
      I think there’s two separate issues in terms of predictive ability from randomized experiments – 1) residual confounding 2) generalizability. You seem to be focused on cases where the first one as an issue. While its possible, I think problems with #2 are far more common. There’s almost never a prediction made on the basis of a prior randomized study for which the prediction population can be said to be indistinguishable from the study population. This is a far bigger elephant in the room than screw-ups that could happen related to being closer or further form the sun, etc.
    - Entsophy on June 1, 2013 1:39 PM at 1:39 pm said:
      
      “My point is that “methodological rigor” will often lead to reproducibility if we believe that tomorrow will look like yesterday”
      
      Any method will lead to reproducible results if tomorrow looks like yesterday.
Steve Sailer on June 1, 2013 5:02 PM at 5:02 pm said:

Allow me to make a public prediction: the much derided correlation between male muscularity and political views (defined in the sophisticated sense of more muscular men tending to favor joint political action in their self-interest while less muscular men are more concerned with universal fairness) will turn out to be true.
- Entsophy on June 1, 2013 5:50 PM at 5:50 pm said:
  
  I’m sure you’re right. Almost all the discussion here is about the statistics not the claim, so I’m not sure whose doing all the deriding. These kinds of topics aren’t illegitimate in any way, shape, or form. It’s just that I don’t believe these kinds of papers add to our store house of knowledge and they’re unlikely to do so even if all the methodological problems were fixed.
  
  There are quite a few fields, which publish ~10^4 papers per year, each of which gives a convincing portrayal of having advance the subject by an epsilon amount, and yet these fields are no better at predicting things than they were 30-60 years ago. This is a minority view thought. Gelman and quite of few others with illustrious careers would fairly strongly disagree. So you seem to be in good company.
  
  And you definitely won’t get much political correctness here. I’m a Marine and have never been offended by anything in my entire life. I’m not even sure what emotion people are referring to when they say they were offended by something.
- Andrew on June 1, 2013 7:59 PM at 7:59 pm said:
  
  Steve:
  
  The difficulty is one of multiple comparisons. The hypotheses you state can express themselves in the data in so many different ways. For example, a low-income conservative might support repealing the estate tax because he feels it is in his self-interest, or because he believes the tax is unfair. And so forth. I agree with you that these things are worth studying, but I doubt the correlations will be so clear, and I think that’s a key weakness of the article under discussion, that they take one particular set of correlations and interpret them to death.
  - Steve Sailer on June 1, 2013 11:42 PM at 11:42 pm said:
    
    Actually, I think Tooby, Cosmides et all have done a good job of cutting through much of the spin and myth associated with left and right and getting down to something solid: political attitudes as self-interest when it comes to redistribution.
    
    This was something that puzzled me when I was young, naive, and skinny: my father-in-law was a classical musician, but he was built like a Teamster, at about 6′-1″ and 220. He played the tuba in the Chicago Lyric Opera orchestra, and the all the weedy violinists kept electing him their union boss to negotiate contracts with management for them. Why? Because he looked like a hard man to buffalo. And, indeed, he was an immovable object. He spent long hours at the bargaining table, and led a few strikes in his time.
    
    As a young intellectual of run-of-the-mill views, I spent some time trying to figure out if my father-in-law’s union boss job put him on the Left or the Right. Theoretically, labor = left, and indeed he worked hard to redistribute income from management and their rich backers to labor. On the other hand, classical musicians with full time jobs aren’t the poor (e.g., the Chicago Symphony Orchestra recently went on strike because they were only being paid $144,000 per year).
    
    But that labor=left linkage was fading in the public mind as things like gay marriage came to define the left. The other classical musicians valued his negotiating because he couldn’t be swayed by management’s appeals to reason or fairness. He was not out for universal justice, he was focused on his team winning. Thus, in Tooby and Cosmides’s framework, it’s really easy to classify him, while it’s not in conventional approaches.
    - Andrew on June 2, 2013 8:41 AM at 8:41 am said:
      
      Steve:
      
      I think your theory is reasonable and I agree that personal experiences are a valuable way to learn and understand about the world. I don’t think the Petersen et al. study tells much, and I think you may be giving the authors some benefit-by-association because their theory is similar to yours.
      
      The trouble is that, as Popper said so famously about psychoanalysis, just about any data pattern could be taken as resounding evidence for the theory. Suppose, for example, there were a straight correlation (no interaction) between biceps size and conservative political attitudes. This could be taken as evidence that stronger larger male college students (who are, of course, on average in the upper part of the SES distribution) are more willing to fight for their self-interest, compared to their smaller weaker counterparts. Now suppose the opposite, that the correlation went the other way. This would still fit the theory: after all, college students tend to have liberal political views, and it’s the stronger ones who accept such views without question (on campus, it is the liberals who are dominant) while the weaker ones are more likely to consider the other side. Similarly for various interactions: with any possible outcome, you can tell a story—a completely reasonable story—that is consistent with the theory. That does not mean the theory is wrong—and, for that matter, I’m sure there are many important insights in Freud’s theories as well.
      
      From a statistical point of view, there is a challenge in estimating subtle interactions in the presence of huge main effects (of age, sex, party identification, etc.), a problem that is made more difficult because there are so many different patterns that one can find in the data. So many places to look, so much variation, it’s all a mess. I wouldn’t discourage people from doing such studies, but I would discourage them from such overinterpretation. The correlations in the population could well be the opposite of the patterns they found in these particular samples, but it wouldn’t invalidate your larger theory, it just indicates how that theory is consistent with almost any pattern of correlations, if interpreted in a certain way.
    - Rahul on June 2, 2013 10:41 AM at 10:41 am said:
      
      Nicely put!
    - Steve Sailer on June 2, 2013 10:15 PM at 10:15 pm said:
      
      This paper strikes me as moving in the opposite direction, toward greater clarity of thought, which facilitates potential Popperian falsification. Tooby, Cosmides et al have made conceptual progress by avoiding the murkier aspects of politics and ideology and focusing upon the question of redistribution for self-interest.
      
      This states elegantly an answer to a question that has puzzled perceptive political observers going back, in my recollection, to the late 1960s: union bosses are a crucial component of the Democratic coalition, yet they tend not to _look_ like other Democrats. For example, current AFL-CIO supremo Richard Trumka is almost a dead ringer for former Chicago Bears coach Iron Mike Ditka:
      
      http://en.wikipedia.org/wiki/Richard_Trumka
      
      Trumka looks like a stereotypical tribal leader: a good man to have leading the fight for your side’s interests.
      
      In contrast, many centuries of Western imagery suggests that we expect saints and scholars, as disinterested figures, to be thin, perhaps emaciated. The least likely actors to have ever been cast as Jesus include Danny Devito, Bob Hoskins, and John Goodman.
    - Andrew on June 3, 2013 7:38 AM at 7:38 am said:
      
      Steve,
      
      You might be right on the theory, but for the reasons stated in my above comment I don’t see Petersen et al.’s data and data analysis as saying much about this, and I certainly don’t think they have done anything close to justifying their dramatic claims. I don’t think their picking out of correlations among surveys of college students is anything like a Popperian test of their theory. For reasons discussed in my previous comment, I think such a test is much more difficult than Petersen et al. appear to believe.
Pingback: Real Science: No wild expectations, no big promises. But the gadget worked. - reestheskin
Pingback: Some links on publishing that I’ve collected the past month. | Åse Fixes Science

Comments are closed.