How do data and experiments fit into a scientific research program?

I was talking with someone today about various “dead on arrival” research programs we’ve been discussing here for the past few years: I’m talking about topics such beauty and sex ratios of children, or ovulation and voting, or ESP—all of which possibly represent real phenomena and could possibly be studied in a productive way, just not using the data collection and measurement strategies now in use. This is just my opinion, but it’s an opinion based on a mathematical analysis (see full story here) that compares standard errors with plausible population differences

Anyway, my point here is not to get into another argument with Satoshi Kanazawa or Daryl Bem or whoever. They’re doing their research, I’m doing mine, and at this point I don’t think they’re planning to change their methods.

Instead, accept for a moment my premise that these research programs, as implemented, are dead ends. Accept my premise that these researchers are chasing noise, that they’re in the position of the “50 shades of gray” guys but without the self-awareness. They think they’re doing research and making discoveries but they’re just moving in circles.

OK, fine, but then the question arises: what is the role of data and experimental results in these research programs?

Here’s what I think. First, when it comes to the individual research articles, I think the data add nothing, indeed the data can even be a minus if they lead other researchers to conclude that a certain pattern holds in the general population.

From this perspective, if these publications have value, it’s in spite of, not because of, their data. If the theory is valuable (and it could be), then it could (and, I think, should) stand alone. It would be good if the theory also came with quantitative predictions that were consistent with the rest of available scientific understanding, which would in turn motivate a clearer understanding of what can be learned from noisy data in such situations—but let’s set that aside, let’s accept that these people are working within their own research paradigm.

So what is that paradigm? By which I mean, not What is their paradigm of evolutionary psychology or paranormal perception or whatever, but What is their paradigm of how research proceeds? How will their careers end up, and how will these strands of research go forward.

I think (but certainly am not sure) that these scientists think of themselves as operating in Popperian fashion, coming up with scientific theories that imply testable predictions, then designing measurements and experiments to test their hypotheses, rejecting when “p less than .05” and moving forward. Or, to put it slightly more loosely, they believe they are establishing stylized facts, little islands of truth in our sea of ignorance, and jumping from island to island, building a pontoon bridge of knowledge . . . ummmm, you get the picture. The point is, that from their point of view, they’re doing classic science. I don’t think this is what’s happening, though, for reasons I discussed here a few months ago.

But, if these researchers are not following the Karl Popper playbook, what are they doing?

A harsh view, given all I’ve written above, is that they’re just playing in a sandbox with no connection to science or the real world.

But I don’t take this harsh view. I accept that theorizing is an important part of science, and I accept that the theorizing of Daryl Bem, or Sigmund Freud, or the himmicanes and hurricanes people, or the embodied cognition researchers, etc etc etc., is science, even if these researchers do not have a realistic sense of the sort of measurement accuracy it would take to test and evaluate these theories.

Now we’re getting somewhere. What I think is that anecdotes, or case studies, even data that are so noisy as to essentially be random numbers, can be a helpful stimulus, in that it can motivate some theorizing.

Take, for example, that himmicanes and hurricanes study. The data analysis was a joke (no more so than a lot of other published data analyses, of course), and the authors of the paper made a big mistake to double down on their claims rather than accepting the helpful criticism from outside—but maybe there’s something to their idea that the name of a weather event affects how people react to it. It’s quite possible that, if there is such an effect, it goes in the opposite direction from what was claimed in that notorious article—but the point is that their statistical analyses may have jogged them into an interesting theory.

It’s the same way, I suppose, that Freud came up with and refined his theories of human nature, based on his contacts with individual patients. In this case, researchers are looking at individual datasets, but it’s the same general idea.

Anyway, here’s my point. To the extent that research of Bem, or Kanazawa, or the ovulation-and-voting people, or the himmicanes-and-hurricanes people, or whatever, has value, I think the value comes from the theories, not from the data and certainly not from whatever happens to show up as statistically significant in some power=.06 study. And, once we recognize that the value comes in the theories, it suggests that the role of the data is to throw up random numbers that will tickle the imagination of theorists. Even if they don’t realize that’s what they’re doing.

Sociologist Jeremy Freese came up with the term Columbian Inquiry to describe scientists’ search for confirmation of a vague research hypothesis: “Like brave sailors, researchers simply just point their ships at the horizon with a vague hypothesis that there’s eventually land, and perhaps they’ll have the rations and luck to get there, or perhaps not. Of course, after a long time at sea with no land in sight, sailors start to get desperate, but there’s nothing they can do. Researchers, on the other hand, have a lot of more longitude—I mean, latitude—to terraform new land—I mean, publishable results—out of data . . .”

What I’ve attempted to do in the above post is, accepting that a lot of scientists do proceed via Columbian Inquiry, try to understand where this leads. What happens if you spend a 40-year scientific career using low-power studies to find support for, and modify, vague research hypotheses? What will happen is that you’ll move in a sort of directed random walk, finding one thing after another, one interaction after another (recall that we’ve looked at studies that find interactions with respect to relationship status, or weather, or parents’ socioeconomic status—but never in the same paper), but continuing to stay in the main current of your subfield. There will be a sense of progress, and maybe real progress (to the extent that the theories lead to useful insights that extend outside your subfield), even if the data aren’t quite playing the role that you think they are.

For example, Satoshi Kanazawa, despite what he might think, is not discovering anything about variation in the proportion of girl births. But, by spending years thinking of explanations for the patterns in his noisy data, he’s coming up with theory after theory, and this all fits into his big-picture understanding of human nature. Sure, he could do all this without ever seeing data at all—indeed, the data are, in reality, so noisy as to have have no bearing on his theorizing—but the theories could still be valuable.

P.S. I’m making no grand claims for my own research. Much of my political science work falls in a slightly different tradition in which we attempt to identify and resolve “puzzles” or stylized facts that do not fit the current understanding. We do have some theories, I guess—Gary and I talked about “enlightened preferences” in our 1993 paper—but we’re a bit closer to the ground. Also we tend to study large effects with large datasets so I’m not so worried that we’re chasing noise.

70 thoughts on “How do data and experiments fit into a scientific research program?

  1. I really enjoy your posts on this matter. I agree that theorizing is important—critical—to science. Every chemist knows the story of how the concept of aromaticity, absolutely critical to chemistry, arose from a dream.

    The real problem is the publication and dissemination in the popular press of lousy research. Too often interestig possibilities become bad research that is promulgated to the public as verified scientific fact.

  2. Interesting. But how do you separate insight from bullshit except by data? I have no problem with publishing both bullshit and insight in journals, and even in the same journals. But there ought to be some subset of journals which publish only things in which the data *demonstrates* (I mean that to be slightly weaker than *proves* which is too much to hope for) the speculation, not just points in the same direction, or weakly supports. Haven’t we progressed to the point (and any time I write that phrase I hear myself muttering “Sadly, no”) where weak quantification is worse than none? [And forking paths is the reason why.] The argument I’ve always heard for NHST 0.05 hoops is not that it advances understanding, or even serves as an index of truth, but that with all the forking path tools at your disposal, if you can’t get there, either your theory is false or you’re just looking in the wrong place.

    • “The argument I’ve always heard for NHST 0.05 hoops is not that it advances understanding, or even serves as an index of truth, but that with all the forking path tools at your disposal, if you can’t get there, either your theory is false or you’re just looking in the wrong place.”

      I’d like to see this logic explored further by one of the proponents, because I do not understand it. Say T is a substantive theory making prediction P, we make auxiliary assumptions A (eg equipment working, no data entry error, etc), and we have observed O. Either O is consistent with P or not.

      1) (T AND A) entail P
      2a) O=P therefore (T AND A) (Affirming the consequent)
      2b) O/=P therefore (~T or ~A) (Modus Tollens)

      That is pretty straightforward, however when we consider the “null* hypothesis”:

      1) (??? and A) entail H0
      2a) O=H0 therefore (??? AND A) (Affirming the consequent)
      2b) O/=H0 therefore (~H0 or ~A) (Modus Tollens)

      Where is the connection to T? I can’t see it, if anyone can then I ask you publish this formally. You will be the first one in history to explain the mechanism at work here. I have my doubts, instead it reminds me of this helpful Venn diagram:

      *Plausible null hypotheses (there is zero ESP effect) or those derived from a theory are a different story.

      • The notation was sloppy. Cleaned up a bit:
        1) (T AND A) entail P
        2a) (O : P) therefore (T AND A) (Affirming the consequent)
        2b) (O /: P) therefore (~T OR ~A) (Modus Tollens)

        That is pretty straightforward, however when we consider the “null* hypothesis”:

        1) (??? AND A) entail H0
        2a) (O : H0) therefore (??? AND A) (Affirming the consequent)
        2b) (O /: H0) therefore (~??? OR ~A) (Modus Tollens)

        Above I am using a non-standard symbol:
        “:” = “is consistent with”
        “/:” = “is not consistent with”

        I am not sure what the usual symbols are (if any) for those statements but the takeaway is that an observation can be consistent with H0 and P even if they are associated with mutually exclusive values. For example, if we fail to reject H0 (H0=mu1-mu2=0) the CI may still include values consistent with P (eg P=mu1-mu2>0).

        I would really be interested in someone tackling the problem of formally explaining the supposed mechanism of strawman NHST.

        Also, I do accept that the p-value may be useful summary statistic*, but only if we DO NOT interpret it the usual way of “the probability of getting a result as or more extreme if H0 is true”. In that case it is simply a tool (possibly not optimal) to choose between (O : P) or (O /: P). We are not concerned at all in deducing anything from the truth/falsity of H0**.

        The root of the confusion appears to be that p-values are related to evidence, but the definition of p-value provided (while it may be correct) is unrelated to that function.

        *”P-values quantify experimental evidence not by their numerical value, but through the likelihood functions that they index”
        **”the null hypothesis serves as little more than an anchor for the calculation-a landmark in parameter space”

        • I think you’re overthinking this. The point is that your choice of model to contrast with the null is really wide. (That’s the garden of forking paths problem.) If none of them can achieve p<0.05 then there's almost surely nothing there. If at least one of them does, and you can rhetorically explain that choice ex post without reference to the fact that you were executing this step to jump through the hoop then you have a passed a minimal statement of scientific surprise. (Note you are not allowed to say: I tried these five things and only thing 3 worked… You are required to explain why thing 3 is a better statement of the null than the other four.) Not only that, but by trying five things you can tell people how hard you worked at the statistical art of the paper (ie the part you didn't care about.)

        • “If none of them can achieve p<0.05 then there's almost surely nothing there"

          And if I do… there is almost surely nothing there either. Any time I mess up the experiment I get p<0.05!

          Also some R:
          for(i in 1:nrow(p)){
          [1] 0.05045

          It sounds like the theory of interest predicted a substantial effect size, nothing to do with rejecting the null hypothesis.

  3. Put differently, the data from experiments are a way to start a conversation with a computational/mathematical model of the phenomenon you are studying. The trouble begins when people start to build theories by wildly waving their hands in the air. Then, as the ad says, impossible is nothing.

    • I think people need to stop & think a lot *before* they gather any data what they intend to do with it. And how they will analyse it. And what are their priors on expected effect size. And how noisy will be their measurements. And if their experimental design makes sense.

      Many of these studies sound like weighing your baby weekly with a 30-ton, industrial truck scale & then theorizing about reasons for variations in its growth rate.

  4. Two responses:

    1. If you limit science to what has survived from the past, you get a distorted view of what actually happened in the past. What has survived in the science world is that which actually calculated, which actually worked out. Yes, some wrong ideas survive as part of a story, like “the aether” through which light moves but outside of a history of science class or a popular history/autobiography those are used to illuminate a relatively difficult correct idea. In the sense that science is actually like it has been only dressed up with math that computers enable, then the issue of reliability, of good versus bad work, etc. is one which will recur over and over. As tools change, it will find some other way to express.

    2. Why? Because people want to be right. Because they want to contribute. Because they have egos. Because it’s in human nature to guess, to gamble. Because we see patterns and leap to conclusions even though those conclusions require inconsistency with other conclusions. Because those inconsistencies tend to turn people into harsher and more strident advocates of their particular views rather than admit error.

  5. Andrew: “What happens if you spend a 40-year scientific career using low-power studies to find support for, and modify, vague research hypotheses?”

    You do very well career wise. And you create a parallel universe.

  6. Then we should bring back Staple who had gobs of good theories, and all who work on data-free theorizing should stop claiming to have evidence for their myths. Whether they then become fringe science, pseudoscience, pre-science, or non science can be decided. I’ve long urged that at least most of these “chump effects” are there for (legitimate) human interest and should be labelled as “for entertainment only”. (I would not place Freud or similar theories in that category.) It is not innocuous to claim to have evidence falsely, it’s quite dangerous. Yet I doubt they’d accept the label “for thought-provoking entertainment only”.

    • Mayo:

      The problem with Stapel was not the theories, it was the (intentionally) false claims of evidence. Similarly, the problem with Bem and the ovulation researchers was not the theories, it was the (presumably unintentionally) false claims that their evidence supported their theories.

      I agree that it is certainly not innocuous to claim to have evidence falsely. Whether the falsity is intentional or not, such claims are misleading and also pollute scientific discourse.

      I think one reason for all these false claims is the expectation among scientists that their off-the-wall theories will be routinely supported by data. This is why I’ve been writing so much recently about “power = .06” studies, to try to get away from the usual way that statistics is presented, as a way of confirming scientific hypotheses.

      Trained researchers think that they can come up a story and then do a 2-item survey on 100 Mechanical Turk participants and learn enduring truths about human nature. Other trained researchers think that their theories are true so it’s ok to misrepresent their data. Others think that their theories are true so it’s ok to hunt through hypothesis tests till they find statistical significance. I think all these groups have the problem that they think that quantitative research proceeds in a straight line; they don’t fully understand uncertainty and variation.

    • That’s not how I understand Andrew’s point. As I understand it, you need to view your data from a highly constrained perspective, which delimits theoretically what is a plausible or implausible outcome before any data come in. This does not mean that one ignores data or makes it up, the way Stapel did (I assume that’s what you mean by “Staple”).

    • “all who work on data-free theorizing should stop claiming to have evidence for their myths”


      Say Mayo, in your four decades of taxpayer funded theorizing did you ever publish something that wasn’t data-free?

      • Hey Anonymous: since you don’t like data-free theorizing, here’s an inductive, theory-free inference from a large sample size: Based on your comments on this and many other posts here, I infer that you have nothing substantive to say about this or any other topic on this blog.

        I look forward to you falsifying my hypothesis repeatedly. Seriously, I’d be thrilled if you would do nothing but prove me wrong on this into the indefinite future.

      • Anon:

        Please do not be rude in this way. And to respond to the substance: it is perfectly reasonable for a philosopher to theorize about statistics without actually performing statistical analyses, just as Bill James could make useful contributions to our understanding of baseball even if he was no better than Michael Jordan at hitting the curveball.

        • Andrew,

          I am “Anon” in this thread, who is different from “Anonymous”. I wouldn’t bother mentioning it but since you are in a position to check IPs, others who read you conflating the two may come to the wrong conclusion.



        • That wasn’t the substance. Mayo has strong theories about what statistics methods statisticians should be using, including new stat tools which are claimed will (A) solve the major problems with classical statics and (2) should be included in pretty much every stat analysis.

          If Mayo’s theories make a difference then they’re testable.

          In four decades of work Mayo has no data to back any of this up. In fact, there’s not a shred of evidence in her published record that she has anything like the mathematical or practical data analysis skills needed to even attempt such a thing.

          I was pointing out the supreme irony of someone like that uttering the words “all who work on data-free theorizing should stop claiming to have evidence for their myths”.

          It’s not rude, it’s funny in an “academia if full worthless time-servers with no intellectual integrity” sort of way.

        • Anon:

          I agree with Mayo on some things and disagree with her on others, but I don’t think a phrase such as “worthless time-servers with no intellectual integrity” is in any way appropriate to this discussion. Mayo holds strong opinions with which you disagree. It’s enough to say that. This blog is a special space. Wit is appreciated but this sort of rudeness is not. Thank you for understanding.

        • Jeremy, I get it. Your upset because you think it’s ok when:

          (A) a “researcher” with what appears to be a weak freshman level knowledge of stat/math passing themselves off as a stat expert,

          (B) a “researcher” telling people what stat methods are legitimate for data analysis when it appears they never done any real data analysis in their entire life, and

          (C) a “researcher” tries to convince everyone they’ve solved all the major problems with classical statistics and Bayesian stats shouldn’t be used because they invented a new statistical tool which they only tested on the simplest first-weak-of-stat-101 examples where it happens to give answers identical to bayes. None — I repeat NONE — of the usual theoretical, simulated, and real data work that goes into vetting new stat tools appears to have been done, yet claims are made that if it is adopted and added to every stat analysis or in place of bayes all the major problems with statistics will disappear.

          So it’s duly noted that you find all this acceptable. I’ll assume your “research” reflects similar standards of “excellence” from here on out. But I think it’s a sham and a disgrace. Just like Gelman going on about David Brooks or pretty much everyone on this blog did to Vincent Granville, who is guilty of far less shoty research than Mayo. See the comments here:

          Nether you nor Gelman had any problem with those comments.

        • Please don’t put words into my mouth. You and I disagree, but not because I approve of ignorant, disgraceful shams while you don’t. You think that Deborah Mayo and other philosophers are ignorant, disgraceful shams; I disagree with that characterization. I’m hardly alone in my disagreement with you on that, of course, which I guess must frustrate you to no end.

          You do realize that, by dismissing my own research without having read it, on the grounds that I don’t think Deborah Mayo is an ignorant, disgraceful sham, you’re merely confirming the impression that you don’t have anything to say? But in a funny way, I’m actually happy for you to take the view that I too am a sham and disgrace. Your comments would be less tediously repetitive if you started mixing in some insults of me.

          I won’t bother to reply further. For months now I’ve harbored the hope that you’d stop repeating the same insults over and over if the other commenters either ignored you, or politely told you to stop. That didn’t seem to be working so I thought I’d try a different tack. But unfortunately, and probably not surprisingly, that doesn’t look like it’s going to dissuade you either. Indeed, at this point it’s obviously making things worse. For which I apologize to the other readers, whom I’m sure are now at least a little annoyed with me for engaging with you.

          So I guess I’ll just go back to ignoring you. Or maybe better, just stick to reading the posts. I’d feel bad missing out on the discussions here, I often find them thought-provoking. But there are a lot of good discussions in the world that I could be reading that aren’t contaminated by such persistent trolls. And I don’t like having to spend even a bit of mental energy keeping one eye out for comments that I know I’ll need to skip. It makes reading into unpleasant work.

        • Regarding your (A): Is the audience so naive to be fooled by weak freshman knowledge? It looks like Mayo has a respectable engagement with professional statisticians.

  7. Is it just a coincidence that most of the articles mentioned here are on somewhat trivial topics? e.g. red clothes & fertility, ESP, ovulation and voting, himmicanes and hurricanes, meat eaters are more selfish than vegetarians et cetra.

    Is there a correlation here? Not saying at all that “serious” research is immune to this, but I get the feeling that the spike in “cool” ( but mostly useless ) research is somehow a relevant co-morbidity.

    • Actually, it’s a mystery to me why Andrew goes after trivial research and not after badly done research that costs people their lives. I guess it’s because he just wants to illustrate the point without getting into a technically complex area where lots of domain knowledge is needed. That makes sense. But it would be useful to occasionally go after the guys who cause real harm. Like that “MIT Professor” (I put it in scare quotes because it sounds impressive) who found a statistically significant correlation between autism and vaccines. You can hear her yourself explaining her “research”:

      • +1

        I’d love some analysis of non-psych studies.

        ESP, himmicanes, red = fertility etc. are somewhat silly studies & easy targets.

        I’d love something of the nature “Hey FDA you made a big methodological blunder!”

        • The FDA already admitted they couldn’t keep up years ago. On the other hand, maybe that was just some plea for funding.

          “The Subcommittee concluded that science at the FDA is in a precarious position: the Agency suffers from serious scientific deficiencies and is not positioned to meet current or emerging regulatory responsibilities…The Subcommittee further noted that the impact of the deficiency is profound precisely because science is at the heart of everything FDA does. The Agency will flounder and ultimately fail without a strong scientific foundation.”

          FDA Science and Mission at Risk Report of the Subcommittee on Science and Technology. 2007

          Also, your best bet at finding a “huge methodological blunder” is investigating the population estimates everyone uses in their denominators. No one has been taking that necessary scientific care when it comes to those. You can tell because they do not have any associated estimate of uncertainty.

        • Fine. Then try NASA, US-DoE, CERN, NIH, USDA. Some flavor of work with impact. Himmicanes & ovulation-voting seems too trivial as a theme.

          I think because the stuff these psych guys attempt is so irrelevant either ways no one really cares when they do make very silly methodological blunders.

          e.g. It’d be bracing to attempt to poke holes in the LHC analysis for a change since those guys use (abuse?) NHST too, don’t they?

        • “Fine. Then try NASA, US-DoE, CERN, NIH, USDA. Some flavor of work with impact.”

          I’d like to see a statistician’s discussion of this:

          Support for the Thermal Origin of the Pioneer Anomaly
          Slava G. Turyshev, Viktor T. Toth, Gary Kinsella, Siu-Chun Lee, Shing M. Lok, and Jordan Ellis
          Phys. Rev. Lett. 108, 241101

          From figure 3 it looks like after they do their accounting and widening of error bars, the remaining difference is conveniently consistent with the excess acceleration predicted by MOND (~1.2 m/s^2).

        • > Himmicanes & ovulation-voting seems too trivial as a theme. I think because the stuff these psych guys attempt is so irrelevant either ways no one really cares when they do make very silly methodological blunders.

          Heilmeier’s Question #4…

          Suppose one got the methodology down and actually found a himmicane or ovulation-voting effect, to what constructive purpose could that knowledge be applied? I’m open to arguments that they’re somehow relevant to something worthwhile but my initial reaction is that they’re trivial – scientific navel-gazing. There’s an opportunity cost to engaging in trivial work. Problems of consequence get shorted.

          Long before I heard Heilmeier’s Catechism I remember hearing three rules for evaluating a work of art:
          1. What was the artist attempting to accomplish?
          2. Were they successful?
          3. Was it worth doing?
          That’s also a perfectly reasonable way to assess research projects. Q#3 can be awkward because it involves subjective judgment but there’s a not-so-fine line between being non-judgmental and having no judgment. (Having no judgment is nothing to be proud of.)

        • +1

          That opportunity cost of frivolous work is getting increasingly ignored. I see a spurt of work on trivial questions. Or maybe its a spurt in popular coverage.

          You could argue that the total funding wasted on such frivolous studies is small as compared to the money for AIDS, Maleria etc. But it still bothers me.

      • Shravan:

        > technically complex area where lots of domain knowledge is needed
        Most of my academic career was in clinical research and I usually avoided discussing any implications of a study without a clinical colleague being involved. Context can be very important, as the implications can be very different for different patients. And almost no studies are truly islands on their own.

        Here I tried to be neutral about PSA screening (not sure I was).

        The points I wanted to make (which likely was only clear in the last comment) were _simply_

        The problem I suspect is statisticians following what other experienced statisticians _know well_ how to do – here mortality outcome – proportional hazards modeling – use simple exponential assumption based power formulas.

        An important step in applying statistics is to be very critical of how the representation (statistical model) _captures_ the important features of what is needed to be represented (the screening process).

        Formulas _force_ simplification while simulation encourages or at least allows complexification and detailing while
        simulating the process may _trick_ one into spending more adequate effort on this.

  8. What I’m always amazed about political science research is that it is clear to the most casual observer that race has been a subtextual or textual theme of every presidential election since 1964 but it is most often completely ignored in issue space analysis, rational choice models, voting models, etc. I would include your Red State, Blue State here. I saw one parenthetical comment about race in there, something about an analysis showing only as much as half of the explained variation can be explained by race, but looking at the maps I immediately saw Racist State, not-so Racist State. The red/blue maps of elections which really should be gray/blue are a degree of obviousness that would seem to be impossible to ignore, but they are, much as generations of school children looking at a world map could move the continents into one large Pangea only to be told that was nonsense. I realize basing political thoughts or actions based on race is basically irrational and doesn’t lend itself to the wonderfully complex models preferred by political scientists but treating it the way Mayberry RFD treated southern racism (i.e., ignoring it) is not the way to go.

    • Race is a massively important factor in a huge variety of social science topics.

      Unfortunately, an Occam’s Razor approach to thinking about race can cost you your job even if you are, say, the co-discoverer of the structure of DNA. So, it’s prudent not to think about race. But, the downside is that our social science is crippled from the start.

      • You know, academic political science actually used to use race to model vote choice–the “symbolic politics” school. But it was tough to put into a probit–so it didn’t exist for rational-choice expected utility maximizers. Thus we had the spectacle of elites acting as if race mattered (Nixon’s Southern strategy, Carter’s “ethnic purity”,
        Reagan’s Neshoba County Fair states’ right speech, Bush’s Willie Horton’s ads, Clinton’s Sister Soulijah put down, and of course the whole Obama era), but it having no effect on the voting public. This is in keeping with the academic attitude that campaigns don’t matter and billions are spent for recreational purposes, I suppose.

  9. Andrew– assume the ovulation-voting-impact (“OVI”) study was based on valid methods– that is, the sampling was appropriate, the observations were properly made & measured, and the effects modeled (estimated) appropriately — but the authors got the surprising result that they reported. What do you say we should do if we do in fact we believe in using “data collection and measurement strategies” that “compare standard errors with plausible population differences”?

    I *think* Gelman & Carlin (2014) implies we should throw the information from the study away (i.e., not publish it, at least not in a “top journal”) b/c the probability of “Type M” and “Type S” error are too high.

    I’m not happy with that answer, if it is indeed the one you are proposing. The problem with it is that it is that it treats one’s priors as a filter on consideration of admittedly valid evidence.

    The only way to calculate the Type M & Type S errors is to treat our priors as correct. You & Carlin do w/r/t the OVI study: your estimated statistical power of 0.06 & resulting Type M “exaggeration factor” of 9.7 and Type S error of 24% for the 0.17, SE = 0.08 OVI finding all depend on our accepting that you are right that the “true” OVI effect size is 0.02.

    But obviously we never actually know the “true” effect size. We do empirical testing precisely to get access to information that we can use to adjust the probability we assign to rival hypothesized “true” effect sizes.

    The straightforward Bayesian way to handle information that is out-of-line w/ our priors is to *combine* it w/ our priors–not throw it away.

    The sky doesn’t fall if one does that w/ the freakish OVI finding.

    Imagine we are super confident that the true OVI is 0.02 as opposed to 0.05, the next most probable estimate: let’s say our priors are 10^4:1 in favor of the 0.02 hypothesis over the 0.05 one. Assuming SE = 0.08, the probability of observing an OVI effect of 0.17 is about twice as likely if the “true” OVI is 0.05 than if it is 0.02. Because we have no reason to believe the 0.17 finding– as improbable as it was — is invalid, we give it a likelihood ratio of 2, and revise our assessment of the relative probability of the 0.02 hypothesis accordingly: we now believe the odds are 5×10^3:1 in favor of the hypothesis that the true OVI is 0.02 rather than 0.05. And we simply go about our business until the next time someone does an OVI study.

    If instead we simply throw the OVI = 0.17, SE = 0.08 data away — b/c we deem its Type M and Type S errors to be excessive– we are committing ourselves to a “scientific research program” that uses or priors to assess the weight to be afforded evidence, rather than one that updates priors based on the weight of the evidence. This is confirmation bias, pure & simple. If our “scientific research program” uses *that* method of “data collection & measurement strategies,” we will be stuck much longer than we should be w/ our mistaken understandings of how the world works. See Rabin, M. & Schrag, J.L. First Impressions Matter: A Model of Confirmatory Bias. The Quarterly Journal of Economics 114, 37-82 (1999).

    I think you are driven to the extreme conclusion that we should simply ignore valid data by your appropriate frustration with the admittedly ignorant convention among unreflective NHT-trained researchers (note: not *all* researchers who use NHT methods are unreflective!) to view any “p < 0.05" finding as "proving" the "result" is "true," at least once published in a peer-reviewed journal.

    But the response to that mistake is not to make another– viz., to toss out or ignore data that are too far out of line w/ our priors. It is to correct the mistaken view of what valid empirical proof actually does: merely supply us w/ some additional increment of evidence for believing or not believing one or another hypothesis than we otherwise would have had.

    Tell me what I'm missing.

    • Dan;

      How should I, or some political scientist, react to evidence such as presented in that ovulation-and-votlng study?

      We agree that it makes sense to evaluate such claims in the context of our substantive understanding of the world. This evaluation does not have to be Bayesian (indeed, my paper with Carlin is non-Bayesian) but it will use prior information. In this particular example, there are many ways for the study to go wrong, and I don’t want to follow the Kahneman principle that “you have no choice but to accept” these claims.

      To put it another way, you say, Why should I believe my prior so strongly? But my reply, is, Why should I believe your data model (your likelihood) so strongly, in practice? In that sense, the right thing to do is to add various fudge factors to the data model to allow for realistic problems with the study. If the underlying comparisons of interest are large, these fudge factors don’t matter so much and we’re ok in ignoring them, but if the underlying comparisons are small, the fudge factors matter.

      It’s the usual story of measurement error in science: the smaller the effect being studied, the more careful you have to be with measurement error.

      But, to return to your point: Yes, I agree that Bayesian analysis is generally the way to go, and it’s better to combine prior with data than to just throw the data away. Here’s the story, though: if we’re going to do this, and if we’re being deluged with these sorts of studies, we have to move away from the so-called conservative model in which “theta” could be equally likely to take on any value from -infinity to infinity.

      Cos if you take a frequentist view—what happens if you apply this procedure over and over and over and over again—what you’ll find is something close to what we have now, which is researchers getting excited by noise, and researchers being sloppy about measurement and the formulation of their problems. And, indeed, why should a researcher not be sloppy, given that sloppiness entails all sorts of benefits? Not only is a sloppy study easier and cheaper to run, it also gives lots of degrees of freedom (as Uri Simonsohn would say) to get statistical significance, and lots of room for interpretation so you can get the kind of headline-grabbing stories beloved of PPNAS, etc.

      So my question is, if many researchers are doing sloppy studies, and if some researchers are making entire careers out of dubious data interpretations, then what are they getting out of it—from a scientific perspective, not just a career perspective. And my above post is an attempt to answer that question.

      Your question—how to do a good Bayesian analysis of real data in the context of small effects—is worth asking too. But, again, if the effects being studied are small enough, the payoffs for the Bayesian analysis (or for any statistical analysis) might not be so high.

      • Fair enough.

        But I think this does boil down to point at end of my comment: what’s the right way to overcome unreflective– or just plain wrong– ways to think about how empirical inquiry advances understanding within science. Among the problems are two: (a) people are doing studies that can’t advance understanding by amassing evidene unconnected to meaningful, theory-informed hypotheses about how the world works (“WTF!” studies); and (b) people are presenting evidence relating to meaningful, theory-informed hypotheses as if they “proved” how the world works simply b/c they are “p < 0.05" and are published in peer-reviewed journals.

        I think you are trying to attack the problem w/ better filters & thresholds & procedures: like Type-M & -S error, pre-registration, etc.; replication protocols, etc.

        I'm skeptical that this can work — thoughtless people will adapt their thoughtlessness to new templates & reinvent the problem. Moreover, the filters, thresholds, & procedures are inevitably over-inclusive– they constrain people to toss out informatoin that would in fact, if given the effect it shoudl be given, help to augument knowledge, even if in small amounts (sometimes not small).

        Better, I think, would be to just *teach* those who do research to *think*. Make them get how empirical proof works: not by "proving" propositions; but by adding (when derived by valid methods) weight to balance of considerations in favor of hypotheses acceptance of which is permanently open to revision. Make them *reflect* on how one or another statistic method appropriately contributes to doing that.

        If we had higher *inferential literacy* in decision science disciplines, we wouldn't have "WTF!" studies & wouldn't face a crisis everytime a journal publishes a finding plucked from the tail end of the distribution of outcomes we can in fact see even when studies are being done right.

        (I don't think most of the studies you criticize are being done right anyway; most are not valid– the hurricane & daughter ones definitely weren't; probably Ovulation too–I don't know enough to be sure — but here we are "assuming" valid methods, as you did in Gelman & Carlin (2014) for the OVI study).

        • Dan:

          I don’t think of Type M and Type S errors as “filters”; I think of them as mathematical tools to help us understand the properties of statistical procedures.

        • But am I right, Andrew, that Type M & S are elements of an apparatus that generates the the answer “ah, just ignore it” to the question you asked– viz., how to “react to evidence such as presented in that ovulation-and-votlng study?”

          I think you propose not publishing or not giving the freakish OVI finding any meaningful attention in our quest to figure out “what is the ‘true’ OSI?” (this exercise keeps requiring us to renew suspension of disbelief that anyone could take this whole idea seriously, but let’s be sure to keep doing that so we can focus on general issue) b/c the Type M and Type S were “too high.” Right? That’s a form of “filtering” as I am using this notion.

          The “nonfilitering” alternative is to “take the result on board as is”–if the study is valid — and simply recognize *how little probative weight*, in Bayesian terms, it actually carries in relation to plausible rival hypotheses, particularly given its high SE.

          I prefer that over conditioning “counting” evidence only when we feel it is “close enough” to our priors– a posture that I think builds confirmation bias into how we assess empirical proof.

          I prefer it generally, moreover, to any regime that is designed to simply “plug a loohole” in unreflective NHT & related views of how empirical proof work.

          I see your motivation in introducing Type M and Type S as to teach people that they shouldn’t get all excited by a freakish study result and describe it as “proving” some “WTF!” proposition about how the world works.

          But we wouldn’t have that excited, mistaken reaction breaking out all around us if we decision science disciplines weren’t plagued by the low level of inferential literacy that supports ridiculous ideas like “p < 0.05 & publshed in peer review proves” findings like those in OVI study. If instead, researchers had a Bayesian-inference *conception* of how empirical proof works, they’d give the freakish result the marginal weight it is due in relation to competing plausible hypotheses & get on w/ it; there’d be no need for them to probe the finding w/ “ignore it” tools Type M and Type S as a means to stifle the unreflective exuberance we now see over OVI & other WTF studies.

          No surprise if I’m wrong; just trying to figure this all out.

        • Dan:

          I think that introducing Type S and Type M errors into the conversation is already a help, in that maybe it can push researchers away from the “p is less than .05, so it’s true” mindset. The Type S and Type M error framework can also be seen as a way of introducing prior information in a non-Bayesian way, or alternatively as a way of introducing Bayesianism in a soft way.

          Regarding “how to think about such studies”: Yes, I agree that ultimately a Bayesian approach is the way to go, and I guess my attitude of disbelief comes from my background understanding of the scale of measurement errors. Kinda like how you might evaluate the evidence if some dude in his garage claims to have detected gravity waves based on a measurement from his home oscilliscope.

        • dmk38:

          > inferential literacy
          Agree – if those who funded, carried out and published such studies did have inferential literacy – they would not be doing what they are. (At most they should be documenting what they did, why and what they think the observations should be taken as data [evidence].)

          It is extremely hard to discern in publications what was done, why and what should be taken as data – but without this you do not have evidence or at least any notion of how it should be weighted. My last kick at that can is here (which was not updated because some researchers refused to share their data set with David Dunson and I, until they had enough publications for using it themselves and I guess no one ever has enough publications.)

          Additionally, getting to a good level of inferential literacy is very difficult. My experience – assuming I have achieved a good level of inferential literacy ;-) is that people need to work through (actually conduct) a number of studies with a mentor. Most statisticians who achieve it do so a couple years after graduating. (Maybe part of the reason many find the Forking Paths paper troubling/challenging.)

      • Andrew: Fortunately, the choices aren’t either to throw data away, combine data with prior, or do stupidly bad frequentist tests. You yourself have been highly skeptical of using priors. Subjective priors can enable the researchers who you criticize here to justify their inferences.

        I might note that it’s incorrect to claim that taking a frequentist view is to care about applying a procedure over and over again leading to today’s problems.The account that has direct resources for dealing with biases, cherry-picking, a host of selection effects is an account that can make use of error probabilities. Why? Because those moves demonstrably alter error probabilities, showing the p-values to be spurious and not actual. That is their central value. It’s accounts that tell us selection effects can’t matter that lack direct resources to combat illicit inferences. Even those who eschew priors but use Bayes’ ratios of various types can conveniently bias results. It’s not hard to pull in alternatives that make a Bayes ratio do whatever you want it to do.

        • Mayo– what are good illustrations of how “even those who eschew priors but use Bayes’ ratios of various types can conveniently bias results” & how it’s “not hard to pull in alternatives that make a Bayes ratio do whatever want it to do”? Likely you have discussed in one or more of your publications, so for sure fine to just refer me to relevant source & pp.

          Am less skeptical of the claim than curious.


        • Rahul:

          It is possible to prior information in non-Bayesian analyses, as for example discussed in my recent paper with John Carlin. But it is awkward. Bayesian analysis allows prior information to be included more flexibly and directly.

        • Andrew:

          I meant to ask the converse question: If someone doesn’t want to use priors, then why go Bayesian at all?

          Isn’t incorporating prior information the core strength of Bayesian analysis?

        • Rahul:

          You get the machinery which if you calibrate it is fully or partially frequentist.

          There is no frequentist ban on procedures unless they have poor frequentist properties (i.e. relevant subsets).

          Also, and perhaps as important, as don Rubin once told me, you can always think through the Bayesian methods even if you have to use frequentist procedures, as it likely will give you better insight into the scientific problem.

        • Rahul– Why not? The likelihood ratio is (or should be!) analytically independent of prior odds. Or in something closer to English, the weight to be assigned new evidence doesn’t depend on what probability anyone assigned the probability of the hypothesis before getting the evidence.

          In the case of the ovulation-voting impact (“OVI”) study, e.g.: Dr. A. Gelman believes the true OVI is 0.02; Dr. Alladin Shane believes it is 0.05. Durante, Arsena & Griskevicus performa a valid experiment and find an OVI of 0.17 with SE = 0.08. If we treat Dr. Gelman’s and Dr. Shane’s competing hypotheses as normal probability density distributions with means of 0.02 and 0.05, respectively, and SE’s of 0.08 (whether they should be modeled that way is open to debate; that’s okay, so long as everyone gets that & can think about this issue for him- or herself), the DAG evidence is about 2x more consistent with Dr. Shane’s “0.05” hypothesis than Dr. Gelman’s “0.02” hypothesis.

          That’s the Bayesian likelihood ratio. You multiply your prior odds in favor of Dr. Shane’s hypothesis rather than Dr. Gelman’s hypothesis by 2 (or your prior odds in favor of Dr. Gelman’s hypothesis by 0.5); I’ll multiply my prior odds by the same. Then we bot go about our business pending someone else doing an OVI experiment, at which point we repeat this process.

          We didn’t have to have the same priors, obviously, in order for us both to use the DAG evidence. By the same token, DAG didn’t have to worry about readers’/consumers’ priors in doing & then reporting their study results.

          The only thing that we should care about when we evaluate a research finding is whether the methods were valid & whether the likelihood ratio is in fact different from 1 w/r/t competing hypotheses of interest.

        • Dan:

          1. No, I don’t believe the true (population) parameter value is 0.02. I just think there’s no way it could be anything more than 0.02. I think the true value is much less.

          2. No, Durante, Arsena, did not Griskevicus perform a “valid experiment,” at least not in the sense that you can or should plug in their estimate and standard error into a Bayeisan analysis and start computing posterior probabilities. The point of my “power = .06” demonstration was to show the massive problems with Durante et al.’s interpretation of their experiment, even if they’d not had huge problems with measurement, data classification, and multiple comparisons. But their study does have all these problems. Don’t confuse my best-case analysis of their data with what they actually have done.

        • Dan: It is true that posterior/prior has the same shape as the likelihood function regardless of the particular prior in the full parameter space – as long as the prior does not put probability zero on some parameter points.

          However, this is not true about the marginal posterior/marginal prior (nuissance parameters get in the way.)

        • @dmk38

          Maybe I’m confused about semantics. But when you use Andrew’s OVI, or Shane’s or your own pet belief of what the OVI really is, you *are* using prior information, correct? e.g. OVI has a normal distributions with mean of 0.02 and SE of 0.08 Then everything makes sense.

          My point was that if you really eschewed priors, & wanted to entirely avoid “polluting” your analysis with prior info, then how is Bayesian analysis adding value. I guess Keith had some points above but I don’t understand those entirely.

  10. “Heilmeier’s Catechism” for evaluating a research project:

    1. What are you trying to do? (Articulate your objectives quantitatively, using absolutely no jargon)
    2. How is it done today and what are the limits of current practice?
    3. What is new in your approach and why do you think it will be successful?
    4. Who cares? If it is successful, what difference will it make?
    5. How will you commercialize or transition the technology to the users? What resources or strategic partners will you need?
    6. What are the risks in implementing your approach and how will you address them in your project?
    7. How much will it cost to reach your ultimate objective? How long will it take?
    8. What are the intermediate and final milestones that will demonstrate success?

    Question 5 isn’t particularly relevant for academia but is for industry. My feeling is that, overall, more attention should be paid to Question 4.

    Ref –

  11. Excellent list. I’d almost want to make this a mandatory format at paper submission.

    I’m trying to imagine how the red-clothes-and-fertility study authors would answer this.

  12. “But I don’t take this harsh view. I accept that theorizing is an important part of science, and I accept that the theorizing of Daryl Bem, or Sigmund Freud, or the himmicanes and hurricanes people, or the embodied cognition researchers, etc etc etc., is science,”

    That’s a view which I would say makes a mockery of scientific theorizing. I don’t know about the others but by positing his absurd retrocausal psi hypotheses Daryl Bem left his field (psychology) far behind and blundered his way into the heart of physics. And in that field he is a deluded crackpot, not a theorist, and a practitioner of futile cargo cult science, not an experimentalist.

  13. @Andrew: This was the *one* contribution in this thread where I didn’t expressly put in the provisio “asusming DAG is valid”– as you do in Gelman & Carlin (2014) — so we can focus on the statistics/methods issue. “Valid” I have in mind relates to valid sampling, valid observations & measures, and valid modeling of the same. If those are okay, then we face the question whether Type-M/-S strategy rather than a “likelihood ratio” one is best for containing misunderstandings of studies & the phenomeon of WTF. “Power = 0.06, Type M-error exaggeration factor =, Type S-error =” might be right way to deal w/ a freakish result — but those tools don’t show *invalidity* unless one assumes answer to question: that those tools rather than otehrs should be sued to figure out what to do w/ a research finding that seems out-of-line w/ our priors.

    @Rahul: Supply whatever hypotheses & priors you like. If the DAG data (or any other relating to any other phenonemon on which people disagree about plausible hypotheses & priors for same) are valid (in sense described), you can compute a likelihood ratio. You are confusing motivation of those who find data useful w/ value of doing a study that has an LR ≠ 1.

    @Keith O’Rourke: I need to think more, obviously

Leave a Reply

Your email address will not be published. Required fields are marked *