What is needed to do good research (hint: it’s not just the avoidance of “too much weight given to small samples, a tendency to publish positive results and not negative results, and perhaps an unconscious bias from the researchers themselves”)

[cat picture]

In a news article entitled, “No, Wearing Red Doesn’t Make You Hotter,” Dalmeet Singh Chawla recounts the story of yet another Psychological Science / PPNAS-style study (this one actually appeared back in 2008 in Journal of Personality and Social Psychology, the same prestigious journal which published Daryl Bem’s ESP study a couple years later).

Chawla’s article is just fine, and I think these non-replications should continue to get press, as much press as the original flawed studies.

I have just two problem. The first is when Chawla writes:

The issues at hand seem to be the same ones surfacing again and again in the replication crisis—too much weight given to small samples, a tendency to publish positive results and not negative results, and perhaps an unconscious bias from the researchers themselves.

I mean, sure, yeah, I agree with the above paragraph. But there are deeper problems going on. First, any effects being studied are small and highly variable: there are some settings where red will do the trick, and other settings where red will make you less appealing. Color and attractiveness are context-dependent, and it’s just inherently more difficult to study phenomena that are highly variable. Second, the experiment in question used a between-person design, thus making things even noisier (see here for more on this topic). Third, the treatment itself was minimal, of the “priming” variety: the color of a background of a photo that was seen for five seconds. It’s hard enough to appear attractive to someone in real life: we can put huge amounts of effort into the task, and so it’s a bit of a stretch to think that this sort of five-second intervention could do much of anything.

Put it together, and you’re studying a highly variable phenomenon using a minimal treatment, using a statistically inefficient design. The study is dead on arrival. Sure, small samples, the garden of forking paths, and the publication process make it all worse, but there’s no getting around the kangaroo problem. Increase your sample and publish everything, and you still won’t be doing useful science; you’ll just be publishing piles of noise. Better than what was done before—I’d much prefer JPSP to publish piles of noisy raw data than to fake people out with ridiculous claims—but still not good science.

My second problem is with this final quote:

In an interview with Slate, Elliot admitted that sample sizes in his earlier works were “too small relative to contemporary standards.” He added, “I have an inclination to think that red does influence attraction, but it is important for me to be open to the possibility that it does not.”

My reply: of course red influences attraction! So does blue! So does the cut of your clothes and whether you chewed on breath mints recently. All these things have effects. But . . . trying to study the effect of red in isolation, using the background of an image . . . that’s just hopeless. That’s the outmoded, button-pushing model of social and behavioral science, which is tied to an outmoded, significance-testing model of statistics.

40 thoughts on “What is needed to do good research (hint: it’s not just the avoidance of “too much weight given to small samples, a tendency to publish positive results and not negative results, and perhaps an unconscious bias from the researchers themselves”)

  1. Interesting. I think many posts in this blog are moving away from stats and more into psychology — or at least “psychology of research methods.”

    Not that this is something necessarily bad.

  2. Given tinder etc, I would have a priori thought you could consistently increase your happiness a little bit by wearing red. How is your thinking different? Why’s this study dead on arrival if I believe that? Am I crazy to think that?

    • Em:

      1. I think it will depend on context. After all, people do wear other colors than red, even when trying to be appealing. In some sense we’re already at the equilibrium, and then we can choose to wear red, black, or whatever, depending on local conditions.

      2. Huge variation from person to person, hence bad idea to do a between-person study.

      3. Changing the background color of an image viewed for 5 seconds: that’s a very attenuated, one might say homeopathic, version of any real treatment.

      • “I think it will depend on context. After all, people do wear other colors than red, even when trying to be appealing. In some sense we’re already at the equilibrium, and then we can choose to wear red, black, or whatever, depending on local conditions.”

        Isn’t this just omitted variable bias?

        • Anon:

          My point was that (a) I’d expect any treatment effect to vary, and (b) given that we’re already at some equilibrium, I’d expect that in the real world, the variation of the treatment effects would be larger than the average. Hence I’m suspicious of the claim, “you could consistently increase your happiness a little bit by wearing red.”

        • I’d expect any treatment effect to vary

          I’m just realizing this is like saying the model is misspecified because it does not contain all relevant IVs. So wearing read might increase happiness by 2 points of people living in 2016 North America, but decrease it by 2 points in 2017 Europe. However if it is during a government declared drought, then this pattern would be reversed and halved, or whatever. Here the year, continent, and drought status are the omitted variables.

          It really makes me think that estimating regression coefficients is a meaningless activity unless you really believe the statistical model maps well enough to reality. But no one who really thought that would be using a default statistical model, they would derive predictions from their real model…

        • I dunno – sometimes the average effect is interesting. I mean, suppose some other state wanted to challenge California for the most attractive people in America so as to drum up tourism or keep young people in their state… they might like to know how much they could improve the average hotness of their population by subsidizing red v-necks and summer dresses. With important public policy considerations like that, the mean effect can be interesting.

        • But if the treatment effect varies, the coefficient estimate cannot tell you “how much they could improve the average hotness of their population by subsidizing red v-necks and summer dresses” unless the model is known to be (at least approximately) correctly specified.

        • Anonueoid,

          Suppose I have a perfect randomized control trial. I regress my outcome on a constant and an indicator variable for treatment group. This model is clearly (and wildly) “mis-specified” in the sense of relating my outcome to some actual data-generating-process in the world. But it also provides an unbiased estimate of the average treatment effect. No? Or by “correctly specified” do you just mean “unbiased”?

        • Jrc:

          If the interactions are large, then the “average treatment effect” you’re estimating is only the average over the people and situations in your study; this might not be relevant for new people and for new situations, as with those ovulation-and-clothing researchers who decided that their original hypothesis held only during certain times of the year, an interaction not at all considered in their original paper.

          One problem, I think, is that our statistical discussions are framed in terms of hypothesis tests and average treatment effects, but researchers typically consider it a win if they are able to find “p less than .05” for anything vaguely related to their hypothesis, as this counts as a discovery. If you really want to take your treatment effect anywhere beyond your original study, interactions can be important. This is a point came up in my discussion of the “freshman fallacy.”

        • Andrew – agreed. In cases where the heterogeneity of treatment effect varies a whole lot across a whole lot of different kinds of people, then all you ever learn from a simple RCT like that is something about the average effects across the particular people in your sample. And I also agree that hotness-and-red-wearing is probably a situation like that (full disclosure/self-preservation: my partner looks great in green! and black! and red!). And of course we agree that the whole “it is significant and so real and meaningful” thing is nonsense.

          But I also think that the idea of needing a “correctly specified model” misses something important. One of the most interesting aspects (to me) of contemporary methods for causal inference in quantitative social science researchis that they do NOT need a “correctly specified model” of the outcome. They need a model that identifies off of the “correct” variation in the world – whether that be something directly manipulated by the researcher themselves (an RCT) or manipulated by an accident of nature (RD or IV or other quasi-experimental designs).

          I am of the general opinion that, for many interesting and important human behaviors, we will never have a “correctly specified model”. But I think we can still make real progress by realizing that modeling the outcome is sometimes less important than modeling/harnessing useful variation in our explanatory variable of interest. That was the point I wanted to make to Anoneuoid.

        • Suppose I have a perfect randomized control trial. I regress my outcome on a constant and an indicator variable for treatment group. This model is clearly (and wildly) “mis-specified” in the sense of relating my outcome to some actual data-generating-process in the world. But it also provides an unbiased estimate of the average treatment effect. No? Or by “correctly specified” do you just mean “unbiased”?

          When Andrew says “the treatment effect varies”, this is the same problem as extrapolating without/beyond a well-defined population, which is the same problem as omitted variable bias. I’m basically working with the wikipedia definition:

          In statistics, omitted-variable bias (OVB) occurs when a model created incorrectly leaves out one or more important factors. The “bias” is created when the model compensates for the missing factor by over- or underestimating the effect of one of the other factors.

          https://en.wikipedia.org/wiki/Omitted-variable_bias

          I think the estimate from your study is only valid for “enumerative” problems, but if you want to extrapolate at all beyond the population (or if one is never even specified), the value of the coefficient estimates are dubious. That is, unless you actually believe your model is close to well-specified (eg includes all the important variables, etc):

          There is no statistical method by which to extrapolate to longer usage of a drug beyond the period of test, nor to other patients, soils, climates, higher voltages, nor to other limits of severity outside the range studied. Side effects may develop later on. Problems of maintenance of machinery that show up wcll in a test that covers three weeks may cause grief and regret after a few months. A competitor may step in with a new product, or put on a blast of advertising. Economic conditions change, and upset predictions and plans. These are some of the reasons why information on an analytic problem can never be complete, and why computations by use of a loss-function can only be conditional. The gap beyond statistical inference can be filled in only by knowledge of the subject-matter (economics, medicine, chemistry, engineering, psychology, agricultural science, etc.), which may take the formality of a model [12], [14], [15].

          On Probability As a Basis For Action. W. EDWARDS DEMING. Reprinted from. The American Statistician, Vol. 29, No. 4, IPS, pp. 146-152. https://deming.org/files/145.pdf

        • jrc: ‘I am of the general opinion that, for many interesting and important human behaviors, we will never have a “correctly specified model”.’

          I think that’s pretty obviously true, but at the same time, I think that it’s possible to structurally specify models that are much more descriptive of real world processes and have much more domain specific information than is normally used in social science models.

          For example, suppose someone is studying, I don’t know, the growth of children in different economic environments ;-)

          Growth of children is obviously a time-series issue, and there are biological inputs to growth, and hormone feedback, and choices that parents make. These things all force the growth to increase or decrease through time. We can structure our analysis along the lines of a process through time, an ODE for example with time-varying forcing functions, and get farther than if we do some kind of standard linear regression, and we could do this even if we only get a single time point for each child, so that standard methods would find it very questionable as to how a timeseries model might be useful. IT might even be helpful if the individual children are measured at variously jittered time points (provided especially that we have some idea about when it was they were measured).

          But, I think that kind of model is going to go a LOT farther in a Bayesian paradigm with very direct input of subject matter ideas into the structure of the time-series than you would with some kind of standard out of the box Frequentist motivated ARIMA type whatchacallit.

        • Daniel,

          I don’t know what child growth has to do with Ozone Depletion Events, Orbital Data Explorers, or Old-Dog Encephalitis (real thing that), but I think I agree with the thrust of what you are saying. I think the idea that we leave reality behind and just blindly run comparisons any time we want to do “statistics” is crazy. Obviously we need to use what we know about the world to structure our empirical investigations.

          The question is how to do that well, and what kinds of prior information we bring in for what purpose in what ways. My overarching point here – other than to make fun of this general line of inquiry which strikes me as kinda silly (sorry Bob) – is that trying to build a model that matches the “true” data generating process in the world is, in many cases, a mug’s game. And one we don’t have to play to get certain kinds of answers to certain kinds of interesting questions about human behavior, including (in some cases) meaningfully interpretable “causal” estimates.

          You and I can chat about how to best do that – and I agree that there is a lot of room for Bayesian regularization here* – but more and more I see people say things like “estimating regression coefficients is a meaningless activity unless you really believe the statistical model maps well enough to reality” and I think….sorta, but maybe not in the way you think**. No amount of modeling of human behavior (or development) will ever turn the regression model “error term” into an actual, statistical “error term” or “mean 0 random variable”. The “error term” in a regression will always be a measure of our ignorance (I don’t mean residual here, I mean the thing we write in our regression equations). We want to think instead about what kinds of model misspecification, omitted variables, and selection (into sample or treatment) are at work conditional on what we have modeled, and how that might influence our estimates. Maybe I could say that the model needs to relate to the world in a meaningful way, but does not need to represent the world.

          *For real – I have this problem right now where I just know that “borrowing information across parameters” is something I need to do to stabilize some estimates, because I’m running out of variation to “borrow information across observations”. To use your (totally bizarre and out of left field) example, I know that, say, the effect of an input at age 1 on height at age 4 is very similar to the effect of that input at age 5…but that is hard to model in a purely frequentist way.

          **Note – Anoneuoid is probably a bad example here, since I think based on the totality of their comments that we are in a lot of agreement. I was just taking that line as a kind of straw-man. Maybe that wasn’t fair.

        • You and I can chat about how to best do that – and I agree that there is a lot of room for Bayesian regularization here* – but more and more I see people say things like “estimating regression coefficients is a meaningless activity unless you really believe the statistical model maps well enough to reality” and I think….sorta, but maybe not in the way you think**. No amount of modeling of human behavior (or development) will ever turn the regression model “error term” into an actual, statistical “error term” or “mean 0 random variable”. The “error term” in a regression will always be a measure of our ignorance (I don’t mean residual here, I mean the thing we write in our regression equations). We want to think instead about what kinds of model misspecification, omitted variables, and selection (into sample or treatment) are at work conditional on what we have modeled, and how that might influence our estimates. Maybe I could say that the model needs to relate to the world in a meaningful way, but does not need to represent the world.

          Such models can be useful even if the model coefficients/parameters are totally misleading. It is just that you can’t give any meaningful interpretation to your estimates (like if we do A it will increase B by x amount). You can only evaluate the predictive skill, which is a perfectly fine goal.

          There are para-statistical ways of checking these estimates though. Eg, independent replication, if everyone keeps getting the same value year after year in different locations, it is probably accurate within range of environmental conditions that have been checked.

        • Anon:

          It depends on context. I think there are some settings where variation is low and the average treatment effect is of interest, and other settings where the treatment effect varies so much that any average is close to meaningless.

        • Reincarnate Deming, Meehl, and Tukey, and maybe the rest of us could just retire!

          Yep, but still today you are considered a crackpot for holding opinions like this (from the same Deming paper):

          Little advancement in the teaching of statistics is possible, and little hope for statistical methods to be useful in the frightful problems that face man today, until the literature and classroom be rid of terms so deadening to scientific enquiry as null hypothesis, population (in place of frame), true value, level of significance for comparison of treatments, representative sample.

  3. That’s just like, your opinion, man. Seriously though, how do you know, a priori, when effects are small and highly variable? Isn’t Elliot et al.’s claim that the effects of red are relatively large and not all that variable? Why isn’t a well-designed, high powered study the right way to resolve this?

    • Some:

      Of course, a well-designed, high powered study is the right way to resolve this! The point of my post is that the study was not well designed, nor was it high powered.

      To have a well-designed, high-powered study, you want: (a) measurements that are closely related to the underlying construct of interest, and (b) control of variation. This study had neither. The treatment was color of a background of an image that was shown for five seconds, which is not at all close to the real world setting of interest. And the between-person design did not control variation. The paper had all sorts of other researcher-degrees-of-freedom errors (for example, reporting p=.06 as “marginally significant,” and reporting the difference between significant and non-significant), but these all arise naturally from (i) a set of noisy, dead-on-arrival experiments, and (ii) the pressure to publish statistically significant results.

      • Ah, I initially thought you were saying that even if this had been a well-designed high powered study (which it wasn’t), it would still be dead on arrival. What I gather you’re saying now is that these types of lab studies can’t be well-designed because of the problems you mentioned. Thanks for clarifying!

      • “To have a well-designed, high-powered study, you want: (a) measurements that are closely related to the underlying construct of interest, and (b) control of variation.”

        Speaking more generally (apart from this particular study), my supposition is that in psychology, it is very difficult to determine a priori whether a set of measurements is closely related to the underlying construct of interest, because inherently human behavior is (presumably) the result of a complex interaction of genetic predisposition and environmental influence.

        At best, would you concede that in psychology (in fact, much of social science), the best we could hope for is deal with latent variables, which as defined by Radford Neal are entities we invent to explain patters we see in observable variables?

        http://www.cs.toronto.edu/~radford/res-latent.html

  4. Andrew:

    I’ve read the power 0.06 paper and your writings on this issue regularly. I have statistical knowledge, but I don’t know how I can convincingly do what you’re doing.

    When I read a study, what calculations should I do to determine if it’s not believable? Could you lay this out formally, beginning with a linear model?

    Maybe I need a prior on some treatment interaction magnitudes and on covariates? What are the necessary quantities, measured and unmeasured? How can I do what you’re doing?

  5. I conducted the replication studies which found little to no impact of red on attraction. I agree with Andrew’s point that the original studies were problematic in ways that go beyond their small sample size. But from reading the comments, I’m not sure if I agree as to why.

    To my mind, the big problem is that prior to the studies on red/attraction we already knew a good deal about the factors that influence attraction (facial symmetry, match in ethnicity, degree facial features match gender, etc.). So to my mind the key problem is designing an experiment which acts as though all this prior knowledge doesn’t exist. Some of these important factors were left completely uncontrolled (e.g. match between participant and rated ethnicity). Others were controlled by using the same photo in both conditions, but never varied to determine how color might interact with it. It would have been more productive to develop a design that could estimate how much clothing color influences attraction above and beyond (and or in interaction) the other well-established factors. To put it more plainly, it doesn’t seem sensible to begin each between-subjects study anew, as though the dependent variable was never studied before.

    I *think* this connects to the points being made in the comments that the noise/context effects are likely much larger than color. The important point, though, is that many of the additional factors influencing attraction were already known and could have been systematically explored in conjunction with color. Or so it seems to me.

    • “…as though the dependent variable was never studied before…”

      GS: And which dependent-variable would that be? Yes…I know you will say that it is “attraction” – but it could be argued that what it is that is measured is the actual dependent-variable, mainstream psychology’s laughable interpretation of operationism notwithstanding. I wonder if that could be one of the reasons that mainstream psychology sucks. Maybe we should ask Andrew about that…since he’s a statistician, and sees behavior every day, he MUST be in a position to direct the future of psychology. OTOH, the stuff Gelman criticizes is the worst of mainstream psychology (which is God-awful as a whole anyway), stuff that generally borrows heavily from ordinary-language as if our ordinary observations and discussions of behavior provide a sound basis for a science of behavior, eh? So…the things that Gelman knows about a science of behavior are on a par with the knowledge possessed by the hacks he criticizes.

  6. Hi Andrew,

    Thanks for the stimulating post. Always enjoy your writing. I’m still pretty new to the blog, and one of your last sentences caught me eye:

    “trying to study the effect of red in isolation, using the background of an image . . . that’s just hopeless.”

    Any other blog posts of yours in the past that might elaborate further on this argument? A lot of fields (including my own) advocate strongly for this exact type of experimental approach, where one specific feature of the stimulus is manipulated while holding all else constant. I’d love to hear the counter-argument from your end.

    Thanks,
    Frank

    • Frank:

      I think your question is answered by Bob Calin-Jageman in his comment immediately above yours. The quick answer is that it doesn’t make much sense to study the effect of red in isolation, given that we’d expect any effects to vary strongly by (a) the implementation of “red” (no reason to believe the effect of red in an image background would be anything like the effect of red clothing on a real person), and (b) the setting (no reason to believe the effect in the lab would be anything like the effect in real life). I refer you to my above-linked post on the freshman fallacy.

  7. Samples that are not blatantly biased are useful for good research.

    Mawson AR (2017) “Pilot comparative study on the health of vaccinated and unvaccinated 6- to 12-year-old U.S. children”

    The letter to parents solicited for the study began:

    “Dear Parent, This study concerns a major current health question:
    namely, whether vaccination is linked in any way to children’s long-term
    health. Vaccination is one of the greatest discoveries in medicine, yet
    little is known about its long-term impact. The objective of this study
    is to evaluate the effects of vaccination by comparing vaccinated and
    unvaccinated children in terms of a number of major health outcomes …”

    http://www.cmsri.org/wp-content/uploads/2017/05/MawsonStudyHealthOutcomes5.8.2017.pdf

  8. That’s the outmoded, button-pushing model of social and behavioral science, which is tied to an outmoded, significance-testing model of statistics.

    There is a lot of informative, rigorous behavioral research that relies primarily, often exclusively, on pushed buttons, much of which involves sophisticated multilevel Bayesian modeling, with no NHST to be seen. I’d like to think that my own work fits into this category, and I think work by people like EJ Wagenmakers (who sometimes comments here) definitely does.

    • “…There is a lot of informative, rigorous behavioral research that relies primarily, often exclusively, on pushed buttons…”

      And Steven Hawking can elucidate physics and give lectures by, if not pushing buttons, interacting with some manipulandum. The importance of a response is its function. This doesn’t mean that button-pressing, as the measured response, automatically makes for good research…mainstream psychology’s biggest problem is that it lacks sophisticated analysis of its concepts and is, thus, a conceptual cesspool. Mainstream psychologists concentrated their efforts on building a fact-gathering system…and it failed at that! So…you can’t trust its facts, and its conceptual foundation is laughable, whether composed of the reification of ordinary mentalistic terms or computer metaphors.

Leave a Reply

Your email address will not be published. Required fields are marked *