If a research team starts a speculative idea, whether it be silly (the idea that people will play better if they’re told they have a lucky golf ball) or speculative (the idea that people will react differently to a male or female-named hurricane) or borderline ridiculous (the idea that beautiful parents will be more likely to have girl babies), there are a few ways they can go forward:
One way to go, which I like, is to advance from a directional hypothesis to a quantitative hypothesis. This takes some work, and I argue it’s work worth doing, as it leads to closer connection to existing science and motivates a critical reading of the literature. It also then leads to the next step of constructing a generative model and simulating fake data that can be used to design a possible experiment.
Another way to go, which unfortunately seems to still be the standard approach in may areas of psychology research, is to just jump right in and conduct an experiment with no realistic sense of possible effect size and variation, and then find something statistically significant and go publish.
That second approach will often take everyday mediocre science and turn it into bad science.
Just for example, here’s an abstract representing mediocre science:
We speculate that people could react consistently differently to hurricanes with male and female names. This could be studied by comparing death rates in historical hurricanes and further understood using laboratory experiments studying people’s gender-based expectations about severity and preparedness to take protective action.
And here’s an abstract representing junk science:
Do people judge hurricane risks in the context of gender-based expectations? We use more than six decades of death rates from US hurricanes to show that feminine-named hurricanes cause significantly more deaths than do masculine-named hurricanes. Laboratory experiments indicate that this is because hurricane names lead to gender-based expectations about severity and this, in turn, guides respondents’ preparedness to take protective action. This finding indicates an unfortunate and unintended consequence of the gendered naming of hurricanes, with important implications for policymakers, media practitioners, and the general public concerning hurricane communication and preparedness.
The latter is the abstract from the published himmicanes paper; the former is my adaptation of what could’ve been written as speculation. The published abstract is, in my opinion, bad science in combining a lack of strong theory with an absence of evidence. In contrast, my “mediocre” abstract has no strong theory but it does not pretend to have evidence. The addition of the strong and unsupported claims made the project worse.
Also, the gap between the mediocre and bad abstracts is instructive, in that it suggests the gap, which is some sense of effect sizes and variation, which would help in any study design.
Of course, the mediocre abstract would never get published in PPNAS or featured in NPR!
This all came up in a recent blog discussion, following this comment from Dale:
When it comes to research, it is all too easy to label a research paper as good or bad, or conclude it should never have been done, or label the data as too noisy to yield meaningful results. But I think all of these are a continuum – the world is all gray. When is the data “too noisy?” The real question is “too noisy for what?” I think the problem with these studies is not that the data is too noisy for the study to be done, but that it is too noisy to reach any conclusions. . . .
What accounts for the prevalence of bad social science? I would propose that this is the wrong focus – it is the nature of social science that we will differ about what studies are worth undertaking, which are worth reporting, and what conclusions can be reached (or even suggested). I think a better question is what accounts for the failures of our research institutions (including academia, think tanks, granting agencies, etc.) to provide meaningful evaluation of social science research? Until the evaluation improves, I would not expect to see the quality of the work improve.
There are multiple dimensions here. There’s the quality of the science (which is some combination of theory, design of data collection and measurement, and analysis), the interestingness or importance of the question being asked (whether the study is worth doing at all), and the general direction of the research program. The point of the present post is to separate some of these issues by considering how to science better, even if you happen to be studying a silly topic. The same principles should apply to more serious work as well.
Summary
Again, my recommendation is to advance from a directional hypothesis to a quantitative hypothesis. This takes some work, and I argue it’s work worth doing, as it leads to closer connection to existing science and motivates a critical reading of the literature. It also then leads to the next step of constructing a generative model and simulating fake data that can be used to design a possible experiment.
I think this is better than to just jump right in and conduct an experiment with no realistic sense of possible effect size and variation, and then find something statistically significant and go publish (or highlight a non-significant difference and claim to have demonstrated no effect).
Andrew, could you illustrate the process of formulating the quantitative hypothesis in the hurricanes case? Even a back-of-the-envelope calculation would be useful I think. But I’d love to see the generative model and the fake data too. Maybe a new blog post?
Here’s my effort to form a quantitative hypothesis about the impacts of male vs. female hurricane names:
“What approximately are the chances that how humans name a hurricane – whether with a male or a female name – will affect the damage caused by the hurricane? Zero.”
The “Zero” is the quantitative part. Next project.
There’s no sense in pretending to generate a quantitative method that rejects quack speculations before they become research projects. There are thousands if not millions of sound research questions where we have some reason to believe we may actually discover something useful. So why would we waste time and money doing idiot speculations about hurricane names and other such quackery?
I share your skepticism on a personal level. At the same time, the world has some real surprises, and I can see the argument that a good researcher should occasionally consider absurd-seeming hypotheses (e.g. that bacteria cause ulcers or that the complex square root of a probability can act like a wave).
So yeah, it would be surprising if implicit bias caused female-named hurricanes to do, say, 10% more damage, on average, than male-named hurricanes. But it wouldn’t be the weirdest thing humanity has ever discovered.
I thought Andrew was proposing an actual method, a concrete thought process … in which case I personally think it would be instructive to see some more details.
Dmitri,
I get your point but if we look back through the pantheon of “absurd-seeming hypotheses” that have turned out to be true, I think we’ll see that nearly all such hypotheses have amassed substantial evidence long before they are proposed. Typically reseachers generate the necessary observations to support these claims in the process of doing other research, then they assemble the evidence to generate a testable hypothesis. Often they know that their hypothesis would be recieved negatively so they wait to go public until their evidence is strong enough to withstand serious scrutiny. The most famous case is Darwin.
In the case of stomach ulcers and bacteria, there was a body of research that had observed bacteria in the stomach over a long period (decades), and the researchers who discovered the relationship between H pylori and stomach ulcers did not publish their findings until they succeeded in culturing the bacteria (which, as often is the case, happened by accident).
https://en.wikipedia.org/wiki/Helicobacter_pylori (section title “History”)
https://en.wikipedia.org/wiki/Timeline_of_peptic_ulcer_disease_and_Helicobacter_pylori
In contrast, the idea that female-named storms cause people to be less concerned about hurricanes is nothing but a wild supposition, with no existing credible observations and no body of supporting literature, which was furthermore advanced with only the most scant “hypothesis” as to it’s cause, and with shaky stats to boot.
The difference here is on three points:
1) there was a body of evidence supporting the concept of stomache bacteria going back centuries
2) the researchers observed the stomache bacteria associated with ulcers on multiple occassions, so they had a sound basis to develop their hypothesis
3) the researchers cultured the bacteria and confirmed it before publishing their hypothesis
They weren’t just taking a wild shot in the dark.
“What accounts for the prevalence of bad social science? ” I like Dale’s question. I’m not a social scientist, so I don’t want to speculate, but I’m old enough to have seen major advances in biological and physical sciences, but don’t know of anything comparable in social science.
I’ve been thinking for some time that the concept of benchmarking needs to be applied much more broadly. There is a need for a sense of what the relevant comparitor is for claims of something being large, meaningful, sufficient etc. This of course is explicitly quantitative in the sense that Andrew is using.
Benchmarking is ubiquitous in management planning and evaluation, and for good reason. It anchors judgment and works against the tendency to think whatever you produce is OK because that’s what you produced. (Or not OK if you have an incentive to disparage it.)
I’ve been trying to promote the practice of benchmarking in climate policy: having a quantitative standard against which to evaluate real world actions and results, but there’s been a lot of resistance, I suspect because there’s no history or culture on which to draw for something like this.
So yes, put forward a quantitative benchmark for the results of proposed studies. Given all we know about the matter, what standard should we use to identify a plausible and meaningful finding? (More precisely, two benchmarks, one for minimum meaningfulness, another for maximum plausibility.)
True, new research can change old benchmarks (Bayes), but the change needs to be explicit and justified by going back to the process that generated the old benchmark and identifying what needs to be altered.
“One way to go, which I like, is to advance from a directional hypothesis to a quantitative hypothesis.”
When you look at the himmicane study, I have difficulty seeing how this applies. The authors are not so much making a case that giving hurricanes female names leads to loss of life. They are making a nonquantitative case that our thinking about everything tends to be gendered if the name is gendered.
A relative of mine who teaches college psychology asks his students at the beginning of each semester to come up with a social science question that interests them. One student asked “does viewing porn make men more likely to commit rape.” I have retained this because it makes a great example of a question that can be quantified, but who cares? If porn makes a single man more likely to commit rape, it is already a big problem.
My point is that going to a fully quantitative approach would not be an incremental improvement in the social sciences, it would be a full paradigm shift.
> If porn makes a single man more likely to commit rape, it is already a big problem.
This feels like a lack of appreciation of how distributions work. What if porn makes exactly 1 man more likely to commit rape and 1000 men less likely, and the overall rate goes down substantially?
“What if porn makes exactly 1 man more likely to commit rape and 1000 men less likely, and the overall rate goes down substantially?”
That would make men less likely to commit rape, a negative finding. What has that got to do with the point I made about psychologists in many cases not being interested in effect size? I was pretty clearly talking about a very small positive effect being sufficient to make a value judgement.
The responses seem intentionally obtuse.
Matt:
If the treatment effect varies, with it being positive for some people and negative for others, and positive in some situations and negative in others, then the aggregate or net treatment effect depends on future conditions, and so it may not make sense to talk of it having a small positive effect.
What you said was
” If porn makes a single man more likely to commit rape, it is already a big problem.”
That statement is logically equivalent to “unless porn makes every single man less likely to commit rape, it is already a big problem” (technically less likely or only equally as likely)
Whether you meant it that way or not I’ve certainly seen people have these types of opinions… If something isn’t good for every single person then it is bad, etc.
Online discourse is notoriously difficult to read between lines so I took your statement at face value.
Matt:
What Daniel said. Interventions have different effects on different people and in different situations. The framing that you gave is an example of what we call the fallacy of the one-sided bet.
Andrew: Do you think that in most of these cases there’s enough information with which to create a “quantitative hypothesis?” And, related, that there aren’t so many degrees of freedom possible for such a hypothesis it would be as meaningless as the current situation?
To me it seems like the better path, aside from the researchers not wasting everyone’s time with these projects, would be to stop pretending that there are meaningful quantifications to be made and instead carefully and qualitatively make observations about behavior.
Raghu:
For the hurricanes example, I think a reasonable starting point would be what I labeled as “an abstract representing mediocre science.”
There’s a lot worse than mediocre! The hurricanes researchers had a mediocre idea and pumped it up to PPNAS-worthy junk science.
To do something useful I think they’d need to go quantitative. I’m not quite sure how to do this in the hurricanes case—then again, that’s not a problem that really interests me. If someone is interested in the topic, then they should be able to study it more carefully, maybe they could start by interviewing people who live in hurricane zones, get a sense of what options people have, etc.
Raghu: I agree completely
The problem is that there is an idea about in the social sciences that the world is full of massive unobservable-to-human-eyes effects that can be magically revealed iwth simple statistical measurements. Just for example the “walk slower when you say old” kind of experiment or belief: what are the chances that such a phenomenon is actually real and important, but not noticable to human eyes – e.g., can only be revealed through statistical analysis? Pretty much zero, as with all these experiments. But it’s much more difficult to do actual research.
chipmunk –
You really should go back and address your false claim about what Andrew argued.
Demonstrate you have at least one shred of class. People might be tempted to generalize negatively about someone with your politics if you don’t.
“People might be tempted to generalize negatively about someone with your politics if you don’t.”
Yes, Joshua, I’m aware that among some groups anyone with what they percieve as center or right politics is considered suspicious until they are proven safe by abandoning their supposed wrong political views.
I frequently don’t bother addressing others’ comments for a reason. There seems to be, in my opinion, a foolish and naive belief among some folks that a few comments on blog will work everything out. On the contrary, I recognize that these issues are complex and that there are many many potential arguments that aren’t addressed in such a conversation and often these problems will take years if not decades to work themselves out. If you thinik Andrew shot down my argument, so be it, but I doubt Andrew’s opinion alone with end anything.
Maybe you should have some class and let Andrew’s argument stand on its own.
Chimpunk:
I didn’t “shoot down your argument.” You didn’t make an argument! You wrote a false statement (that I “argued that, because by polls 51% of Oklahoma supported abortion, that the Oklahoma legislature should be compelled to allow legal abortion” and that I “implied that Oklahoma politicians were acting dishonestly or in deriliction of duty in blocking abortion rights”), and this annoyed me enough that I responded with a comment pointing out that I did not say or imply these things.
When someone makes an error and someone else points it out, that’s not “shooting down an argument”; it’s correcting an error.
I agree with your general point that blog comments won’t generally work things out. Nonetheless, when I’m sitting at the computer and I don’t feel like doing real work, I will sometimes correct flat-out falsehoods.
The frustrating thing is that then the objectionable comments can get more attention than the thoughtful comments.
There may be a reaction to previous failures. If failures are exposed by attempted replications, there will be a tendency to require independent replication for confirmation. If statistical methods are criticised, or if contentious issues accumulate a group of experts on either side opposing their measured and expert judgements (which unfortunately cannot be explained to the uninformed laity) one way out would be to refer back to precedent for support; if study A achieved consensus, perhaps if its methods are applied to contentious issue B we can agree on a result. It appears that if we can resolve issues only when there is a unanimity of opinion in the peer-reviewed literature, few issues will be resolved.
I don’t think so, instead people will try to avoid “dangerous” replications. Eg, they may do “safe” replications where some detail (age, sex) is different from the previous studies, but attempting to repeat the exact same method will be deemed “lacking novelty”.
That is hardly the only method. Another is claiming there are many “informal” replications that are successful all the time, already. So there is no need for someone to publish about direct replications.
Of course these types of replications are cherrypicked, corners are cut on controls, and so on.
In general, I would say there is appalling lack of clear information content to most tested hypotheses in the social sciences. So, any move towards getting people to more clearly define what they are actually aiming to test in the context of a clearly defined (assumed) dgp would be a major improvement. Not just with respect to defining a plausible range of effect sizes, but also defining what kind of effect exactly is predicted (e.g. causal or non-causal? For whom, what, when or where – i.e. averaged over what? Consistent with which functional forms?), and what observed variation (at what level of analysis/aggregation) would meaningfully be required to empirical falsify that hypothesis.
However, I’m pessimistic about the effectiveness of any specific methodological solution to what I strongly believe is a fundamentally a problem of incentives and culture. Careers are made on the basis of being a good storyteller, signaling just enough methods-skills (and of course, the kind that are that latest fad in your field right now) to seem competent while retaining all the flexibility and plausible deniability to turn your lead into gold.
I like the idea, but to me it seems destined to be yet another principally methodologically sound suggestion that will achieve little in really combating junk science. We have seen this play out before with e.g. alternatives to p-values and the so-called causal credibility ‘revolution’. As long as researchers get rewarded for not thinking too critically about their own research, and scientific communities remain unwilling and/or unable to provide appropriate correctives, they will continue to performatively go through the motions. The problem isn’t the methods, the problem is us.
I absolutely agree – the problem is us. And we can solve most of these problems. We should evaluate the questions our colleagues ask in their research. We should evaluate the questions they ask their students in their teaching. We should evaluate the content of their courses and class meetings. We rarely do any of these things (caveat: I know such things happen, and I suspect they happen more frequently at “top” schools, but the vast majority of higher education practices don’t take such things seriously). In fact, at many schools, having peer reviewed publications swamps any evaluation of the content of that research. We readily substitute peer review for our own evaluation.
Why is this the case? I’d cite a few factors: lack of confidence and courage to engage with our colleagues regarding content, fear of alienating colleagues who hold (perceived or real) power over ourselves, educational training that over-emphasizes narrow focus on niche questions, and inadequate appreciation of past research (I once had a well-published colleague who advised me that the formula for publication is “don’t read too much.”
I’m sure I have overstated the case – most of my career has not been at elite institutions, so practices may well differ there. However, we’ve had too many cases where poor research transcends such rankings.
This is revealing my ignorance, but what exactly is a quantitative hypothesis, as opposed to a directional one? Is it hypothesizing a certain effect size (e.g., “male-named hurricanes will lead to .5 of an SD increase in fatalities compared to female” vs “male-named hurricanes will lead to a greater increase in fatalities compared to female”)? Then power appropriately and so on?
That’s an example yes. A directional hypothesis is just “a is bigger/smaller than b” whereas a quantitative hypothesis could be something like “a/b is at least 1.2” or could be something more complex like “increasing a by 1 typical unit would result in the standard deviation of b outcome increasing to at least 1.2x its original size” or “across all b at least 30% of b outcomes will increase by at least 10%” or something else.
To whatever anyone else would say, I would add that you should make your measure dimensionless and typically use a scale in which 1 unit of change is something “of practical interest” and substantially less than 1 is “of no practical interest” and much bigger than 1 (like 8 or 10 or 23) is “of extreme interest”
How is anyone supposed to come up with a quantitative prediction for himmcanes based on data like that though?
The underlying issue is that they are comparing snapshots of different groups. Instead start with something like data on the dynamics of damage during hurricanes.
For example, start with a single hurricane and bin the amount of damage by hours since landfall.
I imagine we would see lots of low-hanging fruit damage at the beginning. Then maybe it increases with windspeed as the storm progresses. Next, there may be some kind of catastrophic failure that really causes a lot of damage (eg, levee is breached -> flooding). Afterwards, there may be looting, etc.
At what point is the “himmcane” effect supposed to come into play? Perhaps the theory says it is the early “low-lying fruit” damage. Eg, people not boarding up windows and so on. Well, now we can have an upper bound on the percent of damage due to hurricane gender.
But if you start with comparing group A to group B, you will essentially never figure out anything of value. Looking at differences is the exact opposite of what we want to do. We first need some “laws” that model the dynamics of hurricane damage.
Indeed, years ago when this himmicanes thing first came out I gave an example model, written in terms of a dimensional analysis of damages as a function of wind speed and estimates of density of population and etc, implemented in Stan, in which I estimated hurricane damage in dollars as a function of hurricane category (for which wind speed was available) and some other stuff. Unfortunately my blog is problematic these days and I don’t have the motivation to unbreak PHP etc to get it back up and running.
I don’t remember it showing any “himmicanes” effect