The problem with (some) science: How much is it a problem with process and how much is it a problem with substance?

I just read an article in the London Review of Books about the replication crisis in psychology. It can be interesting to read things written for general audiences on topics that I do research on. In this case I’d say that the author, John Whitfield, did a good job. In his review, he’s summarizing a book by psychology researcher Stuart Ritchie, so I guess Ritchie did a good job too. There’s a lot of discussion in the review of the processes of science, journal reviewing, impact factors, non-publication of negative findings, questionable research practices, incentives, preprints, and the role of the news media.

Whitfield mangles the definition of the p-value (“When an experimental result is described as statistically significant, that usually means a statistical test has shown there is less than a 5 per cent chance that the difference between that result and, for example, the corresponding result in a control experiment is attributable to random variation”), but the definition of the p-value is notoriously easy to mangle—once, a wrong definition of p-values was once inserted by an editor into one of my own articles, and unfortunately I did not notice it before it appeared in print—so I won’t be too hard on Whitfield, or the LRB, for that.

With the exception of that p-value thing, which is really no big deal, I disagree with nothing in Whitfield’s article, and I think he makes many good points.

There’s just one thing that concerns me, which is something that has concerned me for awhile in discussions of the replication crisis and science reform, and that is the focus on the process of science rather than the substance of science.

At some level, sure, science is the “process of science”; it’s “the scientific method.” But on another level, no. Science isn’t just how we make discoveries; it’s also the discoveries we make.

Scientific studies on ESP and ovulation and voting and beauty and sex ratio and himmicanes and all the rest are doomed from the start by the kangaroo problem (see also here). Some combination of poor theory, small effects, noisy data, and sloppy measurements is dooming these studies from the start.

The procedures of science do matter, and it is a fair indictment of the scholarly publishing system and the news media that each of the examples mentioned in the above paragraph was published and widely publicized—but ultimately I think these are problems with theory and measurement, and I remain concerned that a focus on procedures, important as they are, can lead to a mistaken view of the problem. To put it another way: you could remove all the fraudsters, trimmers, and just plain careerists; get rid of the incentives to cheat; lock up all the p-values and throw away the key; and preregister till the cows come home, and it wouldn’t turn bad science into good science. At best, all these steps would just make it more apparent that people are screwing around with noise, and then maybe they’d reassess their theories and measurement. But they’d have to improve their science, not just their procedures of inference, publication, etc.

This has come up before:

(People are missing the point on Wansink, so) what’s the lesson we should be drawing from this story?

Honesty and transparency are not enough

The real problem of that nudge meta-analysis is not that it includes 12 papers by noted fraudsters; it’s the GIGO of it all

Etc etc.

43 thoughts on “The problem with (some) science: How much is it a problem with process and how much is it a problem with substance?

  1. I’m not sure what to think about your distinction between the process and substance of science; it seems actually to be a distinction between the *tools* and substance of science. If I understand it, your argument seems to be that if we got rid of bad methods, people who are simply looking at spurious patterns in noise would still be looking at spurious patterns in noise, and so things would still be a mess. That’s true, but understanding what is knowable or not, having a sense of how noise and measurement work, the ability to design meaningful experiments, etc., *are,* independent of tools, the process of science — the tools are (supposedly) a way to make the process more robust. To me “substance” suggests the things to which we apply this process, and what emerges from this.

    I think a better framing would be “How much is it a problem with tools and how much is it a problem with process?”

    • Raghu:

      No, I’m not saying, “if we got rid of bad methods, people who are simply looking at spurious patterns in noise would still be looking at spurious patterns in noise.”

      What I’m saying is, “if people stopped looking at spurious patterns in noise, but they’re still collecting data that are noise, they wouldn’t be making scientific discoveries.”

      • Thanks; I see your point now. I don’t agree, but perhaps that’s just because I’m currently reviewing a paper that *doesn’t* use inept statistics — it doesn’t even bother with basic things like error bars — but nonetheless makes unwarranted and wrong claims of scientific discovery…
        Actually, I’m procrastinating by writing this blog comment, since it’s no fun to now spend half an hour nicely describing how the authors are misguided.

  2. >At best, all these steps would just make it more apparent that people are screwing around with noise, and then maybe they’d reassess their theories and measurement. But they’d have to improve their science, not just their procedures of inference, publication, etc.

    You don’t think this would lead to improvements in research design? I guess we need another step where publically exposing shoddy research has consequences on promotion and tenure decisions, which then creates incentives for better science.

    • It looks to me that the problem is not so much shoddy science, but inanely stupid theories.

      That latter bit is easy to see and say after the fact (ESP, power pose, himicanes), but the more interesting (and serious problem) is why aren’t these people coming up with better theories? And why isn’t this stuff shot down in informal discussions amongst researchers? Really, this stuff is painfully embarassing long before the shoddy science raises its (admittedly ugly) head.

      (I’m a bad one to complain, though: I dropped out of AI after passing the quals because, were I being honest, I wasn’t coming up with anything to say for myself. But the culture at the time was very much that dumb theories got shot down pretty brutally.)

      • David:

        It’s complicated. Bad theories are an important part of bad science. Another problem is bad data or a weak link between data and theories. For example, the article claiming that beautiful parents were more likely to have girl babies was ridiculous—and it’s a continuing embarrassment for the Freakonomics team that they promoted this claim and then never backed down when the problems were revealed—but the underlying theory, based on an argument in evolutionary biology, was not terrible. The underlying theory presented in that paper was speculative but it was not insanely stupid. The problem was that their measurement was hopelessly noisy in comparison to any realistic effect they might see. The point of my above post is that doing the study more transparently would not relate in any scientific revelation; at best it would just result in a “non-statistically significant” result which would make the researcher reassess his model.

        • If you were to allow me to rewrite your argument for you, I think it would go something like this:

          “Bad theories don’t matter as long as the methodology is done correctly, since then the researchers would just reassess their models.”

          But given the plethora of bad models out there, this just ends up being (a very expensive version of) the British Museum Algorithm.

          (In a few cases, this is necessary. We got a couple of good Covid vaccines because everyone tried their theory. But even there, all the “bad” theories (killed vaccines, attenuated live vaccines, etc.) were actually really good theories that just happened to not work.)

          But what we are seeing in the various corners of psychology is that people are so invested in their bad theories that they can’t do the reasses part. (I had the option to punt when I wasn’t coming up with a good theory, but that option to punt depended on me not being hell-bent on becoming an academic researcher. You have to at least appear to be driven and committed to play in the academic fast lane, and this seems to result in people being overly committed to their bad theories.)

          (I suspect that the evolutionary biology basis of the disaster you mentioned was probably actually just really bad evolutionary biology. Do you know Holly Dunsworth (prof. at Brown)? She could sort this out if we could persuade her to look at it. She’s really good, (But I’m not an evolutionary explanation fan, since that only tells you how animals built wings that work, but doesn’t give you Bernoulli. And it doesn’t tell you _how_ people think, only that thinking must have some advantage over not thinking.))

          Given that I’m a failed AI type and not a failed statistician, I’m sticking to my guns that it’s the failure to come up with good theories that’s the core of the problem.

  3. I suppose I agree that you could make all those reforms and “it wouldn’t turn bad science into good science,” but I completely disagree that anyone other than science teachers should have a goal of turning bad science into good science. Everyone else – reformers, journal editors, peers, and other readers – can call it a day when bad science becomes ignored science. We’re pretty deep into the internet age at this point, and time-wasting garbage is a constant feature of the landscape. Cleaning it up just seems much less feasible than managing where everyone puts their focus and attention.

    Don’t lose faith in the usual reform ideas, like getting rid of the incentives to cheat and preregistering! Those could be highly effective at keeping the silly stuff (and the attention-thirsty charlatans that promote it) off of everyone’s radar. Those quacks would still free to pursue that stuff, but they’d have to do it *as quacks* on Reddit threads or whatever.

    The prize is thousands of click-baited readers no longer having to independently waste time sorting through the latest “Fancy University Study finds Obviously Stupid Nonsense may be Real” claim anymore (and also, the broader public and policy makers hopefully spending a little less time in the thrall of magical science). (Oh, and it might also help with the galling incidence of quacks landing prestigious positions at Fancy U.)

  4. Andrew, your use of awhile popped out for me:

    “There’s just one thing that concerns me, which is something that has concerned me for awhile ”

    It looks disturbingly wrong. I would have written “a while” here.

  5. Someone called Liam Shaw from Oxford posted a comment on the London Review website, correcting the definition of the p-value:

    “It means that under the assumption the results are entirely due to random variation, one would obtain a difference at least as extreme as the observed difference less than 5 per cent of the time.”

    I would say that this definition is still wrong! :)

    “the results” is too vague, and “random variation” is just the wrong thing to focus on here. A p-value is the probability of observing a statistic (like the t-value) as extreme as the one we observed (or something more extreme) is less than 0.05 under the assumption that the null hypothesis is true.” I don’t know why people keep bringing up chance and random variation and phrases like that in the definition of the p-value. I keep seeing phrases like “the probability that the results are due to chance”. This is really misleading and leads to all kinds of chaotic thinking. It’s a linguistic illusion: if one says that something is unlikely to be due to chance, then it must be not due to chance, i.e., real. This kind of language leads to confusion.

    I bet someone will correct my defintion next :)

    PS In my own field, I don’t think I have read a single paper in the last 20+ years that I have been active that uses the p-value to interpret the conclusions correctly. But as my son keeps pointing out to me, my field is completely irrelevant for society so it doesn’t matter anyway. A famous psycholinguist also pointed this out me once when I was complaining about incorrect inferences: why should one care if the conclusions are wrong if the conclusions don’t harm anyone?

    • It is the probability of seeing a deviation from the prediction at least as extreme as was observed.

      Better to phrase it in terms of “prediction” and “deviation”, because there is no reason the p-value needs to be used to test default models of zero difference.

      Also, it brings up the question of “why are you testing a model that predicts zero difference when your research hypothesis is that the difference is non-zero?”

      • As I understand it, the null hypothesis does not necessarily imply zero difference; it’s not called the null hypothesis because the difference has to be 0. The null in null hypothesis refers to the baseline, starting assumption. One could assume that the null is that the difference (or whatever it is that the parameter means) $\mu = \mu_0$, where $\mu_0$ is any value.

        LOL regarding your second point. Why indeed. It’s completely bass-ackwards. This whole thing answers *a* question, but it answers the wrong question. It is mind-blowing that most of cognitive psych and (psycho)linguistics are built on asking absolutely the wrong question.

        • Don’t you think it’s an overstatement to say that “it’s “absolutely the wrong question” in a general manner without taking into account the particularities of any case? For sure it won’t answer the question “is my research hypothesis true?” if that’s anything specific different from the H0, but if a lot of people think that “in fact nothing is going on” and that’s appropriately formalised in the null hypothesis, it is of legitimate interest whether the data provide evidence against that, isn’t it?

        • @ Shravan

          Null originally referred to “hypothesis to be nullified”, then EF Lindquist changed it to mean “zero”* in his 1940’s stats 101 book that merged Fisher (significance testing) and Neyman-Pearson (hypothesis testing) into the NHST hybrid.

          * To be fair to Lindquist though, he did put a footnote saying sometimes you can test a non-zero hypothesis.

          @ Christian

          There is always “something going on” because everything is correlated to everything else. So we have learned nothing from that test.

        • @Anoneuoid: I have seen many data compatible with all kinds of null hypotheses, so to know that certain data are not is informative. It’s one thing to say “all models are wrong” but quite another to have the data indicate strongly in which way a model is wrong.

        • RE: I have seen many data compatible with all kinds of null hypotheses, so to know that certain data are not is informative. It’s one thing to say “all models are wrong” but quite another to have the data indicate strongly in which way a model is wrong.

          Is the purpose of the analysis model building? Or to examine the extent to which the data support a substantive research question?

          If there is a legitimate prior reason to believe a certain diet will produce 5kg or more of weight loss, surely our first priority would be to ask if the data from a trial support the claim of weight loss being >=5kg.

          Why would we instead prefer an analysis of whether the data support a claim that weight loss equaled zero?

        • @name withheld by request: I was arguing against a general statement against significance tests by Shravan. My aim was not to say that tests are fine in whatever situation in which somebody would use them. In fact I agree they are overused and often misused.

        • @Anoneuoid: I have seen many data compatible with all kinds of null hypotheses, so to know that certain data are not is informative. It’s one thing to say “all models are wrong” but quite another to have the data indicate strongly in which way a model is wrong.

          Just use descriptive statistics and then the theory/explanation/hypothesis you are interested in needs to be consistent with that. I still don’t see how testing these default null hypotheses are supposed to add anything.

        • Compare your prediction to the data in some way. It is not ideal, but use a significance test if you want. Or just eyeball it on a chart.

          The key is to test your hypothesis, not a default strawman hypothesis. This one simple trick turns bizarro science into science. So many bad incentives start acting in reverse and become good incentives.

          Once we get back to that point the next step can be taken, which is comparing multiple hypotheses at once using Bayes rule as described in my other posts below.

        • In my experience people rarely view p-values as literal hypothesis testing.

          Instead, the p-value is commonly used as a ‘discovery certifier,’ which, in conjunction with the sign of the estimate, supports or contradict the substantive hypothesis.

          This perspective is very natural for people working in any field involving the concept of “signal detection”, such as neuroscience.

        • If you suddenly 10-100x the sample sizes of those neuroscience studies, they would start getting too many “discoveries” and end up decreasing the significance threshold.

          Eg, in particle physics they have tons of data so use 3e-7. Likewise for GWAS where its 5e-8.

        • > Compare your prediction to the data in some way. It is not ideal, but use a significance test if you want.

          Can you give an example of how could that be done?

      • > “why are you testing a model that predicts zero difference when your research hypothesis is that the difference is non-zero?”

        Wouldn’t you agree that if it’s true that the difference is non-zero then it’s not true that the difference is zero?

        • The difference is non-zero.* This vague fact is compatible with many explanations so is not informative. You need to test a prediction that is *otherwise surprising*, ie the other terms in the denominator of Bayes’ rule are small. Where we consider hypotheses 0 to n:

          p(H_0|D) = p(H_0) * p(D|H_0) / [ p(H_0) * p(D|H_0) + … + p(H_n) * p(D|H_n) ]

          You can see one way to accomplish this is to make a very precise prediction, ie the likelihood p(D|H_i) is very dense over a small range of values that end up corresponding to the data.

          Meehl has a good paper on this:
          https://meehl.umn.edu/sites/meehl.umn.edu/files/files/147appraisingamending.pdf

          * Outside of examples where exactly zero is predicted by theory, eg mass of electron in different cities

        • That should say “the difference in the mass of the electron in different cities”. I mean it is predicted to be the same everywhere so there should be exactly zero difference.

          Also, let’s look at what happens when we only compare vague predictions like “there should be a non-zero difference”. Start again here:

          p(H_0|D) = p(H_0) * p(D|H_0) / [ p(H_0) * p(D|H_0) + … + p(H_n) * p(D|H_n) ]

          It is not difficult to come up with reasons for a non-zero difference. So you end up with multiple hypothesis with equivalent likelihoods. So those all cancel out:

          p(H_0|D) = p(H_0) / [ p(H_0) + … + p(H_n) ]

          Now all we are doing is comparing priors. The data is irrelevant. And if the priors are all the same you end up with:

          p(H_0|D) = 1 / n

          The point is you need to derive a prediction from your theory that distinguishes it from the others in order to draw an informative conclusion from data. The case of predicting a positive/negative effect is only barely better. There will still be too many explanations available that are equally consistent with the observations.

  6. What if it were rephrased to say “… under the assumption that the data was generated by a specific random number generator”? Didn’t someone here suggest that?

    My own feeling is that the canonical definition (“assuming the null hypothesis is true” is a bit too underspecified for a general audience (as it doesn’t specify the assumptions under which the data was generated). But if being “generated by a specific random number generator” is a correct characterization, I think it rather naturally leads to, “ok, so it wasn’t generated by a specific random number generator. So what?”

  7. I’m trying to imagine the outcome paper abstract that says, “After four years and USD$1.3 million spent on a clinical trial involving 700 participants, our analysis has concluded that the study data were very unlikely to have been produced by a random number generator”.

  8. Responding to the above post here to renew the nesting:

    > Compare your prediction to the data in some way. It is not ideal, but use a significance test if you want.

    Can you give an example of how could that be done?

    This seems really basic to me, so I may be confused. Are you asking for an example of deriving a prediction from your theory and comparing it to the observations?

    • This is the first that comes to mind: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2007940/

      In that case they included an additional unnecessary assumption (probably for computational efficiency back then) though:

      This result will be valid for large values of t (of the order of a human lifetime) provided that p_1*t, p_2*t, …
      p_r*t are all sufficiently small (as could be assumed in an application of this theory to human cancer).

      You can see even in that paper that the theory fails for many cancers at older ages, because the incidence starts decreasing. This is actually as expected according to the theory though (minus the low probability of error assumption).

      Their theory is essentially the same as flipping n coins with probability p of turning up heads, and when a coin turns up heads you stop flipping that coin. What is the distribution you get for number of flips until all coins are heads? Ie, number of flips on the x-axis and frequency on the y-axis.

      • Interesting – but I don’t see any significance test there.

        The antecedent for the pronoun “that” in the question was “use a significance test”.

        We agree that statistical analysis is much simpler – and sounder – when you can do without probabilities.

        • Ok, I am still confused as to why you are asking such a question, but I agree that Armitage and Doll did not do any kind of statistical test (just eyeballed, which is just as good in most cases).

          To do a significance test you simply assume there is some error/uncertainty in the observations that follows a given distribution, ie the value reported is not exactly correct. Then you see how far in the tail your theoretical prediction lies.

          Here is another example where they actually do so (see table 2 and figure 3):
          https://arxiv.org/abs/1204.2507

        • Thanks, I just wanted to understand better in which kind of setting did you find a significance test acceptable.

          Anyway, there are contexts where something like “we find no statistically significant difference between the observed data and the predictions of our model with a seventh quark” may not be quite as informative of the merits of the theory as “we find no statistically significant difference between the observed data and the predictions of the standard model”.

        • Anyway, there are contexts where something like “we find no statistically significant difference between the observed data and the predictions of our model with a seventh quark” may not be quite as informative of the merits of the theory as “we find no statistically significant difference between the observed data and the predictions of the standard model”.

          Of course, which is why you use Bayes’ rule. But we live in a world where many “scientists” have never tested their own theory a single time in their careers, or even read a paper where someone actually tests a real theory. Getting people back to doing that is the first step.

          Or perhaps second step after going back to ensuring the observations are repeatable and reliable. It is much more difficult to come up with a quantitative theory when ~80% of the observations cannot be replicated. This is a totally unnecessary obstacle.

        • I agree with some of the things you say regarding some forms of “science”. But in an applied setting not having enough data or a precise prediction can be an unavoidable obstacle and sometimes testing the no-difference hypothesis can be a useful tool if used and interpreted properly – which of course is not always the case.

  9. sometimes testing the no-difference hypothesis can be a useful tool if used and interpreted properly

    If that is what was predicted by theory, otherwise no. Unlike most fallacies, the strawman argument is not a heuristic.

    It is impossible, or at least would require adopting a different logical system where strawman arguments are valid.

    • The “theory” that a vaccine works against some infection doesn’t necessarily provide a precise prediction of how much. Testing the no-difference “theory” is one way to go. Arguably it’s not completely useless – if that means adopting a different logical system where strawman arguments are valid so be it. Your alternative seems to be to not try to tell whether the vaccine works – or the virus kills – because the theory is not good enough.

  10. The “theory” that a vaccine works against some infection doesn’t necessarily provide a precise prediction of how much

    This is not a theory, in a theory/explanation/model/hypothesis you derive a prediction from a set of assumptions/premises.

    Saying elephants weigh more than humans is not a theory, that is an observation. A theory would be saying assume elephants have dimensions of approximately 10 x 20 x 5 ft while humans are 6 x 2 x 1 ft and both are about equally dense, so elephants should weigh about 83x more than humans.

    I am sure if you think about it you can come up with a reason *why* you think the vaccine works against some infection.

Leave a Reply

Your email address will not be published. Required fields are marked *