What’s published in the journal isn’t what the researchers actually did.

David Allison points us to these two letters:

Alternating Assignment was Incorrectly Labeled as Randomization, by Bridget Hannon, J. Michael Oakes, and David Allison, in the Journal of Alzheimer’s Disease.

Change in study randomization allocation needs to be included in statistical analysis: comment on ‘Randomized controlled trial of weight loss versus usual care on telomere length in women with breast cancer: the lifestyle, exercise, and nutrition (LEAN) study,’ by Stephanie Dickinson, Lilian Golzarri-Arroyo, Andrew Brown, Bryan McComb, Chanaka Kahathuduwa, and David Allison, in Breast Cancer Research and Treatment.

It can be surprisingly difficult for researchers to simply say exactly what they did. Part of this might be a desire to get credit for design features such as random assignment that were too difficult to actually implement; part of it could be sloppiness/laziness; but part of it could just be that, when you write, it’s so easy to drift into conventional patterns. Designs are supposed to be random assignment, so you label them as random assignment, even if they’re not. The above examples are nothing like pizzagate, but it’s part of the larger problem that the scientific literature can’t be trusted. It’s not just that you can’t trust the conclusions; it’s also that papers make claims that can’t possibly be supported by the data in them, and that papers don’t state what the researchers actually did.

As always, I’m not saying these researchers are bad people. Honesty and transparency are not enuf. If you’re a scientist, and you write up your study, and you don’t describe it accurately, we—the scientific community, the public, the consumers of your work—are screwed, even if you’re a wonderful, honorable person. You’ve introduced buggy software in the world, and the published corrections, if any, are likely to never catch up.

P.S. Hannon, Oakes, and Allison explain why it matters that the design described as a “randomized controlled trial” wasn’t actually that:

By sequentially enrolling participants using alternating assignment, the researchers and enrolling physicians in this study were able to know to which group the next participant would be assigned, and there is no allocation concealment. . . .

The allocation method employed by Ito et al. allows the research team to determine in which group a participant would be assigned, and thus could (unintentionally) manipulate the enrollment. . . .

Alternating assignment, or similarly using patient chart numbers, days of the week, date of birth, etc., are nonrandom methods of group allocation, and should not be used in place of randomly assigning participants . . .

There are a number of disciplines (i.e., public health, community interventions, etc.) which commonly employ nonrandomized intervention evaluation studies, and these can be conducted with rigor. It is crucial for researchers conducting these nonrandomized trials to report procedures accurately.

42 thoughts on “What’s published in the journal isn’t what the researchers actually did.

  1. 1) Derive a null model while assuming x = “random assignment”
    2) Don’t collect data that was generated via x = “random assignment”
    3) Spend enough money to detect statistical significance
    4) Publish your “discovery”:
    –A) Conclude your favorite explanation for a deviation from the null model is true
    –B) Come up with a post hoc excuse for why the significance was in the wrong direction this time but your favorite explanation is still true

    This has been the standard procedure for various x, usually multiple x like that one at the same time. It automatically selects for people who are good at convincing/tricking others to give them money and vague “theories” that can explain anything with no predictive skill.

  2. I understand that alternating assignment is not the same as random assignment. What I’m having trouble imagining is how this would ever make a difference, causing some kind of misleading effect.

    • several people come into a clinic… the person there is signing them up, they know the next person will be a control… how do they choose the next person to call?

      basically whenever the researchers are involved in the choice of who goes in which group, you have no idea how much the researcher choice influenced the outcome

      • ==> several people come into a clinic.

        Suppose they enter in sequence, and are alternately assigned to a control vs. intervention group in that sequence? Is there a potential for unintentional manipulation there?

        • That’s not a very realistic depiction of how real world trials actually work.

          If you have a situation where foreknowledge of the allocation could not possibly influence recruitment, go ahead. Quasi-randomisation is fine for something like, say, testing different methods of inviting people to population-based screening. You can’t get informed consent, you’re going to randomise (and continue to include) everyone on the list, the potential for introducing bias is very, very low.

          But (with individual randomisation) such clear cut situations, with no need to ask for or obtain informed consent, are fairly rare. And you can’t just squint and decide it’s close enough if it really isn’t. Randomisation is not that hard, if you can’t do it properly, you probably shouldn’t be doing it at all.

          There’s a horror show of an example here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC28220/ They wanted to test a site-based intervention to raise awareness and reduce stigma around depression in a residential care facility. They knew they couldn’t individually randomise because it was a training-and-posters job, but they only had one cluster so couldn’t do a cluster trial. They decided to randomise their single cluster into two groups: one to be observed before the intervention, the other after. Yes, you read that right. They did a before/after study with half the available sample size and different groups measured before and after. Because they knew they should randomise. They just didn’t understand why they should randomise.

        • Ssuag:

          I followed the link, and . . . it’s not clear to me exactly what was done, but it appears that they measured half the people at time T, then those same people at time T + 9 months, then the other half at T + 12 months and then at T + 21.5 months.

          But then why didn’t they just measure everyone at each time? I’m also confused because they say they were able to “randomly allocate the entire non-nursing home population (1466 people) into two groups,” but then in their data table they have n = 111 and n = 109 in the two groups. Maybe I read the article too quickly and missed the part where they said what happened to the other 1246 people?

          The article was published in 1999, long before we were thinking about replication etc.

        • I told you what they did (and why)? There are myriad problems beyond randomising for no reason (like measuring the two groups in different seasons). There’s a fair bit of correspondence (much of it complaining about methodological critiques for being unfair on the well-meaning clinicians, of course).

          Clinical research (and the regulatory agencies) introduced replication in the 1970s, as a requirement for drug licencing. One of many reforms after the thalidomide scandal. We were quite good at doing clinical trials by 1999 (at least in theory). We had trials units headed by statisticians and funders who required named statisticians on grant applications.

          And a lot of doctors who’d been taught a bit of stats in medical school and assumed they knew it all. (Still true, of course.)

        • “Suppose they enter in sequence, and are alternately assigned to a control vs. intervention group in that sequence? Is there a potential for unintentional manipulation there?”

          Yes because the “assigner” knows which group each person is being assigned to. Of course *if* the assigner follows the rules, there won’t be an impact. But even if that’s true, it’s still not random assignment, so the methods would be incorrectly stated.

          If it’s perfectly OK to misstate the methods because in some person’s opinion it doesn’t matter if it’s done the stated way or some other way, then why state the methods at all?

    • If it is a prespecified analysis, with absolutely no subjectivity, it has little effect.
      ‘Is the patient dead’.
      If there is any subjectivity at any point – from extra encouragement in a treatment arm the experimenter likes, to ruling out patients on post hoc criteria, to post hoc analysis before the official unblinding, this can have major effects.
      This is especially bad for patient subjective reports, where minor changes in tester attitude or behaviour can skew results notably.
      There are reasons why double blind trials are a thing.

      • I’ve seen this messed up by non-blinding as well. For example, the patient/subject may die due to euthanasia.

        In animal trials this is much more common since usually euthanasia is required when certain (usually open to some interpretation) criteria are met to reduce suffering by the IACUC. However, this can happen in that case of humans as well, typically “unofficial euthanasia” is achieved by allowing the patient to OD on morphine.

  3. I’m somewhat confused I guess. We used to talk about blinded and double-blinded.
    It seems to me that the critics are saying that any randomization needs to be blinded; that it can’t otherwise be described as randomized. I like random and blinded… but… it seems unfair to say that random (with regard to experimental variables) isn’t random.

    • I agree that people are mixing a number of issues together. Randomization would happen for each individual, alternating assignment means there is not individual randomization.

      There are all kinds of reasons why the order of assignment may introduce bias. If two people get to the door at the same time if social customs about who holds the door for whom come into play so that members of one group are more likely to enter first then it is not equiprobable, even if the people at the desk are not introducing their own biases.

      Also, say there are two people in the same family who come together. They are never going to end up in the same group, hence, among other things, the standard errors that are calculated on the assumption of independent randomization are wrong.

  4. The important distinction is that “arbitrary” selection is not “random” selection. The research may be (unintentionally) substituting a selection methodology that does impact the results. This is possible with random selection too, of course, but will not be systematic and, if the design allows, we might run the experiment again with a different random selection and expect the same result.

  5. Looking at a couple discussions, some people have taken to addressing systematic allocation as ‘inadequate randomization’ but they apparently have no evidence of an impact. The citation given for this in wikipedia is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4295227/

    So rather than call it not random – because it is random with respect to the variables – or calling it ‘inadequate randomization,’ (potentially more accurate) it is even more accurately described as ‘a randomization method not currently in favor with some putative experts for reasons that haven’t been effectively assessed.’ This places it in the large category of ‘convenient experimental procedures which are potentially practical and effective but not favored by various persons for theoretical reasons without any compelling evidence.’ After all, systematic assignment has a range of benefits in terms of assuring roughly equal numbers of people enrolled in each arm, for example, and if the assignment is truly systematic, should be effective. An assessment that researchers are incapable of executing a process that says: next person through the door gets this label … is rather demeaning and reflects poorly on the accuser.

    • This places it in the large category of ‘convenient experimental procedures which are potentially practical and effective but not favored by various persons for theoretical reasons without any compelling evidence.’

      There is nothing wrong with alternating assignment per se. The problem is testing a null model that assumes random assignment when there was no random assignment! They need to derive a null model when assuming alternating assignment, that is it.

      An assessment that researchers are incapable of executing a process that says: next person through the door gets this label … is rather demeaning and reflects poorly on the accuser.

      I don’t see how anyone who has ever collected data could believe something like this. It is pure fantasy to think you would not skew the results.

      • Anoneuoid said, “There is nothing wrong with alternating assignment per se. The problem is testing a null model that assumes random assignment when there was no random assignment! They need to derive a null model when assuming alternating assignment, that is it.”

        Is there any “testing [of] a null model” that you would approve of? I thought you were on record as flatly rejecting all analyses that test against a null model, regardless of its assumptions.

        • The null model should be a model of the process you think generated the data. Null in the sense of “model to be nullified”, not “model with parameter = 0”.

      • Anoneuoid said”
        “I don’t see how anyone who has ever collected data could believe something like this. It is pure fantasy to think you would not skew the results.”

        This seems to be missing what I think are some very important points:

        1. Many people running experiments do not understand what “random” means in the context of random assignment — the use “random” in the colloquial meaning of “haphazard”. This is why when teaching statistics, I’ve placed much more emphasis than most textbooks do on pointing out things like how technical use of terms is not the same as everyday use. (See https://web.ma.utexas.edu/users/mks/statmistakes/randomsample.html for some of the types of things I emphasize)

        2. Many people running experiments do not understand that the validity of the statistical analysis falls apart of the model assumptions are not met — and that some version of random assignment is typically one of the model assumptions. That is why in teaching I have placed much more emphasis than most textbooks do on pointing out how the validity of a statistical analysis depends on the validity of the model assumptions

        • That quote has nothing to do with stats.

          It should become obvious to anyone who collects any type of data and pays attention that they can bias the results in any number of ways. I can see how someone without any experience could expect otherwise. However, if you have done it and tried to be “objective”, it should shortly become obvious that is impossible.

        • Anoneuoid said, “It should become obvious to anyone who collects any type of data and pays attention that they can bias the results in any number of ways. I can see how someone without any experience could expect otherwise. However, if you have done it and tried to be “objective”, it should shortly become obvious that is impossible.”

          I think a lot of people who have collected data do not pay attention, nor do they try to be “objective”. Also “obvious” is a very subjective concept — one person’s obvious can be another person’s mystery, or revelation.

        • I must be special then I guess… I don’t really see why someone who wasn’t like that would want to do experimental science though.

        • I agree with Elin — and would add that probably more generally, because “That’s The Way We’ve Always Done It” (which I often abbreviate to TTWWADI)

        • Sorry, but I’ve lost track what you all are responding to. People consider it the gold standard to not pay attention while they collect data?

        • I guess Elin and I were being a bit cryptic, so here’s an attempt to clarify at least a bit: I took Elin’s comment to refer to the practice (that is regrettably all too common in some areas of social science) of using a set routine for performing experiments. Students are taught something along the lines of “You follow these steps to do a Randomized Controlled Trial (RTC for short). This way of doing things is considered the Gold Standard.” But the students are not encouraged to question the procedure nor are they taught any details about what the concepts involved really are — it’s a kind of rote learning, that falls very short of teaching students to question, understand, and think. And a lot of students buy into it, believe it, and pass it on to the next generation. It’s more like a religious ritual than real science.

        • Martha:

          Yeah, it’s terrible. And it’s not just students. Esteemed professors can make the same mistake, of thinking that causal identification + statistical significance = discovery. In psychology, that attitude leads to noisy randomized experiments; in economics, noisy observational studies. Either way, there’s a big audience of credulous experts who have been indoctrinated into the belief that causal identification + statistical significance = discovery (along with the paradigm of scientist as hero).

          To put it another way: The problem isn’t that we’re not educating students, journalists, etc., about statistics. The problem is that we’re educating them all too well, in a bad way.

          To put it another way: The problem isn’t just Gladwell, or Freakonomics, or Psychological Science. It goes much deeper than that.

        • Just try to weigh yourself at random times of the day for a month without introducing bias. You will tend to do it just after meals/etc if you want to gain weight and just after bathroom/sleep/etc if you want to lose weight.

        • Probably true. But you would not design this experiment by specifying random times. You are much more likely to see the trend (if there is one) if you weigh yourself at the same (nonrandom) time each day.

        • Then you will adjust your eating/bathroom/sleeping schedule to work around that time. Same difference.

          That is hardly the only way to bias the results either. There is also games to play with weighing yourself once or twice and seeing if you get the same result, etc. You may also discover drinking more water leads to a higher weight…

        • You’ll be lighter after your morning piss because you are dehydrated. Sorry, that video took too long to get to the point so I didn’t finish it.

    • “After all, systematic assignment has a range of benefits in terms of assuring roughly equal numbers of people enrolled in each arm, for example,…”

      It’s not clear that there is any material advantage to systematic assignment in this regard. If the total N is large, then with very high probability, simple random assignment with 50% probability of allocation to each arm will produce very nearly equal numbers in each arm of a study. For example, if you are enrolling a total of even just 100, and use random assignment with probability .5 to each arm, then the probability that the arms will be imbalanced at 60:40 or worse is only 5.7%.
      In a small study, where the law of large numbers cannot be relied upon, if it is important that the two arms be of nearly equal sizes, block randomization can be used.

      And in any case, what is the big deal about nearly equal numbers? Yes, a null hypothesis test of the difference in expected outcomes in each arm will have its maximum power, for a given total sample size, with equal allocation. But the loss in power that comes from the deviations from equality generated by random sampling in large studies is usually negligible. So, again, this issue is only germane to very small studies–where overall low power is usually a greater problem than unbalanced numbers in the arms.

      • Also the role that balance plays in Bayesian model fitting is much less important than the role it plays when you’re doing linear algebra to calculate the least-squares fit where the number of errors on each side of the analysis matters because you’re minimizing a sum, with different numbers of terms.

  6. A quote to put this thread into perspective (and setting expectations right….)

    “an article about computational result is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.” John Claerbout paraphrased in Buckheit and Donoho (1995) paraphrasing John Clearbout https://statweb.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf

    The paper should focus on the findings and their generalisation. As “advertisement” consider adding a section called Generalisation with a table representing alternative representations and a boundary of meaning (BOM). Details provided for repeatability purposes should be in a supplementary material section.

  7. In some areas, randomization is never explicitly asserted, but often assumed by convention. In particular, in much social psychology (this probably includes some of the papers critiqued here before), papers simply say it was (e.g.) “a 2 by 2 between-subjects design” or maybe “an experiment”.

  8. Is there a difference between (a) shuffling all patients and drawing a line down the middle to select balanced groups, and (b) for each patient flipping a coin to decide which group they belong to? Since (b) could result in groups of wildly different sizes, should we always do (a)?

    • > Since (b) could result in groups of wildly different sizes, should we always do (a)?

      wildly different? no. Binomially distributed, the size of the deviations are easily calculable.

      You should shuffle and split when you recruit everyone up front. If you recruit as you go along, randomize each person and rely on the central limit theorem to keep things evenly distributed to within reasonable tolwrance

  9. *writes methods section for an RCT* *sassy response from researcher* “You’ve just summarized the analysis I did” *…* isn’t that the point of a methods section? To write out explicitly how we’ve analyzed the data…?

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *