In a randomized experiment (i.e. RCT, A/B test, etc.) units are randomly assigned to treatments (i.e. conditions, variants, etc.). Let’s focus on Bernoulli randomized experiments for now, where each unit is independently assigned to treatment with probability *q* and to control otherwise.

Thomas Aquinas argued that God’s knowledge of the world upon creation of it is a kind of practical knowledge: knowing something is the case because you made it so. One might think that that in randomized experiments we have a kind of practical knowledge: we know that treatment was randomized because we randomized it. But unlike Aquinas’s God, we are not infallible, we often delegate, and often we are in the position of consuming reports on other people’s experiments.

So it is common to perform and report some tests of the null hypothesis that this process did indeed generate the data. For example, one can test that the sample sizes in treatment and control aren’t inconsistent with this. This is common in at least in the Internet industry (see, e.g., Kohavi, Tang & Xu on “sample ratio mismatch”), where it is often particularly easy to automate. Perhaps more widespread is testing whether the means of pre-treatment covariates in treatment and control are distinguishable; these are often called balance tests. One can do per-covariate tests, but if there are a lot of covariates then this can generate confusing false positives. So often one might use some test for all the covariates jointly at once.

Some experimentation systems in industry automate various of these tests and, if they reject at, say, *p* < 0.001, show prominent errors or even watermark results so that they are difficult to share with others without being warned. If we’re good Bayesians, we probably shouldn’t give up on our prior belief that treatment was indeed randomized just because some p-value is less than 0.05. But if we’ve got *p* < 1e-6, then — for all but the most dogmatic prior beliefs that randomization occurred as planned — we’re going to be doubtful that everything is alright and move to investigate.

In my own digital field and survey experiments, we indeed run these tests. Some of my papers report the results, but I know there’s at least one that doesn’t (though we did the tests) and another where we just state they were all not significant (and this can be verified with the replication materials). My sense is that reporting balance tests of covariate means is becoming even more of a norm in some areas, such as applied microeconomics and related areas. And I think that’s a good thing.

Interestingly, it seems that not everyone feels this way.

In particular, methodologists working in epidemiology, medicine, and public health sometimes refer to a “Table 1 fallacy” and advocate against performing and/or reporting these statistical tests. Sometimes the argument is specifically about clinical trials, but often it is more generally randomized experiments.

Stephen Senn argues in this influential 1994 paper:

Indeed the practice [of statistical testing for baseline balance] can accord neither with the logic of significance tests nor with that of hypothesis tests for the following are two incontrovertible facts about a randomized clinical trial:

1. over all randomizations the groups are balanced;

2. for a particular randomization they are unbalanced.

Now, no ‘significant imbalance’ can cause 1 to be untrue and no lack of a significant balance can make 2 untrue. Therefore the only reason to employ such a test must be to examine the process of randomization itself. Thus a significant result should lead to the decision that the treatment groups have not been randomized, and hence either that the trialist has practised deception and has dishonestly manipulated the allocation or that some incompetence, such as not accounting for all patients, has occurred.

In my opinion this is not the usual reason why such tests are carried out (I believe the reason is to make a statement about the observed allocation itself) and I suspect that the practice has originated through confused and false analogies with significance and hypothesis tests in general.

This highlights precisely where my view diverges: indeed the reason I think such tests should be performed is because I think that they could lead to the conclusion that “the treatment groups have not been randomized”. I wouldn’t say this *always* rises to the level of “incompetence” or “deception”, at least in the applications I’m familiar with. (Maybe I’ll write about some of these reasons at another time — some involve interference, some are analogous to differential attrition.)

It seems that experimenters and methodologists in social science and the Internet industry think that broken randomization is more likely, while methodologists mainly working on clinical trails put a very, very small prior probability on such events. Maybe this largely reflects the real probabilities in these areas, for various reasons. If so, part of the disagreement simply comes from cross-disciplinary diffusion of advice and overgeneralization. However, even some of the same researchers are sometimes involved in randomized experiments that aren’t subject to all the same processes as clinical trials.

Even if there is a small prior probability of broken randomization, if it is very easy to test for it, we still should. One nice feature of balance tests compared with other ways of auditing a randomization and data collection process is that they are pretty easy to take in as a reader.

But maybe there are other costs of conducting and reporting balance tests?

Indeed this gets at other reasons some methodologists oppose balance testing. For example, they argue that it fits into an, often vague, process of choosing estimators in a data-dependent way: researchers run the balance tests and make decisions about how to estimate treatment effects as a result.

This is articulated in a paper in *The American Statistician* by Mutz, Pemantle & Pham, which includes highlighting how discretion here creates a garden of forking paths. In my interpretation, the most considered and formalized arguments are saying is that conducting balance tests and then using that to determine which covariates to include in the subsequent analysis of treatment effects in randomized experiments has bad properties and shouldn’t be done. Here the idea is that when these tests provide some evidence against the null of randomization for some covariate, researchers sometimes then adjust for that covariate (when they wouldn’t have otherwise); and when everything looks balanced, researchers use this as a justification for using simple unadjusted estimators of treatment effects. I agree with this, and typically one should already specify adjusting for relevant pre-treatment covariates in the pre-analysis plan. Including them will increase precision.

I’ve also heard the idea that these balance tests in Table 1 confuse readers, who see a single *p* < 0.05 — often uncorrected for multiple tests — and get worried that the trial isn’t valid. More generally, we might think that Table 1 of a paper in a widely read medical journal isn’t the right place for such information. This seems right to me. There are important ingredients to good research that don’t need to be presented prominently in a paper, though it is important to provide information about them somewhere readily inspectable in the package for both pre- and post-publication peer review.

In light of all this, here is a proposal:

- Papers on randomized experiments should
**report tests of the null hypothesis that treatment was randomized as specified.**These will often include balance tests, but of course there are others. - These tests should follow the maxim “
**analyze as you randomize**“, both accounting for any clustering or blocking/stratification in the randomization and any particularly important subsetting of the data (e.g., removing units without outcome data). - Given a typically high prior belief that randomization occurred as planned, authors, reviewers, and readers should
**certainly not use**.*p*< 0.05 as a decision criterion here - If there is evidence against randomization,
**authors should investigate**, and may often be able to fully or partially**fix the problem**long prior to peer review (e.g., by including improperly discarded data) or in the paper (e.g., by identifying the problem only affected some units’ assignments, bounding the possible bias). - While it makes sense to mention them in the main text, there is typically little reason — if they don’t reject with a tiny p-value — for them to appear in Table 1 or some other prominent position in the main text, particularly of a short article. Rather, they should typically
**appear in a supplement or appendix**— perhaps as Table S1 or Table A1.

This recognizes both the value of checking implications of one of the most important assumptions in randomized experiments and that most of the time this test shouldn’t cause us to update our beliefs about randomization much. I wonder if any of this remains controversial and why.

*[This post is by Dean Eckles. This is my first post here. Because this post discusses practices in the Internet industry, I note that my disclosures include related financial interests and that I’ve been involved in designing and building some of those experimentation systems.]*

Not directly relevant, but I remember a couple years ago when Kaiser Fung spoke in my class. Students were asking him about A/B tests and he talked about the value of A/A tests, and everyone in the room, including me, was like, Ahhhh! Suddenly we all had a clearer perspective on the world.

This interesting blog is all about generalization of findings. You compare two groups, one treated the other untreated. What did you learn – not about the participating units, but beyond.

In many cases the end of the trial is a decision point, followed by what is called in clinical research Phase IV or biosurveillance.

A/B testing will usually have no “Phase IV”.

You run a test and reach a decision that A is better than B. Do you continue using some “B” or only just “A”. Multi arm bandit methods are supposed to deal with this.

A/A testing (which got Andrew surprised) are allowing to pretest the A/B test. They can be also used for monitoring or “Phase IV”.

More on n A/A testing in: https://www.dynamicyield.com/lesson/aa-testing-in-experimentation/

For something on Bayesian A/B testing see: https://www.dynamicyield.com/lesson/running-effective-bayesian-ab-tests/

This can still be important even in a finite population setting where all units are in the experiment. For example, you might run a cluster-randomized experiment where the clusters are US states or network clusters globally. You can still want to test whether the data are consistent with the planned randomization.

Suppose you generated the randomization list using, e.g., rbinom(N, 1, 0.5). You perform a balance check of some sort, and it fails. What you should do, in this case, is try to reproduce the randomization list, by re-running the rbinom call (with the appropriate seed set). If you succeed in reproducing the list actually used, then the balance check is just making a type I error, and it isn’t telling you anything useful. If you fail to reproduce it, then you shouldn’t publish the paper—at least not without fixing the problem (if possible at that point).

Now suppose the balance check succeeds. It’s still possible that the randomization went wrong somehow (a type II error). So, being cautious, you should still try to reproduce the randomization. And depending on how it turns out, you should respond the same way as you would in the case where the balance check failed. So ultimately it seems that if we’re concerned the randomization may have failed, such that a balance check might make sense, we ought to try to reproduce the randomization list, an exercise which renders the balance check pointless. So it doesn’t seem to have any value to you as the randomizer.

So maybe the claim is it has value for the reader, who can’t reproduce the randomization to check it for themselves. But why isn’t it enough for the randomizer to verify that they reproduced the randomization? Is the idea that, we have to worry that randomizers may be trying to commit research fraud, and the balance check is how we make sure they’re not? But if they’re lying about reproducibility, why should we assume they’re not also lying about the balance check? Ultimately the only way to verify the issue without trusting the commitment of the authors is to publish the data and code, so that readers can verify it for themselves. A balance check doesn’t seem to add anything to this.

(To be clear, I think table 1 is valuable, as imbalances may explain part of what’s driving a study’s results. But statistical testing of these imbalances seems to me to add nothing of value. It suggests there’s an actual question about whether the null or alternative hypothesis is true, when there isn’t as long as randomization went as intended. And if the question is if it went as intended, you should be reproducing it, not relying on noisy statistical inferences.)

I’m sympathetic to some of this. This is consistent with why at least one my papers just say that we did the tests and don’t have a table of p-values (but do have replication materials). But then since then I’ve encountered cases where people don’t report balance tests with p-values (or similar measures of statistical evidence) but instead claim the imbalances are, say, “not practically significant”. Then I really want to know the p-values, which may well very strongly indicate something is potentially quite wrong.

But what if the authors report that they verified the reproducibility of the randomization per their protocol? It’s possible they’re lying, but in that case they could be lying about all of it. As long as they’re not lying, that should answer the question to your satisfaction, no? What more does a balance check provide in that case?

The point of performing a hypothesis test is to use data to decide in favor either a null or its alternative hypothesis, hypotheses which make statements about the data generating process. We do this because we have access only to the data. If we had access to the data generator, we could just check which hypothesis was true. Any data-based decision is going to risk making errors due to random characteristics of the generator, whereas the decision based on inspecting the generator will be error-free.

In this scenario, we have access to the the actual data generating process, namely the code that generated the randomization list. Why would we settle for a data-based inference about this generator, rather than just inspecting the generator itself? And if the issue is third parties viewing the person running the code with suspicion of dishonesty, I’m not sure how *anything* such a person produces, short of the data and code itself, would overcome that concern. So I just fail to see the point of this exercise.

Again, summarizing the data can be very valuable, so I have no objections to this. But I think any sort of statistical inference about randomization is going to be strictly dominated by actually verifying that randomization was carried out as intended.

“Any data-based decision is going to risk making errors due to random characteristics of the generator, whereas the decision based on inspecting the generator will be error-free.”

If inspecting the generator is error free, how would anyone ever make a mistake in the first place?

Balance tests are not unlike unit tests in software engineering.

If I check the code and don’t find anything wrong, but I also get p < 1e-10, I'm typically going to conclude something is wrong, not just that I got very unlucky.

OK, perhaps I should have said “sampling variability-induced error-free”. Whatever other sources of error you have in mind, these infect the running and reporting of balance checks just as much as they infect the running and reporting of a randomization reproducibility check. Meanwhile, sampling error only infects one procedure but not the other. So reproducibility checks strictly dominate balance checks.

Covariate balance and sample size tests can be much simpler to implement than the entire randomization and data processing pipeline.

The only thing a balance check tells us is whether we should be concerned that the group assignment column in the spreadsheet used to generate the table was generated per protocol. Regenerating that column, and comparing it with the column you would use to make the table, should be a trivial exercise. And in any scenario where you think a balance check is worth doing (because you have doubts about whether randomization went as intended), you might as well carry out this trivial exercise which yields a far more definitive conclusion about the matter.

15 years ago at a large internet company, we would analyze the “pre-period” when the treatment and control groups had been defined but before the treatment had begun.

When treatment and control were distinguishable in the pre-period, this was never thought of as a failure of the randomization _procedure_, as treatment and control were selected by a well-tested framework that performed the selection in simple and verifiable ways (globally unique identifier mod X).

But when you run many hundreds of experiments per year, sometimes a simple, seemingly obviously correct randomization procedure fails to produce a random assignment for subtle reasons (how were those identifiers assigned anyway? and how sure are you that no one ran an experiment last week with the identifier mod 2X? etc.). And sometimes a random assignment generates a nonrepresentative sample in ways you aren’t adjusting for.

I mean, sure, you should adjust. But one reason not to might be that your sample is nonrepresentative in a way that you can recognize at a high level but that you don’t know enough about to adjust for (“I know my revenue per pageview varies per country but I also know the per-country revenue proportions vary month to month”). Hierarchical models with partial pooling are great but in practice sometimes they’re unavailable to you.

In an internet environment, you can respond to these nonrepresentative samples by simply throwing them away and creating a new treatment/control assignment. You can even do that before beginning to gather data, avoiding any forking paths. The cost of this is low. A clinical trial is a very different situation.

“and how sure are you that no one ran an experiment last week with the identifier mod 2X?”

Indeed this is not a valid way of doing random assignment without strong assumptions. Typically userids are not randomly assigned and, as you say, this creates correlated assignments if you run many experiments.

I’ve seen engineers implement userid modulo as randomization and as a result make a giant, strategically important experiment be nearly useless.

There are two issues here. The first is the assumption that an unbalanced sample is a sign that fraud or a mistake has occurred. If it has, the long-run ES estimate may not converge to the parameter’s value, in which case do test and report pre-treatment balance, and frame results accordingly. But that’s not why most people conduct randomization checks, in my own experience.

The second issue is the assumption that a sample that’s unbalanced on a relevant pre-treatment covariate, even when that imbalance is a natural result of a true randomization, will lead to an inaccurate ES estimate. Yes, the ES will still converge to the true value the long run, but in the particular, isolated case of this one study, our estimate will lie in the tail of the ES sampling distribution. I buy this assumption, because I can’t think of a reason it would not hold. (Others may.) If so, adjusting for the imbalance or discarding results should improve both short-run accuracy and long-run precision by, basically, Winsorizing the sampling distribution. But there are a few caveats.

First, the result will be biased if imbalance is addressed inconsistently–if people tend to test for or report or adjust for imbalance when it is anticipated to favor outcomes for one group or another. Second, we often want to know the properties of the parameter distribution, not just the point estimate, in which case adjusting for accuracy will distort, for example, the CI. Third, we should consider whether an unbalanced sample reflects an unbalanced population, in which case rebalancing may change the nature of our research question. Fourth, consider whether the imbalance is a result of biased measures rather than chance.

Why should be perform tests of significance if we are interested in whether samples are similar with regard to certain co-variates? We could just have a look at the groups and assess the differences. We are not interested in the population(s) here, as far as I can see, just in the samples.

+1

Precisely. Doing statistical inference tests for balance answers an irrelevant question.

I don’t agree with Clyde here. Balance is kind of irrelevant but

a gross failure of balance may show that the data were not generated as

advertised. That needs follow-up.

We are not interested in that. We want to know whether the data are consistent with the null that randomization occurred as planned. Means of observed covariates are simply one way to test implications of that, just like testing that the sample sizes are the right size.

> have a look at the groups and assess the differences

> > We are not interested in that

I don’t see how we are not interested in that. Expand on this. This doesn’t make sense to me unless there is something special reason to be paying attention to this null hypothesis.

The null we care about is that randomization occurred as specified for our analytical sample. This null has testable implications about covariates, sample sizes, etc.

Thinking about covariate balance might be confusing and unnecessarily controversial (because it gets people worried that people are going to choose their estimator of treatment effects based on this table). Instead, say we test that our assignment mechanism is indeed Bernoulli(q) by testing that the proportion of units in treatment is not too improbable under that null. Make sense?

> Make sense?

I see what you’re doing, but no, your comment doesn’t seem right to me. I think Ney’s summary is better, cuz there’s a variety of things we might want to do here.

I think the message on these things is be creative and check your assumptions. For instance, the A/A test and the ask-the-engineer-how-the-randomization-works are like this.

For instance, what is the p-value if you used a pid % X randomization? What is the p-value if you lost X% of your pids when you went to join them on your database of covariates? The differences between what you were hoping for and what you got were the point and there’s lots of ways to diagnose that (which depends a lot on what is practical with the application).

> going to choose their estimator of treatment effects based on this table

Well, even if we aren’t choosing the estimator, we are choosing our treatment effects based on the table. If the table says there’s a problem, we keep trying stuff until the table says there isn’t a problem.

There is the “prior” belief of the probability of an error as you say, but there is also your prior belief that someone reading the paper is going to misinterpret the p-values. This latter probability is very high in medicine indeed, as learning how to remove spleens, for example, doesn’t leave a lot of time left over for learning statistics. I couldn’t speak to the other fields that use randomized designs.

(On a minor quibble, I’ve never really noticed people in epi or medical stats using the term “table 1 fallacy”. Reference to the “table 2 fallacy” coined by Westreich and Greenland is much more common https://pubmed.ncbi.nlm.nih.gov/23371353/).

+1

Yes, my understanding is that “Table 2 fallacy” is much more common and that “Table 1 fallacy” is kind of a reference to that. But seems it is more rare than I realized, this being the only scholarly reference that pops up https://www.tandfonline.com/doi/full/10.1080/17453674.2021.1903727

In my rather long career I have analyzed many many datasets, very often produced

by other groups. When new data arrives, the first question is

“Does it make any sense at all?” ; Quite often the answer is no. Often the

tests here are rather informal — plot the data in various ways — or perhaps the

data were generated in two tranches which “ought” to be statistically indistinguishable.

It’s too much to report all these tests, most of which “couldn’t” show non-randomness, though

sometimes of course they do. Example: A cancer biopsy

dataset in which the day of week the data was collected causes differences.

I would take a more operational point of view to this: We don’t have to do a NHST of balance because we know it is true (barring RNG failing, which can be a concern). The estimator in an A/B test would be unbiased, but it will be noisy. We can adjust for covariates, of course, and that helps to bring down the variance while keeping the unbiasedness. In an experimentation system where the covariate balance is automatically checked and experimenter is automatically warned, then we are basically looking at re-randomization, which is just a different way to increasing the precision. But whether this is needed on top of regression adjustment, I am not sure, but I agree with the general sentiment in Sec 4.1.2 in https://arxiv.org/abs/1906.11291

To be clear, when you say “we know it is true”, it sounds like this isn’t a strong enough belief that you wouldn’t be swayed by strong evidence against it. Sometimes tests of the null of randomization yield very strong evidence against it, and something has very likely gone wrong, whether in initial randomization code or downstream (eg in exposure or “triggering” logs).

Of course, it is a separate and correct point that stratified/blocked and rerandomized designs can certainly make sense, though the variance decreases for large sample sizes over regression adjustment or post-stratification are of order N^-2. https://doi.org/10.1111/j.1467-9868.2012.01048.x

> So it is common to perform and report some tests of the null hypothesis that this process did indeed generate the data.

I don’t think that the point of testing that null hypothesis in clinical trials whether the data was generated from that process or a different one.

The issue is that even if the randomization is perfectly done the result may be unbalanced and all the ex-ante frequentist guarantees go through the window when your actual samples are unbalanced.

At that point forking paths may be better than going through a path that leads surely to a bad result and trying to find confort on the fact that if the samples had been good the result would be better.

I meant: “I don’t think that the point of testing that null hypothesis in clinical trials [is to determine] whether the data was generated from that process or a different one.”

We assume it does, but we need it to look like it does to trust the subsequent analysis.

Looking at the p-values for covariate imbalances on table 1 of an RCT strikes me as a variant of the prosecutor’s fallacy: we are not interested in the sampling probability of the given event (i.e., the imbalances in covariate distribution between the groups in the RCT of interest) but rather in the probability that randomization was not proper given these imbalances. The second question can be answered by looking at the relative likelihood between the two explanations: randomization was proper vs not proper. Not by looking at sampling probabilities under the null hypothesis.

Agree with this completely. Small p is only meaningful for these sorts of inferences if both the likelihood of the imbalance was greater under broken randomisation and the probability of an error was sufficiently high. I would argue that in any sufficiently well designed study this would not be the case. As one example, if fraud was a concern then there’s every chance it would be very well balanced as that’s what any competent trials fraudster would want you to see!

OK so you want to do a likelihood ratio test or compute a Bayes factor with some specified alternative hypothesis. Hard to object to that. You could in fact try to put the whole thing into a Bayesian decision-theoretic framework, and maybe the threshold for investigating or for saying in the paper “something likely went wrong” (if your investigation can’t figure out what went wrong) ends up being a bit different.

I conjecture that sensible choices of alternatives and priors and some kind of prosocial utility functions (of course some authors may want to hide problems) that recognizes harm from passing off something as randomized that wasn’t really would result in taking some action by the time you get to p < 1e-6 for a joint test of balance. But would be interested if someone wants to work thought that. Cf https://twitter.com/analisereal/status/1429220747386388480

Let me use another example: every time that we randomly shuffle a deck of cards, the probability of the observed permutation is astronomically small. That has nothing to do with whether the shuffling procedure was random. Most of the information needed to evaluate the shuffling procedure (e.g., contextual knowledge on whether the cards are marked) lies outside of the observed permutation. If anything, a particularly “balanced” pattern may look suspicious.

Testing various properties of the sequence of cards is one of the few cases where a p value based assessment is actually meaningful. If the number of runs, of groups of same suit, of reverse runs, of cards in a given suit in the first half of the deck, of etc etc taken together give a p value very small compared to the null of a perfect random number generator, then you can conclude the cards were not randomized. This is basically the definition of algorithmic randomness given by Per Martin-Löf

https://dilbert.com/strip/2001-10-25

Note that the null of a perfect random number generator is different from the null of balance tests. Covariate imbalance is expected in an RCT. The statistical models used to analyze such RCTs are in fact designed to account exactly for this imbalance.

Our goal is determine the probability that randomization was improper given the data at hand. The sampling probability of our observed card permutation (or covariate distribution in the RCT example) can provide indirect evidence to be incorporated in our estimation of the relative likelihood between the two scenarios: randomization was proper vs improper. But the balance tests in RCTs use the wrong test hypothesis because a randomized procedure *will* generate imbalances in the covariates between groups. The p-values generated by balance tests can be used to test a *stratification* procedure. But those same p-values are misleading if we are testing randomization. If one wishes to generate a p-value that can refute the randomization procedure, then the hypothesis to be tested is different than the one tested by the p-values shown in Table 1 of RCTs.

“But the balance tests in RCTs use the wrong test hypothesis because a randomized procedure *will* generate imbalances in the covariates between groups. The p-values generated by balance tests can be used to test a *stratification* procedure. ”

Not sure what you mean by this, as I’m having a hard time coming up with a reading that isn’t false. Of course there is random covariate imbalance. We are concerned when the imbalance exceeds that which is likely under the null of randomization.

Let’s consider only a single covariate for simplicity. If it is normally distributed, a t-test will be exact (have size / Type I error rate at or below alpha) for the null of randomization. If it isn’t, then at least asymptotically it will have the correct size. Alternatively, one can use a permutation test (with eg a t-statistic or some rank-based statistic as the test statistic), which will be exact in finite samples without any restriction on the distribution of the covariate.

Now I do agree with your point that if wanted to approach this whole process using Bayesian decision theory, we’d need to formalize what the alternatives are.

I think the idea is this: suppose you’re testing a binomial p=0.5 RNG and you find an exact (50000, 50000) split for 100000 people. By the p-value metric, this is highest possible p-value under the, in this case, desirable null hypothesis. In fact, to take a bayesian view, if I saw this, I’d probably conclude that the RNG is broken because while this is the maximum likelihood case, since my prior is that people sometimes try to engineer things to look random in this way.

All this said, I think the purist bayesian viewpoint here is the wrong way to look at it, both because the prior over the space “ways of badly generating random numbers” is so hard to define and very spikey-slabby and also because the null hypothesis is actually very plausible here. The way I’d like to do this, if they ever let me near an AB testing platform, is to run the bucketing over a suite of frequentist tests as described in the Art of Computer Programming Chapter 3.3. One of these would be a p-value, one of these would be the longest run test, one of them would be the spectral tests, etc.

And, of course, look at the data with visual inspection to check for anything that looks suspicious. Sometimes, signs of bad human engineering just jump out at you in ways that can be hard to imagine pre-hoc

I think we may be getting somewhere and I apologize in advance for any confusion. These are difficult concepts for all of us, and RCT methodology has a lot of nuances. For example, even the statement in the original post that the inclusion of pre-treatment covariates will increase precision is strictly speaking not always correct (see here for an example keeping in mind that logistic and Cox regression models are often used for RCT outcomes: https://www.jstor.org/stable/1403444). Interestingly, it does always lead to an increase in power under the null and this is a great example where frequentist thinking provides additional insight. Also, while I am certainly interested in using Bayesian decision theory in trial designs (see for example here: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9120) it is better for the purposes of this discussion to keep things simple.

To our point now: if our goal is to audit the randomization of a trial using hypothesis testing (any test hypothesis, not just null) then what we need to do is to condition on the randomization scheme that was supposedly used in the protocol. That randomization scheme may include stratification, e.g., by gender. In that case even a small imbalance in gender could alert us that something is wrong. This imbalance may be indirectly detected using a null hypothesis test that assumes gender to be balanced between groups.

Now, part of the confusion likely stems from the fact that even though randomization will generate group imbalances on measured and unmeasured covariates, it is true that these imbalances are not expected to be large. Such large imbalances (e.g., if one group has 100 women and the other zero) do not generally require null hypothesis testing to be detected. At the same time, because at least some imbalance is expected we would need to see very small p-values (much smaller than 0.05) to be alerted that something may be off. From a Bayesian perspective, this is one of these scenarios where one should put a lump of probability on the null being true. To emphasize again: this is the kind of scenario where looking at p-values for balance in measured covariates in Table 1 of an RCT is unnecessary and, overall, reporting these p-values will most likely confuse rather than enlighten whether the reader is a clinician, statistician, epidemiologist, economist etc.

“To emphasize again: this is the kind of scenario where looking at p-values for balance in measured covariates in Table 1 of an RCT is unnecessary and, overall, reporting these p-values will most likely confuse rather than enlighten whether the reader is a clinician, statistician, epidemiologist, economist etc.”

I agree that having them in Table 1 is typically not the best use of space in a short paper in NEJM, Science, etc. — hence the title of the post. But what if the p-value is 1e-6?

Perhaps a source of our divergence comes from is that much of my experience is working with comparatively large experiments. So often problems with randomization and data processing result in only small relative differences in covariates, but nonetheless something is very, very wrong and detectably so with a tiny, tiny p-value.

On the issue of stratification, etc.: I agree one should analyze as you (intended to) randomize, incorporating that into any tests, including balance tests (#2 in the post).

Yup. I think we are in alignment. I agree that such a tiny p-value (extremely rare in a clinical RCTs) is worth looking into.

The age-old debate

DOI: 10.1214/12-AOS1008

I’ve worked as a data scientist at two large tech companies. From a practical perspective we always need to do checks and it would be crazy to not do so. Mistakes just happen, especially when internal experimentation platforms and codebases can be complicated. Experiments have been messed up in extremely subtle ways due to things like engineering errors, data logging errors (not really a randomization issue), a small proportion of users being dropped from one treatment arm, etc.

I think there are some good suggestions for doing balance tests correctly but my concern, similar to other comments, is that a test is a poor way of getting at the issue of whether the randomization was effective. Why not just recommend a reasonable description of the author’s genuine interpretation of Table 1? Then we know their interpretation, which a bunch of t-tests just hints at. And we can be free to agree or not.

And the recommendation really has some issues, such as focusing on the p-value magnitude without a cutoff. That’s not a hypothesis tests under any notion of it I know of. But it is a step toward what I’m suggesting except that it would probably be better to just focus on the magnitude and variability of the covariates and how they may impact the primary manipulation.

The idea that we shouldn’t present important/relevant information because someone might misunderstand it has to be the worst reason for not not learning and/or presenting information that I’ve ever heard.

Jim:

I think the open question here is whether the information is actually important or relevant. After all, there are tons of data summaries that could be reported, and readers have finite attention.

It depends on whether you’re writing for posterity or for a particular audience, I think. Leaving out relevant information to an audience of six-year-olds may well be necessary to avoid sidetracking understanding altogether. If you grant that premise, you then have to decide how close journal (or expert witness report) readers are to six-year-olds. Sometimes, it’s kinda close.

Thanks for the post Dean.

I think it’s quite reasonable. If I am reading correctly you sound as if you are defending standard practice but in practice your proposal agrees with Senn.

I wonder though if your point would go over more easily if you didn’t call these balance tests, but rather “procedure tests” or similar.

Since you are making this argument not to test for balance but to check on procedure. (indeed high levels of balance could tip you off to an implementation problem: if you prescribe simple random assignment to kids in a school and find exactly half the students are in the treatment group in every class this might tip you off that a different procedure was implemented.)

Regarding costs: in practice (though not in your proposal) rather than providing a single number researchers provide a table of tests on each covariate. This has twin problems (1) these tests are taking up space that could be used for showing the actual imbalance, or at least are presented in a way that focuses on the test statistics rather then substantive differences and (2) “passing” the tests is taken as evidence of balance, but of course you can pass even with substantively large imbalance if your study is small.

So an addendum to your proposal is to change the name and request some summaries of actual imbalance.

[Aside, you might enjoy this piece by Hennessy et al if you haven’t see it: https://www.degruyter.com/document/doi/10.1515/jci-2015-0018/html#j_jci-2015-0018_ref_002_w2aab3b7e1011b1b6b1ab2b2b2Aa ]

Indeed. Covariate balance isn’t special here. In the absence of traditional covariates, you can still test things like fraction assigned to treatment and time of enrollment/exposure.

Yes, I agree it makes sense to emphasize a joint test — at least if the set of covariates is preregistered or not so large. Otherwise, one might worry that a joint test might reflect diluting a discovered imbalance. (For example, adding 99 noise covariates can take you from p = 1e-6 to p > .1.)