Breaking the dataset into little pieces and putting it back together again

Alex Konkel writes:

I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

15 thoughts on “Breaking the dataset into little pieces and putting it back together again

  1. If your different data points were not actually from the same distribution, though, then you might legitimately get different results. Just a day or so ago, there was an account of research in the effects of consumption of bread wherein it appears that different subjects experience different results depending on the make up of their gut bacteria. However when lumped together, the average effects were the same.

  2. In the previous commentary the question was which one should you believe, the study where things were broken into groups and *each group* got a statistically significant result, or the study where there was one big group and the result was statistically signficant. Note that it’s totally possible to break things up into groups and then *not* get statistically significant results in each one.

    From a Bayesian perspective as Andrew says, if you feed things in one data point at a time, or in three groups, or whatever, you’ll get the same final result. So as long as you’re doing a Bayesian analysis, you shouldn’t break things up into three groups unless you need to for computing / efficiency reasons.

      • Sure if you have a well defined subgroup. But I think the point of this question is to eliminate any obvious reasons why you’d want to break it down other than pure “statistical power”. For example you randomly select addresses in two cities, and then you knock on the door and ask how many people live in the house and what is the square footage, and *nothing else*. Later you have two datasets, one from city A and one from city B and you want to see if there are detectable differences in housing density. In the absence of any information about how to break things down (such as say income, or race, or immigrant status, or employment, or whatever) you can’t gain information by breaking the sample up artificially in a Bayesian analysis. That’s actually one of the good things about a Bayesian analysis. The final result is independent of the order in which you add subsets of the data to the analysis.

        As soon as you have additional observed data, or even a model based on some other data set that allows you to impute covariates, or anything else, then you can potentially do more.

      • I see though that in your paper part of what you’re discussing is in part robustness. If adding in one “outlier” might be having a big effect on the inference for example. In some sense that’s like model checking, which is slightly different from the question posed here. And I agree with you that it’s a useful thing to do, but it’s sort of “meta” to the data analysis, now you’re doing “analysis analysis” ;-)

  3. “I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions.”

    I guess that sub-sample bootstrap and permutation techniques actually do this (and sometimes for good cause).

    That said – I think the gist here is more along the lines of “is there ever a reason to do an internal secret-weapon analysis”? And in that sense, I can see why people would find that appealing – I mean, the secret weapon is great! But given that the argument here is that the data all come from the exact same DGP, then it doesn’t really make any sense. The beauty of the secret weapon is to show that different conditions (different locations, kinds of people, periods of time) produce the same/similar/different results, not to learn about the sampling distribution of the estimate from one single dataset (under the same or different N). That is just what the bootstrap does (sub-sampled without replacement, or the usual N-sampling with replacement, either one).

    So as Daniel noted above, there is a big difference between breaking up data in a priori meaningful ways (secret weapon – good!) and just breaking it up to break it up (something like looking at the sampling distribution under a smaller sample – why bother?).

  4. I like to see an independent replication. A local institution has been studying a particular drug (already FDA approved, but they are interested in broader applications) through a succession of funded clinical studies over the past 20-30 years, and they like to lump the data from each subsequent trial together with existing data when they analyze the results. I am skeptical of this approach, because if there were an early bias in the data, subsequent additions would have the effect of gradually watering it down, rather than outright disagreeing with it. Further, I KNOW that evolving standard of care and patient demographics have an effect, and simple analyses which ignore these may be subject to bias. Most recently, I recommended that they take the most recent batch of data and attempt an independent replication — I would find this much more persuasive than simply shrinking their standard error a hair by adding the data to the pool.

    • I agree with Martha that what you describe sounds perfectly reasonable, but it is also not what I describe in the question. You’re assuming at least the possibility of changes in the data over time on top of whatever the trial comparison is (I assume drug versus placebo, or a different drug). Breaking the data up into different time periods or performing a time series analysis makes sense. But my question applied to your situation would be, assume you are 100% confident there were no time series effects (including different levels of bias in different trials). Would you still see value in separating the studies?

      • Alex:

        I’m not convinced that Clark is “assuming at least the possibility of changes in the data over time on top of whatever the trial comparison is” (although this seems to be the substance of his sentence starting, “Further …”, and is one thing that needs to be considered in most real life situations.)

        You said in your original post, “In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment,” and now you say, ” But my question applied to your situation would be, assume you are 100% confident there were no time series effects (including different levels of bias in different trials).” These to me are quite different things, because group assignment (even if it is random) can produce different calculated (e.g., average or ratio) effect sizes in different groups, simply because random assignment does not assure “no difference” between the resulting groups.

        (By the way, I am not arguing for breaking up a large data set to analyze subsets, except possibly as a way to investigate how different random assignments might affect calclated results.)

    • Remove any temporal issues. Let’s say that the institution has the capacity to recruit 450 people on one day, randomize 1:1 and do a short study, and get the results collected for the main endpoint. There is one principal investigator. If we change the design where the 450 are assigned to 3 studies with the same principal investigator, what changes is not the results for the endpoint, but the clinical disposition. For example, if 20 people can not be measured for the endpoint in a study of 450 people, it is very different than those 20 people getting split over three studies from a clinical point of view, even if it adheres to an expected multinomial distribution. People go to war over disposition tables. Also declaring 3 studies of 150 allows institutions without the resources to recruit 450 patients to replicate the study at n=150.

  5. For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

    My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions.

    This type of splitting is common, they call it getting rid of “outliers”. You can read almost any biomed paper and see the sample sizes fluctuate from figure to figure even though the various measurements are all subsets from the same group of animals, etc. There could be valid reasons for this in which case the signal/noise ratio will be increased.

    Also I’ll include my standard statement that people should stop doing these types of NHST studies that are designed to “check for a difference”. It is much more rational to assume that everything is correlated with everything else in some way. Measuring the size of a difference is an improvement but often not too interesting since it is so little info. How often do you see a paper that comes up with a model to explains why the effect size should have the value it does? This is usually pretty much impossible, you need a timeseries, dose response, etc to start thinking about what types of processes could have really generated that data.

  6. I’m glad you posted on this again, because I’ve been mulling over why I am one of those people who puzzle Alex–I see results three small studies as more convincing than one big one, and want to talk more about why I see the question posed as being rather unhelpful.

    The continued refinements of the question have resulted in it being quite artificial: if you have NO reason to split a sample into multiple parts, would you find it more convincing that the results pan out in all parts? Stated this way, I find the question fairly easy to answer (No). But it’s also a question that I have never had occasion to answer in all my years writing, reviewing and editing experimental and other empirical studies. Moreover, this refined question is definitely not the same to me as “If you had three small preregistered and identical experiments that all yielded the predicted results, would that be more convincing that just one big one?

    We ALWAYS have good reason to treat separately-run experiments separately, because they are never truly identical. No matter how hard we try, every experiment is a little different, as you can see from the fascinating anthropological study of infant behavior, The Baby Factory. No one can write down every experimental choice, with subject pools, context and execution always varying in ways that are concealed from everyone but the authors, who might not even be aware of all of the tiny variations. Maybe Summer is different from Fall, Mondays are different from Tuesdays, a tall administrator is different from a short one, a first-year doctoral student codes observations differently from a second year (and maybe the authors learned how to do things a little better in the second and third runs, or at least they thought they did, so their execution is different if not better).

    So while the question as it was ultimately refined might be a nice intuition-builder and a good argument for Bayesian thinking, the answer to the original question still seems to be “yes, I would trust a result more if it showed up in 3 small very-similar-but-surely-not-identical preregistered studies than one big one–because it shows me that all the minor variations that might not be captured in the writeup didn’t drive the results.

Leave a Reply

Your email address will not be published. Required fields are marked *