Skip to content

N=1 experiments and multilevel models

N=1 experiments are the hot new thing. Here are some things to read:

Design and Implementation of N-of-1 Trials: A User’s Guide, edited by Richard Kravitz and Naihua Duan for the Agency for Healthcare Research and Quality, U.S. Department of Health and Human Services (2014).

Single-patient (n-of-1) trials: a pragmatic clinical decision methodology for patient-centered comparative effectiveness research, by Naihua Duan, Richard Kravitz, and Chris Schmid, for the Journal of Clinical Epidemiology (2013).

And a particular example:

The PREEMPT study – evaluating smartphone-assisted n-of-1 trials in patients with chronic pain: study protocol for a randomized controlled trial by Colin Barr et al., for Trials (2015), which begins:

Chronic pain is prevalent, costly, and clinically vexatious. Clinicians typically use a trial-and-error approach to treatment selection. Repeated crossover trials in a single patient (n-of-1 trials) may provide greater therapeutic precision. N-of-1 trials are the most direct way to estimate individual treatment effects and are useful in comparing the effectiveness and toxicity of different analgesic regimens.

This can also be framed as the problem of hierarchical modeling when the number of groups is 1 or 2, and this issue comes up, that once you go beyond N=1, you’re suddenly allowing more variation. One way to handle this is to include this between-person variance component even for an N=1 study. It’s just necessary to specify the between-person variance a priori—but that’s better than just setting it to 0. Similarly, once we have N=2 we can fit a hierarchical model but we’ll need strong prior info on the between-person variance parameter.

This relates to some recent work of ours in pharmacology—in this case, the problem is not N=1 patient, but N=1 study, and it also connects to a couple discussions we’ve had on this blog regarding the use of multilevel models to extrapolate to new scenarios; see here and here and here from 2012. We used to think of multilevel models as requiring 3 or more groups, but that’s not so at all; it’s just that when you have fewer groups, there’s more to be gained by including prior information on group-level variance parameters.


  1. Clyde Schechter says:

    What am I missing here?

    The purpose of an N-of-1 trial is to identify the most effective treatment for the 1 patient in the trial. There is no attempt to generalize the results to any other patient. If you omit the person-level variance component, you are simply absorbing its value for this particular patient into the constant term of the model. If you include a person-level variance component with a prior distribution you are giving yourself the apparent ability to infer what might happen if other patients were in the trial–but that is beside the point of the trial.

    What am I missing here?

    • Nate says:

      If you are interested in inferring whether N-of-1 trials are more efficacious than the current standard of care, then it would make sense to include variance components to model how well the treatment is working across individuals.

  2. Is this satire? Who knew that whenever I do something and see what its effect is, I’m running a sophisticated “N of 1 trial”!

    • Andrew says:


      I never said “sophisticated”; that came from you. Beyond this, yes, if you do something on one person and estimate its effect, you’re doing an N=1 trial. If you do something on two people and estimate its effect, you’re doing an N=2 trial. That doesn’t mean it’s a good N=1 or N=2 trial, just that considering the study as a N=1 or 2 trial can be helpful framing of the problem, especially if you’re interested in generalizing to others in the general population.

  3. Dale Lehman says:

    I share Clyde’s confusion and also appreciate the link Roger provides. N of 1 trials are a strange sort of thing – from my brief reading of the guide linked, the potential uses are quite limited – though it is a promising area where circumstances are appropriate. The discussion of how it compares/differs to randomized controlled studies and clinical practices is quite interesting. I would characterize a poorly designed N of 1 study as akin to clinical practice, where an individual is treated but the study is not designed for meaningful statistical analysis and the measurements are inadequate for such analysis. On the other hand, N of 1 studies are amenable to analysis, but are fraught with the limitations imposed by conducting repeated comparison treatments on a single individual (where myriad things, often unmeasured, are changing during the treatment). I’d love to see a systematic and concise analysis that shows what type of issues can be effectively studies by N of 1 trials and which cannot. My initial feeling is that the scope for such studies is very limited and subject to being overhyped (the “personalized medicine” angle reinforces my suspicions).

    What I learned from my brief reading of the guide is that there is indeed a place for N of 1 trials and that they are amenable to statistical analysis that makes them different than RCT or clinical practice. What I don’t have a feel for is the extent to which the scope for such studies is large or small. I do worry that we may see a swarm of power pose N of 1 studies – eminently publishable since it has this new headline grabbing jargon. It would be nice to have a sense of just how applicable the technique is to real world situations.

    • Roy T says:


      In my opinion, the ‘N of 1’ name is a misnomer. In rare diseases research, ‘N of 1 trials’ usually consist of more than 1 subject and inference is pooled over the small number of subjects. A better name might be repeated crossover where each subject receives trts multiple times in some random order. The limitations are similar to crossover trials – you need to have a stable, chronic disease which returns to baseline during the washout periods. A limitation of N of 1 trials for even these diseases is that patients often drop out when they notice something is ‘working’, especially if the test drug is on the market.

      • >you need to have a stable, chronic disease which returns to baseline during the washout periods

        This is more a limitation of the data analysis method than of the general idea. Of course it’s possible to have models where you account for the effect of earlier treatments, but it requires more from the model.

  4. Ethan Bolker says:

    Sadly, my old dog is dying of leukemia/lymphome.

    Some chemotherapy is giving him a few more weeks (maybe months) of a good life.

    The oncologist finds his response to the protocol very interesting and not quite as expected. Better in some clinical ways, puzzling when she looks at the blood chemistry. She modifies the protocol from week to week depending on what she sees (physical exam and lab results).

    I’ve joked with her that she may get a paper out of this, as well as helping Pippin in his last days. Now I can tell her she’s doing an N-of-1 study.

  5. Garnett says:

    The first link appears to be DOA. Bummer!

  6. Robert Grant says:

    I worked on NICE guidelines – or as your Sarah Palin put it, the Death Panel – from 2000 to 2006 and N-of-1 was pretty much done and dusted by then. I’m not aware of any new advances in conduct and tend to agree with Dale Lehman about the difficulties they face. Yet, they could be useful. The notion of RCT as “gold standard” is widespread, and not without reason, but it is a blinkered view of a world with immutable and universal true effects of treatments. It would be amusing to describe it as a meta-gold-standard if it didn’t have such serious impacts on people; it’s a very medical (doctors) view – when I taught nurses and therapists they all wanmted to do Masters projects like this, very pragmatic and patient-centered. I recall discrete conversations with pharma company researchers who said they would never be ordered to do an N-of-1 because finding a group of people or circumstances where treatment X is particularly effective could only reduce the marketing authorisation and hence the revenue stream. I think the real methodological issue is mashing the quant data together with diary data (tricky) and qualitative debriefing (ouch).

    The hostile reception that this topic got from some commenters surprised me. I can only guess their confidence to comment publicly is unconstrained by their limited knowledge of drug trial methodology. It would be interesting to hear what Stephen “secret Bayesian” Senn thinks.

    • Martha (Smith) says:

      “I recall discrete conversations with pharma company researchers who said they would never be ordered to do an N-of-1 because finding a group of people or circumstances where treatment X is particularly effective could only reduce the marketing authorisation and hence the revenue stream.”


    • Alex Gamma says:

      can you help me with grant money?

    • Keith O'Rourke says:


      I know Stephen agrees with this point “once we have N[studies]=2 we can fit a hierarchical model but we’ll need strong prior info on the between-study variance parameter” as I ran that by him when I was putting this post together in particular this point “its the banning of informative priors altogether – forcing there to be a discrete decision to either completely ignore or fully incorporate very noisy between study variation”.

      As for patients instead of studies, I also think he would agree but I should re-read his work on this e.g. Understanding Variation in Sets of N-of-1 Trials

      I do think there has been a blind spot to the opportunities of learning from multiple N of 1 trails that might be overcome with secret Bayesian thinking. To actually get anywhere in particular application ares (in places I have been waiting 20 years) anti-Bayesian sentiments have to die out.

  7. Thanatos Savehn says:

    Apologies for this ham-handed attempt to hijack the thread for my own purposes but there’s something I’d very much like to understand. Earlier today Harry Crane on Twitter posted Andrew’s recent discussion at Rutgers:

    At about 36:40 Andrew starts to explain why the estimate of the interaction has twice the standard error of the main effect. Yes, I was one of the virtual n00bs who didn’t understand how/why sigma/sqrt(N) explains this. Many thanks in advance for any links.

    • Andrew says:


      Consider a simple study with N/2 people in the treatment group and N/2 in the control, for simplicity a binary outcome with probabilities near 0.5. Then the estimated treatment effect is y_bar_1 – y_bar_2, and its standard error is approximately sqrt(0.5^2/(N/2) + (0.5^2)/(N/2)) = sqrt((0.5)^2*4/N) = 1/sqrt(N).

      Now suppose you’re studying an interaction, comparing two groups which are evenly split in the sample. The estimated interaction is y_bar_1a – y_bar_1b – (y_bar_2a – y_bar_2b), and its standard error is approximately sqrt(0.5^2/(N/4) + (0.5^2)/(N/4) + 0.5^2/(N/4) + (0.5^2)/(N/4)) = sqrt((0.5^2)*16/N) = 2/sqrt(N).

      The estimate of the interaction has twice the standard error of the estimate of the main effect.

  8. Thanatos Savehn says:

    Busily digesting. Many thanks!

  9. Eric J. Daza says:

    Thanks for bringing this to your readers’ attention! It’s becoming increasingly relevant with the rise in analyses of wearable and mobile device data for self-tracking health outcomes and their possible causes. To help encourage the causal inference discussion around both n-of-1 trials and observational studies, I recently wrote a paper called “Causal Analysis of Self-tracked Time Series Data Using a Counterfactual Framework for N-of-1 Trials”, found here: I’m currently working on a follow-up paper that draws an analogy between longitudinal and n-of-1 studies called “Person as Population: A Longitudinal View of Single-Subject Causal Inference for Analyzing Self-Tracked Health Data”, found here:

    Eric Jay Daza

Leave a Reply