Using hierarchical modeling to get more stable rankings of gene expression

Posted on July 30, 2025 9:41 AM by Andrew

Will Macnair writes:

I am a computational biologist (with a maths/stats background) working with genomic data. One task that comes up often is estimating changes in expression level between two conditions, for many thousands of genes, then ranking the genes in order of some measure of “confidence.”

My question is: In a Bayesian regression setting, what is a good way to rank the genes? If I calculate Bayes factors for “full model” vs “covariates-only model” and use these to rank the genes, is this ok? Are there better options?

To give a bit more detail, this task is typically referred to as “differential expression”, and we usually do this for all genes that pass some filter on very low / very boring expression. In the frequentist setting, the ranking is done with an adjusted p-value, and there are various popular methods for doing this (e.g. edgeR and DESeq2). For some recent analysis I have been using rstanarm, and I’ve been wondering what would be the best way to achieve a confidence-type ranking in this setting. The reason we do ranking is that we might do follow-up experiments for some genes, however the experiments are a lot of work. So we would like to prioritise genes where we have high confidence that something interesting is going on.

Some other thoughts:

– I searched previous blog entries and found this one. There is some nice discussion in the comments, in particular the thread started by Bob Carpenter where he talks about framing it as ranking. If I’ve missed something more recent, please point me to that!

– There are also some comments from Daniel Lakeland on how there has been a lot of work on the FDR approach, despite it being a bad framing of the problem (i.e. “effect yes/no” rather than “ranking under cost constraints”). The blog was posted in 2016, and the approach taken in the field has not really changed much since then, so your input would be valuable.

– In the blog post, and also in BDA3, you say that you don’t really like Bayes factors, as having a lump of probability at 0 doesn’t really make any sense. I agree with that, however in this case I don’t really want to make statements about the value of the parameter, just how confident we are that something is going on. Could Bayes factors here be ok?

– I wonder if an alternative approach would be to decide on a minimum interesting effect size, then ask which genes have the highest posterior probability of an effect size larger than this.

– I think elsewhere you have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model. I have over 10k genes, and a few hundred samples, so this doesn’t seem practical (at least not for me!).

– In my analysis, I have calculated Bayes factors (for “full model” vs “only covariates”), but also tried ranking by mean(case_vs_control) / sd(case_vs_control). The Bayes factors seem to have some advantages, e.g. they are especially low for genes that are known to have strong sex effects, and sex is one of the covariates.

My reply: I know very little about genomics–ok, actually I know nothing at all about genomics, although I’ve had the occasional conversation of this sort where it seems that I’ve been able to offer helpful advice. So, rather than attempt any sort of solution, I will just share some scattered thoughts:

1. Why do you want to rank the genes? What are you going to do with the ranking?

My general thought is that, even if you’re doing ranking, there are lots of ways to do this, and it’s hard for me to think of a realistic example in which tail-area probabilities are the right way to do the ranking–see discussion here–except in some very simple symmetric problems where all analyses lead to the same result.

2. My problem with the Bayes factor is not so much with the lump of probability at zero, but rather with the dependence on the prior distribution for the unconstrained parameters in the model. Take K well-identified parameters on unit scale and change their weak prior from independent normal(0,100) to independent normal(0,1000), and you’ve decreased the model’s Bayes factor by a factor of 10^K without changing the inferences conditional on the model. We discuss this sort of thing further in Chapter 7 of BDA3 (chapter 6 of the earlier editions). When comparing models, you can establish conventions with weak priors so as to obtain stable Bayes factors, but that’s what these are–conventions–and there’s reason to expect the resulting Bayes factors to make much sense (that’s the point that I made in my 1995 article with Rubin on the topic), so you’ll just have to evaluate these as one more possible decision rule with no clearly-defined theoretical basis.

3. You’re right that I have suggested addressing multiple testing-type problems by putting everything into one large hierarchical model! Here’s my paper with Jennifer and Masanao explaining this, published in the Journal of Research on Educational Effectiveness in 2012.

4. You write that you over 10k genes, and a few hundred samples, so this doesn’t seem practical. But it is practical! We routinely fit multilevel regressions in Stan with hundreds of thousands of data points and thousands of predictors. Try it and see how it goes! If it’s slow, then my recommendation is to set the group-level variance parameters to reasonable values, then it’s just a simple regression model with a prior, and you can fit it fast using an optimizer, just a simple mode-finder or you can use ADVI or Pathfinder if you’d like.

5. I can see that ranking by mean(case_vs_control) / sd(case_vs_control) might not work so well because the sd can be noisy. You can probably do better by fitting a hierarchical model to the sd’s.

6. Again speaking generally, you should be able to use simulated-data experimentation to get a sense of how any of these methods is working. Simulated-data experimentation has two advantages here: first, by construction you know the true values of all the parameters so you can directly compare the performance of different methods; second, the very act of setting up the simulation forces you to figure out exactly what you are trying to learn, which brings us back to my item #1 above.

33 thoughts on “Using hierarchical modeling to get more stable rankings of gene expression”

Daniel Lakeland on July 30, 2025 10:15 AM at 10:15 am said:

Agree with Andrew that putting it into one giant model and fitting with an optimizer is very doable. Ive done it a few times for my wife. Though she honestly doesn’t use these differential expression assays for that much. Usually its find a few (4-5) upregulated genes and then use background knowledge to decide which to spend time on. The level of upregulation isn’t necessarily a good measure of importance.

Reply ↓
- Will Macnair on July 31, 2025 6:06 AM at 6:06 am said:
  
  “putting it into one giant model and fitting with an optimizer is very doable” – that’s really good to hear! If you can point to any example code. I could also generate some fake data if that’s easier.
  
  With transcriptomic data, a common domain-specific tweak to the standard negative-binomial model is to include a trended mean-dispersion relationship (I think technically log_mean-log_dispersion). I have had a brief play with this in the past, but didn’t get very far. How tricky is that likely to be in stan?
  
  Yes, I agree with “level of upregulation isn’t necessarily a good measure of importance”. So there might be multiple valid options for specifying a ranking. For the moment I’m focusing on “genes where we are confident that they change”. And as you say, that isn’t enough on its own – you need biological knowledge to select which ones could be worth pursuing further.
  
  One related topic that I expecting to become more and more interesting over the coming years is linking single cell RNAseq data (i.e. measurements of the RNAs in a few thousand individual cells from one sample) with genotype. Genotype is more or less the only way to get causality at the gene level in human tissue.
  
  Reply ↓
  - Daniel Lakeland on July 31, 2025 10:04 AM at 10:04 am said:
    
    My blog had some posts about this but it crashed and I haven’t revived it. No longer a fan of WordPress.
    
    I think for RNAseq data ive done two versions, basically a linear regression where log count is predicted by an overall level, gene name dummy, and treatment dummy… And a version using a Dirichlet / Multinomial distribution where count is predicted from Dirichlet and total reads, and then the posterior Dirichlet for both treatment and control conditions is compared to determine up or down regulation.
    
    The Dirichlet version can be more reasonable when the read depth is smaller I think. That would include when there are a subset of genes youre interested in and youre taking that subset out of the data, so dont control total reads for the subset. For example “all secreted genes” or “all genes in the X pathway” (often filtered by gene ontology info)
    
    Been a while. Can you email me and we can talk about it? My website is under repair but has my work email
    
    https://lakelandappliedsciences.com
    
    Reply ↓
Aki Vehtari on July 30, 2025 11:25 AM at 11:25 am said:

Did I understand correctly that you have, case/control status, some extra predictors like sex, and as a target you have expression measurements for 10k genes?

Reply ↓
- Joe Nadeau on July 30, 2025 12:11 PM at 12:11 pm said:
  
  exactly, that’s the general design for these surveys
  
  Reply ↓
- Will Macnair on July 30, 2025 1:18 PM at 1:18 pm said:
  
  Yes that’s correct, as the other comment wrote. I had a look for an overview, and this could be helpful for the problem setting:
  https://www.huber.embl.de/msmb/08-chap.html
  
  Reply ↓
bjs12 on July 30, 2025 12:18 PM at 12:18 pm said:

Re. 4: I don’t believe his concern is that the model is “large” per se. I believe his concern is that K is much much larger than N.

Reply ↓
- Andrew on July 30, 2025 1:21 PM at 1:21 pm said:
  
  Bjs:
  
  With a hierarchical model, that’s no problem at all.
  
  Reply ↓
- Will Macnair on July 31, 2025 6:09 AM at 6:09 am said:
  
  It was actually around size. My expectation (as someone who likes reading about stan etc but doesn’t have that much practical experience with it) was that 10k+ genes * 500 samples, in a negative binomial model, with a random effect, would be impractical in terms of timings.
  
  It seems like that could be the case for running stan itself, but maybe not for the optimizer option?
  
  Reply ↓
  - Andrew on July 31, 2025 6:47 AM at 6:47 am said:
    
    Will:
    
    I suggest you post on the Stan Forums. It’s possible that your code could be made more efficient, or you could run for fewer iterations. It also often helps to use good starting points rather than just going with the defaults.
    
    Stan on optimize or ADVI settings will be very fast, but full NUTS for a problem of this size should not take too long either. Of course it depends on how fast you want it!
    
    Reply ↓
    - Daniel Lakeland on July 31, 2025 10:16 AM at 10:16 am said:
      
      My experience maybe 8 years ago was that optimization was doable but sampling was hard because the mass matrix estimation and step size tuning took crazy amounts of time. The variation in posterior standard deviation of the coefficients could be multiple orders of magnitude, some coefficients have near zero variation and so the slightest step in their direction produces divergences while others have large variation and need many steps to orbit. Without mass matrix estimates that are good it requires tiny step sizes and then the loose coefficients never get sampled . probably pathfinder would be a good tool to help.
  - bjs12 on July 31, 2025 12:40 PM at 12:40 pm said:
    
    I’m aware that K larger than N is not a problem with a hierarchical model, but I was making a guess that might have been seen as the problem. But, clearly, I was mistaken. My bad…
    
    Reply ↓
Leon Di Stefano on July 30, 2025 6:48 PM at 6:48 pm said:

Check out Matthew Stephens’ False Discovery Rates: A New Deal (I only noticed the pun about 2 years after first reading the paper …). It applies something like Gelman and Carlin’s “sign vs. magnitude” error framework to genomics and considers the “false sign rate” as an analogue of the false discovery rate (among other things).

This stuff is implemented DESeq2 — see e.g. here and here — and I believe analogues also available in limma (DESeq2 shrinks/models both means and variances whereas I believe limma only shrinks/models variances by default).

If you have multidimensional effects of interest then MASHR>/a> (by Urbut from the Stephens group) may be what you’re after.

Reply ↓
- OliP on July 31, 2025 3:51 AM at 3:51 am said:
  
  That is subtle wordplay!
  
  Reply ↓
  - Mattias Forsgren on July 31, 2025 3:58 AM at 3:58 am said:
    
    Could you explain the pun for the benefit of my friend who doesn’t get it?
    
    Reply ↓
    - OliP on July 31, 2025 5:02 AM at 5:02 am said:
      
      False Discovery Rate is FDR, FDR also the initials of an American president famous for his economic new deal :)
- Will Macnair on July 31, 2025 6:28 AM at 6:28 am said:
  
  Thanks for these links, they look great! I’m more familiar with edgeR than DESeq2, so the various shrinkage options are new to me.
  
  I’m wondering how to implement some of these in stan… One easy tweak seems to be to give logfc a t-distributed prior. I’ll have to think a bit more about how to bring the FSR ideas across to the Bayesian setting. Naively the local FSR would be min(P(logfc0)), I think.
  
  Reply ↓
Thomas Lumley on July 30, 2025 6:51 PM at 6:51 pm said:

Just a comment that while it’s completely feasible now to fit a single hierarchical model it was borderline at best 25 years ago when approaches like FDR and SAM for microarray data were developed. Some of the early discussions in what became Bioconductor were about whether it would be feasible to keep all the data in memory as experiments got bigger or whether some sophisticated database technique was needed. There’s also been the change from roughly multiplicative, roughly lognormal noise with microarrays to overdispersed counts with RNA sequencing– I think the case for more sophisticated modelling was always stronger for RNAseq data.

Reply ↓
- Daniel Lakeland on July 30, 2025 8:12 PM at 8:12 pm said:
  
  It’s true that 25 years ago you couldn’t have fit the models we want to fit today. todays whippersnappers don’t know what it was like… My first computer was a TRS 80 Color Computer with I think 128kB of RAM, and an expensive 300 baud modem. That was like 1985, which was around 11 years ago.
  
  Reply ↓
  - Daniel Lakeland on July 30, 2025 8:15 PM at 8:15 pm said:
    
    The expansion pack allowed me to plug in the modem AND the cassette tape player to recover the stored programs after I’d typed them in by hand from the monthly magazine.
    
    Reply ↓
Anoneuoid on July 30, 2025 7:07 PM at 7:07 pm said:

Start by thinking of the ideal situation: there are a few well-defined computational models of how the condition/intervention affects gene expression.

The genes of interest are then those that help best distinguish between the various models. So we want to look for, eg:

1) High-magnitude upregulation according to Model A but very low for Model B (an otherwise surprising observation yields a high posterior for Model A).

2) Every model (so far considered) predicts upregulation but downregulation is observed (low likelihood under all models).

The goal is to identify experimentum crucis -type gene expression patterns (those that yield high posterior for one of the models when observed), and patterns that no existing model explains well. Ie, anomalies that indicate entirely novel models should be sought.

In the real-life case, there will likely not be a quantitative model. But there will still be some kind of subject matter expertise applied based on some loose idea of how things work. Those expectations must be elicited *before* the ranking algorithm is applied, since it is required info for ranking the genes in a rational manner.

The reason something feels wrong is that the problem has been NHST-ified (ie, logically inverted), so the model expectations are getting applied *after* the ranking. This makes no sense, of course, so we need to uninvert first to move forward.

Reply ↓
- Daniel Lakeland on July 30, 2025 7:46 PM at 7:46 pm said:
  
  I think for a lot of these experiments, the goal is just hypothesis generation… Like, when we give this drug, or knock out this gene… what other genes are affected and let’s chase down what they do.
  
  So it’s really just a measurement problem: measure what’s “changed quite a bit” and then decide which of those genes to do more experiments on.
  
  Reply ↓
  - Anoneuoid on July 30, 2025 8:04 PM at 8:04 pm said:
    
    The biologist/expert looks at the list of top ~10 up/down-regulated genes (or maybe sees one or two related ones), then decides “what it makes sense to focus on” given the condition/intervention.
    
    The order is inverted, this knowledge of background and conditions should be applied *in order to rank the genes of interest*.
    
    How are you supposed to rank things by interestingness without knowing what is of interest? It is inefficient nonsense, but that’s fine for hypothesis generation.
    
    In fact, my experience with GWAS data was like reading tea leaves, which is also perfectly fine for hypothesis generation. So are shroom trips, praying, dream interpretation, and so on. Its a form of inspiration.
    
    Reply ↓
  - Anoneuoid on July 31, 2025 10:00 AM at 10:00 am said:
    
    Thinking on this more, what is the evidence ranking genes according to apparent upregulation is superior for hypothesis generation, eg vs applying 10-100 arbitrary maths operations? Personally, I doubt it is much better since the connection between that and what is actually of interest and viable to investigate (antibodies exist, etc) is so dubious.
    
    Before spending resources trying to optimize this step we should first identify it as a bottleneck.
    
    Reply ↓
    - Daniel Lakeland on July 31, 2025 1:19 PM at 1:19 pm said:
      
      I certainly think that differential expression is gonna be important for different phenotypes. But the magnitude isn’t so important. Small changes in key gene expression can produce large phenotypic effects. Things that unleash a cascade of other genes or that regulate hormones that something’s very sensitive to etc can be more important than some gene that happens to get upregulated a lot but the downstream receptor isn’t expressed or etc etc. So the biologists typically grab a list of genes “known to be differentially expressed” (thought to be based on whatever model) and then they look at their biological knowledge of what those genes do, and try to pick a few that they should knock out or dysregulate or something and see the effect.
      
      Compared to “just try one of 20,000 genes at random” it’s definitely way more valuable. But the second step is the important one (ie. the one where you consider “knowing that this is differentially expressed, how much do I think this gene matters?”)
    - Anoneuoid on July 31, 2025 8:47 PM at 8:47 pm said:
      
      Compared to “just try one of 20,000 genes at random” it’s definitely way more valuable.
      
      I don’t think a random order would be just as good, rather a random set of mathematical operations applied to the data vs elaborate schemes to do corrected mean/sd. That way it is still correlated to the magnitude.
      
      Either way, think about the algorithm actually being applied:
      
      1) High up/down-regulation but already studied -> ignore.
      2) Low up/down-regulation and unexpected to be high -> ignore.
      3) High up/down-regulation but unexpected to be high -> ignore.
      
      Basically just look at stuff you expect to upregulated already but with few/no papers about it.
      
      Is there data on the percent of these screening selections that lead to anything? When I was involved no one trusted these GWAS correlations, it was considered more likely than not they do not pan out.
      
      Seems like a lot of effort optimizing a step of a procedure that isn’t going to address the problem anyway (identifying genes of interest).
Steven on July 30, 2025 8:41 PM at 8:41 pm said:

I would love to read more genomic posts. I know there are other blogs but there is something refreshing about outsiders to a field taking their best shot at the fundamental assumptions.

Reply ↓
J. Monod on July 31, 2025 8:22 AM at 8:22 am said:

Gene expression is more complicated than you think. Perhaps more complicated than you can think:

Decoding complexity: tackling the challenge of how many transcription factors regulate a plant gene
https://www.tandfonline.com/doi/full/10.1080/21541264.2025.2521767

Reply ↓
Bob Carpenter on July 31, 2025 10:49 AM at 10:49 am said:

Hi, Will: If you want to send me email ([email protected]), I can share some code to do some of this. I’d also be happy to collaborate on a Bayesian genomics project. I can help scale up Stan code or suggest roll-your own alternatives.

There are already hierarchical models in wide use in genomics. For example, rMATS is a system for inferring splice variants. The statistical computation in that package is a mess, which led them to the ridiculous conclusion that fewer variants from a population with deeper sequencing is better for estimating the population values than more variants with less deep sequencing (basic stats will show you the latter is better if your goal is estimating the population). I rewrote it all as a Bayesian model. It gets the right answer statistically, namely that it’s better to have more samples at lower sequence depth, unlike rMATS. The problem with rMATs is their Laplace approximation for max marginal likelihood and then their likelihood ratio tests, which together give them the wrong results statistically, even for their model.

Negative binomials induce really bad geometry, as does just about anything that gives you multiple ways to explain the same data (i.e., either higher mean or higher dispersion can explain a large value). So I reparameterized in two ways. First, I coded the differential expression with a gamma-Poisson rather than negative binomial, which with 20 sequencing runs over 20K splice variants led to 400K latent parameters. Second, I use an overall mean plus difference parameterization on Gelman’s advice. Stan’s HMC scales well in high dimension, but not so well in bad geometry, so fitting is OK.

For political reasons, I haven’t been able to publish. As an example, my original collaborator’s Ph.D. supervisor literally won’t let her work on anything other than his project and the person whose lab she rotated through is only interested if it produces “breakthrough biology” we can publish in Nature, which is not something I can deliver. I haven’t been able to find anyone who can help me evaluate this and write it up in a way that the biologists would be OK with. I’d be happy to share the code. I even hired an intern specifically to work on this, but the intern showed up and refused to work on the problem!

More recently, I worked on super-scalable penalized maximum likelihood inference for biome problems with some collaborators in optimization. With 150K biomes averaging 3 megabases each and arbitrary amounts of data, it fits in about 20m on a couple decent servers (doesn’t need GPU). It’s like Rob Patro et al.’s salmon, but I figured out how to do all the bias adjustments for things like hexamer and positional bias (just push the uncertainty through Bayes style—not hard), which is why Rob told me that he abandoned the approach. I’ve had the same problem getting a biologist to help evaluate and write up. Even my collaborators on the autism project where I have a ton of gut biome data from Simons Foundation’s Autism Research Initiative (SFARI) and we did publish a paper, but they say they have zero time to do this. Everyone says they know someone, but when I contact them, they’re too busy. If I could come up with half a year funding for a postdoc, they’d loan me one (they really do assign their postdocs work—it’s not at all like CS and stats). It’s amazing to me how constrained biologists are in working with other people. The code we produced was really great, because Robert Gower (an optimization specialist) helped add a mirror descent optimization algorithm, which is about an order of magnitude faster than the L-BFGS default in Stan. And Robert Blackwell (it’s the 3 Roberts project), a high-performace compute specialist, added some high performance sparse matrix computation in C++ and figured out how to scale it out on a (CPU) cluster. What’s cool is that time doesn’t depend on the data size other than the really fast conversion to k-mer counts.

The autism gut biome paper for which I helped with the Stan modeling is Multi-level analysis of the gut–brain axis shows autism spectrum disorder-associated molecular and microbial profiles. This project was led by Jamie Morton, who was an amazing postdoc at Flatiron. Jamie wrote a really nice overview of how to think about compositional analyses like differential expression in Establishing microbial composition measurement standards with reference frames. But he has zero time to help build a state-of-the-art differential expression tool.

I would suggest looking at DESeq2 to see what a mess current differential expression inference is. It’s pretty easy to see why none of their p-values are well calibrated. It’s trivial to rewrite a statistically sound, fully Bayesian version of DESeq2, but biologists are too conservative to use new tools, so I haven’t even tried to fix this.

In the end, I basically gave up working on biology because of the politics.

Reply ↓
- Daniel Lakeland on July 31, 2025 1:14 PM at 1:14 pm said:
  
  Bob, I think your experience mirrors mine. Biologists that I know personally will work with me knowing that the approaches will be sound and produce something that they care about. Without that personal connection, most Biologists really don’t have any use for statistics other than as a club to wield against reviewers telling them they can’t publish. So using some new technique is just putting a target on their back saying “don’t let me publish this”…
  
  I once went to a training session on how to use Partek Flow to analyze bioinformatics datasets. This site-licensed thing is expensive enough they hired like 3 PhDs to consult on how to use it. Let’s call that $300k/yr that USC is paying to provide essentially marketing for Partek. Anyway the entire hour long seminar was essentially about adjusting menu options to get the p values you prefer to maximize publication probability.
  
  I think it took a week for my eye muscles to recover from rolling so hard, but I kept quiet as instructed by the boss. (my wife)
  
  Reply ↓
- Will Macnair on July 31, 2025 1:44 PM at 1:44 pm said:
  
  Hi Bob, thank you for the offer! And for your overview of the work you’ve done in this area. I’m sorry to hear that you didn’t have a great experience in biology. I’ll send you an email.
  
  Reply ↓
- Frederik Ziebell on July 31, 2025 3:46 PM at 3:46 pm said:
  
  Can you give a reference or example on the point that DESeq2 does not have well calibrated p-values?
  
  Reply ↓
- Leon Di Stefano on July 31, 2025 9:50 PM at 9:50 pm said:
  
  Stefano Mangiola is a researcher who has successfully done genomics with Stan, e.g. this paper.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Using hierarchical modeling to get more stable rankings of gene expression

33 thoughts on “Using hierarchical modeling to get more stable rankings of gene expression”

Leave a Reply Cancel reply