Bayesian approaches to differential gene expression

TPM, FPKM, WTF?

I (Bob) am dipping my toes back into differential expression modeling for RNA-seq data. Doesn’t everyone love high-dimensional count data? I just can’t get over their standard units: TPM (transcripts per million) is almost OK, but FPKM (fragments per kilo base of transcript per million fragments mapped)? Seriously? We have scientific notation for a reason.

If you’re thinking about bioinformatics…

If you’re going to be looking at RNA-seq data, I would strongly recommend reading the following clear, insightful, and comprehensive Bayesian take on the subject.

Lewin, Bottolo, and Richardson. 2019. Bayesian methods for gene expression analysis. In Balding, Moltke, and Marioni. Handbook of Statistical Genomics, 4th Edition. Wiley.

Yes, that’s Sylvia Richardson, so you know this one’s going to be good. So good that I really wish I’d found it in 2018 when they apparently put the draft out online. But it doesn’t come up in obvious searches for RNA-seq and Bayes on Google and it’s a sad indictment of scholarship that it has zero citations in Google Scholar. This should be among the first things anyone reads as an introduction to microarrays or RNA-seq in bioinformatics! If you have other suggestions in this category, please let me know in the comments.

Since 2018, Shuonan Chen (Columbia systems biology Ph.D. student), Chaolin Zhang (Columbia systems biology professor), and I developed a multilevel Bayesian alternative to rMATS (differential isoform expression with replicates). We’re nearly done with the draft and I’ll announce it here when it’s up on arXiv. We hadn’t read Lewin et al. at the time we developed it, but you can view what we did as a specific instance of the kinds of models they describe in their paper. They even used the trick I learned from Andrew to parameterize the location of two groups as a mean and difference from mean rather than two locations. Even better, the paper goes over how to handle non-uniquely mapped reads, which systems like rMATS can’t handle; my only concern is that they don’t directly address identifiability issues in all these models, which can be challenging.

P.S. My plan’s to write a case study along the same lines as Lewin et al. with more background on the RNA-seq process and simulated data sets with Stan code illustrating the main points, with an extended discussion of Bayesian approaches to non-identifiability. If anyone’s interested in being involved in this, please let me know via email ([email protected]).

6 thoughts on “Bayesian approaches to differential gene expression

  1. Thanks for posting the Lewin et al. chapter here. Last time I’ve dealt with gene expression data was before I turned towards Bayesian methods (used edgeR and DESeq2 etc. mainly). A colleague recently presented some data which cries for a hierarchical model, so I did a quick google search and had a hard time finding good resources on Bayesian analyses for RNA-Seq data, so this sure will help a lot. Also looking forward to the case study with Stan!

  2. Super interested in the results. Some time ago, me and Stefano Mangiola tried to work on a better model for RNA-seq data, but we ended up being unable to beat a very direct Bayesian translation of the DESeq2 model (with regards to leave-one-sample-out cross validation on some human datasets), and the project ended up on a back burner. So looking forward to see the opportunities we missed :-)

  3. Martin: I’m curious as to how you translated DESeq2 to a Bayesian framework as it’s such a mess as written. The direct Bayesian translation of rMATS we did outperformed the original combination of approximate Laplace, max marginal likelihood, and likelihood ratio NHSTs.

    And how did you do the comparison? This seems very tricky without any kind of gold standard for comparison. All the papers dance around this in various ways, none of which are super compelling (mainly measuring agreement with other systems or simulated data).

    • The model (intercept-only, a bit messy and unpolished) is here: https://github.com/stemangiola/RNAseq-noise-model/blob/master/stan/negBinomial_deseq2.stan I admi I didn’t follow the DESeq2 code, but rather the paper and basically took three main components: 1) I completely rely on their normalization procedure (DESeq2::estimateSizeFactors) and use it as input for the model 2) I use exactly the DEseq2 formula for relationship between mean expression and dispersion (later versions of DESeq2 also allow some non-parametric fit, which I ignored) 3) Negative binomial observational model.

      Agree that validation is super tricky. What we did with Stefano in the repository linked above was to take a large-ish set of healthy human expression data from the same tissue and took something like 50 replicates (Stefano would know more, as he provided the dataset) and fitted an intercept-only model. We then used k-fold cross validation to estimate out-of-sample prediction error of the models (we get elpd and its differences between models via the loo package).

  4. The (a bit hacky, unpolished, intercept-only) Bayesian version of the DESeq2 model is at https://github.com/stemangiola/RNAseq-noise-model/blob/master/stan/negBinomial_deseq2.stan . It takes a couple of the main ingredients from the DESeq2 paper (I didn’t read their code to verify if the paper matches): 1) I use DESeq2::estimateSizeFactors to get normalization factors across samples 2) I use their parametric formula for relationship between the mean expression and dispersion 3) Negative binomial noise model.

    In the comparison, we didn’t try to compare Bayesian with frequentist approaches. Instead we tried (and mostly failed) to find a better noise model as compared fully within the Bayesian paradigm. To that effect, Stefano got a large-ish dataset of expression from some healthy human tissues and we selected some 50 samples as replicates (he would know more about the details, I just got a read count table :-). We then fitted intercept-only versions of various models and used K-fold cross validation to estimate their ability to predict the other samples (which we summarised by elpd diff via the loo package). It was in that setting that even when we focused on genes that appeared “weird”, we couldn’t after some effort find a model that would work better than the translated DESeq2. I should however admit that we weren’t able to get a working implementation of several models that could have been of interest and the evaluation was relatively shallow, so I don’t have very strong confidence that this is a very general result.

  5. Thanks—that’s just what I was looking for. The key thing DESeq2 is doing is taking the overdispersion to be a regression on the mean and squared mean. They use a negative binomial that I would write this as

    neg_binom(y[n] | lambda[n], alpha / lambda[n] + beta),

    with parametes lambda[1:N], alpha, and beta. That gives

    variance[y[n]] = lambda[n] + alpha * lambda[n] + beta * lambda[n]^2.

    But rather than using a formula like that and just fitting alpha and beta jointly along with everything else, they use the data multiple times in an empirical-Bayes-like procedure to set alpha and beta to constants.

Leave a Reply

Your email address will not be published. Required fields are marked *