jrc: The thread that goes on and on (yup usually haphazard bias being modeling as random).

When I was I Duke I came up with an example of Simpson’s paradox arsing from 3.

1 gave dramatic positive effect, 2 gave negative effects and 3 gave a slight (pooled) positive effect.

(Simpson’s paradox usually arises from inappropriate complete pooling but it can happen with partial pooling. But without an actual example graduate students had difficulty understanding this.)

]]>numeric – gene expression data requires estimating ~30K fold changes and 30K dispersions based on 3 or 4 replicates. In what sense is that not multiple parameter estimation?

The worst sin of omics statistics is the approach of treating large-scale inference as single-parameter inference scaled up with a multiple comparison correction tacked on at the end.

]]>Keith/numeric:

Speaking of vocabulary, in applied micro:

2: a fixed effects model, or first-difference model, is how we would think about that, and we would think about the parameters being identified by using comparisons within an individual `i’, the point estimate being those individual differences aggregated across each `i’.

3: a random effects model, which would be estimated using MLE or with OLS using “Moulton-type” standard errors. OLS is unbiased here, but doesn’t consistently estimate standard errors (if I am understanding your notation right).

If the assumptions in 3 hold, it is more efficient to use a random-effects framework and the kind of hierarchical models I think you are suggesting. But we often think that Mu_i is correlated with some sort of unobservables (some component of the error term) and then the assumptions behind estimating 3 as a random effects model are violated. But if that correlation is persistent over time, 2 will be consistently estimated in first-differences, fixed effects, or with individual dummy variables.

]]>numeric:

In most applications it is obvious that 1 is too wrong so the choice really is between 2 & 3.

3 is outside usual likelihood theory http://www.jstor.org/discover/10.2307/2291674?uid=16997896&uid=3739448&uid=2&uid=3737720&uid=3&uid=67&uid=16268344&uid=62&sid=21106134303421

For the Neyman-Scott problems, with 2 the mle is highly inconsistent where as with 3 the _gmle_ is consistent.

Somewhat annoyingly, when I was working on this http://biostatistics.oxfordjournals.org/content/2/4/463.full.pdf the editor asked us the justify the advantages of 3 and my co-author send back an angry response that this was so well known that to have to justify it in 2000 was ridiculous. The editor agreed and now there is no trace of that being the case in our paper :-( One or more references included from Greenland in the might be helpful (e.g. the simulation study.)

Probably some of Andrew’s papers and or blog posts will cover this as well.

]]>Stein’s paradox deals with the admissibility of an estimator when multiple parameters (3 or more, I believe) are estimated. Here there is one parameter and mle achieves the Cramer-Rao lower bound (if the underlying model is true). I’m interested in why estimating a (3) in Kevin’s example is better than mle estimation. As the wikipedia article on Stein’s example states, “On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.”

]]>However, there is a subtlety here which is that unless a study is highly powered for all parameters, there’s often not a clear answer as to how much shrinkage is too little and how much is too much.

Granted this isn’t any more subjective than “multiple comparison adjusted p-values < some arbitrary number" (which is why I don't have any qualms about advocating shrinkage over corrections) but it still is a big degree of freedom that leaves the primary conclusions of an analysis undetermined.

]]>@rahul see above comment re: stein’s paradox. sometimes an answer sounds good because it’s true.

]]>numeric – i’d suggest educating yourself regarding stein’s paradox. you do not need an underlying relationship between variables for a hierarchical model to get you better estimates with a hierarchical model. you do not need an underlying causal relationship or a “correct” modeling assumptions in the sense of underlying mechanism in nature.

“standard mle theory” gives you worse answers (by very practical definitions of “worse”) if you’re estimating more than 3 parameters simultaneously, which is certainly the case here. Those estimates become a _lot_ worse when power is weak, which is also the case here.

]]>Running out of room but thank you for the comments. Models are typically more powerful the more assumptions are made but are more likely to be wrong. Or maybe restrictions is a better word. Tests of the assumptions are nice (model checking) but don’t seem to have achieved any sort of consensus. I do wonder about the claim of

Kevin, that, given

1. n(mu, 1) distribution has one unknown parameter mu, it is fixed and common for all observations.

2. n(mu.i, 1) distribution has unknown parameters mu.i, they fixed and arbitrary for each observation.

3. n(mu.i*,1) with mu.i* drawn from n(mu,1) has one unknown parameter mu fixed and common for all observations, and a random unobserved parameter u.i for each observation. One might be interested in mu or a mu.i or all.

(3) is better. Under standard mle theory, if (1) is the correct model, it is best. Thus (3) is a claim of some sort of robustness under misspecification (and all models are misspecified). Is there an article going through these three type of models in this simple form (or an equivalent form) and demonstrating this?

]]>I sort of got the same feeling. In quite a few cases where I’ve seen it recommended before I struggle to spot the hierarchy itself.

]]>numeric:

Vocabulary here can be challenging.

To me

1. n(mu, 1) distribution has one unknown parameter mu, it is fixed and common for all observations.

2. n(mu.i, 1) distribution has unknown parameters mu.i, they fixed and arbitrary for each observation.

3. n(mu.i*,1) with mu.i* drawn from n(mu,1) has one unknown parameter mu fixed and common for all observations, and a random unobserved parameter u.i for each observation. One might be interested in mu or a mu.i or all.

Andrew refers to 1 as complete pooling, 2 as no pooling and 3 as partial pooling.

Some suggest special notation and names for these parameter characterizations depending which are of interest or just nuisance.

Yes, it is a mess. The purpose is to get better inference in some sense and 3 often does that.

]]>“[H]ierarchical models…are used when information is available on several different levels of observational units” isn’t a definition; it’s advice on when to reach for a hierarchical model. My view is that what makes a model for a set of parameters ‘hierarchical’ is that it neither assumes they are all independent in the prior (or for non-Bayesians, that they are ‘fixed effects’; the key point is no pooling of information) nor that they are all identical (complete pooling). In such models, some… thing, some rule or principle or method, dictates how parameter estimates co-vary.

]]>Look, I agree that exchangeability is essential to statistical analysis. I don’t agree that this makes a model “hierarchical” since by this criteria all statistical models are hierarchical. To quote Gelman et.al., “[H]ierarchical models…are used when information is available on several different levels of observational units” (BDA3 page 5). From Love et.al.,

Var K_ij = mu_ij + alpha_i mu_ij^2

which to me tells me there is an individual error component (as well as a systemic component) for each gene (K’s a gene). This then tells me that there will be a multiple comparison problem. The alpha_i is an _assumed_ multilevel parameter (bad terminology–a parameter at a higher level of aggregation), not “information which is available on several different levels of observational units”. There is no information here, only an assumption of the model. To see how this definition can’t really work in practice, assume there is no biological variation, only “shot” noise (bioinformatician speak for poisson noise–this would occur with two technical replicates in different lanes of a flow cell, for example). Then alpha_i would be zero and the model would no longer be hierarchical. Or completely differently, assume there is data x1, …, xn drawn from a presumed n(mu, 1) distribution. mu is an aggregate parameter at the level of the entire dataset but is this model hierarchical? If it is, then every statistical model since Pearson has been.

]]>Opps (typo in above fixed below)

(Because Fisher, Yates, Cochran were clever enough to discern the variances _can be_ taken as common (fixed) and still provide a not too wrong representation – Neyman-Scott discovered applications where it was way too wrong of a representation. A less wrong representation for some of these is variances _can be_ taken as common in distribution)

]]>numeric: I addressed what Corey put in de Finetti’s exchangeable language in a different language in my thesis (same concepts but less formalism).

“Certainly, the concept of something being common in observations even when they vary is

central in statistics – especially if the observations are taken under apparently identical conditions.

The appropriate extension of this concept to something being common in slightly or even largely

different situations or contexts is admittedly more challenging. Parameters being common in

distribution is an even further extension to treating admittedly different things as though they are

exchangeable or common in some probabilistic sense. Here the interest could be primarily on what

is common in the probability distribution that generated the lower level probability distributions,

or something about a particular lower level distribution.” http://statmodeling.stat.columbia.edu/wp-content/uploads/2010/06/ThesisReprint.pdf

I went through it in extensive gory details, but the essence is that in coming up with a statistical analysis you do much more than specify a data generating model (distribution) by (implicitly or better still pragmatically) discerning which components or functions of the parameters _can be_ taken as common (fixed), arbitrary (fixed but differing) or common in distribution (parameter’s values are _like_ random draws from distribution – that may have common, arbitrary or common in distribution parameters) and still provide a not too wrong representation of how the data _were_ generated.

The “_can be_ taken as common in distribution and still provide a not too wrong representation” should make it clear that the representation does not have to be literal – so whether something was fixed or random physically is mostly a distraction.

Why was it/is it in Anova that variances were/are assumed common?

(Because Fisher, Yates, Cochran were clever enough to discern the variances _can be_ taken as common in distribution and still provide a not too wrong representation – Neyman-Scott discovered applications where it was way too wrong of a representation)

The quote you give is the (an?) explicitly hierarchical part of the model. It amounts to an assumption that dispersions are conditionally exchangeable given the mean normalized gene read count (MNGC). By de Finetti’s exchangeability theorem, that makes dispersions at a specific MNGC independent and identically distributed given a common parameter. In equation (5) of the Materials and Methods section they make this explicit with the assumption (more restrictive than mere conditional exchangeability) that the mixing distribution is log-normal and has the common parameter \sigma_d^2 for all MNGC values.

To see how this is hierarchical, imagine simulating from the model. Given a MNGC and a \sigma_d^2, one would first simulate an \alpha_i value and then simulate a set of gene counts \K_{ij}. At the higher level of the hierarchy are the random variates indexed by i; at the lower level are the random variates indexed by ij.

]]>“we assume that genes of similar average expression strength have similar dispersion”–from

http://dx.doi.org/10.1186/s13059-014-0550-8, http://dx.doi.org/10.1186/s13059-014-0550-8.

I agree that this approach makes sense but how does this make DESeq2 a hierarchical model? And even a better estimate of variation does not preclude the fact that one is making multiple comparisons. DESeq2 provides FDR and posteriors but there are still a lot of comparisons–note the authors call for “testing of differential expression with respect to user-defined thresholds of biological significance.” The key is user-defined and using standard multivariate methods I can set a threshold and make a good comparison. It is a comparison coming from the uncertainty properties of the data that is difficult and I don’t see that in any multiple testing of genes.

]]>A related question for Andrew, for clarification on your article:

I love better modeling and shrinkage, but they don’t seem to tackle the same problem as multiple comparisons, unless I’m really missing something.

The NCES data (figs 5 vs 7) is a good example where pretty much *every* pairwise comparison has *someone* who cares about it, so it does make sense to report all possible comparisons at once. You write:

“For the purpose of comparing to the classical approach, we set a .05 cutoff: for each pair of states j, k, we check whether 95% or more of the simulations show aj > ak. If so–—that is, if aj > ak for at least 950 of the 1,000 simulations—–we can claim with 95% confidence that state j outperforms state k.”

But those sound like standard (uncorrected) separate comparisons, just under a new model.

To really be analogous to classical multiple comparisons, shouldn’t you build a simultaneous 95% posterior credible set that covers *all* the differences |aj – ak|, for each pair of the 50 states at once, and see which of those differences are never 0 in your credible set?

I imagine this would give you more “not significant” differences than what the article reports.

Have you seen any researchers actually do this?

(I get that you might not *want* to do this because you don’t *care* about multiple comparisons. But that’s different from saying that a good hierarchical model *fixes* the multiple comparisons problem.)

Thanks!

]]>this is already done and the standard packages for analyses are usually based on some form of hierarchical model.. Shrinkage estimation of gene variances is the norm, deseq2 switched to shrinking the fold change. Frankly I don’t know why they didn’t shrink fold changes from the beginning given that the typical study design consists of n=2 or n=3 replicates per condition.

Although the method developers understand this, it’s unfortunate that results are still presented as lists of p-values and p < .05 is all that most users understand.

If they shrunk the FC estimates across the board and based "significance" on the 95% CI excluding negligible effect sizes, you wouldn't need the adhocery of using p-value cutoffs in conjunction with fold change cutoffs as the latest 'innovation' in the field.

]]>welcome to the field of ‘omics, where they don’t even know what they’re asking.

]]>Ben: Yes, but with 20,000 shrunk estimates there’s still an overwhelming need to pick a much smaller (and still possibly noisy) set of “winners”, thus inducing regression-to-the-mean artifacts that the exchangeability assumption does not protect against. Some sort of statistical control of the degree of noise in that set is also desirable – to many – and this is also not given by the exchangeability assumption.

Assuming exchangeability and/or fitting a hierarchical model is a great place to start, but it alone doesn’t give all the machinery one needs in practice.

]]>tl;dr you don’t really need an explicit hierarchical structure to do what Andrew is suggesting.

Maybe I’m misunderstanding, but I would assume that what you would do in this scenario is treat the genes as exchangeable (‘random effects grouping variables’ if you want to put it in those terms). Let’s say for simplicity that all you want to look at are fold changes between groups (not actual expression levels in each group). You make some assumption about the distribution of fold changes (e.g. that they are log-Normally distributed) and estimate the distribution, and the posterior/conditional mean for each gene. These are shrinkage estimators for the individual fold changes.

]]>I doubt there such thing as a false positive in this case. Instead there are small, medium, and large differences. These may or may not be primarily due to your treatment, it is even possible much of the difference is due to totally uninteresting reasons (e.g. batch effects). There are many hours of hard work ahead in figuring out how to extract something likely to be useful from the information.

You are concerned with calculating the probability that two groups of people/animals/cells are from exactly the same hypothetical infinite population, then using this to distinguish “true” vs “false” positive results according to whether that value is less than a customary (arbitrary?) threshold that has been further “corrected” to punish you for collecting too much information. I just don’t get it.

]]>Dean:

For reasons we’ve discussed often on this blog, I don’t think that coverage and size of tests are so interesting. I’d be more interested in type S and type M errors, as Francis Tuerlinckx and I discuss, in the context of hierarchical models, in our 2000 paper. These are frequency calculations too, they’re just calculations that are more interesting to me.

]]>Anon:

“Multilevel” can be anything with variation. See my book with Jennifer for many examples and much discussion. For example, suppose you have several students, each of whom is given several test items. You can consider the students and items and responses as a multilevel structure. If you’d like, you can think of responses as being “within students” or “within items,” but I don’t know that this is so helpful.

]]>Alex:

If you only report one comparison then, yes, your inference needs to account for that selection process (or, as we call it in BDA, the inclusion process). In selection-type settings, the inclusion process itself contains information relevant to the parameters of interest. For example, if you report the interaction with the highest level of statistical significance, you should account for this in your statistical analysis; otherwise you’re “doing a Daryl Bem” or an “ovulation-and-clothing analysis” and you’ll get the wrong answer (wrong in the Bayesian sense of not reflecting the posterior distribution given all the data, and wrong in the frequentist sense of routinely reporting confidence in large effects that aren’t there, hence the difficulty of replication).

But if you condition on all the data, you should be ok. Of course, your model won’t be correct and you’ll have to worry about that, as always, but no selection problems. The key is to perform your analysis conditional on all the data. Or, in your terms, run all the tests and put them in a hierarchical model; don’t select.

]]>