Circling back to an old Bayesian “counterexample”

Hi everyone! It’s Dan again. It’s been a moment. I’ve been having a lovely six month long holiday as I transition from academia to industry (translation = I don’t have a job yet, but I’ve started to look). It’s been very peaceful. But sometimes I get bored and when I get bored and the weather is rubbish I write a  blog post. I’ve got my own blog now where it’s easier to type maths so most of the things I write about aren’t immediately appropriate for this place.

But this one might be.

It’s on an old example that long-time readers may have come across before. The setup is pretty simple:

We have a categorical covariate x with a large number of levels J. We draw a sample of N data points by first sampling a value of x from a discrete uniform distribution on [1,…,J]Once we have that, we draw a corresponding from a normal distribution with a mean that depends on which category of x we drew.

Because the number of categories is very large, for a reasonably sized sample of data we will still have a lot of categories where there are no observations. This makes it impossible to estimate the conditional means for each category. But we can still estimate the overall mean of y.

Robins and Ritov (and Wasserman) queer the pitch by adding to each sample a random coin flip with a known probability (that differs for each level of x) and only reporting the value of y if that coin shows a head. This is a type of randomization that is pretty familiar in survey sampling. And the standard solution is also pretty familiar–the Horvitz-Thompson estimator is an unbiased estimator of the population mean.

All well and good so far. The thing that Robins, Ritov and Wasserman point out is that the Bayesian estimator will, in finite samples, often be massively biased unless the sampling probabilities are used when setting the priors. Here is Wasserman talking about it. And here is Andrew saying some smart things in response (back in 2012!).

I read this whole discussion back in the day and it never felt very satisfying to me. I was. torn between my instinctive dislike of appeals to purity and my feeling that none of the Bayesian resolutions were very satisfying.

So ten years later I got bored (read: I had covid) and I decided to sketch out my solution using, essentially, MRP. And I think it came out a little bit interesting. Not in a this is surprising sense. Or even as a refutation of anything anyone else has written on this topic. But more it is an example that crystallizes the importance of taking the posterior seriously when you’re doing Bayesian modelling.

The resolution essentially finds the posterior for all of the mean parameters and then uses that as our new information about how the sample was generated. From this we can take our new joint distribution for the covariate, the data, and the ancillary coin and use it to estimate average of an infinite sample. And, shock and horror, when we do that we get something that looks an awful lot like a Horvitz-Thompson estimator. But really, it’s just MRP.

If you’re interested in the resolution, the full post isn’t too long and is here. (Warning: contains some fruity language). I hope you enjoy.

God is present in the sweeping gestures, but the devil is in the details

Can’t remember anything at all — Nick Cave

Girls, it’s been a while. But it’s midnight and I am AMPED after a Springtime concert (when Lauren asked me what genre of music was, I said “loud”. They played in a very weird room that usually hosts orchestras and clearly had a rusted on group of silver-haired subscribers who didn’t want the intersection of swampy rock, free jazz, and noise. And I now need to see every single band possible in a room that’s half full of people who are SIMPLY NOT INTO IT. Glorious. It’s the vibe of the thing.)

Of course I’ve been gone, but fish gotta swim, birds gotta swim if sufficient external motivation is applied, and similarly I’ve just gotta blog. So I’ve been doing it in secret (not actually secret, I’ve been telling people). Not because I don’t love you but because sometimes a man needs to write eight thousand words on technical definitions of Gaussian processes, describe conjugate priors as “the crystal deodorant of Bayesian statistics“, and just generally make people say “look, I held on for as long as I could but that was too much“. I’ve also been re-posting a couple of my old posts from here that I like because the WordPress ate my equations (they’re all lightly edited because I’m anal, and this one probably got much better). Also, using distill to make blogs means I can go footnote mad and you know how much I love footnote.

Anyway. What am I gonna talk about? I feel there’s some pressure here because one of my favourite of my posts (and, also, one of my most “is he ok? no” posts) was written in a similar post-concert coming down state.

But I’m not gonna be that fancy or that long winded tonight.

Instead, I just want to share some stolen thoughts on a single phenomena the is interesting.

When you add a “random effect” to your regression-type model, the regression coefficients will change. Sometimes a lot.

Now I know that I risk the full wrath of the cancel culture (to use the youth’s term) by using the term random effect on this blog. Andrew has written at length about why he doesn’t like it (you can google). But it’s a useful umbrella term in this context: I mean iid effects like random slopes and intercepts, smoothing splines in (Bayesian) GAM models, Gaussian processes, and the whole undead army of other “extra randomness” bits that you can put in a regression equation. (The criticism that a term doesn’t mean anything specific or is ambiguous can be effectively blunted by using it to mean everything.)

As with many “change the cheerleader, change the world” statistical issues, this really always ends up being an issue of confounding. But oh what subtle confounding it can be! Jim Hodges has a legitimately wonderful book called Richly Parameterized Linear Models that I heartily recommend that goes into all of these things. Even just by looking at the sample chapters, you will find

  • models where the multilevel structure is confounded with a covariate
  • models where a more complex dependence structure (like an ICAR for spatial data or a spline for … non-spatial data) is confounded with a covariate.

(The book also covers some truly wild things that happen to the hyperparameters [aka the parameters that control the correlation structure] in these situations!)

The long and the short of it, though, is that the interpretation of regression coefficients will change when you add more complex things (like multilevel or spatial or temporal or non-linear) structure to your model.

As with all things, this becomes violently true when you start thinking of interpreting your regression coefficients causally.

It’s worth saying that this is potentially a feature and not a bug! Especially when you’ve added the extra structure to your regression model in order to try and separate structural effects (like repeated measurements in individuals or temporal or spatial autocorrelation) from the effect of your covariates!

So how do you deal with it? No earthly idea.

  • Hodges and Reich suggest (in some contexts) organizing your extra randomness so it’s orthogonal to your variables of interest (this is easy to do for Gaussians). This assumes that there are no unmeasured covariates moving in the same direction as your covariate of interest is and will keep the regression coefficient point estimates the same between the vanilla regression and regression with the extra stuff. (Hanks et al suggest this can lead to unreasonably narrow posterior uncertainty intervals)
  • Sigrunn, Janine, David, Håvard, and I think you should use priors to explicitly limit complexity of your extra structure, which puts bounds on how much of the change can be coming from unmeasured things correlated with your object of interest. (Jon Wakefield also does this when choosing priors for smoothing splines in his excellent book)
  • A whole cornucopia of literature that I don’t have the time or space to link to covers an infinity of aspects of formal causal identification and estimation in a whole variety of situations where we would add these type of structured uncertainties (longitudinal models, time series, spatial models, and variants thereof).

All of which is to say: don’t take your regression coefficients for granted when there are fancy things in your model. But also don’t avoid the fancy things because if you do that your regression coefficients won’t make any sense.

Did you think I’d leave you on a happy note?

Is he … you know…?

Today I learnt, via Sam Power from Bristol, that the legendary IJ Good and the possibly legendary (I really don’t know) RA Gaskins suggested, in their 1971 paper on density estimation, referring to the roughness penalty from density estimation (or non-linear regression) as the flamboyance1 functional. And it would be a crime if we, as a field, did not take up this terminology.

Don’t say: Splines balance the bias and the variance.

Do say: Splines support flamboyance, but only if it’s repeatable.

Footnotes:
1 They specify “flamboyant in the sense of having a wavy edge”. It was the 70s after all. Being wavy was a big thing.

Varimax: Sure, it’s always worked but now there’s maths!

Some day you will be lovedDeath Cab for Cutie

Here is a paper that I just read that I really liked: Karl Rohe and Muzhe Zeng’s Vintage Factor Analysis with Varimax Performs Statistical Inference.

(Does anyone else get a Simon Smith and His Amazing Dancing Bear vibe off that title? No? Just me? Fine.)

This paper is in the fine tradition of papers that examine a method that’s been around for ages and everyone (maybe in a particular discipline) knows to work well, but no one has ever cracked why. Almost universally, it turns out that these methods do actually work1. But it usually takes someone to hold them up to the light in just the right way to see why.

And that’s what this paper does.

The basic idea is that you can often rotate principal components in order to make the loading vectors sparse. This typically leads to more interpretable principal components and, therefore, better exploratory data analysis.

The black fly in this particular glass of Chardonay2 is that in the standard working model for PCA (namely that that everything is independent and Gaussian) we end up with an awkward rotational invariance that doesn’t exactly scream “Hey! Rotate me! I’ll get better! I promise!”.

But nonetheless the Varimax rotations do actually seem to help most of the time and so they’ve been comfortably used for the last 50 or so years. It’s even in the base stats package in R!

So what’s the deal, daddy-o? It turns out that the whole method works as long as the entries in the principal component matrix are sufficiently non-Gaussian3. Figure 1 from their paper (shown below) shows this clearly. It turns out that when the principal components have heavy tails you see “radial streaks” and then there is a good rotation—the one that lines the axes up with these streaks!

Checking for these streaks was part of the workflow for using variamax rotations proposed by Thurstone back in 1947 and this paper does a good job of using it as a jumping off point for building a theory of why these rotations work. (Side note: I really love how rooted in history this paper is while being extremely modern. It’s always nice to see a paper big-up the contributions of people who came before them while also doing amazing work.)

Anyway. You should all read the paper. It’s super sweet. Basically it shows that if your PCA is done on an appropriately centred and scaled matrix, then under some assumptions PCA + Varimax (which they call Vintage Factor Analysis) fits the model

$latex \mathbb{E}(X\mid Y,Z)=ZBY^T$,

where $latex X$ is the $latex n\times p$ matrix of covariates and the PCA+Varimax returns the matrix $latex Z$. (The B and Y are just in there for fun. Don’t stress about them. I promise the Z is the one we want.).

This basically means that while PCA will give you some $latex n\times k$ matrix $latex U$ such that $latex X^TX=U^TU$, Vintage Factor Analysis can actually recover (under conditions!) the original matrices Z and Y used to generate the data.

The key condition is that the entries in the true $latex Z$ come from a heavy-tailed distribution (each column is iid from the same distribution, different columns can come from different distributions as long as they’re independent). This causes the radial streaks seen in Figure 1 and gives a preferred orientation that the Varimax rotation can find.

Now just because this method corresponds to a model doesn’t mean we believe that model. Actually we probably don’t. Unlike some other great homes of working models (like linear regression), this type of factor analysis is typically exploratory. So we don’t actually need to believe the assumptions for it to be useful.

This paper justifies the diagnostic checks of Thurstone (1947) and gives a plausible reason why this PCA+Varimax method can actually produce good, interpretable results.

On a personal note, my very first stats paper involved fitting a Bayesian factor model (Chris did all of the hard stats stuff because I was terrified and knew nothing. I provided an algorithm.). And I’ve got to say I’d much rather just do a PCA.

===========
1 Yes. I know. Just because a method is persistent doesn’t make it valid. This blog is full of examples. But this is a slightly different thing. This is not a method that is mathematically well understood (like using p-values to do inference) where either the mathematical framework doesn’t match with reality (like using p-values to do inference) or they aren’t being used correctly (like using p-values to do inference). This is instead a method that works but does not have a well-developed mathematical theory. Those types of things typically do end up working provably, even though it might take a while for theory to catch up to the method. Fun fact: one of my very early papers did this for a much more obscure algorithm.

2 OMG OMG OMG THERE’S A NEW ALANIS ALBUM AND IT’S AMAZING!!!!!

3 This is also the thinking behind Independent Component Analysis. The paper discusses the links.

Male bisexuality gets Big PNAS Energy

Do flowers exist at night?—John Mulaney and the Sack Lunch Bunch

I have very little to say here, except to let you all know that the venerable PNAS has today published a paper (edited by Steven Pinker) letting use know that male bisexuality exists.

Here it is: Robust evidence for bisexual orientation among men (The paper has authors, which you can click through for, but I prefer to imagine it sprouting fully-formed from the mind of Zeus)

Big shout out to all my bi bros who have been making this happen for years and years without the privilege of existing.

A few choice quotes:

Some competing interests…

Competing interest statement: J.S. is president of the American Institute of Bisexuality (AIB), which has funded some of the research contributing to our data. Obtaining funding from the AIB has sometimes allowed it to have input into the design prior to funding approval. However, J.S. has had no role in data analysis or manuscript writing until the present article, when he has contributed to writing.

Size matters when studying if male bisexuality exists:

All existing studies have been of small to modest size; the largest had 114 participants.

Something something fallacy of decontextualized measurement:

Notably, across these studies, bisexual-identified men self-reported subjective arousal to both male and female stimuli, even in samples where their genital arousal did not reflect such a pattern.

I’ll take structural heterosexism and misogyny in a patriarchal society manifesting it self in different ways across gender for 500, Alex:

The question of whether bisexual arousal patterns exist has been less controversial about women than men

Honestly there’s some stats in the papers but who really cares. The data is combined across 6 studies that cover two different types of stimulus (but it’s not clear how this was taken into account). The data also has about 270 straight(-ish) people, 200 gay(-ish) people, and 100 bi people, so, you know, balance.

The analysis itself is some sort of linear regression with a breakpoint such that the regression was a U shape (the arousal was higher/lower depending on where your are on the Kinsey scale). They used two different break points (but didn’t seem to adjust for the multiple testing) rather than using an automatically detected break point.

The data is shared if anyone wants to play with it and check some things out. It’s on OSF. Nothing was pre-registered so go hogwild!

But I’m personally not that interested in spending any actual effort to “discover bisexuality”. (Who expected Colonize Bisexuality would be July 2020’s clarion call!)

Before you go, please joint me one more time in singing the male bisexual national anthem Do Flowers Exist At Night.

PS. I think this needs a quote from a paper that Andrew, Greggor and I wrote:

We can identify all these problems with what might be called the fallacy of decontextualized measurement, the idea that science proceeds by crisp distinctions modeled after asocial phenomena, such as unambiguous medical diagnoses (the presence or absence of streptococcus or the color change of a litmus paper). Seeking an on–off decision, normalizing a base rate to 50 percent, and, most problematically, stripping a phenomenon of its social context all give the feel of scientific objectivity while creating serious problems for generalizing findings to the world outside the lab or algorithm.

or, to put it more succinctly:

Imagine saying “the indium/gallium strain gauge I hooked up to this man’s penis didn’t show consistent arousal as I Clockwork Orange’d him with gender-spanning porn, so male bisexuality does not exist”.

This is a much bigger problem than a data analysis that is making a bunch of arbitrary decisions and not accounting for the heterogeneity in the data. It is a question of whether these experiments are appropriate methods to demonstrate the existence of bisexual men.

There’s obviously a different question of why, in the year of our lord two thousand and twenty, are researchers using the language of discovery in this context. And why they’d go straight for the old penis measuring method rather than just asking.

P.P.S. from Andrew: There seems to be some confusion in the comments about Dan’s point, so I elaborated here:

I agree that it can make sense to perform physiological tests of sexual responses. If you publish a study on that, you might draw some laughs (back in the 1970s there were the “Golden Fleece Awards,” but that comes with the territory. Just because a form of measurement is funny to some people, it doesn’t mean it shouldn’t be done.

I can’t speak for Dan, but I think his problem is not that this paper does some sort of dick-o-meter thing, but rather the points you make in the fourth paragraph of your comment. Also I have some problems with the statistics in the paper, which are related to Dan’s concerns about “language of discovery.”

Consider this statement from the paper:

Evidence for bisexual orientation requires that Minimum Arousal have an inverted U-shaped distribution and that Absolute Arousal Difference be U-shaped.

That makes no sense to me. Evidence for bisexual orientation requires only that there are some people who are sexually attracted to both men and women. No inverted-U-shaped distribution is required.

The above-quoted sentence is an example of scientism or the fallacy of measurement: taking a real and much-discussed phenomenon (bisexuality) and equating it with some particular measurement or statistical test.

P.P.P.S. (Dan again – sorry, I had other things to do today): Wow. The comments are certainly a journey aren’t they! A few things

  • Here is a paper (referenced in the above paper) that basically says the defensible version of what this paper shows: that patterns of arousal vary between monosexual and bisexual men, but there is a lot of variance among bisexual men. It does not claim to validate or provide evidence for the existence of bisexuality. If these particular patterns are of interest to you then that’s fine. It’s what’s being measured.
  • Andrew used the word “scientism”, which is an excellent way to describe the leap from the measurement scenario to claims about the validity of bisexuality (two of the authors have gone both ways on this). The paper is about patterns of arousal. The idea that a single common pattern of arousal can be used to categorize a sexual orientation is reductive to the point of absurdity.
  • Because apparently some people will not rest until I’ve been very clear about whether or not self-identification is reliable, let me just say  the following: If you’d asked me my orientation when I was 13 I would’ve said straight. If you’d inferred it by measuring arousal, my orientation was “busses going along bumpy roads”. There are indications that as time goes on, this happens less often (the straight thing. The bus thing is eternal). So while there is noise in self-identification, it’s no more than there is in most other demographic variables that we commonly use.
  • (To put that point somewhat differently, referring to these penis measurements as objective is not the same thing as them being objective. They are also noisy, contextual, and complicated by social and cultural forces.)

The seventy two percent solution (to police violence)

And now it is your turn,
We are tired of praying, and marching, and thinking, and learning —  Gil Scott-Heron

So. It turns out that Gil Scott-Heron was right and he was wrong. We once again, during a time of serious social inequality and political upheaval, sent whiteys to the moon (ish). On the other hand, the revolution is definitely being televised. Live.

And as the protests and discussions continue there are a lot of reform ideas being mooted. Some ideas are transformative, like defunding and demilitarizing the police (this does not mean “get rid of police”! see here and here for what that might mean) and redistributing parts budget to housing, community projects, health, employment, education and other vital social programs. (This is predicated on the idea that the police are not the right solution to every problem they’ve been hurled at.)

At the other end of the spectrum we have 8 Simple Rules for Dating My Teenage Daughter to reduce police killings by 72%.

These rules are:

  • Require de-escalation
  • Have a use-of-force continuum
  • Ban chokeholds and strangleholds
  • Warn before shooting
  • Don’t shoot at moving vehicles
  • Exhaust all other means before shooting
  • Other officers have a duty to intervene to stop another office from using excessive force
  • Require comprehensive reporting of uses and threats of force.

And really, all of those are perfectly reasonable things. The question is do they do anything? (NB: Eric Garner died from a chokehold in 2014. They were banned in NY in 1993.)

Well the reason why I’m talking about this at all is that these rules were heavily being promoted as being “data driven”. (I am not even going to pretend that I’m well versed in police violence or structural racism in the US. I have a very well-defined lane to stay in.)

In fact, the interventions were sold with the tag-line “Data proves that together these eight policies can decrease police violence by 72 percent”. (See, e.g. here). Very recently, project has walked this back, somewhat, to “Research shows more restrictive use of force policies can reduce killings by police and save lives”.

If you look into their docs, you get this

Moreover, our research found that having all eight of these use of force restrictions in place was associated with 72% fewer police-involved killings compared to departments with none of these policies in place and a 54% reduction for the average police department.

This research was part of the Police Use of Force project, which looked at police killings linked to 100-ish large police forces and compared which of these 8 policies had been implemented. (Why these 8? No idea.) The data is not open, but on Twitter, Emma Glennon, who is an epi researcher and PhD student at Cambridge has compiled some similar data and released it on github. She wrote a really great twitter thread that dives into the data and the analysis and that I’m copping  a lot from here. (Eve Ewing also has a really great discussion about this.)

So where did that 72% come from? Well, if you read the report (which doesn’t appear to be peer reviewed, but who even knows) you see the following:

  • 100 departments were approached for their use of force policies, leading to 91 departments that could be included.
  • Deaths were taken from The Guardian’s The Counted database, which covers 2015-2016. (A more up-to-date database is available from the Fatal Encounters project.)
  • The policies were included as a total number of policies enacted
  • The model was fit using a negative binomial regression that also included (with regression coefficients and standard errors)
    • Percent minority, 1.7 (0.7)
    • Arrests, 0.9 (0.2)
    • Income, 0.8 (0.5)
    • Inequality (Gini coefficient from the census), 0.1 (1.7)
    • Assault on officers (~0 effect) and number of officers (~0 effect)
  • The regression coefficient for Number of Policies was -0.16 (0.07).

Now, there is no way that this analysis supports causal conclusions. It’s not built for it and it’s not analyzed that way. And that’s ok. Exploratory data analysis is great. We all love it.

(Let’s not be glib about this though. That policy data was hard got and has value beyond this particular analysis. As always the hardest part of doing a data-driven analysis is getting the damn data.)

But the problem comes when making the 72% claim. That is a claim about the effect of a hypothetical intervention. Which is to say that it’s a causal claim

And it’s not a good one.

The argument is that if $latex k$ policies are implemented, then the reduction in police killings compared to similar city with no policies enacted is

$latex \left[\exp(-0.16k) -\exp(-0.16*0)\right]\times 100\%$.

This gives the 72% reduction. (Yes Andrew, that extra precision annoys me too.)

An average city has 3 of these 8 policies enacted, so the change would be<

$latex \left[\exp(-0.16k) -\exp(-0.16*3)\right]\times 100\%$,

which is a 34% change.

Note: The 54% quoted above is wrong. You get the 54% quoted above if a department that previously had zero policies enacted (in the data Irving, Kansas City, Reno, and Stockton) enacts 3 policies.

So. Are these numbers in any way real? Well, cracking open my copy of Regression and Other Stories, I know that this will be ok if the model holds (there is no model checking in the report) and under some unlikely assumptions that basically mean that observations with the same covariates are randomized into each of the 9 treatment levels (from 0 to 8 policies).

The numbers are also assuming that if you enact 4 policies, it doesn’t matter which 4 policies you enact. In the data there are 49 different policy combinations. And it is not unreasonable to expect that some combinations will be more effective than others.

All of this is to say that it is difficult to use this data and this modelling to predict the effect of enacting these policies. It’s definitely wrong to say they will lead to a 72% or 34% (or 54%!) reduction.

The correct interpretation is that (modulo fit issues), of the departments studied, those that enacted more of these policies recorded fewer killings (even after accounting for income, inequality, racial makeup of the population, and the number of arrests).

Does this mean the policy proposal should be scrapped? Well. That’s a hard question. It’s similar to the question of if COPSS should rename the Fisher Lecture after someone who wasn’t a notorious racist. (There is a petition) Sure. Do it. But beware of opportunity cost, because there is no evidence that this intervention will even touch the sides of the actual problem. So if enacting these policies is the end of the story, we likely won’t get very far.

And, as always, we should probably be aware that simple interventions rarely solve complex problems. And we should always be fairly skeptical when a relatively small intervention is advertised as leading to a large change. Even (or especially) what that intervention is advertised as “data driven”.

Fugitive and cloistered virtues

There’s a new revolution, a loud evolution that I saw. Born of confusion and quiet collusion of which mostly I’ve known. — Lana Del Ray

While an unexamined life seems like an excellent idea, an unexamined prior distribution probably isn’t. And, I mean, I write and talk and think and talk and write and talk and talk and talk and talk about this a lot. I am pretty much obsessed with the various ways in which we need to examine how our assumptions map to our priors (and vice versa).

But also I’m teaching a fairly low-level course on Bayesian stats (think right after a linear regression course) right now, which means I am constantly desperate for straightforward but illuminating questions that give some insight into what can and can’t go wrong with Bayesian models.

So I asked myself (and subsequently my students) to work out exactly what a prior that leads to a pre-determined posterior distribution looks like. Such a prior, which is necessarily data-dependent, is at the heart of some serious technical criticisms of Bayesian methods.

So this was my question: Consider the normal-normal model $latex y_i\mid\mu\sim N(\mu,1)$, $latex i=1,\ldots,n$. What prior $latex \mu\sim N(m,s^2)$ will ensure that the posterior expectation is equal to $latex \mu^*$?

The answer will probably not shock you. In the end, it’s any prior that satisfies $latex m=ns^2(\mu^*-\bar{y})+\mu^*$. It’s pretty easy to see what this looks like asymptotically: if we fix s, then $latex m\rightarrow\mathcal{O}\left(\text{sign}(\mu^*-\mu^\text{true})n\right)$, which is to say the mean gets HUGE.

Alternatively, we can set the prior as $latex \mu\sim N(2-\bar{y},n^{-1})$, which is probably cleaner.

One thing that you notice if you simulate from this is that when the number of observations is small, these priors look kinda ordinary. It’s not until you get a large sample that either the mean buggers off to infinity or the variance squishes down to zero.

And that should be a bit of a concern. Because if you’ve got a model with a lot of parameters (or–heaven forfend!– a non-parametric component like a Gaussian process or BART prior), there is likely to be very little data “per parameter”.

So it’s probably not enough to just look at the prior and say “well the mean and variance aren’t weird, so it’s probably safe”.

(BTW, we know this! This is just another version of the “concentration of measure” phenomenon that makes perfectly ok marginal priors turn into hideous monster joint priors. In that sense, priors are a lot like rat kings–rats are perfectly cute by themselves, but disgusting and terrifying when joined together.)

So what can you do to prevent people tricking you with priors? Follow my simple three step plan.

  1. Be very wary of magic numbers. Ask why the prior standard deviation is 2.43! There may be good substantive reasons. But it also might just give the answer people wanted.
  2. Simulate. Parameters. From. Your. Prior. Then. Use. Them. To. Simulate. Data.
  3. Investigate the stability of your posterior by re-computing it on bootstrapped or subsetted data.

The third thing is the one that is actually really hard to fool. But bootstraps can be complex when the data is complicated, so don’t just do iid splits. Make your subsampling scheme respect the structure of your data (and, ideally, the structure of your model). There is, as always, a fascinating literature on this.

Part of why we’re all obsessed with developing workflows—rather than just isolated, bespoke results—is that there’s no single procedure that can guard against bad luck or malicious intent. But if we expose more of the working parts of the statistical pipeline, there become fewer and fewer places to hide.

But also sometimes we just need a fast assignment for week 2 of the course.

Unquestionable Research Practices

Hi! (This is Dan.) The glorious Josh Loftus from NYU just asked the following question.

Obviously he’s not heard of preregistration.

Seriously though, it’s always good to remember that a lot of ink being spilled over hypothesis testing and it’s statistical brethren doesn’t mean that if we fix that we’ll fix anything.  It all comes to naught if

  1. the underlying model for reality (be it your Bayesian model or your null hypothesis model and test statistic) is rubbish OR
  2. the process of interest is poorly measured or the measurement error isn’t appropriately modelled OR
  3. the data under consideration can’t be generalised to a population of interest.

Control of things like Type 1, Type 2, Type S, and Type M is a bit like combing your hair. It’s great if you’ve got hair to comb, but otherwise it leaves you looking a bit silly.

What if it’s never decorative gourd season?

If it rains, now we’ll change
We’ll hold and save all of what came
We won’t let it run away
If it rains — Robert Forster

I’ve been working recently as part of a team of statisticians based in Toronto on a big, complicated applied problem. One of the things about working on this project is that, in a first for me, we know that we need to release all code and data once the project is done. And, I mean, I’ve bolted on open practices to the end of an analysis, or just released a git repo at the end (sometimes the wrong one!). But this has been my first real opportunity to be part of a team that is weaving open practices all the way through an analysis. And it is certainly a challenge.

It’s worth saying that, notoriously “science adjacent” as I am, the project is not really a science project. It is a descriptive, explorative, and predictive study, rather than one that is focussed on discovery or confirmation. So I get to work my way through open and reproducible science practices without, say, trying desperately to make Neyman-Pearson theory work.

A slight opening

Elisabeth Kübler-Ross taught us that there are five stages in the transition to more open and reproducible scientific practices: 

  • Denial (I don’t need to do that!)
  • Anger (How dare they not do it!)
  • Bargaining (A double whammy of “Please let this be good enough” and “Please let other people do this as well”)
  • Depression (Open and reproducible practices are so hard and no one wants to do them properly)
  • Acceptance (Open and reproducible science is not a single destination, but a journey and an exercise in reflective practice)

And, really, we’re often on many parts of the journey simultaneously. (Although, like, we could probably stop spending so long on Anger, because it’s not that much fun for anyone.)  

And a part of this journey is to carefully and critically consider the shibboleths and touchstones of open and reproducible practice. Not because everyone else is wrong, but more because these things are complex and subtle and working out how to weave them into our idiosyncratic research practices.

So I’ve found myself asking the following question.

Should we release code with our papers?

Now to friends and family who are also working their way through the Kübler-Ross stages of Open Science, I’m very sorry but you’re probably not going to love where I land on this. Because I think most code that is released is next to useless. And that it would be better to release nothing than release something that is useless. Less digital pollution.

It’s decorative gourd season!

A fairly well known (and operatically sweary) piece in McSweeney’s Internet Tendency celebrates the Autumn every year by declaring It’s decorative gourd season, m**********rs! And that’s the piece. A catalogue of profane excitement at the chance to display decorative gourds. Why? Because displaying them is enough!

But is that really true for code? While I do have some sympathy for the sort of “it’s been a looonnng day and if you just take one bite of the broccoli we can go watch Frozen again”-school of getting reluctant people into open science, it’s a desperation move and at best a stop-gap measure. It’s the type of thing that just invites malicious compliance or, perhaps worse, indifferent compliance.

Moreover, making a policy (even informally) that “any code release is better than no code release” is in opposition to our usual insistence that manuscripts reach a certain level of (academic) clarity and that our analyses are reported clearly and conscientiously. It’s not enough that a manuscript or a results section or a graph simply exist. We have much higher standards than that.

So what should the standard for code be?

The gourd’s got potential! Even if it’s only decorative, it can still be useful.

One potential use purely decorative code is the idea that the code can be read to help us understand what the paper is actually doing.

This is potentially true, but it definitely isn’t automatically true. Most code is too hard to read to be useful for this purpose. Just like most gourds aren’t the type of decorative gourd you’d write a rapturous essay about.

So unless code meets a certain standard, it’s going to need do something than just sit there and look pretty, which means we will need our code to be at least slightly functional. 

A minimally functional gourd?

This is actually really hard to work out. Why? Well there are just so many things we can look at. So let’s look at some possibilities. 

Good code “runs”. Why the scare quotes? Well because there are always some caveats here. Code can be good even if it takes some setup or a particular operating system to run. Or you might need a Matlab license. To some extent, the idea of whether “the code runs” is an ill-defined target that may vary from person to person. But in most fields there are common computing set ups and if your code runs on one of those systems it’s probably fine.

Good code takes meaningful input and produces meaningful output: It should be possible to, for example, run good code on similar-but-different data.  This means it shouldn’t require too much wrangling to get data into the code. There are some obvious questions here about what is “similar” data. 

Good code should be somewhat generalizable. A simple example of this: good code for a regression-type problem should not assume you have exactly 7 covariates, making it impossible to use when there data has 8 covariates. This is vital for dealing with, for instance, the reviewer who asks for an extra covariate to be added, or for a graph to change.

How limited can code be while still being good? Well that depends on the justification. Good code should have justifiable limitations.

Code with these 4 properties is no longer decorative! It might not be good, but it at least does something.  Can we come up with some similar targets for the written code to make it more useful? It turns out that this is much harder because judging the quality of code is much more subjective.

Good gourd! What is that smell?

The chances that a stranger can pick up your code and, without running it, understand what the method is doing are greatly increased with good coding practice. Basically, if it’s code you can come back to a year later and modify as if you’d never put it down, then your code is possibly readable. 

This is not an easy skill to master. And there’s no agreed upon way to write this type of code. Like clearly written prose, there are any number of ways that code can be understandable. But like writing clear prose, there are a pile of methods, techniques, and procedures to help you write better code.

Simple things like consistent spacings and doing whatever RStudio’s auto-format does like adding spaces each side of “+” can make your code much easier to read. But it’s basically impossible to list a set of rules that would guarantee good code. Kinda like it’s impossible to list a set of rules that would make good prose. 

So instead, let’s work out what is bad about code. Again, this is a subjective thing, but we are looking for code that smells.

If you want to really learn what this means (with a focus on R), you should listen to Jenny  Bryan’s excellent keynote presentation on code smell (slides etc here). But let’s summarize.

How can you tell if code smells? Well if you open a file and are immediately moved to not just light a votive candle but realize in your soul that without intercessory prayer you will never be able to modify even a corner of the code, then the code smells.  If you can look at it and at a glance see basically what the code is supposed to do, then your code smells nice snd clean.

If this sounds subjective, it’s because it is. Jenny’s talk gives some really good advice about how to make less whiffy code, but her most important piece of advice is not about a specific piece of bad code. It’s the following:

Your taste develops faster than your ability. 

To say it differently, as you code more you learn what works and what doesn’t. But a true frustration is that (just like with writing) you tend to know what you want to do before you necessarily have the skills to pull it off. 

The good thing is that code for academic work is iterative. We do all of our stuff, send it off for review, and then have to change things. So we have a strong incentive to make our code better and we have multiple opportunities to make it so.

Because what do you do when you have to add a multilevel component to a model? Can you do that by just changing your code in a single place? Or do you have to change the code in a pile of different places? Because good smelling code is often code that is modular and modifiable.

But because we build our code over the full lifecycle of a project (rather than just once after which it is never touched again), we can learn the types of structures we need to build into our code and we can share these insights with our friends, colleagues, and students.

A gourd supportive lab environment is vital to success

The frustration we feel when we want to be able to code better than our skills allow is awful. I think everyone has experienced a version of it. And this is where peers and colleagues and supervisors have their chance to shine. Because just as people need to learn how to write scientific reports and people need to learn how to build posters and people need to learn how to deliver talks, people need to learn how to write good code.

Really, the only teacher is experience. But you can help experience along. Work through good code with your group. Ask for draft code. Review it. Just like the way you’ll say “the intro needs more “Piff! Pop! Woo!” because right now I’m getting “*Sad trombone*” and you’ve done amazing work so this should reflect that”, you need to say the same thing about the code. Fix one smell at a time. Be kind. Be present. Be curious. And because you most likely were also not trained in programming, be open and humble.

Get members of your lab to swap code and explain it back to the author. This takes time. But this time is won back when reviews come or when follow up work happens and modifications need to be made. Clean, nice code is easy to modify, easy to change, and easy to use.

But trainees who are new at programming are nervous about programming.

They’re usually nervous about giving talks too. Or writing. Same type of strategy.

But none of us are professional programmers

Sadly, in the year of our lord two thousand and nineteen if you work in a vaguely quantitative field in science, social science, or the vast mire that surrounds them, you are probably being paid to program. That makes you a professional programmer.  You might just be less good at that aspect of your job than others.

I am a deeply mediocre part time professional programmer. I’ve been doing it long enough to learn how code smells, to have decent practices, and to have a bank of people I can learn from. But I’m not good at it. And it does not bring me joy. But neither does spending a day doing forensic accounting on the universities bizarre finance system. But it’s a thing that needs to be done as part of my job and for the most part I’m a professional who tries to do my best even if I’m not naturally gifted at the task.

Lust for gourds that are more than just decorative

In Norwegian, the construct “to want” renders “I want a gourd” as “Jeg har lyst på en kalebas” and it’s really hard, as an english speaker, not to translate that to “I have lust for a gourd”. And like that’s the panicking Norwegian 101 answer (where we can’t talk about the past because it’s linguistically complex or the future because it’s hard, so our only verbs can be instantaneous. One of the first things I was taught was “Finn er sjalu.” (Finn is jealous.) I assume because jealousy has no past or future).

But it also really covers the aspect of desiring a better future. Learning to program is learning how to fail to program perfectly. Just like learning to write is learning to be clunky and inelegant. To some extent you just have to be ok with that. But you shouldn’t be ok with the place you are being the end of your journey.

So did I answer my question? Should we release code with our papers?

I think I have an answer that I’m happy with. No in general. Yes under circumstances.

We should absolutely release code that someone has tried to make good code. Even though they will have failed. We should carry each other forward even in our imperfection. Because the reality is that science doesn’t get more open by making arbitrary barriers. Arbitrary barriers just encourages malicious compliance. 

When I lived in Norway as a newly minted gay (so shiny) I remember once taking a side trip to Gay’s The Word, the LGBTQIA+ bookshop in London and buying (among many many others) a book called Queering Anarchism. And I can’t refer to it because it definitely got lost somewhere in the nine times I’ve moved house since then.

The thing I remember most about this book (other than being introduced to the basics of intersectional trans-feminism) was its idea of anarchism as a creative force. Because after tearing down existing structures, anarchists need to have a vision of a new reality that isn’t simply an inversion of the existing hierarchy (you know. Reducing the significance threshold. Using Bayes Factors instead of p-values. Pre-registering without substantive theory.) A true anarchist, the book suggested, needs to queer rather than invert the existing structures and build a more equitable version of the world.

So let’s build open and reproducible science as a queer reimagining of science and not a small perturbation of the world that is. Such a system will never be perfect. Just lusting to be better.

Extra links:

A heart full of hatred: 8 schools edition

No; I was all horns and thorns
Sprung out fully formed, knock-kneed and upright
Joanna Newsom

Far be it for me to be accused of liking things. Let me, instead, present a corner of my hateful heart. (That is to say that I’m supposed to be doing a really complicated thing right now and I don’t want to so I’m going to scream into a void for a little while.)

The object of my ire: The 8-Schools problem.

Now, for those of you who aren’t familiar with the 8-schools problem, I suggest reading any paper by anyone who’s worked with Andrew (or has read BDA). It’s a classic.

So why hate on a classic?

Well, let me tell you. As you can well imagine, it’s because of a walrus.

I do not hate walruses (I only hate geese and alpacas: they both know what they did), but I do love metaphors. And while sometimes a walrus is just a walrus, in this case it definitely isn’t.

The walrus in question is the Horniman Walrus (please click the link to see my smooth boy!). The Horniman walrus is a mistake that you an see, for a modest fee, at a museum in South London.

The story goes like this: Back in the late 19th century someone killed a walrus, skinned it, hopefully did some other things to it, and sent it back to England to be stuffed and mounted. Now, it was the late 19th century and it turned out that the English taxidermist maybe didn’t know what a walrus looked like. (The museum’s website claims that “only a few people had ever seen a live walrus” at this point in history which, even for a museum, is really [expletive removed] white. Update: They have changed the text on the website! It now says “Over 100 years ago, not many people (outside of Artic regions) had ever seen a live walrus”!!!!!!!!!!)

But hey. He had a sawdust. He had glue. He had other things that are needed to stuff and mount a dead animal. So he took his dead animal and his tools, introduced them to each other and proudly displayed the results.

(Are you seeing the metaphor?)

Now, of course, this didn’t go well. Walruses, if you’ve never seen one, are huge creatures with loose skin folds. The taxidermist did not know this and so he stuffed the walrus full leading to a photoshop disaster of a walrus. Smooth like a beachball. A glorious mistake. And a genuine tourist attraction.

So this is my first problem. Using a problem like 8 schools as default test for algorithms has a tendency to lead to over-stuffed algorithms that are tailored to specific models. This is not a new problem. You could easily call it the NeurIPS Problem (aka how many more ways do you want to over-fit MNIST?). (Yes, I know NeurIPS has other problems as well. I’m focussing on this one.)

A different version of this problem is a complaint I remember from back in my former life when I cared about supercomputers. This was before the whole “maybe you can use big computers on data” revolution. In these dark times, the benchmarks that mattered were the speed at which you could multiply two massive dense matrices, and the speed at which you could do a dense LU decomposition of a massive matrix. Arguably neither of these things were even then the key use of high-performance computers, but as the metrics became goals, supercomputer architectures emerged that could only be used to their full capacity on very specialized problems that had enough arithmetic intensity to make use of the entire machine. (NB: This is quite possibly still true, although HPC has diversified from just talking about Cray-style architectures)

So my problem, I guess, is with benchmark problems in general.

A few other specific things:

Why so small? 8 Schools has 8 observations, which is not very many observations. We have moved beyond the point where we need to include the data in a table in the paper.

Why so meta? The weirdest thing about the 8 Schools problem is that it has the form

$latex y_j\mid\mu_j\sim N(\mu_j,\sigma_j)$
$latex \mu_j\mid\mu,\tau\sim N(\mu,\tau)$

with appropriate priors on $latex \mu$ and $latex \tau$. The thing here is that the observation standard deviations $latex \sigma_j$ are known. Why? Because this is basically a meta-analysis. So 8-schools is a very specialized version of a Gaussian multilevel model. Buy fixing the observation standard deviation, the model has a much nicer posterior than the equivalent model with an unknown observation standard deviation. Hence, 8-schools doesn’t even test an algorithm on  an ordinary linear mixed model.

But it has a funnel! So does Radford Neal’s funnel distribution (in more than 17 dimensions). Sample from that instead.

But it’s real data! Firstly, no it isn’t. You grabbed it out of a book. Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong. We’re supposed to be statisticians so let me ask you this: How does an algorithm’s success on real data set A generalize to the set of all possible data sets? (Hint: It doesn’t.)

So, in conclusion, I am really really really sick of seeing the 8-schools data set.

 

 

Postscript: There’s something I want to clarify here: I am not saying that empirical results are not useful for evaluating inference algorithms. I’m saying that it’s only useful if the computational experiments are clear. Experiments using well-designed simulated data are unbelievably important. Real data sets are not.

Why? Because real data sets are not indicative of data that you come across in practice. This is because of selection bias! Real data sets that are used to demonstrate algorithms come in two types:

  1. Data that is smooth and lovely (like 8-Schools or an over-stuffed walrus)
  2. Data that is pointy and unexpected (like StackLoss, which famously has an almost singular design matrix, or this excellent photo a friend of mine once took)

Basically this means that if you have any experience with a problem at all, you can find a data set that makes y0ur method look good or that demonstrates a flaw in a competing method, or makes your method look bad. But this choice is opaque to people who are not experts in the problem at hand. Well-designed computational experiments on the other hand are clear in their aims (eg. this data has almost co-linear covariates, or this data has outliers, or this data should be totally pooled).

Simulated data is clearer, more realistic, and more honest when evaluating or even demonstrating algorithms.

Dan’s Paper Corner: Yes! It does work!

Only share my research
With sick lab rats like me
Trapped behind the beakers
And the Erlenmeyer flasks
Cut off from the world, I may not ever get free
But I may
One day
Trying to find
An antidote for strychnine — The Mountain Goats

Hi everyone! Hope you’re enjoying Peak Libra Season! I’m bringing my Air Sign goodness to another edition of Dan’s Paper Corner, which is a corner that I have covered in papers I really like.

And honestly, this one is mostly cheating. Two reasons really. First, it says nice things about the work Yuling, Aki, Andrew, and I did and then proceeds to do something much better. And second because one of the authors is Tamara Broderick, who I really admire and who’s been on an absolute tear recently.

Tamara—often working with the fabulous Trevor Campbell (who has the good grace to be Canadian), the stunning Jonathan Huggins (who also might be Canadian? What am I? The national register of people who are Canadian?), and the unimpeachable Ryan Giordano (again. Canadian? Who could know?)—has written a pile of my absolute favourite recent papers on Bayesian modelling and Bayesian computation.

Here are some of my favourite topics:

As I say, Tamara and her team of grad students, postdocs, and co-authors have been on one hell of a run!

Which brings me to today’s paper: Practical Posterior Error Bounds from Variational Objectives by Jonathan Huggins, Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick.

In the grand tradition of Dan’s Paper Corner, I’m not going to say much about this paper except that it’s really nice and well worth reading if you care about asking “Yes, but did it work?” for variational inference.

I will say that this paper is amazing and covers a tonne of ground. It’s fully possible that someone reading this paper for the first time won’t recognize how unbelievably practical it is. It is not trying to convince you that its new melon baller will ball melons faster and smoother than your old melon baller. Instead it stakes out much bolder ground: this paper provides a rigorous and justified and practical workflow for using variational inference to solve a real statistical problem.

I have some approximately sequential comments below, but I cannot stress this enough: this is the best type of paper. I really like it. And while it may be of less general interest than last time’s general theory of scientific discovery, it is of enormous practical value. Hold this paper close to your hearts!

  • On a personal note, they demonstrate that the idea in the paper Yuling, Aki, Andrew, and I wrote is good for telling when variational posteriors are bad, but the k-hat diagnostic being small does not necessarily mean that the variational posterior will be good. (And, tbh, that’s why we recommended polishing it with importance sampling)
  • But that puts us in good company, because they show that neither the KL divergence that’s used in deriving the ELBO or the Renyi divergence is a particularly good measure of the quality of the solution.
  • The first of these is not all that surprising. I think it’s been long acknowledged that the KL divergence used to derive variational posteriors is the wrong way around!
  • I do love the Wasserstein distance (or as an extremely pissy footnote in my copy of Bogachev’s glorious two volume treatise on measure theory insists: the KantorovichRubinstein metric). It’s so strong. I think it does CrossFit. (Side note: I saw a fabulous version of A Streetcar Named Desire in Toronto [Runs til Oct 27] last week and really it must be so much easier to find decent Stanleys since CrossFit became a thing.)
  • The Hellinger distance is strong too and will also control the moments (under some conditions. See Lemma 6.3.7 of Andrew Stuart’s encyclopedia)
  • Reading the paper sequentially, I get to Lemma 4.2 and think “ooh. that could be very loose”. And then I get excited about minimizing over $latex \eta$ in Theorem 4.3 because I contain multitudes.
  • Maybe my one point of slight disagreement with this paper is where they agree with our paper. Because, as I said, I contain multitudes. They point out that it’s useful to polish VI estimates with importance sampling, but argue that they can compute their estimate of VI error instead of k-hat. I’d argue that you need to compute both because just like we didn’t show that small k-hat guarantees a good variational posterior, they don’t show that a good approximate upper bound on the Wasserstein distance guarantees that importance sampling will work. So ha! (In particular, Chatterjee and Diaconis argue very strongly, as does Mackay in his book, that the variance of an importance sampler being finite is somewhere near meaningless as a practical guarantee that an importance sampler actually works in moderate to high dimensions.)
  • But that is nought but a minor quibble, because I completely and absolutely agree with the workflow for Variational Inference that they propose in Section 4.3.
  • Let’s not kid ourselves here. The technical tools in this paper are really nice.
  • There is not a single example I hate more than the 8 schools problem. It is the MNIST of hierarchical modelling. Here’s hoping it doesn’t have any special features that makes it a bad generic example of how things work!
  • That said, it definitely shows that k-hat isn’t enough to guarantee good posterior behaviour.

Anyway. Here’s to more papers like this and to fewer examples of what the late, great David Berman referred to as ceaseless feasts of schadenfreude“.

Dan’s Paper Corner: Can we model scientific discovery and what can we learn from the process?

Jesus taken serious by the many
Jesus taken joyous by a few
Jazz police are paid by J. Paul Getty
Jazzers paid by J. Paul Getty II

Leonard Cohen

So I’m trying a new thing because like no one is really desperate for another five thousand word essay about whatever happens to be on my mind on a Thursday night in a hotel room in Glasgow. Also, because there’s a pile of really interesting papers that I think it would be good and fun for people to read and think about.

And because if you’re going to do something, you should jump right into an important topic, may I present for your careful consideration Berna Devezer, Luis G. Nardin, Bert Baumgaertner,  and Erkan Ozge Buzbas’ fabulous paper Scientific discovery in a model-centric framework: Reproducibility, innovation, and epistemic diversity. (If we’re going to talk about scientific discovery and reproducibility, you better believe I’m going to crack out the funny Leonard Cohen.)

I am kinda lazy so I’m just going to pull out the last paragraph of the paper as a teaser. But you should read the whole thing. You can also watch Berna give an excellent seminar on the topic. Regardless, here is that final paragraph.

Our research also raises questions with regard to reproducibility of scientific results. If reproducibility can be uncorrelated with other possibly desirable properties of scientific discovery, optimizing the scientific process for reproducibility might present trade-offs against other desirable properties. How should scientists resolve such trade-offs? What outcomes should scientists aim for to facilitate an efficient and proficient scientific process? We leave such considerations for future work.

I like this paper for a pile of reasons. A big one is that a lot of discussion that I have seen around scientific progress is based around personal opinions (some I agree with, some I don’t) and proposed specific interventions. Both of these things are good, but they are not the only tools we have. This paper proposes a mechanistic model of discovery encoding some specific assumptions and investigates the consequences. Broadly speaking, that is a good thing to do.

Some random observations:

  • The paper points out that the background information available for a replicated experiment is explicitly different from the background information from the original experiment in that we usually know the outcome of the original. That the set of replications is not a random sample of all experiments is very relevant when making statements like x% of experiments in social psychology don’t replicate.
  • One of the key points of the paper is that reproducibility is not the only scientifically relevant properties of an experiment. Work that doesn’t reproduce may well lead to a “truth” discovery (or at least a phenomenological model that is correct within the precision of reasonable experiments) faster than work that does reproduce. An extremely nerdy analogy would be that reproducibility will be like a random walk towards the truth, while work that doesn’t reproduce can help shoot closer to the truth.
  • Critically, proposals that focus on reproducibility of single experiments (rather than stability of experimental arcs) will most likely be inefficient. (Yes, that includes preregistration, the current Jesus taken serious by the many)
  • This is a mathematical model so everything is “too simple”, but that doesn’t mean it’s not massively informative. Some possible extensions would be to try to model more explicitly the negative effect of persistent-but-wrong flashy theories. Also the effect of incentives. Also the effect of QRPs, HARKing, Hacking, Forking, and other deviations from The Way The Truth and The Life.

I’ll close out with a structurally but not actually related post from much-missed website The Toast: Permission To Play Devil’s Advocate Denied by the exceptional Daniel Mallory Ortberg (read his books. They’re excellent!)

Our records indicate that you have requested to play devil’s advocate for either “just a second here” or “just a minute here” over fourteen times in the last financial quarter. While we appreciate your enthusiasm, priority must be given to those who have not yet played the position. We would like to commend you for the excellent work you have done in the past year arguing for positions you have no real interest or stake in promoting, including:

  • Affirmative Action: Who’s the Real Minority Here?
  • Maybe Men Score Better In Math For A Reason
  • Well, They Don’t Have To Live Here
  • I Think You’re Taking This Too Personally
  • Would It Be So Bad If They Did Die?
  • If You Could Just Try To See It Objectively, Like Me

 

Multilevel structured (regression) and post-stratification

My enemies are all too familiar. They’re the ones who used to call me friend – Jawbreaker

Well I am back from Australia where I gave a whole pile of talks and drank more coffee than is probably a good idea. So I’m pretty jetlagged and I’m supposed to be writing my tenure packet, so obviously I’m going to write a long-ish blog post about a paper that we’ve done on survey estimation that just appeared on arXiv. We, in this particular context, is my stellar grad student Alex Gao, the always stunning Lauren Kennedy, the eternally fabulous Andrew Gelman, and me.

What is our situation?

When data is a representative sample from the population of interest, life is peachy. Tragically, this never happens. 

Maybe a less exciting way to say that would be that your sample is representative of a population, but it might not be an interesting population. An example of this would be a psychology experiment where the population is mostly psychology undergraduates at the PI’s university. The data can make reasonable conclusions about this population (assuming sufficient sample size and decent design etc), but this may not be a particularly interesting population for people outside of the PI’s lab. Lauren and Andrew have a really great paper about this!

It’s also possible that the population that is being represented by the data is difficult to quantify.  For instance, what is the population that an opt-in online survey generalizes to?

Moreover, it’s very possible that the strata of the population have been unevenly sampled on purpose. Why would someone visit such violence upon their statistical inference? There are many many reasons, but one of the big one is ensuring that you get enough samples from a rare population that’s of particular interest to the study. Even though there are good reasons to do this, it can still bork your statistical analysis.

All and all, dealing with non-representative data is a difficult thing and it will surprise exactly no one to hear that there are a whole pile of approaches that have been proposed from the middle of last century onwards.

Maybe we can weight it

Maybe the simplest method for dealing with non-representative data is to use sample weights. The purest form of this idea occurs when the population is stratified into $latex J$ subgroups of interest and data is drawn independently at random from the $latex j$th population with probability $latex \pi_j$.  From this data it is easy to compute the sample average for each subgroup, which we will call $latex \bar{y}_j$. But how do we get an estimate of the population average from this?

Well just taking the average of the averages probably won’t work–if one of the subgroups has a different average from the others it’s going to give you the wrong answer.  The correct answer, aka the one that gives an unbiased estimate of the mean, was derived by Horvitz and Thompson in the early 1950s. To get an unbiased estimate of the mean you need to use the subgroup means and the sampling probabilities.  The Horvitz-Thompson estimator has the form 

$latex \frac{1}{J}\sum_{j=1}^J\frac{\bar{y}_j}{\pi_j}$.

Now, it is a truth universally acknowledged, if perhaps not universally understood, that unbiasedness is really only a meaningful thing if a lot of other things are going very well in your inference. In this case, it really only holds if the data was sampled from the population with the given probabilities.  Most of the time that doesn’t really happen. One of the problems is non-response bias, which (as you can maybe infer from the name) is the bias induced by non-response. 

(There are ways through this, like raking, but I’m not going to talk about those today).

Poststratification: flipping the problem on its head

One way to think about poststratification is that instead of making assumptions about how the observed sample was produced from the population, we make assumptions about how the observed sample can be used to reconstruct the rest of the population.  We then use this reconstructed population to estimate the population quantities of interest (like the population mean).

The advantage of this viewpoint is that we are very good at prediction. It is one of the fundamental problems in statistics (and machine learning because why not). This viewpoint also suggests that our target may not necessarily be unbiasedness but rather good prediction of the population. It also suggests that, if we can stomach a little bias, we can get much tighter estimates of the population quantity than survey weights can give. That is, we can trade of bias against variance!

Of course, anyone who tells you they’re doing assumption free inference is a dirty liar, and the fewer assumptions we have the more desperately we cling to them. (Beware the almost assumption-free inference. There be monsters!) So let’s talk about the two giant assumptions that we are going to make in order for this to work.

Giant assumption 1: We know the composition of our population. In order to reconstruct the population from the sample, we need to know how many people or things should be in each subgroup. This means that we are restricted in how we can stratify the population. For surveys of people, we typically build out our population information from census data, as well as from smaller official surveys like the American Community Survey (for estimation things about the US! The ACS is less useful in Belgium.).  (This assumption can be relaxed somewhat by clever people like Lauren and Andrew, but poststratifying to a variable that isn’t known in the population is definitely an adanced skill.)

Giant assumption 2: The people who didn’t answer the survey are like the people who did answer the survey. There are a few ways to formalize this, but one that is clear for me is that we need two things. First, that the people who were asked to participate in the survey in subgroup j is a random sample of subgroup j. The second thing we need is that the people who actually answered the survey in subgroup j is a random sample of the people who were asked.  These sort of missing at random or missing completely at random or ignorability assumptions are pretty much impossible to verify in practice. There are various clever things you can do to relax some of them (e.g. throw a hand full of salt over your left shoulder and whisper “causality” into a mouldy tube sock found under a teenage boy’s bed), but for the most part this is the assumption that we are making.

A thing that I hadn’t really appreciated until recently is that this also gives us some way to do model assessment and checking.  There are two ways we can do this. Firstly we can treat the observed data as the full population and fit our model to a random subsample and use that to assess the fit by estimating the population quantity of interest (like the mean). The second method is to assess how well the prediction works on left out data in each subgroup. This is useful because poststratification explicitly estimates the response in the unobserved population, so how good the predictions are (in each subgroup!) is a good thing to know!

This means that tools like LOO-CV are still useful, although rather than looking at a global LOO-elpd score, it would be more useful to look at it for each unique combination of stratifying variables. That said, we have a lot more work to do on model choice for survey data.

So if we have a way to predict the responses for the unobserved members of the population, we make estimates based on non-representative samples. So how do we do this prediction.  

Enter Mister P

Mister P (or MRP) is a grand old dame. Since Andrew and Thomas Little  introduced it in the mid-90s, a whole lot of hay has been made from the technique. It stands for Multilevel Regression and Poststratification and it kinda does what it says on the box. 

It uses multilevel regression to predict what unobserved data in each subgroup would look like, and then uses poststratification to fill in the rest of the population values and make predictions about the quantities of interest.

(This is a touch misleading. What it does is estimate the distribution of each subgroup mean and then uses poststratification to turn these into an estimate the distribution of the mean for the whole population. Mathematically it’s the same thing, but it’s much more convenient than filling in each response in the population.)

And there is scads of literature suggesting that this approach works very well. Especially if the multilevel structure and the group-level predictors are chosen well. 

But no method is perfect and in our paper we launch at one possible corner of the framework that can be improved. In particular, we look at the effect that using structured priors within the multilevel regression will have on the poststratified estimates. These changes turn out not to massively change whole population quantities, but can greatly improve the estimates within subpopulations.

What are the challenges with using multilevel regression in this context?

The standard formulation of Mister P treats each stratifying variable the same (allowing for a varying intercept and maybe some group-specific effects). But maybe not all stratifying variables are created equal.  (But all stratifying variables will be discrete because it is not the season for suffering. November is the season for suffering.)

Demographic variables like gender or race/ethnicity have a number of levels that are more or less exchangeable. Exchangeability has a technical definition, but one way to think about it is that a priori we think that the size of the effect of a particular gender on the response has the same distribution as the size of the effect of another gender on the response (perhaps after conditioning on some things).

From a modelling perspective, we can codify this as making the effect of each level of the demographic variable a different independent draw from the same normal distribution. 

In this setup, information is shared between different levels of the demographic variable because we don’t know what the mean and standard deviation of the normal distribution will be. These parameters are (roughly) estimated using information from the overall effect of that variable (total pooling) and from the variability of the effects estimated independently for each group (no pooling). 

But this doesn’t necessarily make sense for every type of demographic variable. One example that we used in the paper is age, where it may make more sense to pool information more strongly from nearby age groups than from distant age groups. A different example would be something like state, where it may make sense to pool information from nearby states rather from the whole country.

We can incorporate this type of structured pooling using what we call structured priors in the multilevel model. Structured priors are everywhere: Gaussian processes, time series models (like AR(1) models), conditional autogregressive (CAR) models, random walk priors, and smoothing splines are all commonly used examples.

But just because you can do something doesn’t mean you should. This leads to the question that inspired this work:

When do structured priors help MRP?

Structured priors typically lead to more complex models than the iid varying intercept model that a standard application of the MRP methodology uses. This extra complexity means that our we have more space to achieve our goal of predicting the unobserved survey responses. 

But as the great sages say: with low power comes great responsibility. 

If the sample size is small or if the priors are set wrong, this extra flexibility can lead to high-variance predictions and will lead to worse estimation of the quantities of interest. So we need to be careful.

As much as I want it to, this isn’t going to turn into a(nother) blog post about priors. But it’s worth thinking about. I’ve written about it at length before and will write about it at length again. (Also there’s the wiki!)

But to get back to the question, the answer depends on how we want to pool information. In a standard multilevel model, we augment the information within subgroup with the whole population information.  For instance, if we are estimating a mean and we have one varying intercept, it’s a tedious algebra exercise to show that 

$latex \mathbb{E}(\mu_j \mid y)=\approx\frac{\frac{n_j}{\sigma^2} \bar{y}_j+\frac{1}{\tau^2}\bar{y}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}$,

so we’ve borrowed some extra information from the raw mean of the data $latex \bar{y}$ to augment the local means $latex \bar{y}_j$ when they don’t have enough information.

But if our population is severely unbalanced and the different groups have vastly different different responses, this type of pooling may not be appropriate. 

A canny ready might say “well what if we put weights in so we can shrink to a better estimate of the population mean?”. Well that turns out to be very difficult. 

Everybody needs good neighbours (especially when millennials don’t answer the phone)

The solution we went with was to use a random walk prior on the age. This type of prior prioritizes pooling to nearby age categories.  We found that this makes a massive difference to the subpopulation estimates, especially when some age groups are less likely to answer the phone than others.

We put this all together into a detailed simulation study that showed that you can get some real advantages to doing this!

We also used this technique to analyze some phone survey data from The Annenberg Public Policy Center of the University of Pennsylvania about popular support for marriage equality in 2008. This example was chosen because, even in 2008, young people had a tendency not to answer their phones. Moreover, we expect the support for marriage equality to be different among different age groups.  Things went well. 

How to bin ordinal variables (don’t!)

One of the advantages of our strategy is that we can treat variables like age at their natural resolution (eg year) while modelling, and then predict the distribution of the responses in an aggregated category where we have enough demographic information to do poststratification. 

This breaks an awkward dependence between modelling choices and the assumptions needed to do poststratification. 

Things that are still to be done!

No paper is complete, so there are a few things we think are worth looking at now that we know that this type of strategy works.

  • Model selection: How can you tell which structure is best?
  • Prior choice: Always an issue!
  • Interactions: Some work has been done on using BART with MRP (they call it … BARP). This should cover interaction modelling, but doesn’t really allow for the types of structured modelling we’re using in this paper.
  • Different structures: In this paper, we used an AR(1) model and a second order random walk  model (basically a spline!). Other options include spatial models and Gaussian process models. We expect them to work the same way.

What’s in a name? (AKA the tl;dr)

I (and really no one else) really wants to call this Ms P, which would stand for Multilevel Structured regression with Poststratification.

But regardless of name, the big lesson of this paper are:

  1. Using structured priors allow us to pool information in a more problem appropriate way than standard multilevel models do, especially when stratifying our population according to an ordinal or spatial variable. 
  2. Structured priors are especially useful when one of the stratifying variable is ordinal (like age) and the response is expected depend (possibly non-linearly) with this variable.
  3. The gain from using structured priors increases when certain levels of the ordinal stratifying variable are over- or under-sampled. (Eg if young people stop answering phone surveys.)

So go forth and introduce yourself to Ms P. You’ll like her.

I don’t have a clever title but this is an interesting paper

Why do we, as a discipline, have so little understanding of the methods we have created and promote? Our primary tool for gaining understanding is mathematics, which has obvious appeal: most of us trained in math and there is no better form of information than a theorem that establishes a useful fact about a method. But the preceding sentence imposes a heavy burden: it must be possible to prove a theorem and facts established by the theorem must be useful. We find finite-sample facts indispensible because real datasets have finite samples and asymptotic theorems never tell us how to apply their conclusions to finite samples. But finite-sample theorems about contemporary methods are rare; it seems inescapable that they are at least extremely difficult, given their popularity in earlier eras.

This paper considers a complementary tool for opening our black-box methods, modeled explicitly on the approach molecular biologists use to open Nature’s black boxes.

I (Dan) came across a fun and fascinating article by Jim Hodges about how we explain and understand (or, really, how we don’t) statistical models. It falls nicely within the space of things that I’ve been thinking about recently and it is well worth a read.

The focus here is very much on quite simple linear mixed effects models (and maximum likelihood-type fitting of those models) but it’s really about building up a framework for systematically understanding statistical models. The language he uses is that of scientific experiments and it reminds me a lot of how Aki both talks about his computational experiments, and how he writes them up before they’re turned into papers. (See, for example, his R notebooks)

Anyway. I have nothing specific to add except this is a thing that is worth reading and thinking about and building off.

All I need is time, a moment that is mine, while I’m in between

You’re an ordinary boy and that’s the way I like it – Magic Dirt

Look. I’ll say something now, so it’s off my chest. I hate order statisics. I loathe them. I detest them. I wish them nothing but ill and strife. They are just awful. And I’ve spent the last god only knows how long buried up to my neck in them, like Jennifer Connelly forced into the fetid pool at the end of Phenomena.

It would be reasonable to ask why I suddenly have opinions about order statistics. And the answer is weird. It’s because of Pareto Smoothing Importance Sampling (aka PSIS aka the technical layer that makes the loo package work).

The original PSIS paper was written by Aki, Andrew, and Jonah. However there is a brand new sparkly version by Aki, Me, Andrew, Yuling, and Jonah that has added a pile of theory and restructured everything (edit by Aki: link changed to the updated arXiv version). Feel free to read it. The rest of the blog post will walk you through some of the details.

What is importance sampling?

Just a quick reminder for those who don’t spend their life thinking about algorithms. The problem at hand is estimating the expectation $latex I_h=\mathbb{E}(h(\theta))$ for some function h when $latex \theta\sim p(\theta)$.  If we could sample directly from $latex p(\theta)$ then the Monte Carlo estimate of the expectation would be

$latex \frac{1}{S}\sum_{s=1}^Sh(\theta_s)$,  where $latex \theta_s\stackrel{\text{iid}}{\sim}p(\theta)$.

But in a lot of real life situations we have two problems with doing this directly: firstly it is usually very hard to sample from $latex p(\theta)$. If there is a different distribution that we can sample from, say g, then we can use the following modification of the Monte Carlo estimator

$latex I_h^S= \frac{1}{S}\sum_{s=1}^S\frac{p(\theta_s)}{g(\theta_s)}h(\theta_s)$,

 where $latex \theta_s$ are iid draws from $latex g(\theta)$. This is called an importance sampling estimator. The good news is that it always converges in probability to the true expectation. The bad news is that it is a random variable and it can have infinite variance.

The second problem is that often enough we only know the density $latex p(\theta)$ up to a normalizing constant, so if $latex f(\theta)\propto p(\theta)$, then the following self-normalized importance sampler is useful

$latex I_h^S= \frac{\sum_{s=1}^Sr(\theta_s)h(\theta_s)}{\sum_{s=1}^Sr(\theta_s)}$,

where the importance ratios are defined as

$latex r_s=r(\theta_s) = \frac{f(\theta_s)}{g(\theta_s)}$,

where again $latex \theta_s\sim g$. This will converge to the correct answer as long as $latex \mathbb{E}(r_s)<\infty$.  For the rest of this post I am going to completely ignore self-normalized importance samplers, but everything I’m talking about still holds for them.

So does importance sampling actually work?

Well god I do hope so because it is used a lot. But there’s a lot of stuff to unpack before you can declare something “works”. (That is a lie, of course, all kinds of people are willing to pick a single criterion and, based on that occurring, declaring that it works. And eventually that is what we will do.)

First things first, an importance sampling estimator is a sum of independent random variables. We may well be tempted to say that, by the central limit theorem, it will be asymptotically normal. And sometimes that is true, but only if the importance weights have finite variance. This will happen, for example, if the proposal distribution g has heavier tails than the target distribution p.

And there is a temptation to stop there. To declare that if the importance ratios have finite variance then importance sampling works. That. Is. A. Mistake.

Firstly, this is demonstrably untrue in moderate-to-high dimensions. It is pretty easy to construct examples where the importance ratios are bounded (and hence have finite variance) but there is no feasible number of samples that would give small variance. This is a problem as old as time: just because the central limit theorem says the error will be around $latex \sigma/\sqrt{S}$, that doesn’t mean that $latex \sigma$ won’t be an enormous number.

And here’s the thing: we do not know $latex \sigma$ and our only way to estimate it is to use the importance sampler. So when the importance sampler doesn’t work well, we may not be able to get a decent estimate of the error. So even if we can guarantee that the importance ratios have finite variance (which is really hard to do in most situations), we may end up being far too optimistic about the error.

Chatterjee and Diaconis recently took a quite different route to asking whether an importance sampler converges. They asked what the minimum sample size required to ensure, with high probability, that $latex |I_h^S – I_h|$ is small (with high probability). They showed that you need approximately $latex \exp(\mathbb{E}[r_s \log(r_s)])$ samples and this number can be large.  This quantity is also quite hard to compute (and they proposed another heuristic, but that’s not relevant here), but it is going to be important later.

Modifying importance ratios

So how do we make importance sampling more robust. A good solution is to somehow modify the importance ratios to ensure they have finite variance. Ionides proposed a method called Truncated Importance Sampling (TIS) where the importance ratios are replaced with truncated weights $latex w_s=\max\{r_s,\tau_S\}$, for some sequence of thresholds $latex \tau_S\rightarrow\infty$ as $latex S\rightarrow\infty$.  The resulting TIS estimator is

$latex I_h^S= \frac{1}{S}\sum_{s=1}^Sw_s h(\theta_s)$.

 

A lot of real estate in Ionides’ paper is devoted to choosing a good sequence of truncations. There’s theory to suggest that it depends on the tail of the importance ratio distribution. But the suggested choice of truncation sequence is $latex \tau_S=C\sqrt{S}$, where C is the normalizing constant of which is one when using ordinary rather than self-normalized importance sampling. (For the self normalized version, Appendix B suggests taking C as the sample mean of the importance ratios, but the theory only works for deterministic truncations.)

This simple truncation guarantees that TIS is asymptotically unbiased, has finite variance that asymptotically goes to zero, and (with some caveats) is asymptotically normal.

But, as we discussed above, none of this actually guarantees that TIS will work for a certain problem. (It does work asymptotically for a vast array of problems and does a lot better that ordinary importance sampler, but no simple truncation scheme can overcome a poorly chosen proposal distribution. And most proposal distributions in high dimensions are poorly chosen.)

Enter Pareto-Smoothed Importance Sampling

So a few years ago Aki and Andrew worked on an alternative to TIS that would make things even better. (They originally called it the “Very Good Importance Sampling”, but then Jonah joined the project and ruined the acronym.) The algorithm they came up with was called Pareto-Smoothed Importance Sampling (henceforth PSIS, the link is to the three author version of the paper).

They noticed that TIS basically replaces all of the large importance ratios with a single value $latex \tau_S$. Consistent with both Aki and Andrew’s penchant for statistical modelling, they thought they could do better than that (Yes. It’s the Anna Kendrick version. Deal.)

PSIS is based on the principle the idea that, while using the same value for each extreme importance ratio works, it would be even better to model the distribution of extreme importance ratios! The study of distributions of extremes of independent random variables has been an extremely important (and mostly complete) part of statistical theory. This means that we know things.

One of the key facts of extreme value theory is that the distribution of ratios larger than some sufficiently large threshold u  approximately has a generalized Pareto distribution (gPd). Aki, Andrew, and Jonah’s idea was to fit a generalized Pareto distribution to the M largest importance ratios and replace the upper weights with appropriately chosen quantiles of the fitted distribution. (Some time later, I was very annoyed they didn’t just pick a deterministic threshold, but this works better even if it makes proving things much harder.)

They learnt a few things after extensive simulations. Firstly, this almost always does better than TIS (the one example where it doesn’t is example 1 in the revised paper). Secondly, the gPd has two parameters that need to be estimated (the third parameter is an order statistic of the sample. ewwwww) And one of those parameters is extremely useful!

The shape parameter (or tail parameter) of the gPd, which we call k, controls how many moments the distribution has. In particular, a distribution who’s upper tail limits to a gPd with shape parameter k has at most $latex k^{-1}$ finite moments. This means that if $latex k<1/2$ then an importance sampler will have finite variance.

But we do not have access to the true shape parameter. We can only estimate it from a finite sample, which gives us $latex \hat{k}$, or, as we constantly write, “k-hat”. The k-hat value has proven to be an extremely useful diagnostic in a wide range of situations. (I mean, sometimes it feels that every other paper I write is about k-hat. I love k-hat. If I was willing to deal with voluntary pain, I would have a k-hat tattoo. I once met a guy with a nabla tattooed on his lower back, but that’s not relevant to this story.)

Aki, Andrew, and Jonah’s extensive simulations showed something that may well have been unexpected: the value of k-hat is a good proxy for the quality of PSIS. (Also TIS, but that’s not the topic). In particular, if k-hat was bigger than around 0.7 it became massively expensive to get an accurate estimate. So we can use k-hat to work out if we can trust our PSIS estimate.

PSIS ended up as the engine driving the loo package in R, which last time I checked had around 350k downloads from the RStudio CRAN mirror. It works for high-dimensional problems and can automatically assess the quality of an importance sampler proposal for a given realization of the importance weights.

So PSIS is robust, reliable, useful, has R and Python packages, and the paper was full of detailed computational experiments that showed that it was robust, reliable, and useful even for high dimensional problems. What could possibly go wrong?

What possibly went wrong

Reviewers.

It works, but where is the theory?

I wasn’t an author so it would be a bit weird for me to do a postmortem on the reviews of someone else’s paper. But one of the big complaints was that Aki, Andrew, and Jonah had not shown that PSIS was asymptotically unbiased, had finite vanishing variance, or that it was asymptotically normal.

(Various other changes of emphasis or focus in the revised version are possibly also related to reviewer comments from the previous round, but also to just having more time.)

These things turn out to be tricky to show. So Aki, Andrew, and Jonah invited me and Yuling along for the ride.

The aim was to restructure the paper, add theory, and generally take a paper that was very good and complete and add some sparkly bullshit. So sparkly bullshit was added. Very slowly (because theory is hard and I am not good at it).

Justifying k-hat < 0.7

Probably my favourite addition to the paper is due to Yuling, who read the Chatterjee and Diaconis  paper and noticed that we could use their lower bound on sample size to justify k-hat. The idea is that it is the tail of $latex r_s$ that breaks the importance sampler. So if we make the assumption that the entire distribution of $latex r_s$ is generalized Pareto with shape parameter k, we can actually compute the minimum sample size for a particular accuracy from ordinary importance sampling. This is not an accurate sample size calculation, but should be ok for an order-of-magnitude calculation.

The first thing we noticed is, consistent with the already existing experiments, the error in importance sampling (and TIS and PSIS) increases smoothly as k passes 0.5 (in particular the finite-sample behaviour does not fall off a cliff the moment the variance isn’t finite). But the minimum sample size starts to increase very rapidly as soon as k got bigger than about 0.7. This is consistent with the experiments that originally motivated the 0.7 threshold and suggests (at least to me) that there may be something fundamental going on here.

We can also use this to justify the threshold on k-hat as follows. The method Aki came up with for estimating k-hat is (approximately) Bayesian, so we can interpret the k-hat at a value selected so that the data is consistent with M independent samples from a gPd with shape parameter k-hat. So a k-hat value that is bigger than 0.7 can be interpreted loosely as saying that the extreme importance ratios could have come from a distribution that has a tail that is too heavy for PSIS to work reliably.

This is what actually happens in high dimensions (for an example we have that has bounded ratios and hence finite variance). With a reasonable sample size, the estimator for k-hat simply cannot tell that the distribution of extreme ratios has a large but finite variance rather than an infinite variance. And this is exactly what we want to happen! I have no idea how to formalized this intuition, but nevertheless it works.

So order statistics

It turned out that–even though it is quite possible that other people would not have found proving unbiasedness and finite variance hard–I found it very hard. Which is quite annoying because the proof for TIS was literally 5 lines.

What was the trouble? Aki, Andrew, and Jonah’s decision to choose the threshold as the Mth largest importance ratio. This means that the threshold is an order statistic and hence is not independent of the rest of the sample. So I had to deal with that.

This meant I had to read an absolute tonne of papers about order statistics. These papers are dry and technical and were all written between about 1959 and 1995 and at some later point poorly scanned and uploaded to JSTOR. And they rarely answered the question I wanted them to. So basically I am quite annoyed with order statistics.

But the end point is that, under some conditions, PSIS is asymptotically unbiased and has finite, vanishing variance.

The conditions are a bit weird, but are usually going to be satisfied.  Why are they weird? Well…

PSIS is TIS with an adaptive threshold and bias correction

In order to prove asymptotic properties of PSIS, I used the following representation of the PSIS estimator

$latex \frac{1}{S}\sum_{s=1}^{S}\min\left(r(\theta_s),r(\theta_{(S-M+1):S})\right)h(\theta_s)+\frac{1}{S}\sum_{m=1}^M\tilde{w}_mh(\theta_{S-M+m}),$

where the samples $latex \theta_s$ have been ordered so that $latex r(\theta_1)\leq r(\theta_2)\leq\ldots\leq r(\theta_S)$ and the weights $latex \tilde{w}_m$ are deterministic (and given in the paper). They are related to the quantile function for the gPd.

The first term is just TIS with random threshold $latex \tau_S=r(\theta_{(S-M+1):S})$, while the second term is an approximation to the bias. So PSIS has higher variance than TIS (because of the random truncation), but lower bias (because of the second term) and this empirically usually leads to lower mean-square error than TIS.

But that random truncation is automatically adapted to the tail behaviour of the importance ratios, which is an extremely useful feature!

This representation also gives hints as to where the ugly conditions come from. Firstly, anything that is adaptive is much harder to prove things about than a non-adaptive method, and the technical conditions that we need to be able to adapt our non-adaptive proof techniques are often quite esoteric. The idea of the proof is to show that, conditional on $latex r(\theta_{(S-M+1):S})=U$, all of the relevant quantities go to zero (or are finite) with some explicit dependence on U. The proof of this is very similar to the TIS proof (and would be exactly the same if the second term wasn’t there).

Then we need to let vary and hope it doesn’t break anything. The technical conditions can be split into the ones needed to ensure $latex r(\theta_{(S-M+1):S})=U$ behaves itself as gets big; the ones needed to ensure that $latex h(\theta)$ doesn’t get too big when the importance ratios are large; and the ones that control the last term.

Going in reverse order, to ensure the last term is well behaved we need that h is square-integrable with respect to the proposal g in addition to the standard assumption that its square integrable with respect to the target p.

We need to put growth conditions on because we are only modifying the ratios, which does not help if h is also enormous out in the tails. These conditions are actually very easy to satisfy for most problems I can think of, but almost certainly there’s some one out there with a h that grows super-exponentially just waiting to break PSIS.

The final conditions are just annoying. They are impossible to verify in practice, but there is a 70 year long literature that coos reassuring phrases like “this almost always holds” into our ears. These conditions are strongly related to the conditions needed to estimate k correctly (using something like the Hill estimator). My guess is that these conditions are not vacuous, but are relatively unimportant for finite samples, where the value of k-hat should weed out the catastrophic cases.

What’s the headline

With some caveats, PSIS is asymptotically unbiased; has finite, vanishing variance; and a variant of it is asymptotically normal as long as the importance ratios have more than $latex (1+\delta)$-finite moments. But it probably won’t be useful unless it has at least 1/0.7 = 1.43 moments.

And now we send it back off into the world and see what happens

Against Arianism 3: Consider the cognitive models of the field

“You took my sadness out of context at the Mariners Apartment Complex” – Lana Del Rey

It’s sunny, I’m in England, and I’m having a very tasty beer, and Lauren, Andrew, and I just finished a paper called The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling.

So I guess it’s time to resurrect a blog series. On the off chance that any of you have forgotten, the Against Arianism series focusses on the idea that, in the same way that Arianism1 was heretical, so too is the idea that priors and likelihoods can be considered separately. Rather, they are consubstantial–built of the same probability substance.

There is no new thing under the sun, so obviously this has been written about a lot. But because it’s my damn blog post, I’m going to focus on a paper Andrew, Michael, and I wrote in 2017 called The Prior Can Often Only Be Understood in the Context of the Likelihood. This paper was dashed off in a hurry and under deadline pressure, but I quite like it. But it’s also maybe not the best place to stop the story.

An opportunity to comment

A few months back, the fabulous Lauren Kennedy was visiting me in Toronto on a different project. Lauren is a postdoc at Columbia working partly on complex survey data, but her background is quantitative methods in psychology. Among other things, we saw a fairly regrettable (but excellent) Claire Denis movie about vampires2.

But that’s not relevant to the story. What is relevant was that Lauren had seen an open invitation to write a comment on a paper in Computational Brain & Behaviour about Robust3 Modelling in Cognitive Science written by a team cognitive scientists and researchers in scientific theory, philosophy, and practice  (Michael Lee, Amy Criss, Berna Devezer, Christopher Donkin, Alexander Etz, Fábio Leite, Dora Matzke, Jeffrey Rouder, Jennifer Trueblood, Corey White, and Joachim Vandekerckhove).

Their bold aim to sketch out the boundaries of good practice for cognitive modelling (and particularly for the times where modelling meets data) is laudable, not least because such an endeavor will always be doomed to fail in some way. But the act of stating some ideas for what constitutes best practice gives the community a concrete pole to hang this important discussion on. And Computational Brain & Behaviour recognized this and decided to hang an issue off the paper and its discussions.

The paper itself is really thoughtful and well done. And obviously I do not agree with everything in it, but that doesn’t stop me from the feeling that wide-spread adoption of  their suggestions would definitely make quantitative research better.

But Lauren noticed one tool that we have found extremely useful that wasn’t mentioned in the paper: prior predictive checks. She asked if I’d be interested in joining her on a paper, and I quickly said yes!

It turns out there is another BART

The best thing about working with Lauren on this was that she is a legit psychology researcher so she isn’t just playing in someone’s back yard, she owns a patch of sand. It was immediately clear that it would be super-quick to write a comment that just said “you should use prior predictive checks”. But that would miss a real opportunity. Because cognitive modelling isn’t quite the same as standard statistical modelling (although in the case where multilevel models are appropriate Daniel  Schad, Michael Betancourt, and  Shravan Vasishth just wrote an excellent paper on importing general ideas of good statistical workflows into Cognitive applications).

Rather than using our standard data analysis models, a lot of the time cognitive models are generative models for the cognitive process coupled (sometimes awkwardly) with models for the data that is generated from a certain experiment. So we wanted an example model that is more in line with this practice than our standard multilevel regression examples.

Lauren found the Balloon Analogue Risk Task (BART) in Lee and Wagenmakers’ book Bayesian Cognitive Modeling: A Practical Course, which conveniently has Stan code online4. We decided to focus on this example because it’s fairly easy to understand and has all the features we needed. But hopefully we will eventually write a longer paper that covers more common types of models.

BART is an experiment that makes participants simulate pumping balloons with some fixed probability of popping after every pump. Every pump gets them more money, but they get nothing if the balloon pops. The model contains a parameter ($latex \gamma^+$) for risk taking behaviour and the experiment is designed to see if the risk taking behaviour changes as a person gets more drunk.  The model is described in the following DAG:

Exploring the prior predictive distribution 

Those of you who have been paying attention will notice the Uniform(0,10) priors on the logit scale and think that these priors are a little bit terrible. And they are! Direct simulation from model leads to absolutely silly predictive distributions for the number of pumps in a single trial. Worse still, the pumps are extremely uniform across trials. Which means that the model thinks, a priori, that it is quite likely for a tipsy undergraduate to pump a balloon 90 times in each of the 20 trials. The mean number of pumps is a much more reasonable 10.

Choosing tighter upper bounds on the uniform priors leads to more sensible prior predictive distributions, but then Lauren went to test out what changes this made to inference (in particular looking at how it affects the Bayes factor against the null that the $latex \gamma^+$ parameters were the same across different levels of drunkenness). It made very little difference.  This seemed odd so she started looking closer.

Where is the p? Or, the Likelihood Principle gets in the way

So what is going on here? Well the model describe in Lee and Wagenmaker’s book is not a generative model for the experimental data. Why not? Because the balloon sometimes pops! But because in this modelling setup the probability of explosion is independent of the number of pumps, this explosive possibility only appears as a constant in the likelihood.

The much lauded Likelihood Principle tells us that we do not need to worry about these constants when we are doing inference. But when we are trying to generate data from the prior predictive distribution, we really need to care about these aspects of the model.

Once the context on the experiment is taken into account, the prior predictive distributions change a lot.

Context is important when taking statistical methods into new domains

Prior predictive checks are really powerful tools. They give us a way to set priors, they give us a way to understand what our model does, they give us a way to generate data that we can use to assess the behaviour of different model comparison tools under the experimental design at hand. (Neyman-Pearson acolytes would talk about power here, but the general question lives on beyond that framework).

Modifications of prior predictive checks should also be used to assess how predictions, inference, and model comparison methods behave under different but realistic deviations from the assumed generative model. (One of the points where I disagree with Lee et al.‘s paper is that it’s enough to just pre-register model comparision methods. We also need some sort of simulation study to know how they work for the problem at hand!)

But prior predictive checks require understanding of the substantive field as well as understanding of how the experiment was performed. And it is not always as simple as just predict y!

Balloons pop. Substantive knowledge may only be about contrasts or combinations of predictions. We need to always be aware that it’s a lot of work to translate a tool to a new scientific context. Even when that tool  appears to be as straightforward to use and as easy to explain as prior predictive checks.

And maybe we should’ve called that paper The Prior Can Often Only Be Understood in the Context of the Experiment.

Endnotes:

1 The fourth century Christian heresy that posited that Jesus was created by God and hence was not of the same substance. The council of Nicaea ended up writing a creed to stamp that one out.

2 Really never let me choose the movie. Never.

3 I hate the word “robust” here. Robust against what?! The answer appears to be “robust against un-earned certainty”, but I’m not sure. Maybe they want to Winsorize cognitive science?

4 Lauren had to twiddle it a bit, particularly using a non-centered parameterization to eliminate divergences.

Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it.

“Otto eye-balled the diva lying comatose amongst the reeds, and he suddenly felt the fire of inspiration flood his soul. He ran back to his workshop where he futzed and futzed and futzed.” –Bette Midler

Andrew was annoyed. Well, annoyed is probably too strong a word. Maybe a better way to start is with The List. When Andrew, Aki, and I work together we have The List of projects that need to be done and not every item on this list weighted the same by all of us.

The List has longer term ideas that we’ve been slouching towards, projects that have stalled, small ideas that have room to blossom, and then there’s my least favourite part. My least favourite part of The List is things that are finished but haven’t been written up as papers yet. This is my least favourite category because I never think coercing something into a paper is going to be any fun.

Regular readers of my blogs probably realize that I am frequently and persistently wrong.

But anyway. It was day one of one of Aki and my visits to Columbia and we were going through The List. Andrew was pointing out a project that had been sitting on The List for a very very long time. (Possibly since before I was party to The List.) And he wanted it off The List.

(Side note: this is the way all of our projects happen. Someone suddenly wants it done enough that it happens. Otherwise it stays marinating on The List.)

So let’s start again.

Andrew wanted a half-finished paper off The List and he had for a while. Specifically, the half-finished paper documenting the actual way that the Stan project computes $latex \widehat{R}$ (aka the Potential Scale Reduction Factor or, against our preference of not naming things after people, the Gelman-Rubin(-Brooks) statistic). So we agreed to finish it and then moved on to some more exciting stuff.

But then something bad happened: Aki broke R-hat and we had to work out how to fix it.

The preprint is here. There is an extensive online appendix here. The paper is basically our up to date “best practice” guide to monitoring convergence for general MCMC algorithms. When combined with the work in our recent visualization paper, and two of Michael Betancourt’s case studies (one and two), you get our best practice recommendations for Stan. All the methods in this paper are available in a github repo and will be available in future versions of the various Stan and Stan-adjacent libraries.

What is R-hat?

R-hat, or the potential scale reduction factor, is a diagnostic that attempts to measure whether or not an MCMC algorithm1 has converged flag situations where the MCMC algorithm has failed converge.  The basic idea is that you want to check a couple of things:

  1. Is the distribution of the first part of a chain (after warm up) the same as the distribution of the second half of the chain?
  2. If I start the algorithm at two different places and let the chain warm up, do both chains have the same distribution?

If one of these two checks fail to hold, then your MCMC algorithm has probably not converged. If both checks pass, it is still possible that the chain has problems.

Historically, there was a whole lot of chat about whether or not you need to run multiple chains to compute R-hat. To summarize that extremely long conversation: you do. Why? To paraphrase the great statistician Vanessa Williams:

Sometimes the snow comes down in June
Sometimes the sun goes ’round the moon
Sometimes the posterior is multimodal
Sometimes the adaptation you do during warm up is unstable

Also it’s 2019 and all of our processors are multithreaded so just do it.

The procedure, which is summarized in the paper (and was introduced in BDA3), computes a single number summary and we typically say our Markov Chains have not converged if $latex \widehat{R} > 1.01$.

A few things before we tear the whole thing down.

  1. Converging to the stationary distribution is the minimum condition required for a MCMC algorithm to be useful. R-hat being small doesn’t mean the chain is mixing well, so you need to check the effective sample size!
  2. R-hat is a diagnostic and not a proof of convergence. You still need to look at all of the other things (like divergences and BFMI in Stan) as well as diagnostic plots (more of which are in the paper)
  3. The formula for R-hat in BDA3 assumes that the stationary distribution has finite variance. This is a very hard property to check from a finite sample.

The third point is how Aki broke R-hat.

R-hat is a diagnostic not a hypothesis test

A quick (added after posting) digression here about how you should treat R-hat. Fundamentally, it is not a formal check to see if an MCMC algorithm has reached its stationary distribution. And it is definitely not a hypothesis test!

A better way of thinking about R-hat is like the fuel light in a car: if the fuel light is on, get some petrol. If the fuel light isn’t on, look at the fuel gauge, how far you’re gone since your last refill, etc.

Similarly, if R-hat is bigger than 1.01, your algorithm probably isn’t working. If R-hat is less than this fairly arbitrary threshold, you should look at the effective sample size (the fuel gauge), divergences, appropriate plots, and everything else to see if there are any other indicators that the chain is misbehaving. If there aren’t, then you’ve done your best to check for convergence problems, so you might as well hold your breath and hope for the best.

But what about the 1.01 threshold? Canny readers will look at the author list and correctly assume that it’s not a high quantile of the distribution of R-hat under some null hypothesis. It’s actually just a number that’s bigger than one but not much bigger than one. Some earlier work suggested bigger thresholds like 1.1, but Vats and Knudson give an excellent argument about why that number is definitely too big. They suggest making a problem-dependent threshold for R-hat to take into account it’s link with effective sample size, but we prefer just to look at the two numbers separately, treating R-hat as a diagnostic (like the fuel light) and the effective sample size estimate like the fuel gauge.

So to paraphrase Flight of The Concords: A small R-hat value is not a contract, but it’s very nice.

When does R-hat break?

The thing about turning something into a paper is that you need to have a better results section than you typically need for other purposes. So Aki went off and did some quick simulations and something bad happened. When he simulated from four chains where one had the wrong variance, the R-hat value was still near one. So R-hat was not noticing the variance was wrong. (This is the top row of the above figure. The left column has one bad chain, the right column four correct chains.)

On the other hand, R-hat was totally ok noticing when the location parameter of one chain had the wrong location parameter. Except for the bottom row of the figure, where the target distribution is Cauchy.

So we noticed two things:

  1. The existing R-hat diagnostic was only sensitive to errors in the first moment.
  2. The existing R-hat diagnostic failed catastrophically when the variance was infinite.

(Why “catastrophically”? Because it always says the chain is good!)

So clearly we could no longer just add two nice examples to the text that was written and then send the paper off. So we ran back to our workshop where we futzed and futzed and futzed.

He tried some string and paper clips…

Well, we2 came up with some truly terrible ideas but eventually circled around to two observations:

  1. Folding the draws by computing $latex \zeta^{(mn)}=\left|\theta^{(nm)}-\mathrm{median}(\theta)\right|$ and computing R-hat on the folded chain will give a statistic that is sensitive to changes in scale.
  2. Rank-based methods are robust against fat tails. So perhaps we could rank-normalize the chain (ie compute the rank for each draw inside the total pool of samples and replace the true value with the quantile of a standard normal that corresponds to the rank).

Putting these two together, we get our new R-hat value: after rank-normalizing the chains, compute the standard R-hat and the folded R-hat and report the maximum of the two values.  These are the blue histograms in the picture.

There are two advantages of doing this:

  1. The new value of R-hat is robust against heavy tails and is sensitive to changes in scale between the chains.
  2. The new value of R-hat is parameterization invariant, which is to say that the R-hat value for $latex \theta$ and $latex \log(\theta)$ will be the same. This was not a property of the original formulation.

What does the rank-normalization actually do?

Great question imaginary bold-face font person! The intuitive answer is that it is computing the R-hat value for the nicest transformation of the parameter. (Where nicest is problematically defined to be “most normal”). So what does this R-hat tell you? It tells you that if the MCMC algorithm has failed to converge after we strip away all of the problems with heavy tails and skewness and all that jazz. Similarly we can compute an Effective Sample Size (ESS) for this nice scenario. This is the best case scenario R-hat and ESS. If it isn’t good, we have no hope.

Assuming the rank-normalized and folded R-hat is good and the rank-normalized ESS is good, it is worth investigating the chain further.

R-hat does not give us all the information we need to assess if the chain is useful

The old version of R-hat basically told us if the mean was ok. The new version tells us if the median and the MAD are ok. But that’s not the only thing we usually report. Typically a posterior is summarized by a measure of centrality (like the median) and a quantile-based uncertainty interval (such as the 0.025 and 0.975 quantiles of the posterior). We need to check that these quantiles are computed correctly!

This is not trivial: MCMC algorithms do not explore the tails as well as the bulk. This means that the Monte Carlo error in the quantiles may potentially be much higher than the Monte Carlo error in the mean.  To deal with this, we introduced a localized measure of quantile efficiency, which is basically an effective sample size for computing a quantile.  Here’s an example from the online appendix, where the problem is sampling from a Cauchy distribution using the “nominal” parameterization. You can see that it’s possible that central quantiles are being resolved well, but extreme quantile estimates will be very noisy.

Maybe it’s time to let traceplots die 

As we are on somewhat of a visualization kick around these parts, let’s talk about traceplots of MCMC algorithms. They’re terrible. If the chain is long, all the interesting information is compressed, and if you try to include information from multiple chains it just becomes a mess. So let us propose an alternative: rank plots.

The idea is that if the chains are all mixing well exploring the same distribution, the ranks should be uniformly distributed. In the following figure, 4 chains of 1000 draws are plotted and you can easily see that the histograms are not uniform. Moreover, the histogram for the first chain clearly never visits the left tail of the distribution, which is indicative of a funnel. This would be harder to see with 4 traceplots plotted over each other.

How to use these new tools in practice

To close off, here are our recommendations (taken directly from the paper) for using R-hat. All of the methods in this paper will make their way into future version of RStan, rstanarm, and bayesplot (as well as all the other places we put things).

In this section [of the paper] we lay out practical recommendations for using the tools developed in this paper. In the interest of specificity, we have provided numerical targets for both R-hat􏰄 and effective sample size (ESS). However, these values should be adapted as necessary for the given application.

In Section 4, we propose modifications to R-hat􏰄 based on rank-normalizing and folding the posterior draws. We recommend running at least four chains by default and only using the sample if R-hat􏰄 < 1.01. This is a much tighter threshold than the one recommended by Gelman and Rubin (1992), reflecting lessons learnt over more than 25 years of use.

Roughly speaking, the ESS of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. When there might be difficulties with mixing, it is important to use between-chain information in computing the ESS. For instance, in the sorts of funnel-shaped distributions that arise with hierarchical models, differences in step size adaptation can lead to chains to have different behavior in the neighborhood of the narrow part of the funnel. For multimodal distributions with well-separated modes, the split-R-hat􏰄 adjustment leads to an ESS estimate that is close to the number of distinct modes that are found. In this situation, ESS can be drastically overestimated if computed from a single chain.

As Vats and Knudson (2018) note, a small value of R-hat􏰄 is not enough to ensure that an MCMC procedure is useful in practice. It also needs to have a sufficiently large effective sample size. As with R-hat􏰄, we recommend computing the ESS on the rank-normalized sample. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Specifically, it computes the ESS of a sample from a normalized version of the quantity of interest, using the rank transformation followed by the normal inverse-cdf. This is still indicative of the effective sample size for computing an average, and if it is low the computed expectations are unlikely to be good approximations to the actual target expectations. We recommend requiring that the rank-normalized ESS is greater than 400. When running four chains, this corresponds to having a rank-normalized effective sample size of at least 50 per split chain.

Only when the rank-normalized and folded R-hat􏰄 values are less than the prescribed threshold and the rank- normalized ESS is greater than 400 do we recommend computing the actual (not rank-normalized) effective sample size for the quantity of interest. This can then be used to assess the Monte Carlo standard error (MCSE) for the quantity of interest.

Finally, if you plan to report quantile estimates or posterior intervals, we strongly suggest assessing the behaviour of the chains for these quantiles. In Section 4.3 we show that convergence of Markov chains is not uniform across the parameter space and propose diagnostics and effective sample sizes specifically for extreme quantiles. This is different from the standard ESS estimate (which we refer to as the “bulk-ESS”), which mainly assesses how well the centre of the distribution is resolved. Instead, these “tail-ESS” measures allow the user to estimate the MCSE for interval estimates.

 

Footnotes:

1 R-hat can be used more generally for any iterative simulation algorithm that targets a stationary distribution, but it’s main use case is MCMC.
2 I. (Other people’s ideas were good. Mine were not.)

Limitations of “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”

“If you will believe in your heart and confess with your lips, surely you will be saved one day”The Mountain Goats paraphrasing Romans 10:9

One of the weird things about working with people a lot is that it doesn’t always translate into multiple opportunities to see them talk.  I’m pretty sure the only time I’ve seen Andrew talk was at a fancy lecture he gave in Columbia. He talked about many things that day, but the one that stuck with me (because I’d not heard it phrased that well before, but as a side-note this is a memory of the gist of what he was saying. Do not hold him to this opinion!) was that the problem with p-values and null-hypothesis wasn’t so much that the procedure was bad. The problem is that people are taught to believe that there exists a procedure that can, given any set of data, produce a “yes/no” answer to a fairly difficult question. So the problem isn’t the specific decision rule that NHST produces, so much as the idea that a universally applicable decision rule exists at all. (And yes, I know the maths. But the problem with p-values was never the maths.)

This popped into my head again this week as Aki, Andrew, Yuling, and I were working on a discussion to Gronau and Wagenmakers’ (GW) paper “Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection”.

Our discussion is titled “Limitations of ‘Limitations of Bayesian Leave-One-Out Cross-Validation for Model Selection'” (published version) and it extends various points that Aki and I have made at various points on this blog.

To summarize our key points:

  1. It is a bad thing for GW to introduce LOO model selection in a way that doesn’t account for its randomness. In their very specialized examples this turns out not to matter because they choose such odd data that the LOO estimates have zero variance. But it is nevertheless bad practice.
  2. Stacking is a way to get model weights that is more in line with the LOO-predictive concept than GW’s ad hoc pseudo-BMA weights. Although stacking is also not consistent for nested models, in the cases considered in GW’s paper it consistently picks the correct model. In fact, the model weight for the true model in each of their cases is $latex w_0=1$ independent of the number of data points.
  3. By not recognizing this, GW missed an opportunity to discuss the limitations of the assumptions underlying LOO (namely that the observed data is representative of the future data, and each individual data point is conditionally exchangeable).  We spent some time laying these out and proposed some modifications to their experiments that would make these limitations clearer.
  4. Because LOO is formulated under much weaker assumptions than is used in this paper, namely LOO does not assume that the data is generated by one of the models under consideration (the so-called “M-Closed assumption”), it is a little odd that GW only assess its performance under this assumption. This assumption almost never holds. If you’ve ever used the famous George Box quote, you’ve explicitly stated that the M-Closed assumption does not hold!
  5. GW’s assertion that when two models can support identical models (such as in the case of nested models), the simplest model should be preferred is not a universal truth, but rather a specific choice that is being made. This can be enforced for LOO methods, but like all choices in statistical modelling, it shouldn’t be made automatically or by authority, but should instead be critically assessed in the context of the task being performed.

All of this has made me think about the idea of doing model selection. Or, more specifically, it’s made me question whether or not we should try to find universal tools for solving this problem. Is model selection even possible? (Danielle Navarro from UNSW has a particularly excellent blog post outlining her experiences with various existing model selection methods that you all should read.)

So I guess my very nebulous view is that we can’t do model selection, but we can’t not do model selection, but we also can’t not not do model selection.

In the end we need to work out how to do model selection for specific circumstances and to think critically about our assumptions. LOO helps us do some of that work.

To close off, I’m going to reproduce the final section of our paper because what’s the point of having a blog post (or writing a discussion) if you can’t have a bit of fun.

Can you do open science with M-Closed tools?

One of the great joys of writing a discussion is that we can pose a very difficult question that we have no real intention of answering. The question that is well worth pondering is the extent to which our chosen statistical tools influence how scientific decisions are made. And it’s relevant in this context because of a key difference between model selection tools based on LOO and tools based on marginal likelihoods is what happens when none of the models could reasonably generate the data.

In this context, marginal likelihood-based model selection tools will, as the amount of data increases, choose the model that best represents the data, even if it doesn’t represent the data particularly well. LOO-based methods, on the other hand, are quite comfortable expressing that they can not determine a single model that should be selected. To put it more bluntly, marginal likelihood will always confidently select the wrong model, while LOO is able to express that no one model is correct.

We leave it for each individual statistician to work out how the shortcomings of marginal likelihood-based model selection balance with the shortcomings of cross-validation methods. There is no simple answer.

Against Arianism 2: Arianism Grande

“There’s the part you’ve braced yourself against, and then there’s the other part” – The Mountain Goats

My favourite genre of movie is Nicole Kidman in a questionable wig. (Part of the sub-genre founded by Sarah Paulson, who is the patron saint of obvious wigs.) And last night I was in the same room* as Nicole Kidman to watch a movie where I swear she’s wearing the same wig Dolly did in Steel Magnolias. All of which is to say that Boy Erased is an absolutely brilliant and you should all rush out to see it when it’s released. (There are many good things about it that aren’t Nicole Kidman’s wig.) You will probably cry.

When I eventually got home, I couldn’t stop thinking of one small moment towards the end of the film where one character makes a peace overture to another character that is rejected as being insufficient after all that has happened. And it very quietly breaks the first character’s heart. (Tragically this is followed by the only bit of the movie that felt over-written and over-acted, but you can’t have it all.)

So in the spirit of blowing straight past nuance and into the point that I want to make, here is another post about priors. The point that I want to make here is that saying that there is no prior information is often incorrect. This is especially the case when you’re dealing with things on the log-scale.

Collapsing Stars

Last week, Jonah an I successfully defended our visualization paper against a horde of angry** men*** in Cardiff. As we were preparing our slides, I realized that we’d missed a great opportunity to properly communicate how bad the default priors that we used were.

This figure (from our paper) plots one realization of fake data generated from our prior against the true data. As you can see, the fake data simulated from the prior model is two orders of magnitude too large.

(While plotting against actual data is a convenient way of visualizing just how bad these priors are, we could see the same information using a histogram.)

To understand just how bad these priors are, let’s pull in some physical comparisons. PM2.5 (airborne particulate matter that is less than 2.5 microns in diameter) is measured in micrograms per cubic metre.  So let’s work out the density of some real-life things and see where they’d fall on the y-axis of that plot.

(And a quick note for people who are worried about talking the logarithm of a dimensioned quantity: Imagine I’ve divided through by a baseline of 1 microgram per cubic metre.  To do this properly, I probably should divide through by a counterfactual, which would be more reasonably around 5 micrograms per cubic metre, but let’s keep this simple.)

Some quick googling gives the following factlets:

  • The log of the density of concrete would be around 28.
  • The log the density of a neutron star would be around 60.

Both of these numbers are too small to appear on the y-axis of the above graph. That means that every point in that prior simulation is orders of magnitude denser than the densest thing in the universe.

It makes me wonder why I would want to have any prior mass at all on these types of values. Sometimes, such as in the example in the paper, it doesn’t really matter. The data informs the parameters sufficiently well that the silly priors don’t have much of an effect on the inference.

But this is not always the case. In the real model for air pollution, that our model was the simplest possible version of, it’s possible that some of the parameters will be hard to learn from data. In this case, these un-phyiscal priors will lead to unphysical inference. (In particular, the upper end of the credible interval on the posterior predictive interval will be unreasonable.)

The nice thing here is that we don’t really need to know anything about air pollution to make weakly informative priors that are tailored for this data. All we need to know is that if you breathe try to breathe in concrete, you will die. This means that we just need to set priors that don’t generate values that are bigger than 28.  I’ve talked about this idea of containment priors  in these pages before, but a concrete example hopefully makes it more understandable. (That pun is unforgivable and I love it!)

Update: The fabulous Aki made some very nice plots that shows a histogram of the log-PM2.5 values for 1000 datasets simulated from the model with these bad, vague priors. Being Finnish, he also included Pallastunturi Fells, which is a place in Finland with extremely clean air. The graph speaks for itself.

One of the things that we should notice here is that the ambient PM2.5 concentration will never just be zero (it’ll always be >1), so the left tail is also unrealistic.

A similar plot using the weakly informative priors suggested in the paper looks much better.

 

 

The Recognition Scene

The nice thing about this example is that it is readily generalizable to a lot of cases where logarithms are used.

In the above case, we modelled log-PM2.5 for two reasons: because it was more normal, but more importantly because we care much more about the difference between $latex 5\mu gm^{-3}$ and $latex 10\mu gm^{-3}$ than we do about the difference between $latex 5\mu gm^{-3}$ and $latex 6\mu gm^{-3}$. (Or, to put it differently, we care more about multiplicative change than additive change.)  This is quite common when you’re doing public health modelling.

A different time you see logarithms is when you’re modelling count data.  Once again, there are natural limits to how big counts can be.

Firstly, it is always good practice to model relative-risk rather than risk. That is to say that if we observe $latex y_i$ disease cases in a population where we would ordinarily expect to see $latex E_i$ disease cases, then we should model the counts as something like

$latex y_i\sim \text{Poisson}(E_i\lambda)$

rather than

$latex y_i\sim \text{Poisson}(\tilde{\lambda})$.

The reason for doing this is that if the expected number of counts differs strongly between observations, the relative risk $latex \lambda$ can be estimated more stably than the actual risk $latex \tilde{\lambda}$.  This is a manifestation of the good statistical practice of making sure that unknown parameters are on the unit scale.

This is relevant for setting priors on model parameters.  If we follow usual practice and model $latex \log\lambda\sim N(\mu,\sigma)$, then being on unit scale puts some constraints on how big $latex \sigma$ can be.

But that assumes that our transformation of our counts to unit scale was reasonable. We can do more than that!

Because we do inference using  computers, we are limited to numbers that computers can actually store. In particular, we know that if $latex \log\lambda>45$ then the expected number of counts is bigger than any integer that can be represented on an ordinary (64bit) computer.

This strongly suggests that if the covariates used in $latex \mu$ are centred and the intercept is zero (which should be approximately true if the expected counts are correctly calibrated!), then we should avoid priors on $latex \sigma$ with much mass on values larger than 15.

Once again, this is the most extreme priors for a generic Poisson model that don’t put too much weight on completely unreasonable values. Some problem-specific though can usually make these priors even tighter. For instance, we know there are fewer than seven and a half billion people on earth, so anything that is counting people can safely use much tighter priors (in particular we don’t want too much mass on sigma that’s bigger than 7).

All of these numbers are assuming that $latex E_i\approx 1$ and need to be adjusted when the expected counts are very large or very small.

The exact same arguments can be used to put priors on the hyper-parameters in models of count data (such as models that use a negative-binomial likelihood). Similarly, priors for models of positive, continuous data can be set using the sorts of arguments used in the previous section.

Song For an Old Friend

This really is just an extension of the type of thing that Andrew has been saying for years about sensible priors for the logit-mean in multilevel logistic regression models.  In that case, you want the priors for the logit-mean to be contained within [-5,5].

The difference between this case and they types of models considered in this post, is that the upper bound requires some more information than just the structure of the likelihood. In the first case, we needed information about the real world (concrete, neutron stars etc), while for count data we needed to know about the limits of computers.

But all of this information is easily accessible. All you have to do is remember that you’re probably not as ignorant as you think you are.

Just to conclude, I want to remind everyone once again that there is one true thing that we must always keep in mind when specifying priors. When you specify a prior, just like when you cut a wig, you need to be careful. That hair isn’t going to grow back.

Deserters

* It was a very large room. I had to zoom all the way in to get a photo of her. It did not come out well.)

** They weren’t.

*** They were. For these sort of sessions, the RSS couldn’t really control who presented the papers (all men) but they could control the chair (a man) and the invited discussants (both men). But I’m sure they asked women and they all said no. </sarcasm> Later, two women contributed very good discussions from the floor, so it wasn’t a complete sausage-fest.

Against Arianism

“I need some love like I’ve never needed love before” – Geri, Mel C, Mel B, Victoria, Emma (noted Arianists) 

I spent most of today on a sequence of busses shuttling between cities in Ontario, so I’ve been thinking a lot about fourth century heresies. 

That’s an obvious lie. But I think we all know by now that I love to torture a metaphor. (Never forget Diamanda Galas. This is not a metaphor. Just solid advice.)

But why Arianism specifically? Well it’s an early Christian heresy that posited that Jesus was created by God and thus not the same as God. This view was emphatically rejected by the church leadership and in 325 the First Council of Nicaea wrote this rejection explicitly into the Nicene Creed, which these days reads 

I believe in one Lord Jesus Christ / … / Begotten, not made, consubstantial with the Father

(Yes to all the other Catholics who have lapsed or fallen: a few years back they decided “of one being with the Father” was too understandable. If you don’t know the difference between a lapsed and a fallen Catholic either it doesn’t matter to you, or you’re the former.)

This isn’t even my favourite early Christian heresy: if I ever find a solid reason to use Docetism (short version: Jesus was a hologram) as a metaphor, you better believe I will. (They didn’t need to construct a whole creed to rid themselves of that one.)

Whyyyyyyyyyyy? (extreme Annie Lennox voice)

The first sentence of this post was a lie. I actually spent those five hours thinking about Satan.

I get horrifically motion sick if I try to read anything on a bus, so I used today as an opportunity to catch up on the only podcast I subscribe to. And that podcast is a combination of an Audiobook and an academic discussion of the meaning and context (past and current) of Milton’s Paradise Lost.  It’s done by Anthony Oliveira, and as well as being an excellent performance of the epic poem, the discussion is so much fun you’ll be struggling to stop thinking about Satan. (Anthony has a PhD in this and unlike people with stats PhDs, he has good communication skills. So you get expert knowledge in an accessible package!) It is well worth the $3 per month (at this point $3 gets 21 episodes, so it’s an absolute bargain).

The Perils of Pauline (Theology) 

If I have a point in all of this, I probably should’ve expressed it by now. But I do. I’m just bad a writing. And unsurprisingly it will be a point that sort of I’ve made before. But one I made without such a grouse metaphor.

Like Gaul, Bayesian data analysis is usually divided into three parts: the likelihood, the prior, and … well, depending on the point that I’m going to make I’d either say the data or the computation. But as I don’t care about the third part at the moment, feel free to pick your favourite. (And let’s take it as a sign of personal growth [or Miltonic inspiration] that I’ve swerved into a pre-Christian metaphor even though there’s a perfectly obvious other option.) (Added later: John Ormerod suggested the third part should be the posterior, proving he has a better grip on my metaphors than I do.)

I’ve written before (with Andrew and Michael) about how the prior can (often) only be understood in the context of the likelihood [qualifying adjective added in review], but this is a more realistic metaphor. Because there is not compulsion to speak only of Jesus in the context of God. Instead, they are consubstantial; different but not interchangeable beings made of the same stuff.

But we’ve maybe hit the point of the post where my metaphor falls apart. Because the trinitarian view of god is often stated hierarchically as the father, son, and holy spirit, even if they are co-eternal and of the same substance. And our previous paper was also hierarchical: the prior was only to be understood in the light of the (superior) likelihood.

But reality is more nuanced. If I had to write that title again, I’d say this: The prior is consubstantial with the likelihood. (This is why post-publication revisions shouldn’t be encouraged)

That is not to say that they’re the same. The likelihood typically contains our hypothesized generative mechanism as well as information about how that mechanism was measured. On the other hand, the prior will encode our hypotheses about the constituent parts of both the generative and measurement processes. My point is that while it’s true that the prior should be considered in light of the likelihood, it is equally true that the likelihood  should be considered in light of the prior. 

Bayes from a homoousianism viewpoint. (Rather than a homoiousian one)

So Data Science isn’t so much ability to set a reasonable prior for a given problem (as suggested by a twitter soul with the excellent name “daniel”). Instead, it is the ability to use the same scientific knowledge (substance) to simultaneously build both the prior and the likelihood. 

Work that only considers one of these problems (and I’ve definitely written a bunch of those sorts of papers, including my favourite of my papers) is definitely useful but is incomplete. We must resist Arianism at all times!

The life of the world to come

I want to round out this post with some cultural things. It’s summer, so I’ve been actually enjoying myself, which is to say I’ve been consuming media like something that consumes a lot of media. So here’s some stuff (also – its nice to show that I know how to make a fold, just to point out that all these overbearingly long posts have been on purpose).

Continue reading