## “There is no way to prove that [an extreme weather event] either was, or was not, affected by global warming.”

This post is by Phil, not Andrew.

It’s hurricane season, which means it’s time to see the routine disclaimer that no single weather event can be attributed to global warming. There’s a sense in which that is true, and a sense in which it is very wrong.

I’ll start by going way back to 2005. Remember Hurricane Katrina? A month afterwards some prominent climatologists (Rahmstorf, Mann, Benestad, Schmidt, and Connolley) wrote “Could New Orleans be the first major U.S. city ravaged by human-caused climate change? The correct answer–the one we have indeed provided in previous posts (Storms & Global Warming II, Some recent updates and Storms and Climate Change) –is that there is no way to prove that Katrina either was, or was not, affected by global warming. For a single event, regardless of how extreme, such attribution is fundamentally impossible.”

Well, that’s just nonsense. How on earth could Katrina not have been affected by global warming? There’s no way. You can argue that a major hurricane might have struck New Orleans on August 29, 2005 with or without global warming — sure, could be. Or maybe it would have happened a day earlier or a week earlier or a year earlier or a decade earlier. But sure, OK, maybe it would have happened on August 29, 2005. It’s extremely unlikely but not impossible. But there’s no way, literally no way, that it could have been the same storm. Katrina was definitely affected by global warming.

Does it matter? “We all know what they meant”? Well, I don’t know what they meant! And I’ve seen similar statements hundreds of times.

The weather is different than it would have been without global warming, every day and in every location. In some places and at some times the differences are large and in some places they are small. On some days there are fewer tropical cyclones in the Atlantic than there would have been, and on some days there are more; on other days there are exactly the same number of tropical cyclones but they are not in exactly the same places with exactly the same winds.

To say we don’t know whether a given city would have been destroyed by a hurricane on such-and-such a date in the absence of global warming, OK, fine, coincidences happen. But to say that we can’t say whether the storm was affected by global warming, that’s just wrong.  That goes for Hurricane Dorian, too.

I’ve been waiting 14 years to get this off my chest. I feel better.

This post is by Phil Price

## The Wife

I was on the plane a few months ago and watched on that tiny screen some movies, the best of which was The Wife, starring Glenn Close. It wasn’t a great movie, but it was OK, the acting was good, and my main thought was: This seems like a much better story for a book than for a movie—it really wasn’t a “cinematic” story at all—and this led me to want to read the original book. Then awhile later I was at the bookstore and I picked it up: The Wife, by Meg Wolitzer. And I loved it, absolutely loved it from the first page on. Now I want to read everything that Wolitzer’s ever written.

Just as a taste, here’s one amusing passage:

“That she’d only spent a few months in that igloo . . .”: I love that. And it’s a deft touch (as the critics would say) how Wolitzer is making fun, not of Qaanaaq but of the people who are jealous of Qaanaaq.

Anyway, that was just one bit I happen to have recalled. The book is hilarious all the way through, with the humor built into the structure, not just painted on.

The humor seemed very American. I say that only because I’ve read a lot of funny British novels, and there’s just something different about the two styles. A funny British novel will have the main character dumped on and humiliated over and over—think Jonathan Coe, or David Szalay—whereas the American approach is a bit more exuberant. It’s hard for me to describe this exactly, but there’s a difference.

Anyway, the movie wasn’t as funny as the book. Actually, it wasn’t funny at all. Instead, the dramatic aspects of the story were emphasized. This may have been a smart choice. The book was funny not because funny things happened in it, but because it was narrated by a writer with a fine literary sense of humor. To carry this off in a movie you’d need to either open up the story and add a lot more conversation—so that the funny lines would appear as dialogue—or else do lots of voiceover, Annie Hall style. And I guess the screenwriters didn’t want to do either of these.

Also, the acting in the movie was excellent but it changed the story. Again, maybe by necessity. The acting is key in this sort of movie that focuses on interpersonal relationships, and arguably it’s a better choice to change the story to suit the acting, rather than the other way around.

The acting changed the story in two major ways. First, the main character’s husband, the famous writer, has much more of a presence in the movie than in the book. In the book, you hear him talk on occasion but the entire story is narrated by Joan, the wife, in her authorial voice. In the movie, Joan and Joe are nearly equal characters, both played by strong actors. This both changes the balance of the story—it’s now a story of a marriage, rather than the story of a person and her marriage. Second, the movie version makes Joe a more sympathetic character. If you read the book carefully, you can see Joe’s positive aspects—Joan is in some respects an unreliable narrator—but it’s much more clear in the movie when you see him as Jonathan Pryce.

Then there were minor changes. An amusing running gag in the book was that Joe was receiving the Helsinki prize, which was some sort of Finnish knock-off of the Nobel. So even while getting this award, Joe remained aware that his status was still insecure. I guess the screenwriters removed this bit as a step in de-emphasizing the comedy in the story. On the plus side, I think the movie made a good choice in reducing the number of children and bringing them into the story rather than keeping them offstage. Also I preferred the movie’s treatment of the rebellious son: in the book he has serious mental illness and represents the uncontrollability of life, whereas in the movie it’s a more conventional story of a son wanting the affection and respect of his father. I think that worked better in the context of the story.

Finally, to return to the loss of comedy: both Close and Pryce played it straight: no comedy, and not much wit. Both performances were excellent for what they were, but again I think something was lost in the change of focus, compared to the original presentation.

In the book, Joan is a writer, she’s always turning phrases around in her head, playing around with the interaction between the Idea and the Word, making every story her own by deciding how to tell it. That is, the structure of Wolizer’s novel mirrors its theme.

In contrast, in the movie, yes we are told that Joan is a writer, but I didn’t see Close playing her as a writer. Her being a writer was something Joan did, she was a writer in the same way that the hero of Strangers on a Train was a tennis player—it fit in with the narrative logic of the story, but it wasn’t an essential part of her character, not something that came out in her every sentence, as it did in the book. If anything, I felt that Joan was being portrayed in the movie as a sort of actress.

So, again, as well acted as the movie was, I think it lost something, as it became the story of a marriage, not the story of a writer. Fine on its own terms, but not the same, and not as special.

P.S. I discussed this with a friend who argued, convincingly to me, that Joan’s decision to stick with the fraud for so many years was not well motivated. So the book’s not perfect. That’s fine. A book can be not perfect but still great. Just, then we can want to adjust the story to make the narrative logic work out. As with The Martian.

## He says it again, but more vividly.

We’ve discussed Clarke’s third law (“Any sufficiently crappy research is indistinguishable from fraud”) and that, to do good science, honesty and transparency are not enough.

James Heathers says it again, vividly.

I don’t know if Heathers has ever written anything about the notorious study in which participants were invited to stick 51 pins into a voodoo doll representing their spouses—no, I’m not kidding about this one, it was just done by some bigshot professor who calls himself a “myth buster,” has over 199 peer-reviewed articles (he said “over 200” but it turned out that at least one was copied), and has testified before Congress, so no big deal. That voodoo study is hilarious, worth a good rant or two if Heathers has it in him. . . .

## Bayesian post-selection inference

Richard Artner, Francis Tuerlinckx, and Wolf Vanpaemel write:

We are currently researching along the lines of model selection/averaging/misspecification and post-selection inference. As far as we understand your approach to Bayesian statistical analysis looks (drastically simplified) like this:

1. A series of models is sequentially fitted (with an increase in model complexity) whereby the types of model misfits motivate the way the model is extended in each step. This process stops if additional complexity could not be handled by the amount of data at hand (i.e.; when parameter uncertainty due to estimation surpasses a certain point) or potentially earlier in the (lucky!) case that a model has been found where no discrepancies between the observed data pattern and the model assumptions can be found.

2. The final model is then, once again, put to the acid test. That means residual plots, posterior predictive checks and the likes.

3. Inference for the model parameters of interest as well as functions of them (i.e.; expected mean, quantiles of response variable etc.) is then conducted in the chosen model.

An example of this process is, for instance, given in BDA (Chapter 22.2 “Using regression predictions: incentives for telephone surveys”). [That example is in section 9.2 of the third edition of BDA. — ed.]

We are wondering to what extent the inferences achieved by such a process can be problematic and potentially misleading since the data were used twice (first to end up with the final model and second to fit the likelihood to conduct the inferences). You do not mention any broadening of credible intervals, nor data splitting where the third step is conducted on an unused test sample. Maybe you do not mention it because it does not matter so much theoretically and in practice. Or perhaps because it is too difficult to deal with the issue in a Bayesian sense.

As far as we understand it, in such a process the dataset influences the form of the likelihood, the prior distributions as well as the parameter fits (e.g.; via ML) thereby violating the internal consistency of Bayesian inference (i.e.; given an apriori specified likelihood and the “correct” prior distribution, the posterior distribution is correct where in the M-open case, correctness is defined by best approximating model).

– Yes, that’s a reasonable summary of our model-building approach. A more elaborate version is in this paper with Jonah Gabry, Dan Simpson, Aki Vehtari, and Mike Betancourt.

– I don’t think it will ever make sense to put all of Bayesian inference in a coherent framework, even for a single application. For one thing, as Dan, Mike, and I wrote a couple of years ago, the prior can often only be understood in the context of the likelihood. And that’s just a special case of the general principle that there’s always something more we could throw into our models. Whatever we have is at best a temporary solution.

– That said, much depends on how the model is set up. We might add features to a model in a haphazard way but then go back and restructure it. For example, the survey-incentives model in section 9.2 of BDA is pretty ugly, and recently Lauren Kennedy and I have gone back to this problem and set up a model that makes more sense. So I wouldn’t consider the BDA version of this model (which in turn comes from our 2003 paper) an ideal example.

– To put it another way, we shouldn’t think of the model-building process as a blind data-fitting exercise. It’s more like we’re working toward building a larger model that makes sense, and each step in the process is a way of incorporating more information.

## Is the effect they found too large to believe? (the effect of breakfast macronutrients on social decisions)

Someone who wishes to remain anonymous writes:

Have you seen this paper?

I [my correspondent] don’t see any obvious problems, but the results fall into the typical social psychology case “unbelievably large effects of small manipulations”. They even say so themselves:

We provided converging evidence from two studies showing that a relatively small variation in breakfast’s macronutrient composition has a striking impact on social decisions.

The article in question is “Impact of nutrition on social decision making,” by Sabrina Strang, Christina Hoeber, Olaf Uhl, Berthold Koletzko, Thomas F. Münte, Hendrik Lehnert, Raymond Dolang, Sebastian Schmid, and Soyoung Park. From the abstract:

Breakfasts with a high-carbohydrate/protein ratio increased social punishment behavior in response to norm violations compared with that in response to a low carbohydrate/ protein meal. We show that these macronutrient-induced behavioral changes in social decision making are causally related to a lowering of plasma tyrosine levels.

And here’s their evidence:

I’m concerned about implausible effect size estimates, which is what you can get from the combination of noisy data, small samples, and forking paths in analysis.

The graph on the left is from experiment 1 which is observational data (not assigning breakfasts but just asking people what they ate), but still:

Within the low-carb/ protein group, 24% of subjects decided to reject unfair offers. In contrast, 53% of the high-carb/protein group decided to reject unfair offers.

I don’t care if it is p=0.03, I don’t expect to see this in a replication.

There’s a lot more data, and there could be something going on—I have no idea. I think they should do a Nosek, Spies, and Motyl and replicate the whole thing from scratch. Or someone else can do the replication.

Until then, I’m skeptical of these claims:

The findings indicate that, in a limited sense, “we are what we eat” and provide a perspective on a nutrition-driven modulation of cognition. The findings have implications for education, economics, and public policy, and emphasize that the importance of a balanced diet may extend beyond the mere physical benefits of adequate nutrition.

Or this:

In this study, we demonstrated that the macronutrient composition of food acutely influences our social decisions, showing a modulation in the dopamine precursor as the underlying mechanism.

Exploratory experimentation and analysis are fine—that’s what science is all about. Let’s just not forget that finding some statistical significant comparisons in data is not the same thing as scientifically “demonstrating” a hypothesis. Their hypothesis could well be true, or maybe not, or maybe it depends on context. Nothing special about this particular study, we just need to give such studies a modern reading.

## Seeking postdoc (or contractor) for next generation Stan language research and development

The Stan group at Columbia is looking to hire a postdoc* to work on the next generation compiler for the Stan open-source probabilistic programming language. Ideally, a candidate will bring language development experience and also have research interests in a related field such as programming languages, applied statistics, numerical analysis, or statistical computation.

The language features on the roadmap include lambdas with closures, sparse matrices and vectors, ragged arrays, tuples and structs, user-defined Jacobians, and variadic functions. The parser, intermediate representation, and code generation are written in OCaml using the Menhir compiler framework. The code is hosted on GitHub in the stanc3 repo; the current design documents are in the design docs repo. The generated code is templated C++ that depends on the automatic differentiation framework in the Stan math library and is used by Stan’s statistical inference algorithms.

The research and development for Stan will be carried out in collaboration with the larger Stan development team, which includes a large group of friendly and constructive collaborators within and outside of Columbia University. In addition to software development, the team has a diverse set of research interests in programming language semantics, applied statistics, statistical methodology, and statistical computation. Of particular relevance to this project is foundational theory work on programming language semantics for differentiable and probabilistic programming languages.

The position would be housed in the Applied Statistics Center at Columbia University and supervised by Bob Carpenter. The initial appointment will be for one year with a possible reappointment for a second year.

To apply, please send a CV and a statement of interest and experience in this area if not included in the CV to Bob Carpenter, carp@alias-i.com. The position is available immediately and we will review applications as they arrive.

Thanks to Schmidt Futures for the funding to make this possible!

* We could also hire a contractor on an hourly basis. For that, I’d be looking for someone with experience who could hit the ground running with the OCaml code.

## “I am a writer for our school newspaper, the BHS Blueprint, and I am writing an article about our school’s new growth mindset initiative.”

Caleb VanArragon writes:

I am a student at Blaine High School in Blaine, Minnesota. I am a writer for our school newspaper, the BHS Blueprint, and I am writing an article about our school’s new growth mindset initiative. I was wondering if you would be willing to answer a couple of questions about your study of the statistical reliability of some growth mindset studies.

In this article, written in 2017, you said you believe that Carol Dweck is “using statistical methods that will allow them to find success no matter what,” and you have said similar things on your blog. Do you believe that all of Dweck’s studies have been conducted using poor statistical methods, or have some of them been conducted properly?

Have you seen any studies that have found correlation between growth mindset and academic performance that have been conducted using sound statistical methods?

No, I do not believe that Dweck’s studies have been conducted using poor statistical methods. I think that statistics is hard, and that various people have made too-strong claims from studies such as Dweck’s.
See discussion here. And see here for further discussion on growth mindset studies. I hope this is helpful.

## What’s the origin of the term “chasing noise” as applying to overinterpreting noisy patterns in data?

Roy Mendelssohn writes:

In an internal discussion at work I used the term “chasing noise”, which really grabbed a number of people involved in the discussion. Now my memory is I first saw the term (or something similar) in your blog. But it made me interested in who may have first used the term? Did you hear it first from someone, or have any idea of who may have first used the term, or something close to it?

The term seems so natural. I don’t know if I heard it from somewhere. Here’s where I used it in 2013. I’ve also used the related term “noise mining.”

A quick google search came up with this 2012 article, Chasing Noise, by Brock Mendel and Andrei Shleifer in the Journal of Financial Economics, but they’re using the term slightly differently, referring not to overfitting explanations of noisy statistical findings, but to random economic behavior.

Roy then gave some background:

The term came up in the setting that I am a firm believer that if you ignore spatial and temporal correlation in space-time data, as many analyses do, you are uncovering patterns that are transitory in the dynamics sense, either because you have over estimated the effective sample size (as when the talks on Stan talk about ESS for analyzing the chains) or you are just being fooled by the seeming patterns caused by noise when data are dependent (actually even when they are independent – when state lotteries started I knew quite a few people who were positive they had found a pattern in the numbers, and sure enough they all lost a fair amount of money).

Anyway, if any of you know further history on this use of the expression “chasing noise” as applying to overinterpretation of noisy patterns in data, please let us know in comments.

## When people make up victim stories

A couple of victim stories came up recently: in both cases these were people I’d never heard of until (a) they claimed to have been victimized, and (b) it seems that these claims were made up.

First case was Jussie Smollett, a cable-TV actor who claimed to be the victim of a racist homophobic attack, which seems to have never happened.

Next case was Jacob Wohl, a political operator of some sort who claimed to have received online death threats, but then it seems these threats came from accounts that he created.

So it seems that Smollett may have hired people to mug him, and Wohl may have sent death threats to himself.

These two cases reminded me of the much more obscure story from several years ago of Mary Rosh, a fictional online character created with presumed intent to deceive (that is, a “sock puppet”) by researcher and policy advocate John Lott. In Rosh’s (that is, Lott’s) words, “I have to say that he [Lott] was the best professor I ever had.” When questioned about this action, Lott wrote that, “it was a way to get information into the debate.” In this case, the information that Lott thinks that he’s a really really good teacher, which happens to be something he could’ve introduced into the debate directly, under his own name.

And this in turn reminds me of when cartoonist Scott Adams posted, under an assumed name, the paradoxical statement, “You’re talking about Scott Adams. He’s not talking about you.” Adams also apparently used his fake online persona to write, “I hate Adams for his success too.” Which may be true: I would not be surprised if Adams, like some other successful people, has mixed feelings about his success.

Rosh and Adams used fake identities to affect conversations about themselves; Wohl and Smollett faked their victimhood; but I don’t see these actions as so different. It’s just that being a victim is more of a “thing” now than it was a few years ago.

Anyway, I have a theory about all these cases, and other similar examples, which is that when people lie or misrepresent like this, they’re doing this out of the belief that they’re representing a larger truth. Even if Smollett did not receive these particular slurs at that particular time, he’s felt slighted on other occasions. Even if Wohl did not get those death threats, people have spoken harshly to him online at other times. Basically, Smollett and Wohl feel like oppressed victims, so to them it’s barely a lie at all if they make up these particular cases. Just as fiction can feel more true than the truth, Smollett and Wohl could well feel that these faked incidents capture the essence of what’s happening to them. And then, when the fakes get revealed, they can feel victimized again by all the people who are questioning them.

Similarly with Rosh and Adams: Rosh probably does think she’s a great teacher—indeed maybe some students in her real-life classes gave her some positive feedback on their teaching. And of course Adams is right that people talk about him when he’s not in the room. So, again, they were lying in the service of a larger truth. At least that’s how I conjecture they see it.

At this point, you might ask: If these people feel like they’re serving the larger truth, why not just tell that truth? Why does Smollett not just recount real examples of when he’s been hassled, why does not Wohl share real internet beef he’s received, why did not Rosh ask a real student to testify to her teaching prowess, why did not Adams . . . ummm, I’m not actually sure what point Adams was trying to make in that particular discussion, so I’ll skip on that one.

Anyway, I conjecture that the reason these people don’t just recount true stories is that the truth isn’t good enough. Maybe Smollett received some rude stares but no in-your-face slurs, maybe Wohl received some angry emails but no threats, maybe Rosh didn’t actually have any former students at hand to argue in her favor.

As we say in statistics, if the data don’t make your case, impute from the model!

P.S. Why write about these sad stories at all? I’m interested for two reasons. First, as noted above, questions of truth and lies relate to more general concerns about learning from data and the scientific process, as discussed in my papers with Basbøll here (To throw away data: Plagiarism as a statistical crime) and here (When do stories work? Evidence and illustration in the social sciences).

Second, similar issues of trust can arise in scientific disputes, in which pseudo-evidence is used to support a claim that’s been questioned. Sometimes this can involve out-and-out misrepresentation; other times it’s what is sometimes charitably called questionable research practices, perhaps most notoriously Daryl Bem counting, as a successful replication of his 2011 ESP paper, a study on spider stimuli from 2005. The issue here is not lying; the concern is that vaguely relevant pieces of information are being treated as evidence. Again I suspect the belief is that this is all in support of a larger truth so the details don’t matter, also the people doing this sort of thing may feel beleaguered by criticism, which can make almost any tactic seem reasonable in response. So I think it’s worth thinking about how it is that people justify various behaviors involving constructing, selecting, or misrepresenting evidence.

## How much of the change from 2016 was due to different people voting vs. the same people changing their vote choice?

A colleague writes:

Whenever I think of appropriate democratic strategies for 2020 I am drawn to ask how a candidate can get voters from Trump.

But a colleague frequently corrects my thinking by saying Karl Rove discovered that you want to rile up your base to get them to turn out and appealing to the median voter is nonsense. He is not the only one who says so.

Is there any evidence you have seen on the Rove hypothesis? What is the best evidence I wonder? Strikes me as a hard thing to study but terribly important.

I guess the first question I would have is how many people vote sometimes (say 30-80% of the time) vs. how many vote sometimes democratic (30-80%) of time. What is the size of the two pools? Harder to get at of course is the elasticity or ease of changing those numbers. But if you assumed they were equally easy to shift, then the size of the pools would be determinative.

Or if the size was comparable and your prior was that shifting a political opinion is very hard but getting someone to decide to vote on a given occasion vs. stay home (if they sometimes vote) is less hard, then maybe with equally sized pools, you know the answer.

Before giving my answer (or, more precisely, Yair’s answer), I’ll just point out that the same question could be asked from the Republican side. Given that the Democrats are doing various things to try to win more votes in 2020, the Republicans can hardly just stand still in this new election and expect to squeak by again. So, although my colleague poses the question from a particular partisan perspective, it applies from either direction.

OK, now to the answer. It turns out I already posted on the topic in May, reporting a detailed analysis from Yair Ghitza, who asked:

How much of the change from 2016 was due to different people voting vs. the same people changing their vote choice?

and who concluded:

Two things happened between 2016 and 2018. First, there was a massive turnout boost that favored Democrats, at least compared to past midterms. . . . But if turnout was the only factor, then Democrats would not have seen nearly the gains that they ended up seeing. Changing vote choice accounted for a +4.5% margin change, out of the +5.0% margin change that was seen overall — a big piece of Democratic victory was due to 2016 Trump voters turning around and voting for Democrats in 2018.

And lots of data-rich detail in between.

I’m reposting because it seems that at least one potential reader didn’t see this when it came up the first time.

## Beyond Power Calculations: Some questions, some answers

Brian Bucher (who describes himself as “just an engineer, not a statistician”) writes:

I’ve read your paper with John Carlin, Beyond Power Calculations. Would you happen to know of instances in the published or unpublished literature that implement this type of design analysis, especially using your retrodesign() function [here’s an updated version from Andy Timm], so I could see more examples of it in action? Would you be up for creating a blog post on the topic, sort of a “The use of this tool in the wild” type thing?

I [Bucher] found this from Clay Ford and this from Shravan Vasishth and plan on working my way through them, but it would be great to have even more examples.

I promised to write such a post asking for more examples—and here it is! So feel free to send some in. I have a couple examples in section 2 of this paper.

After I told Bucher the post is coming, he threw in another question:

I’d also be curious about if you would apply this methodology in cases where there was technically no statistical significance. I’m thinking primarily of these two cases:

(a) There was no alpha value chosen before the study, and the authors weren’t testing a p-value against an alpha, but just reporting a p-value (such as 0.06) and deciding that it was sufficiently small to conclude that there was likely an effect and worth further experimentation/investigation. (Fisher-ian?)

(b) There was an alpha value chosen (0.05), and the t-test didn’t reject the null because the p-value was 0.08. However, in addition to the frequentist analysis, the authors generated a Bayes factor of 2.0 and claimed this showed that a difference between the two groups was twice as likely as having no difference between groups, and, therefore, conclude a difference in groups.

Letter (a) is a decent description of the type of analyses that I often do (mostly DOEs), since I don’t use alpha-thresholds unless required by a third party.

Letter (b) is (basically) something from a paper that I’m analyzing, and it would be great if I could estimate the Type-S/M errors without violating any statistical laws.

I have my fingers crossed, because in your Beyond Power Calculations paper you do say,

If the result is not statistically significant, the chance of the estimate having the wrong sign is 49% (not shown in the Appendix; this is the probability of a Type S error conditional on nonsignificance)—so that the direction of the estimate gives almost no information on the sign of the true effect.

…so I do have hope that the methods are generally applicable to nonsignificant results as well.

Full disclosure, I [Bucher] posted a version of this question to stackexchange but have not (yet) received any comments.

We were thinking of type M and type S errors as frequency properties. The idea is that you define a statistical procedure and then work out its average properties over repeated use. So far, we’ve mostly thought about the procedure which is “do an analysis and report it if it’s ‘statistically significant'”—in my original paper with Tuerlinckx on type M and type S errors (full text here), we talked about the frequency properties of “claims with confidence.”

In your case it seems that you want inference about a particular effect size given available information, and I think you’d be best off just attacking the problem Bayesianly. Write down a reasonable prior distribution for your effect size and then go from there. Sure, there’s a challenge here in having to specify a prior, but that’s the price you have to pay: Without prior, you can’t do much in the way of inference when your data are noisy.

## “No, cardiac arrests are not more common on Monday mornings, study finds”

Paul Alper points us to this news report by Susan Perry. I have no idea how good this study is—I have not looked at it at all, except to pull out those two ugly-but-somewhat-functional graphs above (where “SCA” stands for “sudden cardiac arrest”)—but I wanted to convey my approval for a news story reporting a study with a negative conclusion.

P.S. This would be a good dataset for fitting this model (see also cover of BDA3).

## More on the piranha problem, the butterfly effect, unintended consequences, and the push-a-button, take-a-pill model of science

The other day we had some interesting discussion that I’d like to share.

I started by contrasting the butterfly effect—the idea that a small, seemingly trivial, intervention at place A can potentially have a large, unpredictable effect at place B—with the “PNAS” or “Psychological Science” view of the world, in which small, seemingly trivial, intervention at place A can have large, consistent, and predictable effects at place B. My point was that the “butterfly effect” and what might be called “PNAS hypotheses” seem superficially to be similar but are actually much different.

Related to this is that butterfly effects are, presumably, not just inconsistent; it’s also that any particular butterfly effect will be rare. As John Cook puts it:

A butterfly flapping its wings usually has no effect, even in sensitive or chaotic systems. You might even say especially in sensitive or chaotic systems. . . . The lesson that many people draw from their first exposure to complex systems is that there are high leverage points, if only you can find them and manipulate them. They want to insert a butterfly to at just the right time and place to bring about a desired outcome. Instead, we should humbly evaluate to what extent it is possible to steer complex systems at all.

I then connected this to the “piranha principle,” that there can be some large and predictable effects on behavior, but not a lot, because, if there were, then these different effects would interfere with each other, and as a result it would be hard to see any consistent effects of anything in observational data. As Kaiser puts it, “The piranha effect gives words to my unease with the ‘priming’ literature. My existence exposes me to all kinds of primes that interfere with each other (including many that have not been studied and are thus unknown), and what is their cumulative effect?”

Zbicyclist connects to the law of unintended consequences. I’ve on occasion expressed skepticism about this so-called law (see also here and, more recently, here). But, sure, unpredictability of outcomes is real.

Joe writes:

Can you explain the piranha thing? I read the linked post and I’m still not sure I’m getting it.

For example, there are thousands of easily identifiable causes that have extremely large and predictable effects on human mortality: being in a high velocity crash, eating cyanide, eating botulinum, being shot in the chest, exposure to high dose radiation… I could easily go on all day with these. This does not appear to present either an epistemological problem or a conceptual one.

Why can’t the same be true of social behavior? It may be challenging to estimate effects in observational data if there are thousands of large causes of, say, voting behavior, but I would think we can solve that with sufficient sample sizes and appropriately designed work.

It’s trivially true that the there can’t be a large number of factors that explain a large proportion of the ultimate variance in whatever outcome, but what’s the problem with a large number of factors the independent and partial effect of which is large?

To which I respond:

Good question.

The difference is that the examples you give are rare, and they act immediately. The social science analogy would be something like this: being exposed to words relating to elderly people, having an older sibling, having a younger sibling, being in a certain part of the monthly cycle, having your local football team win or lose, being exposed to someone in an expansive or contractive posture, receiving implicitly racist signals, having an age ending in 9, riding in coach class on an airplane, hearing about a local hurricane, etc.: these are common, not rare, and they can all potentially interact.

To put it another way, your mortality example is like a set of piranhas that are each in their own tank, whereas “Psychological Science” or “PNAS”-style psychology research is like a set of piranhas that are all in a tank together.

And Jim continues:

IMO social behavior also just has a MUCH larger number of competing influences operating at similar magnitudes where the physical worlds effects are spread over a large range of magnitudes so that at any given magnitude the number of competing factors is small. So, say two speeding vehicles collide. There are also sand grains blowing in the wind but the impact of sand grain on colliding cars is very very small. OTOH, put twenty people in a room and measure the impact if two of them get into a loud heated argument – the impact on arguers is probably larger than for bystanders, but roughly of the same magnitude, and there are many interactions.

“The message that I disagree with is the statement that causal relationships can be discovered using automated analysis of observational data.”

Wish I could understand this. When my “check engine” light comes on, I hook up the code reader, and it tells me what the electronic diagnosis circuitry read. Then I (or my mechanic) fix it based upon what caused the anomalous performance. What am I missing? How deep into epistemology do I need to go for this to not make sense?

To which I respond:

It depends on the context.With your car engine light, there’s a lot of theory and engineering going into the system, and we understand the connection between the engine trouble and the light going on. The analogy in social science would be a clean randomized experiment, in which the design and data collection give us identification.

In observational data in social science, there is generally no such clear pattern. To use your analogy, someone might observe the engine in car A, the light in car B, and the circuitry in car C. No amount of analysis of such observational data would tell us much about causal relationships here.

And Daniel continues:

The check engine light is not observational data. It’s a system designed specifically to detect and diagnose issues. It’s subject to a huge quantity of design and testing to ensure that it does its job.

Observational data would be something like collecting the tire wear pattern, paint oxidation, zip code of owner, owner race, and owner educational attainment data of cars brought in to have their transmissions fixed and using some structural assumptions about people’s decision making skills and this observational data inferring something about the causal effects of culture, education, and income on maintenance behavior and its resulting impact on longevity.

I would say the computer in your car is not discovering causal relationships automatically in anything near the sense that Andrew means. It is not trying to discover how cars in general work from the data it gathers. Instead, it just filters data through a calibrated theoretical model of how cars work.

Anyway, I hope the above explanations are helpful for some readers out there.

In a world where things like this get published in top journals, followed by major media exposure, I think it’s important that we understand the systemic problem with these sorts of “Psychological Science” or “PNAS”-style claims, rather than just playing whack-a-mole when each one comes along.

To put it another way, there are two things that work together to keep cargo-cult science alive. First, there are the misunderstandings of statistical methods (researcher degrees of freedom, forking paths, problems with null hypothesis testing, etc.) which give researchers the tools to routinely make inappropriately certain claims from noisy data. Second, there are the theoretical misunderstandings by which researchers think that that we live in a world of populated by huge effects of priming, as if our every step is buffeted by mysterious forces that are beyond our conscious understanding yet can be easily manipulated by psychologists.

We (the statistics profession and quantitative social researchers) have spent a lot of time addressing that first misconception, but maybe not enough time addressing the second.

P.S. I still want to do that statistical theory research project where we model the piranha problem and prove (given some assumptions) that it’s unlikely to see many large effects in a single system.

## Forming a hyper-precise numerical summary during a research crisis can improve an article’s chance of achieving its publication goals.

Speaking of measurement and numeracy . . . Kevin Lewis pointed me to this published article with the following abstract that starts out just fine but kinda spirals out of control:

Forming a military coalition during an international crisis can improve a state’s chances of achieving its political goals. We argue that the involvement of a coalition, however, can have unintended adverse effects on crisis outcomes by complicating the bargaining process and extending the duration of crises. This argument suggests that crises involving coalitions should be significantly longer than crises without coalitions. However, other factors that affect crisis duration are also likely to influence coalition formation. Therefore, taking into account the endogeneity of the presence of a coalition is essential to testing our hypothesis. To deal with this inferential challenge, we develop a new statistical model that is an extension of instrumental variable estimation in survival analysis. Our analysis of 255 post–World War II interstate crises demonstrates that, even after accounting for the endogeneity of coalition formation, military coalitions tend to extend the duration of crises by approximately 284 days.

Approximately 284, huh? What’s the precise number, 283.734908243098230498?

Somehow I’m reminded of my favorite sentence from any quantitative research ever:

Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001].

P.S. I have not read the paper on military coalitions and it might well be wonderful and important research. If you’re interested in the topic, go read it! Make your own call on its quality and on the relevance of this research to the real world. I just think that this “approximately 284 days” thing is hilarious, and I say this as someone who’s approximately 174.3 centimeters tall.

P.P.S. Lewis responds:

Cool. But what is the ideal phrasing? I assume you wouldn’t say “coalitions SIGNIFICANTLY extend crisis duration.” Is it ok to say “extend crisis duration by almost a year on average”?

My reply: To start with, I’d express the result in the past tense as these are past data! “Crises with military coalitions lasted 300 days more on average.” Something like that.

## The importance of talking about the importance of measurement: It depends on the subfield

An interesting point came up in a comment thread the other day and you might have missed it, so I’ll repeat it here.

Dan Goldstein wrote to me:

Many times I’ve heard you say people should improve the quality of their measurements. Have you considered that people may be quite close to the best quality of measurement they can achieve?
Have you thought about the degree of measurement improvement that might actually be achievable?
And what that would mean for the quality of statistical inferences?

Competent psychophysicists are getting measurements that are close to the best they can reasonably achieve. Equipment that costs ten times more might only reduce error by one thousandth. It’s the variation between people that gets ya.

I replied:

There are subfields where measurement is taken seriously. You mention psychophysics; other examples include psychometrics and of old-fashioned physics and chemistry. In those fields, I agree that there can be diminishing returns from improved measurement.

What I was talking about are the many, many fields of social research where measurement is sloppy and noisy. I think the source of much of this is a statistical ideology that measurement doesn’t really matter.

The reasoning, I think, goes like this:

1. Measurement has bias and variance.

2. If you’re doing a randomized experiment, you don’t need to worry about bias because it cancels out in the two groups.

3. Variance matters because if your variance his higher, your standard errors will be higher and so you’ll be less likely to achieve statistical significance.

4. If your findings are statistically significant, then retroactively you can say that your standard error was not too high, hence measurement variance did not materially affect your results.

5. Another concern is that you were not measuring quite what you thought you were measuring. But that’s ok because you’ve still discovered something. If you claimed that Y is predicted from X but you didn’t actually measure X, you were actually measuring Z, then you just change the interpretation of your finding: you’ve now discovered that Y is predicted from Z, and you still have a finding.

Put the above 5 steps together and you can conclude that as long as you achieve statistical significance from a randomized experiment, you don’t have to worry about measurement. And, indeed, I’ve seen lots and lots of papers in top journals, written by respected researchers, that don’t seem to take measurement at all seriously (again, with exceptions, especially in fields such as psychometrics that are particularly focused on measurement).

I’ve never seen steps 1-5 listed explicitly in the above form, but it’s my impression that this the implicit reasoning that allows many many researchers to go about their work without concern about measurement error. Their reasoning is, I think, that if measurement error were a problem, it would show up in the form of big standard errors. So when standard errors are big and results are not statistically significant, then they might start to worry about measurement error. But not before.

I think the apparent syllogism of steps 1-5 above is wrong. As Eric Loken and I have discussed, when you have noisy data, a statistical significant finding doesn’t tell you so much. The fact that a result is statistically significant does not imply that your measurement error was so low that your statistically significant finding can be trusted.

If all of social and behavioral science were like psychometrics and psychophysics, I’d still have a lot to talk about, but I don’t think I’d need to talk so much about measurement error.

tl;dr: Measurement is always important and should always be emphasized, but in some fields there is already a broad recognition of the importance of measurement, and researchers in those fields don’t need me to harangue them about the importance of measurement. But even they often don’t mind that I talk about measurement so much, because they recognize researchers in other subfields are not always aware of the importance of measurement, with the unawareness arising perhaps from a misunderstanding of statistical significance and evidence.

Ummm, I guess I just violated the “tl;dr” principle by writing a tl;dr summary that itself was a long paragraph. That’s academic writing for ya! Whatever.

## More on why Cass Sunstein should be thanking, not smearing, people who ask for replications

Recently we discussed law professor and policy intellectual Cass Sunstein’s statement that people who ask for social science findings to be replicated are like the former East German secret police.

In that discussion I alluded to a few issues:

1. The replication movement is fueled in large part by high-profile work, lauded by Sunstein and other Ivy League luminaries, that did not replicate.

2. Until outsiders loudly criticized the unreplicated work, those unreplicated claims were essentially uncriticized in the popular and academic press. And the criticism had to be loud, Loud, LOUD. Recall the Javert paradox.

3. That work wasn’t just Gladwell and NPR-bait, it also had real-world implications.

For example, check this out from the Nudge blog, several years ago:

As noted above, Sunstein had no affiliation with that blog. My point is that his brand was, unwittingly, promoting bad research.

And this brings me to my main point for today. Sunstein likens research critics to the former East German secret police, echoing something that a psychology professor wrote a few years ago regarding “methodological terrorists.” But . . . without these hateful people who are some cross between the Stasi and Al Qaeda, those destructive little second-stringers etc. . . . without them, Sunstein would I assume still be promoting claims based on garbage research. (And, yes, sure, Wansink’s claims could still be true, research flaws notwithstanding: It’s possible that the guy just had a great intuition about behavior and was right every time—but then it’s still a mistake to present those intuitions as being evidence-based.)

For example, see this recent post:

The link states that “A field study and a laboratory study with American participants found that calorie counts to the left (vs. right) decreased calories ordered by 16.31%.” 16.31%, huh? OK, I’ll believe it when it’s replicated for real, not before. The point is that, without the research critics—including aggressive research critics, the Javerts who annoy Sunstein and his friends so much—junk science would expand until it entirely filled up the world of policy analysis. Gresham, baby, Gresham.

So, again, Sunstein should be thanking, not smearing, people who ask for replications.

P.S. Probably not a good idea to believe anything Brian Wansink has ever written, at least not until you see clearly documented replication. This overview by Elizabeth Nolan Brown gives some background on the problems with Wansink’s work, along with discussions of some political concerns:

For the better half of a decade, American public schools have been part of a grand experiment in “choice architecture” dressed up as simple, practical steps to spur healthy eating. But new research reveals the “Smarter Lunchrooms” program is based largely on junk science.

Smarter Lunchrooms, launched in 2010 with funding from the U.S. Department of Agriculture (USDA) . . . is full of “common sense,” TED-talk-ready, Malcolm Gladwell-esque insights into how school cafeterias can encourage students to select and eat more nutritious foods. . . . This “light touch” is the foundation upon which Wansink, a former executive director of the USDA’s Center for Nutrition Policy and Promotion and a drafter of U.S. Dietary Guidelines, has earned ample speaking and consulting gigs and media coverage. . . .

The first serious study testing the program’s effectiveness was published just this year. At the end of nine weeks, students in Smarter Lunchroom cafeterias consumed an average of 0.10 more fruit units per day—the equivalent of about one or two bites of an apple. Wansink and company called it a “significant” increase in fruit consumption.

But “whether this increase is meaningful and has real world benefit is questionable,” Robinson* writes.

Nonetheless, the USDA claims that the “strategies that the Smarter Lunchrooms Movement endorses have been studied and proven effective in a variety of schools across the nation.” More than 29,000 U.S. public schools now employ Smarter Lunchrooms strategies, and the number of school food service directors trained on these tactics increased threefold in 2015 over the year before.

Also this:

One study touted by the USDA even notes that since food service directors who belong to professional membership associations were more likely to know about the Smarter Lunchrooms program, policy makers and school districts “consider allocating funds to encourage [directors] to engage more fully in professional association meetings and activities.”

But now that Wansink’s work has been discredited, the government will back off and stop wasting all this time and money, right?

Ummm . . .

A spokesman for the USDA told The Washington Post that while they had some concerns about the research coming out of Cornell, “it’s important to remember that Smarter Lunchrooms strategies are based upon widely researched principles of behavioral economics, as well as a strong body of practice that supports their ongoing use.”

Brown summarizes:

We might disagree on whether federal authorities should micromanage lunchroom menus or if local school districts should have more control, and what dietary principles they should follow; whether the emphasis of school cafeterias should be fundraising or nutrition; or whether school meals need more funding. But confronting these challenges head-on is a hell of a lot better than a tepid consensus for feel-good fairytales about banana placement.

Or celebrating the “coolest behavioral finding of 2019.”
Continue reading ‘More on why Cass Sunstein should be thanking, not smearing, people who ask for replications’ »

## Yes, you can include prior information on quantities of interest, not just on parameters in your model

Nick Kavanagh writes:

I studied economics in college and never heard more than a passing reference to Bayesian stats. I started to encounter Bayesian concepts in the workplace and decided to teach myself on the side.

I was hoping to get your advice on a problem that I recently encountered. It has to do with the best way to encode prior information into a model in which the prior knowledge pertains to the overall effect of some change (not the values of individual parameters). I haven’t seen this question addressed before and thought it might be a good topic for a blog post.

I’m building a model to understand the effects of advertising on sales, controlling for other factors like pricing. A simplified version of the model is presented below.

Additional units of advertising will, at some point, yield lower incremental sales. This non-linearity is incorporated into the model through a variable transformation — f(ad_spend, s) — where the parameter s determines the rate of diminishing returns.

sales = alpha + beta_ad * f(ad_spend, s) + beta_price * log(price)

Outside the model, I have estimates of the impact of advertising on sales obtained through randomized experiments. These experiments don’t provide estimates of beta_ad and s. They simply tell you that “increasing advertising spend by \$100K generated 400 [300, 500] incremental sales.” The challenge is that different sets of parameter values for beta_ad and s yield very similar results in terms of incremental sales. I’m struggling with the best way to incorporate the experimental results into the model.

In Stan this is super-easy: You can put priors on anything, including combinations of parameters. Consider this code fragment:

model {
target += normal(y | a + b*x, sigma);  \\ data model
target += normal(a | 0, 10);           \\ weak prior information on a
target += normal(b | 0, 10);           \\ weak prior information on b
target += normal(a + 5*b | 4.5, 0.2);  \\ strong prior information on a + 5*b


In this example, you have prior information on the linear combination, a + 5*b, an estimate of 4.5 with standard error 0.2, from some previous experiment.

The key is that prior information is, mathematically, just more data.

You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.

P.S. I’ve updated the comments on the above code in response to Lakeland’s suggestion in comments.

## Multilevel structured (regression) and post-stratification

My enemies are all too familiar. They’re the ones who used to call me friend – Jawbreaker

Well I am back from Australia where I gave a whole pile of talks and drank more coffee than is probably a good idea. So I’m pretty jetlagged and I’m supposed to be writing my tenure packet, so obviously I’m going to write a long-ish blog post about a paper that we’ve done on survey estimation that just appeared on arXiv. We, in this particular context, is my stellar grad student Alex Gao, the always stunning Lauren Kennedy, the eternally fabulous Andrew Gelman, and me.

What is our situation?

When data is a representative sample from the population of interest, life is peachy. Tragically, this never happens.

Maybe a less exciting way to say that would be that your sample is representative of a population, but it might not be an interesting population. An example of this would be a psychology experiment where the population is mostly psychology undergraduates at the PI’s university. The data can make reasonable conclusions about this population (assuming sufficient sample size and decent design etc), but this may not be a particularly interesting population for people outside of the PI’s lab. Lauren and Andrew have a really great paper about this!

It’s also possible that the population that is being represented by the data is difficult to quantify.  For instance, what is the population that an opt-in online survey generalizes to?

Moreover, it’s very possible that the strata of the population have been unevenly sampled on purpose. Why would someone visit such violence upon their statistical inference? There are many many reasons, but one of the big one is ensuring that you get enough samples from a rare population that’s of particular interest to the study. Even though there are good reasons to do this, it can still bork your statistical analysis.

All and all, dealing with non-representative data is a difficult thing and it will surprise exactly no one to hear that there are a whole pile of approaches that have been proposed from the middle of last century onwards.

Maybe we can weight it

Maybe the simplest method for dealing with non-representative data is to use sample weights. The purest form of this idea occurs when the population is stratified into $J$ subgroups of interest and data is drawn independently at random from the $j$th population with probability $\pi_j$.  From this data it is easy to compute the sample average for each subgroup, which we will call $\bar{y}_j$. But how do we get an estimate of the population average from this?

Well just taking the average of the averages probably won’t work–if one of the subgroups has a different average from the others it’s going to give you the wrong answer.  The correct answer, aka the one that gives an unbiased estimate of the mean, was derived by Horvitz and Thompson in the early 1950s. To get an unbiased estimate of the mean you need to use the subgroup means and the sampling probabilities.  The Horvitz-Thompson estimator has the form

$\frac{1}{J}\sum_{j=1}^J\frac{\bar{y}_j}{\pi_j}$.

Now, it is a truth universally acknowledged, if perhaps not universally understood, that unbiasedness is really only a meaningful thing if a lot of other things are going very well in your inference. In this case, it really only holds if the data was sampled from the population with the given probabilities.  Most of the time that doesn’t really happen. One of the problems is non-response bias, which (as you can maybe infer from the name) is the bias induced by non-response.

(There are ways through this, like raking, but I’m not going to talk about those today).

Poststratification: flipping the problem on its head

One way to think about poststratification is that instead of making assumptions about how the observed sample was produced from the population, we make assumptions about how the observed sample can be used to reconstruct the rest of the population.  We then use this reconstructed population to estimate the population quantities of interest (like the population mean).

The advantage of this viewpoint is that we are very good at prediction. It is one of the fundamental problems in statistics (and machine learning because why not). This viewpoint also suggests that our target may not necessarily be unbiasedness but rather good prediction of the population. It also suggests that, if we can stomach a little bias, we can get much tighter estimates of the population quantity than survey weights can give. That is, we can trade of bias against variance!

Of course, anyone who tells you they’re doing assumption free inference is a dirty liar, and the fewer assumptions we have the more desperately we cling to them. (Beware the almost assumption-free inference. There be monsters!) So let’s talk about the two giant assumptions that we are going to make in order for this to work.

Giant assumption 1: We know the composition of our population. In order to reconstruct the population from the sample, we need to know how many people or things should be in each subgroup. This means that we are restricted in how we can stratify the population. For surveys of people, we typically build out our population information from census data, as well as from smaller official surveys like the American Community Survey (for estimation things about the US! The ACS is less useful in Belgium.).  (This assumption can be relaxed somewhat by clever people like Lauren and Andrew, but poststratifying to a variable that isn’t known in the population is definitely an adanced skill.)

Giant assumption 2: The people who didn’t answer the survey are like the people who did answer the survey. There are a few ways to formalize this, but one that is clear for me is that we need two things. First, that the people who were asked to participate in the survey in subgroup j is a random sample of subgroup j. The second thing we need is that the people who actually answered the survey in subgroup j is a random sample of the people who were asked.  These sort of missing at random or missing completely at random or ignorability assumptions are pretty much impossible to verify in practice. There are various clever things you can do to relax some of them (e.g. throw a hand full of salt over your left shoulder and whisper “causality” into a mouldy tube sock found under a teenage boy’s bed), but for the most part this is the assumption that we are making.

A thing that I hadn’t really appreciated until recently is that this also gives us some way to do model assessment and checking.  There are two ways we can do this. Firstly we can treat the observed data as the full population and fit our model to a random subsample and use that to assess the fit by estimating the population quantity of interest (like the mean). The second method is to assess how well the prediction works on left out data in each subgroup. This is useful because poststratification explicitly estimates the response in the unobserved population, so how good the predictions are (in each subgroup!) is a good thing to know!

This means that tools like LOO-CV are still useful, although rather than looking at a global LOO-elpd score, it would be more useful to look at it for each unique combination of stratifying variables. That said, we have a lot more work to do on model choice for survey data.

So if we have a way to predict the responses for the unobserved members of the population, we make estimates based on non-representative samples. So how do we do this prediction.

Enter Mister P

Mister P (or MRP) is a grand old dame. Since Andrew and Thomas Little  introduced it in the mid-90s, a whole lot of hay has been made from the technique. It stands for Multilevel Regression and Poststratification and it kinda does what it says on the box.

It uses multilevel regression to predict what unobserved data in each subgroup would look like, and then uses poststratification to fill in the rest of the population values and make predictions about the quantities of interest.

(This is a touch misleading. What it does is estimate the distribution of each subgroup mean and then uses poststratification to turn these into an estimate the distribution of the mean for the whole population. Mathematically it’s the same thing, but it’s much more convenient than filling in each response in the population.)

And there is scads of literature suggesting that this approach works very well. Especially if the multilevel structure and the group-level predictors are chosen well.

But no method is perfect and in our paper we launch at one possible corner of the framework that can be improved. In particular, we look at the effect that using structured priors within the multilevel regression will have on the poststratified estimates. These changes turn out not to massively change whole population quantities, but can greatly improve the estimates within subpopulations.

What are the challenges with using multilevel regression in this context?

The standard formulation of Mister P treats each stratifying variable the same (allowing for a varying intercept and maybe some group-specific effects). But maybe not all stratifying variables are created equal.  (But all stratifying variables will be discrete because it is not the season for suffering. November is the season for suffering.)

Demographic variables like gender or race/ethnicity have a number of levels that are more or less exchangeable. Exchangeability has a technical definition, but one way to think about it is that a priori we think that the size of the effect of a particular gender on the response has the same distribution as the size of the effect of another gender on the response (perhaps after conditioning on some things).

From a modelling perspective, we can codify this as making the effect of each level of the demographic variable a different independent draw from the same normal distribution.

In this setup, information is shared between different levels of the demographic variable because we don’t know what the mean and standard deviation of the normal distribution will be. These parameters are (roughly) estimated using information from the overall effect of that variable (total pooling) and from the variability of the effects estimated independently for each group (no pooling).

But this doesn’t necessarily make sense for every type of demographic variable. One example that we used in the paper is age, where it may make more sense to pool information more strongly from nearby age groups than from distant age groups. A different example would be something like state, where it may make sense to pool information from nearby states rather from the whole country.

We can incorporate this type of structured pooling using what we call structured priors in the multilevel model. Structured priors are everywhere: Gaussian processes, time series models (like AR(1) models), conditional autogregressive (CAR) models, random walk priors, and smoothing splines are all commonly used examples.

But just because you can do something doesn’t mean you should. This leads to the question that inspired this work:

When do structured priors help MRP?

Structured priors typically lead to more complex models than the iid varying intercept model that a standard application of the MRP methodology uses. This extra complexity means that our we have more space to achieve our goal of predicting the unobserved survey responses.

But as the great sages say: with low power comes great responsibility.

If the sample size is small or if the priors are set wrong, this extra flexibility can lead to high-variance predictions and will lead to worse estimation of the quantities of interest. So we need to be careful.

As much as I want it to, this isn’t going to turn into a(nother) blog post about priors. But it’s worth thinking about. I’ve written about it at length before and will write about it at length again. (Also there’s the wiki!)

But to get back to the question, the answer depends on how we want to pool information. In a standard multilevel model, we augment the information within subgroup with the whole population information.  For instance, if we are estimating a mean and we have one varying intercept, it’s a tedious algebra exercise to show that

$\mathbb{E}(\mu_j \mid y)=\approx\frac{\frac{n_j}{\sigma^2} \bar{y}_j+\frac{1}{\tau^2}\bar{y}}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}$,

so we’ve borrowed some extra information from the raw mean of the data $\bar{y}$ to augment the local means $\bar{y}_j$ when they don’t have enough information.

But if our population is severely unbalanced and the different groups have vastly different different responses, this type of pooling may not be appropriate.

A canny ready might say “well what if we put weights in so we can shrink to a better estimate of the population mean?”. Well that turns out to be very difficult.

Everybody needs good neighbours (especially when millennials don’t answer the phone)

The solution we went with was to use a random walk prior on the age. This type of prior prioritizes pooling to nearby age categories.  We found that this makes a massive difference to the subpopulation estimates, especially when some age groups are less likely to answer the phone than others.

We put this all together into a detailed simulation study that showed that you can get some real advantages to doing this!

We also used this technique to analyze some phone survey data from The Annenberg Public Policy Center of the University of Pennsylvania about popular support for marriage equality in 2008. This example was chosen because, even in 2008, young people had a tendency not to answer their phones. Moreover, we expect the support for marriage equality to be different among different age groups.  Things went well.

How to bin ordinal variables (don’t!)

One of the advantages of our strategy is that we can treat variables like age at their natural resolution (eg year) while modelling, and then predict the distribution of the responses in an aggregated category where we have enough demographic information to do poststratification.

This breaks an awkward dependence between modelling choices and the assumptions needed to do poststratification.

Things that are still to be done!

No paper is complete, so there are a few things we think are worth looking at now that we know that this type of strategy works.

• Model selection: How can you tell which structure is best?
• Prior choice: Always an issue!
• Interactions: Some work has been done on using BART with MRP (they call it … BARP). This should cover interaction modelling, but doesn’t really allow for the types of structured modelling we’re using in this paper.
• Different structures: In this paper, we used an AR(1) model and a second order random walk  model (basically a spline!). Other options include spatial models and Gaussian process models. We expect them to work the same way.

What’s in a name? (AKA the tl;dr)

I (and really no one else) really wants to call this Ms P, which would stand for Multilevel Structured regression with Poststratification.

But regardless of name, the big lesson of this paper are:

1. Using structured priors allow us to pool information in a more problem appropriate way than standard multilevel models do, especially when stratifying our population according to an ordinal or spatial variable.
2. Structured priors are especially useful when one of the stratifying variable is ordinal (like age) and the response is expected depend (possibly non-linearly) with this variable.
3. The gain from using structured priors increases when certain levels of the ordinal stratifying variable are over- or under-sampled. (Eg if young people stop answering phone surveys.)

So go forth and introduce yourself to Ms P. You’ll like her.

Hi, everyone!

## Coney Island

Inspired by this story (“Good news! Researchers respond to a correction by acknowledging it and not trying to dodge its implications”):

Coming down from Psych Science
Stopping off at PNAS
Out all day datagathering
And the craic was good
Stopped off at the old lab
Early in the morning
Drove through Harvard taking pictures
And on to the stat department
Stopped off for Sunday papers at the
Journal office, just before Coney Island

On and on, over the hill to the Uni
In the jamjar, autumn sunshine, magnificent
And all shining through

Stop off at Uni for a couple of lines of
R code and some potted data in case
We get review reports before dinner

On and on, over the hill and the craic is good