Skip to content

The Shrinkage Trilogy: How to be Bayesian when analyzing simple experiments

There are lots of examples of Bayesian inference for hierarchical models or in other complicated situations with lots of parameters or with clear prior information.

But what about the very common situation of simple experiments, where you have an estimate and standard error but no clear prior distribution? That comes up a lot! In such settings, we usually just go with a non-Bayesian approach, or we might assign priors to varying coefficients or latent parameters but not to the parameters of primary interest. But that’s not right: in many of these problems, uncertainties are large, and prior information make a difference.

With that in mind, Erik van Zwet has done some research. He writes:

Our paper is now on arXiv where it forms a “shrinkage trilogy” with two other preprints. It would be really wonderful if you would advertise them on your blog – preferably without the 6 months delay! The three papers are:

1. The Significance Filter, the Winner’s Curse and the Need to Shrink at (Erik van Zwet and Eric Cator)

2. A Proposal for Informative Default Priors Scaled by the Standard Error of Estimates at (Erik van Zwet and Andrew Gelman)

3. The Statistical Properties of RCTs and a Proposal for Shrinkage at (Erik van Zwet, Simon Schwab and Stephen Senn)

He summarizes:

Shrinkage is often viewed as a way to reduce the variance by increasing the bias. In the first paper, Eric Cator and I argue that shrinkage is important to reduce bias. We show that noisy estimates tend to be too large, and therefore they must be shrunk. The question remains: how much?

From a Bayesian perspective, the amount of shrinkage is determined by the prior. In the second paper, you and I propose a method to construct a default prior from a large collection of studies that are similar to the study of interest.

In the third paper, Simon Schwab, Stephen Senn and I apply these ideas on a large scale. We use the results of more than 20,000 RCTs from the Cochrane database to quantify the bias in the magnitude of effect estimates, and construct a shrinkage estimator to correct it.

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Charles Margossian, Aki Vehtari, Daniel Simpson, Raj Agrawal write:

Gaussian latent variable models are a key class of Bayesian hierarchical models with applications in many fields. Performing Bayesian inference on such models can be challenging as Markov chain Monte Carlo algorithms struggle with the geometry of the resulting posterior distribution and can be prohibitively slow. An alternative is to use a Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using dynamic Hamiltonian Monte Carlo, a gradient-based Markov chain Monte Carlo sampler. To implement this scheme efficiently, we derive a novel adjoint method that propagates the minimal information needed to construct the gradient of the approximate marginal likelihood. This strategy yields a scalable differentiation method that is orders of magnitude faster than state of the art differentiation techniques when the hyperparameters are high dimensional. We prototype the method in the probabilistic programming framework Stan and test the utility of the embedded Laplace approximation on several models, including one where the dimension of the hyperparameter is ∼6,000. Depending on the cases, the benefits can include an alleviation of the geometric pathologies that frustrate Hamiltonian Monte Carlo and a dramatic speed-up.

“Orders of magnitude faster” . . . That’s pretty good!

Understanding Janet Yellen

I don’t know anything about Janet Yellen, the likely nominee for Secretary of the Treasury. For the purpose of this post, my ignorance is OK, even desirable, in that my goal is to try to understand mixed messages that I’m receiving.

Two constrasting views on the prospective Treasury Secretary

First, here’s Joseph Delaney:

So, I [Delaney] know that inflation is a potential menace and ignoring debt has gotten many an advanced nation into trouble. These are all reasonable things to be concerned about. But, via Yasha Levine, I want to bring your attention to the views of the frontrunner for incoming treasury secretary:

In a 2018 interview at the Charles Schwab Impact conference in Washington, Ms. Yellen said the United States’ debt path was “unsustainable” and offered a remedy: “If I had a magic wand, I would raise taxes and cut retirement spending.”

Last year, Ms. Yellen touched on the third rail of Democratic politics when she suggested more directly that cuts to Medicare, Medicaid and Social Security could be in order.

“I think it will not be solved without some additional revenues on the table, but I also find it hard to believe that it won’t be solved without some changes to those programs,” Ms. Yellen said at the National Investment Center for Seniors Housing & Care Fall Conference.

So, there are several issues all bundled together here. First, can we stop putting Medicare into the same bucket as the (less generous) Medicaid and the (quite sustainable) Social Security. The problem with Medicare, insofar as there is one, is an issue of medical cost inflation and that’s an independent policy problem that has little to do with the budget (except as a motivation to solve it). . . .

I am not saying these programs should never be considered for cuts, but that we should be very careful about not framing this as a choice to have lower revenues which require cuts. . . .

This reminded me that I’d noticed a Paul Krugman column on Yellen . . . ok, here it is:

In Praise of Janet Yellen the Economist

She never forgot that economics is about people.

It’s hard to overstate the enthusiasm among economists over Joe Biden’s selection of Janet Yellen as the next secretary of the Treasury. . . . But the good news about Yellen goes beyond her ridiculously distinguished career in public service. Before she held office, she was a serious researcher. And she was, in particular, one of the leading figures in an intellectual movement that helped save macroeconomics as a useful discipline when that usefulness was under both external and internal assault. . . .

Krugman also argues that Yellen “got it right” in 2009 by fighting against the “inflation hawks” to expand the economy.

What’s the conflict?

It seems to me that, despite both coming from the left, or the center-left, Delaney and Krumgan are painting much different pictures of Yellen. I say this because Delaney’s point—that it’s a mistake to use artificial budgetary constraints as a rationale for cutting benefits to the poor and middle class—is the kind of argument that I associate with Krugman, at least in his post-2000 incarnation. Krugman’s always saying we can afford Social Security, and he’s been pretty consistently criticizing those political figures who want to cut or otherwise restrict this retirement program. For example:

Social Security does not face a financial crisis; its long-term funding shortfall could easily be closed with modest increases in revenue.

Krugman goes on to offer a reason that some Republican politicians favor cutting Social security: “it’s all about the big money.”

So here’s the conflict.

A. Yellen wants to cut Social Security (as Delaney notes, she puts it in the “Medicare, Medicaid and Social Security” category, but that’s a separate issue we won’t get into here). Her rationale is that the debt is unsustainable and we can’t raise taxes.

B. Krugman hates, absolutely hates, people who want to cut Social Security, and he’s dismissive of the argument that the retirement program is unaffordable.

C. Krugman looooves Yellen, both as an academic economist and a policy figure.

I’m finding it difficult to hold A, B, and C in my head at the same time. Lewis Carroll might resolve the problem by just adding a fourth statement:

D. A and B are consistent with C.

Of course then I’d wonder why I should believe D, but then Carroll could posit:

E. D is true.

I guess this would pretty much cover it!

Possible explanations

OK, here are some possible resolutions to the above puzzle:

1. Maybe Yellen was misquoted and she doesn’t really want to cut Social Security?

2. Maybe Krugman wasn’t aware of Yellen’s stance on Social Security when he wrote his column the other day?

3. Maybe Krugman knows about Yellen’s stance on Social Security and doesn’t like it, but in his column he was evaluating all her positions on the economy: perhaps she agreed with him on 9 out of 10 issues and so in his column he’s focusing on the places where they agree?

4. Maybe Krugman has changed his views and now he thinks Social Security really is unsustainable? Maybe Social Security was sustainable in 2015 but not in 2020?

I’m guessing it’s #3. But I’m still baffled by how Krugman is so enthusiastic for Yellen given that they seem to disagree on such a core political and economic issue.


I’m not sure what to think here. My point is not to drag Krugman for being inconsistent. Rather, my point is how difficult it is for an outsider to evaluate policy positions.

There are lots of examples where a policymaker is on the left, and he or she is criticized from the right, or vice versa. And examples where a centrist is criticized from both sides, for example by supporting enough environmental regulations to annoy the right, but not enough to satisfy the left. I’m not saying that the centrist position is correct here; I’m just saying I understand the debate, or at least I think I do.

There are also examples of controversy arising from multidimensionality in policy positions. For example, you might agree with a policymaker’s position on China but disagree with their stance on India. Or you could agree on gun rights but not on abortion rights. I get that.

The Yellen example is interesting to me because it’s not either of the above things. Delaney and Krugman have different tones (the polite statistician and the aggressive economist) but I think their political positions are pretty similar. Delaney’s on the outside and has some distrust of Ivy League economics professors, and Krugman’s an insider, so that somewhat explains their different views about a credentialed academic economist—but I don’t see that Delaney and Krugman are disagreeing on the relevant policy question.

And that brings us to the other point, which is that this does not seem to be a multidimensional issue. Delaney is suspicious of Yellen regarding Social Security—but Krugman cares about social security too!

When two people with the same views on the same issue have opposite takes on a policymaker, I’m not sure what to think. Which is why I’m saying that, for the purpose of this discussion, it’s good that I came into this knowing nothing about Yellen. I don’t really have any strong views about Social Security either, but that’s another story.

P.S. In comments, Jim offers another possibility:

Yellen’s previous comments on social security, medicare and medicaid were just thinking out loud and don’t reflect a policy position. Krugman and Yellen have probably had many conversations, so Krugman knows much more about her thinking than a few simple quotes can reflect.

That makes sense. If Krugman thinks that Yellen’s previous statements on social security or entitlement reform don’t reflect her current positions, then it would make sense for him not to get into those issues in his column.

Basbøll’s Audenesque paragraph on science writing, followed by a resurrection of a 10-year-old debate on Gladwell

I pointed Thomas Basbøll to my recent post, “Science is science writing; science writing is science,” and he in turn pointed me to his post from a few years ago, “Scientific Writing and ‘Science Writing,'” which stirringly begins:

For me, 2015 will be the year that I [Basbøll] finally lost all respect for “science writing”.

He continues: “especially since the invention of the TED talk (a “dark art”), it gave me the feeling of knowing without actually providing me with knowledge. Popular presentations of science tell us stories about what is known without giving us the critical foundations we need to engage with it, i.e., to question those stories.”

And leads to this stunning conclusion:

Knowledge was once something you acquired through years of study, guided by books, but framed by a classroom (other people), an observatory (other vistas), a laboratory (other experiences), a library (other books). If you did not have access to these “academic” conditions you did not presume to understand the topic. Scientists wrote about their discoveries for people who had the knowledge, intelligence, time and apparatus to test them. These days, “science” is becoming something that is produced in a lab and consumed in a book you buy at the airport.

I’m a sucker for nostalgia. But I still can’t bring myself to take the position that the old days were better—after all, the vast majority of people didn’t, and don’t, have the opportunity for these years of study—or, even if they did, it would only be in one narrow field—so I still like the idea of science writing, if we can get beyond the obsolete “science as hero” framework.

One think I like about the above-quoted paragraph is its Audenesque rhythm. (“Yesterday all the past…”). Then again, Orwell roasted Auden for that particular poem, and years later Auden renounced it. Something can sound good and even make a certain kind of logical sense but still be factually or morally wrong. Orwell knew this all along, it took Auden awhile to realize it, and there are lots of people who still don’t get the point.

Speaking of Malcolm Gladwell . . . In his 2015 post, Basbøll links to this blog discussion from 2010 which is kind of amazing in that Gladwell responds to Basbøll in the comments. And it wasn’t even Basbøll’s post! Blogs really used to matter, enough so that a big name Malcolm Gladwell would engage with critic A in the comments section of a post by blogger B. And they went back and forth!

I’m not the world’s biggest Gladwell fan, but I admire that he engaged seriously with criticism in that way. Here’s an example, from late in the thread:

What strikes me [Gladwell] most—reading all the comments—is how unwilling many of the commenters (most of whom, I’m guessing, are academics) are to deal with the trade-off presented in the original post. Academics have the luxury, appropriately, of dealing with ideas and arguments and social science in its full complexity. Those of us who have chosen to swim in the lay pool do not. We have to make compromises. My book Blink, for example, was a compromise: an attempt to nudge people away from the reflexive position that intuition and instinct are invariably reliable or useful. A complete summary of the academic understanding of those questions would have been read by a fraction of the audience. Figuring out where to draw that line is difficult, and I don’t pretend that I always do it properly. But I do think that the effort to expose as wide an audience as impossible to the wonders and mysteries of social science ought to be met with more than condescension—especially from a group of people who teach for a living.

I don’t think this response from Gladwell is perfect—for example, he does not address that in his books, he wasn’t just doing compromises and trade-offs; he was also actively promoting junk science such as John Gottman’s divorce predictions (see here—wow, that was from back in 2010 also! Such a long time has gone by.) and I don’t know that he (Gladwell) has ever retracted his endorsement of Gottman’s claims.

So, yeah, I think Gladwell misses the point in his replies, in that his paragraph sounds reasonable in isolation but it doesn’t address his devastating combination of credulity and unwillingness to admit specific errors. But I still very much appreciate that he at least made the effort: he showed the critics some respect, which is more than you can say of David Brooks, Susan Fiske, Cass Sunstein, etc.

The other stunning thing in that thread from 2010 is when Brayden King, who wrote the blog post that started it all, added this in comments:

Lots of completely legitimate academic articles are liberally sprinkled with “premature conclusions or misleading anecdotes.” I don’t see them as harmful as you do in either case. The point of much empirical work is to push theoretical boundaries and to get people to think. Gladwell is doing the same thing, the main difference being the intended audience.


I mean, yeah, sure, lots of academics make mistakes and don’t ever issue corrections. ESP, ages ending in 9, pizzagate, the disgraced primatologist, that dude from Ohio State with the voodoo dolls, air rage, himmicanes, beauty and sex ratios, that sleep researcher, etc etc. But that’s a bad thing, right?? No matter what the intended audience.

I do think there are some solid defenses of Gladwell. One possible defense is that the man has a workflow, and if he were to fact-check his writing too carefully, it would destroy the spontaneity that makes it all hang together. The second possible defense is that to correct the errors would destroy the willing suspension of disbelief that makes traditional science writing so effective.

In either case, the argument is: (a) the pluses of Gladwell’s writing (the sharing of true facts, the reporting and publicizing of good research, the engagement of the reader in the process of social science) outweigh the minuses (the sharing of false claims, the reporting and publicizing of bad research, the misrepresentation of social science), and (b) that removal or correction of the errors would be impossible as it would in some way destroy the ability of Gladwell to produce this work.

I think this argument is plausible. But, to make it work, you need both (a) and (b). Either alone is not enough.

P.S. Thanks to Zad Chow for the above picture of Polynomial Cats. Happy new year, Zad!

“We’ve got to look at the analyses, the real granular data. It’s always tough when you’re looking at a press release to figure out what’s going on.”

Chris Arderne writes:

Surprised to see you hadn’t yet discussed the Oxford/AstraZeneca 60%/90% story on the blog.

They accidentally changed the dose for some patients without an hypothesis, saw that it worked out better and are now (sort of) claiming 90% as a result…

Sounds like your kind of investigation?

I hadn’t heard about this so I googled *Oxford/AstraZeneca 60%/90%* and found this news article from Helen Branswell and Adam Feuerstein:

AstraZeneca said Monday that its coronavirus vaccine reduced the risk of symptomatic Covid-19 by an average of 70.4%, according to an interim analysis of large Phase 3 trials conducted in the United Kingdom and Brazil. . . .

The preliminary results on the AstraZeneca vaccine were based on a total of 131 Covid-19 cases in a study involving 11,363 participants. The findings were perplexing. Two full doses of the vaccine appeared to be only 62% effective at preventing disease, while a half dose, followed by a full dose, was about 90% effective. That latter analysis was conducted on a small subset of the study participants, only 2,741.

A U.S.-based trial, being supported by Operation Warp Speed, is testing the two-full-dose regimen. That may soon change. AstraZeneca plans to explore adding the half dose-full dose regimen to its ongoing clinical trials in discussions with regulatory agencies . . .

Fauci cautioned that full datasets — which the Oxford researchers said they intend to publish in a scientific journal — need to be pored over before conclusions can be drawn.

“We’ve got to look at the analyses, the real granular data. It’s always tough when you’re looking at a press release to figure out what’s going on,” Fauci said. . . .

Indeed, it’s hard to deconstruct a press release. In this case, the relevant N is not the number of people in the study; it’s the number of coronavirus cases. If this was proportional in the subset, then they’s (2741)/(11363)*131 = 32 cases . . . ok, if there are equal numbers in the placebo and treatment groups, and the risk is reduced by 90%, then that would be something like 30 cases in the placebo group and 3 in the treatment group, if I’m thinking about this right. A 70% reduction would be 9 cases in the treatment group. If you expect to see 9, then it would be unlikely to see only 3 . . . I guess I’d do the Bayes thing and estimate the efficacy to be somewhere between 70% and 90% . . . . of course that’s making the usual assumption that the vaccine is as effective in the real world as in this trial.

OK, so I guess I didn’t have that much to say on this one. Ultimately I hope they can learn not just from these topline numbers but by looking at more direct measurements of antibodies or whatever on the individual patients.

Full disclosure: I have done some work with AstraZeneca.

P.S. I think it’s useful to sometimes post on statistical issues like this where I have no special insight, if for no other reason that to remind people that nobody, myself included, has the answer all the time.

Unfair to James Watson?

A reader writes:

I usually enjoy your blog, but when I saw the first sentence of this recent post it left a bad taste that just didn’t go away.

The post in question was titled, “New England Journal of Medicine engages in typical academic corporate ass-covering behavior,” and its first sentence began, “James Watson (not the racist dude who, in 1998, said that a cancer cure was coming in 2 years) writes . . .”

The reader continues:

You could have said “not the co-discoverer of the structure of DNA”, but instead to identify him, it is the “racist” label you had to use. Was that really necessary? What was the point of that? To show that you are among the enlightened and moral in identifying that as his most important characteristic? (Besides racist is a entirely vague term now as it can include someone saying “All lives matter” or disagreeing with an African-American).

When a commenter took issue with his mention your response was rather bizarre. So what if cancer cure forecasting and racism were his side gigs? That’s what he is famous for now? And no other great scientists have made bogus predictions? It was all rather petty, unbecoming, and unnecessary. The treatment of Watson has been a disgrace and is one of many episodes leading to the culture of fear in academia for saying something offhand that will get you unpersoned. Thanks for adding to that.

I disagree, but I appreciate the open criticism. Here is my reply:

1. Publicity goes both ways. There was nobody holding a gun to Watson’s head telling him to say, in 1998, that “Judah is going to cure cancer in two years.” Watson seems to love publicity. If the cancer cure had really come, Watson could rightly claim credit for calling it ahead of time. When it didn’t come . . . then, yeah, he’s due for some mockery. Don’t you think it’s a little bit irresponsible for one of the mast famous biologists in the world to tout a nonexistent cancer cure? I don’t like it when Dr. Oz does this sort of thing either.

2. The reason I called Watson is a racist is not that he said “All lives matter” or that he disagreed with an African American. I called him a racist because he’s said things like this:

Some anti-Semitism is justified.

All our social policies are based on the fact that [Africans] intelligence is the same as ours – whereas all the testing says not really.

The one aspect of the Jewish brain that is not first class is that Jews are said to be bad in thinking in three dimensions.. it is true.

I think now we’re in a terrible sitution where we should pay the rich people to have children.. if we don’t encourage procreation of wealthier citizens, IQ levels will most definitely fall.

Indians in [my] experience [are] servile.. because of selection under the caste system.

East Asian students [tend] to be conformist, because of selection for conformity in ancient Chinese society.

3. You refer to “the culture of fear in academia for saying something offhand that will get you unpersoned.” First, I don’t know that Watson’s statements were so “offhand”; he seems to have pretty consistent views. Second, I’m not unpersoning the guy. He’s a person, and one of the things he’s done as a person is to trade in some of the fame he got from his youthful scientific accomplishments to promote racism and cancer cures that don’t work. Third, what about the culture of fear for ethnic minorities and women in science? Watson was head of a major lab and a big figure in American biology for many years. It doesn’t bother me so much that people might want think twice before spewing some of the opinions that Watson’s expressed.

In the meantime, I don’t think they’ll be taking DNA out of the textbooks, even if one of its discoverers was Rosalind Franklin, who Watson apparently couldn’t stand. She couldn’t do maths, she couldn’t think in three dimensions very well, she didn’t even curl her hair . . . jeez! It’s amazing she could do science at all. I guess standards were lower back in the 1950s.

Look, I’m not saying Watson was evil. He was a complicated person, like all of us. But scientific politics, sexism, and racism were not just part of his private opinions. They were part of his public persona. If you go around saying “Some anti-Semitism is justified,” then, yeah, you’re gonna piss some people off!

Bishops of the Holy Church of Embodied Cognition and editors of the Proceedings of the National Academy of Christ

Paul Alper points to a recent New York Times article about astrology as a sign that the world is going to hell in a handbasket.

My reply:

Astrology don’t bug me so much cos it doesn’t pretend to be science. I’m more bothered by PNAS-style fake science because it pretends to be real science. Same thing with religion. I don’t get so worked up about Biblical fundamentalists. If someone wants to believe that someone parted the Red Sea, whatever. Similarly, if Prof. Susan Fiske and Prof. Robert Sternberg were Rev. Susan Fiske and Rev. Robert Sternberg, bishops of the Holy Church of Embodied Cognition and editors of the Proceedings of the National Academy of Christ, I’d be less annoyed, because their experiments would be reported in the religion section of the paper, not the science section.

Alper adds:

If you post on this, be sure to include a reference to truffle French fries, an item not normally found in the upper midwest. My guess is the stars will be aligned and many comments will be forthcoming whether or not Venus is retrograding.

We’ll see.

A very short statistical consulting story

I received the following email:

Professor Gelman,

My firm represents ** (Defendant) in a case pending in the U.S. District Court for the District of **. This case concerns [a topic in political science that you have written about].

I’ve reviewed your background and think that your research and interests, in particular your statistical background, may offer a valuable perspective in this matter.

I’ve attached a report drafted by Plaintiffs’ expert, **. The Plaintiffs have submitted this report in support of their Motion for Preliminary Injunction. Our response to the Plaintiffs’ Motion is due on **. This is the same date by which we would need to submit any rebuttal expert reports.

Do you have a few moments when I could discuss this case in further detail with you? If so, please let me know when I could give you a call and the best number to reach you.

Thank you,

I replied:

Hi—I took a look at **’s report and it looks pretty good. So I don’t know that you should be contesting it. He seems to have done a solid analysis.

That was an easy consulting job—actually, not a job at all, as I declined the opportunity to take this one on. People send me so many bad analyses to look at; it’s refreshing when they send me solid work for a change.

P.S. This all happened a year ago and appeared just now because of a combination of usual blog delay and bumping due to one of our coronavirus posts this spring.

2 PhD student positions on Bayesian workflow! With Paul Bürkner!

Paul Bürkner writes:

The newly established work group for Bayesian Statistics of Dr. Paul-Christian Bürkner at the Cluster of Excellence SimTech, University of Stuttgart (Germany), is looking for 2 PhD students to work on Bayesian workflow and Stan-related topics. The positions are fully funded for at least 3 years and people with a Master’s degree in any quantitative field can apply.
All details on the two positions can be found at


This sounds great! Some of our ideas on workflow are here.

Is causality as explicit in fake data simulation as it should be?

Sander Greenland recently published a paper with a very clear and thoughtful exposition on why causality, logic and context need full consideration in any statistical analysis, even strictly descriptive or predictive analysis.

For instance, in the concluding section – “Statistical science (as opposed to mathematical statistics) involves far more than data – it requires realistic causal models for the generation of that data and the deduction of their empirical consequences. Evaluating the realism of those models in turn requires immersion in the subject matter (context) under study.”

Now, when I was reading the paper I started to think how these three ingredients are or should be included in most or all fake data simulation. Whether one is simulating fake data for a randomized experiment or a non-randomized comparative study, the simulations need to adequately represent the likely underlying realities of the actual study. Only have to add simulation to this excerpt from the paper “[Simulation] must deal with causation if it is to represent adequately the underlying reality of how we came to observe what was seen – that is, the causal network leading to the data”.  For instance, it is obvious that sex is determined before treatment assignment or selection (and should be in the simulations), but some features may not be so obvious.

Once someone offered me a proof that the simulated censored survival times they generated where the censoring time was set before the survival time (or some weird variation on that) would be meet the definition of non-informative censoring. Perhaps there was a flaw in the proof, but the assessed properties of repeated trials we wanted to understand, were noticeably different than when survival times were first generated and then censoring times generated and then applied. In that way, simulations likely better reflect the underlying reality as we understand it. And others (including future selves) more likely to raise criticisms about this.

So I then worried about how clear I had been in my seminars and talks on using fake data simulation to better understand statistical inference, both frequentist and Bayes. At first, I thought I had, but on further thought I am not so sure. One possibly misleading footnote on the bootstrap and cross-validation I gave likely needs revision, as that did not reflect causation at all.

Continue reading ‘Is causality as explicit in fake data simulation as it should be?’ »

A new hot hand paradox

1. Effect sizes of just about everything are overestimated. Selection on statistical significance, motivation to find big effects to support favorite theories, researcher degrees of freedom, looking under the lamp-post, and various other biases. The Edlin factor is usually less than 1. (See here for a recent example.)

2. For the hot hand, it’s the opposite. Correlations between successive shots are low, but, along with Josh Miller and just about everybody else who’s played sports, I think the real effect is large.

How to reconcile 1 and 2? The answer has little to do with the conditional probability paradox that Miller and Sanjurjo discovered, and everything to do with measurement error.

Here’s how it goes. Suppose you are “hot” half the time and “cold” half the time, with Pr(success) equal to 0.6 in your hot spells and 0.4 in your cold spells. Then the probability of two successive shots having the same result is 0.6^2 + 0.4^2 = 0.52. So if you define the hot hand as the probability of success conditional on a previous success, minus the probability of success conditional on a previous failure, you’ll think the effect is only 0.04, even though in this simple model the true effect is 0.20.

This is known as attenuation bias in statistics and econometrics and is a well-known effect of conditioning on a background variable that is measured with error. The attenuation bias is particularly large here because a binary outcome is about the noisiest thing there is. This application of attenuation bias to the hot hand is not new (it’s in some of the hot hand literature that predates Miller and Sanjurjo, and they cite it); I’m focusing on it here because of its relevant to effect sizes.

So one message here is that it’s a mistake to define the hot hand in terms of serial correlation (so I disagree with Uri Simonsohn here).

Fundamentally, the hot hand hypothesis is that sometimes you’re hot and sometimes you’re not, and that this difference corresponds to some real aspect of your ability (i.e., you’re not just retroactively declaring yourself “hot” just because you made a shot). Serial correlation can be an effect of the hot hand, but it would be a mistake to define serial correlation as the hot hand.

One thing that’s often left open in hot hand discussions is to what extent the “hot hand” represents a latent state (sometimes you’re hot and sometimes you’re not, with this state unaffected by your shot) and to what extent it’s causal (you make a shot, or more generally you are playing well, and this temporarily increases your ability, whether because of better confidence or muscle memory or whatever). I guess it’s both things; that’s what Miller and Sanjurjo say too.

Also, remember our discussion from a couple years ago:

The null model is that each player j has a probability p_j of making a given shot, and that p_j is constant for the player (considering only shots of some particular difficulty level). But where does p_j come from? Obviously players improve with practice, with game experience, with coaching, etc. So p_j isn’t really a constant. But if “p” varies among players, and “p” varies over the time scale of years or months for individual players, why shouldn’t “p” vary over shorter time scales too? In what sense is “constant probability” a sensible null model at all?

I can see that “constant probability for any given player during a one-year period” is a better model than “p varies wildly from 0.2 to 0.8 for any player during the game.” But that’s a different story.

Ability varies during a game, during a season, and during a career. So it seems strange to think of constant p_j as a reasonable model.

OK, fine. The hot hand exists, and estimates based on correlations will dramatically underestimate it because attenuation bias.

But then, what about point 1 above, that the psychology and economics research literature (not about the hot hand, I’m talking here about applied estimates of causal effects more generally) typically overestimates effect size, sometimes by a huge amount. How is the hot hand problem different from all other problems? In all other problems, published estimates are overestimates. But in this problem, the published estimates are too small. Attenuation bias happens in other problems, no? Indeed, I suspect that one reason econometricians have been so slow to recognize the importance of type M errors and the Edlin factor is that they’ve been taught about attenuation bias and they’ve been trained to believe that noisy estimates are too low. From econometrics training, it’s natural to believe that your published estimates are “if anything, too conservative.”

The difference, I think, is that in most problems of policy analysis and causal inference, the parameter to be estimated is clearly defined, or can be clearly defined. In the hot hand, we’re trying to estimate something latent.

To put it another way, suppose the “true” hot hand effect really is a large 0.2, with your probability going from 40% to 60% when you go from cold to hot. There’s not so much that can be done with this in practice, given that you never really know your hot or cold state. So a large underlying hot hand effect would not necessarily be accessible. That doesn’t mean the hot hand is unimportant, just that it’s elusive. Concentration, flow, etc., these definitely seem real. It’s the difference between estimating a particular treatment effect (which is likely to be small) and an entire underlying phenomenon (which can be huge).

Further formalization of the “multiverse” idea in statistical modeling

Cristobal Young and Sheridan Stewart write:

Social scientists face a dual problem of model uncertainty and methodological abundance. . . . This ‘uncertainty among abundance’ offers spiraling opportunities to discover a statistically significant result. The problem is acute when models with significant results are published, while those with non-significant results go unmentioned. Multiverse analysis addresses this by recognizing ‘many worlds’ of modeling assumptions, using computational tools to show the full set of plausible estimates. . . . Our empirical cases examine racial disparity in mortgage lending, the role of education in voting for Donald Trump, and the effect of unemployment on subjective wellbeing. Estimating over 4,300 unique model specifications, we find that OLS, logit, and probit are close substitutes, but matching is much more unstable. . . .

My quick thought is that the multiverse is more conceptual than precise. Or, to put it another way, I don’t think the multiverse can be ever really defined. For example in our multiverse paper we considered 168 possible analyses, but there were many other researcher degrees of freedom that we did not even consider. One guiding idea we had in defining the multverse for any particular analysis was to consider other papers in the same subfield. Quite often, if you look different papers in a subfield, or different papers by a single author, or even different studies in a single paper, you’ll see alternative analytical choices. So these represent a sort of minimal multiverse. This has some similarities to research in diplomatic history, where historians use documentary evidence to consider what alternatives courses of action might have been considered by poilicymakers.

Also, regarding that last bit above on matching estimation, let me emphasize, following Rubin (1970), that it’s not matching or regression, it’s matching and regression (see also here).

Greek statistician is in trouble for . . . telling the truth!

Paul Alper points us to this news article by Catherine Rampell, which tells this story:

Georgiou is not a mobster. He’s not a hit man or a spy. He’s a statistician. And the sin at the heart of his supposed crimes was publishing correct budget numbers.

The government has brought a relentless series of criminal prosecutions against him. His countrymen have sought their own vengeance by hacking his emails, dragging him into court, even threatening his life. His lawyers in Greece are now preparing for his latest trial, which begins this month . . .

Politicians accused him of being a “Trojan horse” for international interests that wanted to place Greece under “foreign occupation.” It didn’t matter that his numbers were repeatedly validated by outside experts. Or that the deficit his agency calculated precisely matched the net amount Greece borrowed from capital markets in 2009.

The government prosecuted, cleared and re-prosecuted him anyway, for causing “extraordinary damage” to the Greek state and for “violation of duty.” In one case, he was given a suspended prison sentence of two years. Two criminal investigations remain open.

I’m reminded of this story, The Commissar for Traffic Presents the Latest Five-Year Plan. There sometimes seem to be incentives to give inaccurate forecasts that tell people what they want to hear.

Getting back to the Greek story, Alper writes:

Consider yourself personally lucky—Wansink, Brooks, Bem, etc.—that you don’t live in Greece because:

In layman’s terms, a court said he made statements that were true but that hurt someone’s reputation. (Yes, this is an actual crime in Greece.) If his appeal fails, he’ll be forced to pay and publicly apologize to his predecessor. This means the person who restored the credibility of Greek statistics will have to apologize to a person who had been fudging the data.

Wow. I guess whistleblowers have it hard there too.

The 200-year-old mentor

Carl Reiner died just this year and Mel Brooks is, amazingly, still alive. But in any case their torch will be carried forward, as long as there are social scientists who are not in full control of their data.

The background is the much-discussed paper, “The association between early career informal mentorship in academic collaborations and junior author performance.”

Dan Weeks decided to look into the data from this study. He reports:

I [Weeks] think there are a number of problematic aspects with the data used in this paper.

See Section 13 ‘Summary’ of

How can one have a set of mentors with an average age > 200? How can one have 91 mentors?

Always always graph your data!

Now whenever people discuss mentoring, I’m gonna hear that scratchy Mel Brooks voice in the back of my head.

Best comics of 2010-2019?

X linked to this list by Sam Thielman of the best comics of the decade. The praise is a bit over the top (“brimming with wit and pathos” . . . “Every page in Ferris’s enormous debut is a wonder” . . . “An astounding feat of craftsmanship and patience” . . . “never has an artist created a world so vivid without a single word spoken” etc.), but that’s been the style in pop-music criticism for a few decades, so I’m not surprised to see it in other pop-cultural criticism as well: the critic is juicing up the positivity because he’s promoting the entire genre.

It’s interesting how different these are than Franco-Belgian BD’s. Lately I’ve been continuing to read Emile Bravo and Riad Sattouf, among others.

U.S. comics are like indie movies, Franco-Belgian BD’s are like Hollywood productions. Even the BD’s written and drawn by a single person have certain production values, in contrast to the DIY attitude from independent comics in English.

Today in spam

1. From “William Jessup,” subject line “Invitation: Would you like to join GlobalWonks?”:

Dear Richard,

I wanted to follow up one last time about my invitation to join our expert-network.

We are happy to compensate you for up to $900 per hour for our client engagements. If you would like to join us, you may do so by signing up here.

If you already signed up, please ignore this email.

Hey, for $900/hour, you can call me Richard, no problem. Whatever you say, William!

I’ve kept in the link above in case any Richards in our readership would like to get in on this sweet, sweet deal. Just click and join; I’m sure the $900 checks will start rolling in.

2. From “Christina,” subject line “Re: Regarding Andrew Gelman’s Book”:

Dear Dr. Andrew Gelman,

I am Christina Batchelor, Editorial assistant from
Index of Sciences Ltd. contacting you with the reference from our editorial
department. Basing on your outstanding contribution to the scientific
community, we would like to write a book for you.

Many Researchers like you wanted to write and
publish a book to show their scientific achievements. But only a few
researchers have published their books and yet there are researchers who
still have the thought of writing a book and publishing it, but due to
their busy schedule, they never get the time to write the book by
themselves and publish it.

If you are one of those researchers who are very
busy but still want to write a book and publish it? we can help you with
the writing and publishing of your book.

With our book writing service, we can convert your
research contributions or papers into common man’s language and draft
it like a book. . . .

Dear Christina:

If you really want to hook me for this sort of scam, try calling me Richard. That’ll get my attention. Also, if this Index of Sciences Ltd. thing ever stops working out, you should look around for other opportunities. Maybe Wolfram Research is hiring?

Are female scientists worse mentors? This study pretends to know

A new paper in Nature communications, The association between early career informal mentorship in academic collaborations and junior author performance, by AlShebli, Makovi, and Rahwan, caught my attention. There are a number of issues but what bothered me the most is the post-hoc speculation about what might be driving the associations.

Here’s the abstract:

We study mentorship in scientific collaborations, where a junior scientist is supported by potentially multiple senior collaborators, without them necessarily having formal supervisory roles. We identify 3 million mentor–protégé pairs and survey a random sample, verifying that their relationship involved some form of mentorship. We find that mentorship quality predicts the scientific impact of the papers written by protégés post mentorship without their mentors. We also find that increasing the proportion of female mentors is associated not only with a reduction in post-mentorship impact of female protégés, but also a reduction in the gain of female mentors. While current diversity policies encourage same-gender mentorships to retain women in academia, our findings raise the possibility that opposite-gender mentorship may actually increase the impact of women who pursue a scientific career. These findings add a new perspective to the policy debate on how to best elevate the status of women in science.

To find these mentor-protégé pairs, they first do gender disambiguation on names in their dataset of 222 million papers from the Microsoft Academic Graph, then define a junior scholar as anyone within 7 years of their first publication in the set, and a senior scholar as anyone past 7 years. They argue that this looser definition of mentorship, as anyone that a junior person published with who had passed the senior mark at the time, is okay because a lot of time there is informal mentorship from those other than one’s advisor, in the form of somehow helping or giving advice, and one could interpret the co-authorship of the paper itself as helping. It seems a little silly that after saying this they present results of a survey sample of 167 authors to argue that their assumption is good. But beyond the potential for dichotomizing experience to introduce researcher degrees of freedom I don’t really have a problem with these assumptions. 

To analyze the data, they define two measures of mentor quality as independent variables. First the “big shot” measure, which is the average impact of the mentors prior to mentorship, operationalized as “their average number of citations per annum up to the year of their first publication with the protégé.” Then the hub experience, defined as the average degree of the mentors in the network of scientific collaborations up to the year of their first publication with the protégé.  

They measure mentorship outcome, conceptualized as “the scientific impact of the protégé during their senior years without their mentors”, by calculating the average number of citations accumulated 5 years post publication of all the papers published when the academic age of the protégé was greater than 7 years which included none of the scientists who were identified as their mentors. 

I have some slight issues with their introduction of terminology like mentorship quality here…. Should we really call a citation-based measure of impact mentorship quality? Yes, it’s easy to remember what they are trying to get at when they call average citations per year “big shot” experience, but at the same time, gender is known to have a robust effect on citations. So defining mentorship quality based on average citations per year essentially bakes gender bias into the definition of quality – I would expect women to have lower big shot scores and lower mentorship outcomes on average based on their definitions. But whatever, this is mostly annoying labeling at this point.

They then do ‘coarsened exact matching’, matching groups of protégés who received a certain level of mentorship quality with another group with lower mentorship quality but comparable in terms of other characteristics like the number of mentors, year they first published, discipline, gender, rank of affiliation on their first mentored publication, number of years active post mentorship, and average academic age of their mentors, and hub experience or big shot experience, whichever one they are not analyzing at the time. To motivate this, they say “While this technique does not establish the existence of a causal effect, it is commonly used to infer causality from observational data.” Um, what? 

They compare quintiles separately for big shot and hub where treatment and control are the Qith+1 and Qith quintile. They do a bunch of significance tests, finding “an increase in big-shot experience is significantly associated with an increase in the post-mentorship impact of protégés by up to 35%. Similarly, the hub experience is associated with an increase the post-mentorship impact of protégés, although the increase never exceeds 13%”. They conclude there’s a stronger association between mentorship outcome and big-shot experience than with hub experience since changes to big shot experience have more impact given their quintile comparison approach.  

Their main takeaways are about gender though, which involves matching sets of protégés where everything is comparable except for the number of female mentors. They present some heatmaps, one for male protégés and one for female protégés, where given a certain number of mentors, one can see how increasing the proportion of female mentors generally decreases the protégés’ outcomes (recall that’s citations on papers with none of the mentors once the protégé reaches senior status). Many “*” for the significance tests. Graph b is more red overall, implying that the association between having more female mentors and having less citations is weaker for males. 

graphs of associations with female mentorship


They also look at what mentoring a particular protégé does for the mentor, captured by the average impact (citations 5 years post publication) of the papers the mentor and protégé co-authored together during the mentorship period. They match male and female protégés on discipline, affiliation rank, number of mentors, and the year in which they published their first mentored paper, then compare separately the gains from male versus female protégés for male and female mentors. The downward extending bar chart shows that mentors of both genders see less citations for papers with female proteges, and would seem to suggest there’s a bigger difference between the citations a female mentor gets for papers with a female versus male protégé than that which a male mentor gets for papers with female versus male protégés.

These associations are kind of interesting. The supplemental material includes a bunch of versions of the charts broken down by discipline and where the authors vary their definitions of senior versus junior and of impact, by way of arguing that the patterns are robust. Based purely on my own experience, I can buy that there’s less payoff in terms of citations from co-authoring with females; I’ve come to generally expect that my papers with males, whether they are my PhD students or collaborators, will get more citations. But to what extent are these associations redundant with known gender effects in citations? Could, for example, the fact that someone had a female mentor mean they are more likely to collaborate later in their career with females who, according to past studies, tend to receive less citations on papers where they are in prominent author positions? The measures here are noisy, making it hard to ascertain what might be driving them more specifically. 

However, that doesn’t stop the authors from speculating what might be going on here:  

Our study … suggests that female protégés who remain in academia reap more benefits when mentored by males rather than equally-impactful females. The specific drivers underlying this empirical fact could be multifold, such as female mentors serving on more committees, thereby reducing the time they are able to invest in their protégés, or women taking on less recognized topics that their protégés emulate, but these potential drivers are out of the scope of current study.

Seems like the authors are exercising their permission to draw some causal inferences here, because, hey, as they implied above, everybody else is doing it. Serving on more committees seems like grasping for straws – I have no reason to believe that women don’t get asked to do more service, but it seems implausible that inequity in time spent on service could be extreme enough to affect the citation counts of their mentees years later, given all the variation in a dataset like this. The possibility of “women taking on less recognized topics“ seems less crazy implausible (see for instance this linguistic analysis of nearly all US PhD-recipients and their dissertations across three decades). Though I’d prefer to be spared these speculations.

Our findings also suggest that mentors benefit more when working with male protégés rather than working with comparable female protégés, especially if the mentor is female. These conclusions are all deduced from careful comparisons between protégés who published their first mentored paper in the same discipline, in the same cohort, and at the very same institution. Having said that, it should be noted that there are societal aspects that are not captured by our observational data, and the specific mechanisms behind these findings are yet to be uncovered. One potential explanation could be that, historically, male scientists had enjoyed more privileges and access to resources than their female counterparts, and thus were able to provide more support to their protégés. Alternatively, these findings may be attributed to sorting mechanisms within programs based on the quality of protégés and the gender of mentors.

So, again we jump to the conclusion that because there are associations with lower citations and working with female mentors or protégés, women must be doing a worse job somehow? What set of reviewers felt comfortable with these sudden jumps to causal inference? The dataset used here has some value, and the associations are interesting as an exploratory analysis, but seriously, I would expect more of undergrads or masters students I teach data science to. I’m with Sander Greenland here on the fact that what science often needs most from a study is its data, not for the authors to naively expound on the implications. 

Mister P for the 2020 presidential election in Belarus

An anonymous group of authors writes:

Political situation

Belarus is often called the “last dictatorship” in Europe. Rightly so, Aliaskandr Lukashenka has served as the country’s president since 1994. In the 26 years of his rule, Lukashenka has consolidated and extended his power, which is today absolute. Rigging referendums has been an effective means of consolidating power. His re-elections have been no better — he has claimed about 80% of the vote in all of them, while none of them has been acknowledged by the international community as free and fair. Lukashenka’s dictatorial rule seemed unshakeable a mere half a year ago. Right now, all of this is history as Lukashenka is scrambling to prop up his regime under the stress of 100,000- to 250,000-strong protest rallies every weekend since August, 9th. So what happened? In this post, we are discussing a preprint that we under the pseudonym of Ales Zahorski wrote to analyze the actual support levels for Lukashenka coming into the Presidential election on August 9th, 2020.

The 2020 presidential campaign proved to be unique for Belarus in many ways: The nonchalant approach of President Aliaksandr Lukashenka to the Covid-19 pandemic caused voluntary civil engagement in countering the threat of Covid-19. In turn, this led to increased political activity, providing fertile soil for the emergence of new political leaders, some of whom became presidential contenders. These new political leaders did not come from the conventional opposition and they had no obvious orientation towards Russia or the West. In addition, the new opposition leaders came from different backgrounds and had experience from a wide variety of professional fields. Thus they appealed to a much broader audience than their earlier counterparts.

Lukashenka, however, eliminated the three strongest candidates from the presidential race. To his dismay, the teams of those candidates united around Sviatlana Tsikhanouskaya, the wife of Siarhei Tsikhanouski, who was the third most popular candidate according to Internet surveys (see Table 1). She registered as a stand-in for her husband after his arrest. The Central Electoral Committee (CEC), a puppet body meant to oversee elections, allowed her to enter the race, probably because Lukashenka did not consider her a real threat. Otherwise, CEC registered the three representatives from the conventional opposition: Siarhei Cherachen, Andrei Dmitriyeu and Hanna Kanapatskaya. None of them had any visible support in the population according to the media polls (see Table 1). From the early stages of the 2020 presidential campaign, it was clear that the fairness of the election would be in question. Independent candidates were barred from entering local election committees which hinted at the planned ballot stuffing. The Belarusian Ministry of Foreign Affairs did not invite any credible international observers.

Sociology on political topics is banned

Since independent sociology and independent surveys are banned in Belarus, we had to be inventive in order to obtain data on the popularity of each presidential candidate. There are some online polls performed by the media (which were as of June 1, 2020 also forbidden), but these can not be trusted as they lack sound scientific rigour.

The absence of independent polling institutes and extremely contradictory results coming from different sources, provided the impetus for the current study. The results of media polls are summarized in Table 1, while the Ecoom (a company hired by Belarusian authorities) polls are presented in Table 2. As one can see, these polls contradict each other. Thus, we came up with an initiative to carry out a national poll, and based on these data, we used the multilevel regression with poststratification (MRP) methodology to estimate the popularity of each candidate. With this study, it was our sincere aim to provide a politically unbiased account of what the presidential election results in a counterfactual world – a Belarus with free and fair elections – would likely have been.


We employed two different methods for polling: (1) An online poll using Viber – the most popular messenger application in Belarus; and (2) a street poll taking place at different locations across the country. The questionnaires contained questions about what candidate the respondents intended to vote for, as well as questions about socio-economic and demographic status of the participants including age, gender, education level, region of residence and type of area of residence that correspond to the national census data. The latter allowed us to employ poststratification. We further added questions of common research interest. There were two additional questions in both the Viber and the street surveys about the family’s total monthly income and whether the respondent was willing to participate in early voting. The invitation to participate in the Viber poll was advertised in various communities on social media and was also sent via SMS to random Belarusian phone numbers (see details in the paper). As a result, we obtained around 45,000 answers. After disregarding answers from persons younger than 18 years old, people without Belarusian citizenship, and responses from phone numbers outside of Belarus (in the clean-up) 32,108 answers were kept. For the street poll, we aimed at collecting at least 500 responses to cover all possible categories of citizens with respect to gender, age, region, and type of area of residence. We used the official annual report for 2019 from Belstat (National Statistical Committee of the Republic of Belarus) to calculate the representative size of the statistical group for each category surveyed. As a result, we collected 1124 responses, providing a decent representativeness of the Belarusian population as compared to the official Belstat census data Demographic biases in the collected samples against the official 2009 census and 2019 annual report are presented in Figure 1.

After preprocessing the data from the Viber and street polls, we joined the two samples as follows: The filtered Viber sample was randomly divided into two parts consisting of 50% of the data each. One of these parts was kept as a holdout set for testing the predictive uncertainty handling of our MRP model, whilst the other one was merged with the street sample into a training set, where the street data was uniformly upsampled to the size of 50% of the whole Viber sample. By means of doing this kind of preprocessing we equalize the importance of the street and Viber data in the training set, whilst keeping approximately the same amount of information as in the Viber poll data.

The scripts used for merging the data are implemented in R as a part of statistical modelling pipeline and are also freely available on the GitHub page of the project.


In short, the methodology we employ involves building a statistical model that attempts to atone for the fact that our survey respondents are not representative of the population as a whole. By properly weighting the predictions of our multilevel regression model, we generalise from the sample to the entire population. The procedure is called multilevel regression with poststratification (MRP). The inference was performed in INLA.

We also adopted several recently published advancements from Gao et al. [2020] to improve MRP. In particular, the random effects corresponding to the ordinal categorical predictors (age and education) are assumed to have a latent AR1 structure between the categories, whilst other factors as well as the intercept term have an i.i.d. latent structure. Additionally, a latent Gaussian Besag-York-Mollie (BYM2) field is included into the model in order to account for the spatial dependence of the probabilities between the regions and the variance which is neither explained by the covariates nor by the common latent factors included into the random intercept.

We also employed model selection using criteria including WAIC and MLIK to compare the suggested model to the baselines. The baselines were models without a latent AR1 structure between the categories and additionally without BYM2. The model with both AR1 and BYM2 included was found optimal with respect to these criteria.


We found that the results of the election announced by CEC and the results of the pro-governmental BRSM (BRSM here stands for Belarusian Republican Youth Union) poll strongly disagree with the estimated pre-election ratings of the candidates, whilst the results of the independent polls are much more consistent with our estimated ratings. In particular, we found that both the officially announced results of the election and the officially reported early voting rates are improbable according to the estimates we obtained from the merged Viber and street poll data.

As shown in the following figure, both the officially announced results of the election and early voting rates are highly improbable. With a probability of at least 95%, Sviatlana Tikhanouskaya’s rating lies between 75% and 80%, whereas Aliaksandr Lukashenka’s rating lies between 13% and 18% and early voting rate predicted by the method ranges from 9% to 13% of those who took part in the election. These results contradict the officially announced outcomes, which are 10.12%, 80.11%, and 49.54% respectively and lie far outside even the 99.9% credible intervals predicted by our model. The ratings of other candidates and voting “Against all” are insignificant and correspond to the official results. The same conclusions are valid when comparing the pre-election ratings to the pro-governmental BRSM poll.

As shown below, the only groups of people where the upper bounds of the 99.9% credible intervals of the rating of Lukashenka predicted by MRP are above 50% are people older than 60 and uneducated people.

For all other subgroups, including rural residents, even the upper bounds of 99.9% credible intervals for Lukashenka are far below 50%. The same is true for the population as a whole. Thus, with a probability of at least 99.9%, as predicted by MRP, Lukashenka could not have had enough electoral support to win the 2020 presidential election in Belarus.

Criticism and our responses
Important assumptions that must hold for our conclusions to be valid are discussed by Daniel Simpson in his scientific blogpost: Assumption 1: The demographic composition of the population is known. Assumption 2: The people who did not answer the survey in subgroup j correspond to a random sample of subgroup j and to a random sample of the people who were asked.

Regarding Assumption 1, we used precise survey data from the 2009 Belstat census. We had to assume, however, that the demographics of Belarus have not changed significantly since then. In the first figure presented in this blogpost, we show this to be true at least marginally for four groups of the addressed demographic variables (when compared to the 2019 annual report), but the data on the fifth group (education levels) from 2019 is not yet available. Assumption 1 will also get an additional check when the results of the 2019 census in Belarus are published. Then, we will have the possibility to restratify the results if some significant changes in the demographics appear.
Regarding Assumption 2, we agree with Simpson that this sort of missing at random assumption is almost impossible to verify in practice. Simpson mentions that there are various things one can do to relax this assumption, but generally this is the assumption that we are making. This assumption is quite likely met for the street survey. Nevertheless, there is room for several sources of bias: (1) selection of respondents by interviewers – tendency to select more approachable/friendly-looking people, although we gave the explicit instructions to select random people; (2) response/refusal of respondents when approached (those in a hurry, those afraid to answer because of their pro-opposition views, possibly pro-governmental respondents who are not eager to answer due to their distrust in polls and other activities around the election); (3) item non-response, i.e. respondents not answering specific questions (some respondents did not want to report their income levels). The net effect of (1)-(3) is, of course, unknown. Validating Assumption 2 in the Viber poll is much more difficult. According to Simpson’s blogpost, one option is to assess how well the prediction works on some left out data in each subgroup. This is useful because poststratification explicitly estimates the response in the unobserved population. This viewpoint suggests that our goal is not necessarily unbiasedness but rather a good prediction of the population. It also means that if we can accept a reasonable bias, we will get the benefit of much tighter credible bounds of the population quantity than the survey weights can give. Hence, we return to the famous bias against variance trade-off. We have tried to approach this assumption from several perspectives. First of all, in the Viber poll, we used sampling of random phone numbers to invite respondents and advertised at different venues frequented by people with various demographic and political backgrounds. Secondly, in the attempt to obtain better results in the bias-variance trade-off sense and to assess predictive properties of the underlying Bayesian regression, we uniformly upsampled the street data to the size of 50% of the Viber data and randomly divided the Viber data into two halves: One half was merged with the upsampled street data to form the training sample. The other one was left as a hold-out set to test predictive uncertainty handling by the introduced in the paper modified Brier score. Here, we aimed at reducing the variance by possibly introducing some bias and at testing predictive qualities of the model. Lastly, to assess and confirm our findings on the joint sample, we performed the same analysis based on MRP fitted on the street data only. This analysis is much more likely to have no violations of Assumption 2 above, however the sample is significantly smaller, and in the sense of a bias-variance trade-off, we are likely to have a significantly increased variation in the posterior distributions of the focus parameters. At the same time, we can validate the results obtained by MRP on the joint sample. As a result we get the posterior quantiles of interest presented in Figure 4. In short, one can see that even though the level of uncertainty is significantly increased due to the reduced sample size, ultimately all of the conclusions are equivalent to those presented above for the MRP trained on the joint sample. Though for some important conclusions the level of significance drops from 99.9% to 99% or 95%. Moreover, the 99.9%, 99%, 95%, and 90% credible intervals of the MRP trained on the joint sample are almost always inside the corresponding credible intervals obtained on the street data. This allows us to conclude that we have obtained a very reasonable bias-variance trade-off on the joint data, corroborating the conclusions we have drawn from the joint sample.

The full article is here. It contains some tables and graphs.

I have not checked this analysis myself, and of course all conclusions depend on assumptions, but I like the general approach of adjusting survey data in this way, and even if this analysis has its imperfections it can be the starting point for further work and it cam motivate similar studies in other countries.

Is vs. ought in the study of public opinion: Coronavirus “opening up” edition

I came across this argument between two of my former co-bloggers which illustrates a general difficulty when thinking about political attitudes, which is confusion between two things: (a) public opinion, and (b) what we want public opinion to be.

This is something I’ve been thinking about for many years, ever since our Red State Blue State project.

Longtime blog readers might recall our criticism of political reporter Michael Barone, who told his readers that richer people voted for Democrats and poorer people voted for Republicans—even though the data showed the opposite. And Barone was a data guy! I coined a phrase, “second-order availability bias,” just to to try to understand this way of thinking.

The latest example is the debate over how fast to open up the economy. A clear description that I’ve seen of the confusion comes in this op-ed by Michelle Goldberg, who writes:

Lately some commentators have suggested that the coronavirus lockdowns pit an affluent professional class comfortable staying home indefinitely against a working class more willing to take risks to do their jobs. . . . Writing in The Post, Fareed Zakaria tried to make sense of the partisan split over coronavirus restrictions, describing a “class divide” with pro-lockdown experts on one side and those who work with their hands on the other. . . . The Wall Street Journal’s Peggy Noonan wrote: “Here’s a generalization based on a lifetime of experience and observation. The working-class people who are pushing back have had harder lives than those now determining their fate.”

But, no, it seems that Zakaria and Noonan are wrong. Goldberg continues:

The assumptions underlying this generalization, however, are not based on even a cursory look at actual data. In a recent Washington Post/Ipsos survey, 74 percent of respondents agreed that the “U.S. should keep trying to slow the spread of the coronavirus, even if that means keeping many businesses closed.” Agreement was slightly higher — 79 percent — among respondents who’d been laid off or furloughed. . . .

Goldberg can also do storytelling:

Meatpacking workers have been sickened with coronavirus at wildly disproportionate rates, and all over the country there have been protests outside of meatpacking plants demanding that they be temporarily closed, sometimes by the workers’ own children. Perhaps because those demonstrators have been unarmed, they’ve received far less coverage than those opposed to lockdown orders. . . . Meanwhile, financial elites are eager for everyone else to resume powering the economy. . . . when it comes to the coronavirus, willingness to ignore public health authorities isn’t a sign of flinty working-class realism. Often it’s the ultimate mark of privilege.

OK, that’s just a story too. But I was curious about the people who Goldberg cited at the beginning of her article, who so confidently got things wrong. So I clicked on each story.

First, Zakaria. He does a David Brooks-style shtick, with lines like, “Imagine you are an American who works with his hands — a truck driver, a construction worker, an oil rig mechanic — and you have just lost your job because of the lockdowns, as have more than 36 million people. You turn on the television and hear medical experts, academics, technocrats and journalists explain that we must keep the economy closed — in other words, keep you unemployed — because public health is important. . . .”

In this riff, Zakaria is exhibiting a failure of imagination. He talks about truck drivers who want to go back to work, but not about meatpacking workers who don’t want to be exposed to coronavirus. He talks about various experts who want to “keep you unemployed” but does not talk about the financial elites, not to mention “academics, technocrats and journalists” such as himself who are eager to see everyone else get back to work—even though they can keep working from home as long as they want. Do they just miss going into the TV studo?

There’s also a gender dimension to Zakaria’s article, in that he listed about three stereotypically male occupations. In general, men are less concerned about health and safety than women are. So he’s stacking the deck by talking about truck drivers, construction workers, and oil rig mechanics, rather than, say, nurse’s aides, housecleaners, and preschool teachers.

Zakaria is making an error, imputing a statement that lower-social-class Americans want to open up the economy, even though the data don’t show this, and even though there are lots of logical reasons to understand why comfortable work-at-home pundits could be just fine with opening up, given that they get to pick and choose when and where to go to work.

Next, Noonan. Unfortunately this link is paywalled, but I do see the sub-headline, “Those who are anxious to open up the economy have led harder lives than those holding out for safety.” Perhaps someone with a Wall Street Journal subscription can tell me what data she cites on this one.

It could be that Noonan is right and Goldberg is wrong here. Goldberg cited this one survey, but that’s just one survey, and it was from 27 Apr to 4 May, and opinions have surely changed since then. For now I’ll go with Goldberg’s take because she brought data to the table.

The analyst I really trust for this sort of thing is sociologist David Weakliem. Let’s go to his blog and see if he wrote anything on this . . . yeah! Here it is:

Some people have said that the coronavirus epidemic will bring Americans together, uniting us behind a goal that transcends political differences. It doesn’t seem to be working out that way–whether to ease restrictions has become a political issue, with Republicans more in favor of a quick end and Democrats more in favor of keeping restrictions. There have been some claims that it’s also a class issue. The more common version is that the “elites” can work at home, so they are happy to keep going on that way, but most ordinary people can’t, so they want to get back to work (see this article for an entertainingly unhinged example). But you could also argue it the other way—affluent people are getting fed up with online meetings, and tend to have jobs that would let them keep more space from their co-workers, so they want to get back to normal; less affluent people have jobs that would expose them to infection, so they want to stay safe. I couldn’t find individual-level data for any survey, but I did find one report that breaks opinions down by some demographic variables.

It’s a Washington Post – University of Maryland survey from 21-26 Apr.

Here’s what Weakliem found:

The most relevant question is “Do you think current restrictions on how restaurants, stores and other businesses operate in your state are appropriate, are too restrictive or are they not restrictive enough?”

Too restrictive Appropriate Not enough
Republicans 29% 60% 11%
Democrats 8% 72% 19%

Although majorities of both parties say (or said—the survey was April 21-26) they were appropriate, there is a pretty big difference.

By education:

College grads 15% 72% 12%
Others 18% 63% 18%

or restricting it to whites:

College grads 17% 72% 10%
Others 20% 64% 15%

To the extent there is a difference, it’s that less educated people are more likely to have “extreme” opinions of both kinds. Maybe that’s because more educated people tend to have more trust in the authorities. But basically, it’s not a major factor.

A few other variables: income is similar to education, with lower income people more likely to take both “extreme” positions; non-whites, women, and younger people more likely to say “not restrictive enough” and less likely to say “too restrictive”. All of those differences are considerably smaller than the party differences. Region and urban/rural residence seem relevant in principle, but aren’t included in the report.

Interesting about moreless educated or higherlower-income people taking more extreme positions, which gives a slightly different twist on Zakaria and Noonan. As with red state blue state, pundits love talking about the working class, but many of the most intense battles are happening within the elite.

But I promised I’d talk with you about my former co-bloggers . . .

Here’s Robin Hanson from 5 May:

The public is feeling the accumulated pain, and itching to break out. . . . Elites are now loudly and consistently saying that this is not time to open; we must stay closed and try harder to contain. . . . So while the public will uniformly push for more opening, elites and experts push in a dozen different directions. . . . elites and experts don’t speak with a unified voice, while the public does.

This makes no sense to me. To the extent that the polls were capturing public opinion, the public was speaking with a uniform voice in favor of restrictions—the exact opposite of what Hanson was saying.

My guess is that Hanson was frustrated that “experts” and “elites” (in his words) did not agree with his opening-up policy preferences, so he was enlisting “the public” to be on his side. Unfortunately, the public did not hold his position either.

Hanson continues, “Many are reading me as claiming that the public is unified in the sense of agreeing on everything. But I only said that the public pushes will will tend to be correlated in a particular direction, in contrast with the elite pushes which are much more diverse. Some also read me as claiming that strong majorities of the public support fast opening, but again that’s not what I said.” I can’t figure out what he’s getting at here. He said, “The public is feeling the accumulated pain, and itching to break out” . . . but the polls didn’t support that take. He also said that the public “speaks with a unified voice”—but, to the extent that was true, the voice was the opposite of what Hanson was saying. Maybe now things have changed and the public is more divided on their policy preferences regarding restrictions or openings—but, if so, that’s really the opposite of a unified voice.

Hanson also cites a couple of polls he did on twitter, but he uses these incoherently, first as evidence of opinions of elites and experts, then second as evidence of public opinion. I don’t think twitter polls really represent elite opinion, expert opinion, or public opinion, but I guess it all depends on who responds.

And here’s Henry Farrell from 5 May, saying pretty much what I said above, but in a more structured way:

There is indeed survey evidence to suggest that the public has strong preferences on re-opening. The problem is that that evidence (or, at least, the evidence that I am aware of), is that large majorities of people don’t want to reopen anytime soon. . . . the best empirical evidence I know of as to what individual members of the public want runs exactly contrary to the claims made by public choice scholars (who are presumably methodological individualists) about what the public wants.

It’s not clear to me that Farrell should be taking the blogs of two people (Robin Hanson and Tyler Cowen, who linked to Hansen’s post) as representative of “public choice scholars” more generally. But Farrell does acknowledge they may be “talking about ‘what the public will inevitably end up wanting in the long run as the costs of freezing much economic activity become clear.'” The trouble with this sort of in-the-long-term-the-public-will-agree-with-me attitude, as Farrell points out, is (a) people might agree with you for the wrong reasons (maybe for reasons of partisanship rather than policy), and (b) “the problem with such loosely expressed arguments about what ‘the public wants’ is that they’re likely to blur together ideological priors and empirical claims in a manner that makes them impossible to distinguish.”

I agree. This is the public-opinion version of the difficulties that arise when people make empirical statements without the data.

If you’re interested, Farrell and Hanson continue the discussion here and here. The conversation goes in a different direction than my focus here: Farrell is focusing on the whole public-choice thing and Hanson starts talking about how communism can’t work. Farrell might be wrong on the economics, but I think Hanson makes Farrell’s point for him on the public opinion question, pretty much admitting that the belief that “the public” agrees with him, despite what the polls might say, is based on his (Hanson’s) reading of economic theory.

All this does not say that Hanson’s economic analysis and policy preferences are wrong (or that they’re right). That’s a separate question from the study of public opinion, although public opinion is relevant to the question. If you claim to be in agreement with the people, it helps if the people are in agreement with you. Also, opinions can change.

Authors repeat same error in 2019 that they acknowledged and admitted was wrong in 2015

David Allison points to this story:

Kobel et al. (2019) report results of a cluster randomized trial examining the effectiveness of the “Join the Healthy Boat” kindergarten intervention on BMI percentile, physical activity, and several exploratory outcomes. The authors pre-registered their study and described the outcomes and analysis plan in detail previously, which are to be commended. However, we noted four issues that some of us recently outlined in a paper on childhood obesity interventions: 1) ignoring clustering in studies that randomize groups of children, 2) changing the outcomes, 3) emphasizing results that were statistically significant from a host of analyses, and 4) using self-reported outcomes that are part of the intervention.

First and most critically, the statistical analyses reported in the article were inadequate and deviated from the analysis plan in the study’s methods article – an error the authors are aware of and had acknowledged after some of us identified it in one of their prior publications about this same program. . . .

Second, the authors switched their primary and secondary outcomes from their original plan. . . .

Third, while the authors focus on an effect of the intervention of p ≤ 0.04 in the abstract, controlling for migration background in their full model raised this to p = 0.153. Because inclusion or exclusion of migration background does not appear to be a pre-specified analytical decision, this selective reporting in the abstract amounts to spinning of the results to favor the intervention.

Fourth, “physical activity and other health behaviours … were assessed using a parental questionnaire.” Given that these variables were also part of the intervention itself, with the control having “no contact during that year,” subjective evaluation may have resulted in differential, social-desirability bias, which may be of particular concern in family research. Although the authors mention this in the limitations, the body of literature demonstrating the likelihood of these biases invalidating the measurements raises the question of whether they should be used at all.

This is a big deal. The authors of the cited paper knew about these problems—to the extent of previously acknowledging them in print—but then did them again.

They authors did this thing of making a strong claim and then hedging it in their limitations. That’s bad. From the abstract of the linked paper:

Children in the IG [intervention group] spent significantly more days in sufficient PA [physical activity] than children in the CG [control group] (3.1 ± 2.1 days vs. 2.5 ± 1.9 days; p ≤ 0.005).

Then, deep within the paper:

Nonetheless, this study is not without limitations, which need to be considered when interpreting these results. Although this study has an acceptable sample size and body composition and endurance capacity were assessed objectively, the use of subjective measures (parental report) of physical activity and the associated recall biases is a limitation of this study. Furthermore, participating in this study may have led to an increased social desirability and potential over-reporting bias with regards to the measured variables as awareness was raised for the importance of physical activity and other health behaviours.

This is a limitation that the authors judge to be worth mentioning in the paper but not in the abstract or in the conclusion, where the authors write that their intervention “should become an integral part of all kindergartens” and is “ideal for integrating health promotion more intensively into the everyday life of children and into the education of kindergarten teachers.”

The point here is not to slam this particular research paper but rather to talk about a general problem with science communication, involving over-claiming of results and deliberate use of methods that are problematic but offer the short-term advantage of allowing researchers to make stronger claims and get published.

P.S. Allison follows up by pointing to this Pubpeer thread.