Skip to content

AnnoNLP conference on data coding for natural language processing

This workshop should be really interesting:

Silviu Paun and Dirk Hovy are co-organizing it. They’re very organized and know this area as well as anyone. I’m on the program committee, but won’t be able to attend.

I really like the problem of crowdsourcing. Especially for machine learning data curation. It’s a fantastic problem that admits of really nice Bayesian hierarchical models (no surprise to this blog’s audience!).

The rest of this note’s a bit more personal, but I’d very much like to see others adopting similar plans for the future for data curation and application.

The past

Crowdsourcing is near and dear to my heart as it’s the first serious Bayesian modeling problem I worked on. Breck Baldwin and I were working on crowdsourcing for applied natural language processing in the mid 2000s. I couldn’t quite figure out a Bayesian model for it by myself, so I asked Andrew if he could help. He invited me to the “playroom” (a salon-like meeting he used to run every week at Columbia), where he and Jennifer Hill helped me formulate a crowdsourcing model.

As Andrew likes to say, every good model was invented decades ago for psychometrics, and this one’s no different. Phil Dawid had formulated exactly the same model (without the hierarchical component) back in 1979, estimating parameters with EM (itself only published in 1977). The key idea is treating the crowdsourced data like any other noisy measurement. Once you do that, it’s just down to details.

Part of my original motivation for developing Stan was to have a robust way to fit these models. Hamiltonian Monte Carlo (HMC) only handles continuous parameters, so like in Dawid’s application of EM, I had to marginalize out the discrete parameters. This marginalization’s the key to getting these models to sample effectively. Sampling discrete parameters that can be marginalized is a mug’s game.

The present

Coming full circle, I co-authored a paper with Silviu and Dirk recently, Comparing Bayesian models of annotation, that reformulated and evaluated a bunch of these models in Stan.

Editorial Aside: Every field should move to journals like TACL. Free to publish, fully open access, and roughly one month turnarond to first decision. You have to experience journals like this in action to believe it’s possible.

The future

I want to see these general techniques applied to creating probabilistic corpora, to online adaptative training data (aka active learning), to joint corpus inference and model training (a la Raykar et al.’s models), and to evaluation.

P.S. Cultural consensus theory

I’m not the only one who recreated Dawid and Skene’s model. It’s everywhere these days.

Recently, I just discovered an entire literature dating back decades on cultural consensus theory, which uses very similar models (I’m pretty sure either Lauren Kennedy or Duco Veen pointed out the literature). The authors go more into the philosophical underpinnings of the notion of consensus driving these models (basically the underlying truth of which you are taking noisy measurements). One neat innovation in the cultural consensus theory literature is a mixture model of truth—you can assume multiple subcultures are coding the data with different standards. I’d thought of mixture models of coders (say experts, Mechanical turkers, and undergrads), but not of the truth.

In yet another small world phenomenon, right after I discovered cultural consensus theory, I saw a cello concert organized through Groupmuse by a social scientist at NYU I’d originally met through a mutual friend of Andrew’s. He introduced the cellist, Iona Batchelder, and added as an aside she was the daughter of well known social scientists. Not just any social scientists, the developers of cultural consensus theory!

Another Regression Discontinuity Disaster and what can we learn from it

As the above image from Diana Senechal illustrates, a lot can happen near a discontinuity boundary.

Here’s a more disturbing picture, which comes from a recent research article, “The Bright Side of Unionization: The Case of Stock Price Crash Risk,” by Jeong-Bon Kim, Eliza Xia Zhang, and Kai Zhong:

which I learned about from the following email:

On Jun 18, 2019, at 11:29 AM, ** wrote:

Hi Professor Gelman,

This paper is making the rounds on social media:

Look at the RDD in Figure 3 [the above two graphs]. It strikes me as pretty weak and reminds me a lot of your earlier posts on the China air pollution paper. Might be worth blogging about?

If you do, please don’t cite this email or my email address in your blog post, as I would prefer to remain anonymous.

Thank you,

This anonymity thing comes up pretty often—it seems that there’s a lot of fear regarding the consequences of criticizing published research.

Anyway, yeah this is bad news. The discontinuity at the boundary looks big and negative, in large part because the fitted curves have a large positive slope in that region, which in turn seems to be driven by action on the boundary of the graph which is essentially irrelevant to the causal question being asked.

It’s indeed reminiscent of this notorious example from a few years ago:

Screen Shot 2013-08-03 at 4.23.29 PM

And, as before, it’s stunning not just that the researchers made this mistake—after all, statistics is hard, and we all make mistakes—but that they could put a graph like the ones above directly into their paper and not realize the problem.

This is not a case of the chef burning the steak and burying it in a thick sauce. It’s more like the chef taking the burnt slab of meat and serving it with pride—not noticing its inedibility because . . . the recipe was faithfully applied!

What happened?

Bertrand Russell has this great quote, “This is one of those views which are so absurd that only very learned men could possibly adopt them.” On the other hand, there’s this from George Orwell: “To see what is in front of one’s nose needs a constant struggle.”

The point is that the above graphs are obviously ridiculous—but all these researchers and journal editors didn’t see the problem. They’d been trained to think that if they followed certain statistical methods blindly, all would be fine. It’s that all-too-common attitude that causal identification plus statistical significance equals discovery and truth. Not realizing that both causal identification and statistical significance rely on lots of assumptions.

The estimates above are bad. They can either be labeled as noisy (because the discontinuity of interest is perturbed by this super-noisy curvy function) or as biased (because in the particular case of the data the curves are augmenting the discontinuity by a lot). At a technical level, these estimates give overconfident confidence intervals (see this paper with Zelizer and this one with Imbens), but you hardly need all that theory and simulation to see the problem—just look at the above graphs without any ideological lenses.

Ideology—statistical ideology—is important here, I think. Researchers have this idea that regression discontinuity gives rigorous causal inference, and that statistical significance gives effective certainty, and that the rest is commentary. These attitudes are ridiculous, but we have to recognize that they’re there.

The authors do present some caveats but these are a bit weak for my taste:

Finally, we acknowledge the limitations of the RDD and alert readers to be cautious when generalizing our inferences in different contexts. The RDD exploits the local variation in unionization generated by union elections and compares crash risk between the two distinct samples of firms with the close-win and close-loss elections. Thus, it can have strong local validity, but weak external validity. In other words, the negative impact of unionization on crash risk may be only applicable to firms with vote shares falling in the close vicinity of the threshold. It should be noted, however, that in the presence of heterogeneous treatment effect, the RDD estimate can be interpreted as a weighted average treatment effect across all individuals, where the weights are proportional to the ex ante likelihood that the realized assignment variable will be near the threshold (Lee and Lemieux 2010). We therefore reiterate the point that “it remains the case that the treatment effect estimated using a RD design is averaged over a larger population than one would have anticipated from a purely ‘cutoff’ interpretation” (Lee and Lemieux 2010, 298).

I agree that generalization is a problem, but I’m not at all convinced that what they’ve found applies even to their data. Again, a big part of their negative discontinuity estimate is coming from that steep up-sloping curve which seems like nothing more than an artifact. To say it another way: including that quadratic curve fit adds a boost to the discontinuity which then pulls it over the threshold to statistical significance. It’s a demonstration of how bias and coverage problems work together (again, see my paper with Guido for more on this).

This is not to say that the substantive conclusions of the article are wrong. I have no idea. All I’m saying is that the evidence is not as strong as is claimed. And also I’m open to the possibility that the substantive truth is the opposite of what is claimed in the article. Also don’t forget that, even had the discontinuity analysis not had this problem—even if there was a clear pattern in the data that didn’t need to be pulled out by adding that upward-sloping curve—we’d still only be learning about these two particular measures that are labeled as stock price crash risk.

How to better analyze these data?

To start with, I’d like to see a scatterplot. According to the descriptive statistics there are 687 data points, so the above graph must be showing binned averages or something like that. Show me the data!

Next, accept that this is an observational study, comparing companies that did or did not have unions. These two groups of companies differ in many ways, one of which is the voter share in the union election. But there are other differences too. Throwing them all in a regression will not necessarily do a good job of adjusting for all these variables.

The other thing I don’t really follow are their measures of stock price crash risk. These seem like pretty convoluted definitions; there must be lots of ways to measure this, at many time scales. This is a problem with the black-box approach to causal inference, but I’m not sure how this aspect of the problem could be handled better. The trouble is that stock prices are notoriously noisy, so it’s not like you could have a direct model of unionization affecting the prices—even beyond the obvious point that unionization, or the lack thereof, will have different effects in different companies. But if you go black-box and look at some measure of stock prices as an outcome, then the results could be sensitive to how and when you look at them. These particular measurement issues are not our first concern here—as the above graphs demonstrate, the estimation procedure being used here is a disaster—but if you want to study the problem more seriously, I’m not at all clear that looking at stock prices in this way will be helpful.

Larger lessons

Again, I’d draw a more general lesson from this episode, and others like it, that when doing science we should be aware of our ideologies. We’ve seen so many high-profile research articles in the past few years that have had such clear and serious flaws. On one hand it’s a social failure: not enough eyes on each article, nobody noticing or pointing out the obvious problems.

But, again, I also blame the reliance on canned research methods. And I blame pseudo-rigor, the idea that some researchers have that their proposed approach is automatically correct. And, yes, I’ve seen that attitude among Bayesians too. Rigor and proof and guarantee are fine, and they all come with assumptions. If you want the rigor, you need to take on the assumptions. Can’t have one without the other.

Finally, in case there’s a question that I’m being too harsh on an unpublished paper: If the topic is important enough to talk about, it’s important enough to criticize. I’m happy to get criticisms of my papers, published and unpublished. Better to have mistakes noticed sooner rather than later. And, sure, I understand that the authors may well have followed the rules as they understood them, and it’s too bad that resulted in bad work. Kind of like if I was driving along a pleasant country road at the speed limit of 30 mph and then I turned a corner and slammed into a brick wall. It’s really not my fault, it’s whoever put up the damn 30 mph sign. But my car will still be totaled. In the above post, I’m blaming the people who put up the speed limit sign (including me, in that in our textbooks our colleagues and I aren’t always so clear on how our methods can go wrong).

We should be open-minded, but not selectively open-minded.

I wrote this post awhile ago but it just appeared . . .

I liked this line so much I’m posting it on its own:

We should be open-minded, but not selectively open-minded.

This is related to the research incumbency effect and all sorts of other things we’ve talked about over the years.

There’s a Bayesian argument, or an implicitly Bayesian argument for believing everything you read in the tabloids, and the argument goes as follows: It’s hard to get a paper published, papers in peer-reviewed journals typically really do go through the peer review process, so the smart money is to trust the experts.

This believe-what-you-read heuristic is Bayesian, but not fully Bayesian: it does not condition on new information. The argument against Brian Wansink’s work is not that it was published in the journal Environment and Behavior. The argument against it is that the work has lots of mistakes, and then you can do some partial pooling, looking at other papers by this same author that had lots of mistakes.

Asymmetric open-mindedness—being open to claims published in scientific journals and publicized on NPR, Ted, etc., while not at all being open to their opposites—is, arguably, a reasonable position to take. But this position is only reasonable before you look carefully at the work in question. Conditional on that careful look, the fact of publication provides much less information.

To put it another way, defenders of junk science, and even people who might think of themselves as agnostic on the issue, are making the fallacy of the one-sided bet.

Here’s an example.

Several years ago, the sociologist Satoshi Kanazawa claimed that beautiful parents were more likely to have girl babies. This claim was reproduced by the Freakonomics team. It turns out that underlying statistical analysis was flawed, and was was reported was essentially patterns in random numbers (the kangaroo problem).

So, fine. At this point you might say: Some people believe that beautiful parents are more likely to have girl babies, while other people are skeptical of that claim. As an outsider, you might take an intermediate position (beautiful parents might be more likely to have girl babies), and you could argue that Kanazawa’s work, while flawed, might still be valuable by introducing this hypothesis.

But that would be a mistake; you’d be making the fallacy of the one-sided bet. If you want to consider the hypothesis that beautiful parents are more likely to have girl babies, you should also consider the hypothesis that beautiful parents are more likely to have boy babies. If you don’t consider both possibilities, you’re biasing yourself—and you’re also giving an incentive for future Wansinks to influence policy through junk science.

P.S. I also liked this line that I gave in response to someone who defended Brian Wansink’s junk science on the grounds that “science has progressed”:

To use general scientific progress as a way of justifying scientific dead-end work . . . that’s kinda like saying that the Bills made a good choice to keep starting Nathan Peterman, because Patrick Mahomes has been doing so well.

A problem I see is that the defenders of junk science are putting themselves in the position where they’re defending Science as an entity.

And, if we really want to get real, let’s be open to the possibility that the effect is positive for some people in some scenarios, and negative for other people in other scenarios, and that in the existing state of our knowledge, we can’t say much about where the effect is positive and where it is negative.

Javier Benitez points us to this op-ed, “Massaging data to fit a theory is not the worst research sin,” where philosopher Martin Cohen writes:

The recent fall from grace of the Cornell University food marketing researcher Brian Wansink is very revealing of the state of play in modern research.

Wansink had for years embodied the ideal to which all academics aspire: innovative, highly cited and media-friendly.

I would just like to briefly interrupt that not all academics aspire to be media-friendly. I have that aspiration myself, and of course people who aspire to be media-friendly are overrepresented in the media—but I’ve met lots of academics who’d prefer to be left in peace and quite to do their work and communicate just with specialists and students.

But that’s not the key point here. So let me continue quoting Cohen:

[Wansink’s] research, now criticised as fatally flawed, included studies suggesting that people who go grocery shopping while hungry buy more calories, that pre-ordering lunch can help you choose healthier food, and that serving people out of large bowls leads them to eat larger portions.

Such studies have been cited more than 20,000 times and even led to an appearance on The Oprah Winfrey Show [and, more to the point, the spending of millions of dollars of government money! — ed.]. But Wansink was accused of manipulating his data to achieve more striking results. Underlying it all is a suspicion that he was in the habit of forming hypotheses and then searching for data to support them. Yet, from a more generous perspective, this is, after all, only scientific method.

Behind the criticism of Wansink is a much broader critique not only of his work but of a certain kind of study: one that, while it might have quantitative elements, is in essence ethnographic and qualitative, its chief value being in storytelling and interpretation. . . .

We forget too easily that the history of science is rich with errors. In a dash to claim glory before Watson and Crick, Linus Pauling published a fundamentally incoherent hypothesis that the structure of DNA was a triple helix. Lord Kelvin misestimated the age of the Earth by more than an order of magnitude. In the early days of genetics, Francis Galton introduced an erroneous mathematical expression for the contributions of different ancestors to individuals’ inherited traits. We forget because these errors were part of broader narratives that came with brilliant insights.

I accept that Wansink may have been guilty of shoehorning data into preconceived patterns – and in the process may have mixed up some of the figures too. But if the latter is unforgivable, the former is surely research as normal.

Let me pause again here. If all that happened is that Wansink “may have mixed up some of the figures,” that this is not “unforgivable” at all. We all “mix up some of the figures” from time to time (here’s an embarrassing example from my own published work), and nobody who does creative work is immune from “shoehorning data into preconceived patterns.”

For some reason, Cohen seems to be on a project to minimize Wansink’s offenses. So let me spell it out. No, the problem with the notorious food researcher is not that he “may have mixed up some of the figures.” First, he definitely—not “may have”—mixed up many—not “some”—of his figures. We know this because many of his figures contradicted each other, and others made no sense (see, for example, here for many examples). Second, Wansink bobbed and weaved, over the period of years denying problems that were pointed out to him from all sorts of different directions.

Cohen continues:

The critics are indulging themselves in a myth of neutral observers uncovering “facts”, which rests on a view of knowledge as pristine and eternal as anything Plato might have dreamed of.

It is thanks to Western philosophy that, for thousands of years, we have believed that our thinking should strive to eliminate ideas that are vague, contradictory or ambiguous. Today’s orthodoxy is that the world is governed by iron laws, the most important of which is if P then Q. Part and parcel of this is a belief that the main goal of science is to provide deterministic – cause and effect – rules for all phenomena. . . .

Here I think Cohen’s getting things backward! It’s Wansink’s critics who have repeatedly stated that the world is complicated and that we should be wary of taking misreported data from 97 people in a diner to make general statements about eating behavior, men’s and women’s behaviors, nutrition policy, etc.

Contrariwise, it was Wansink and his promoters who were making general statements, claiming to have uncovered facts about human nature, etc.

Cohen continues a few paragraphs later:

Plato attempted to avoid contradictions by isolating the object of inquiry from all other relationships. But, in doing so, he abstracted and divorced those objects from a reality that is multi-relational and multitemporal. This same artificiality dogs much research.

Exactly! Wansink, like all of us, is subject to the Armstrong Principle (“If you promise more than you can deliver, then you have an incentive to cheat.”). Most scholars, myself included, are scaredy-cats: in order to avoid putting ourselves in a Lance Armstrong situation, we’re careful to underpromise. Wansink, though, he overpromised, presenting his artificial research has yielding general truths.

In short, we, the critics of Wansink and other practitioners of cargo-cult science, are on Cohen’s side. We’re the ones who are trying to express scientific method in a way that respects the disconnect between experiment and real world.

Cohen concludes:

Even if the quantitative elements don’t convince and need revising, studies like Wansink’s can be of value if they offer new clarity in looking at phenomena, and stimulate ideas for future investigations. Such understanding should be the researcher’s Holy Grail.

After all, according to the tenets of our current approach to facts and figures, much scientific endeavour of the past amounted to wasted effort, in fields with absolutely no yield of true scientific information. And yet science has progressed.

I don’t get the logic here. “Much endeavour amounted to wasted effort . . . And yet science has progressed.” Couldn’t it be that, to a large extent, the wasted effort and the progress has been done in by different people, different places?

To use general scientific progress as a way of justifying scientific dead-end work . . . that’s kinda like saying that the Bills made a good choice to keep starting Nathan Peterman, because Patrick Mahomes has been doing so well.

Who cares?

So what? Why keep talking about this pizzagate? Because I think misconceptions here can get in the way of future learning.

Let me state the situation as plainly as possible, without any reference to this particular case:

Step 1. A researcher performs a study that gets published. The study makes big claims and gets lots of attention, both from the news media and from influential policymakers.

Step 2. Then it turns out that (a) the published work was seriously flawed, and the published claims are not supported by the data being offered in their support: the claims may be true, in some ways, but no good evidence has been given; (b) other published studies that appear to show confirmation of the original claim have their own problems; and (c) statistical analysis shows that it is possible that the entire literature is chasing noise.

Step 3. A call goes out to be open-minded: just because some of these studies did not follow ideal scientific practices, we should not then conclude that their scientific claims are false.

And I agree with Step 3. But I’ve said it before and I’ve said it again: We should be open-minded, but not selectively open-minded.

Suppose the original claim is X, but the study purporting to demonstrate X is flawed, and the follow-up studies don’t provide strong evidence for X either. Then, of course we should be open to the possibility that X remains true (after all, for just about any hypothesis X there is always some qualitative evidence and some theoretical arguments that can be found in favor of X), and we should also be open to the possibility that there is no effect (or, to put it more precisely, an effect that is in practice indistinguishable from zero). Fine. But let’s also be open to the possibility of “minus X”; that is, the possibility that the posited intervention is counterproductive. And, if we really want to get real, let’s be open to the possibility that the effect is positive for some people in some scenarios, and negative for other people in other scenarios, and that in the existing state of our knowledge, we can’t say much about where the effect is positive and where it is negative. Let’s show some humility about what we can claim.

Accepting uncertainty does not mean that we can’t make decisions. After all, we were busy making decisions about topic X, whatever it was, before we had any data at all—so we can keep making decisions on a case-by-case basis using whatever information and hunches we have.

Here are some practical implications. First, if we’re not sure the effect of an intervention, maybe we should think harder about costs, including opportunity costs. Second, it makes sense to gather information about what’s happening locally, to get a better sense of what the intervention is doing.

All the work that you haven’t heard of

The other thing I want to bring up is the selection bias involved in giving the benefit of the doubt to weak claims that happen to have received positive publicity. One big big problem here is that there are lots of claims in all sorts of directions that you haven’t heard about, because they haven’t appeared on Oprah, or NPR, or PNAS, or Freakonomics, or whatever. By the same logic as Cohen gives in the above-quoted piece, all those obscure claims also deserve our respect as “of value if they offer new clarity in looking at phenomena, and stimulate ideas for future investigations.” The problem is that we’re not seeing all that work.

As I’ve also said on various occasions, I have no problem when people combine anecdotes and theorizing to come up with ideas and policy proposals. My problem with Wansink is not that he had interesting ideas without strong empirical support: that happens all the time. Most of our new ideas don’t have strong empirical support, in part because ideas with strong empirical support tend to already exist so they won’t be new! No, my problem with Wansink is that he took weak evidence and presented it as if it were strong evidence. For this discussion, I don’t really care if he did this by accident or on purpose. Either way, now we know he had weak evidence, or no evidence at all. So I don’t see why his conjectures should be taken more seriously than any other evidence-free conjectures. Let a zillion flowers bloom.

A supposedly fun thing I definitely won’t be repeating (A Pride post)

“My friends and I don’t wanna be here if this isn’t an actively trans-affirming space. I’m only coming if all my sisters can.” – I have no music for you today, sorry. But I do have an article about cruise ships 

(This is obviously not Andrew)

A Sunday night quickie post, from the tired side of Toronto’s Pride weekend. It’s also Pride month, and it’s 50 years on Friday since the Stonewall riots, which were a major event in LGBT+ rights activism in the US and across the world. Stan has even gone rainbow for the occasion. (And many thanks to the glorious Michael Betancourt who made the badge.)

This is a great opportunity for a party and to see Bud Lite et al.  pretend they care deeply about LGBTQIA+ people. But really it should also be a time to think about how open workplaces, departments, universities, conferences, any other place of work are to people who are lesbian, gay, bisexual, transgender, non-binary, two-spirit, gender non-conforming, intersex, or who otherwise lead lives (or wish to lead lives) that lie outside the cisgender, straight world that the majority occupies.  People who aren’t spending a bunch of time trying to hide aspects of their life are usually happier and healthier and better able to contribute to things like science than those who are.

Which I guess is to say that diversity is about a lot more than making sure that there aren’t zero women as invited speakers. (Or being able to say “we invited women but they all said no”.) Diversity is about racial and ethnic diversity, diversity of gender, active and meaningful inclusion of disabled people, diversity of sexuality, intersections of these identities, and so much more. It is not an accounting game (although zero is still a notable number).

And regardless of how many professors or style guides or blogposts tell you otherwise, there is no single gold standard absolute perfect way to deliver information. Bring yourself to your delivery. Be gay. Be femme. Be masc. Be boring. Be sports obsessed. Be from whatever country and culture you are from. We can come along for the journey. And people who aren’t willing to are not worth your time.

Anyway, I said a pile of words that aren’t really about this but are about this for a podcast, which if you have not liked the previous three paragraphs you will definitely not enjoy. Otherwise I’m about 17 mins in (but the story about the alligators is also awesome.) If you do not like adult words, you definitely should not listen.

In the spirit of Pride month please spend some time finding love for and actively showing love to queer and trans folk. And for those of you in the UK especially (but everywhere else as well), please work especially hard to affirm and love and care for and support Trans* people who are under attack on many fronts. (Not least the recent rubbish about how being required to use people’s correct names and pronouns is somehow an affront to academic freedom, as if using the wrong pronoun or name for a student or colleague is an academic position.)

And should you find yourself with extra cash, you can always support someone like Rainbow Railroad. Or your local homeless or youth homeless charity. Or your local sex worker support charity like SWOP Behind Bars or the Sex Workers Project from the Urban Justice Centre. (LGBTQ+ people have much higher rates of homelessness [especially youth homelessness] and survival sex work than straight and cis people.)

Anyway, that’s enough for now. (Or nowhere near enough ever, but I’ve got other things to do.)  Just recall what the extremely kind and glorious writer and academic Anthony Olivera said in the Washington Post: (Also definitely read this from him because it’s amazing)

We do not know what “love is love” means when you say it, because unlike yours, ours is a love that has cost us everything. It has, in living memory, sent us into exterminations, into exorcisms, into daily indignities and compromises. We cannot hold jobs with certainty nor hands without fear; we cannot be sure when next the ax will fall with the stroke of a pen.

Hope you’re all well and I’ll see you again in LGBT+ wrath month. (Or, more accurately, some time later this week to talk about the asymptotic properties of PSIS.)


How to think about reported life hacks?

Interesting juxtaposition as two interesting pieces of spam happened to appear in my inbox on the same day:

1. Subject line “Why the power stance will be your go-to move in 2019”:

The power stance has been highlighted as one way to show your dominance at work and move through the ranks. While moving up in your career comes down to so much more, there may be a way to make your power stance practical while also boosting your motivation and energy at the office.

**’s range of standing desks is the perfect way to bring your power stance to your office while also helping you stay organized, motivated and energized during the typical 9-5. . . . not only are you able to move from sitting to standing (or power stand) with the push of a button, but you are able to completely customize your desk for optimal organization and efficiency. For example, you can customize your desk to include the keyboard platform and dual monitor arms to keep the top of your desk clean and organized to help keep your creativity flowing. . . . the perfect way to help you show your power stance off in the office without ever having to leave your desk.

A standing desk could be cool, but color me skeptical on the power stance. Last time I saw a review of the evidence on that claim, there didn’t seem to be much there.

2. Subject line “Why you’re more productive in a coffee shop…”:

Why “one step at a time” is scientifically proven to help you get more done. Say hello to microproductivity . . .

Readers’ Choice 2018 ⭐️ Why you get more done when you relocate to a coffee shop. Plot twist: it’s not the caffeine. . . .

Feel like you’re constantly working but never accomplishing anything? Use this sage advice to be more strategic.

I clicked on the link for why you get more done when you relocate to a coffee shop, and it all seemed plausible to me. I’ve long noticed that I can get lots more work done on a train ride than in the equivalent number of hours at my desk. The webpage on “the coffee shop effect” has various links, including an article in Psychology Today on “The Science of Accomplishing Your Goals” and a university press release from 2006 reporting on an FMRI study (uh oh) containing several experiments, each on N=14 people (!) such as a statistically significant interaction (p = 0.048!!) and this beauty: “A post hoc analysis showed a significant difference . . . in substantia nigra (one sample t test, p = 0.05, one tailed) . . . but not in the amygdala . . .” So, no, this doesn’t look like high-quality science.

On the other hand, I often am more productive on the train, and I could well believe that I could be more productive in the coffee shop. So what’s the role of the scientific research here? I have no doubt that research on productivity in coffee shops could have value. But does the existing work have any value at all? I have no idea.

Freud expert also a Korea expert

I received the following email:

Dear Dr Andrew Gelman,

I am writing to you on behalf of **. I hereby took this opportunity to humbly request you to consider being a guest speaker on our morning radio show, on 6th August, between 8.30-9.00 am (BST) to discuss North Korea working on new missiles

We would feel honoured to have you on our radio show. having you as a guest speaker would give us and our viewers a great insight into this topic, we would greatly appreciate it if you could give us 10-15 minutes of your time and not just enhance our but also our views knowledge on this topic.

We are anticipating your reply and look forward to possibly having you on our radio show.

Kind regards,


Note – All interviews are conducted over the phone

Note – Timing can be altered between 7.30- 9.00 am (BST)




This email is CONFIDENTIAL and LEGALLY PRIVILEGED. If you are not the intended recipient of this email and its attachments, you must take no action based upon them, nor must you copy or show them to anyone. If you believe you have received this email in error, please email **

I don’t know which aspect of this email is more bizarre, that they sent me an unsolicited email that concludes with bullying pseudo-legal instructions, or that they think I’m an expert on North Korea (I guess from this post; to be fair, it seems that I know more about North Korea than the people who run the World Values Survey). Don’t they know that my real expertise is on Freud?

How much is your vote worth?

Tyler Cowen writes:

If it were legal, and you tried to sell your vote and your vote alone, you might not get much more than 0.3 cents.

It depends where you live.

If you’re not voting in any close elections, then the value of your vote is indeed close to zero. For example, I am a resident of New York. Suppose someone could pay me $X to switch my vote (or, equivalently, pay me $X/2 to not vote, or, equivalently, pay a nonvoter $X/2 to vote in a desired direction) in the general election for president. Who’d want to do that? There’s not much reason at all, except possibly for a winning candidate who’d like the public relations value of winning by an even larger margin, or for a losing candidate who’d like to lose by a bit less, to look like a more credible candidate next time, or maybe for some organization that would like to see voter turnout reach some symbolic threshold such as 50% or 60%.

If you’re living in a district with a close election, the story is quite different, as Edlin, Kaplan, and I discussed in our paper. In some recent presidential elections, we’ve estimated the ex ante probability of your vote being decisive in the national election (that is, decisive in your state, and, conditional on that, your state being decisive in the electoral college) as being approximately 1 in a million in swing states.

Suppose you live in one of those states? Then, how much would someone pay for your vote, if it were legal and moral to do so? I’m pretty sure there are people out there who would pay a lot more than 0.3 cents. If a political party or organization would drop, say, $100M to determine the outcome of the election, then it would be worth $10 to switch one person’s vote in one of those swing states.

We can also talk about this empirically. Campaigns do spend money to flip people’s votes and to get voters to turn out. They spend a lot more than 0.3 cents per voter. Now, sure, not all this is for the immediate goal of winning the election right now: for example, some of it is to get people to become regular voters, in anticipation of the time when their vote will make a difference. There’s a difference between encouraging people to turn out and vote (which is about establishing an attitude and a regular behavior) and paying for a single vote with no expectation of future loyalty. That said, even a one-time single vote should be worth a lot more than $0.03 to a campaign in a swing state.

tl;dr. Voting matters. Your vote is, in expectation, worth something real.

How to simulate an instrumental variables problem?

Edward Hearn writes:

In an effort to buttress my own understanding of multi-level methods, especially pertaining to those involving instrumental variables, I have been working the examples and the exercises in Jennifer Hill’s and your book.

I can find general answers at the Github repo for ARM examples, but for Chapter 10, Exercise 3 (simulating an IV regression to test assumptions using a binary treatment and instrument) and for the book examples, no code is given and I simply cannot figure out the solution.

My reply:

I have no homework solutions to send. But maybe some blog commenters would like to help out?

Here’s the exercise:

“The writer who confesses that he is ‘not good at attention to detail’ is like a pianist who admits to being tone deaf”

Edward Winter wrote:

It is extraordinary how the unschooled manage to reduce complex issues to facile certainties. The writer who confesses that he is ‘not good at attention to detail’ (see page 17 of the November 1990 CHESS for that stark, though redundant, admission by the Weekend Wordspinner) is like a pianist who admits to being tone deaf. Broad sweeps are valueless. Unless an author has explored his terrain thoroughly, how will he be reasonably sure that his central thesis cannot be overturned? Facts count. Tentative theorizing may have a minor role once research paths have been exhausted but, as a general principle, rumour and guesswork, those tawdry journalistic mainstays, have no place in historical writing of any kind. . . .

He’s talking about chess, but the principle applies more generally.

What’s interesting to me is how many people—including scientists and even mathematicians (sorry, Chrissy) don’t think that way.

We’ve discussed various examples over the years where scientists write things in published papers that are obviously false, reporting results that could not possibly have been in their data, which we know either from simple concordances (i.e., the numbers don’t add up) or because the results report information that was never actually gathered in the study in question.

How do people do this? How can they possibly think this is a good idea?

Here are the explanations that have been proffered for this behavior, of publishing claims that are contradicted by, or have zero support from, their data:

1. Simple careerism, or what’s called “incentives”: Make big claims and you can get published in PNAS, get a prestigious job, fame, fortune, etc.

2. The distinction between truth and evidence: Researchers think their hypotheses are true, so they get sloppy on the evidence. To them, it doesn’t really matter if their data support their theory because they believe the theory in any case—and the theory is vague enough to support just about any pattern in data.

And, sure, that explains a lot, but it doesn’t explain some of the examples that Winter has given (for example, a chess book staing that a game occurred 10 years after one of the players had died; although, to be fair, that player was said to be the loser of said game). Or, in the realm of scientific research, the papers of Brian Wansink which contained numbers that were not consistent with any possible data.

One might ask: Why get such details wrong? Why not either look up the date, or, if you don’t want to bother, why give a date at all?

This leads us to a third explanation for error:

3. If an author makes zillions of statements, and truth just doesn’t matter for any particular one of the statements, then you’ll get errors.

I think this happens a lot. All of us make errors, and most of us understand that errors are inevitable. But there seems to be a divide between two sorts of people: (a) Edward Winter, and I expect most of the people who read this blog, who feel personally responsible for our errors and try to check as much as possible and correct what mistakes arise, and (b) Brian Wansink, David Brooks and, it seems, lots of other writers, who are more interested in the flow, and who don’t want be slowed down by fact checking.

Winter argues that if you get the details wrong, or if you don’t care about the details, you can get the big things wrong too. And I’m inclined to agree. But maybe we’re wrong. Maybe it’s better to just plow on ahead, mistakes be damned, always on to the next project. I dunno.

P.S. Regarding Winter’s quote above, I bet it is possible to be a good pianist even if tone-deaf, if you can really bang it out and you have a good sense of rhythm. Just as you can be a good basketball player even if you’re really short. But it’s a handicap, that’s for sure.

Harvard dude calls us “online trolls”

Story here.

Background here (“How post-hoc power calculation is like a shit sandwich”) and here (“Post-Hoc Power PubPeer Dumpster Fire”).

OK, to be fair, “shit sandwich” could be considered kind of a trollish thing for me to have said. But the potty language in this context was not gratuitous; it furthered the larger point I was making. There’s a tradeoff: use clean language and that will help with certain readers; on the other hand, vivid language and good analogies make the writing more readable and can make the underlying argument more accessible. So no easy answers on that one.

In any case, the linked Pubpeer thread had no trolling at all, by me or anybody else.

Random patterns in data yield random conclusions.

Bert Gunter points to this New York Times article, “How Exercise May Make Us Healthier: People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary,” writing that it is “hyping a Journal of Applied Physiology paper that is now my personal record holder for most extensive conclusions from practically no data by using all possible statistical (appearing) methodology . . . I [Gunter] find it breathtaking that it got through peer review.”

OK, to dispose of that last issue first, I’ve seen enough crap published by PNAS and Lancet to never find it breathtaking that anything gets through peer review.

But let’s look at the research paper itself, “Habitual aerobic exercise and circulating proteomic patterns in healthy adults: relation to indicators of healthspan,” by Jessica Santos-Parker, Keli Santos-Parker, Matthew McQueen, Christopher Martens, and Douglas Seals, which reports:

In this exploratory study, we assessed the plasma proteome (SOMAscan proteomic assay; 1,129 proteins) of healthy sedentary or aerobic exercise-trained young women and young and older men (n = 47). Using weighted correlation network analysis to identify clusters of highly co-expressed proteins, we characterized 10 distinct plasma proteomic modules (patterns).

Here’s what they found:

In healthy young men and women, 4 modules were associated with aerobic exercise status and 1 with participant sex. In healthy young and older men, 5 modules differed with age, but 2 of these were partially preserved at young adult levels in older men who exercised; among all men, 4 modules were associated with exercise status, including 3 of the 4 identified in young adults.

Uh oh. This does sound like a mess.

On the plus side, the study is described right in the abstract as “exploratory.” On the minus side, the word “exploratory” is not in the title, nor did it make it into the news article. The journal article concludes as follows:

Overall, these findings provide initial insight into circulating proteomic patterns modulated by habitual aerobic exercise in healthy young and older adults, the biological processes involved, and the relation between proteomic patterns and clinical and physiological indicators of human healthspan.

I do think this is a bit too strong. The “initial” in “initial insight” corresponds to the study being exploratory, but it does not seem like enough of a caveat to me, especially considering that the preceding sentences (“We were able to characterize . . . Habitual exercise-associated proteomic patterns were related to biological pathways . . . Several of the exercise-related proteomic patterns were associated . . .”) had no qualifications and were written exactly how you’d write them if the results came from a preregistered study of 10,000 randomly sampled people rather than an uncontrolled study of 47 people who happened to answer an ad.

How to analyze the data better?

But enough about the reporting. Let’s talk about how this exploratory study should’ve been analyzed. Or, for that matter, how it can be analyzed, as the data are still there, right?

To start with, don’t throw away data. For example, “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.” Huh?

Also this: “Because of the exploratory nature of this study, significance for all subsequent analyses was set at an uncorrected α < 0.05." This makes no sense. Look at everything. Don't use an arbitrary threshold. Also there's some weird thing in which proteins were divided into 5 categories. It's kind of a mess. To be honest, I'm not quite sure what should be done here. They're looking at 1129 different proteins so some sort of structuring needs to be done. But I don't think it makes sense to do the structuring based on this little dataset from 47 people. A lot must already be known about these proteins, right? So I think the right way to go would be to use some pre-existing structuring of the proteins, then present the correlations of interest in a grid, then maybe fit some sort of multilevel model. I fear that the analysis in the published paper is not so useful because it's picking out a few random comparisons, and I'd guess that a replication study using the same methods would come up with results that are completely different. Finally, I hove no doubt that the subtitle of the news article, "People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary," is true, because any two groups of people will differ in all sorts of ways. I think the analysis as performed won't help much in understanding these differences in the general population, but perhaps a multilevel model, along with more data, could give some insight. P.S. Maybe the title of this post could be compressed to the following: Random in, random out.

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions.

Jeff Lax writes:

I’m probably not the only one telling you about this Science story, but just in case.

The link points to a new research article reporting a failed replication of a study from 2008. The journal that published that now-questionable result refuses to consider publishing the replication attempt.

My reply:

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions. Sure, they couldn’t publish this particular paper, but maybe there was something more exciting submitted to Science that week, maybe a new manuscript by Michael Lacour?

Causal inference: I recommend the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment

Amy Cohen asked me what I thought of this article, “Control of Confounding and Reporting of Results in Causal Inference Studies: Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals,” by David Lederer et al.

I replied that I liked some of their recommendations (downplaying p-values, graphing raw data, presenting results clearly) and I am supportive of their general goal to provide statistical advice for practitioners, but I was less happy about their recommendations for causal inference, which was focused on what taking observational data and drawing causal graphs. Also I don’t think their phrase “causal association” has any useful meaning. A statement such as “Causal inference is the examination of causal associations to estimate the causal effect of an exposure on an outcome” looks pretty circular to me.

When it comes to causal inference, I prefer a more classical approach in which an observational study is understood in reference to a hypothetical controlled experiment.

I also think that the discussion of causal inference in the paper is misguided in part because of the authors’ non-quantitative approach. For example, they consider a hypothetical study estimating the effect of exercise on lung cancer and they say that “Controlling for ‘smoking’ will close the back-door path.” First off, given the effects of smoking on lung cancer, “controlling for smoking” won’t do the job at all, unless this is some incredibly precise model with smoking very well measured. The trouble is that the effect of smoking on lung cancer is so large that any biases in this measurement could easily overwhelm the effect they’d be trying to estimate. And this sort of thing comes up a lot in public health studies. Second, you’d need to control for lots of things, not just smoking. This example illustrates how I don’t see the point of all their discussion of colliders. If we instead simply take the classical approach, we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels. This makes it clear that there’s no “back-door path”; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.

Also I fear that this passage in the linked article could be misleading: “Causal inference studies require a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding. The latter is addressed in detail later in this document. Prediction models are fundamentally different than those used for causal inference. Prediction models use individual-level data (predictors) to estimate (predict) the value of an outcome. . . ” This seems misleading to me in that a good prediction study also requires a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding.

The point is that, once you’re concerned about out-of-sample (rather than within-sample) prediction, all these issues of measurement, selection, confounding, etc. arise. Also, a causal model is a special case of a predictive model where the prediction is conditional on some treatment being applied. So I think it’s a mistake to think of causal and predictive inference as being two different things.

P.S. Long comment thread below, and I think I need to clarify something. I’m not saying that researchers should not use graphical models when doing causal inference. Graphical models can be useful, and, in any case, many statistical methods that do not explicitly use graphical models, can be interpreted as using graphical models implicitly. If people want to argue in the comments about the utility or importance of graphical models for causal inference, that’s fine: just be clear that this is not the point of the above post. The above post is responding to a very specific article that gives what I see as some misleading advice. My problem with the article is not that it recommends the use of graphical models; rather, my problems are some specific issues stated above.

The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174


Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.


We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”


The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

Algorithmic bias and social bias

The “algorithmic bias” that concerns me is not so much a bias in an algorithm, but rather a social bias resulting from the demand for, and expectation of, certainty.

Pharmacometrics meeting in Paris on the afternoon of 11 July 2019

Julie Bertrand writes:

The pharmacometrics group led by France Mentre (IAME, INSERM, Univ Paris) is very pleased to host a free ISoP Statistics and Pharmacometrics (SxP) SIG local event at Faculté Bichat, 16 rue Henri Huchard, 75018 Paris, on Thursday afternoon the 11th of July 2019.

It will features talks from Professor Andrew Gelman, Univ of Columbia (We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions) and Professor Rob Bies, Univ of Buffalo (A hybrid genetic algorithm for NONMEM structural model optimization).

We welcome all of you (please register here). Registration is capped at 70 attendees.

If you would like to present some of your work (related to SxP), please contact us by July 1, 2019. Send a title and short abstract (

Question 15 of our Applied Regression final exam (and solution to question 14)

Here’s question 15 of our exam:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

And the solution to question 14:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

(a) For an average classroom, the curve is invlogit(1 + 0.1x), so it goes through the 50% point at x = -10. So the easiest way to draw the curve is to extend it outside the range of the data. But in the graph, the x-axis should go from 0 to 50. Recall that invlogit(5) = 0.99, so the probability of passing reaches 99% when x reaches 40. From all this information, you can draw the curve.

(b) The 25th and 75th percentage points of the normal distribution are at the mean +/- 0.67 standard errors. Thus, the 25th and 75th percentage points of the intercepts are 1 +/- 0.67*2, or -0.34, 2.34, so the curves to draw are invlogit(-0.34 + 0.1x) and invlogit(2.34 + 0.1x). These are just shifted versions of the curve from a, shifting by 1.34/0.1 = 13.4 to the left and the right.

Common mistakes

Students didn’t always use the range of x. The most common bad answer was to just draw a logistic curve and then put some numbers on the axes.

A key lesson that I had not conveyed well in class: draw and label the axes first, then draw the curve.

Question 14 of our Applied Regression final exam (and solution to question 13)

Here’s question 14 of our exam:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

And the solution to question 13:

13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as
y_i = a_j[i] + bx + ε_i
a_j = A + Bu_j + η_j.

(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

(a) The estimates of b, A, and B are 0.7, 2.5, and 0.5, respectively, and the estimated distributions are ε ~ normal(0, 3.0) and η ~ normal(0, 2.0).

(b) 3.0 and sqrt(3.0^2 + 2.0^2) = 3.6.

Common mistakes

Almost everyone got part (a) correct, and most people got (b) also, but there was some confusion about the uncertainty for a new observation in a new group.