Faye Flam wrote a solid article for the New York Times on Bayesian statistics, and as part of her research she spent some time on the phone with me awhile ago discussing the connections between Bayesian inference and the crisis in science criticism. My longer thoughts on this topic are in my recent article, “The connection between varying treatment effects and the crisis of unreplicable research: A Bayesian perspective,” but of course many more people will get the short version that appeared in the newspaper.

That’s fine, and Flam captured the general “affect” of our discussion—the idea that Bayes allows the use of prior information, and that p-values can’t be taken at face value. As I discuss below, I like Flam’s article, I’m glad it’s out there, and I’m grateful that she took the time to get my perspective.

Unfortunately, though, some of the details got garbled.

Flam never put quotation marks around anything I said, and I know that with journalism there isn’t always time to check every paragraph. After I saw the article online I pointed out the mistakes and Flam asked the NYT editors to correct them so I hope this will be done soon.

In the meantime, I’ll post the corrections here.

In the article, it says:

But there’s a danger in this [p-value] tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent means that one in 20 “statistically significant” results are nothing but random noise.

No no no no no. I recommended correcting as follows:

But there’s a danger in this tradition, said Andrew Gelman, a statistics professor at Columbia. Even if scientists always did the calculations correctly — and they don’t, he argues — accepting everything with a p-value of 5 percent can lead to spurious findings—cases where an observed “statistically significant” pattern in data does not reflect a corresponding pattern in the population—far more than 5 percent of the time. The weaker the signal and the noisier the measurements, the more likely that a pattern, even if statistically significant, will not replicate.

To the outsider this might sound almost the same, but on a technical level it makes a big difference!

The article then says that I say:

The proportion of wrong results published in prominent journals is probably even higher

I would change this to:

This could well be an even bigger problem with prominent journals

Later the article refers to the notorious fecundity-and-voting study and says:

Dr. Gelman re-evaluated the study using Bayesian statistics. That allowed him look at probability not simply as a matter of results and sample sizes, but in the light of other information that could affect those results.

He factored in data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. When he did, the study’s statistical significance evaporated.

This is not correct. I did not re-evaluated the study using Bayesian methods, nor did I claim to have done so.

Here’s my suggested revision:

Dr. Gelman felt this result was not consistent with polling data showing that people rarely change their voting preference over an election cycle, let alone a menstrual cycle. And after accounting for the many different analyses that could have been performed on the data, the study’s statistical significance evaporated.

Finally, the article writes of me:

He suggests using Bayesian calculations not necessarily to replace classical statistics but to flag spurious results.

I wouldn’t quite put it that way! I prefer:

He says that in such studies there is strong prior information, which can be included using Bayesian methods or in other ways.

**Putting it into perspective**

I suppose journalists find it difficult to deal with academics because we’re so picky. As I noted above, I think the article captured the general sense of what I was saying, and overall I like the article, I like how Flam quoted people who had varying perspectives; I think it’s important for people to see statistics as a pluralistic field with different tools for solving different problems.

But I do think the details matter (and I certainly don’t want people to think I said things I didn’t say, or that I did things I didn’t do) so I hope the corrections can be made soon. And I stand by the larger point that lots of bad stuff happens when people think that “statistically significant” + “vague theory” = truth. I can’t say that I’m *surprised* that Kristina Durante, the author of the fecundity-and-voting study, stands by those claims, but I think it’s too bad. The point is not that there’s anything horrible about Durante (a person whom I’ve never met), nor do I know of anything horrible about Daryl Bem, etc., but that there is widespread confusion about how to do statistics, especially when studying small effects in the presence of large measurement errors (that’s one of the things I discuss in my above-cited article), and I’m glad to get these concerns out there, as precisely as is possible within the format of a newspaper article.

In any case, this’ll be an excellent example for my statistical communication class!

**P.S.** I also just noticed this bit from the article:

The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the nine-in-10 result is — a piece of information that can be useful in investigating your suspicion.

By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the coin-toss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.

No!!!!!!!!!!!!!! Weighting a coin does not (appreciably) affect the probability that a coin lands heads. You can load a die but you can’t bias a coin. Yes, with practice you can *throw* a coin (weighted or otherwise) to generally land heads or tails, but, no, there is no such thing as a weighted coin which has an appreciably greater than 50% chance of generally landing heads. No big deal but this is one of my pet peeves. Also, beyond the flaws in this particular example, I don’t think it’s a good representation of science, in that the point to me is not to distinguish fair from unfair coins (equivalently, to distinguish randomness from non-randomness) but rather to understand the many real patterns in the world, which are not purely random but can be buried in noise if we’re not careful, hence motivating noise-reduction efforts such as this, with Sharad Goel, David Rothschild, and Doug Rivers. (And my point there was not to promote that work but to illustrate my general point with an example.)

What are some examples of problems considered impossible in 1995 that are solvable today just because science embraced Bayesianism?

Rahul:

In my opinion, anything you can do with Bayesian inference you can do in other ways. To me, Bayesian inference is a bit like calculus: You can do derivatives and integrals without calculus (indeed, mathematicians in pre-Newtonian times were able to compute limits, with care), but calculus makes it a lot easier. Similar, I find that Bayesian inference makes it a lot easier to combine information. For example, I’m sure that someone

coulddo MRP non-Bayeisanly—and indeed there is a non-Bayesian tradition of partial pooling for small-area estimation in sample surveys—but I think it’s no coincidence that the widespread use of MRP has come along with the Bayesian approach.If you look at my applied research papers, you’ll see a lot of analyses that maybe could’ve been done in non-Bayesian ways but in fact which my colleagues and I did Bayesianly, and which I suspect would never have been solved had we not had Bayesian tools.

There are also a lot of non-Bayesian success stories in statistics, but that’s fine, I don’t think that news article claims otherwise.

Bayesian inference is many things. It’s a set of tools for solving problems, also a framework for understanding statistical methods. Other statistical approaches similarly serve this dual duty, for example classical hypothesis testing is a set of methods and also a framework in which statistical inference is viewed as a set of testing problems. I don’t find that particular framework very helpful—indeed, I think it often gets in the way—but I do recognize that there are many problems for which methods developed in that tradition can be useful. Recall my recent discussion of lasso.

Excellent summary I think. I agree.

I just hate it that newspaper stories have to exaggerate so much. They could’ve done without the

“impossible problems made possible now”bit, I think.“anything you can do with Bayesian inference you can do in other ways”

I think Bayesian methods have come into their own as computing power has increased. There are modern approaches that are computationally tractable only with Bayesian methods. E.g. Bayesian Markov Chain Monte Carlo for parameter estimation?

Another point I disagree with is this one:

I think the crap is all spread out. I’d love to see evidence that top-journals are more wrong than others.

Rahul:

Please read carefully! In the post above I clearly state that I did

notactually say that. My suggested replacement was, “This could well be an even bigger problem with prominent journals.” Weasel words, sure, but that’s cos I honestly have no idea. But I do think itcouldbe a bigger problem, especially if “problem” is defined to include impact of the errors.I was criticizing the article. Not you.

I’m a great fan of Andrew’s articles and books, and blog, and I try to understand his recommendations from the ground up (and I often bug him with questions, which he gracefully answers—thanks, Andrew!).

But I have a strong feeling that Andrew’s recommendations for psychology and related areas come from a non-practitioner’s perspective. Editors routinely reject our papers *because* we have replications of key results in the paper (“replications add nothing new”), if we mark a speculative and tentative conclusion as such, the paper is rejected because the result is not convincing. Top journals routinely publish low power null results as if they are positive findings. Post-hoc explanations are dressed up as predictions before the experiment was even run.

If one were to really implement Andrew’s ideas, there would be no publication possible because the gatekeepers are not on board. From the perspective of a user of statistics in my research, I’m pretty frustrated that I don’t see any point in following Andrew’s advice, unless I want to self-publish my work on my home page. I could do that; but not my students.

I just wanted to point out how ineffective Andrew’s advice is in changing real practice. I guess if enough people adopted saner statistical practice things would change. But right now I see Andrew’s advice as good to know and good for real understanding, but practically not useful. I’m happy to be corrected; maybe there is real change happening and I just don’t see it; I certainly experience the absence of change in my daily drudgery of revise-and-resubmit actions.

This is very sad and is what the NY Times should be explaining to the lay public.

> real change happening and I just don’t see it

Speculated a bit to Fernando about this last week.

> if they plan to make a living as academics

Life is not fair, but it may get fairer in research as preregistration, reproducibility and replication become more prevalent than enhancing one’s reputation with work for which the quality cannot be assessed. (As Don Rubin once said, “smart people do not like being repeatedly wrong” and some smart people with resource control are starting realise believing much of the published research will do exactly that.)

But it will likely be the high (and difficult) road for some time.

Also why I worry about people reading to much into the successes of many who _might_ have taken the low road.

http://statmodeling.stat.columbia.edu/2014/09/24/study-published-2011-followed-successful-replication-2003-cool/#comment-191732

It sounds to me like your beef is with the reviewers in your field(s), not Andrew’s applied statistics recommendations. Ironically, of course, you’re a member of the group you’re complaining about. Do you do the same thing in reviews? Could you edit a special volume of a journal? Or become an area editor and invite some specific papers to address the shortcomings? It’s not even that hard to start a whole new journal — you could talk to the Michael Collins and crew about their experience starting TACL (largely started, I believe, to deal with the incredibly slow and picky reviewing in the pre-existing Computational Linguistics journal).

It’s a chicken-and-egg problem. Once papers exist to cite and once journals see them being cited, they’ll want more such papers. And it’s easier to convince the grad students than the tenured faculty. The upside is that they’re the future reviewers whereas the current reveiwers are future retirees.

The field of psychology is changing, as witnessed by the success of Krushcke’s book, the number of tutorials on Bayesian models I’ve seen at psych conferences (mainly psycholinguistics — I have a biased selection), and the new book by Lee and Wagenmakers. I don’t remember seeing anything like this 30 years ago in my first romp through stats for social science.

Ironically, academia is very conservative and very slowly paced compared to what I expected going in. Getting tenure is supposed to give professors all the freedom in the world, but the whole enterprise winds up reinforcing very narrow and traditional research. Professors go along with it for tenure, promotion, and grant funding. Students are forced to go along with it under the reasoning that they need to get publications in order to get jobs. I think it may have something to do with the reward structure encouraging people to concentrate on narrow research areas and the age bias toward older professors on editorial boards.

Certainly, my beef is not with Andrew’s recommendations, which make sense to me. And no, of course I don’t do the same thing in reviews. I understand the process you describe for instituting change. But that is a major project compared to the much more immediate goal I was talking about: simply implementing Andrew’s recommendations. I guess what you are saying is that one needs to take bigger steps to get to that point. Maybe.

Yes, it would be a major project — but I think it would be worth the effort.

Speaking for myself, I would be enthusiastic about contributing to something like this. And I am pretty sure there are others.

I guess this is hinging on a technicality, but a quick googling of weighted coins brings up this: https://izbicki.me/blog/how-to-create-an-unfair-coin-and-prove-it-with-math.html

Christoph:

Yes, bending. But not weighting, which is what was stated in the news article.

True. I think I was actually reading the “weighted” more loosely as “something funky is going on with this coin” instead of literal weight changes. Weightedness as a metaphor, so to speak.

You also have to bend the coin in a very obvious way according to the article to get get a coin where the author was quite confident that it is unfair. His power is not very big, though.

If you suspect you friend has a biased coin…..well you ought to just insist on the von Neumann protocol. :)

What if it’s a very thick coin?

More seriously, there’s more than one definition of weighted.

https://www.google.com/search?ie=UTF-8&source=android-browser&q=weighted

Very interesting. I’ve seen a number of remarks of the kind “but with Bayesian statistics it all became better” in articles targeted at a general audience, and I’m delighted to see that the Bayesian himself here actually tries to make the wording more worthy of a scientist and less of a salesman. (Whether this is compatible with Faye Flam’s intentions we will see.)

+1

Bayesian is a great technique but

just not a cure-all (some pieces make it sound like that).e.g. In applications like spell check, or spam filters, or expert systems or handwriting / image recognition, or voice recognition, fault diagnosis etc. Bayesian is perfectly awesome.

In certain other areas I’m not so sold on it.

I’m glad for poor Faye who had to struggle to get this published, and kick out whatever was not deemed P.C., but I don’t see much clarity on the issues here. I find it very odd that Gelman is presented as holding a view that is “the opposite” of frequentism, while he writes a nice article about the relevance, in criticizing the case at hand, of outcomes not observed and paths not taken, merely because they might have been taken. This is the very definition of an error statistician and is wildly at odds with what other Bayesians claim.

We are roundly criticized daily for considering such “could have beens” in reasoning from what has occurred. If Gelman wanted to have a pop article like this have a real impact, he’d focus on the need to critique the reliability of the methodology, taking into account outcomes and paths that might have been followed, thereby allowing leeway that gives grounds to question the actual conclusion. Merely calling what he advocates Bayesian just skirts what has always been the key issue: the need for a method to live up to controlling error probabilities. The mere fact that there’s background knowledge to make you suspect a result (e.g.,political allegiances don’t change so fast or whatever) fails to constitute the actual grounds for criticizing the methodology. Why not just say what the real criticism is?

If you use probability distributions which aren’t frequencies then you’re Bayesian. No ifs, ands, or buts about it.

If you use probability distributions which aren’t frequencies to estimate the speed of light, you can compare your estimate to the actual speed if known, thereby checking the model assumptions.

If you use probability distributions which aren’t frequencies to estimate frequencies, you can compare the estimated frequencies to actual frequencies if known, thereby checking the model.

DOING SO DOESN’T MAKE YOU A FREQUENTIST. YOU’RE STILL STILL USING PROBABILITY DISTRIBUTIONS WHICH AREN’T INTERPRETABLE AS FREQUENCIES.

And another thing. Within a given model the likelihood principle holds and updating is done through Bayes theorem. If the model is tentative, or based on assumptions, then Bayesians are free to change the model whenever they want after reexamining those assumptions. In particular they can change it because the model makes poor predictions.

Doing so doesn’t mean they’re secretly frequentists. They’re still using probability distributions which aren’t frequencies. When they go to check the accuracy of the model they still compute the models implications using Bayes theorem and all the rest.

The idea that you can make assumptions, work out their implications, and see if the those implications are true isn’t owned by frequentist in any sense whatsoever. People were doing this long before frequentism came along. Bayesians can do this just like everyone else. Bayesians were doing this from day 1 (i.e. Laplace), just like very everyone else.

And if some of those assumptions involve distributions which represent uncertainties rather than frequencies, and if they use things like Bayes theorem to get those “implications” in order to compare them reality, they aren’t secretly frequentists or relying in any way, on any part of frequentist ideology.

Most of Gelman’s critique of forking paths etc is critique of people using frequentist ideas and p values to make claims. His point is that the claimed error frequency isn’t anything like 0.05 etc because of the post-data post-hoc nature of the testing.

I’m pretty sure his recommendation to consulting clients etc wouldn’t be to do different frequentist p value based testing, but rather to build a model with some causal content, and use bayesian methods to fit it.

[…] in the NYT. (Note that about the argument from Professor Andrew Gelman, please also check out the response from Professor Gelman.) “Mr. Hall, longtime host of the game show “Let’s Make a Deal,” hides a car behind one […]

Interesting article about the swing voter. How do you explain the YouGov poll just days before the Scottish independence indicating a pro-independence vote, when the actual results were quite decisively against independence? Since there was extreme interest, it seems unlikely it was low response by slumping supporters? Is it possible that the polling techniques (internet based, primarily) of YouGov simply were inaccurate (these techniques have in the past produced real lemons, though maybe they are better now–I haven’t kept track)? If so, since the study on the mythical swing voter uses these techniques (Xbox sampling, actually, though one wonders how representative this is of the population as a whole), could this cast doubt on the conclusions of this study you cite? I see you cite Pew data as supporting your decline hypothesis, but Pew almost certainly uses self-identified party id, so a drop from 55% to 48% in partisan identification is not confirmation of non-response but rather a reflection that partisan identification is endogenous to some extent with vote choice. If you really wished to support your hypothesis of “slumping”, you would look at states that do report partisan identification of the voter registration rolls and then see if the decrease in response was associated with these a priori indicators of partisan preference, not endogenously affected self-report. And I don’t see where you describe anywhere changes in partisan identification over the Xbox panel time period, and how this compares with ANES panel changes (Figure 5 references final PID share, but nowhere do I see the dynamics–I will say, like most readers, I just went through the figures (aside from my comments on Pew, where I read the discussion)). Would these changes in self-reported PID be correlated with vote change, and would this not provide an alternative explanation for your results?

Also, I see no conflict of interest disclaimers in this study. As Polimetrix was acquired by YouGov, Rivers in particular probably still owns a chunk of stock in YouGov (in fact, a cynical comment on this paper is that it is an attempt to drum up business for internet sampling). Every reputable scientific journal requires this, and if political science wants to be more scientific than political, this should be included as a matter of course in any article (whether required by the

Numeric:

Lots of comments you have here, let me respond very quickly:

1. I don’t know anything about the polls or the election in Scotland. This is not to dismiss your remarks, just to say that I can’t really address them, one way or another.

2. You write, “Xbox sampling, actually, though one wonders how representative this is of the population as a whole.” The Xbox sample is indeed

notrepresentative of the population; we discuss this in our article.3. Regarding the possibility that there are major swings in party ID during a one-week period during the campaign: For many reasons I don’t think this is plausible.

4. You suggest a state-level analysis. That could be a good idea. Go for it.

5. In the Xbox panel, party ID was asked only once, when the participant joined the survey. It was not asked repeatedly.

6. I don’t see the conflict of interest of which you speak. The survey was done using Xbox, not Yougov. I guess there’s a conflict of interest in that one of the authors works at Microsoft and we used Microsoft data but this seems pretty clear, I don’t really see this as any worse than a Columbia University researcher writing a paper using Columbia University data, etc.

7. You can be as cynical as you want. I think you could probably find better targets for your cynicism than Doug and me, but that’s your call.

Regarding the possibility that there are major swings in party ID during a one-week period during the campaign: For many reasons I don’t think this is plausible.

Major is an incorrect word and not my argument. Your argument is that short-term changes in candidate support following an “event” (a convention, a debate) are caused by some sort of anomie leading to a “slump” (decline in response rate). My argument is that self-reported partisan identification is endogenous to vote choice to some extent. In particular, the Pew self-report figure going from 55 to 48% is probably a result of changes in self-report PID as it is from slumping. This is not major (given the small sample size on surveys such as Pew, 7% difference for one party is probably just within the bounds of confidence, a concept you don’t like). But it may be systematic, and I believe this endogeneity is what is behind your results.

In particular, this is an empirically testable hypothesis, as you can cross-reference back to the registered voter file for states that report partisan affiliation. You should be able to do this

for a number of states from your X-box data, since you must have the names and addresses of these individuals or could relatively easily get them (note this is not doing a state-wide analysis, as you claim–this is substituting an exogenous variable for an endogenous variable–an obvious analogue is instrumental variables). Then it is trivial to match these back to the registered voter file and get the partisan identification at time of registration.

I think the overall point is that there is an obvious counter hypothesis to the “slump” hypothesis, one that is testable, and one that is not mentioned in your paper. I doubt if any pollsters will take it seriously otherwise (if the point is another academic trope, well, be my guest). As far a cynicism goes, though, I’m still waiting for the blog entry on the statistical methods in “Heterogeneity in Models of Electoral Choice.” As another somewhat cynical observation,

your targets for statistical ire appear to rarely hit your political science colleagues.

Numeric:

I appreciate your taking the time to comment. It’s true that my immediate reaction to these sorts of comments is irritation, but that’s really my problem, not yours. I recognize that not all of my research is convincing to everyone—this is social science, after all—and, given that you do have these issues with our project, I think it’s good for myself and others to see your objections.

I don’t have time to respond to every comment but I will respond to your points.

1. I consider a swing of 5% of party ID during a one-week period to be a major swing, and for various reasons I do not consider it plausible. My colleagues and I have discussed this issue in our 1993 and 2001 papers that addressed party ID and voting. In the Xbox study it is not an issue at all since we have panel data and the party ID question is asked only once. So even if party ID were changing by these large amounts, it would not be an explanation to what we saw in our data.

2. We will do our best work and then worry about whether any pollsters will take it seriously. I’ve talked to various people involved in politics and polling and they do take our findings seriously. Adoption of any new method takes time and won’t be universal; indeed, it was 13 years ago that my colleagues and I published our paper on poststratification by party ID and this is still not standard practice. It takes awhile. But I do think we are making progress and, as we continue to improve the methods and find interesting empirical results, I think pollsters and political professionals will move in this direction.

A lot depends on the goals of a polling organization. In the short term, a poll gets headlines by producing more fluctuations: noise = news = publicity. Longer-term, though, I think accuracy has to be the way to go.

2. I actually have no idea what you’re talking about regarding heterogeneity in models of electoral choice. But if you think this is an important topic, you should feel free to write such an article yourself. There’s no need to wait on me to do it.

3. I do criticize political science claims when they come to my attention and when they bother me. Recently here and on the sister blog I criticized the party-id-and-smell paper and I criticized what I considered to be exaggerated claims regarding political effects of subliminal stimuli.

If you’d like to see criticisms of work by my own colleagues, again, you should feel to write and publish such criticisms myself. I’m not planning any time soon to write withering criticisms of the work of Jeff Lax, Justin Phillips, Bob Erikson, Bob Shapiro, etc etc., for the simple reason that I think their work is good!

If you recall the 2000 election, the polls varied dramatically over

the course of the campaign (see

http://en.wikipedia.org/wiki/Historical_polling_for_U.S._Presidential_elections#United_States_presidential_election.2C_2000). Yet there was no indication of large non-response problems by party from any of the polling organizations (maybe there was and I didn’t see it/it wasn’t reported, but my impression is there wasn’t and I didn’t see any of this in the polling I was doing). This is why I think your findings are probably not true (slump versus changing

responses). Also, when you state that a 5% difference is “major”, you are ignoring that the Pew estimates are random and that the 55 to 47 percent difference in two successive samples may very well be within expected sampling error bounds, depending upon the Pew n (which I

don’t know). Isn’t ingoring randomness in estimation a type M error?

Here’s the problem with “Heterogeneity in Models of Electoral Choice”. Rivers tests two competing regressions (discrete choice) of vote choice (MNL and COLOGIT) and wishes to make inferences on the difference of weights of idealogy and party between the two models.

He estimates both models and calculates weights and standard errors for each model.

Here is Table 2:

Table 2 (standard errors in parenthesis)

MNL COLOGIT

Idealogy -0.214 (0.051) -0.112 (0.774)

Party -0.313 (0.085) -0.730 (2.995)

Here is his comparison:

“Table 2 compares the standard multinomial logit estimates of (19) which impose the homogeneity assumption of equal party and ideology weights for each voter with the average COLOGIT estimates. The discrepancies between the MNL and average COLOGIT estimates

are striking. The average party weight estimated by the COLOGIT procedure is more than twice as large as the MNL estimate, while the average ideology weight is 50 percent less than the MNL estimate. In neither case is the average COLOGIT estimate within two standard

errors of the MNL estimates.”

This shed an interesting light on how a leading political methodologist performs statistical inference! In actual fact, of course, one cannot treat the COLOGIT estimate as fixed and use the standard error of the MNL estimator to establish statistical difference. Rather, both are random and what one needs to do is calculate the difference of the coefficients over the standard deviation of the coefficients. Letting the estimate of party be a under the MNL and that of party be b under COLOGIT, then the the ratio (a – b)/root(var (a) + var(b) – 2 cov(a,b)) is what should be calculated. This can be done by the method of Cox (1961), but we can put some bounds on this ratio by the following. A minimization of the denominator is obtained by noting that cov(a,b) = sd(a)sd(b) when the correlation of a and b is set to one, so the denominator becomes root (sd(a)^2 + sd(b)^2 – 2 sd(a) sd(b)) which is |sd(a) – sd(b)|. Hence the ratio of the difference of the the party weight to the standard deviation cannot exceed (-0.313 – (-0.730))/|2.995 – 0.085| = .14 (approximately). So River’s Table 2 actually presents statistical evidence against his hypothesis of unequal weights.

1. When I say a 5% swing in party ID is large, I mean a 5% actual swing in the population. A 5% swing in the sample is no big deal as it can be caused by a combination of actual swing, nonresponse swing, and sampling variability.

2. I was not familiar with that particular paper by Doug. I have to admit I’m not knowledgeable about much of the political science literature.

Regarding the 5% swing, of course an actual swing of that magnitude is major, but how are you going to determine it? By sampling, of course, and my comments were based on the Pew change in PID mentioned in your paper.

Regarding Rivers and in more general academic political science research, the main problem from a practitioners point of view is that it can’t be trusted, and that is clear in both the heterogeneity paper but also in the swing voter myth paper (I won’t believe non-response is the driving force behind most poll changes until results like the 2000 polling can be explained by it). No one fakes the data (at least as far as I can tell), but the analysis is typically wrong/incomplete and in a manner that benefits the author. This is apparently necessary to get the paper published. In heterogeneity, would it have been published if a statistically valid analysis had been done on table 2, in swing (and I don’t know whether it has been published), would a title such as “Evidence of Non-response biased by partisan affiliation after seminal campaign events” get any attention (or be published?). I think not, yet that is how I read the results in your paper. The phenomenon I am describing is analogous to the .05 significance problem and it hurts the science.

As regard to the applicability of your paper to pollsters, in states where you can obtain registered voter list, a very simple way to ensure non-biasness of response by partisan/demographic categories is to sort the registered voter list by various criteria (partisan id, sex, age, ethnicity (through surname matching)) and then create clusters of, say, 50 names for 400, 800, or however many n you want to contact. This gives a rough equivalence to the actual composition of the district/state and you start with number one in the cluster and call until you get a respondent. There might be increased non-response in some of these clusters after a campaign event (I’ve never looked) but even if you have to call 4 names in a democratic cluster (say) as opposed to 3 in a republican cluster (presume this is after a republican convention), it doesn’t matter (unless you want to claim there is a vote-choice bias in the democrats that do respond as opposed to those who have “slumped”–I’ve never seen this either). Anyway, this is why non-response isn’t typically a problem in states where you can get registered voters with party identification (obviously, you have to pay to get the phone numbers matched since they typically won’t be on the registered voter file).

I suppose my final comment is a plea that you adopt a more stringent scientific approach to political science research so that is more science than politics. I know it is difficult to get published in polysci journals and your reference group is your fellow academic political scientists but you are better than them–you have rigorous scientific training in statistics that they don’t. I suppose it is a public goods problem–it would hurt you and but help the field. You’re senior enough that the cost to you is bearable. Think of your legacy.

Did not Flam misidentify odds? She said:

“A Bayesian calculation would start with one-third odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three”

She is conflating the probability of 1/3 with odds. Odds are p /(1 – p) or 1/3 divided by 2/3 here, so the odds are 1 to 2.

Hcg:

Sure, but the words “odds,” “likelihood,” “chance,” and “probability” are commonly used interchangeably in colloquial writing so this doesn’t really bug me.

Andrew,

I think you should also let the author know that the Monty Hall problem has nothing to do with Bayesians Vs. Frequentism – it isn’t even a statistics question at all!

[…] oh, this is getting kinda embarrassing. The Garden of Forking Paths paper, by Eric Loken and myself, just […]

[…] No, I didn’t say that, by Andrew Gelman, on this blog. […]