Do celebrated scientists have the free will to admit they are wrong? Not always!

Josh Miller points us to this post by Kevin Mitchell, who writes:

Gotta hand it to Sapolsky here . . . it’s quite ballsy to uber-confidently assert we do not have “the slightest scrap of agency” and then support that with one discredited social psych study after another…

I don’t understand Mitchell’s confusion at all here. It seems pretty clear. Robert Sapolsky is a Ted-talking science hero who’s accustomed to getting adoring press coverage. You don’t have to be a David Brooks-level sociologist to understand that these people do not have the slightest scrap of agency when it comes to backing down on any of their claims.

In the above post, Miller was referring to a podcast where Sapolsky is interviewed by credulous physicist Sean Carroll.

P.S. I’ve written about agency before! See here, here, and here. Also, like everyone else, I enjoy Dix pour cent.

More discussion of differential privacy at the Census

This is Jessica. As I’ve blogged about before, the U.S. Census Bureau’s adoption of differential privacy (DP) for the 2020 Census has sparked debates among demographers, computer scientists, redistrictors, and other users of census data. In a nutshell, mechanisms that satisfy DP provide a stability guarantee on the output of a function computed on data over changes in the input. This is typically achieved by adding a calibrated amount of noise to computed statistics, controlled by one or more privacy budget parameters. DP replaces more ad-hoc prior approaches to disclosure avoidance that the Census has used to preserve anonymity in releasing statistics, like swapping selected records for households so that households that are more unique in their block (e.g., in terms of racial or ethnic makeup) are less likely to be identifiable from doing inference on released statistics. 

A couple weeks ago the Center for Discrete Mathematics and Theoretical Computer Science held a workshop on the Analysis of Census Noisy Measurement Files and Differential Privacy, organized by Cynthia Dwork, Ruobin Gong, Weijie Su, and Linjun Zhang, where computer scientists, demographers and others paying attention to the new disclosure avoidance system (DAS) discussed some of the implications. I wasn’t able to make it, but Priyanka Nanayakkara filled me in and it sounded like some of the dimensions of the new DAS that we’ve considered on the blog came up, so I asked her to summarize.  

I (Priyanka) will offer a brief summary of some notable themes from the first day of the workshop, with the caveat that these are certainly not comprehensive of everything discussed.

Perceptions and demography use cases involving census data often omit or downplay uncertainty measures—debates about DP are surfacing and calling this into question. The topic of uncertainty quantification came up early. Data users such as demographers have been concerned that DP noise makes data too inaccurate for their work, particularly because census data are critical to methods for making population estimates. Computer scientists and statisticians have been suggesting uncertainty measures as a solution, in that quantifying uncertainty would help contextualize error owing to the DP-based disclosure avoidance system (DAS) relative to other forms of error already affecting the data (e.g., non-response error). Demographers argued, however, that population estimates are not a form of inferential statistics—a comment which was met with confusion from some members of the audience. As far as I could tell, they were drawing a distinction between statistical inference and evaluating estimates based on the decennial census, which is considered to be ground truth. The issue seems to be that demographic estimates are considered primarily as point values, and that available methods for producing uncertainty measures are either not applicable, considered unwieldy, or not used. Jessica’s previous posts have pointed out how much of the pushback to DP seems related to “conventional certitude,” in which census population figures are treated as ground truth point estimates, with negligible error; this theme came up at the workshop as well, with computer scientists and statisticians seeming to suggest that DP offers the chance to challenge this norm and normalize uncertainty quantification. In other words, census data are a “statistical imaginary,” or example of “conventional certitude,” treated as perfect. (For an in-depth characterization of this phenomenon, see boyd and Sarathy’s paper—it does an excellent job of pinpointing the epistemological rupture brought on by the Census Bureau’s adoption of DP.) 

Developing methods for effectively using noisy data is difficult when use cases are not predetermined. As per the workshop’s imperative, computer scientists and statisticians seemed eager for concrete research tasks to make using noisy census data more viable, yet the resulting discussion showed similar challenges to those occurring between the Census Bureau and stakeholders more broadly around trying to elicit critical use cases for census data. One demographer emphasized the variability in their work, describing the wide range of questions they receive and must attempt to answer. As an example, they cited the following question which they’d previously received: How many 18 year olds are in the Upper Peninsula of Michigan? Although they immediately undercut the question by noting that it came from a militant group engaged in strategic planning for defending against a possible Canadian invasion of the Upper Peninsula, the point seemed to be that sometimes demographers are put in the position of answering bizarre queries and this makes it difficult to precisely predict the use cases for which they’ll need methods for adapting noisy counts.

The relationship between privacy and legitimacy is complicated. Separate from the Census Bureau’s Title 13 mandate to maintain confidentiality, what is the role of privacy in the Census? At least one privacy expert noted that a severe, bad privacy event (think most of the population being re-identified from census statistics and individual-level records published)—which they termed a “privacy Chernobyl”—could reduce trust in the Census Bureau, and lead to lower rates of participation in future censuses. The point of DP is to help prevent outcomes like this. But what if this isn’t the type of privacy threat people are actually concerned about? Participants debated whether people are actually concerned about the confidentiality of their responses, citing various reports with survey results on how much of the population (and which parts of the population) cites confidentiality of census responses as a concern. Some pointed out that historically, people have been more concerned about the Census Bureau misusing or inappropriately sharing confidential records as opposed to an outside party performing an attack on published census statistics. Clearly, there are merits in both privacy concerns, though they approach the matter from different angles—namely, who is the “attacker” of concern? And in light of an answer to that question, where does DP fit in?

Courts may weigh the Census Bureau’s mandates differently and interpret DP noise as conceptually distinct from other forms of error. A primary use of census data is for redistricting and upholding the Voting Rights Act. While acknowledging the existing published analyses on the extent to which DP will impact redistricting, there was consensus that it would be good to further investigate this, given the complexity and immensity of redistricting. In these discussions, workshop participants also discussed how courts might interpret DP. One legal expert noted that the enumeration of the population is constitutionally mandated, whereas the Census Bureau’s mandate to keep responses confidential does not appear in the Constitution. From my understanding, it seems that courts may weigh these two mandates differently considering the importance courts place on the Constitution generally. This point complicates the trade-off between privacy and accuracy as it implies that perhaps from a legal standpoint, accuracy is more important.

Second, workshop discussions noted that while we can all acknowledge several sources of error in census data, courts may consider DP noise to be of a different “flavor,” since there is something conceptually different about intentionally injected noise compared to other sources of error that are not intentional or widely known (e.g., error introduced from previous disclosure avoidance methods like swapping). There may also be a tendency to treat DP noise as different since it is added to census block counts, which the previous DAS held invariant. The new DAS may also be viewed differently since census block counts are noised, which was not true of previous systems. Relatedly, Jessica and I are currently working with Abie Flaxman and a law colleague to try and contextualize what exactly is different with DP (spoiler: a lot less than some of the pushback has suggested) for a law audience. 

The trade-off is not just between privacy and accuracy—there are other dimensions, too. When it comes to DP, at least for the Census, the trade-off is not solely between privacy and accuracy. One presenter suggested that there is a third dimension related to legitimacy and trust in census data that the Census Bureau is taking into account when considering the trade-off. Noisy counts showing nonsensical values (e.g., negative population counts) harm this third dimension, perhaps explaining why the Census Bureau did not settle on an “optimal” balance between privacy and accuracy. One participant who works closely with residents noted how disastrous it would be to show people illogical census counts, since people would be alarmed at seeing what they would perceive as low-quality data given the amount of taxes that go into producing high-quality censuses. Computer scientists and statisticians have suggested, and continued to suggest at the workshop, that accounting for DP noise in analyses would be much easier if the Census Bureau released statistics without post-processing (the process by which the Census Bureau converted DP-noised data into “logical” [e.g., non-negative] values for publication). My takeaway here is that reasoning about trade-offs around disclosure avoidance and the Census requires including the role of human factors around perception of census data, and will be crucial to account for in future uses of DP, especially for the Census.

As I’ve said before, I (Jessica) like the idea of noisy counts becoming normalized. It would be nice if those pushing the argument that releasing noisy counts would lead to chaos could provide more concrete examples of what that might look like; will courts halt completely when it comes to voting rights act violations? It’s not clear to me how much we would have to explicitly change legal practices that involve rules like one-person-one-vote which are already recognized as somewhat absurd. 

It would also be nice to see more direct attempts to get at this proposed relationship between perceptions of census data as private and willingness of populations to be included in data collection, related to the kind of indirect costs of not using DP that Priyanka alludes to. Even without a “privacy Chernobyl,” if the Census bureau switched to DP out of fear that their liability under the old methods was too high and would threaten their credibility as an organization/lead to higher non-response error, could we try to quantify that trade-off? This would be exploratory of course but if the bureau thinks certain hard to reach populations will become harder to reach if they don’t make a dramatic change, then it makes sense to ask how much harder to reach would they need to become (relative to current estimated non-response) for this trend to threaten data quality more than DP does. This could involve identifying how much evidence there is in past data for a link between greater privacy awareness in a population and higher non-response error. Hard to know if this would be hopelessly confounded without seeing any attempts.

Much of the discussion so far has been about PL 94-171 data, which is used for redistricting and voting legislation. However, attention is now turning to the Demographic and Housing Characteristics (DHC) files, which contain many more variables. Next month the bureau is holding a meeting to collect feedback from those who have evaluated the demonstration products they released which applied the new DAS to 2010 DHC data (so we should expect the same messy comparison of noised data to noised data). It’s unclear to me whether there are any groups of computer scientists or others trying to understand the implications of the DHC demo data for privacy. My sense from talking to others who have followed the discussions very closely is that what happens at that meeting could be pivotal for the strategy going forward.  

Thanks to Abie Flaxman, who also caught parts of the workshop, for reading a draft of this post.

Buying things vs. buying experiences (vs. buying nothing at all): Again, we see a stock-versus-flow confusion

Alex Tabarrok writes:

A nice, well-reasoned piece from Harold Lee pushing back on the idea that we should buy experiences not goods:

While I appreciate the Stoic-style appraisal of what really brings happiness, economically, this analysis seems precisely backward. It amounts to saying that in an age of industrialization and globalism, when material goods are cheaper than ever, we should avoid partaking of this abundance. Instead, we should consume services afflicted by Baumol’s cost disease, taking long vacations and getting expensive haircuts which are just as hard to produce as ever. . . .

. . . tools and possessions enable new experiences. A well-appointed kitchen allows you to cook healthy meals for yourself rather than ordering delivery night after night. A toolbox lets you fix things around the house and in the process learn to appreciate how our modern world was made. A spacious living room makes it easy for your friends to come over and catch up on one another’s lives. A hunting rifle can produce not only meat, but also camaraderie and a sense of connection with the natural world of our forefathers. . . .

The sectors of the economy that are becoming more expensive every year – which are preventing people from building durable wealth – include real estate and education, both items that are sold by the promise of irreplaceable “experiences.” Healthcare, too, is a modern experience that is best avoided. As a percent of GDP, these are the growing expenditures that are eating up people’s wallets, not durable goods. . . .

OK, first a few little things, then my main argument.

The little things

It’s fun to see someone pushing against the “buy experiences, not goods” thing, which has become a kind of counterintuitive orthodoxy. I wrote about this a few years ago, mocking descriptions of screensaver experiments and advice to go to bullfights. So, yeah, good to see this.

There are some weird bits in the quoted passage above. For one thing, that hunting rifle. What is it with happiness researchers and blood sports, anyway? Are they just all trying to show how rugged they are, or something? I eat meat, and I’m not offering any moral objection to hunting rifles—or bullfights, for that matter—but this seems like an odd example to use, given that you can get “camaraderie and a sense of connection with the natural world of our forefathers” by just taking a walk in the woods with your friends or family—no need to buy the expensive hunting rifle for that!

Also something’s off because in one place he’s using “a spacious living room” as an example of a material good that people should be spending on (it “makes it easy for your friends to come over and catch up on one another’s lives”), but then later he’s telling us to stop spending so much money on real estate. Huh? A spacious living room is real estate. Of course, real estate isn’t all about square footage, it’s also about location, location, and location—but, if your goal is to make it easy for your friends to come over, then it’s worth paying for location, no? Personally, I’d rather live around the corner from my friends and be able to walk over than to have a Lamborghini and have to shlep it through rush-hour traffic to get there. Anyway, my point is not that Lee should sell his Lambo and exchange it for a larger living room in a more convenient neighborhood; it just seems that his views are incoherent and indeed contradictory.

And then there are the slams against education and health care. I work in the education sector so I guess I have a conflict of interest in even discussing this one, but let me give Lee the benefit of the doubt and say that lots of education can be replaced by . . . books. And books are cheaper than ever! A lot of education is motivation, and maybe tricks of gamification can allow this to be done using less labor of instructors. Still, once you’ve bought the computer, these still are services (“experiences”), not durable goods. Indeed, if you’re reading your books online, then these are experiences too.

Talking about education gets people all riled up, so let’s try pushing the discussion sideways, to sports. Lee mentions “a functional kitchen and a home gym (or tennis rackets or cross-country skis).” You might want to pay someone to teach you how to use these things! I think we’re all familiar with the image of the yuppie who buys $100 sneakers and and a $200 tennis racket and goes out to the court, doesn’t know what he’s doing, and throws out his back.

A lot of this seems like what Tyler Cowen calls “mood affiliation.” For example, Lee writes, “If you have a space for entertaining and are intentional about building up a web of friendships, you can be independent from the social pull of expensive cities. Build that network to the point of introducing people to jobs, and you can take the edge off, a little, of the pressure for credentialism.” I don’t get it. If you want a lifestyle that “makes it easy for your friends to come over and catch up on one another’s lives,” you might naturally want to buy a house with a large living room in a neighborhood where many of your friends live. Sure, this may be expensive, but who needs the fancy new car, the ski equipment you’ll never use, the home gym that deprives you of social connections, etc. But nooooo. Lee doesn’t want you to do that! He’s cool with the large living room (somehow that doesn’t count as “real estate”), but he’s offended that you might want to live in an expensive city. Learn some economics, dude! Expensive places are expensive because people want to live there! People want to live there for a reason. Yes, I know that’s a simplification, and there are lots of distortions of the market, but that’s still the basic idea. Similarly, wassup with this “pressure for credentialism”? I introduce people to jobs all the time. People often are hirable because they’ve learned useful skills: is that this horrible “credentialism” thing?

The big thing

The big thing, though, is that I agree with Lee and Tabarrok—goods are cheap, and it does seem wise to buy a lot of them (environmental considerations aside)—but I think they’re missing the point, for a few reasons.

First, basic economics. To the extent that goods are getting cheaper and services aren’t, it makes sense that the trend would be (a) to be consuming relatively more goods and relatively fewer services than before, but (b) to be spending a relatively greater percentage of your money on services. Just think about that one for a moment.

Second, we consume lots more material goods than in the past. Most obviously, we substitute fuel for manual labor, both as individuals and as a society, for example using machines instead of laborers to dig ditches.

Third is the stock vs. flow thing mentioned in the title to this post. As noted, I agree with Lee and Tabarrok that it makes sense in our modern society to consume tons and tons of goods—and we do! We buy gas for our cars, we buy vacuum cleaners and washing machines and dishwashers and computers and home stereo systems and smartphones and toys for the kids and a zillion other things. The “buy experiences not things” advice is not starting from zero: it’s advice starting from the baseline that we buy lots and lots of things. We already have closets and garages and attics full of “thoughtfully chosen material goods can enable new activities can enrich your life, extend your capabilities, and deepen your understanding of the world” (to use Lee’s words).

To put it another way, we’re already in the Lee/Tabarrok world in which we’re surrounded by material possessions with more arriving in the post every week. But, as these goods become cheaper and cheaper, it make sense that a greater proportion of our dollars will be spend on experiences. To try to make the flow of possessions come in even faster and more luxuriously, to the extent of abandoning your education, not going to the doctor, and not living in a desirable neighborhood—that just seems perverse, more of a sign of ideological commitment than anything else.

One more thing

In an important way, all of this discussion, including mine, is in a bubble. If you’re choosing between a fancy kitchen and home gym, a dream vacation complete with bullfight tickets, and a Columbia University education, you’re already doing well financially.

So far we’ve been talking about two ways to spend your money: on things or experiences. But there’s a third goal: security. People buy the house in the nice neighborhood not just for the big living room (that’s a material good that Lee approves of) or to have a shorter commute (an experience, so he’s not so thrilled about that one, I guess), but also to avoid crime and to allow their kids to go to good schools. These are security concerns. Similarly we want reliable health care not for material gain or because it’s a fun experience but because we want some measure of security (while recognizing that none of us will live forever). Similarly for education too: we want the experience of learning and the shiny things we can buy with our future salaries but also future job and career security. So it’s complicated, but I don’t know that either of the solutions on offer—buying more home gym equipment or buying more bullfight tickets—is the answer.

“A much bigger problem is the tension between the difficulty of statistics and the demand for it to be simple and readily available.”

Christian Hennig writes (see here for context):

Statistics is hard. Well-trained, experienced and knowledgeable statisticians disagree about standard methods. . . .

The 2021 [American Statistical Association] task force statement states: “Indeed, P-values and significance tests are among the most studied and best understood statistical procedures in the statistics literature.” I do not disagree with this. Probability models assign probabilities to sets, and considering the probability of a well chosen data-dependent set is a very elementary way to assess the compatibility of a model with the data. . . .

Still, considering the P-value as “among the best understood”, it is remarkable how much controversy, lack of understanding, and misunderstanding regarding them exist. Indeed there are issues with tests and P-values about which there is disagreement even among the most proficient experts, such as when and how exactly corrections for multiple testing should be used, or under what exact conditions a model can be taken as “valid”. Such decisions depend on the details of the individual situation, and there is no way around personal judgement.

I do not think that this is a specific defect of P-values and tests. The task of quantifying evidence and reasoning under uncertainty is so hard that problems of these or other kinds arise with all alternative approaches as well.

This is well put. Hennig continues:

A much bigger problem is the tension between the difficulty of statistics and the demand for it to be simple and readily available. Data analysis is essential for science, industry, and society as a whole. Not all data analysis can be done by highly qualified statisticians, and society cannot wait with analysing data for statisticians to achieve perfect understanding and agreement. On top of this there are incentives for producing headline grabbing results, and society tends to attribute authority to those who convey certainty rather than to those who emphasise uncertainty. . . .

Another important tension exists between the requirement for individual judgement and decision-making depending on the specifics of a situation, and the demand for automated mechanical procedures that can be easily taught, easily transferred from one situation to another, justified by appealing to simple general rules . . .

P-values are so elementary and apparently simple a tool that they are particularly suitable for mechanical use and misuse. To have the data’s verdict about a scientific hypothesis summarised in a single number is a very tempting perspective, even more so if it comes without the requirement to specify a prior first, which puts many practitioners off a Bayesian approach. As a bonus, there are apparently well established cutoff values so that the number can even be reduced to a binary “accept or reject” statement. Of course all this belies the difficulty of statistics and a proper account of the specifics of the situation.

As said in the 2016 ASA Statement, the P-value is an expression of the compatibility of the data with the null model, in a certain respect that is formalised by the test statistic. As such, I have no issues with tests and P-values as long as they are not interpreted as something that they are not. . . . It seems more difficult to acknowledge how models can help us to handle reality without being true, and how finding an incompatibility between data and model can be a starting point of an investigation how exactly reality is different and what that means. . . .

And then:

As statisticians we face the dilemma that we want statistics to be popular, authoritative, and in widespread use, but we also want it to be applied carefully and correctly, avoiding oversimplification and misinterpretation. That these aims are in conflict is in my view a major reason for the trouble with P-values, and if P-values were to be replaced by other approaches, I am convinced that we would see very similar trouble with them, and to some extent we already do.

Ultimately I believe that as statisticians we should stand by the complexity and richness of our discipline, including the plurality of approaches. We should resist the temptation to give those who want a simple device to generate strong claims what they want, yet we also need to teach methods that can be widely applied, with a proper appreciation of pitfalls and limitations, because otherwise much data will be analysed with even less insight. Making reference to the second quote above, we exactly need to “contradict ourselves” in the sense of conveying what can be done, together with what the problems of any such approach are.

That’s what we try to do in Regression and Other Stories!

“Causal” is like “error term”: it’s what we say when we’re not trying to model the process.

After my talks at the University of North Carolina, Cindy Pang asked me a question regarding causal inference and spatial statistics: both topics are important in statistics but you don’t often see them together.

I brought up the classical example of agricultural studies, for example in which different levels of fertilizer are applied to different plots, the plots have a spatial structure (for example, laid out in rows and columns), and the fertilizers can spread through the soil to affect neighboring plots. This is known in causal inference as the spillover problem, and the way I’d recommend attacking it is to set up a parametric model for the spillover as a function of distance, which affects the level of fertilizer going to each plot, so that you could directly fit a model of the effect of the fertilizer on the outcome.

The discussion got me thinking about the way in which we use the term “causal inference” in statistics.

Consider some familiar applications of “causal inference”:
– Clinical trials of drugs
– Observational studies of policies
– Survey experiments in psychology.

And consider some examples of problems that are not traditionally labeled as “causal” but do actually involve the estimation of effects, in the sense of predicting outcomes under different initial conditions that can be set by the experimenter:
– Dosing in pharmacology
– Reconstructing climate from tree rings
– Item response and ideal-point models in psychometrics.

So here’s my thought: statisticians use the term “causal inference” when we’re not trying to model the process. Causal inference is for black boxes. Once we have a mechanistic model, it just feels like “modeling,” not like “causal inference.” Issues of causal identification still matter, and selection bias can still kill you, but typically once we have the model for the diffusion of fertilizer or whatever, we just fit the model, and it doesn’t seem like a causal inference problem, it’s just an inference problem. To put it another way, causal inference is all about the aggregation of individual effects into average effects, and if you have a direct model for individual effects, then you just fit it directly.

This post should have no effect in how we do any particular statistical analysis; it’s just a way to help us structure our thinking on these problems.

P.S. Just to clarify: In my view, all the examples above are causal inference problems. The point of this post is that only the first set of examples are typically labeled as “causal.” For example, I consider dosing models in pharmacology to be causal, but I don’t think this sort of problem is typically included in the “causal inference” category in the statistics or econometrics literature.

We have really everything in common with machine learning nowadays, except, of course, language.

I had an interesting exchange with Bob regarding the differences between statistics and machine learning. If it were just differences in jargon, it would be no big deal—you could just translate back and forth—but it’s trickier than that, because the two subfields also have different priorities and concepts.

It started with this abstract by Satyen Kale in Columbia’s statistical machine learning seminar:

Learning linear predictors with the logistic loss—both in stochastic and online settings—is a fundamental task in machine learning and statistics, with direct connections to classification and boosting. Existing “fast rates” for this setting exhibit exponential dependence on the predictor norm, and Hazan et al. (2014) showed that this is unfortunately unimprovable. Starting with the simple observation that the logistic loss is 1-mixable, we design a new efficient improper learning algorithm for online logistic regression that circumvents the aforementioned lower bound with a regret bound exhibiting a doubly-exponential improvement in dependence on the predictor norm. This provides a positive resolution to a variant of the COLT 2012 open problem of McMahan and Streeter when improper learning is allowed. This improvement is obtained both in the online setting and, with some extra work, in the batch statistical setting with high probability. Leveraging this improved dependency on the predictor norm yields algorithms with tighter regret bounds for online bandit multiclass learning with the logistic loss, and for online multiclass boosting. Finally, we give information-theoretic bounds on the optimal rates for improper logistic regression with general function classes, thereby characterizing the extent to which our improvement for linear classes extends to other parametric and even nonparametric settings. This is joint work with Dylan J. Foster, Haipeng Luo, Mehryar Mohri and Karthik Sridharan.

What struck me was how difficult it was for me to follow what this abstract was saying!

“Learning linear predictors”: does this mean estimating coefficients, or deciding what predictors to include in the model?

“the logistic loss”: I’m confused here. In logistic regression we use the log loss (that is, the contribution to the likelihood is log(p)); the logistic comes in the link function. So I can’t figure this out. If for example we were instead doing probit regression, would it be probit loss? There’s something I’m not catching here. I’m guessing that they are using the term “loss” in a slightly different way than we would.

“exponential dependence on the predictor norm”: I have no idea what this is.

“1-mixable”: ?

“a regret bound”: I don’t know what that is either!

“the COLT 2012 open problem of McMahan and Streeter”: This is news to me. I’ve never even heard of COLT before.

“improper learning”: What’s that?

“online bandit multiclass learning”: Now I’m confused in a different way. I think of logistic regression for 0/1 data, but multiclass learning, I’m guessing that’s for data with more than 2 categories?

“online multiclass boosting”: I don’t actually know what this is either, but I’ve heard of “boosting” (even though I don’t know exactly what it is).

“improper logistic regression”: What is that? I don’t think it’s the same as improper priors.

I know I could google all the above terms, and I’m not faulting the speaker for using jargon, any more than I should be faulted for using Bayesian jargon in my talks. It’s just interesting how difficult the communication is.

I sent the above to Bob, who replied:

Stats seems more concerned with asymptotic consistency and bias of estimators than about convergence rate. And I’ve never seen anyone in stats talk about regret. The whole setup is online (data streaming inference, which they call “stochastic”). But then I don’t get out much.

I have no idea why they say online logistic regression is a fundamental task. It seems more like an approach to classification than a task itself. But hey, that’s just me being picky about language.

I have no idea what 1-mixable is or what a predictor norm is, and I wasn’t enlightened as to what “improper” means after reading the abstract.

Again, not at all a slam on Satyen Kale. It would be just about as hard to figure out what I’m talking about, based on one of my more technical abstracts. This is just an instance of the general problem of specialized communication: jargon is confusing for outsiders but saves time for insiders.

I suggested to Bob that what we need is some sort of translation, and he responded:

I know some of this stuff. Regret is the expected difference in utility between the strategy you played and the optimal strategy (in simple bandit problems, optimal is always playing the best arm). Regret bounds are strategies for pulling arms in an explore/exploit way that bounds regret. This is what you’ll need to connect to the ML literature on online (subject at a time being assigned) A/B testing.

COLT is the conference on learning theory. That’s where they do PAC learning. If you don’t know PAC learning is, well, google it. . . .

I think “logistic loss” was either a typo or to distinguish their use of logistic regression from general 0/1 loss stuff.

Online bandit multiclass: online means one example at a time or one subject a time where you can control assignment to any of k treatments (or a control). Yes, you can use multi-logit for this as we were discussing the other day. It’s in the Stan user’s guide regression chapter.

Boosting is a technique where you iteratively train, upweighting examples each iteration where there were errors in the previous iteration, then you weight all the predictors at each iteration. Given its heuristic nature, there are a gazillion variants. It’s usually used with decision “stumps” (shallow decision trees) a la BART.

But I have no clue what “improper” means despite being in the title.

As an aside, I’m annoyed at the linguistic drift by which “classification and regression trees” have become “decision trees.” This seems inaccurate and misleading to me, as no decisions are involved. “Regression trees” or “Prediction trees” would seem more accurate to me. As with other such linguistic discussions, my goal here is not purity or correctness but rather accuracy and minimization of confusion.

Anyway, to continue the main thread, Bob summarizes the themes of the above discussion as: “us all being finite, academia being siloed, and communication being harder than math.” At a technical level, it seems that the key difference is that machine learning is focused on online learning, while statistics is focused on static learning. This is part of the general pattern that computer scientists work on larger problems than statisticians do.

New twitter feed: StatPapers

The other day we introduced our new twitter feed, StatRetro, which goes through all the posts in the history of this blog, in chronological order, popping out a new one every 8 hours.

We also created another feed, StatPapers, which tweets out our published research articles in chronological order, starting from my very first publication from 1984, with a new one appearing twice a week. The papers are already here (and my unpublished papers are here), but I thought that having a feed like this could be fun for anyone who wanted to follow along with all that we’ve been doing over the years.

Provisional draft of the Neurips code of ethics

This is Jessica. Recently a team associated with the machine learning conference Neurips released an interesting document detailing a code of ethics, essentially listing concerns considered fair game for critiquing (possibly even rejecting) submitted papers. Samy Bengio, Alina Beygelzimer, Kate Crawford, Jeanne Fromer, Iason Gabriel, Amanda Levendowski, Inioluwa Deborah Raji, Marc’Aurelio Ranzato write:

Abstract: Over the past few decades, research in machine learning and AI has had a tremendous impact in our society. The number of deployed applications has greatly increased, particularly in recent years. As a result, the NeurIPS Conference has received an increased number of submissions (approaching 10,000 in the past two years), with several papers describing research that has foreseeable deployment scenarios. With such great opportunity to impact the life of people comes also a great responsibility to ensure research has an overall positive effect in our society.  

Before 2020, the NeurIPS program chairs had the arduous task of assessing papers not only for their scientific merit, but also in terms of their ethical implications. As the number of submissions increased and as it became clear that ethical considerations were also becoming more complex, such ad hoc process had to be redesigned in order to properly support our community of submitting authors.

The program chairs of NeurIPS 2020 established an ethics review process, which was chaired by Iason Gabriel. As part of that process, papers flagged by technical reviewers could undergo an additional review process handled by reviewers with expertise at the intersection of AI and Ethics. These ethical reviewers based their assessment on guidelines that Iason Gabriel, in collaboration with the program chairs, drafted for the occasion. 

This pilot experiment was overall successful because it surfaced early on several papers that needed additional discussion and provided authors with additional feedback on how to improve their work. Extended and improved versions of such ethics review process were later adopted by the NeurIPS 2021 program chairs as well as by other leading machine learning conferences. 

One outstanding issue with such a process was the lack of transparency in the process and lack of guidelines for the authors. Early in 2021, the NeurIPS Board gave Marc’Aurelio Ranzato the mandate to form a committee to draft a code of Ethics to remedy this. The committee, which corresponds to the author list of this document, includes individuals with diverse background and views who have been working or have been involved in Ethics in AI as part of their research or professional service. The first outcome of the committee was the Ethics guidelines, which was published in May 2021.

The committee has worked for over a year to draft a Code of Ethics. This document is their current draft, which has not been approved by the NeurIPS Board as of yet. The Board decided to first engage with the community to gather feedback. We therefore invite reviews and comments on this document. We welcome your encouragement as well as your critical feedback. We will then revise this document accordingly and finalize the draft, hoping that this will become a useful resource for submitting authors, reviewers and presenters.

The idea that AI/ML researchers need to take certain values (fairness, environmental sustainability, privacy, etc.) seriously so as to avoid real-world harms if their tools are deployed has been pretty visible in recent years, but these conversations have been mostly “opt-in” until recently. Documents like the code of ethics and requirements like broader impacts statements attempt to introduce values that haven’t been baked into the domain the way that caring about scalability and efficiency are. Naturally things get interesting. It’s hard not to acknowledge that knowledge of how to evaluate or deliberate about ethics is sparse (not to mention how to institute guidelines). Some adamantly oppose any ethics regulation on this ground, or for other reasons, e.g., arguing for instance that tech is neutral or that the future is too unpredictable to evaluate work on downstream consquences. Documenting what constitutes an ethics concern at the organization level requires deciding where to draw lines and what to leave out in a way that the individual papers about ethical concerns don’t have to contend with. I’ve heard some people refer to all this as a power struggle between the more activism-minded critics and those who are content from having benefitted from the status quo.

At any rate, among ML conferences, Neurips organizers have been pretty brave in stepping up to experiment with ways to establish processes to encourage more responsible research, whether everyone in the community wants them or not. The authors of this doc took on a difficult challenge in agreeing to write it. 

My initial reaction from reading the code of ethics is that if there is a power struggle around how big the role of ethical deliberation should be in AI/ML, this is a rather small step. It’s an interesting mix of a) what seem like basic research integrity suggestions that should not ruffle any feathers and b) Western progressive politics made explicit. I have more questions than strong opinions. 

The biggest question, as some of the public comments on the draft get at, is to what extent the guidelines can be interpreted as suggestions versus mandates. It seems reasonable to think that if Neurips is going to do ethics review, there should be some log somewhere of what kinds of things might lead to an ethics flag being raised that is taken seriously in review, and the abstract above makes clear this was part of the motivation for the code of ethics. But being transparent about the content of the guidelines is not the same as being transparent about the intended use of the guidelines. I think the authors do attempt to clarify – they write for instance that “this document aims at supporting the comprehensive evaluation of ethics in AI research by offering material of reflection and further reading, as opposed to providing a strict set of rules.” At the same time, most Neurips authors know by now that it is within the realm of possibility that their paper submission could be rejected if there are ethics concerns that aren’t addressed to satisfaction, so naturally they are going to pay attention to the specific content. 

From my read I expect that this code is not meant to have the same power as the ethics reviewers themselves; its a reflection of stuff they care about that can change over time but which will probably always be incomplete and secondary to their authority. Language used throughout the document makes it read like a tentative set of suggestions. There’s a lot of “encourage”, “emphasize,” “to the extent possible” to imply it’s more like a wish list on the part of Neurips organization. Though there are a few sentences that use slightly stronger terms, e.g., saying the the document “outlines conference expectations about the broad ethical practices that must be adopted by participants and by the wider community, including submitting authors, members of the program committee and speakers.” There’s also a reference to the code of ethics being meant to complement the Neurips code of conduct, which as far as I know is treated more like a set of rules that can get one reported, asked to leave events, etc. So not surprising to see some confusion about what exactly its role will be.

When it comes to the specific content, many of the suggestions about general research ethics seem like fairly well-accepted best practices: paying participants (like in dataset creation) a fair wage, involving the IRB in data collection, trying to protect the privacy of individuals represented in a dataset, documenting model and dataset details (though I’m not sure templates are necessary, as they suggest, so much as making the effort to record everything that might be important for readers to know). 

Suggestions about research ethics that struck me as slightly bolder include aiming for datasets that are representative of a diverse population and using disaggregated (by sub-population) test sets when reporting model performance. I guess fairness has graduated from being a sub-area of AI/ML to being an expectation of all applications. I wonder about a weaker suggestion that could have been made, that authors simply be more mindful in making claims about what their contributions are capable of (e.g., don’t claim it can detect some attribute from images of people’s faces if your training and test data only included images of white people’s faces). The idea that it’s not enough to ask for transparency and accurate statements about limitations, that instead we should evaluate the potential to do good for the world, is admittedly where I struggle the most with ethics reform in CS. The less value-laden premise that researchers should be transparent and rigorous in their claims seems at least as important to express as a value. Do these exist somewhere else? I glanced at the code of conduct but that seems more about barring harassment or fraud.

The second part of the code provides guidance on how authors should reflect on societal impact. It says that all authors are expected to include reflection on societal impacts in a separate paper section, so that reads like a requirement. The high level idea that authors should to the best of their ability try to imagine ways that the technology might be used to cause harm does not seem contentious to me. Requiring sections in all papers on societal impacts does however raise the question of who they are for – readers, authors, both? When I first heard proposals a few years ago that AI and ML researchers need to be stating broader impacts of what they build, I wondered why anyone would expect such reflection to be useful, given how much uncertainty exists in the pipeline from research paper to deployment, and the fact that computer scientists have often taken very few humanities classes. Having paid attention to attempts to codify ethics over the last few years, I now have the impression that the motivation behind requiring these statements is driven more by a desire to normalize the idea of reflecting on ethics, i.e., to encourage awareness in a general way rather than producing concrete factual statements. Still, it’s often hard in reading these proposals to tell what the ethics advocates have in mind, so how the expectations about downstream impacts are phrased seems important. You don’t want to imply predictability so much as an expectation that researchers will not intentionally produce tools for societal evil. 

Related to this, early on the doc states that:

In general, what is legally permissible might not be considered ethically acceptable in AI research. There are two guiding principles that underlie the NeurIPS Code of Ethics. First, AI research is worthwhile only to the extent that it can have a positive net outcome for our society. Second, since the assessment of whether a research practice or outcome is ethical might be difficult to establish unequivocally, transparent and truthful communication is of paramount importance.

“Positive net outcome” is a strong statement if you try to take it literally as a criteria we should use to judge all research in the near term. The second principle seems to acknowledge this, so I doubt the authors intend this phrase to be taken very seriously. Still the wording does make me cringe a little, since it implies that there is some ultimate “objective” perspective on what is positive versus negative. And the word “net” makes it seem like something that’s at least theoretically measurable, like we can calculate the definitive goodness of a project, it’s just that we can’t observe it perfectly as humans. My mind inevitably goes to wondering how many research projects submitted to Neurips explore techniques that don’t feed directly into real world deployments, but maybe help the researchers learn important lessons they apply in later projects, but could still be said to have opportunity costs in terms of taking effort away from what seem like more obviously beneficial applications (say, more efficient methods for diagnosing certain illnesses)… etc. Researchers can be pretty good at making up problems, or exaggerating problems we think we can solve, and we can learn from these projects, but the impact is sort of meh. I feel that way about some of my own work actually! I would hate to have weigh in on whether it was “net positive” or not. I don’t think the authors are advocating for thinking like this, but I guess I would prefer to avoid the positive / negative labels altogether and frame it more generally as an expectation that researchers will be careful in considering and acknowledging risks. 

One guideline in the societal impacts section that stood out to me is worded less like a suggestion:

Research should not be designed for incorporation into weapons or weapons systems, either as a direct aim or primary outcome. We encourage researchers to consider ways in which their contribution could be used to develop, support, or expand weapons systems and to take measures to forestall this outcome.

It would seem from this that one should expect mentioning weapons as an application to get their paper flagged for ethics review. I wonder here how the authors decided that weapons systems were off limits but not other uses of tech. I’m not disagreeing, but I’ve seen people call for bans on other types of applications, carceral for instance, so I wonder why only weapons system are ruled out. In line with transparency, it would be great for Neurips to also publish more on the process for producing this document: what were the criteria that the group had in mind in thinking through how to write this guidelines, including both what types of concerns to include and how to phrase the directives? I imagine there might be some lessons to be learned from reflecting on the process for creating guidelines like this, where there may have been disagreements, etc. 

It seems like it could also be an interesting exercise to collect many versions of a document like this, representing what different members of the Neurips community think is worth explicitly citing as community guidelines that could inform issues flagged for ethics review. There are some comments about how the vision in the document doesn’t really capture an international perspective as much as a Western one, for instance. It makes me wonder more about what the part of the abstract that says the committee includes “individuals with diverse background and views,” like along what dimensions views differed.  

Jamaican me crazy one more time

Someone writes in:

After your recent Jamaican Me Crazy post, I dug into the new JECS paper a bit, and the problems are much deeper than what you mentioned. The main problems have to do with their block permutation approach to inference.

The article he’s referring to is “Effect of the Jamaica early childhood stimulation intervention on labor market outcomes at age 31”; it’s an NBER working paper and I blogged about it last month. I was surprised to hear that it had already been published in JECS.

I did some googling but couldn’t find the JECS version of the paper . . . maybe the title had been changed so I searched JECS and the author names, still couldn’t find it, then I realized I didn’t even know what JECS was: Journal of Economic . . . what, exactly? So I swallowed my pride and asked my correspondent what exactly was the new paper he was referring to, and he replied that it was the Jamaican Early Childhood Stimulation study, hence JECS. Of course! Jamaican me crazy, indeed.

Anyway, my correspondent followed up with specific concerns:

1. The first issue that I noticed is in their block 5. They lay out the blocks in Appendix A. The blocks as described in the body of the paper are stratified first by a mother’s education dummy and assignment to the nutritional supplement treatment arm (supposedly for being unbalanced at baseline), and then gender and an age dummy, which is how the study’s initial randomization was stratified. However, block 5 is on broken up by age, not gender. There’s no reason I can see for doing this – breaking it up by gender won’t create new blocks that are all treatment or control, nor will they be exceptionally small (current blocks 1, 3, and 6 are all smaller than what the resulting blocks would be). Regardless, this violates the exchangeability assumption of their permutation tests. Considering block 5 is 19% of their sample, splitting it could create a meaningful difference in their inferences.

2. Their block 1 is only mothers with higher education, it isn’t broken out by supplement, gender, or age. Again, this violates the exchangeability assumption, no reason is given as to why, and if you were only to read the body of the paper, you would have no idea that this is what they were doing. The actual design of the blocking I’ve attached here.

3. In the 2014 paper, the blocking uses mother’s education, mother’s employment status, a discretized weight-for-height variable, and then gender and age. No reason is given for why they dropped employment and weight-for-height and added supplement assignment – these are all baseline variables, if they were imbalanced in 2014, they’re still imbalanced now! Stranger still, the supplement assignment isn’t even imbalanced, since it was originally a treatment arm!

I [the anonymous correspondent] found the 2014 data on ICPSR, and ran a handful of analyses, looking at the p-values you get if you run their 2021 blocking as they ran it, 2021 with block 5 split properly, some asymptotic robust p’s, and their 2014 blocking as I think they did it. I say “I think” because their replication code is 125 MATLAB files with no documentation. If you do it as described in the 2014 paper, you have 105 kids divided into 48 blocks, you end up with lots of empty or single observation blocks, so I’m sure that isn’t what they did, but it’s my best guess. I attached the table from running those here as well:

There were other issues, like their treatment of emigration, but this email is already long. You might also be interested in something on the academic side. I showed this to my advisor, and was basically told “great work, I don’t think you should pursue this.” . . . He recommended at most that I create a dummy email account, scrub my PDFs of any metadata, and send it anonymously to you and Uri Simonsohn. So at least for now, like your original correspondent, I live in the Midwest and have to walk across parking lots from time to time, so if you do blog, please keep me anonymous.

“Their replication code is 125 MATLAB files with no documentation”: Hey, that sounds a bit like my replication code sometimes! And I’ve been known to have some forking paths in my analyses. That’s one reason why I think the appropriate solution to multiplicity in statistical analysis is to fit multilevel models rather than to try to figure out multiple comparisons corrections. It should be about learning from the data, not about rejecting null hypotheses that we know ahead of time are false. So . . . I’m not particularly interested in the details of the permutation tests as discussed above—except to the extent that the results from those tests are presented as evidence in favor of the researchers’ preferred theories, in which case it’s useful to see the flaws in their reasoning.

Also, yeah, laffs and all that but for reals it’s absolutely horrible that people are afraid to express criticism of published work. What a horrible thing it says about our academic establishment that this sort of thing is happening. I don’t like being called Stasi or a terrorist or all the other crap that they throw at us, and I didn’t like it when my University of California colleagues flat-out lied about my research in order to do their friends a solid and stop my promotion, but, at least that was all in the past. The idea that it’s still going on . . . jeez. Remember this story, where an econ professor literally wrote, “I cam also play this hostage game,” threatening the careers of the students of a journal editor? Or the cynical but perhaps accurate remarks by Steven Levitt regarding scientific journals? I have no reason to think that economists behave worse than researchers in other fields; maybe they’re just more open about it, overtly talking about retaliation and using terms such as “hostage,” whereas people in other fields might just do these things and keep quiet about it.

Just to be on the safe side, though, look both ways before you cross that parking lot.

New twitter feed: StatRetro

For several years our blog has had a twitter feed, StatModeling, which tweets each of our posts as it appears.

I thought it would be fun to have a similar feed for our old papers, so we set something up, StatRetro, that starts with our very first post, from 12 Oct 2004, and then goes forward in chronological order, tweeting a new post every 8 hours. So you can relive all our old discussions. It’ll just take a couple decades to go through them all.

Just to get a sense of what’s coming, here are the first few posts we ever did:

A weblog for research in statistical modeling and applications, especially in social sciences

The Electoral College favors voters in small states

Why it’s rational to vote

Bayes and Popper

Overrepresentation of small states/provinces, and the USA Today effect

Sensitivity Analysis of Joanna Shepherd’s DP paper

Unequal representation: comments from David Samuels

Problems with Heterogeneous Choice Models

Morris Fiorina on C-SPAN

A fun demo for statistics class

Red State/Blue State Paradox

Statistical issues in modeling social space

. . .

You can use it as a teaching aid. With 3 new posts a day, there should be something for everyone. Again, the new twitter feed is StatRetro.

P.S. The above photo, courtesy of Zad, represents the overlay of the present and past that you will get from following StatRetro along with the new posts on StatModeling.

Public opinion on abortion (David Weakliem’s summary of the polls)


Although there are some puzzles on specific points, I [Weakliem] think that a pretty clear and consistent picture emerges.

1. Overall distribution of opinions: the most straightforward question is one by Gallup, which asks if abortion should be legal under any circumstances, under only certain circumstances, or illegal in all circumstances. . . .

2. Change: there has been little change in the overall distribution of opinions since the 1970s. If you look closely, you can find some changes, but they are small.

3. Gender: there is little difference in the opinions of men and women. . . .

4. OK, what does matter for opinions?: Education and age—more educated people and younger people are more favorable. Over time, younger generations replace older ones, and younger generations tend to be more highly educated, which would increase support for legal abortion. Given the absence of a clear trend, there must be age and/or period effects working in the other direction. . . . Religion . . . region . . . race . . .

5. Roe v. Wade: most people say that the Roe v. Wade decision should not be overturned. . . .

6. Parties: more people feel that they are closer to the Democrats on abortion and trust the Democrats to do a better job of dealing with abortion. The differences aren’t large, but they’ve been around for a long time and are pretty consistent. . . .

Ben Lerner is a reboot of Jonathan Franzen (and that’s a good thing)

Ben Lerner is to Jonathan Franzen as Richard Ford is to John Updike.

Lerner and Franzen write in the voices of painfully self-aware Midwesterners. I suspect they actually are painfully self-aware Midwesterners, but that could just be their personas. Ford and Updike write in the voiced of confident but slightly goofy middle-aged Wasps. Updike himself is a bit of a rebooted John O’Hara; we’ve discussed this on the blog before.

What’s cool about Lerner and Ford is that they’re working within existing styles but they have better production values, as it were; they can convey the story more smoothly.

Seeing this sort of incremental improvement also gives a sense of why writers sometimes want to create new forms, so as not to get trapped in technique.

These observations are not new; they just came to mind recently and I wanted to share them with you.

How should journals handle submissions of non-reproducible research?

We had a big discussion recently about the article, “A Graph Placement Methodology for Fast Chip Design,” which was published in Nature, and was then followed up by a rebuttal, “Stronger Baselines for Evaluating Deep Learning in Chip Placement.” Both articles are based on work done at Google, and there was some controversy because it seems that Google management did not allow the Stronger Baselines article to be submitted to a journal, hence its appearance in samizdat form.

I won’t address the media coverage, the personal disputes in this case, or any issues of what has or should be done within Google.

Rather, I want to discuss two questions relevant to outsiders such as myself:

1. How should we judge the claims in the two papers?

2. How should scientific journals such as Nature handle submissions of non-reproducible research?

The technical controversy

Just as a reminder, here are the abstracts to the two articles:

A Graph Placement Methodology for Fast Chip Design:

Stronger Baselines for Evaluating Deep Learning in Chip Placement:

The Nature paper focuses on the comparison to human designers, but here’s what is said in the conclusion of the Stronger Baselines paper:

So they’re just flat-out disagreeing, and this seems easy enough to resolve: just try the different methods on some well-defined examples and you’re done.

I’d say this comparison could be performed by some trusted third party—but the Stronger Baselines paper was already by a trusted third party. so I guess now we need some meta-level of trust. Anyway, it seems that it should be possible to do all this without having to resolve the question without having to go to the Google executives who are defending the Nature paper and ask them where exactly they disagree with the Stronger Baselines paper.

Although this comparison should be easy enough, it would require having all the code available, and maybe that isn’t possible. Which brings us to our second question . . .

How should scientific journals such as Nature handle submissions of non-reproducible research?

A big deal with reproducibility in computing is training, choosing hyperparameters, settings, options, etc. From the Nature paper:

For our method, we use a policy pre-trained on the largest dataset (20 TPU blocks) and then fine-tune it on five target unseen blocks (denoted as blocks 1–5) for no more than 6 h. For confidentiality reasons, we cannot disclose the details of these blocks, but each contains up to a few hundred macros and millions of standard cells. . . .

For confidentiality reasons, it’s not reproducible.

Fair enough. Google wants to make money and not give away their trade secrets. They have no obligation to share their code, and we have no obligation to believe their claims. (I’m not saying their claims are wrong, I’m just saying that in the absence of reproducible evidence, we have a choice of whether to accept their assertions.) They do have this Github page, which is better than you’ll see for most scientific papers; I still don’t know if the results in the published article can be reproduced from scratch.

But here’s the question. What should a journal such as Nature do with this sort of submission?

From one perspective, the answer is easy: if it’s not reproducible, it’s not public science, so don’t publish it. Nature gets tons of submissions; it wouldn’t kill them to only publish reproducible research.

At first, that’s where I stood: if it can’t be replicated, don’t publish it.

But then there’s a problem. In many subfields of science and engineering, the best work is done in private industry, and if you set a rule that you’re only publishing fully replicable work, you’re be excluding the state of the art, just publishing old or inferior stuff. It would be like watching amateur sports when the pros are playing elsewhere.

So where does this leave us? Publish stuff that can’t be replicated or rule out vast swathes of the forefront of research. Not a pleasant choice.

But I have a solution. Here it is: Journals can publish unreplicable work, but publish it in the News section, not the Research section.

Instead of “A Graph Placement Methodology for Fast Chip Design,” it would be “Google Researchers Claim a Graph Placement Methodology for Fast Chip Design.”

The paper could still be refereed (the review reports for this particular Nature article are here); it would just be a news report not a research article because the claims can’t be independently verified.

This solution should make everybody happy: corporate researchers can publish their results without giving away trade secrets, journals can stay up-to-date even in areas where the best work is being done in private, and readers and journalists can be made immediately aware of which articles in the journal are reproducible and which are not.

P.S. I’ve published lots of articles with non-reproducible research, sometimes because the data couldn’t be released for legal or commercial reasons, often simply because we did not put all the code in one place. I’m not proud of this; it’s just how things were generally done back then. I’m mentioning this just to emphasize that I don’t think it’s some kind of moral failure to have published non-reproducible research. What I’d like is for us to all do better going forward.

“How fake science is infiltrating scientific journals”

Chetan Chawla points us to this disturbing news article from Harriet Alexander:

In 2015, molecular oncologist Jennifer Byrne was surprised to discover during a scan of the academic literature that five papers had been written about a gene she had originally identified, but did not find particularly interesting.

“Looking at these papers, I thought they were really similar, they had some mistakes in them and they had some stuff that didn’t make sense at all,” she said. As she dug deeper, it dawned on her that the papers might have been produced by a third-party working for profit. . . .”

The more she investigated, the more clear it became that a cottage industry in academic fraud was infecting the literature. In 2017, she uncovered 48 similarly suspicious papers and brought them to the attention of the journals, resulting in several retractions, but the response from the publishing industry was varied, she said.

“A lot of journals don’t really want to know,” she said. . . .

More recently, she and a French collaborator developed a software tool that identified 712 papers from a total of more than 11,700 which contain wrongly identified sequences that suggest they were produced in a paper mill. . . .

Even if the research was published in low-impact journals, it still had the potential to derail legitimate cancer research, and anybody who tried to build on it would be wasting time and grant money . . . Publishers and researchers have reported an extraordinary proliferation in junk science over the last decade, which has infiltrated even the most esteemed journals. Many bear the hallmarks of having been produced in a paper mill: submitted by authors at Chinese hospitals with similar templates or structures. Paper mills operate several models, including selling data (which may be fake), supplying entire manuscripts or selling authorship slots on manuscripts that have been accepted for publication.

The Sydney Morning Herald has learned of suicides among graduate students in China when they heard that their research might be questioned by authorities. Many universities have made publication a condition of students earning their masters or doctorates, and it is an open secret that the students fudge the data. . . .

In 2017, responding to a fake peer review scandal that resulted in the retraction of 107 papers from a Springer Nature journal, the Chinese government cracked down and created penalties for research fraud. Universities stopped making research output a condition of graduation or the number of articles a condition of promotion. . . . But those familiar with the industry say the publication culture has prevailed because universities still compete for research funding and rankings. . . . The Chinese government’s investigation of the 107 papers found only 11 per cent were produced by paper mills, with the remainder produced in universities. . . .

As Chawla writes, what’s scary is the idea that this Greshaming isn’t just happening in the Freakonomics/Gladwell/NPR/Ted/Psychological Science axis of bogus social science storytelling; it’s also occurring in fields such as cancer research, which we tend to think of as being more serious. OK, not always, but usually, right??

I continue to think that the way forward is to put everything on preprint servers and turn journals into recommender systems. The system would still have to deal with paper mills, but perhaps the problem would he easier to handle through post-publication review.

My (remote) talks at UNC biostat, 12-13 May

I was invited to give three talks at the biostatistics department of University of North Carolina. I wasn’t sure what to talk about, so I gave them five options and asked them to choose three.

I’ll share the options with you, then you can guess which three they chose.

Here were the options:

1. All the ways that Bayes can go wrong

Probability theory is false. Weak priors give strong and implausible posteriors. If you could give me your subjective prior I wouldn’t need Bayesian inference. The best predictive model averaging is non-Bayesian. There will always be a need to improve our models. Nonetheless, we still find Bayesian inference to be useful. How can we make the best use of Bayesian methods in light of all their flaws?

2. Piranhas, kangaroos, and the failure of apparent open-mindedness: The connection between unreplicable research and the push-a-button, take-a-pill model of science

There is a replication crisis in much of science, and the resulting discussion has focused on issues of procedure (preregistration, publication incentives, and so forth) and statistical concepts such as p-values and statistical significance. But what about the scientific theories that were propped up by these unreplicable findings–what can we say about them? Many of these theories correspond to a simplistic view of the world which we argue is internally inconsistent (the piranha problem) involving quantities that cannot be accurately learned from data (the kangaroo problem). We discuss connections between these theoretical and statistical issues and argue that it has been a mistake to consider each of these studies and literatures on their own.

3. From sampling and causal inference to policy analysis: Interactions and the challenges of generalization

The three central challenges of statistics are generalizing from sample to population, generalizing from control to treated group, and generalizing from observed data to underlying constructs of interest. These are associated with separate problems of sampling, causal inference, and measurement, but in real decision problems all three issues arise. We discuss the way in which varying treatment effects (interactions) bring sampling concerns into causal inference, along with the real challenges of applying this insight into real problems. We consider applications in medical studies, A/B testing, social science research, and policy analysis.

4. Statistical workflow

Statistical modeling has three steps: model building, inference, and model checking, followed by possible improvements to the model and new data that allow the cycle to continue. But we have recently become aware of many other steps of statistical workflow, including simulated-data experimentation, model exploration and understanding, and visualizing models in relation to each other. Tools such as data graphics, sensitivity analysis, and predictive model evaluation can be used within the context of a topology of models, so that data analysis is a process akin to scientific exploration. We discuss these ideas of dynamic workflow along with the seemingly opposed idea that statistics is the science of defaults. We need to expand our idea of what data analysis is, in order to make the best use of all the new techniques being developed in statistical modeling and computation.

5. Putting it all together: Creating a statistics course combining modern topics with active student engagement

We envision creating a new introductory statistics, combining several innovations: (a) a new textbook focusing on modeling, visualization, and computing, rather than estimation, testing, and mathematics; (b) exams and homework exercises following this perspective; (c) drills for in-class practice and learning; and (d) class-participation activities and discussion problems. We will discuss what’s been getting in the way of this happening already, along with our progress creating a collection of stories, class-participation activities, and computer demonstrations for a two-semester course on applied regression and causal inference.

OK, time to guess which three talks they picked. . . .
Continue reading

Ask the polling expert

I checked my email and saw this, dated 9:52am:

Good Morning Andrew,

My name is ** with **, the **-affiliate in **

I’m doing a story about a stat going around online: 70% of Americans support Roe v. Wade, and 30% support overturning it.
What I’m finding is it depends on what poll you look at.

I’d love to interview you this morning/early afternoon about how to interpret polls and what people should look at.

For instance, I’m seeing that the Gallup results vary from the Pew Research results vary from the Rassmussen results.

Do you have time for 10-15 minute Zoom interview?

By the time I saw this, the morning/early afternoon had passed. But I wanted to be helpful going forward, so I replied:

I never check my email before 4pm! In any case, if you have future such questions I recommend you contact David Weakliem at U. Connecticut, who is the real expert on this sort of public opinion question.

For those who don’t follow the blogroll, David Weakliem is here. And he’s written about abortion!

Opportunity for political scientists and economists to participate in the Multi100 replication project!

Barnabás Szászi writes:

I’m contacting you now regarding a project that I’m co-leading with Aczel Balazs. Here, we aim to estimate how robust published results and conclusions in social and behavioral sciences are to analysts’ analytical choices. What we do is that 100 empirical studies published in different disciplines of social and behavioral sciences are being re-analyzed by independent researchers.

More than 500 re-analysts applied for the project and we have almost all the papers being already analyzed by some, but we did not get enough volunteers for papers from Economics, International Relations, and Political Science. Probably, our network is very psychological.

As you are not just a statistician but also a political scientist, we were wondering if you had any options to put on your blog or send around the following ad to some bigger crowd in any of these areas?

OK, here it is!

Giving an honest talk

This is Jessica. I gave a talk about a month ago at the Martin Zelen symposium at Harvard, which was dedicated to visualization this year. Other speakers were Alberto Cairo, Amanda Cox, Jeff Heer, Alvita Ottley, Lace Padilla, and Hadley Wickham, and if you’re interested in visualization you can watch them here (my talk starts around 2:08:00). But this post is not so much about the content but the questions I found myself asking as I put together the talk. 

As an academic, I give a lot of talks, and I’ve always spent a fair amount of time on making slides and practicing talks, I guess because I hate feeling unprepared. Having come up as a grad student in visualization/human computer interaction there was always a high premium on giving polished talks with well designed slides with lots of diagrams/visuals, which is annoying since I’m not very gifted at graphic design and have to do a lot of work to make things not look bad. But, as a student and junior faculty member I enjoyed giving talks because so many of my talks were technical talks of the kind you give at conferences. Even if they take a while to prepare they are easy, mindless even, because you’re constrained to presenting what you did. There is never time to get too far into the motivation, so you usually repeat whatever spiel you used to motivate the problem in the intro of your paper and then quickly get into the details. 

But, inevitably (at least in computer science) you get more senior and your students give most of the technical talks, while you do more invited lectures and keynotes where you’re expected to talk about whatever you want. So what do you do with that time? As a junior faculty member, I still treated the more open-ended invited talks I did like longer technical talks, since part of being pre-tenure is showing that you can lead students to produce a cohesive body of technical work. But the more senior I get, the more I question how to treat invited talks. 

One philosophy is to continue treating invited talks as an opportunity to present some recent subset of work in your lab, where you string together a few technical talks but with a little more motivation/vision. This is nice because you can highlight the work of the grad students and give a little bit more of a big picture but still get into some details. But, lately the thought of giving talks like this seems constrained and rote to me similar to the way individual paper tech talks do, since often most of the work is already done, and concerns ideas I might have had a relatively long time ago. Also, because you’re focused on presenting the individual projects and what each contributes, you’re more or less stuck with the way you motivated the stuff in the original papers. So at least to me, it can feel like the bulk of the talk is dead and I’m just reviving it to parade around in front of people so that they get something polished with some high level message. If you give a lot of talks, it gets harder and harder to feign enthusiasm about topics you haven’t really thought twice about since you finished that project, or worse you’ve thought a lot about it and you have some issues with.  

As I’ve probably mentioned before on this blog, one issue I have lately is that I’m more interested in critiquing some of my past work more than I am selling it, since critiquing it helps me figure out where to go next. I tend to care, perhaps more than other computer scientists, about the philosophy that underlies the projects I take on. Repeatedly questioning why/if what I’m doing is useful helps me figure out which of the many things I could do next are actually worth the time. The specific technical solutions can be fun to think about but they don’t really do a good job of representing what I’ve learned over my career.

But —since the vision, at least for me, shifts slightly with each project, how do you get the talks to feel dynamic in a way that matches that? Talking about work in progress could be better, but I find it hard to implement this in a successful way. Sometimes you sense that the reason you were invited is to impart some wisdom on the topic you work on, and so if the importance or potential impact of the in-progress work is still something you’re figuring out (which it generally is for me for the projects I’m most excited about), then you risk not delivering much for them to take away as a message. I guess my premise for the ideal talk is that it gives the audiences something useful to think about, without alienating them, and provides me some value as presenter. It’s not clear to me many people in most audiences I speak to would benefit from jumping into the weeds of what I haven’t yet figured out with me.  

So, at least in making the talk for the Zelen symposium, which I knew would be seen by people I respect in my field (like the other speakers) but also need to be accessible to the broader audience in attendance, I found myself racking my brain for what I could say that would be a) useful, b) interesting, but most importantly c) an honest portrayal of what I was questioning at that moment. Eventually I settled on something that seemed like a good compromise – motivating better uncertainty visualizations in the beginning, then admitting I didn’t really think visualizing uncertainty alone solves many problems because satisficing is so common, then suggesting the idea of visualizations as model checks as a broader framework for thinking about how to use a visualization to communicate. But it was very hard to fit this into 20 minutes. I had to motivate the problems that led to lots of my work, then carefully back up to question the solutions, but still provide some resolution or “insight” so it looked like I deserved to be invited to speak in the first place. 

Anyway, the specifics don’t matter that much, the point is that sometimes it can be very hard to find a “statement” that both expresses your honest current viewpoint about a topic you’re an expert in and which is somewhat palatable to people who don’t have nearly as much of the backstory. This is why I hate when people ask you to give a talk and then say, “You don’t have to prep anything new, you can just use an old talk.” No, actually I can’t. 

In this case, I’m not sure how successful the Zelen talk was. It worked well for me, and I suspect some people liked it in the audience, but I also got less questions than anyone else, so hard to say how many people I lost. It got me thinking that maybe the more honest and “current” the ideas I present in a talk, the less I will connect with audiences. It’s like the idea of an honest talk implies that you’ll lose more people, because you’ll have to tell a more complicated story than the one that you yourself were once fooled by. Sometimes to provide a sense of context on the types of problems your work tries to solve it makes more sense to admit the shortcomings of all the existing solutions, including your own, rather than cheerleading your old work. Maybe I just need to be ok with that and not give in to the pressures I perceive for polished, enthusiastic talks. It reminds me of Keith’s comment on a previous post: CS Peirce once wrote that the best compliment anyone ever paid him, though likely meant as an insult was roughly “the author does not seem completely convinced by his own arguments”.

All of this also makes me think of Andrew’s talks, which he has mentioned he sometimes gets mixed responses to. As far as I can tell, he’s perfected the art of the honest talk. There are no slides, while there might be some high level talking points what he says seems spontaneous, and there might not be any obvious take-home message, because it’s not some heavily scripted performance, its a window into how he’s thinking right now about some topic.  I wish more people were willing to experiment and give us bad but honest talks.

Controversy over California Math Framework report

Gur Huberman points to this document from Stanford math professor Brian Conrad, criticizing a recent report on the California Math Framework, which is a controversial new school curriculum.

Conrad’s document includes two public comments.

Comment #1 is a recommended set of topics for high school mathematics prior to calculus. I really agree with some but not all of these recommendations (I like the bits about problem solving modeling with functions; I’m skeptical that high school students need to learn how to add, subtract, multiply, and divide complex numbers; I don’t really buy how they recommend covering probability and statistics; and if it were up to me I’d drop trigonometry entirely). I think their plan is aspirational and the kind of thing that a couple of math professors might come up with; I wouldn’t characterize those topics as “crucial in high school math training for a student who might conceivably wish to pursue a quantitative major in college, including data science.” Sure, knowing sin and cos can’t hurt, but I don’t see them as crucial or even close to it.

Comment #2 is the fun part, eviscerating the California Math Framework report. Here’s how Conrad leads off:

The Mathematics Framework Second Field Review (often called the California Mathematics Framework, or CMF) is a 900+ page document that is the outcome of an 11-month revision by a 5-person writing team supervised by a 20-person oversight team. As a hefty document with a large number of citations, the CMF gives the impression of being a well-researched and evidence-based proposal. Unfortunately, this impression is incorrect.

I [Conrad] read the entire CMF, as well as many of the papers cited within it. The CMF contains false or misleading descriptions of many citations from the literature in neuroscience, acceleration, de-tracking, assessments, and more. (I consulted with three experts in neuroscience about the papers in that field which seemed to be used in the CMF in a concerning way.) Often the original papers arrive at conclusions opposite those claimed in the CMF. . . .

I’m not sure about this “conclusions opposite those claimed” thing, but it does seem that the CMF smoothed the rough edges of the published research, presenting narrow results as general statements. Conrad writes:

The CMF contains many misrepresentations of the literature on neuroscience, and statements betraying a lack of understanding of it. . . . A sample misleading quote is “Park and Brannon (2013) found that when students worked with numbers and also saw the numbers as visual objects, brain communication was enhanced and student achievement increased.” This single sentence contains multiple wrong statements (1) they worked with adults and not students; (2) their experiments involved no brain imaging, and so could not demonstrate brain communication; (3) the paper does not claim that participants saw numbers as visual objects: their focus was on training the approximate number system. . . .

The CMF selectively cites research to make points it wants to make. For example, Siegler and Ramani (2008) is cited to claim that “after four 15-minute sessions of playing a game with a number line, differences in knowledge between students from low-income backgrounds and those from middle-income backgrounds were eliminated”. In fact, the study was specifically for pre-schoolers playing a numerical board game similar to Chutes and Ladders and focused on their numerical knowledge, and at least five subsequent studies by the same authors with more rigorous methods showed smaller positive effects of playing the game that did not eliminate the differences. . . .

In some places, the CMF has no research-based evidence, as when it gives the advice “Do not include homework . . . as any part of grading. Homework is one of the most inequitable practices of education.” The research on homework is complex and mixed, and does not support such blanket statements. . . .

Chapter 8, lines 1044-1047: Here the CMF appeals to a paper (Sadler, Sonnert, 2018) as if that paper gives evidence in favor of delaying calculus to college. But the paper’s message is opposite what the CMF is suggesting. The paper controls for various things and finds that mastery of the fundamentals is a more important indicator of success in college calculus than is taking calculus in high school. There is nothing at all surprising about this: mastery of the fundamentals is most important. The paper is simply quantifying that effect (this is the CMF’s “double the positive impact”), and also studying some other things. What the paper does not find is that taking calculus first in college leads to greater success in that course. To the contrary, it finds that for students at all levels of ability who take calculus in high school and again in college (which the authors note near the end omits the population of strongest students who ace the AP and move on in college) do better in college calculus than those who didn’t take it in high school (controlling for other factors). The benefit accrued is higher for those who took it in high school with weaker background, which again is hardly a surprise if one thinks about it (as Sadler and Sonnert note, that high school experience reinforces fundamental skills, etc.). If one only looks at the paper’s abstract then one might get a mistaken sense as conveyed in the CMF about the meaning of the paper’s findings. But if one actually reads the paper, then the meaning of its conclusions becomes clearer, as described above. . . .

Here’s a juicy one:

Chapter 12, lines 221-228 . . . the CMF makes the dramatic unqualified claim that:

“if teachers shifted their practices and used predominantly formative assessment, it would raise the achievement of a country, as measured in international studies, from the middle of the pack to a place in the top five.”

Conrad goes on to explain how this claim was not supported by the study being cited, but, yeah, in any case it’s a ridiculous thing to be claiming in the first place, all the way to the pseudo-precision of “top five.”

One problem seems to be that the report had no critical readers on the inside who could take the trouble to go through the report with an eye to common sense. This is important stuff, dammit! The California school board, or whatever it’s called, should have higher standards than the National Academy of Sciences reporting on himmicanes or the American Economic Association promoting junk science regarding climate change.

I don’t agree with all of Conrad’s criticisms, though. For example, he writes:

The CMF claims Liang et al (2013) and Domina et al (2015) demonstrated that “widespread acceleration led to significant declines in overall mathematics achievement.” As discussed in §4, Liang et al actually shows that accelerated students did slightly better than non-accelerated ones in standardized tests. In Domina et al, the effect is 7% of a standard deviation (not “7%” in an absolute sense, merely 0.07 times a standard deviation, a very tiny effect). Such minor effects are often the result of other confounders, and are far below anything that could be considered “significant” in experimental work.

I agree that effects can often be explained by other confounders, but I wouldn’t say that a 0.07 standard deviation effect is “very tiny.” A standard deviation is huge, and 7% of a standard deviation is not nothing. I agree that the report isn’t helping by using the term “significant” here. The thing that really confuses me here is . . . did the report really claim that Liang et al. (2013) that acceleration caused significant declines, but Liang et al. actually found that accelerated students did better? Whassup with that? I’m not completely sure but I think the paper he’s referring to is this one by Jian-Hua Liang, Paul E. Heckman, and Jamal Abedi (2012), which doesn’t seem to say anything about acceleration leading to significant declines, while at the same time I don’t see it relying that accelerated students did better. That article concludes: “The algebra policy did encourage schools and districts to presumably enroll more students into algebra courses and then take the CST for Algebra I. However, among the students in our study, the algebra-for-all policy did not appear to have encouraged a more compelling set of classroom and school-wide learning conditions that enhanced student understanding and learning of critical knowledge and skills of algebra, as we have previously discussed.” I don’t see this supporting the claims of the CMF or of Conrad.


It’s a complicated story. Conrad seems to be correct that this California Math Framework is doing sloppy science reporting in the style of Gladwell/Freakonomics/NPR/Ted, using published research to tell a story without ever getting a handle on what those research papers are really saying or whether the claims even make sense. Unfortunately, the real story seems to be:

(a) The different parties to this dispute (Conrad and the authors of the California report) each have strong opinions about mathematics education, opinions which have been formed by their own experiences and which they will support using their readings of the research literature, which is, unfortunately, uneven and inconclusive.

(b) We don’t know much about what works or doesn’t work.

(c) The things that work aren’t policies or mandates but rather things going on at the level of individual schools, teachers, and students.

The most important aspect of policy would then seem to be doing what it takes to facilitate learning. At the same time, some curricular standards need to be set. The CMF has strong views not really supported by the data; Conrad and his colleagues have their own strong views, which are open to debate as well. I don’t feel comfortable stating a position on the policy recommendations being thrown around, but I do think Conrad is doing a service by pointing out these issues.

A journal is like a crew

The police department, is like a crew
It does whatever they want to do
In society you have illegal and legal
We need both, to make things equal
So legal is tobacco, illegal is speed
Legal is aspirin, illegal is weed
Crack is illegal, cause they cannot stop ya
But cocaine is legal if it’s owned by a doctor
Everything you do in private is illegal
Everything’s legal if the government can see you
Don’t get me wrong, America is great place to live
But listen to the knowledge I give . . .
— BDP.

Someone pointed me to an iffy paper appearing in a prestigious scientific journal. At first I was annoyed. One more??? Himmicanes, air rage, ages ending in 9, and all the rest . . . that’s not enough for them?

But then I thought, nah, it’s all good. This outlet is their journal, just like Statistical Modeling, Causal Inference, and Social Science is our blog. We can publish anything we want here. Through years of diligent effort, we have built an audience of people who will read what we write here, and who will give us a hearing, even if they don’t always agree with us (and even if we, the bloggers here, don’t always agree with each other).

Similarly, the organization that publishes that journal has, through many years of diligent effort, built an effective brand. They decided in their wisdom to give publishing power for their proceedings to their editors, some of whom in turn use that power to promote a certain kind of science. I’d call it junk social science or scientism; I guess they would call it the real stuff. In any case, it’s their journal! To be upset that they publish unsupported claims in social science (along with lots of good stuff too) would be like being upset that Statistical Modeling, Causal Inference, and Social Science publishes too many cat pictures.

P.S. I have no link to the particular paper that was sent to me, partly because I don’t actually remember what it is or even what it’s about—I wrote this post awhile ago!—and partly because the details don’t really matter. Indeed, you, the reader, might like this particular paper and not think it’s “scientistic” at all, in which case my point is better made in general terms, so that you can imagine some characteristically bad tabloid-style social science article in its place. You could also forget this particular journal entirely and think about some other journal such as the Journal of Economic Perspectives, which published and never retracted that notorious gremlins paper. The econ department is like a crew. It’s crews all the way down.

P.P.S. Again, this journal publishes lots of good stuff too including but not limited to my own publications there! I guess it’s best to think of a journal not as a unified entity but rather a loose agglomeration of mini-journals, some of which focus on the serious stuff and some of which go more for drama and publicity rather than scientific accuracy. To slam all of this journal for the bad stuff would be like slamming everything coming out of Columbia University just because we have Dr. Oz, or slamming everything coming out of the University of California just because they have that sleep guy.