Academic jobs in Bayesian workflow and decision making

This job post (with two reserach topics) is by Aki (I promise that next time I post about something else)

I’m looking for postdocs and doctoral students to work with me on Bayesian workflow at Aalto University, Finland. You can apply through a joint call (with many more other related topics) application forms for postdocs) and for doctoral students.

We’re also looking for postdocs and doctoral students to work on Probabilistic modeling for assisting human decision making in with Finnish Center for Artificial Intelligence funding. You can apply through a joint call (with many more probabilistic modeling topics) application form.

To get some idea on how we might approach these topics, you can check what I’ve been recently talking and working.

For five years straight, starting in 2018, the World Happiness Report has singled out Finland as the happiest country on the planet

Kaiser Fung’s review of “Don’t Trust Your Gut: Using Data to Get What You Really Want in Life” (and a connection to “Evidence-based medicine eats itself”)

Kaiser writes:

Seth Stephens-Davidowitz has a new book out early this year, “Don’t Trust Your Gut”, which he kindly sent me for review. The book is Malcolm Gladwell meets Tim Ferriss – part counter intuition, part self help. Seth tackles big questions: how to find love? how to raise kids? how to get rich? how to be happier? He invariantly believes that big data reveal universal truths on such matters. . . .

Seth’s book interests me as a progress report on the state of “big data analytics”. . . .

The data are typically collected by passive observation (e.g. tax records, dating app usage, artist exhibit schedules). Meaningful controls are absent (e.g. no non-app users, no failed artists). The dataset is believed to be complete. The data aren’t specifically collected for the analysis (an important exception is the happiness data collected from apps for that specific purpose). Several datasets are merged to investigate correlations.

Much – though not all – of the analyses use the most rudimentary statistics, such as statistical averages. This can be appropriate, if one insists one has all the data, or “essentially” all. An unstated axiom is that the sheer quantity of data crowds out any bias. This is not a new belief: as long as Google has existed, marketing analysts have always claimed that Google search data are fully representative of all searches since Google dominates the market. . . .

If the analyst incorporates model adjustments, these adjusted models are treated as full cures of all statistical concerns. [For example, the] last few chapters on activities that cause happiness or unhappiness report numerous results from adjusted models of underlying data collected from 60,000 users of specially designed mobile apps. The researchers broke down 3 million logged events by 40 activity types, hour of day, day of week, season of year, location, among other factors. For argument’s sake, let’s say the users came from 100 places, ignore demographic segmentation, and apply zero exclusions. Then, the 3 million points fell into 40*24*7*4*100 = 2.7 million cells… unevenly but if evenly, each cell has an average of 1.1 events. That means many cells contain zero events. . . . The estimates in many cells reflect an underlying model that hasn’t been confirmed with data – and the credibility of these estimates rests with the reader’s trust in the model structure.

I observed a similar phenonmenon when reading the well-known observational studies of Covid-19 vaccine effectiveness. Many of these studies adjust for age, an obvious confounder. Having included the age term, which quite a few studies proclaimed to be non-significant, the researchers spoke as if their models are free of any age bias.

Kaiser continues:

A blurred line barely delineates using data as explanation and as prescription.

Take, for example, the revelation that people who own real estate businesses have the highest chance of being a top 0.1% earner in the U.S., relative to other industries. This descriptive statistic is turned into a life hack, that people who want to get rich should start real-estate businesses. Nevertheless, being able to explain past data is different from being able to predict the future. . . .

And, then, Kaiser’s big point:

Most of the featured big-data research aim to discover universal truths that apply to everyone.

For example, an eye-opening chart in the book shows that women who were rated bottom of the barrel in looks have half the chance of getting a response in a dating app when they messaged men in the most attractive bucket… but the absolute response was still about 30%. This produces the advice to send more messages to presumably “unattainable” prospects.

Such a conclusion assumes that the least attractive women are identical to the average women on factors other than attractiveness. It’s possible that such women who approach the most attractive-looking men have other desirable assets that the average woman does not possess.

It’s an irony because with “big data”, it should be possible to slice and dice the data into many more segments, moving away from the world of “universal truths,” which are statistical averages . . .

This reminds me of a post from a couple years ago, Evidence-based medicine eats itself, where I pointed out the contradiction between two strands of what is called “evidence-based medicine”: the goal of treatments targeted to individuals or subsets of the population, and the reliance on statistical significant results from randomized trials. Statistical significance is attained by averaging, which is the opposite of what needs to be done to make individualized or local recommendations.

Kaiser concludes with a positive recommendation:

As with Gladwell, I recommend reading this genre with a critical eye. Think of these books as offering fodder to exercise your critical thinking. Don’t Trust Your Gut is a light read, with some intriguing results of which I was not previously aware. I enjoyed the book, and have kept pages of notes about the materials. The above comments should give you a guide should you want to go deeper into the analytical issues.

I think there is a lot more that can be done with big data, we are just seeing the tip of the iceberg. So I agree with Seth that the potential is there. Seth is more optimistic about the current state than I am.

Imperfectly Bayesian-like intuitions reifying naive and dangerous views of human nature

Allan Cousins writes:

After reading your post entitled “People are complicated” and the subsequent discussion that ensued, I [Cousins] find it interesting that you and others didn’t relate the phenomenon to human propensity to bound probabilities into 3 buckets (0%, coin toss, 100%), and how that interacts with anchoring bias. It seems natural that if we (meaning people at large) do that across most domains that we would apply the same in our assessment of others. Since we are likely to have more experiences with certain individuals on one side of the spectrum or the other (given we tend to only see people in particular rather than varied circumstances) it’s no wonder we tend to fall into the dichotomous trap of treating people as if they are only good or bad; obviously the same applies if we don’t have personal experiences but only see / hear things from afar. Similarly, even if we come to know other circumstances that would oppose our selection (e.g. someone we’ve classified as a “bad person” performs some heroic act), we are apt to have become anchored on our previous selection (good or bad) and that reduces the reliance we might place on the additional information in our character assessment. Naturally our human tendencies lead us to “forget” about that evidence if ever called upon to make a similar assessment in the future. In a way it’s not dissimilar to why we implement reverse counting in numerical analysis. When we perform these social assessments it is as if we are always adding small numbers (additional circumstances) to large numbers (our previous determination / anchor) and the small numbers, when compared to the large number, are truncated and rounded away; of course possibly leading to the possibility that our determination to be hopelessly incorrect!

This reminds me of the question that comes up from time to time, of what happens if we use rational or “Bayesian” inference without fully accounting for the biases involved in what information we see.

The simplest example is if someone rolls a die a bunch of times and tells us the results, which we use to estimate the probability that the die will come up 6. If that someone gives us a misleading stream of information (for example, telling us about all the 6’s but only a subset of the 1’s, 2’s, 3’s, 4’s, and 5’s) and we don’t know this, then we’ll be in trouble.

The linked discussion involves the idea that it’s easy for us to think of people as all good or all bad, and my story about a former colleague who had some clear episodes of good and clear episodes of bad is a good reminder that (a) people are complicated, and (b) we don’t always see this complication given the partial information available to us. From a Bayesian perspective, I’d say that Cousins is making the point that the partial information available to us can, if we’re not careful, be interpreted as supporting a naive bimodal view of human nature, thus leading to a vicious cycle or unfortunate feedback mechanism where we become more and more set in this erroneous model of human nature.

Propagation of responsibility

When studying statistical workflow, or just doing applied statistics, we think a lot about propagation of uncertainty. Today’s post is about something different: it’s about propagation of responsibility in a decision setting with many participants. I’ll briefly return to workflow at the end of this post.

The topic of propagation of responsibility came up in our discussion the other day of fake drug studies. The background was a news article by Johanna Ryan, reporting on medical research fraud:

In some cases the subjects never had the disease being studied or took the new drug to treat it. In others, those subjects didn’t exist at all.

I [Ryan] found out about this crime wave, not from the daily news, but from the law firm of King & Spaulding – attorneys for GlaxoSmithKline (GSK) and other major drug companies. K&S focused not so much on stopping the crime wave, as on advising its clients how to “position their companies as favorably as possible to prevent enforcement actions if the government comes knocking.” In other words, to make sure someone else, not GSK, takes the blame. . . .

So how do multi-national companies like GSK find local clinics like Zain Medical Center or Healing Touch C&C to do clinical trials? Most don’t do so directly. Instead, they rely on Contract Research Organizations (CROs): large commercial brokers that recruit and manage the hundreds of local sites and doctors in a “gig economy” of medical research. . . .

The doctors are independent contractors in these arrangements, much like a driver who transports passengers for Uber one day and pizza for DoorDash the next. If the pizza arrives cold or the ride is downright dangerous, both Uber and the pizza parlor will tell you they’re not to blame. The driver doesn’t work for them!

Likewise, when Dr. Bencosme was arrested, the system allowed GSK to position themselves as victims not suspects. . . .

Commenter Jeremiah concurred:

I want folks to be careful in giving the Sponsor (such as GSK) any pass and putting the blame on the CRO. The Good Clinical Practices (GCP) here are pretty strong that the responsibility lies with the Sponsor to do due diligence and have appropriate processes in place (for example see ICH E6(r2) and ICH E8(r1)

The FDA has been flagging problems with 21CFR 312.50 for years. In fy2021 they identified multiple observations that boil down to a failure to select qualified investigators. The Sponsor owns that and we should never give a pass because of the nature of contract organizations in our industry.

Agreed. Without making any comment on this particular case, which I know nothing about, I agree with your general point about responsibility going up and down the chain. Without such propagation of responsibility, there are huge and at times overwhelming incentives to cheat.

It does seem that when institutions are being set up and maintained, that insufficient attention is paid to maintaining the smooth and reliable propagation of responsibility. Sometimes this is challenging—it’s not always os easy to internalize externalities (“tragedies of the comments”) through side payments—but in a highly regulated area such as medical research, it should be possible, no?

And this does seem to have some connections to ideas such as influence analysis and model validation that arise in statistical workflow. The trail of breadcrumbs.

Pizzagate and “Nudge”: An opportunity lost

We all make mistakes. What’s important is to engage with our mistakes and learn from them. When we don’t, we’re missing an opportunity to learn.

Here’s an example. A few years ago there was a Harvard study that notoriously claimed that North Carolina was less democratic than North Korea. When this came out, the directors of the study accepted that the North Korea estimate was problematic and they removed it from their dataset. But I don’t think they fully engaged with the error. They lost an opportunity to learn, even though they were admirably open about their process.

Here’s a more recent example. The authors of the influential policy book Nudge came out with a new edition. In the past, they, like many others, had been fooled by junk science on eating behavior, most notably that of Cornell business school professor Brian Wansink. I was curious whassup with that so I searched this press release news article and found some stuff:

They began the [first edition of their] book with the modest example of a school administrator rearranging the display of food at the cafeteria, increasing the likelihood that kids choose a healthier lunch. . . .

In addition to new things in the new version of the book, there are old things from the original version that are gone. That includes the research of former Cornell University professor Brian Wansink, a behavioral scientist who got caught producing shoddy research that fudged numbers and misled the public about his empirical findings. . . . Thaler is cheering on the social scientists probing academic literature to suss out what can be proved and what can’t. “That’s healthy science,” he says.

That’s cool. I like that the author has this attitude: instead of attacking critics as Stasi, he accepts the value of outside criticism.

Just one thing, though. Removing Wansink’s research from the book—that’s a start, but to really do it right you should engage with the error. I don’t have a copy of either edition of the book (and, hey, before you commenters start slamming me about writing a book I haven’t read: first, this is not a book review nor does it purport to be; second, a policy book is supposed to have influence among people who don’t read it. There’s no rule, nor should there be a rule, that I can’t write skeptical things about a book if I haven’t managed to get a copy of it into my hands), but I was able to go on Amazon and take a look at the index.

Here’s the last page of the index of the first edition:

And now the new edition:

Lots of interesting stuff here! But what I want to focus on here are two things:

1. Wansink doesn’t play a large role even in the first edition. He’s only mentioned once, on page 43—that’s it! So let’s not overstate the importance of this story.

2. Wansink doesn’t appear at all in the second edition! That’s the lost opportunity, a chance for the authors to say, “Hey, nudge isn’t perfect; indeed the ideas of nudging have empowered sleazeballs like Wansink, and we got fooled too. Also, beyond this, the core idea of nudging—that small inputs can have large, predictable, and persistent effects—has some deep problems.” Even if they don’t have the space in their book to go into those problems, they could still discuss how they got conned. It’s a great story and fits in well with the larger themes of their book.

Not a gotcha

Connecting Nudge to pizzagate is not a “gotcha.” As I wrote last year after someone pointed out one of my published articles: It’s not that mistakes are a risk of doing science; mistakes are a necessary part of the process.

P.S. The first edition of Nudge mentioned the now-discredited argument that there is no hot hand. Good news is that the Lords realized this was a problem and they excised all mention of the hot hand from their second edition. Bad news is that they did not mention this excision—they just memory-holed the sucker. Another opportunity for learning from mistakes was lost! On the plus side, you’ll probably be hearing these guys on NPR very soon for something or another.

B.S. pseudo-expertise has many homes. (Don’t blame the Afghanistan/Iraq war on academic specialization.)

Bert Gunter writes:

I have no opinion on this (insufficient expertise to judge), but thought you and colleagues might find it interesting if you are not already familiar with the notion.

Couple of quotes:

“Academia is in some ways nearly ideally suited to produce the wrong kinds of expertise.”

“The British government in 2020 started a website that invites individuals to make predictions and ranks them based on accuracy; in future crises, it could consult the best forecasters.”

OK, I [Gunter] will an express an opinion on this latter—it’s crazy. A website to make predictions on the lottery would also identify such “expert predictors.” Same as stock market pundits.

The article in question is an op-ed by political commentator Richard Hanania, and the central example is the bad advice given over the past 20 years for the U.S. to send thousands of troops and spend zillions of dollars dropping bombs, flying planes, and shooting people in Afghanistan and Iraq. He writes that within the U.S., these disastrous plans were formulated by credentialed authorities within the military and that, in Afghanistan, the last president of the American-backed government “has a Ph.D. from Columbia and was even a co-author of a book titled “Fixing Failed States.” And here he is! Ouch.

Here’s what Hanania writes:

As radical as it sounds, just because someone has a Ph.D. in political science or speaks Pashto does not make that person more likely to be able to predict what is going to happen in Afghanistan than an equally intelligent person with knowledge that appears less directly relevant. Anthropology, economics and other fields may offer insight . . .

Not so fast, pal.

I googled, and it appears that the former Afghan president’s Columbia Ph.D. was in . . . anthropology! So this example does not support the claim that political science was a problem, nor does it support the claim that “anthropology, economics and other fields may offer insight.” It’s not a good sign when your example contradicts your claim.

Beyond this, though, my main problem with the proposal in the op-ed is that it seems to be replacing open expertise with closed authority. It says, “Government should set up forecasting tournaments and remove regulatory barriers to establishing prediction markets, in addition to funding them through programs like DARPA and the National Science Foundation”—and that’s fine, I like prediction markets too, we talk about them on the blog all the time—but I also recall that the Department of Defense ran a terrorism prediction market headed by war criminal John Poindexter. Having an actual terrorist running a terrorism prediction market—talk about a conflict of interest. I’m completely serious here when I say this bothers me a lot, especially when you’re using the Afghanistan/Iraq war as an example of bad judgment. It’s hard to imagine Poindexter’s organization being used to exercise policy restraint; rather, I’m guessing it would’ve just been one more tool used to justify the latest military “surge” or whatever. What next—should we put John Yoo in charge of a torture prediction market, have a prediction market in election fraud run by Ted Cruz or a criminal justice prediction market headed by Al Sharpton?

The other thing is that I feel it’s missing the point to place the blame on academia. The examples given in the op-ed are two U.S. army generals and a policy guy who, if I’m counting correctly from his wikipedia page, taught at universities for 14 years. It says that he was considered as a possible secretary general of the United Nations. These aren’t academics in the mold of, say, Samuel Huntington or Robert Putnam; they’re more like the kind of military/government/military figures who spend some time in academia to bring their real-world expertise onto campus. One of the people criticized in the op-ed was a coauthor of a book on “Fixing Failed States”: in retrospect, that does seem laughable or sad, depending on how you look at it—but certainly you can’t tag him for relying on “credentials” and “narrow forms of knowledge.” His Ph.D. in anthropology is not that much of a credential, and “Fixing Failed States” is hardly an example of a “narrow form of knowledge”—actually it sounds like the kind of interdisciplinary research that’s not narrow at all—; and I’d guess that it was his real-life experience, not his “highly specialized knowledge” or “credentials,” which got him his political support.


I agree with the main point of that op-ed not with its specifics. Or maybe I agree with the specifics but not the main point. I’m not sure.

The place I agree is that I don’t think we should let authority figures suppress debate, which is done in part through intimidation (acting like only they have expertise) and in part through establishing “facts on the ground” (in the military situation this would be actual troops or political or logistical commitments; in an intellectual debate this can involve the use of strategic contacts in the news media or academia who can push a particular agenda, as for example we saw with Fox news promoting unfounded claims of U.S. election fraud). I think the op-ed is right that the ability of authority figures to suppress and channel debate is a problem, and this is a legitimately hard problem. We have legitimate distrust of purported experts but at the same time we need real expertise.

The place I half-agree is regarding prediction markets. I agree that prediction markets are a good idea and I agree that open predictions are helpful. But I disagree with any implication that markets are the only way or even the best way to share and discuss predictions. During the 2020 election campaign, we at the Economist posted a probabilistic forecast, our friends at posted their forecast, and we had some discussions and disagreements. No market was necessary. There were also prediction markets for the election, and that’s fine, but these markets had their problems too. I think it was good to have many open predictions, some market-based and some otherwise. Markets are just one aggregation mechanism, and we should be aware of their limitations.

The place I disagree with fully disagree with the op-ed is in its focus on academia as the bad guy here. Don’t get me wrong—regular readers know I hate academic pseudo-expertise, not just people like the sleep guy (who misrepresented the research literature) and the pizzagate guy (who described experiments that may never have occurred), but also people who do B.S. research like the voodoo-doll study or the ovulation-and-voting study. Some of these researchers are probably wonderful human beings, but remember that honesty and transparency are not enuf. But I digress. Yes, I have lots of problems with B.S. academic pseudo-expertise, but the problems discussed not coming from B.S. academic pseudo-expertise; they’re coming from B.S. military pseudo-expertise and B.S. global-elite pseudo-expertise. Remember that terrorist in the U.S. government who was running a terrorism futures program! And don’t get me started on B.S. rich-guy pseudo-expertise. Academia is a soft target—but in this case it’s the wrong target. I bang on this point not to defend academia but because I think it’s a major mistake to let military/government/corporate/globetrotter B.S. pseudo-expertise off the hook.

Buying things vs. buying experiences (vs. buying nothing at all): Again, we see a stock-versus-flow confusion

Alex Tabarrok writes:

A nice, well-reasoned piece from Harold Lee pushing back on the idea that we should buy experiences not goods:

While I appreciate the Stoic-style appraisal of what really brings happiness, economically, this analysis seems precisely backward. It amounts to saying that in an age of industrialization and globalism, when material goods are cheaper than ever, we should avoid partaking of this abundance. Instead, we should consume services afflicted by Baumol’s cost disease, taking long vacations and getting expensive haircuts which are just as hard to produce as ever. . . .

. . . tools and possessions enable new experiences. A well-appointed kitchen allows you to cook healthy meals for yourself rather than ordering delivery night after night. A toolbox lets you fix things around the house and in the process learn to appreciate how our modern world was made. A spacious living room makes it easy for your friends to come over and catch up on one another’s lives. A hunting rifle can produce not only meat, but also camaraderie and a sense of connection with the natural world of our forefathers. . . .

The sectors of the economy that are becoming more expensive every year – which are preventing people from building durable wealth – include real estate and education, both items that are sold by the promise of irreplaceable “experiences.” Healthcare, too, is a modern experience that is best avoided. As a percent of GDP, these are the growing expenditures that are eating up people’s wallets, not durable goods. . . .

OK, first a few little things, then my main argument.

The little things

It’s fun to see someone pushing against the “buy experiences, not goods” thing, which has become a kind of counterintuitive orthodoxy. I wrote about this a few years ago, mocking descriptions of screensaver experiments and advice to go to bullfights. So, yeah, good to see this.

There are some weird bits in the quoted passage above. For one thing, that hunting rifle. What is it with happiness researchers and blood sports, anyway? Are they just all trying to show how rugged they are, or something? I eat meat, and I’m not offering any moral objection to hunting rifles—or bullfights, for that matter—but this seems like an odd example to use, given that you can get “camaraderie and a sense of connection with the natural world of our forefathers” by just taking a walk in the woods with your friends or family—no need to buy the expensive hunting rifle for that!

Also something’s off because in one place he’s using “a spacious living room” as an example of a material good that people should be spending on (it “makes it easy for your friends to come over and catch up on one another’s lives”), but then later he’s telling us to stop spending so much money on real estate. Huh? A spacious living room is real estate. Of course, real estate isn’t all about square footage, it’s also about location, location, and location—but, if your goal is to make it easy for your friends to come over, then it’s worth paying for location, no? Personally, I’d rather live around the corner from my friends and be able to walk over than to have a Lamborghini and have to shlep it through rush-hour traffic to get there. Anyway, my point is not that Lee should sell his Lambo and exchange it for a larger living room in a more convenient neighborhood; it just seems that his views are incoherent and indeed contradictory.

And then there are the slams against education and health care. I work in the education sector so I guess I have a conflict of interest in even discussing this one, but let me give Lee the benefit of the doubt and say that lots of education can be replaced by . . . books. And books are cheaper than ever! A lot of education is motivation, and maybe tricks of gamification can allow this to be done using less labor of instructors. Still, once you’ve bought the computer, these still are services (“experiences”), not durable goods. Indeed, if you’re reading your books online, then these are experiences too.

Talking about education gets people all riled up, so let’s try pushing the discussion sideways, to sports. Lee mentions “a functional kitchen and a home gym (or tennis rackets or cross-country skis).” You might want to pay someone to teach you how to use these things! I think we’re all familiar with the image of the yuppie who buys $100 sneakers and and a $200 tennis racket and goes out to the court, doesn’t know what he’s doing, and throws out his back.

A lot of this seems like what Tyler Cowen calls “mood affiliation.” For example, Lee writes, “If you have a space for entertaining and are intentional about building up a web of friendships, you can be independent from the social pull of expensive cities. Build that network to the point of introducing people to jobs, and you can take the edge off, a little, of the pressure for credentialism.” I don’t get it. If you want a lifestyle that “makes it easy for your friends to come over and catch up on one another’s lives,” you might naturally want to buy a house with a large living room in a neighborhood where many of your friends live. Sure, this may be expensive, but who needs the fancy new car, the ski equipment you’ll never use, the home gym that deprives you of social connections, etc. But nooooo. Lee doesn’t want you to do that! He’s cool with the large living room (somehow that doesn’t count as “real estate”), but he’s offended that you might want to live in an expensive city. Learn some economics, dude! Expensive places are expensive because people want to live there! People want to live there for a reason. Yes, I know that’s a simplification, and there are lots of distortions of the market, but that’s still the basic idea. Similarly, wassup with this “pressure for credentialism”? I introduce people to jobs all the time. People often are hirable because they’ve learned useful skills: is that this horrible “credentialism” thing?

The big thing

The big thing, though, is that I agree with Lee and Tabarrok—goods are cheap, and it does seem wise to buy a lot of them (environmental considerations aside)—but I think they’re missing the point, for a few reasons.

First, basic economics. To the extent that goods are getting cheaper and services aren’t, it makes sense that the trend would be (a) to be consuming relatively more goods and relatively fewer services than before, but (b) to be spending a relatively greater percentage of your money on services. Just think about that one for a moment.

Second, we consume lots more material goods than in the past. Most obviously, we substitute fuel for manual labor, both as individuals and as a society, for example using machines instead of laborers to dig ditches.

Third is the stock vs. flow thing mentioned in the title to this post. As noted, I agree with Lee and Tabarrok that it makes sense in our modern society to consume tons and tons of goods—and we do! We buy gas for our cars, we buy vacuum cleaners and washing machines and dishwashers and computers and home stereo systems and smartphones and toys for the kids and a zillion other things. The “buy experiences not things” advice is not starting from zero: it’s advice starting from the baseline that we buy lots and lots of things. We already have closets and garages and attics full of “thoughtfully chosen material goods can enable new activities can enrich your life, extend your capabilities, and deepen your understanding of the world” (to use Lee’s words).

To put it another way, we’re already in the Lee/Tabarrok world in which we’re surrounded by material possessions with more arriving in the post every week. But, as these goods become cheaper and cheaper, it make sense that a greater proportion of our dollars will be spend on experiences. To try to make the flow of possessions come in even faster and more luxuriously, to the extent of abandoning your education, not going to the doctor, and not living in a desirable neighborhood—that just seems perverse, more of a sign of ideological commitment than anything else.

One more thing

In an important way, all of this discussion, including mine, is in a bubble. If you’re choosing between a fancy kitchen and home gym, a dream vacation complete with bullfight tickets, and a Columbia University education, you’re already doing well financially.

So far we’ve been talking about two ways to spend your money: on things or experiences. But there’s a third goal: security. People buy the house in the nice neighborhood not just for the big living room (that’s a material good that Lee approves of) or to have a shorter commute (an experience, so he’s not so thrilled about that one, I guess), but also to avoid crime and to allow their kids to go to good schools. These are security concerns. Similarly we want reliable health care not for material gain or because it’s a fun experience but because we want some measure of security (while recognizing that none of us will live forever). Similarly for education too: we want the experience of learning and the shiny things we can buy with our future salaries but also future job and career security. So it’s complicated, but I don’t know that either of the solutions on offer—buying more home gym equipment or buying more bullfight tickets—is the answer.

My (remote) talks at UNC biostat, 12-13 May

I was invited to give three talks at the biostatistics department of University of North Carolina. I wasn’t sure what to talk about, so I gave them five options and asked them to choose three.

I’ll share the options with you, then you can guess which three they chose.

Here were the options:

1. All the ways that Bayes can go wrong

Probability theory is false. Weak priors give strong and implausible posteriors. If you could give me your subjective prior I wouldn’t need Bayesian inference. The best predictive model averaging is non-Bayesian. There will always be a need to improve our models. Nonetheless, we still find Bayesian inference to be useful. How can we make the best use of Bayesian methods in light of all their flaws?

2. Piranhas, kangaroos, and the failure of apparent open-mindedness: The connection between unreplicable research and the push-a-button, take-a-pill model of science

There is a replication crisis in much of science, and the resulting discussion has focused on issues of procedure (preregistration, publication incentives, and so forth) and statistical concepts such as p-values and statistical significance. But what about the scientific theories that were propped up by these unreplicable findings–what can we say about them? Many of these theories correspond to a simplistic view of the world which we argue is internally inconsistent (the piranha problem) involving quantities that cannot be accurately learned from data (the kangaroo problem). We discuss connections between these theoretical and statistical issues and argue that it has been a mistake to consider each of these studies and literatures on their own.

3. From sampling and causal inference to policy analysis: Interactions and the challenges of generalization

The three central challenges of statistics are generalizing from sample to population, generalizing from control to treated group, and generalizing from observed data to underlying constructs of interest. These are associated with separate problems of sampling, causal inference, and measurement, but in real decision problems all three issues arise. We discuss the way in which varying treatment effects (interactions) bring sampling concerns into causal inference, along with the real challenges of applying this insight into real problems. We consider applications in medical studies, A/B testing, social science research, and policy analysis.

4. Statistical workflow

Statistical modeling has three steps: model building, inference, and model checking, followed by possible improvements to the model and new data that allow the cycle to continue. But we have recently become aware of many other steps of statistical workflow, including simulated-data experimentation, model exploration and understanding, and visualizing models in relation to each other. Tools such as data graphics, sensitivity analysis, and predictive model evaluation can be used within the context of a topology of models, so that data analysis is a process akin to scientific exploration. We discuss these ideas of dynamic workflow along with the seemingly opposed idea that statistics is the science of defaults. We need to expand our idea of what data analysis is, in order to make the best use of all the new techniques being developed in statistical modeling and computation.

5. Putting it all together: Creating a statistics course combining modern topics with active student engagement

We envision creating a new introductory statistics, combining several innovations: (a) a new textbook focusing on modeling, visualization, and computing, rather than estimation, testing, and mathematics; (b) exams and homework exercises following this perspective; (c) drills for in-class practice and learning; and (d) class-participation activities and discussion problems. We will discuss what’s been getting in the way of this happening already, along with our progress creating a collection of stories, class-participation activities, and computer demonstrations for a two-semester course on applied regression and causal inference.

OK, time to guess which three talks they picked. . . .
Continue reading

Controversy over California Math Framework report

Gur Huberman points to this document from Stanford math professor Brian Conrad, criticizing a recent report on the California Math Framework, which is a controversial new school curriculum.

Conrad’s document includes two public comments.

Comment #1 is a recommended set of topics for high school mathematics prior to calculus. I really agree with some but not all of these recommendations (I like the bits about problem solving modeling with functions; I’m skeptical that high school students need to learn how to add, subtract, multiply, and divide complex numbers; I don’t really buy how they recommend covering probability and statistics; and if it were up to me I’d drop trigonometry entirely). I think their plan is aspirational and the kind of thing that a couple of math professors might come up with; I wouldn’t characterize those topics as “crucial in high school math training for a student who might conceivably wish to pursue a quantitative major in college, including data science.” Sure, knowing sin and cos can’t hurt, but I don’t see them as crucial or even close to it.

Comment #2 is the fun part, eviscerating the California Math Framework report. Here’s how Conrad leads off:

The Mathematics Framework Second Field Review (often called the California Mathematics Framework, or CMF) is a 900+ page document that is the outcome of an 11-month revision by a 5-person writing team supervised by a 20-person oversight team. As a hefty document with a large number of citations, the CMF gives the impression of being a well-researched and evidence-based proposal. Unfortunately, this impression is incorrect.

I [Conrad] read the entire CMF, as well as many of the papers cited within it. The CMF contains false or misleading descriptions of many citations from the literature in neuroscience, acceleration, de-tracking, assessments, and more. (I consulted with three experts in neuroscience about the papers in that field which seemed to be used in the CMF in a concerning way.) Often the original papers arrive at conclusions opposite those claimed in the CMF. . . .

I’m not sure about this “conclusions opposite those claimed” thing, but it does seem that the CMF smoothed the rough edges of the published research, presenting narrow results as general statements. Conrad writes:

The CMF contains many misrepresentations of the literature on neuroscience, and statements betraying a lack of understanding of it. . . . A sample misleading quote is “Park and Brannon (2013) found that when students worked with numbers and also saw the numbers as visual objects, brain communication was enhanced and student achievement increased.” This single sentence contains multiple wrong statements (1) they worked with adults and not students; (2) their experiments involved no brain imaging, and so could not demonstrate brain communication; (3) the paper does not claim that participants saw numbers as visual objects: their focus was on training the approximate number system. . . .

The CMF selectively cites research to make points it wants to make. For example, Siegler and Ramani (2008) is cited to claim that “after four 15-minute sessions of playing a game with a number line, differences in knowledge between students from low-income backgrounds and those from middle-income backgrounds were eliminated”. In fact, the study was specifically for pre-schoolers playing a numerical board game similar to Chutes and Ladders and focused on their numerical knowledge, and at least five subsequent studies by the same authors with more rigorous methods showed smaller positive effects of playing the game that did not eliminate the differences. . . .

In some places, the CMF has no research-based evidence, as when it gives the advice “Do not include homework . . . as any part of grading. Homework is one of the most inequitable practices of education.” The research on homework is complex and mixed, and does not support such blanket statements. . . .

Chapter 8, lines 1044-1047: Here the CMF appeals to a paper (Sadler, Sonnert, 2018) as if that paper gives evidence in favor of delaying calculus to college. But the paper’s message is opposite what the CMF is suggesting. The paper controls for various things and finds that mastery of the fundamentals is a more important indicator of success in college calculus than is taking calculus in high school. There is nothing at all surprising about this: mastery of the fundamentals is most important. The paper is simply quantifying that effect (this is the CMF’s “double the positive impact”), and also studying some other things. What the paper does not find is that taking calculus first in college leads to greater success in that course. To the contrary, it finds that for students at all levels of ability who take calculus in high school and again in college (which the authors note near the end omits the population of strongest students who ace the AP and move on in college) do better in college calculus than those who didn’t take it in high school (controlling for other factors). The benefit accrued is higher for those who took it in high school with weaker background, which again is hardly a surprise if one thinks about it (as Sadler and Sonnert note, that high school experience reinforces fundamental skills, etc.). If one only looks at the paper’s abstract then one might get a mistaken sense as conveyed in the CMF about the meaning of the paper’s findings. But if one actually reads the paper, then the meaning of its conclusions becomes clearer, as described above. . . .

Here’s a juicy one:

Chapter 12, lines 221-228 . . . the CMF makes the dramatic unqualified claim that:

“if teachers shifted their practices and used predominantly formative assessment, it would raise the achievement of a country, as measured in international studies, from the middle of the pack to a place in the top five.”

Conrad goes on to explain how this claim was not supported by the study being cited, but, yeah, in any case it’s a ridiculous thing to be claiming in the first place, all the way to the pseudo-precision of “top five.”

One problem seems to be that the report had no critical readers on the inside who could take the trouble to go through the report with an eye to common sense. This is important stuff, dammit! The California school board, or whatever it’s called, should have higher standards than the National Academy of Sciences reporting on himmicanes or the American Economic Association promoting junk science regarding climate change.

I don’t agree with all of Conrad’s criticisms, though. For example, he writes:

The CMF claims Liang et al (2013) and Domina et al (2015) demonstrated that “widespread acceleration led to significant declines in overall mathematics achievement.” As discussed in §4, Liang et al actually shows that accelerated students did slightly better than non-accelerated ones in standardized tests. In Domina et al, the effect is 7% of a standard deviation (not “7%” in an absolute sense, merely 0.07 times a standard deviation, a very tiny effect). Such minor effects are often the result of other confounders, and are far below anything that could be considered “significant” in experimental work.

I agree that effects can often be explained by other confounders, but I wouldn’t say that a 0.07 standard deviation effect is “very tiny.” A standard deviation is huge, and 7% of a standard deviation is not nothing. I agree that the report isn’t helping by using the term “significant” here. The thing that really confuses me here is . . . did the report really claim that Liang et al. (2013) that acceleration caused significant declines, but Liang et al. actually found that accelerated students did better? Whassup with that? I’m not completely sure but I think the paper he’s referring to is this one by Jian-Hua Liang, Paul E. Heckman, and Jamal Abedi (2012), which doesn’t seem to say anything about acceleration leading to significant declines, while at the same time I don’t see it relying that accelerated students did better. That article concludes: “The algebra policy did encourage schools and districts to presumably enroll more students into algebra courses and then take the CST for Algebra I. However, among the students in our study, the algebra-for-all policy did not appear to have encouraged a more compelling set of classroom and school-wide learning conditions that enhanced student understanding and learning of critical knowledge and skills of algebra, as we have previously discussed.” I don’t see this supporting the claims of the CMF or of Conrad.


It’s a complicated story. Conrad seems to be correct that this California Math Framework is doing sloppy science reporting in the style of Gladwell/Freakonomics/NPR/Ted, using published research to tell a story without ever getting a handle on what those research papers are really saying or whether the claims even make sense. Unfortunately, the real story seems to be:

(a) The different parties to this dispute (Conrad and the authors of the California report) each have strong opinions about mathematics education, opinions which have been formed by their own experiences and which they will support using their readings of the research literature, which is, unfortunately, uneven and inconclusive.

(b) We don’t know much about what works or doesn’t work.

(c) The things that work aren’t policies or mandates but rather things going on at the level of individual schools, teachers, and students.

The most important aspect of policy would then seem to be doing what it takes to facilitate learning. At the same time, some curricular standards need to be set. The CMF has strong views not really supported by the data; Conrad and his colleagues have their own strong views, which are open to debate as well. I don’t feel comfortable stating a position on the policy recommendations being thrown around, but I do think Conrad is doing a service by pointing out these issues.

Google’s problems with reproducibility

Daisuke Wakabayashi and Cade Metz report that Google “has fired a researcher who questioned a paper it published on the abilities of a specialized type of artificial intelligence used in making computer chips”:

The researcher, Satrajit Chatterjee, led a team of scientists in challenging the celebrated research paper, which appeared last year in the scientific journal Nature and said computers were able to design certain parts of a computer chip faster and better than human beings.

Dr. Chatterjee, 43, was fired in March, shortly after Google told his team that it would not publish a paper that rebutted some of the claims made in Nature, said four people familiar with the situation who were not permitted to speak openly on the matter. . . . Google declined to elaborate about Dr. Chatterjee’s dismissal, but it offered a full-throated defense of the research he criticized and of its unwillingness to publish his assessment.

“We thoroughly vetted the original Nature paper and stand by the peer-reviewed results,” Zoubin Ghahramani, a vice president at Google Research, said in a written statement. “We also rigorously investigated the technical claims of a subsequent submission, and it did not meet our standards for publication.” . . .

The paper in Nature, published last June, promoted a technology called reinforcement learning, which the paper said could improve the design of computer chips. . . . Google had been working on applying the machine learning technique to chip design for years, and it published a similar paper a year earlier. Around that time, Google asked Dr. Chatterjee, who has a doctorate in computer science from the University of California, Berkeley, and had worked as a research scientist at Intel, to see if the approach could be sold or licensed to a chip design company . . .

But Dr. Chatterjee expressed reservations in an internal email about some of the paper’s claims and questioned whether the technology had been rigorously tested . . . While the debate about that research continued, Google pitched another paper to Nature. For the submission, Google made some adjustments to the earlier paper and removed the names of two authors, who had worked closely with Dr. Chatterjee and had also expressed concerns about the paper’s main claims . . .

Google allowed Dr. Chatterjee and a handful of internal and external researchers to work on a paper that challenged some of its claims.

The team submitted the rebuttal paper to a so-called resolution committee for publication approval. Months later, the paper was rejected.

There is another side to the story, though:

Ms. Goldie [one of the authors of the recently published article on chip design] said that Dr. Chatterjee had asked to manage their project in 2019 and that they had declined. When he later criticized it, she said, he could not substantiate his complaints and ignored the evidence they presented in response.

“Sat Chatterjee has waged a campaign of misinformation against me and [coauthor] Azalia for over two years now,” Ms. Goldie said in a written statement.

She said the work had been peer-reviewed by Nature, one of the most prestigious scientific publications. And she added that Google had used their methods to build new chips and that these chips were currently used in Google’s computer data centers.

And an outsider perspective:

After the rebuttal paper was shared with academics and other experts outside Google, the controversy spread throughout the global community of researchers who specialize in chip design.

The chip maker Nvidia says it has used methods for chip design that are similar to Google’s, but some experts are unsure what Google’s research means for the larger tech industry.

“If this is really working well, it would be a really great thing,” said Jens Lienig, a professor at the Dresden University of Technology in Germany, referring to the A.I. technology described in Google’s paper. “But it is not clear if it is working.”

The above-linked news article has links to the recent paper in Nature (“A graph placement methodology for fast chip design,” by Azalia Mirhoseini) and the earlier preprint (“Chip placement with deep reinforcement learning”), but I didn’t see any link to the Chatterjee et al. response. The news article says that “the rebuttal paper was shared with academics and other experts outside Google,” so it must be out there somewhere, but I couldn’t find it in a quick, ummmmm, Google search. The closest I came was this news article by Subham Mitra that reports:

The new episode emerged after the scientific journal Nature in June published “A graph placement methodology for fast chip design,” led by Google scientists Azalia Mirhoseini and Anna Goldie. They discovered that AI could complete a key step in the design process for chips, known as floorplanning, faster and better than an unspecified human expert, a subjective reference point.

But other Google colleagues in a paper that was anonymously posted online in March – “Stronger Baselines for Evaluating Deep Reinforcement Learning in Chip Placement” – found that two alternative approaches based on basic software outperform the AI. One beat it on a well-known test, and the other on a proprietary Google rubric.

Google declined to comment on the leaked draft, but two workers confirmed its authenticity.

I searched on “Stronger Baselines for Evaluating Deep Learning in Chip Placement” but couldn’t find anything. So no opportunity to read the two papers side by side.

Comparison to humans or comparison to default software?

I can’t judge the technical controversy given the available information. From the abstract to “A graph placement methodology for fast chip design”:

Despite five decades of research, chip floorplanning has defied automation, requiring months of intense effort by physical design engineers to produce manufacturable layouts. Here we present a deep reinforcement learning approach to chip floorplanning. In under six hours, our method automatically generates chip floorplans that are superior or comparable to those produced by humans in all key metrics, including power consumption, performance and chip area. . . . Our method was used to design the next generation of Google’s artificial intelligence (AI) accelerators . . .

This abstract is all about comparisons with humans, but it seems that this is not the key issue, if the Stronger Baselines article was claiming that “two alternative approaches based on basic software outperform the AI.” I did find this bit near the end of the “A graph placement methodology” article:

Comparing with baseline methods. In this section, we compare our method with the state-of-the-art RePlAce and with the production design of the previous generation of TPU, which was generated by a team of human physical designers. . . . To perform a fair comparison, we ensured that all methods had the same experimental setup, including the same inputs and the same EDA tool settings. . . . For our method, we use a policy pre-trained on the largest dataset (20 TPU blocks) and then fine-tune it on five target unseen blocks (denoted as blocks 1–5) for no more than 6 h. For confidentiality reasons, we cannot disclose the details of these blocks, but each contains up to a few hundred macros and millions of standard cells. . . . As shown in Table 1, our method outperforms RePlAce in generating placements that meet design criteria. . . .

There’s a potential hole here in “For confidentiality reasons, we cannot disclose the details of these blocks,” but I don’t really know. The article is making some specific claims so I’d like to see the specifics in the rebuttal.

It doesn’t sound like there’s much dispute about the claim that automated methods can outperform human design. That is not a huge surprise, given that this is a well-defined optimization problem. Indeed, I’d like to see some discussion of what aspects of the problem make it so difficult that it wasn’t already machine-optimized. From the abstract: “Despite five decades of research, chip floorplanning has defied automation, requiring months of intense effort by physical design engineers to produce manufacturable layouts,” but the article also refers to “the state-of-the-art RePlAce,” so does that mean that RePLAce is only partly automatic?

The whole thing is a bit mysterious to me. I’m not saying the authors of this paper did anything wrong; I just don’t quite understand what’s being claimed here: in one place the big deal seems to be that this procedure is being automated; elsewhere the dispute seems to be a comparison to basic software.

Google’s problems with reproducibility

Google produces some great software. They also seem to follow the tech industry strategy of promoting vaporware, or, as we’d say in science, non-reproducible research.

We’ve seen two recent examples:

1. The LaMDA chatbot, which was extravagantly promoted by Google engineer Blaise Agüera y Arcas but with a bunch of non-reproducible examples. I posted on this multiple times and also contacted people within Google, but neither Agüera y Arcas nor anyone else has come forth with any evidence that the impressive conversational behavior claimed from LaMDA is reproducible. It might have happened or it might all be a product of careful editing, selection, and initialization—I have no idea!

2. University of California professor and Google employee Matthew Walker, who misrepresents data and promotes junk science regarding sleep.

That doesn’t mean that Chatterjee is correct in the above dispute. I’m just saying it’s a complicated world out there, and you can’t necessarily believe a scientific or engineering claim coming out of Google (or anywhere else).

P.S. From comments, it seems that Google no longer employs Matthew Walker. That makes sense. It always seemed to a misfit for a data-focused company to be working with someone who’s famous for misrepresenting data.

P.P.S. An anonymous tipster sent me the mysterious Stronger Baselines paper. Here it is, and here’s the abstract:

This all leaves me even more confused. If Chatterjee et al. can “produce competitive layouts with computational resources smaller by orders of magnitude,” they could show that, no? Or, I guess not, because it’s all trade secrets.

This represents a real challenge for scholarly journals. On one hand, you don’t want them publishing research that can’t be reproduced; on the other hand, if the best work is being done in proprietary settings, what can you do? I don’t know the answer here. This is all setting aside personnel disputes at Google.

Hey, check this out, it’s really cool: A Bayesian framework for interpreting findings from impact evaluations.

Chris Mead points us to a new document by John Deke, Mariel Finucane, and Daniel Thal, prepared for the Department of Education’s Institute of Education Sciences. It’s caled The BASIE (BAyeSian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations: A Practical Guide for Education Researchers, and here’s the summary:

BASIE is a framework for interpreting impact estimates from evaluations. It is an alternative to null hypothesis significance testing. This guide walks researchers through the key steps of applying BASIE, including selecting prior evidence, reporting impact estimates, interpreting impact estimates, and conducting sensitivity analyses. The guide also provides conceptual and technical details for evaluation methodologists.

I looove this, not just all the Bayesian stuff but also the respect it shows for the traditional goals of null hypothesis significance testing. They’re offering a replacement, not just an alternative.

Also they do good with the details. For example:

Probability is the key tool we need to assess uncertainty. By looking across multiple events, we can calculate what fraction of events had different types of outcomes and use that information to make better decisions. This fraction is an estimate of probability called a relative frequency. . . .

The prior distribution. In general, the prior distribution represents all previously available information regarding a parameter of interest. . . .

I really like that they express this in terms of “evidence” and “information” rather than “belief.”

They also discuss graphical displays and communication that is both clear and accurate; for example recommending summaries such as, “We estimate a 75 percent probability that the intervention increased reading test scores by at least 0.15 standard deviations, given our estimates and prior evidence on the impacts of reading programs for elementary school students.”

And it all runs in Stan, which is great partly because Stan is transparent and open-source and has good convergence diagnostics and a big user base and is all-around reliable for Bayesian inference, and also because Stan models are extendable: you can start with a simple hierarchical regression and then add measurement error, mixture components, and whatever else you want.

And this:

Local Stop: Why we do not recommend the flat prior

A prior that used to be very popular in Bayesian analysis is called the flat prior (also known as the improper uniform distribution). The flat prior has infinite variance (instead of a bell curve, a flat line). It was seen as objective because it assigns equal prior probability to all possible values of the impact; for example, impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all treated as equally plausible.

When probability is defined in terms of belief rather than evidence, the flat prior might seem reasonable—one might imagine that the flat prior reflects the most impartial belief possible (Gelman et al., 2013, Section 2.8). As such, this prior was de rigueur for decades.

But when probability is based on evidence, the implausibility of the flat prior becomes apparent. For example, what evidence exists to support the notion that impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all equally probable? No such evidence exists; in fact, quite a bit of evidence is completely inconsistent with this prior (for example, the distribution of impact estimates in the WWC [What Works Clearinghouse]). The practical implication is that the flat prior overestimates the probability of large effects. Following Gelman and Weakliem (2009), we reject the flat prior because it has no basis in evidence.

The implausibility of the flat prior also has an interesting connection to the misinterpretation of p-values. It turns out that the Bayesian posterior probability derived under a flat prior is identical (for simple models, at least) to a one-sided p-value. Therefore, if researchers switch to Bayesian methods but use a flat prior, they will likely continue to exaggerate the probability of large program effects (which is a common result when misinterpreting p-values). . . .

Yes, I’m happy they cite me, but the real point here is that they’re thinking in terms of modeling and evidence, also that they’re connecting to important principles in non-Bayesian inference. As the saying goes, there’s nothing so practical as a good theory.

What makes me particularly happy is the way in which Stan is enabling applied modeling.

This is not to say that all our problems are solved. Once we do cleaner inference, we realized the limitations of experimental data: with between-person studies, sample sizes are never large enough to get stable estimates of interactions of interest (recall 16), which implies the need for . . . more modeling, as well as open recognition of uncertainty in decision making. So lots more to think about going forward.

Full disclosure: My research is funded by the Institute of Education Sciences, and I know the authors of the above-linked report.

The perils of (apparently) logical thinking

Regular readers know that I’ve been screaming for years about the problems with binary thinking, for example here and here. Or, for a applied example from medical research, here.

This comes up in statistics all the time, when “statistically significant” results are treated as representing enduring truths (I’m looking at you, beauty-and-sex-ratio, ages-ending-in-9, ovulation-and-voting, subliminal-smiley-faces, ESP, and all the rest), and non-statistically significant comparisons are treated as no effect.

But today I came across another example which made me think that the flaws of binary reasoning arise in slightly more complicated settings too, when people take a probabilistic argument and reduce it to a false syllogism.

I’ll give the example and then discuss the more general problem.

The example

It comes from a post by Tyler Cowen, not something he’s saying himself, exactly, but I think by posting it he’s endorsing the argument, or maybe it’s a bit of mood affiliation, as Cowen might say. Anyway, here it is:

My first reaction upon hearing that boosters were rejected was to ask the same thing: would these same “experts” say that, because the vaccines are still effective without boosters, vaccinated persons don’t need to wear masks and can resume normal life? Of course not. They use the criterion “prevents hospitalization” for evaluating boosters (2a) but switch back to “prevents infection” when the question is masks and other restrictions. What about those that are willing to accept the tiny risk of side effects to prevent infection so that they can get back to fully normal life? The Science (TM) tells us that one can’t transmit the virus if one is never infected to begin with.

Also, one of the No votes on boosters said that he feared approval would effectively turn boosters into a mandate and change the definition of fully vaccinated. So, it appears that the overzealousness to demand vaccine mandates has actually contributed to fewer people getting access to (booster) vaccines, thus paradoxically contributing to spread. A vivid illustration of the problem with, “That which is not mandatory should be prohibited.”

Why the reasoning is wrong

I’m not saying the above passage is wrong on the merits. Drug approval involves lots of tricky questions, and it’s worth wrestling with these difficulties. Indeed, if the passage were just plain stupid, there’d be no point in discussing it. What I’m getting to is that I think the author of the passage is making a sort of logical error arising from inappropriate discretization of choices and outcomes.

So here’s my concern with the above passage: It takes decisions about mask wearing and social distancing and reduces them to the binary of:
(i) Wearing masks, or
(ii) Resuming normal life.
But it’s not just (i) or (ii). Last semester I started teaching in person and meeting in person, and I’m loving it and loving it. So that’s my bias, or my personal preference. But we were also wearing masks in the classroom (except I was allowed to not wear a mask when teaching and more than 6 feet away from any student, so sometimes I pulled off the mask to take a drag of air) and in the office (except, again, for the occasional socially-distanced drag). Now you might say that we shouldn’t have bothered with masks (or maybe that masks shouldn’t have been required)—but there were many people on campus who didn’t think we should’ve be teaching in person at all!, and I understand the university had to respect that perspective. This gets to the cost-benefit issue: the cost of wearing masks is just not so high. But the cost of getting rid of full in-person classes is high, not in terms of health but in terms of education.

It seems to me that the passage quoted above is a sort of syllogism, nailing the so-called “experts” for a logical inconsistency—but the apparent illogic is only arising because the decision is being taken as binary.

The other discretization in the above quote, and this is important too, is in its treating of “these same ‘experts'” as a unified mass. I think the author of the quote is a bit too sure about what these “experts” (I guess it’s some FDA employees, or people on an FDA panel, or something like that?) would say about various other policy questions. You can make lots of errors in reasoning if you condition on facts that you’re not sure of. To put it in statistical terms, it might be that you have good reason to believe each of the statements A, B, C, D, and E—but if you condition on all five of these plausible statements, you might well lead yourself into a strong and unreasonable inference.

The general problem

Again, I can understand the frustration being expressed by the author of the above quote, and so don’t take this post as a defense of any particular policy regarding vaccine boosters, or as a defense of any particular mask policy. My point is a more general one, that the logical form of the syllogism is seductive, but it breaks down under uncertainty.

There are lots of examples. For example: “A is correlated with B” and “B is correlated with C” does not necessarily imply the statement, “A is correlated with C.” Confusion on this can lead to all sorts of problems. This syllogism thing is different, but it seems to be me to be of the same family of fallacy. I’d be interested in reading a more systematic treatment of the topic: deterministic arguments fall apart in the presence of uncertainty. Another example is the “What does not kill my statistical significance makes it stronger” fallacy.

It’s an interesting problem where the use of logic can pull you into fallacious reasoning—but, at the same time, the apparent logic makes the (erroneous) argument more compelling, which in turn can lead you to think that people on the other sides are knaves or fools, which then makes you less likely to listen to explanations of why your reasoning is wrong, which . . . It’s a vicious circle out there!

There’s also a political science angle here. Vaccine denial is annoying and it’s sad, and policymakers need to account for it. Politically-minded promotion of vaccine denial is even more horrible, and policymakers need to account for that too. A bit less annoying are people who are so scared of covid risks that they resist return to school. Policymakers need to account for them too! Wearing masks indoors and following testing protocols can be a way to satisfy that group.

To return to the “vaccinated persons don’t need to wear masks and can resume normal life” issue: I’d rather not wear a mask at all, but if I can teach classes and do meetings in person (and, in my non-work life, if I can travel and see people and go to the store etc.), than that’s close enough to “resuming normal life.” I mean, sure, I’d rather be able to ride the train without a mask—that’s not nothing—but “normal life” is a continuum.

Gigerenzer: “On the Supposed Evidence for Libertarian Paternalism”

From 2015. The scourge of all things heuristics and biases writes:

Can the general public learn to deal with risk and uncertainty, or do authorities need to steer people’s choices in the right direction? Libertarian paternalists argue that results from psychological research show that our reasoning is systematically flawed and that we are hardly educable because our cognitive biases resemble stable visual illusions. For that reason, they maintain, authorities who know what is best for us need to step in and steer our behavior with the help of “nudges.” Nudges are nothing new, but justifying them on the basis of a latent irrationality is. In this article, I analyze the scientific evidence presented for such a justification. It suffers from narrow logical norms, that is, a misunderstanding of the nature of rational thinking, and from a confirmation bias, that is, selective reporting of research. These two flaws focus the blame on individuals’ minds rather than on external causes, such as industries that spend billions to nudge people into unhealthy behavior. I conclude that the claim that we are hardly educable lacks evidence and forecloses the true alternative to nudging: teaching people to become risk savvy.

Good stuff here on three levels: (1) social science theories and models; (2) statistical reasoning and scientific evidence; and (3) science and society.

Gigerenzer’s article is interesting in itself and also as a counterpart to the institutionalized hype of the Nudgelords.

Confidence intervals, compatability intervals, uncertainty intervals

“Communicating uncertainty is not just about recognizing its existence; it is also about placing that uncertainty within a larger web of conditional probability statements. . . . No model can include all such factors, thus all forecasts are conditional.”us (2020).

A couple years ago Sander Greenland and I published a discussion about renaming confidence intervals.

Confidence intervals

Neither of us likes the classical term, “confidence intervals,” for two reasons. First, the classical definition (a procedure that produces an interval which, under the stated assumptions, includes the true value at least 95% of the time in the long run) is not typically what is of interest when performing statistical inference. Second, the assumptions are wrong: as I put it, “Confidence intervals excluding the true value can result from failures in model assumptions (as we’ve found when assessing U.S. election polls) or from analysts seeking out statistically significant comparisons to report, thus inducing selection bias.”

Uncertainty intervals

I recommended the term “uncertainty intervals,” on the grounds that the way confidence intervals are used in practice is to express uncertainty about an inference. The wider the interval, the more uncertainty.

But Sander doesn’t like the label “uncertainty interval”; as he puts it, “the word ‘uncertainty’ gives the illusion that the interval properly accounts for all important uncertainties . . . misrepresenting uncertainty as if it were a known quantity.”

Compatability intervals

Sander instead recommends the term “compatibility interval,” following the reasoning that the points outside the interval are outside because they are incompatible with the data and model (in a stochastic sense) and the points inside are compatible our data and assumptions. What Sander says makes sense.

The missing point in both my article and Sander’s is how the different concepts fit together. As with many areas in mathematics, I think what’s going on is that a single object serves multiple functions, and it can be helpful to disentangle these different roles. Regarding interval estimation, this is something that I’ve been mulling over for many years, but it did not become clear to me until I started thinking hard about my discussion with Sander.

Purposes of interval estimation

Here’s the key point. Statistical intervals (whether they be confidence intervals or posterior intervals or bootstrap intervals or whatever) serve multiple purposes. One purpose they serve is to express uncertainty in a point estimate; another purpose they serve is to (probabilistically) rule out values outside the interval; yet another purpose is to tell us that values inside the interval are compatible with the data. These first of these goals corresponds to the uncertainty interval; the second and third correspond to the compatibility interval.

In a simple case such as linear regression or a well-behaved asymptotic estimate, all three goals are served by the same interval. In more complicated cases, no interval will serve all these purposes.

I’ll illustrate with a scenario that arose in a problem I worked on a bit over 30 years ago, and discussed here:

Sometimes you can get a reasonable confidence interval by inverting a hypothesis test. For example, the z or t test or, more generally, inference for a location parameter. But if your hypothesis test can ever reject the model entirely, then you’re in the situation shown above. Once you hit rejection, you suddenly go from a very tiny precise confidence interval to no interval at all. To put it another way, as your fit gets gradually worse, the inference from your confidence interval becomes more and more precise and then suddenly, discontinuously has no precision at all. (With an empty interval, you’d say that the model rejects and thus you can say nothing based on the model. You wouldn’t just say your interval is, say, [3.184, 3.184] so that your parameter is known exactly.)

For our discussion here, the relevant point is that, if you believe your error model, this is a fine procedure for creating a compatability interval—as your data becomes harder and harder to explain from the model, the compatibility interval becomes smaller and smaller, until it eventually becomes empty. That’s just fine; it makes sense; it’s how compatability intervals should be.

But as an uncertainty interval, it’s terrible. Your model fits worse and worse, your uncertainty gets smaller and smaller, and then suddenly the interval becomes empty and you have no uncertainty statement at all—you just reject the model.

At this point Sander might stand up and say, Hey! That’s the point! You can’t get an uncertainty interval here so you should just be happy with the compatibility interval. To which I’d reply: Sure, but often the uncertainty interval isn’t what people want. To which Sander might reply: Yeah, but as statisticians we shouldn’t be giving people what we want, we should be giving people what we can legitimately give them. To which I’d reply: in decision problems, I want uncertainty. I know my uncertainty statements aren’t perfect, I know they’re based on assumptions, but that just pushes me to check my assumptions, etc. Ok, this argument could go on forever, so let me just return to my point that uncertainty and compatibility are two different (although connected) issues.

All intervals are conditional on assumptions

There’s one thing I disagree with in Sander’s article, though, and that’s his statement that “compatibility” is a more modest term than “confidence” or “uncertainty.” My take on this is that all these terms are mathematically valid within their assumptions, and none are in general valid when the assumptions are false. When the assumptions of model and sampling and reporting are false, there’s no reason to expect 95% intervals to contain the true value 95% of the time (hence, no confidence property), there’s no reason to think they will fully capture our uncertainty (hence, “uncertainty interval” is not correct), and no reason to think that the points inside the interval are compatible with the data and that the points outside are not compatible (hence, “compatibility interval” is also wrong).

All of these intervals represent mathematical statements and are conditional on assumptions, no matter how you translate them into words.

And that brings us to the quote from Jessica, Chris, Elliott, and me at the top of this post, from a paper on information, incentives, and goals in election forecasts, an example in which the most important uncertainties arise from nonsampling error.

All intervals are conditional on assumptions (which are sometimes called “guarantees“). Calling your interval an uncertainty interval or a compatability interval doesn’t make that go away, any more than calling your probabilities “subjective” or “objective” absolves you from concerns about calibration.

He’s a devops engineer and he wants to set some thresholds. Where should he go to figure out what to do?

Someone who wants to remain anonymous writes:

I’m not a statistician, but a devops engineer. So basically managing servers, managing automated systems, databases, that kind of stuff.

A lot of that comes down to monitoring systems, producing a lot of time series data. Think cpu usage, number of requests, how long the servers take to respond to requests, that kind of thing. We’re using a tool called Datadog to collect all this, and aggregate, run some functions on them (average, P90…), make dashboards etc. We can also alert on this, so if a server is CPU low on RAM, we page someone who then has to investigate and fix it.

When making these alerts you have to set thresholds, like if there’s 10% of requests are errors over 5 minutes, then the person gets paged. I’m mostly just guessing these by eyeballing the graphs, but I don’t know anything about statistics, I feel this field probably has opinions on how to do this better!

So I’m wondering, can you recommend any beginner resources I can read to see some basics, and maybe get an idea what kind of stuff is possible with statistics? Maybe then I can try to approach this monitoring work in a bit more systematic way.

My reply: I’m not sure, but my guess is that the usual introductory statistics textbooks will be pretty useless here, as I don’t see the relevance of hypothesis testing, p-values, confidence intervals, and all the rest of those tool, nor do I think you’ll get much out of the usual sermons on the importance or random sampling and randomized experimentation. This sounds more like a quality control problem, so I guess I’d suggest you start with a basic textbook on quality control. I’m not sure what’s the statistical content in those books.

Could the so-called “fragility index” be useful as a conceptual tool even though it would be a bad idea to actually use it?

Erik Drysdale writes:

You’ve mentioned the fragility index (FI) on your blog once before – a technique which does a post-hoc assessment on the number of patients that would need to change event status for a result to be insignificant. It’s quite popular in medical journals these days, and I have helped colleagues at my hospital use the technique for two papers. I haven’t seen a lot of analysis done on what exactly this quantity represents statistically (except for Potter 2019).

I’ve written a short blog post exploring that FI, and I show that its expected value (using some simplifying assumptions) is a function of the power of the test. While this formula can be used post-hoc to estimate power of a test, this leads to a very noisy estimate (as you’ve pointed out many times before). On the plus side, this post-hoc power estimate is conservative and does not suffer from the usual problem of inflated power estimates because it explicitly conditions on statistical significance being achieved.

Overall, I agree closely with your original view that the FI is a neat idea, but rests on a flawed system. However, I am more positive towards its practical use because it seems to get doctors to think much more in terms of sample size rather than measured effect size.

From my previous post, the criticism of the so-called fragility index is that (a) it’s all about “statistical significance,” and (b) it’s noisy. So I wouldn’t really like people to be using it. I guess Drysdale’s point is that it could be useful as a conceptual tool even though it would be a bad idea to actually use it. Kind of like “statistical power,” which is based on “statistical significance” which is kinda horrible, but is still a super-useful concept in that it gets people thinking about the connections between design, data collection, measurement, inference, and decisions.

I guess the right way to go forward would be to create something with the best of both worlds: “Bayesian power analysis” or “Bayesian fragility index” or something like that. In our recent work we’ve used the general term “design analysis” to capture the general idea but without restricting to the classical concept of “power,” which is tied so closely to statistical significance. Similarly, “fragility” is related to “influence” of data points, so in a sense these methods are already out there, but there’s this particular connection to how inferences will be summarized.

DST controversy!

All I can say in response to this post from Paul Campos is that I had to be outside at 6:15 today, and it was still dark so I had to turn on the light on my bike. Also when I was a kid they had daylight time all year round one year and we had to walk to school in the dark.

Also, it’s annoying when you’re setting up a meeting in May or whatever and someone says it’s gonna be 10am EST and I have to either be Mister Picky and say, “Do you mean EDT?” or else just hope they didn’t schedule it as EST on their calendar somehow.

Really, though, if daylight is so damn wonderful, why not just leave the sun on all day long. Sure, it wastes some energy having it burning all through the night, but from an economic perspective the increased productivity will more than compensate for whatever small fraction of GDP is eaten up by global warming.

How to think about the large estimated effects from taking an ethnic studies class in high school?

Daniel Arovas asked for my impressions of this article, “Ethnic studies increases longer-run academic engagement and attainment,” by Sade Bonilla, Thomas Dee, and Emily Penner.

I responded that the article looked reasonable to me and asked if there were particular concerns that he had.

Arovas replied, “None in particular but seeing as this is both topical and in a fairly politicized field of research, I was a bit wary.”

I’ve been known to be suspicious of regression discontinuity analyses, but in this case the running variable (high school grade point average at the time of treatment assignment) is strongly predictive of the outcome, as we would expect it should be, and the curves are clean and monotonic; they don’t show the wacky edge effects that we’ve seen in problematic discontinuity regressions such as here, here, and here. As I wrote in a recent textbook review, “when you look at the successful examples of RDD, the running variable has a strong connection to the outcome, enough so that it’s plausible that adjusting for this variable will adjust for systematic differences between treatment and control groups.”

What about the conclusion, that, for these students, taking an ethnic studies class will have such huge effects, reducing high school dropout rate by 40% and increasing college enrollment by 20% (that’s what it looks like from eyeballing Figures 1 and 2 of the paper)? I don’t fully understand these numbers (in particular, it seems that more kids are graduating from high school than were attending in year 4; they get into some of these details in the paper regarding students who drop out or move to another district or state, so I guess someone can check how they analyzed these cases), but the patterns seem clear. The paper also considers subgroup effects but I’m pretty sure those will be too noisy for us to learn much.

The thing I don’t fully understand is what are the treatment and control conditions. If a student was required or strongly urged to take the ethnic studies class, was this added to the schedule or did it take the place of another class? What they say in the article seems plausible, that you’re less likely to drop out of school if you’re taking a class that’s more relevant to your experiences. I’m still hung up on how large the estimated effect sizes are. I guess one way to think about this is that, if this class can reduce dropout rate by 40%, then lots of kids must be dropping out who really don’t need to, who are in some sense just on the edge of dropping out, and there should be lots of ways of keeping these kids in school. Which is not to say the ethnic studies class is a bad idea; just trying to put these results in context.

Elasticities are typically between 0 and 1

Tim Requarth discusses the implications of this innocuous but often-misunderstood statement.

The problem is, by some mixture of ideology, confusion, and too-clever-by-half-ness, people often like to argue that an elasticity will be greater then 1 (or less than 0, depending on how you define it). That is, they argue that a proposed policy will elicit such a strong reaction in the opposite direction to have no effect or even backfire. As Requarth discusses, this sort of claim (for example, the argument that wearing seatbelts makes driving more dangerous) don’t make a lot of sense nor are they typically supported by data. Their main role is to offer just enough surface plausibility to muddy the waters, I guess kind of like the anti-vax arguments we’ve been seeing lately.

This reminds me a bit of middle-school debate, where you win by (a) advancing enough arguments that the other side doesn’t get around to rebutting all of them, while (b) making the effort to offer some rebuttal to each of the other side’s arguments. Even a weak rebuttal not based on solid evidence is fine, in that it then forces the other side to waste its time rebutting the rebuttal, etc.

“An Investigation of the Facts Behind Columbia’s U.S. News Ranking”

Michael Thaddeus writes:

Nearly forty years after their inception, the U.S. News rankings of colleges and universities continue to fascinate students, parents, and alumni. . . . A selling point of the U.S. News rankings is that they claim to be based largely on uniform, objective figures like graduation rates and test scores. Twenty percent of an institution’s ranking is based on a “peer assessment survey” in which college presidents, provosts, and admissions deans are asked to rate other institutions, but the remaining 80% is based entirely on numerical data collected by the institution itself. . . .

Like other faculty members at Columbia University, I have followed Columbia’s position in the U.S. News ranking of National Universities with considerable interest. It has been gratifying to witness Columbia’s steady rise from 18th place, on its debut in 1988, to the lofty position of 2nd place which it attained this year . . .

A few other top-tier universities have also improved their standings, but none has matched Columbia’s extraordinary rise. It is natural to wonder what the reason might be. Why have Columbia’s fortunes improved so dramatically? One possibility that springs to mind is the general improvement in the quality of life in New York City, and specifically the decline in crime; but this can have at best an indirect effect, since the U.S. News formula uses only figures directly related to academic merit, not quality-of-life indicators or crime rates. To see what is really happening, we need to delve into these figures in more detail.

Thaddeus continues:

Can we be sure that the data accurately reflect the reality of life within the university? Regrettably, the answer is no. As we will see, several of the key figures supporting Columbia’s high ranking are inaccurate, dubious, or highly misleading.


And some details:

According to the 2022 U.S. News ranking pages, Columbia reports (a) that 82.5% of its undergraduate classes have under 20 students, whereas (e) only 8.9% have 50 students or more.

These figures are remarkably strong, especially for an institution as big as Columbia. The 82.5% figure for classes in range (a) is particularly extraordinary. By this measure, Columbia far surpasses all of its competitors in the top 100 universities; the nearest runners-up are Chicago and Rochester, which claim 78.9% and 78.5%, respectively.

Although there is no compulsory reporting of information on class sizes to the government, the vast majority of leading universities voluntarily disclose their Fall class size figures as part of the Common Data Set initiative. . . .

Columbia, however, does not issue a Common Data Set. This is highly unusual for a university of its stature. Every other Ivy League school posts a Common Data Set on its website, as do all but eight of the universities among the top 100 in the U.S. News ranking. (It is perhaps noteworthy that the runners-up mentioned above, Chicago and Rochester, are also among the eight that do not issue a Common Data Set.)

According to Lucy Drotning, Associate Provost in the Office of Planning and Institutional Research, Columbia prepares two Common Data Sets for internal use . . .

She added, however, that “The University does not share these.” Consequently, we know no details regarding how Columbia’s 82.5% figure was obtained.

On the other hand, there is a source, open to the public, containing extensive information about Columbia’s class sizes. Columbia makes a great deal of raw course information available online through its Directory of Classes . . . Course listings are taken down at the end of each semester but remain available from the Internet Archive.

Using these data, the author was able to compile a spreadsheet listing Columbia course numbers and enrollments during the semesters used in the 2022 U.S. News ranking (Fall 2019 and Fall 2020), and also during the recently concluded semester, Fall 2021. The entries in this spreadsheet are not merely a sampling of courses; they are meant to be a complete census of all courses offered during those semesters in subjects covered by Arts & Sciences and Engineering (as well as certain other courses aimed at undergraduates). . . .

[lots of details]

Two extreme cases can be imagined: that undergraduates took no 5000-, 6000-, and 8000-level courses, or that they took all such courses (except those already excluded from consideration). . . . Since the reality lies somewhere between these unrealistic extremes, it is reasonable to conclude that the true . . . percentage, among Columbia courses enrolling undergraduates, of those with under 20 students — probably lies somewhere between 62.7% and 66.9%. We can be quite confident that it is nowhere near the figure of 82.5% claimed by Columbia.

Reasoning similarly, we find that the true . . . percentage, among Columbia courses enrolling undergraduates, of those with 50 students or more — probably lies somewhere between 10.6% and 12.4%. Again, this is significantly worse than the figure of 8.9% claimed by Columbia . . .

These estimated figures indicate that Columbia’s class sizes are not particularly small compared to those of its peer institutions. Furthermore, the year-over-year data from 2019–2021 indicate that class sizes at Columbia are steadily growing.


Thaddeus continues by shredding the administration’s figures on “percentage of faculty who are full time” (no, it seems that it’s not really “96.5%”), the “student-faculty ratio” (no, it seems that it’s not really “6 to 1”), “spending on instruction” (no, it seems that it’s not really higher than the corresponding figures for Harvard, Yale, and Princeton combined), “graduation and retention rates” (the reported numbers appear not to include transfer students).

Regarding that last part, Thaddeus writes:

The picture coming into focus is that of a two-tier university, which educates, side by side in the same classrooms, two large and quite distinct groups of undergraduates: non-transfer students and transfer students. The former students lead privileged lives: they are very selectively chosen, boast top-notch test scores, tend to hail from the wealthier ranks of society, receive ample financial aid, and turn out very successfully as measured by graduation rates. The latter students are significantly worse off: they are less selectively chosen, typically have lower test scores (one surmises, although acceptance rates and average test scores for the Combined Plan and General Studies are well-kept secrets), tend to come from less prosperous backgrounds (as their higher rate of Pell grants shows), receive much stingier financial aid, and have considerably more difficulty graduating.

No one would design a university this way, but it has been the status quo at Columbia for years. The situation is tolerated only because it is not widely understood.

Thaddeus summarizes:

No one should try to reform or rehabilitate the ranking. It is irredeemable. . . . Students are poorly served by rankings. To be sure, they need information when applying to colleges, but rankings provide the wrong information. . . .

College applicants are much better advised to rely on government websites like College Navigator and College Scorecard, which compare specific aspects of specific schools. A broad categorization of institutions, like the Carnegie Classification, may also be helpful — for it is perfectly true that some colleges are simply in a different league from others — but this is a far cry from a linear ranking. Still, it is hard to deny, and sometimes hard to resist, the visceral appeal of the ranking. Its allure is due partly to a semblance of authority, and partly to its spurious simplicity.

Perhaps even worse than the influence of the ranking on students is its influence on universities themselves. Almost any numerical standard, no matter how closely related to academic merit, becomes a malignant force as soon as universities know that it is the standard. A proxy for merit, rather than merit itself, becomes the goal. . . .

Even on its own terms, the ranking is a failure because the supposed facts on which it is based cannot be trusted. Eighty percent of the U.S. News ranking of a university is based on information reported by the university itself.

He concludes:

The role played by Columbia itself in this drama is troubling and strange. In some ways its conduct seems typical of an elite institution with a strong interest in crafting a positive image from the data that it collects. Its choice to count undergraduates only, contrary to the guidelines, when computing student-faculty ratios is an example of this. Many other institutions appear to do the same. Yet in other ways Columbia seems atypical, and indeed extreme, either in its actual features or in those that it dubiously claims. Examples of the former include its extremely high proportion of undergraduate transfer students, and its enormous number of graduate students overall; examples of the latter include its claim that 82.5% of undergraduate classes have under 20 students, and its claim that it spends more on instruction than Harvard, Yale, and Princeton put together. . . .

In 2003, when Columbia was ranked in 10th place by U.S. News, its president, Lee Bollinger, told the New York Times, “Rankings give a false sense of the world and an inauthentic view of what a college education really is.” These words ring true today. Even as Columbia has soared to 2nd place in the ranking, there is reason for concern that its ascendancy may largely be founded, not on an authentic presentation of the university’s strengths, but on a web of illusions.

It does not have to be this way. Columbia is a great university and, based on its legitimate merits, should attract students comparable to the best anywhere. By obsessively pursuing a ranking, however, it demeans itself. . . .

Michael Thaddeus is a professor of mathematics at Columbia University.

P.S. A news article appeared on this story. It features this response by a Columbia spokesman:

[The university stands] by the data we provided to U.S. News and World Report. . . . We take seriously our responsibility to accurately report information to federal and state entities, as well as to private rankings organizations. Our survey responses follow the different definitions and instructions of each specific survey.

Wow. That’s not a job I’d want to have, to be responding to news reporters who ask whether it’s true that your employer is giving out numbers that are “inaccurate, dubious, or highly misleading.” Michael Thaddeus and I are teachers at the university and we have academic freedom. If you’re a spokesman, though, you can’t just say what you think, right? That’s gotta be a really tough part of the job, to have to come up with responses in this sort of situation. I guess that’s wha they’re paying you for, but still, it’s gotta be pretty awkward.