The behavioral economists’ researcher degree of freedom

A few years ago we talked about the two modes of pop-microeconomics:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with “why do they do that?” sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there’s some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with “we can do better” claims such as why we should fire 80% of public-school teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you’re gonna get 1 or 2 above. They’re complete opposites!

I thought of this when rereading this post from a few years ago, where we quoted Jason Collins, who wrote regarding the decades-long complacency of the academic psychology and economics establishment regarding the hot-hand fallacy fallacy:

We have a body of research that suggests that even slight cues in the environment can change our actions. Words associated with old people can slow us down. Images of money can make us selfish. And so on. Yet why haven’t these same researchers been asking why a basketball player would not be influenced by their earlier shots – surely a more salient part of the environment than the word “Florida”? The desire to show one bias allowed them to overlook another.

When writing the post with the above quote, I had been thinking specifically of issues with the hot hand.

Stepping back, I see this as part of the larger picture of researcher degrees of freedom in the fields of social psychology and behavioral economics.

You can apply the “two modes of thinking” idea to the hot hand:

Argument 1 goes like this: Believing in the hot hand sounds silly. But lots of successful players and coaches believe in it. Real money is at stake—this is not cheap talk! So it’s our duty to go beneath the surface and understand why, counterintuitively, belief in the hot hand makes sense, even though it might naively seem like a fallacy. Let’s prove that the pointy-headed professors outsmarted themselves and the blue-collar ordinary-Joe basketball coaches were right all along, following the anti-intellectual mode that was so successfully employed by the Alvin H. Baum Professor of Economics at the University of Chicago (for example, an unnamed academic says something stupid, only to be shot down by regular-guy “Chuck Esposito, a genial, quick-witted and thoroughly sports-fixated man who runs the race and sports book at Caesars Palace in Las Vegas.”)

Argument 2 goes the other way: Everybody thinks there’s a hot hand, but we, the savvy social economists and behavioral economists, know that because of evolution our brains make lots of shortcuts. Red Auerbach might think he’s an expert at basketball, but actually some Cornell professors have collected some data and have proved definitively that everything you thought about basketball was wrong.

Argument 1 is the “Econ 101” idea that when people have money on the line, they tend to make smart decisions, and we should be suspicious of academic theories that claim otherwise. Argument 2 is the “scientist as hero” idea that brilliant academics are making major discoveries every day, as reported to you by Ted, NPR, etc.

In the case of the hot hand, the psychology and economics establishment went with Argument 2. I don’t see any prior reason why they’d pick 1 or 2. In this case I think they just made an honest mistake: a team of researchers did a reasonable-seeming analysis and everyone went from there. Following the evidence—that’s a good idea! Indeed, for decades I believed that the hot hand was a fallacy. I believed in it, I talked about it, I used it as an example in class . . . until Josh Miller came to my office and explained to me how so many people, including me, had gotten it wrong.

So my point here is not to criticize economists and psychologists for getting this wrong. The hot hand is subtle, and it’s easy to get this one wrong. What interests me is how they chose—even if the choice was not made consciously—to follow Argument 2 rather than Argument 1 here. You could say the data led them to Argument 2, and that’s fine, but the same apparent strength of data could’ve led them to Argument 1. These are people who promote flat-out ridiculous models of the Argument 1 form such as the claim that “all deaths are to some extent suicides.” Sometimes they have a hard commitment to Argument 1. This time, though, they went with #2, and this time they were the foolish professors who got lost trying to model the real world.

I’m still working my way though the big picture here of trying to understand how Arguments 1 and 2 coexist, and how the psychologists and economists decide which one to go for in any particular example.

Interestingly enough, in the hot-hand example, after the behavioral economists saw their statistical argument overturned, they didn’t flip over to Argument 1 and extol the savvy of practical basketball coaches. Instead they pretty much tried to minimize their error and try to keep as much of Argument 2 as they could, for example arguing that, ok, maybe there is a hot hand but it’s much less than people think. They seem strongly committed to the idea that basketball players can’t be meaningfully influenced by previous shots, even while also being committed to the idea that words associated with old people can slow us down, images of money can make us selfish, and so on. I’m still chewing on this one.

Beverly Cleary is winner in third iteration of Greatest Seminar Speaker competition

Our third seminar speaker competition has come to an end, with the final round pitting Beverly “Ramona” Cleary against Laura “Ingalls” Wilder.

Before going on, I’d like to say that Alison Bechdel is the “Veronica Geng” of this particular competition, in that, even after she was defeated, she pretty much defined the parameters of the game. We’re still talking about her, whether or not she’s still in the running.

But the current matchup passes the Bechdel Test, so let’s move along.

Raghu writes:

I used my best slogan for Cleary (Cleary for the present!) in her prior round — a calculated risk, since it would have been worthless if Cleary hadn’t advanced. I am reminded of the dilemma of Karna in the Mahabharata, faced with the fearsome half-demon Ghatotkacha on the 15th day of the epic’s climactic war:

“Karna is unable to prevent Ghatotkacha from wreaking havoc on the Kaurava army, and even many of his celestial weapons are rendered useless. As the army breaks around him … Karna uses Vasavi Śhakti as a last resort. This weapon had been bestowed by Indra and could only be used once; Karṇa had been keeping it in reserve to use against Arjuna [his arch rival]. … The Pandavas were filled with grief at Ghatotkacha’s death. Krishna, however, couldn’t help but smile, knowing that Ghatotkacha has saved Arjuna from Karna.”

Cleary, then, is on her own.

Oncodoc leaps into the gap:

Let’s hear from Ms. Wilder. Her opinions without the editing imposed by her daughter might be interesting.

I don’t know about that! Rose was arguably the catalyst that made Laura’s writing so readable.

Dzhaughn riffs:

I remember these two together at a party out in the midwest summer home of William Carlos and Esther Williams. Beverly and John Cleary and the Absolute Monster Gentlemen were jamming, and I was chatting with Billy and Laura Ingalls Wilder and Melissa Gilbert and Brendan Sullivan besides some potted plants when suddently from the north i saw an olive helicopter with Michael Landon land on the roof. Right behind him were Lorne Greene and Zha Zha Gabor, Denis and Doodles Weaver, Spike Jones, Spike Jonze, Spike Lee and Peggy Lee. Paul Lynde right in the middle of everything. Everybody was shouting, Studs Terkel was roasting beef on the bbq with gin, flames everwhere, William Carlos and Esther Williams had to jump in the pool and started swimming laps. Garrison Keillor next door and said he was going to tell the story of this little house A Prairie Home Companion.

OK fine but I have no idea who this would support.

Diana plays it straight:

Cleary, for a simple reason: her books made me laugh many times.
I read all the Little House books as a kid, read some of them multiple times, but I don’t remember being made laugh by them.
Laughter during a seminar (here and there) is a good thing. Cleary can hold her own on the serious end too.

“She was not a slowpoke grownup. She was a girl who could not wait. Life was so interesting she had to find out what happened next.”
― Beverly Cleary, Ramona the Pest

Lots of arguments here. We’ll have to go to the very first commenter, Anonymous Pigeon, for a weighing of the evidence:

While seminars should be informative, they should also be fun. Laura can tell a good story but looks at it with less of the fun approach of Beverly. 1 point to the creator of Ramona The Pest. Neither one of them is better than the other on the writing scale, at least in my opinion. No points attributed there. So overall, Beverly Cleary should win because she would be the best seminar speaker.

Agreed.

Replacing the “zoo of named tests” by linear models

Gregory Gilderman writes:

The semi-viral tweet thread by Jonas Lindeløv linked below advocates abandoning the “zoo of named tests” for Stats 101 in favor of mathematically equivalent (I believe this is the argument) varieties of linear regression:

As an adult learner of statistics, perhaps only slightly beyond the 101 level, and an R user, I have wondered what the utility of some of these tests are when regression seems to get the same job done.

I believe this is of wider interest than my own curiosity and would love to hear your thoughts on your blog.

My reply: I don’t agree with everything in Lindeløv’s post—in particular, he doesn’t get into the connection between analysis of variance and multilevel models, and sometimes he’s a bit too casual with the causal language—but I like the general flow, the idea of trying to use a modeling framework and to demystify the zoo of tests. Lindeløv doesn’t mention Regression and Other Stories, but I think he’d like it, as it follows the general principle of working through linear models rather than presenting all these tests as if they are separate things.

Also, I agree 100% with Lindeløv that things like the Wilcoxon test are best understood as linear models applied to rank-transformed data. This is a point we made in the first edition of BDA way back in 1995, and we’ve also blogged it on occasion, for example here. So, yeah, I’m glad to see Lindeløv’s post and I hope that people continue to read it.

Some nuggets from a week’s worth of comments

After returning from vacation I went in to approve the blog comments that were sitting in the possibly-spam folder. This gave me the chance to skim through a couple hundred comments, and there were some gems! We often have great comments but I just read them as they come in, one at a time. Seeing a week of comments all at once made me appreciate what we’ve got here. So, thanks everyone for contributing to our discussions!

Here are a few fun comments from last week:

From Anoneuoid:

A friend and I were talking the other day about all the people who seem to be on the phone (via headset) all day at work. This is particularly common for jobs like cab/uber driver, convenience store clerk, and mail/package delivery.

Who exactly is available to talk all day with them? Then we realized who must be on the other end of the conversation: other people with similar jobs.

That lead to the next idea, what if for every “crazy” person who hears voices and talks to themself there is another “crazy” person on the other end?

Now yes, this theory requires assuming some form of natural long distance communication is possible. And also must be highly compressed/encrypted to escape detection for so long. But all of that sounds like what you would expect evolutionarily anyway.

I love how that comment starts with an amusing observation and then goes off the deep end. It’s kind of like science standup.

And this from Jd:

Random forest, neural nets, bagging, and boosting sounds so cool. You know, if your sponsors are staring across the horizon of big data and you offer something like random forest, they may just say, “yes…yes…that’s the thing.” Mr P is sorta cool, but it sounds kinda like a hip hop artist. But if you say, well we ran this Bayesian linear regression with a couple of variables, how are you gonna compete against the feedforward artificial neural network multilayer perceptron model?? It’s like boring old stats vs the Transformers.

Finally, a serious one from Anonymous, relating to the controversial unreproducible paper from Google on chip design:

This repository has been up for over a year, right? Have things changed since this statement from Andrew Kahng? https://docs.google.com/document/d/1vkPRgJEiLIyT22AkQNAxO8JtIKiL95diVdJ_O4AFtJ8/edit

Relevant quote:

“Further, I believe it is well-understood that the reported methods are not fully implementable based on what is provided in the Nature paper and the Circuit Training repository. While the RL method is open-source, key preprocessing steps and interfaces are not yet available. I have been informed that the Google team is actively working to address this. Remedying this gap is necessary to achieve scientific clarity and a foundation upon which the field can move forward.”

I wonder what ended up happening with that. My guess is that everyone moved forward and nobody cares about that particular method anymore, but I have no idea.

Messy data crashes into us

This short post is by Lizzie. 

Excellent job title alert! University of Nebraska-Lincoln has an open-rank position in messy data. Thanks to my former student, Dan, for sharing this.

In other news, I am in France (near the toy shop in this photo) where the radio also goes on strike. This means that at 8am last Thursday when I went to listen to the top-of-the-hour news I instead heard ‘There is a light and it never goes out, There is a light and it never goes out …’

An epidemiologist and a psychologist come down from the ivory tower to lecture us on “concrete values like freedom and equality” . . . What does that even mean?? The challenge of criticizing policies without considering their political and social contexts:

Flavio Bartmann writes:

I don’t know if you have seen this, the latest from John Ioannidis (together with Michaela Schippers this time), Saving Democracy From the Pandemic, an article at Tablet magazine. In spite of its grandiose title, it is a content free diatribe, mercifully short. It promotes skepticism in science, which is mildly ironic given his sensitivity when the skepticism is directed towards his work. Might merit some comment.

My reply: I took a quick read and couldn’t make much sense of the article. The authors talked about “concrete values like freedom and equality”: I have no idea what they’re talking about. It seemed like word salad.

Bartmann responded:

I believe that unexpected, material events lead many people (including some very smart ones) to strange places.

To return to the article under discussion: The topic of societal and governmental reactions to emergencies is important, very much worth writing about. I think where the article went wrong was in its framing of “health authorities and politicians” as the bad guys, without recognizing that these decisions are made in a larger context, in this case with lots of people being afraid of spreading covid, parents pulling their kids out of schools in March 2020, etc. As with other controversial government policies such as tax cuts and mandatory criminal sentencing laws, there’s a complicated push and pull between government, political entrepreneurs, and public opinion, and it’s a mistake to try to collapse this into a model of governments imposing policies on the public.

Also the authors could think a bit more about the context of their statements. For example, they write, “It is critical in free, democratic societies that media never become a vessel for a single, state-sanctioned, official narrative at the expense of public debate and freedom of speech. Removing content considered ‘fake’ or ‘false’ in order to limit the ability of ordinary people to judge information for themselves only inflames polarization and distrust of the public sphere.”—but we live in a social media environment where political and media leaders such as Ted Cruz and Alex Jones spread dangerous conspiracy theories. The example of Cruz illustrates that the “state” is not unitary; and the example of Jones illustrates issues with “media.”

I’n not saying that the authors of this one piece need to engage with all these complexities. I just think they should be aware of them, and if they want to make suggestions or criticisms of policies or attitudes, it would help for them to be specific rather than indulging in generalities about freedom and equality etc.

P.S. I clicked through the Tablet site and saw that it describes itself as “a daily online magazine of Jewish news, ideas, and culture.” I didn’t see anything Jewish-related in the above-linked article so maybe I’m missing something here?

P.P.S. Regarding the title of the post: Yes, I too am coming down from the ivory tower to lecture here. You’ll have to judge this post on its merits, not based on my qualifications. And if I go around using meaningless phrases such as “concrete values like freedom and equality,” please call me on it!

Multilevel modeling to make better decisions using data from schools: How can we do better?

Michael Nelson writes:

I wanted to point out a paper, Stabilizing Subgroup Proficiency Results to Improve the Identification of Low-Performing Schools, by Lauren Forrow, Jennifer Starling, and Brian Gill.

The authors use Mr. P to analyze proficiency scores of students in subgroups (disability, race, FRL, etc.). The paper’s been getting a good amount of attention among my education researcher colleagues. I think this is really cool—it’s the most attention Mr. P’s gotten from ed researchers since your JREE article. This article isn’t peer reviewed, but it’s being seen by far more policymakers than any journal article would.

All the more relevant that the authors’ framing of their results is fishy. They claim that some schools identified as underperforming, based on mean subgroup scores, actually aren’t, because they would’ve gotten higher means if the subgroup n’s weren’t so small. They’re selling the idea that adjustment by poststratification (which they brand as “score stabilization”) may rescue these schools from their “bad luck” with pre-adjustment scores. What they don’t mention is that schools with genuinely underperforming (but small) subgroups could be misclassified as well-performing if they have “good luck” with post-adjustment scores. In fact, they don’t use the word “bias” at all, as in: “Individual means will have less variance but will be biased toward the grand mean.” (I guess that’s implied when they say the adjusted scores are “more stable” rather than “more accurate,” but maybe only to those with technical knowledge.)

And bias matters as much as variance when institutions are making binary decisions based on differences in point estimates around a cutpoint. Obviously, net bias up or down will be 0, in the long run, and over the entire distribution. But bias will always be net positive at the bottom of the distribution, where the cutpoint is likely to be. Besides, relying on net bias and long-run performance to make practical, short-run decisions seems counter to the philosophy I know you share, that we should look at individual differences not averages whenever possible. My fear is that, in practice, Mr. P might be used to ignore or downplay individual differences–not just statistically but literally, given that we’re talking about equity among student subgroups.

To the authors’ credit, they note in their limitations section that they ought to have computed uncertainty intervals. They didn’t, because they didn’t have student-level data, but I think that’s a copout. If, as they note, most of the means that moved from one side of the cutoff to the other are quite near it already, you can easily infer that the change is within a very narrow interval. Also to their credit, they acknowledge that binary choices are bad and nuance is good. But, also to their discredit, the entire premise of their paper is that the education system will, and presumably should, continue using cutpoints for binary decisions on proficiency. (That’s the implication, at least, of the US Dept. of Ed disseminating it.) They could’ve described a nuanced *application* of Mr. P, or illustrated the absurd consequences of using their method within the existing system, but they didn’t.

Anyway, sorry this went so negative, but I think the way Mr. P is marketed to policymakers, and its potential unintended consequences, are important.

Nelson continues:

I’ve been interested in this general method (multilevel regression with poststratification, MRP) for a while, or at least the theory behind it. (I’m not a Bayesian so I’ve never actually used it.)

As I understand it, MRP takes the average over all subgroups (their grand mean) and moves the individual subgroup means toward that grand mean, with smaller subgroups getting moved more. You can see this in the main paper’s graphs, where low means go up and high means go down, especially on the left side (smaller n’s). The grand mean will be more precise and more accurate (due to something called superefficiency), while the individual subgroup means will be much more precise but can also be much more biased toward the grand mean. The rationale for using the biased means is that very small subgroups give you very little information beyond what the grand mean is already telling you, so you should probably just use the grand mean instead.

In my view, that’s an iffy rationale for using biased subgroup proficiency scores, though, which I think the authors should’ve emphasized more. (Maybe they’ll have to in the peer-reviewed version of the paper.) Normally, bias in individual means isn’t a big deal: we take for granted that, over the long run, upward bias will be balanced out by downward bias. But, for this method and this application, the bias won’t ever go away, at least not where it matters. If what we’re looking at is just the scores around the proficiency cutoff, that’s generally going to be near the bottom of the distribution, and means near the bottom will always go up. As a result, schools with “bad luck” (as the authors say) will be pulled above the cutoff where they belong, but so will schools with subgroups that are genuinely underperforming.

I have a paper under review that derives a method for correcting a similar problem for effect sizes—it moves individual estimates not toward a grand mean but toward the true mean, in a direction and distance determined by a measure of the data’s randomness.

I kinda see what Nelson is saying, but I still like the above-linked report because I think that in general it is better to work with regularized, partially-pooled estimates than with raw estimates, even if those raw estimates are adjusted for noise or multiple comparisons or whatever.

To help convey this, let me share a few thoughts regarding hierarchical modeling in this general context of comparing averages (in this case, from different schools, but similar issues arise in medicine, business, politics, etc.).

1. Many years ago, Rubin made the point that, when you start with a bunch of estimates and uncertainties, classical multiple comparisons adjustments are effectively increasing can be increasing the standard errors so that fewer comparisons are statistically significant, whereas Bayesian methods move the estimates around. Rubin’s point was that you can get the right level of uncertainty much more effectively by moving the intervals toward each other rather than by keeping their centers fixed and then making them wider. (I’m thinking now that a dynamic visualization would be helpful to make this clear.)

It’s funny because Bayesian estimates are often thought of as trading bias for variance, but in this case the Bayesian estimate is so direct, and it’s the multiple comparisons approaches that do the tradeoff, getting the desired level of statistical significance by effectively making all the intervals wider and thus weakening the claims that can be made from data. It’s kinda horrible that, under the classical approach, your inferences for particular groups and comparisons will on expectation get vaguer as you get data from more groups.

We explored this idea in our 2000 article, Type S error rates for classical and Bayesian single and multiple comparison procedures (see here for freely-available version) and more thoroughly in our 2011 article, Why we (usually) don’t have to worry about multiple comparisons. In particular, see the discussion on pages 196-197 of that latter paper (see here for freely-available version):


2. MRP, or multilevel modeling more generally, does not “move the individual subgroup means toward that grand mean.” It moves the error terms toward zero, which implies that it moves the local averages toward their predictions from the regression model. For example, if you’re predicting test scores given various school-level predictors, then multilevel modeling partially pools the individual school means toward the fitted model. It would not in general make sense to partially pool toward the grand mean—not in any sort of large study that includes all sorts of different schools. (Yes, in Rubin’s classic 8-schools study, the estimates were pooled toward the average, but these were 8 similar schools in suburban New Jersey, and there were no available school-level predictors to distinguish them.)

3. I agree with Nelson that it’s a mistake to summarize results using statistical significance, and this can lead to artifacts when comparing different models. There’s no good reason to make decisions based on whether a 95% interval includes zero.

4. I like multilevel models, but point estimates from any source—multilevel modeling or otherwise—have unavoidable problems when the goal is to convey uncertainty. See our 1999 article, All maps of parameter estimates are misleading.

In summary, I like the Forrow et article. The next step should be to go beyond point estimates and statistical significance and to think more carefully about decision making under uncertainty in this educational context.

Someone has a plea to teach real-world math and statistics instead of “derivatives, quadratic equations, and the interior angles of rhombuses”

Robert Thornett writes:

What if, for example, instead of spending months learning about derivatives, quadratic equations, and the interior angles of rhombuses, students learned how to interpret financial and medical reports and climate, demographic, and electoral statistics? They would graduate far better equipped to understand math in the real world and to use math to make important life decisions later on.

I agree. I mean, I can’t be sure; he’s making a causal claim for which there is no direct evidence. But it makes sense to me.

Just one thing. The “interior angles of rhombuses” thing is indeed kinda silly, but I think it would be awesome to have a geometry class where students learn to solve problems like: Here’s the size of a room, here’s the location of the doorway opening and the width of the hallway, here are the dimensions of a couch, now how do you manipulate the couch to get it from the hall through the door into the room, or give a proof that it can’t be done. That would be cool, and I guess it would motivate some geometrical understanding.

In real life, though, yeah, learning standard high school and college math is all about turning yourself into an algorithm for solving exam problems. If the problem looks like A, do X. If it looks like B, to Y, etc.

Lots of basic statistics teaching looks like that too, I’m afraid. But statistics has the advantage of being one step closer to application, which should help a bit.

Also, yeah, I think we can all agree that “derivatives, quadratic equations, and the interior angles of rhombuses” are important too. The argument is not that these should not be taught, just that these should not be the first things that are taught. Learn “how to interpret financial and medical reports and climate, demographic, and electoral statistics” first, then if you need further math courses, go on to the derivatives and quadratic equations.

It’s the finals! Time to choose the ultimate seminar speaker: Beverly Cleary vs. Laura Ingalls Wilder

We’ve reached the endpoint of our third seminar speaker competition. Top seeds J. R. R. Tolkien, Miles Davis, David Bowie, Dr. Seuss, Hammurabi, Judas, Martha Stewart, and Yo-Yo Ma fell by the wayside—indeed, Davis, Judas, and Ma didn’t even get to round 2!—; unseeded heavyweight Isaac Newton lost in round 3; and dark-horse favorites James Naismith, Henry Winkler, Alison Bechdel, and J. Robert Lennon couldn’t make the finish line either.

What we have is two beloved and long-lived children’s book authors. Cleary was more prolific, but maybe only because she got started at a younger age. Impish Ramona or serious Laura . . . who’s it gonna be?

Either way, I assume it will go better than this, from a few years ago:

CALL FOR APPLICATIONS: LATOUR SEMINAR — DUE DATE AUGUST 11 (extended)
The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of Arts and Sciences are proud to present

BRUNO LATOUR AT COLUMBIA UNIVERSITY, SEPTEMBER 22-25
You are invited to apply for a seminar led by Professor Bruno Latour on Tuesday, September 23, 12-3pm. Twenty-five graduate students from throughout the university will be selected to participate in this single seminar given by Prof. Latour. Students will organize themselves into a reading group to meet once or twice in early September for discussion of Prof. Latour’s work. They will then meet to continue this discussion with a small group of faculty on September 15, 12-2pm. Students and a few faculty will meet with Prof. Latour on September 23. A reading list will be distributed in advance.

If you are interested in this 3-4 session seminar (attendance at all 3-4 sessions is mandatory), please send

Name:
Uni:
Your School:
Your Department:
Year you began your terminal degree at Columbia:
Thesis or Dissertation title or topic:
Name of main advisor:

In one short, concise paragraph tell us what major themes/keywords from Latour’s work are most relevant to your own work, and why you would benefit from this seminar. Please submit this information via the site
http://brown.submittable.com/submit
The due date for applications is August 11 and successful applicants will be notified in mid-August.

That was the only time I’ve heard of a speaker who’s so important that you have to apply to attend his seminar! And, don’t forget, “attendance at all 3-4 sessions is mandatory.” I wonder what they did to the students who showed up to the first two seminars but then skipped #3 and 4.

Past matchup

Wilder faced Sendak in the last semifinal. Dzhaughn wrote:

This will be a really tight match up.

Sendak has won the Laura Ingalls Wilder Award. Yet no one has won more Maurice Sendak Awards than Wilder. And she was dead when he won it.

Maurice Sendak’s paid for his college by working at FAO Schwarz. That’s Big, isn’t it?

The Anagram Department notices “Serial Lulling Award,” not a good sign for a seminar speaker. “American Dukes” and “Armenia Sucked” are hardly top notch, but less ominous.

So, I come up with a narrow edge to Sendak but I hope there is a better reason.

“Serial Lulling Award” . . . that is indeed concerning!

Raghu offers some thoughts, which, although useless for determining who to advance to the final round, are so much in the spirit of this competition that I’ll repeat them here:

This morning I finished my few-page-a-day reading of the biography of basketball inventor and first-round loser James Naismith, and I was struck again by how well-suited he is to this tournament:

“It was shortly after seven o’clock, and the meal was over. He added briskly, ‘Let me show you some of the statistics I’ve collected about accidents in sports. I’ve got them in my study.’ He started to rise from the table and fell back into his chair. Ann recognized the symptoms. A cerebral hemorrhage had struck her father.” — “The Basketball Man, James Naismith” by Bernice Larson Webb

Statistics! Sports! Medical inference!

I am not, however, suggesting that the rules be bent; I’ve had enough of Naismith.

I finished Sendak’s “Higglety Pigglety Pop! Or, There Must Be More to Life” — this only took me 15 minutes or so. It is surreal, amoral, and fascinating, and I should read more by Sendak. Wilder is neither surreal nor amoral, though as I think I noted before, when I was a kid I found descriptions of playing ball with pig bladders as bizarre as science fiction. I don’t know who that’s a vote for.

I find it hard to read a book a few pages a day. I can do it for awhile, but at some point I either lose interest and stop, or I want to find out what happens next so I just finish the damn book.

Diana offers a linguistic argument:

Afterthought and correction: The “n” should be considered a nasal and not a liquid, so Laura Ingalls Wilder has five liquids, a nasal, a fricative, a glide, and two plosives, whereas Maurice Sendak has two nasals, a liquid, two fricatives, and two plosives (and, if you count his middle name, three nasals, three liquids, two fricatives, and four plosives). So Wilder’s name actually has the greater variety of consonants, given the glide, but in Sendak’s name the various kinds are better balanced and a little more spill-resistant.

OK, sippy cups. Not so relevant for a talk at Columbia, though, given that there will be very few toddlers in the audience.

Anon offers what might appear at first to be a killer argument:

If you look at the chart, you can pretty clearly notice that the bracket is only as wide as it is because of Laura Ingalls Wilder’s prodigious name. I’ve got to throw my hat in the ring for Sendak, simply for storage.

+1 for talking about storage—optimization isn’t just about CPU time!—but this length-of-name argument reeks of sexism. In a less traditional society, Laura wouldn’t have had to add the Wilder to her name, and plain old “Laura Ingalls,” that’s a mere 13 characters wide, and two of them are lower-case l’s, which take up very little space (cue Ramanujan here). Alison Bechdel’s out of the competition now, but she’s still looking over my shoulder, as it were, scanning for this sort of bias.

And Ben offers a positive case for the pioneer girl:

There’s some sort of libertarian angle with Wilder though right?

What if we told Wilder about bitcoin and defi and whatnot? Surely that qualifies as surreal and amoral in the most entertaining kind of way. I know talking about these things in any context is a bit played out at this point but c’mon. This isn’t some tired old celebrity we’re selling here! This is author of an American classic, from the grave — any way she hits that ball is gonna be funny.

Sounds good to me!

“The hat”: A single shape that can tile the plane aperiodically but not periodically.

Z in comments points to a new discovery by David Smith, Joseph Samuel Myers, Craig Kaplan, and Chaim Goodman-Strauss, who write:

An aperiodic monotile . . . is a shape that tiles the plane, but never periodically. In this paper we present the first true aperiodic monotile, a shape that forces aperiodicity through geometry alone, with no additional constraints applied via matching conditions. We prove that this shape, a polykite that we call “the hat”, must assemble into tilings based on a substitution system.

All I can say is . . . wow. (That is, assuming the result is correct. I have no reason to think it’s not; I just haven’t tried to check it myself.)

First off, this is just amazing. Even more amazing is that I had no idea that this was even an open problem. I’d seen the Penrose two-shape tiling pattern years ago and loved it so much that I painted a tabletop with it (and send a photo of the table to Penrose himself, who replied with a nice little note, which unfortunately I lost some years ago, or I’d reproduce it here), and it never even occurred to me to ask whether an aperiodic monotile was possible.

This is the biggest news of 2023 so far (again, conditional on the result being correct), and I doubt anything bigger will happen between now and the end of December.

OK, there’s one possibility . . .

Penrose did it with 2 unique tiles, Smith et al. just needed 1, . . . The next frontier in aperiodic tiling is to do it with 0. Whoever gets there will be the real genius.

P.S. Michael in comments points out that the Smith et al. pattern includes mirrored tiles. So let’s call it 1.5. Is there a theorem that you can’t do it with just 1 tile with no mirroring?

More on “I could care less about the twin primes conjecture”

As part of a discussion about research retractions, I remarked that I could care less about the twin primes conjecture.

This got some reactions in comments! Dmitri wrote:

I think it’s refreshing that Andrew doesn’t care about the twin primes conjecture. After thinking about it for a few seconds, I realized that I also don’t care about the twin primes conjecture.

It’s kind of interesting to think about what sorts of unanswered questions you actually care about. “Is there life on other planets?” Definitely. “What does Quantum Mechanics mean?” Totally. Twin primes, meh …

From the other direction, Ethan Bolker and Larry Gonick were disappointed, with Ethan writing, “Andrew did mathematics before he did what he does now and I thought some of that curiosity would remain.” Adede followed up with, “I find it interesting that someone can care about whether sqrt(2) is a normal number but not care about the twin primes conjecture. It can’t be a pure vs applied thing, both of them seem equally devoid of real-world applications (unless I am missing something).”

OK, so where are we?

First, I’m a big fan of the Cartoon Guide to Statistics, so if Larry Gonick is disappointed in me, that makes me sad and it motivates me to try to explain myself. Second, hey Ethan, I still have curiosity about mathematics, just not about the twin prime conjecture! For example, as Adede notes, I’m curious about the distribution of 0’s and 1’s in the binary expansion of the square root of 2, and that’s pure math with no relevant applications that I know of.

So here’s the question: Why do I care about the distribution of the digits of sqrt(2) but not twin primes?

I’m not really sure, but here are some guesses:

1. The distribution of the digits of sqrt(2) has a probability and statistics flavor; it’s a search for randomness. I’m interested in randomness.

2. Back when I was in high school and did math team and math olympiad training, there were two subjects that were waaay overestimated, to my taste: number theory and classical non-analytic geometry. We got so much propaganda for these subjects that I grew to hate them. A certain amount of number theory is necessary—factorization, things like that—and, yeah, I get that there are deep connections to group theory and other important topics, as well as connections to analysis. I’m glad that somewhere there are people working on the Riemann hypothesis, etc. But the twin primes conjecture, the 3n+1 problem, etc.: I get that they’re challenging, but they’ve never really engaged me.

Explanation #1 can’t be the whole story, because I also find questions about tilings to be interesting, even when no randomness is involved. And explanation #2 isn’t the whole story either. So I don’t really know. Maybe the best answer is that my understanding of mathematics is sufficient for me to understand lots of things in statistics but is not deep enough for me to have any real sense of what makes these particular problems difficult, and so my finding one or another of these problems “intriguing” or “boring” is just an idiosyncratic product of my personal history with no larger meaning.

To put it another way, when I tell you that the Fieller-Creasy problem is fundamentally uninteresting or that the so-called Fisher exact test is a bad idea or that Bayes factors typically don’t do what people want them to do, I’m saying these things for good reasons. You might disagree with me, and maybe I’m wrong and you’re right, but I have serious, explainable reasons for these views of mine. They’re not just matters of taste.

But when I say I care about the distribution of the digits of the square root of 2 but not about the twin primes conjecture, that’s just some uninformed attitude for which I’m not claiming any reasonable basis.

Bad stuff going down in biostat-land: Declaring null effect just cos p-value is more than 0.05, assuming proportional hazards where it makes no sense

Wesley Tansey writes:

This is no doubt something we both can agree is a sad and wrongheaded use of statistics, namely incredible reliance on null hypothesis significance testing. Here’s an example:

Phase III trial. Failed because their primary endpoint had a p-value of 0.053 instead of 0.05. Here’s the important actual outcome data though:

For the primary efficacy endpoint, INV-PFS, there was no significant difference in PFS between arms, with 243 (84%) of events having occurred (stratified HR, 0.77; 95% CI: 0.59, 1.00; P = 0.053; Fig. 2a and Table 2). The median PFS was 4.5 months (95% CI: 3.9, 5.6) for the atezolizumab arm and 4.3 months (95% CI: 4.2, 5.5) for the chemotherapy arm. The PFS rate was 24% (95% CI: 17, 31) in the atezolizumab arm versus 7% (95% CI: 2, 11; descriptive P < 0.0001) in the chemotherapy arm at 12 months and 14% (95% CI: 7, 21) versus 1% (95% CI: 0, 4; descriptive P = 0.0006), respectively, at 18 months (Fig. 2a). As the INV-PFS did not cross the 0.05 significance boundary, secondary endpoints were not formally tested.

The odds of atezolizumab being better than chemo are clearly high. Yet this entire article is being written as the treatment failing simply because the p-value was 0.003 too high.

He adds:

And these confidence intervals are based on proportional hazards assumptions. But this is an immunotherapy trial where we have good evidence that these trials violate the PH assumption. Basically, you get toxicity early on with immunotherapy, but patients that survive that have a much better outcome down the road. Same story here; see figure below. Early on the immunotherapy patients are doing a little worse than the chemo patients but the long-term survival is much better.

As usual, our recommended solution for the first problem is to acknowledge uncertainty and our recommended solution for the second problem is to expand the model, at the very least by adding an interaction.

Regarding acknowledging uncertainty: Yes, at some point decisions need to be made about choosing treatments for individual patients and making general clinical recommendations—but it’s a mistake to “prematurely collapse the wave function” here. This is a research paper on the effectiveness of the treatment, not a decision-making effort. Keep the uncertainty there; you’re not doing us any favors by acting as if you have certainty when you don’t.

Research is everywhere, even, on rare occasions, in boxes labeled research

This is Jessica. I remember once hearing one of my colleagues who is also a professor talking about the express train that runs through much of Chicago up to Northwestern campus. He said, “The purple line is fantastic. I get on in the morning, always get a seat and I can get research done. Then I get to campus, and all research ceases for 8 hours. But I get back on the train and I’m right back to doing research!”

It is no joke that the more senior you get in academia, the less time you get to do the things that made you choose that career in the first place. But the topic of this post is a different sort of irony. Right how it’s deadline time for my lab, when many of the PhD students are preparing papers for the big conference in our field. It’s a very “researchy” time. What is surprising is how easy it is to be surrounded by people doing research and not feel like there is much actual new knowledge or understanding happening.

There is a David Blackwell quote that I have come to really like:

 I’m not interested in doing research and I never have been, I’m interested in understanding, which is quite a different thing.

Andrew has previously commented on this quote, implying that this may have been true at Blackwell’s time, but things have since shifted and understanding is now recognized as a valuable part of research. But I tend to think that Blackwell’s sentiment is still very much relevant. 

For example, when I think about what most people would call “my research,” I think of papers I’ve published that propose or evaluate visualization techniques or other interactive tools we create. But I don’t necessarily associate most of this work with “understanding.” On some level we find things out, but its very easy to present some stuff you learned in a paper without it ever actually challenging anything we already know. It’s framed as brand new information but usually it’s actually 99% old information in the form of premises and assumptions with a tiny new bit of something. It might not actually answer any of the questions that get you out of bed in the morning. I think most researchers would relate to feeling like this at least sometimes.

Pursuing understanding is why I like my job. I think of it as tied to the questions that I am chewing on but I can’t yet fully answer, because the answer is going to be complicated, connecting to many other things I’ve thought about in the past but without the derivation chain being totally clear. Maybe it even contradicts things I’ve thought or said in the past. On some level I think of understanding as dynamic, about a shift in perspective. This all makes hard to circumscribe linguistic boundaries around. I find it’s more natural to express understanding in questions versus answers. 

The problem is that questions don’t make for a good paper though unless they can be answered with some satisfaction. As soon as you plan the thing that will fit nicely into the 10-15 page article, with a concise introduction, related work section, and a description of the methods and results, you probably have left behind the  understanding. You are instead in the realm of “Making Statements Whose Assumptions and Implications More or Less Follow from One Another and are the Right Scope for a Research Article.” Your task becomes connecting the dots, e.g., making clear there’s motivating logic running from the data collection to the definitions or estimators to the inferences you draw in the end. This is of course usually already established by the time you write the paper, but it can still takes a long time to write it all out, and hopefully you don’t discover an error in your logic, because then its even harder to make the pieces fit and you have to figure out how to talk about that. 

But it’s the understanding that is source of actual new information, in contrast to the veneer of new knowledge we usually get with a paper. I used to think that even though it was hard to really explore a problem in a single paper, the real learning or understanding would manifest through bodies of work. Like if you look at my papers over the last ten years, you can see what I’ve come to understand. But I don’t think that’s quite accurate. Certainly there is some knowledge accrual and some influence of what I’ve said in past papers on how I see the world now. But I would say the knowledge I’m most interested in, or most proud of having gained, is not well represented in the papers. It’s more about what intuitions I’ve developed over time, about things like what’s hard about studying behavior under uncertainty, what’s actually an important problem or an unanswered question when it comes to learning from data in different scenarios, what’s misleading or wrong in the way things get portrayed in the literature in my field, etc.

The conflict arises because understanding doesn’t care about connecting the dots. It happens in a realm where it’s well understand that the dots have only a tenuous relationship to the truth status of whatever claims you want to make. But it’s hard to write papers in that world. Strong assertions seem out of place. 

Maybe this is why Blackwell’s papers tended to be short. 

It’s worth asking whether one can reach understanding without going through the motions of doing the research. I’m not sure. I think there’s value in attempting to take things seriously and make moderately simple statements about them of the type that can be put in a research paper. But then again something like blogging can have the same effect. 

On the bright side, if you can find a way to write a paper that you really believe in, then once you put the paper out there, you might get some critical feedback. And maybe then understanding enters the equation, because the critique jars your thinking enough to help you see beyond your old premises. But at least for me this is not the norm. I like getting critical feedback, but even when the paper is about something I’m still in the midst of trying to understand, often by the time things have been published and presented at some conference and the right people see it and weigh in, I’ve already reached some conclusions about the limitations of those ideas and moved on. For this reason it has always driven me crazy when people associate my current interests with things I’ve published a couple years ago. 

In terms of shifting the balance toward more understanding, being intentional about publishing less papers and being pickier about what problems you take on should help. And other possibilities I’ve posted about in the past like trying to normalize scientists admitting what they don’t know or when they have doubts about their own work in talks and the papers themselves. More pointing out of assertions and claims to generalization that aren’t warranted, even if the work is already published and it makes the authors uncomfortable, because it enforces the idea that we are doing research because we actually care about getting the understanding right, not just because we like clever ideas.

P.S. Probably the title should have been, Understanding is everywhere, even, on rare occasions, in boxes labeled research. But I like the recursion!

StanCon 2023: call for proposals is still open!

This year, we’re bringing back StanCon in person!

StanCon is an opportunity for members of the broader Stan community to come together and discuss applications of Stan, recent developments in Bayesian modeling, and (most importantly perhaps) unsolved problems. The conference attracts field practitioners, software developers, and researchers working on methods and theory. This year’s conference will take place on June 20 – 23 at Washington University in St Louis, Missouri.

The keynote speakers are:

  • Bob Carpenter (Flatiron Institute)
  • John Kruschke (Indiana University)
  • Mariel Finucane (Mathematica Policy Research)
  • Siddhartha Chib (Washington University in St. Louis)

Proposals for talks, sessions and tutorials are due on March 31st (though it looks like we’ll be able to extend the deadline). Posters are accepted on a rolling basis. From the website:

We are interested in a broad range of topics relevant to the Stan community, including:

  • Applications of Bayesian statistics using Stan in all domains
  • Software development to support or complement the Stan ecosystem
  • Methods for Bayesian modeling, relevant to a broad range of users
  • Theoretical insights on common Bayesian methods and models
  • Visualization techniques
  • Tools for teaching Bayesian modeling

Keep in mind that StanCon brings together a diverse audience. Material which focuses on an application should introduce the problem to non-field experts; theoretical insights should be linked to problems modelers are working on, etc.

What does it take, or should it take, for an empirical social science study to be convincing?

A frequent correspondent sends along a link to a recently published research article and writes:

I saw this paper on a social media site and it seems relevant given your post on the relative importance of social science research. At first, I thought it was an ingenious natural experiment, but the more I looked at it, the more questions I had. They sure put a lot of work into this, though, evidence of the subject’s importance.

I’m actually not sure how bad the work is, given that I haven’t spent much time with it. But the p values are a bit overdone (understatement there). And, for all the p-values they provide, I thought it was interesting that they never mention the R-squared from any of the models. I appreciate the lack of information the R-squared would provide, but I am always interested to know if it is 0.05 or 0.70. Not a mention. They do, however, find fairly large effects – a bit too large to be believable I think.

I didn’t have time to look into this one so I won’t actually link to the linked paper; instead I’ll give some general reactions.

There’s something about that sort of study that rubs me the wrong way and gives me skepticism, but, as my correspondent says, the topic is important so it makes sense to study it. My usual reaction to such studies is that I want to see the trail of breadcrumbs, starting from time series plots of local and aggregate data and leading to the conclusions. Just seeing the regression results isn’t enough for me, no matter how many robustness studies are attached to it. Again, this does not mean that the conclusions are wrong or even that there’s anything wrong with the researchers are doing; I just think that the intermediate steps are required to be able to make sense of this sort of analysis of limited historical data.

Laura Ingalls Wilder vs. Maurice Sendak; Cleary advances

OK, two more children’s book authors. Both have been through a lot. Laura defeated cool person Banksy, lawgiver Steve Stigler, person known by initials Malcolm X, and then a come-from-behind victory against lawgiver Alison Bechdel. Meanwhile, Maurice dethroned alleged tax cheat Martha Stewart, namesake Steve McQueen, and fellow children’s book author Margaret Wise Brown.

Who’s it gonna be? I’d say Maurice because he’s an illustrator as well as a writer. On the other hand, Laura’s books have a lot more content than Maurice’s, also as a political scientist I appreciate the story of how Laura rewrote some of her life history to be more consistent with her co-author daughter’s political ideology.

Both authors are wilderness-friendly!

Past matchup

Raghu suggests we should sit here for the present.

Dzhaughn writes:

I have had the Cleary image of Ramona sitting in a basement taking one bite out of every apple for more than 90% of my life.

But Diana counters:

I don’t wanna go down to the basement.

Moving away from Ramona for a moment, Pedro writes:

A little bit of Googling reveals that Shakira once started in a soap opera (Telenovela) in her teen years. Apparently embarrassed, she ended up buying the rights to the soap and now it’s no longer available in any legal way.

Although I’m very sympathetic towards her actions and feelings, this blog is very pro-open science and sharing data and her actions are as against that as possible…

Good point! Cleary is very open, as you can see if you read her two volumes of autobiography. Maybe if she comes to speak, we’ll hear some excerpts from volume 3?

Again, here are the announcement and the rules.

They did a graphical permutation test to see if students could reliably distinguish the observed data from permuted replications. Now the question is, how do we interpret the results?

1. Background: Comparing a graph of data to hypothetical replications under permutation

Last year, we had a post, I’m skeptical of that claim that “Cash Aid to Poor Mothers Increases Brain Activity in Babies”, discussing recently published “estimates of the causal impact of a poverty reduction intervention on brain activity in the first year of life.”

Here was the key figure in the published article:

As I wrote at the time, the preregistered plan was to look at both absolute and relative measures on alpha, gamma, and theta (beta was only included later; it was not in the preregistration). All the differences go in the right direction; on the other hand when you look at the six preregistered comparisons, the best p-value was 0.04 . . . after adjustment it becomes 0.12 . . . Anyway, my point here is not to say that there’s no finding just because there’s no statistical significance; there’s just a lot of uncertainty. The above image looks convincing but part of that is coming from the fact that the responses at neighboring frequencies are highly correlated.

To get a sense of uncertainty and variation, I re-did the above graph, randomly permuting the treatment assignments for the 435 babies in the study. Here are 9 random instances:

2. Planning an experiment

Greg Duncan, one of the authors of the article in question, followed up:

We almost asked students in our classes to guess which of ~15 EEG patterns best conformed to our general hypothesis of negative impacts for lower frequency bands and positive impacts for higher-frequency bands. One of the graphs would be the real one and the others would be generated randomly in the same manner as in your blog post about our article. I had suggested that we wait until we could generate age and baseline-covariate-adjusted versions of those graphs . . . I am still very interested in this novel way of “testing” data fit with hypotheses — even with the unadjusted data — so if you can send some version of the ~15 graphs then I will go ahead with trying it out on students here at UCI.

I sent Duncan some R code and some graphs, and he replied that he’d try it out. But first he wrote:

Suppose we generate 14 random + 1 actual graphs; recruit, say, 200 undergraduates and graduate students; describe the hypothesis (“less low-frequency power and more high-frequency power in the treatment group relative to the control group”); and ask them to identify their top and second choices for the graphs that appear to conform most closely with the hypothesis. I would also have them write a few sentences justifying their responses in order to coax them to take the exercise seriously.

The question: how would you judge whether the responses convincingly favored the actual data? More than x% first-place votes; more than y% first or second place votes? Most votes? It would be good to pre-specify some criteria like that.

I replied that I’m not sure if the results would be definitive but I guess it would be intereseting to see what happens.

Duncan responded:

I agree that the results are merely useful but not definitive.

Your blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

I agree, and Drew Bailey, who was also involved in the discussion, added:

The earlier blog post used these graphs to show that the data, if manipulated with randomly-generated treatment dummies, produced an uncomfortable number of false positives. This new exercise would inform that intuition, even if we want to rely on formal statistics for the most systematic assessment of how confident we should be with the results.

3. Experimental conditions

Duncan was then ready to go. He wrote:

I am finally ready to test randomly generated graphs out on a large classroom of undergraduate students.

Paul Yoo used Stata to generate 15 random graphs plus the real one (see attached). The position (10th) in the 16 for the PNAS graph was determined from a random number draw. (We could randomize its position but that increases the scoring task considerably.) We put an edited version of the hypothesis that was preregistered/spelled out in our original NICHD R01 proposal below the graphs. My plan is to ask class members to select their first and second choices for the graph that conforms most closely to the hypothesis.

Bailey responded:

Yes, with the same caveat as before (namely, that the paths have already forked: we aren’t looking at a plot of frequency distributions for one of the many other preregistered outcomes in part because these impacts didn’t wind up on Andrew’s blog).

4. Results

Duncan reported:

97 students examined the 16 graphs shown in the 4th slide in the attached powerpoint file. The earlier slides set up the exercise and the hypothesis.

Almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. Most of the other guesses are for figures that show more treatment-group power in the beta and gamma ranges but not alpha.

5. Discussion

I’m not quite sure what to make of this. It’s interesting and I think useful to run such experiments to help stimulate our thinking.

This is all related to the 2009 paper, Statistical inference for exploratory data analysis and model diagnostics, by Andreas Buja, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah Swayne, and Hadley Wickham.

As with hypothesis tests in general, I think the value of this sort of test is when it does not reject the null hypothesis, which represents a sort of negative signal that we don’t have enough data to learn more on the topic.

The thing is, I’m not clear what to make of the result that almost 2/3rds chose the right figure (#10) on their first guess and 78% did so on their first or second guesses. On one hand, this is a lot better than the 1/16 and 1/8 we would expect by pure chance. On the other hand, the fact that some of the alternatives were similar to the real data . . . this is all getting me confused! I wonder what Buja, Cook, etc., would say about this example.

6. Expert comments

Dianne Cook responded in detail in comments. All of this is directly related to our discussion so I’m copying her comment here:

The interpretation depends on the construction of the null sets. Here you have randomised the group. There is no control of the temporal dependence or any temporal trend, so where the lines cross or the volatility of lines is possibly distracting.

You have also asked a very specific one-sided question – it took me some time to digest what your question is asking. Effectively it is, in which plot is the solid line much higher than the dashed line only in three of the zones. When you are randomising groups, the group labels have no relevance, so it would be a good idea to set the higher-valued one to be the solid line in all null sets. Otherwise, some plots would be automatically irrelevant. People don’t need to know the context of a problem to be an observer for you, and it is almost always better if the context is removed. If you had asked a different question, eg in which plot are the lines getting further apart at higher Hz, or in which plot are the two lines the most different, would likely yield different responses. The question you ask matters. We typically try to keep it generic “which plot is different” or “which plot shows the most difference between groups”. Being too specific can create the same problem as creating the hypothesis post-hoc after you have seen the data, eg you spot clusters and then do a MANOVA test. You pre-registered your hypothesis so this shouldn’t be a problem. Thus your null hypothesis is “There is NO difference in the high-frequency power between the two groups.”

When you see as much variability in the null sets as you have here, it would be recommended to make more null sets. With more variability, you need more comparisons. Unlike a conventional test where we see the full curve of the sampling distribution and can check if the observed test statistic has a value in the tails, with randomisation tests we have a finite number of draws from the sampling distribution on which to make a comparison. Numerically we could generate tons of draws but for visual testing, it’s not feasible to look at too many. However, you still might need more than your current 15 nulls to be able to gauge the extent of the variability.

For your results, it looks like 64 of the 97 students picked plot 10, their first pick. Assuming that this was done independently and that they weren’t having side conversations in the room, then you could use nullabor to calculate the p-value:

> library(nullabor)
> pvisual(64, 97, 16)
x simulated binom
[1,] 64 0 0

which means that the probability that this many people would pick plot 10, if it really was truly a null sample, is 0. Thus we would reject the null hypothesis, and with strong evidence, conclude that there is more high frequency in the high-cash group. You can include the second votes by weighting the p-value calculation by two picks out of 16 instead of one, but here the p-value is still going to be 0.

To understand whether observers are choosing the data plot, for reasons related to the hypothesis you have to ask them why they made their choice. Again, this should be very specific here because you’ve asked a very specific question, things like “the lines are constantly further apart on the right side of the plot”. For people that chose null plots instead of 10, it would be interesting to know what they were looking at. In this set of nulls, there are so many other types of differences! Plot 3 has differences everywhere. We know there are no actual group differences, so this big of an observed difference is consistent with there being no true difference. It is ruled out as a contender only because the question asks in 3 of the 4 zones if is there a difference. We see crossings of lines in many plots, so this is something very likely to see assuming the null is true. The big scissor pattern in 8 is interesting, but we know this has arisen by chance.

Well, this has taken some time to write. Congratulations on an interesting experiment, and interesting post. Care needs to be taken in designing data plots, constructing the null-generating mechanisms and wording questions appropriately when you apply the lineup protocol in practice.

This particular work has been borne from curiosity about a published data plot. It reminds me of our work in Roy Chowdhury et al (2015) (https://link.springer.com/article/10.1007/s00180-014-0534-x). It was inspired by a plot in a published paper where the authors reported clustering. Our lineup study showed that this was an incorrect conclusion, and the clustering was due to the high-dimensionality. I think your conclusion now would be that the published plot does show the high-frequency difference reported.

She also lists a bunch of relevant references at the end of the linked comment.

Parallelization for Markov chain Monte Carlo with heterogeneous runtimes: a case-study on ODE-based models

(this post is by Charles)

Last week, BayesComp 2023 took place in Levi, Finland. The conference covered a broad range of topics in Bayesian computation, with many high quality sessions, talks, and posters. Here’s a link to the talk abstracts. I presented two posters at the event. The first poster was on assessing the convergence of MCMC in the many-short-chains regime. I already blogged about this research (link): here’s the poster and the corresponding preprint.

The second poster was also on the topic of running many chains in parallel but in the context of models based on ordinary differential equations (ODEs). This was the outcome of a project led by Stanislas du Ché, during his summer internship at Columbia University. We examined several pharmacometrics models, with likelihoods parameterized by the solution to an ODE. Having to solve an ODE inside a Bayesian model is challenging because the behavior of the ODE can change as the Markov chains journey across the parameter space. An ODE which is easy-to-solve at some point can be incredibly difficult somewhere else. In the past, we analyzed this issue in the illustrative planetary motion example (Gelman et al (2020), Section 11). This is the type of problem where we need to be careful about how we initialize our Markov chains and we should not rely on Stan’s defaults. Indeed, these defaults can start you in regions where your ODE is nearly impossible to solve and completely kill your computation! A popular heuristic is to draw the initial point from the prior distribution. On a related note, we need to construct priors carefully to exclude patently absurd parameter values and (hopefully) parameter values prone to frustrate our ODE solvers.

Even then—and especially if our priors are weakly informative—our Markov chains will likely journey through challenging regions. A common manifestation of this problem is that some chains lag behind because their random trajectories take them through areas that frustrate the ODE solver. Stanislas observed that this problem becomes more acute when we run many chains. Indeed, as we increase the number of chains, the probability that at least some of the chains get “stuck” increases. Then, even when running chains in parallel, the efficiency of MCMC as measured by effective sample size per second (ESS/s) eventually goes down as we add more chains because we are waiting for the slowest chain to finish!

Ok. Well, we don’t want to be punished for throwing more computation at our problem. What if we instead waited for the fastest chains to finish? This is what Stanislas studied by proposing a strategy where we stop the analysis after a certain ESS is achieved, even if some chains are still warming up. An important question is what bias does dropping chains introduce? One concern is that the fastest chains are biased because they fail to explore a region of the parameter space which contains a non-negligible amount of probability mass and where the ODE happens to be more difficult to solve. Stanislas tried to address this problem using stacking (Yao et al 2018), a strategy designed to correct for biased Markov chains. But stacking still assumes all the chains somehow “cover” the region where the probability mass concentrates and, when properly weighted, produce unbiased Monte Carlo estimators.

We may also wonder about the behavior of the slow chains. If the slow chains are close to stationarity, then by excluding them we are throwing away samples which would reduce the variance of our Monte Carlo estimators, however, it’s not worth waiting for these chains to finish if we’ve already achieved the wanted precision. What is more, as Andrew Gelman pointed out to me, slow chains can often be biased, for example if they get stuck in a pathological region during the warmup and never escape this region—as was the case in the planetary motion example. But we can’t expect this to always be the case.

In summary, I like the idea of waiting only for the fastest chains and I think understanding how to do this in a robust manner remains an open question. This work posed the problem and took steps in the right direction. There was a lot of traffic at the poster and I was pleased to see many people at the conference working on ODE-based models.

“On March 14th at 7pm ET, thought leader and Harvard professor Steven Pinker will release digital collectibles of his famous idea that ‘Free speech is fundamental.'”

A commenter points us to this juicy story:

John Glenn, huh? I had no idea. I guess it makes sense, though: after the whole astronaut thing ended, dude basically spend the last few decades of his life hanging out with rich people.

Following the link:

Two tiers will be available: the gold collectible, which is unique and grants the buyer the right to co-host the calls with Pinker, will be priced at $50,000; the standard collectibles, which are limited to 30 items and grant the buyers the right to access those video calls and ask questions to Pinker at the end, will be priced at 0.2 Ethereum (~$300).

Here’s the thing. Pinker’s selling collectibles of his idea, “Free speech is fundamental.” But we know from some very solid research that scientific citations are worth $100,000 each.

So does that mean that Pinker’s famous idea that “Free speech is fundamental” is only worth, at best, 0.5 citations? That doesn’t seem fair at all. Pinker’s being seriously ripped off here.

On the other hand, he could also sell collectibles for some of his other ideas, such as, “Did the crime rate go down in the 1990s because two decades earlier poor women aborted children who would have been prone to violence?”, “Are suicide terrorists well-educated, mentally healthy and morally driven?”, “Do African-American men have higher levels of testosterone, on average, than white men?”, or, my personal favorite, “Do parents have any effect on the character or intelligence of their children?” 50 thousand here, 50 thousand there, pretty soon you’re talking about real money.

All joking aside, I don’t see anything wrong with Pinker doing this. The NFT is a silly gimmick, sure, but what he’s really doing is coming up with a clever way to raise money for his research projects. If I had a way to get $50,000 donations, I’d do it too. It’s hard to believe that anyone buying the “NFT” is thinking that they’re getting their hands on a valuable, appreciating asset. It’s just a way for them to support Pinker’s professional work. One reason this topic interests me is that we’re always on the lookout for new sources of research funds. (We’ve talked about putting ads on the blog, but it seems like the amount of $ we’d end up getting for it would be not worth all the hassle involved in having ads.) As is often the case with humor, we laugh because we care.

And why is particular story this so funny? Maybe because it seems so time-bound, kind of as if someone were selling custom disco balls in the 1970s, or something like that. And he’s doing it with such a straight face (“* * * NOW LIVE . . . My first digital collectible . . .”)! If you’re gonna do it at all, you go all in, I guess.

P.S. Following the links on the above twitter feed led me to this website of McGill University’s Office for Science and Society, whose slogan is, “Separating Sense from Nonsense.” How cool is that?

What a great idea! I wonder how they fund it. They should have similar offices at Ohio State, Cornell, Harvard (also here), the University of California, Columbia, etc etc etc.