Zipf’s law and Heaps’s law. Also, when is big as bad as infinity? And why unit roots aren’t all that.

John Cook has a fun and thoughtful post on Zipf’s law, which “says that the frequency of the nth word in a language is proportional to n^(−s),” linking to an earlier post of his on Heaps’s law, which “says that the number of unique words in a text of n words is approximated by Kn^β, where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 and 0.6.” Unsurprisingly, you can derive one of these laws from the other; see links on the aforementioned wikipedia page.

In his post on Zipf, Cook discusses the idea that setting large numbers to infinity can work in some settings but not in others. In some way this should already be clear to you—for example, if a = 3.4 + 1/x, and x is very large, then if you’re interested in a, then for most purposes you can just say a = 3.4, but if you care about x, you can’t just call it infinity. If you can use infinity, that simplifies your life. As Cook puts it, “Infinite is easier than big.” Another way of saying this is that, if you can use infinity, you can use some number like 10^8, thus avoiding literal infinities but getting many of the benefits in simplicity and computation.

Cook continues:

Whether it is possible to model N [the number of words in the language] as infinite depends on s [the exponent in the Zipf formula]. The value of s that models word frequency in many languages is close to 1. . . . When s = 1, we don’t have a probability distribution because the sum of 1/n from 1 to ∞ diverges. And so no, we cannot assume N = ∞. Now you may object that all we have to do is set s to be slightly larger than 1. If s = 1.0000001, then the sum of n−s converges. Problem solved. But not really.

When s = 1 the series diverges, but when s is slightly larger than 1 the sum is very large. Practically speaking this assigns too much probability to rare words.

I like how he translates the model error into a real-world issue.

This all reminds me of a confusion that sometimes arises in statistical inference. As Cook says, if you have problems with infinity, you’ll often also have problems with large finite numbers. For example, it’s not good to have an estimate that has an infinite variance, but if it has a very large variance, you’ll still have instability. Convergence conditions aren’t just about yes or no, they’re also about how close you are. Similarly with all that crap in time series about unit roots. The right question is not, Is there a unit root? It’s, What are you trying to model?

Of course its preregistered. Just give me a sec

This is Jessica. I was going to post something on Bak-Coleman and Devezer’s response to the Protzko et al. paper on the replicability of research that uses rigor-enhancing practices like large samples, preregistration, confirmatory tests, and methodological transparency, but Andrew beat me to it. But since his post didn’t get into one of the surprising aspects of their analysis (beyond the paper making causal claim without a study design capable of assessing causality), I’ll blog on it anyway.

Bak-Coleman and Devezer describe three ways in which the measure of replicability that Protzko et al. use to argue that the 16 effects they study are more replicable than effects in prior studies deviates from prior definitions of replicability:

  1. Protzko et al. define replicability as the chance that any replication achieves significance in the hypothesized direction as opposed to whether the results of the confirmation study and the replication were consistent 
  2. They include self-replications in calculating the rate
  3. They include repeated replications of the same effect and replications across different effects in calculating the rate

Could these deviations in how replicability is defined have been decided post-hoc, so that the authors could present positive evidence for their hypothesis that rigor-enhancing practices work? If they preregistered their definition of replicability, we would not be so concerned about this possibility.  Luckily, the authors report that “All confirmatory tests, replications and analyses were preregistered both in the individual studies (Supplementary Information section 3 and Supplementary Table 2) and for this meta-project (https://osf.io/6t9vm).”

But wait – according to Bak-Coleman and Devezer:

the analysis on which the titular claim depends was not preregistered. There is no mention of examining the relationship between replicability and rigor-improving methods, nor even how replicability would be operationalized despite extensive descriptions of the calculations of other quantities. With nothing indicating this comparison or metric it rests on were planned a priori, it is hard to distinguish the core claim in this paper from selective reporting and hypothesizing after the results are known. 

Uh-oh, that’s not good. At this point, some OSF sleuthing was needed. I poked around the link above, and the associated project containing analysis code. There are a couple analysis plans: Proposed Overarching Analyses for Decline Effect final.docx, from 2018, and Decline Effect Exploratory analyses and secondary data projects P4.docx, from 2019. However, these do not appear to describe the primary analysis of replicability in the paper (the first describes an analysis that ends up in the Appendix, and the second a bunch of exploratory analyses that don’t appear in the paper). About a year later, the analysis notebooks with the results they present in the main body of the paper were added. 

According to Bak-Coleman on X/Twitter: 

We emailed the authors a week ago. They’ve been responsive but as of now, they can’t say one way or another if the analyses correspond to a preregistration. They think they may be in some documentation.

In the best case scenario where the missing preregistration is soon found, this example suggests that there are still many readers and reviewers for whom some signal of rigor suffices even when the evidence of it is lacking. In this case, maybe the reputation of authors like Nosek reduced the perceived need on the part of the reviewers to track down the actual preregistration. But of course, even those who invented rigor-enhancing practices can still make mistakes!

In the alternative scenario where the preregistration is not found soon, what is the correct course of action? Surely at least a correction is in order? Otherwise we might all feel compelled to try our luck at signaling preregistration without having to inconvenience ourselves by actually doing.

More optimistically, perhaps there are exciting new research directions that could come out of this. Like, wearable preregistration, since we know from centuries of research and practice that it’s harder to lose something when it’s sewn to your person. Or, we could submit our preregistrations to OpenAI, I mean Microsoft, who could make a ChatGPT-enabled Preregistration Buddy who not only trained on your preregistration, but also knows how to please a human judge who wants to ask questions about what it said.

More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.

The rise and fall of Seth Roberts and the Shangri-La diet

Here’s a post that’s suitable for the Thanksgiving season.

I no longer believe in the Shangri-La diet. Here’s the story.

Background

I met Seth Roberts back in the early 1990s when we were both professors at the University of California. He sometimes came to the statistics department seminar and we got to talking about various things; in particular we shared an interest in statistical graphics. Much of my work in this direction eventually went toward the use of graphical displays to understand fitted models. Seth went in another direction and got interested in the role of exploratory data analysis in science, the idea that we could use graphs not just to test or even understand a model but also as the source of new hypotheses. We continued to discuss these issues over the years.

At some point when we were at Berkeley the administration was encouraging the faculty to teach freshman seminars, and I had the idea of teaching a course on left-handedness. I’d just read the book by Stanley Coren and thought it would be fun to go through it with a class, chapter by chapter. But my knowledge of psychology was minimal so I contacted the one person I knew in the psychology department and asked him if he had any suggestions of someone who’d like to teach the course with me. Seth responded that he’d be interested in doing it himself, and we did it.

Seth was an unusual guy—not always in a good way, but some of his positive traits were friendliness, inquisitiveness, and an openness to consider new ideas. He also struggled with mood swings, social awkwardness, and difficulties with sleep, and he attempted to address these problems with self-experimentation.

After we taught the class together we got together regularly for lunch and Seth told me about his efforts in self-experimentation involving sleeping hours and mood. Most interesting to me was his discovery that seeing life-sized faces in the morning helped with his mood. I can’t remember how he came up with this idea, but perhaps he started by following the recommendation that is often given to people with insomnia to turn off TV and other sources of artificial light in the evening. Seth got in the habit of taping late-night talk-show monologues and then watching them in the morning while he ate breakfast. He found himself happier, did some experimentation, and concluded that we had evolved to talk with people in the morning, and that life-sized faces were necessary. Seth lived alone, so the more natural approach of talking over breakfast with a partner was not available.

Seth’s self-experimentation went slowly, with lots of dead-ends and restarts, which makes sense given the difficulty of his projects. I was always impressed by Seth’s dedication in this, putting in the effort day after day for years. Or maybe it did not represent a huge amount of labor for him, perhaps it was something like a diary or blog which is pleasurable to create, even if it seems from the outside to be a lot of work. In any case, from my perspective, the sustained focus was impressive. He had worked for years to solve his sleep problems and only then turned to the experiments on mood.

Seth’s academic career was unusual. He shot through college and graduate school to a tenure-track job at a top university, then continued to do publication-quality research for several years until receiving tenure. At that point he was not a superstar but I think he was still considered a respected member of the mainstream academic community. But during the years that followed, Seth lost interest in that thread of research. He told me once that his shift was motivated by teaching introductory undergraduate psychology: the students, he said, were interested in things that would affect their lives, and, compared to that, the kind of research that leads to a productive academic career did not seem so appealing.

I suppose that Seth could’ve tried to do research in clinical psychology (Berkeley’s department actually has a strong clinical program) but instead he moved in a different direction and tried different things to improve his sleep and then, later, his skin, his mood, and his diet. In this work, Seth applied what he later called his “insider/outsider perspective”: he was an insider in that he applied what he’d learned from years of research on animal behavior, an outsider in that he was not working within the existing paradigm of research in physiology and nutrition.

At the same time he was working on a book project, which I believe started as a new introductory psychology course focused on science and self-improvement but ultimately morphed into a trade book on ways in which our adaptations to Stone Age life were not serving us well in the modern era. I liked the book but I don’t think he found a publisher. In the years since, this general concept has been widely advanced and many books have been published on the topic.

When Seth came up with the connection between morning faces and depression, this seemed potentially hugely important. Were the faces were really doing anything? I have no idea. On one hand, Seth was measuring his own happiness and doing his own treatments on his own hypothesis so the potential for expectation effects are huge. On the other hand, he said the effect he discovered was a surprise to him and he also reported that the treatment worked with others. Neither he nor, as far as I know, anyone else, has attempted a controlled trial of this idea.

In his self-experimentation, Seth lived the contradiction between the two tenets of evidence-based medicine: (1) Try everything, measure everything, record everything; and (2) Make general recommendations based on statistical evidence rather than anecdotes.

Seth’s ideas were extremely evidence-based in that they were based on data that he gathered himself or that people personally sent in to him, and he did use the statistical evidence of his self-measurements, but he did not put in much effort to reduce, control, or adjust for biases in his measurements, nor did he systematically gather data on multiple people.

The Shangri-La diet

Seth’s next success after curing his depression was losing 40 pounds on an unusual diet that he came up with, in which you can eat whatever you want as long as each day you drink a cup of unflavored sugar water, at least an hour before or after a meal. The way he theorized that his diet worked was that the carefully-timed sugar water had the effect of reducing the association between calories and flavor, thus lowering your weight set-point and making you uninterested in eating lots of food.

I asked Seth once if he thought I’d lose weight if I were to try his diet in a passive way, drinking the sugar water at the recommended time but not actively trying to reduce my caloric intake. He said he supposed not, that the diet would make it easier to lose weight but I’d probably still have to consciously eat less.

I described Seth’s diet to one of my psychologist colleagues at Columbia and asked what he thought of it. My colleague said he thought it was ridiculous. And, as with the depression treatment, Seth never had an interest in running a controlled trial, even for the purpose of convincing the skeptics.

I had a conversation with Seth about this. He said he’d tried lots of diets and none had worked for him. I suggested that maybe he was just ready at last to eat least and lose weight, and he said he’d been ready for awhile but this was the first diet that allowed him to eat less without difficulty. I suggested that maybe the theory underlying Seth’s diet was compelling enough to act as a sort of placebo, motivating him to follow the protocol. Seth responded that other people had tried his diet and lost weight with it. He also reminded me that it’s generally accepted that “diets don’t work” and that people who lose weight while dieting will usually gain it all back. He felt that his diet was different in that it didn’t you what foods to eat or how much; rather, it changed your set point so that you didn’t want to eat so much. I found Seth’s arguments persuasive. I didn’t feel that his diet had been proved effective, but I thought it might really work, I told people about it, and I was happy about its success. Unlike my Columbia colleague, I didn’t think the idea was ridiculous.

Media exposure and success

Seth’s breakout success happened gradually, starting with a 2005 article on self-experimentation in Behavioral and Brain Sciences, a journal that publishes long articles followed by short discussions from many experts. Some of his findings from the ten of his experiments discussed in the article:

Seeing faces in the morning on television decreased mood in the evening and improved mood the next day . . . Standing 8 hours per day reduced early awakening and made sleep more restorative . . . Drinking unflavored fructose water caused a large weight loss that has lasted more than 1 year . . .

As Seth described it, self-experimentation generates new hypotheses and is also an inexpensive way to test and modify them. The article does not seem to have had a huge effect within research psychology (Google Scholar gives it 93 cites) but two of its contributions—the idea of systematic self-experimentation and the weight-loss method—have spread throughout the popular culture in various ways. Seth’s work was featured in a series of increasingly prominent blogs, which led to a newspaper article by the authors of Freakonomics and ultimately a successful diet book (not enough to make Seth rich, I think, but Seth had simple tastes and no desire to be rich, as far as I know). Meanwhile, Seth started a blog of his own which led to a message board for his diet that he told me had thousands of participants.

Seth achieved some measure of internet fame, with fans including Nassim Taleb, Steven Levitt, Dennis Prager, Tucker Max, Tyler Cowen, . . . and me! In retrospect, I don’t think having all this appreciation was good for him. On his blog and elsewhere Seth reported success with various self-experiments, the last of which was a claim of improved brain function after eating half a stick of butter a day. Even while maintaining interest in Seth’s ideas on mood and diet, I was entirely skeptical of his new claims, partly because of his increasing rate of claimed successes. It took Seth close to 10 years of sustained experimentation to fix his sleep problems, but in later years it seemed that all sorts of different things he tried were effective. His apparent success rate was implausibly high. What was going on? One problem is that sleep hours and weight can be measured fairly objectively, whereas if you measure brain function by giving yourself little quizzes, it doesn’t seem hard at all for a bit of unconscious bias to drive all your results. I also wonder if Seth’s blog audience was a problem: if you have people cheering on your every move, it can be that much easier to fool yourself.

Seth also started to go down some internet rabbit holes. On one hand, he was a left-wing Berkeley professor who supported universal health care, Amnesty International, and other liberal causes. On the other hand, his paleo-diet enthusiasm brought him close to various internet right-wingers, and he was into global warming denial and kinda sympathetic to Holocaust denial, not because he was a Nazi or anything but just because he had distrust of authority thing going on. I guess that if he’d been an adult back in the 1950s and 1960s he would’ve been on the extreme left, but more recently it’s been the far right where the rebels are hanging out. Seth also had sympathy for some absolutely ridiculous and innumerate research on sex ratios and absolutely loved the since-discredited work of food behavior researcher Brian Wansink; see here and here. The point here is not that Seth believed things that turned out to be false—that happens to all of us—but rather that he had a soft spot for extreme claims that were wrapped in the language of science.

Back to Shangri-La

A few years ago, Seth passed away, and I didn’t think of him too often, but then a couple years ago my doctor told me that my cholesterol level too high. He prescribed a pill, which I’m still taking every day, and he told me to switch to a mostly-plant diet and lose a bunch of weight.

My first thought was to try the Shangri-La diet. That cup of unflavored sugar water at least an hour between meals. Or maybe I did the spoonful of unflavored olive oil, I can’t remember which. Anyway, I tried it for a few days, also following the advice to eat less. And then after a few days, I thought: if the point is to eat less, why not just do that? So that’s what I did. No sugar water or olive oil needed.

What’s the point of this story? Not that losing the weight was easy for me. For a few years before that fateful conversation, my doctor had been bugging me to lose weight, and I’d vaguely wanted that to happen, but it hadn’t. What worked was me having this clear goal and motivation. And it’s not like I’m starving all the time. I’m fine; I just changed my eating patterns, and I take in a lot less energy every day.

But here’s a funny thing. Suppose I’d stuck with the sugar water and everything else had been the same. Then I’d have lost all this weight, exactly when I’d switched to the new diet. I’d be another enthusiastic Shangri-La believer, and I’d be telling you, truthfully, that only since switching to that diet had I been able to comfortably eat less. But I didn’t stick with Shangri-La and I lost the weight anyway, so I won’t make that attribution.

OK, so after that experience I had a lot less belief in Seth’s diet. The flip side of being convinced by his earlier self-experiment was becoming unconvinced after my own self-experiment.

And that’s where I stood until I saw this post at the blog Slime Mold Time Mold about informal experimentation:

For the potato diet, we started with case studies like Andrew Taylor and Penn Jilette; we recruited some friends to try nothing but potatoes for several days; and one of the SMTM authors tried the all-potato diet for a couple weeks.

For the potassium trial, two SMTM hive mind members tried the low-dose potassium protocol for a couple of weeks and lost weight without any negative side effects. Then we got a couple of friends to try it for just a couple of days to make sure that there weren’t any side effects for them either.

For the half-tato diet, we didn’t explicitly organize things this way, but we looked at three very similar case studies that, taken together, are essentially an N = 3 pilot of the half-tato diet protocol. No idea if the half-tato effect will generalize beyond Nicky Case and M, but the fact that it generalizes between them is pretty interesting. We also happened to know about a couple of other friends who had also tried versions of the half-tato diet with good results.

My point here is not to delve into the details of these new diets, but rather to point out that they are like the Shangri-La diet in being different from other diets, associated with some theory, evaluated through before-after studies on some people who wanted to lose weight, and yielded success.

At this point, though, my conclusion is not that unflavored sugar water is effective in making it easy to lose weight, or that unflavored oil works, or that potatoes work, or that potassium works. Rather, the hypothesis that’s most plausible to me is that, if you’re at the right stage of motivation, anything can work.

Or, to put it another way, I now believe that the observed effect of the Shangri-La diet, the potato diet, etc., comes from a mixture of placebo and selection. The placebo is that just about any gimmick can help you lose weight, and keep the weight off, if it somehow motivates you to eat less. The selection is that, once you’re ready to try something like this diet, you might be ready to eat less.

But what about “diets don’t work”? I guess that diets don’t work for most people at most times. But the people trying these diets are not “most people at most times.” They’re people with a high motivation to eat less and lose weight.

I’m not saying I have an ironclad case here. I’m pretty much now in the position of my Columbia colleague who felt that there’s no good reason to believe that Seth’s diet is more effective than any other arbitrary series of rules that somewhere includes the suggestion to eat less. And, yes, I have the same impression of the potato diet and the other ideas mentioned above. It’s just funny that it took so long for me to reach this position.

Back to Seth

I wouldn’t say the internet killed Seth Roberts, but ultimately I don’t think it did him any favors for him to become an internet hero, in the same way that it’s not always good for an ungrounded person to become an academic hero, or an athletic hero, or a musical hero, or a literary hero, or a military hero, or any other kind of hero. The stuff that got you to heroism can be a great service to the world, but what comes next can be a challenge.

Seth ended up believing in his own hype. In this case, the hype was not that he was an amazing genius; rather, the hype was about his method, the idea that he had discovered modern self-experimentation (to the extent that this rediscovery can be attributed to anybody, it should be to Seth’s undergraduate adviser, Allen Neuringer, in this article from 1981). Maybe even without his internet fame Seth would’ve gone off the deep end and started to believe he was regularly making major discoveries; I don’t know.

From a scientific standpoint, Seth’s writings are an example of the principle that honesty and transparency are not enough. He clearly described what he did, but his experiments got to be so flawed as to be essentially useless.

After I posted my obituary of Seth (from which I took much of the beginning of this post), there were many moving tributes in the comments, and I concluded by writing, “It is good that he found an online community of people who valued him.” That’s how I felt right now, but in retrospect, maybe not. If I could’ve done it all over again, I never would’ve promoted his diet, a promotion that led to all the rest.

I’d guess that the wide dissemination of Seth’s ideas was a net benefit to the world. Even if his diet idea is bogus, it seems to have made a difference to a lot of people. And even if the discoveries he reported from his self-experimentation (eating a stick of butter a day improving brain functioning and all the rest) were nothing but artifacts of his hopeful measurement protocols, the idea of self-experimentation was empowering to people—and I’m assuming that even his true believers (other than himself) weren’t actually doing the butter thing.

Setting aside the effects on others, though, I don’t think that this online community was good for Seth in his own work or for his personal life. In some ways he was ahead of his time, as nowadays we’re hearing a lot about people getting sucked into cult-like vortexes of misinformation.

P.S. Lots of discussion in comments, including this from the Slime Mold Time Mold bloggers.

I disagree with Geoff Hinton regarding “glorified autocomplete”

Computer scientist and “godfather of AI” Geoff Hinton says this about chatbots:

“People say, It’s just glorified autocomplete . . . Now, let’s analyze that. Suppose you want to be really good at predicting the next word. If you want to be really good, you have to understand what’s being said. That’s the only way. So by training something to be really good at predicting the next word, you’re actually forcing it to understand. Yes, it’s ‘autocomplete’—but you didn’t think through what it means to have a really good autocomplete.”

This got me thinking about what I do at work, for example in a research meeting. I spend a lot of time doing “glorified autocomplete” in the style of a well-trained chatbot: Someone describes some problem, I listen and it reminds me of a related issue I’ve thought about before, and I’m acting as a sort of FAQ, but more like a chatbot than a FAQ in that the people who are talking with me do not need to navigate through the FAQ to find the answer that is most relevant to them; I’m doing that myself and giving a response.

I do that sort of thing a lot in meetings, and it can work well, indeed often I think this sort of shallow, associative response can be more effective than whatever I’d get from a direct attack on the problem in question. After all, the people I’m talking with have already thought for awhile about whatever it is they’re working on, and my initial thoughts may well be in the wrong direction, or else my thoughts are in the right direction but are just retracing my collaborators’ past ideas. From the other direction, my shallow thoughts can be useful in representing insights from problems that these collaborators had not ever thought about much before. Nonspecific suggestions on multilevel modeling or statistical graphics or simulation or whatever can really help!

At some point, though, I’ll typically have to bite the bullet and think hard, not necessarily reaching full understanding in the sense of mentally embedding the problem at hand into a coherent schema or logical framework, but still going through whatever steps of logical reasoning that I can. This feels different than autocomplete; it requires an additional level of focus. Often I need to consciously “flip the switch,” as it were, to turn on that focus and think rigorously. Other times, I’m doing autocomplete and either come to a sticking point or encounter an interesting idea, and this causes me to stop and think.

It’s almost like the difference between jogging and running. I can jog and jog and jog, thinking about all sorts of things and not feeling like I’m expending much effort, my legs pretty much move up and down of their own accord . . . but then if I need to run, that takes concentration.

Here’s another example. Yesterday I participated in the methods colloquium in our political science department. It was Don Green and me and a bunch of students, and the structure was that Don asked me questions, I responded with various statistics-related and social-science-related musings and stories, students followed up with questions, I responded with more stories, etc. Kinda like the way things go here on the blog, but spoken rather than typed. Anyway, the point is that most of my responses were a sort of autocomplete—not in a word-by-word chatbot style, more at a larger level of chunkiness, for example something would remind me of a story, and then I’d just insert the story into my conversation—but still at this shallow, pleasant level. Mellow conversation with no intellectual or social strain. But then, every once in awhile, I’d pull up short and have some new thought, some juxtaposition that had never occurred to me before, and I’d need to think things through.

This also happens when I give prepared talks. My prepared talks are not super-well prepared—this is on purpose, as I find that too much preparation can inhibit flow. In any case, I’ll often finding myself stopping and pausing to reconsider something or another. Even when describing something I’ve done before, there are times when I feel the need to think it all through logically, as if for the first time. I noticed something similar when I saw my sister give a talk once: she had the same habit of pausing to work things out from first principles. I don’t see this behavior in every academic talk, though; different people have different styles of presentation.

This seems related to models of associative and logical reasoning in psychology. As a complete non-expert in that area, I’ll turn to wikipedia:

The foundations of dual process theory likely come from William James. He believed that there were two different kinds of thinking: associative and true reasoning. . . . images and thoughts would come to mind of past experiences, providing ideas of comparison or abstractions. He claimed that associative knowledge was only from past experiences describing it as “only reproductive”. James believed that true reasoning could enable overcoming “unprecedented situations” . . .

That sounds about right!

After describing various other theories from the past hundred years or so, Wikipedia continues:

Daniel Kahneman provided further interpretation by differentiating the two styles of processing more, calling them intuition and reasoning in 2003. Intuition (or system 1), similar to associative reasoning, was determined to be fast and automatic, usually with strong emotional bonds included in the reasoning process. Kahneman said that this kind of reasoning was based on formed habits and very difficult to change or manipulate. Reasoning (or system 2) was slower and much more volatile, being subject to conscious judgments and attitudes.

This sounds a bit different from what I was talking about above. When I’m doing “glorified autocomplete” thinking, I’m still thinking—this isn’t automatic and barely conscious behavior along the lines of driving to work along a route I’ve taken a hundred times before—; I’m just thinking in a shallow way, trying to “autocomplete” the answer. It’s pattern-matching more than it is logical reasoning.

P.S. Just to be clear, I have a lot of respect for Hinton’s work; indeed, Aki and I included Hinton’s work in our brief review of 10 pathbreaking research articles during the past 50 years of statistics and machine learning. Also, I’m not trying to make a hardcore, AI-can’t-think argument. Although not myself a user of large language models, I respect Bob Carpenter’s respect for them.

I think that where Hinton got things wrong in the quote that led off this post was not in his characterization of chatbots, but rather in his assumptions about human thinking, in not distinguishing autocomplete-like associative reasoning with logical thinking. Maybe Hinton’s problem in understanding this is that he’s just too logical! At work, I do a lot of what seems like autocomplete—and, as I wrote above, I think it’s useful—but if I had more discipline, maybe I’d think more logically and carefully all the time. It could well be that Hinton has that habit or inclination to always be in focus. If Hinton does not have consistent personal experience of shallow, autocomplete-like thinking, he might not recognize it as something different, in which case he could be giving the chatbot credit for something it’s not doing.

Come to think of it, one thing that impresses me about Bob is that, when he’s working, he seems to always be on focus. I’ll be in a meeting, just coasting along, and Bob will interrupt someone to ask for clarification, and I suddenly realize that Bob absolutely demands understanding. He seems to have no interest in participating in a research meeting in a shallow way. I guess we just have different styles. It’s my impression that the vast majority of researchers are like me, just coasting on the surface most of the time (for some people, all of the time!), while Bob, and maybe Geoff Hinton, is one of the exceptions.

P.P.S. Sometimes we really want to be doing shallow, auto-complete-style thinking. For example, if we’re writing a play and want to simulate how some characters might interact. Or just as a way of casting the intellectual net more widely. When I’m in a research meeting and I free-associate, it might not help immediately solve the problem at hand, but it can bring in connections that will be helpful later. So I’m not knocking auto-complete; I’m just disagreeing with Hinton’s statement that “by training something to be really good at predicting the next word, you’re actually forcing it to understand.” As a person who does a lot of useful associative reasoning and also a bit of logical understanding, I think they’re different, both in how they feel and also in what they do.

P.P.P.S. Lots more discussion in comments; you might want to start here.

P.P.P.P.S. One more thing . . . actually, it might deserve its own post, but for now I’ll put it here: So far, it might seem like I’m denigrating associative thinking, or “acting like a chatbot,” or whatever it might be called. Indeed, I admire Bob Carpenter for doing very little of this at work! The general idea is that acting like a chatbot can be useful—I really can help lots of people solve their problems in that way, also every day I can write these blog posts that entertain and inform tens of thousands of people—but it’s not quite the same as focused thinking.

That’s all true (or, I should say, that’s my strong impression), but there’s more to it than that. As discussed in my comment linked to just above, “acting like a chatbot” is not “autocomplete” at all, indeed in some ways it’s kind of the opposite. Locally it’s kind of like autocomplete in that the sentences flow smoothly; I’m not suddenly jumping to completely unrelated topics—but when I do this associative or chatbot-like writing or talking, it can lead to all sorts of interesting places. I shuffle the deck and new hands come up. That’s one of the joys of “acting like a chatbot” and one reason I’ve been doing it for decades, long before chatbots ever existed! Walk along forking paths, and who knows where you’ll turn up! And all of you blog commenters (ok, most of you) play helpful roles in moving these discussions along.

What happens when someone you know goes off the political deep end?

Speaking of political polarization . . .

Around this time every year we get these news articles of the form, “I’m dreading going home to Thanksgiving this year because of my uncle, who used to be a normal guy who spent his time playing with his kids, mowing the lawn, and watching sports on TV, but has become a Fox News zombie, muttering about baby drag shows and saying that Alex Jones was right about those school shootings being false-flag operations.”

This all sounds horrible but, hey, that’s just other people, right? OK, actually I did have an uncle who started out normal and got weirder and weirder, starting in the late 1970s with those buy-gold-because-the-world-is-coming-to-an-end newsletters and then getting worse from there, with different aspects of his life falling apart as his beliefs got more and more extreme. Back in 1999 he was convinced that the year 2K bug (remember that?) would destroy society. After January 1 came and nothing happened, we asked him if he wanted to reassess. His reply: the year 2K bug would indeed take civilization down, but it would be gradual, over a period of months. And, yeah, he’d always had issues, but it did get worse and worse.

Anyway, reading about poll results is one thing; having it happen to people you know is another. Recently a friend told me about another friend, someone I hadn’t seen in awhile. Last I spoke with that guy, a few years back, he was pushing JFK conspiracy theories. I don’t believe any of these JFK conspiracy theories (please don’t get into that in the comments here; just read this book instead), but lots of people believe JFK conspiracy theories, indeed they’re not as wacky as the ever-popular UFOs-as-space-aliens thing. I didn’t think much about it; he was otherwise a normal guy. Anyway, the news was that in the meantime he’d become a full-bore, all-in vaccine denier.

What happened? I have no idea, as I never knew this guy that well. He was a friend, or I guess in recent years an acquaintance. I don’t really have a take on whether he was always unhinged, or maybe the JFK thing started him on a path that spiraled out of control, or maybe he just spent too much time on the internet.

I was kinda curious how he’d justify his positions, though, so I sent him an email:

I hope all is well with you. I saw about your political activities online. I was surprised to see you endorse the statement that the covid vaccine is “the biggest crime ever committed on humanity.” Can you explain how you think that a vaccine that’s saved hundreds of thousands of lives is more of a crime committed on humanity than, say, Hitler and Stalin starting WW2?

I had no idea how he’d respond to this, maybe he’d send me a bunch of Qanon links, the electronic equivalent of a manila folder full of mimeographed screeds. It’s not like I was expecting to have any useful discussion with him—once you start with the position that a vaccine is a worse crime than invading countries and starting a world war, there’s really no place to turn. He did not respond to me, which I guess is fine. What was striking to me was how he didn’t just take a provocative view that was not supported by the evidence (JFK conspiracy theories, election denial, O.J. is innocent, etc.); instead he staked out a position that was well beyond the edge of sanity, almost as if the commitment to extremism was part of the appeal. Kind of like the people who go with Alex Jones on the school shootings.

Anyway, this sort of thing is always sad, but especially when it happens to someone you know, and then it doesn’t help that there are lots of unscrupulous operators out there who will do their best to further unmoor these people from reality and take their money.

From a political science perspective, the natural questions are: (1) How does this all happen?, and (2) Is this all worse than before, or do modern modes of communication just make us more aware of these extreme attitudes? After all, back in the 1960s there were many prominent Americans with ridiculous extreme-right and extreme-left views, and they had a lot of followers too. The polarization of American institutions has allowed some of these extreme views to get more political prominence, so that the likes of Alex Jones and Al Sharpton can get treated with respect by the leaders of the major political parties. Political leaders always would accept the support of extremists—a vote’s a vote, after all—but I have the feeling that in the past they were more at arms length.

This post is not meant to be a careful study of these questions, indeed I’m sure there’s a big literature on the topic. What happened is that my friend told me about our other friend going off the deep end, and that all got me thinking, in the way that a personal connection can make a statistical phenomenon feel so much more real.

P.S. Related is this post from last year on Seth Roberts and political polarization. Unlike my friend discussed above, Seth never got sucked into conspiracy theories, but he had this dangerous mix of over-skepticism and over-credulity, and I could well imagine that he could’ve ended up in some delusional spaces.

Simulations of measurement error and the replication crisis: Update

Last week we ran a post, “Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?”, reporting some questions that neuroscience student Federico D’Atri asked about a paper that Eric Loken and I wrote a few years ago. It’s one of my favorite papers so it was good to get feedback on it. D’Atri had run a simulation and had some questions, and in my post I shared some old code from the paper. Eric and I then looked into it. We discussed with D’Atri and here’s what we found:

1. No, Loken and I did not have a mistake in our paper. (More on that below.)

2. The code I posted on the blog was not the final code for our paper. Eric had made the final versions. From Eric, here’s:
(a) the final code that we used to make the figures in the paper, where we looked at regression slopes, and
(b) cleaned version of the code I’d posted, where we looked at correlations.
The code I posted last week was something in my files, but it was not the final version of the code, hence the confusions about what was being conditioned on in the analysis.

Regarding the code, Eric reports:

All in all we get the same results whether it’s correlations or t-tests of slopes. At small samples, and for small effects, the majority of the stat sig cors/slopes/t-tests are larger in the error than the non error (when you compare them paired). The graph’s curve does pop up through 0.5 and higher. It’s a lot higher if r = 0.08, and it’s not above 50% if the r is 0.4. It does require a relatively small effect, but we also have .8 reliability.

3. Some interesting questions remain. Federico writes:

I don’t think there’s an error in the code used to produce the graphs in the paper; rather I personally find that certain sentences in the paper may lead to some misunderstandings. I also concur with the main point made in their paper, that large estimates obtained from a small sample in high-noise conditions should not be trusted and I believe they do a good job of delivering this message.

What Andrew and Eric show is the proportion of larger correlations achieved when noisy measurements are selected for statistical significance, compared to the estimate one would obtain in the same scenario without measurement error and without selecting for statistical significance. What I had initially thought was that there was an equal level of selection for statistical significance applied to both scenarios. They essentially show that under conditions of insufficient power to detect the true underlying effect, doing enough selection based on statistical significance, can produce an overestimation much higher than the attenuation caused by measurement error.

This seems quite intuitive to me, and I would like to clarify it with an example. Consider a true underlying correlation in ideal conditions of 0.15 and a sample size of N = 25, and the extreme scenario where measurement error is infinite (in the noisy x and y will be uncorrelated). In this case, the measurements of x and y under ideal conditions will be totally uncorrelated with those obtained under noisy conditions, hence the correlation estimates in the two different scenarios as well. If I select for significance the correlations obtained under noisy conditions, I am only looking at correlations greater than 0.38 (for α = 0.05, two-tailed test), which I’ll be comparing to an average correlation of 0.15, since the two estimates are completely unrelated. It is clear then that the first estimate will almost always be greater than the second. The greater the noise, the more uncorrelated the correlation estimates obtained in the two different scenarios become, making it less likely that obtaining a large estimate in one case would also result in a large estimate in the other case.

My criticism is not about the correctness of the code (which is correct as far as I can see), but rather how relevant this scenario is in representing a real situation. Indeed, I believe it is very likely that the same hypothetical researchers who made selections for statistical significance in ‘noisy’ measurement conditions would also select for significance in ideal measurement conditions, and in that case, they would obtain an even higher frequency of effect overestimation when selecting for statistical significance (once selecting for the direction of the true effect) as well as a greater ease in achieving statistically significant results .

However, I think it could be possible that in research environments where measurement error is greater (and isn’t modeled), there might be an incentive, or a greater co-occurrence, of selection for statistical significance and poor research practices. Without evidence of this, though, I find it more interesting to compare the two scenarios assuming similar selection criteria.

Also I’m aware that in situations deviating from the simple assumptions of the case we are considering here (simple correlation between x and y and uncorrelated measurement errors), complexities can arise. For example, as probably know better than me, in multiple regression scenarios where two predictors, x1 and x2, are correlated and their measurement errors are also correlated (which can occur with certain types of measures, such as self-reporting where individuals prone to overestimating x1 may also tend to overestimate x2), and only x1 is correlated with y, there is an inflation of Type I error for x2 and asymptotically β2 is biased away from zero.

Eric adds:

Glad we resolved the initial confusion about our article’s main point and associated code. When you [Federico] first read our article, you were interested in different questions than the one we covered. It’s a rich topic, with lots of work to be done, and you seem to have several ideas. Our article addressed the situation where someone might acknowledge measurement error, but then say “my finding is all the more impressive because if not for the measurement error I would have found an even bigger effect.” We target the intuition that if a dataset could be made error free by waving a wand, that the data would necessarily show a larger correlation. Of course the “iron law” (attenuation) holds in large samples. Unsurprisingly, however, in smaller samples, data with measurement error can have a larger realized correlation. And after conditioning on the statistical significance of the observed correlations, a majority of them could be larger than the corresponding error free correlation. We treated the error free effect (the “ideal study”) as the counterfactual (“if only I had no error in my measurements”), and thus filtered on the statistical significance of the observed error prone correlations. When you tried to reproduce that graph, you applied the filter differently, but you now find that what we did was appropriate for the question we were answering.

By the way, we deliberately kept the error modest. In our scenario, the x and y values have about 0.8 reliability—widely considered excellent measurement. I agree that if the error grows wildly, as with your hypothetical case, then the observed values are essentially uncorrelated with the thing being measured. Our example though was pretty realistic—small true effect, modest measurement error, range of sample sizes. I can see though that there are many factors to explore.

Different questions are of interest in different settings. One complication is that, when researchers say things like, “Despite limited statistical power . . .” they’re typically not recognizing that they have been selecting on statistical significance. In that way, they are comparing to the ideal setting with no selection.

And, for reasons discussed in my original paper with Eric, researchers often don’t seem to think about measurement error at all! because they have the (wrong) impression that having a “statistically significant” result gives retroactive assurance that their signal-to-noise ratio is high.

That’s what got us so frustrated to start with: not just that noisy studies get published all the time, but that many researchers seem to not even realize that noise can be a problem. Lots and lots of correspondence with researchers who seem to feel that if they’ve found a correlation between X and Y, where X is some super-noisy measurement with some connection to theoretical concept A, and Y is some super-noisy measurement with some connection to theoretical concept B, that they’ve proved that A causes B, or that they’ve discovered some general connection between A and B.

So, yeah, we encourage further research in this area.

Hey! Here are some amazing articles by George Box from around 1990. Also there’s some mysterious controversy regarding his center at the University of Wisconsin.

The webpage is maintained by John Hunter, son of Box’s collaborator William Hunter, and I came across it because I was searching for background on the paper-helicopter example that we use in our classes to teach principles of experimental design and data analysis.

There’s a lot to say about the helicopter example and I’ll save that for another post.

Here I just want to talk about how much I enjoyed reading these thirty-year-old Box articles.

A Box Set from 1990

Many of the themes in those articles continue to resonate today. For example:

• The process of learning. Here’s Box from his 1995 article, “Total Quality: Its Origins and its Future”:

Scientific method accelerated that process in at least three ways:

1. By experience in the deduction of the logical consequences of the group of facts each of which was individually known but had not previously been brought together.

2. By the passive observation of systems already in operation and the analysis of data coming from such systems.

3. By experimentation – the deliberate staging of artificial experiences which often might ordinarily never occur.

A misconception is that discovery is a “one shot” affair. This idea dies hard. . . .

• Variation over time. Here’s Box from his 1989 article, “Must We Randomize Our Experiment?”:

We all live in a non-stationary world; a world in which external factors never stay still. Indeed the idea of stationarity – of a stable world in which, without our intervention, things stay put over time – is a purely conceptual one. The concept of stationarity is useful only as a background against which the real non-stationary world can be judge. For example, the manufacture of parts is an operation involving machines and people. But the parts of a machine are not fixed entities. They are wearing out, changing their dimensions, and losing their adjustment. The behavior of the people who run the machines is not fixed either. A single operator forgets things over time and alters what he does. When a number of operators are involved, the opportunities for change because of failures to communicate are further multiplied. Thus, if left to itself any process will drift away from its initial state. . . .

Stationarity, and hence the uniformity of everything depending on it, is an unnatural state that requires a great deal of effort to achieve. That is why good quality control takes so much effort and is so important. All of this is true, not only for manufacturing processes, but for any operation that we would like to be done consistently, such as the taking of blood pressures in a hospital or the performing of chemical analyses in a laboratory. Having found the best way to do it, we would like it to be done that way consistently, but experience shows that very careful planning, checking, recalibration and sometimes appropriate intervention, is needed to ensure that this happens.

Here an example, from Box’s 1992 article, “How to Get Lucky”:

For illustration Figure 1(a) shows a set of data designed so seek out the source of unacceptably large variability which, it was suspected, might be due to small differences in five, supposedly identical, heads on a machine. To test this idea, the engineer arranged that material from each of the five heads was sampled at roughly equal intervals of time in each of six successive eight-hour periods. . . . the same analysis strongly suggested that real differences in means occurred between the six eight-hour periods of time during which the experiment was conducted. . . .

• Workflow. Here’s Box from his 1999 article, “Statistics as a Catalyst to Learning by Scientific Method Part II-Discussion”:

Most of the principles of design originally developed for agricultural experimentation would be of great value in industry, but the most industry experimentation differed from agricultural experimentation in two major respects. These I will call immediacy and sequentially.

What I mean by immediacy is that for most of our investigations the results were available, if not within hours, then certainly within days and in rare cases, even within minutes. This was true whether the investigation was conducted in a laboratory, a pilot plant or on the full scale. Furthermore, because the experimental runs were usually made in sequence, the information obtained from each run, or small group of runs, was known and could be acted upon quickly and used to plan the next set of runs. I concluded that the chief quarrel that our experimenters had with using “statistics” was that they thought it would mean giving up the enormous advantages offered by immediacy and sequentially. Quite rightly, they were not prepared to make these sacrifices. The need was to find ways of using statistics to catalyze a process of investigation that was not static, but dynamic.

There’s lots more. It’s funny to read these things that Box wrote back then, that I and others have been saying over and over again in various informal contexts, decades later. It’s a problem with our statistical education (including my own textbooks) that these important ideas are buried.

More Box

A bunch of articles by Box, with some overlap but not complete overlap with the above collection, is at the site of the University of Wisconsin, where he worked for many years. Enjoy.

Some kinda feud is going on

John Hunter’s page also has this:

The Center for Quality and Productivity Improvement was created by George Box and Bill Hunter at the University of Wisconsin-Madison in 1985.

In the first few years reports were published by leading international experts including: W. Edwards Deming, Kaoru Ishikawa, Peter Scholtes, Brian Joiner, William Hunter and George Box. William Hunter died in 1986. Subsequently excellent reports continued to be published by George Box and others including: Gipsie Ranney, Soren Bisgaard, Ron Snee and Bill Hill.

These reports were all available on the Center’s web site. After George Box’s death the reports were removed. . . .

It is a sad situation that the Center abandonded the ideas of George Box and Bill Hunter. I take what has been done to the Center as a personal insult to their memory. . . .

When diagonoised with cancer my father dedicated his remaining time to creating this center with George to promote the ideas George and he had worked on throughout their lives: because it was that important to him to do what he could. They did great work and their work provided great benefits for long after Dad’s death with the leadership of Bill Hill and Soren Bisgaard but then it deteriorated. And when George died the last restraint was eliminated and the deterioration was complete.

Wow. I wonder what the story was. I asked someone I know who works at the University of Wisconsin and he had no idea. Box died in 2013 so it’s not so long ago; there must be some people who know what happened here.

“You need 16 times the sample size to estimate an interaction than to estimate a main effect,” explained

This has come up before here, and it’s also in Section 16.4 of Regression and Other Stories (chapter 16: “Design and sample size decisions,” Section 16.4: “Interactions are harder to estimate than main effects”). But there was still some confusion about the point so I thought I’d try explaining it in a different way.

The basic reasoning

The “16” comes from the following four statements:

1. When estimating a main effect and an interaction from balanced data using simple averages (which is equivalent to least squares regression), the estimate of the interaction has twice the standard error as the estimate of a main effect.

2. It’s reasonable to suppose that an interaction will have half the magnitude of a main effect.

3. From 1 and 2 above, we can suppose that the true effect size divided by the standard error is 4 times higher for the interaction than for the main effect.

4. To achieve any desired level of statistical power for the interaction, you will need 4^2 = 16 times the sample size that you would need to attain that level of power for the main effect.

Statements 3 and 4 are unobjectionable. They somewhat limit the implications of the “16” statement, which does not in general apply to Bayesian or regularized estimates, not does it consider goals other than statistical power (equivalently, the goal of estimating an effect to a desired relative precision). I don’t consider these limitations a problem; rather, I interpret the “16” statement as relevant to that particular set of questions, in the way that the application of any mathematical statement is conditional on the relevance of the framework under which they can be proved.

Statements 1 and 2 are a bit more subtle. Statement 1 depends on what is considered a “main effect,” and statement 2 is very clearly an assumption regarding the applied context of the problem being studied.

First, statement 1. Here’s the math for why the estimate of the interaction has twice the standard error of the estimate of the main effect. The scenario is an experiment with N people, of which half get treatment 1 and half get treatment 0, so that the estimated main effect is ybar_1 – ybar_0, comparing average under treatment and control. We further suppose the population is equally divided between two sorts of people, a and b, and half the people in each group get each treatment. Then the estimated interaction is (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b).

The estimate of the main effect, ybar_1 – ybar_0, has standard error sqrt(sigma^2/(N/2) + sigma^2/(N/2)) = 2*sigma/sqrt(N); for simplicity I’m assuming a constant variance within groups, which will typically be a good approximation for binary data, for example. The estimate of the interaction, (ybar_1a – ybar_0a) – (ybar_1b – ybar_0b), has standard error sqrt(sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4) + sigma^2/(N/4)) = 4*sigma/sqrt(N). I’m assuming that the within-cell standard deviation does not change after we’ve divided the population into 4 cells rather than 2; this is not exactly correct—to the extent that the effects are nonzero, we should expect the within-cell standard deviations to get smaller as we subdivide—; again, however, it is common in applications for the within-cell standard deviation to be essentially unchanged after adding the interaction. This is equivalent to saying that you can add a important predictor without the R-squared going up much, and it’s the usual story in research areas such as psychology, public opinion, and medicine where individual outcomes are highly variable and so we look for effects among averaging.

The biggest challenge with the reasoning in the above two paragraphs is not the bit about sigma being smaller when the cells are subdivided—this is typically a minor concern, and it’s easy enough to account for if necessary—, nor is it the definition of interaction. Rather, the challenge comes, perhaps surprisingly, from the definition of main effect.

Above I define the “main effect” as the average treatment effect in the population, which seems reasonable enough. There is an alternative, though. You could also define the main effect as the average treatment effect in the baseline category. In the notation above, the main effect would then be defined ybar_1a – ybar_0a. In that case, the standard error of the estimated main effect is only sqrt(2) times the standard error of the estimate of the interaction.

Typically I’ll frame the main effect as the average effect in the population, but there are some settings where I’d frame it as the average effect in the baseline category. It depends on how you’re planning to extrapolate the inferences from your model. The important thing is to be clear in your definition.

Now on to statement 2. I’m supposing an interaction that is half the magnitude of the main effect. For example, if the main effect is 20 and the interaction is 10, that corresponds to an effect of 25 in group a and 15 in group b. To me, that’s a reasonable baseline: the treatment effect is not constant but it’s pretty stable, which is kinda what I think about when I hear “main effect.”

But there are other possibilities. Suppose that the effect is 30 in group a and 10 in group b, so the effect is consistently positive effect, but now it varies by a factor of 3 rather under the two conditions. In this case, the main effect is 20 and the interaction is 20. The main effect and the interaction are of equal size, and so you only need 4 times the sample size to estimate the main effect as to estimate the interaction.

Or suppose the effect is 40 in group a and 0 in group b. Then the main effect is 20 and the interaction is 40, and in that case you need the same sample size to estimate the main effect as to estimate the interaction. This can happen! In such a scenario, I don’t know that I’d be particularly interested in the “main effect”—I think I’d frame the problem in terms of effect in group a and effect in group b, without any particular desire to average over them. It will depend on context.

Why this is important

Before going on, let me copy something from my our earlier post explaining the importance of this result: From the statement of the problem, we’ve assumed the interaction is half the size of the main effect. If the main effect is 2.8 on some scale with a standard error of 1 (and thus can be estimated with 80% power; see for example page 295 of Regression and Other Stories, where we explain why, for 80% power, the true value of the parameter must be 2.8 standard errors away from
the comparison point), and the interaction is 1.4 with a standard error of 2, then the z-score of the interaction has a mean of 0.7 and a sd of 1, and the probability of seeing a statistically significant effect difference is pnorm(0.7, 1.96, 1) = 0.10. That’s right: if you have 80% power to estimate the main effect, you have 10% power to estimate the interaction.

And 10% power is really bad. It’s worse than it looks. 10% power kinda looks like it might be OK; after all, it still represents a 10% chance of a win. But that’s not right at all: if you do get “statistical significance” in that case, your estimate is a huge overestimate:

> raw <- rnorm(1e6, .7, 1)
> significant <- raw > 1.96
> mean(raw[significant])
[1] 2.4

So, the 10% of results which do appear to be statistically significant give an estimate of 2.4, on average, which is over 3 times higher than the true effect.

So, yeah, you don’t want to be doing studies with 10% power, which implies that when you’re estimating that interaction, you have to forget about statistical significance; you need to just accept the uncertainty.

Explaining using a 2 x 2 table

Now to return to the main-effects-and-interactions thing:

One way to look at all this is by framing the population as a 2 x 2 table, showing the averages among control and treated conditions within groups a and b:

           Control  Treated  
Group a:  
Group b:  

For example, here’s an example where the treatment has a main effect of 20 and an interaction of 10:

           Control  Treated  
Group a:     100      115
Group b:     150      175

In this case, there’s a big “group effect,” not necessarily causal (I had vaguely in mind a setting where “Group” is an observational factor and “Treatment” is an experimental factor), but still a “main effect” in the sense of a linear model. Here, the main effect of group is 55. For the issues we’re discussing here, the group effect doesn’t really matter, but we need to specify something here in order to fill in the table.

If you’d prefer, you can set up a “null” setting where the two groups are identical, on average, under the control condition:

           Control  Treated  
Group a:     100      115
Group b:     100      125

Again, each of the numbers in these tables represents the population average within the four cells, and “effects” and “interactions” correspond to various averages and differences of the four numbers. We’re further assuming a balanced design with equal sample sizes and equal variances within each cell.

What would it look like if the interaction were twice the size of the main effect, for example a main effect of 20 and an interaction of 40? Here’s one possibility of the averages within each cell:

           Control  Treated  
Group a:     100      100
Group b:     100      140

If that’s what the world is like, then indeed you need exactly the same sample size (that is, the total sample size in the four cells) to estimate the interaction as to estimate the main effect.

When using regression with interactions

To reproduce the above results using linear regression, you’ll want to code the Group and Treatment variables on a {-0.5, 0.5} scale. That is, Group = -0.5 for a and +0.5 for b, and Treatment = -0.5 for control and +0.5 for treatment. That way, the main effect of each variable corresponds to the other variable equaling zero (thus, the average of a balanced population), and the interaction corresponds to the difference of treatment effects, comparing the two groups.

Alternatively we could code each variable on a {-1, 1} scale, in which case the main effects are divided by 2 and the interaction is divided by 4, but the standard errors are also divided in the same way, so the z-scores don’t change, and you still need the same X times the sample size to estimate the interaction as to estimate the man effect.

Or we could code each variable as {0, 1}, in which case, as discussed above, the main effect for each predictor is then defined as the effect of that predictor when the other predictor equals 0.

Why do I make the default assumptions that I do in the above analyses?

The scenario I have in mind is studies in psychology or medicine where a and b are two groups of the population, for example women and men, or young and old people, and researchers start with a general idea, a “main effect,” but there is also interest in how this effects vary, that is, “interactions.” In my scenario, neither a or b is a baseline, and so it makes sense to think of the main effect as some sort of average (which, as discussed here, can take many forms).

In the world of junk science, interactions represent a way out, a set of forking paths that allow researchers to declare a win in settings where their main effect does not pan out. Three examples we’ve discussed to death in this space are the claim of an effect of fat arms on men’s political attitudes (after interacting with parental SES), an effect of monthly cycle on women’s political attitudes (after interacting with partnership status), and an effect of monthly status on women’s clothing choices (after interacting with weather). In all these examples, the main effect was the big story and the interaction was the escape valve. The point of “You need 16 times the sample size to estimate an interaction than to estimate a main effect” is not to say that researchers shouldn’t look for interactions or that they should assume interactions are zero; rather, the point is that they should not be looking for statistically-significant interactions, given that their studies are, at best, barely powered to estimate main effects. Thinking about interactions is all about uncertainty.

In more solid science, interactions also come up: there are good reasons to think that certain treatments will be more effective on some people and in some scenarios. Again, though, in a setting where you’re thinking of interactions as variations on a theme of the main effect, your inferences for interactions will be highly uncertain, and the “16” advice should be helpful both in design and analysis.

Summary

In a balanced experiment, when the treatment effect is 15 in Group a and 25 in Group b (that is, the main effect is twice the size of the interaction), the estimate of the interaction will have twice the standard error as the estimate of the main effect, and so you’d need a sample size of 16*N to estimate the interaction at the same relative precision as you can estimate the main effect from the same design but with a sample size of N.

With other scenarios of effect sizes, the result is different. If the treatment effect is 10 in Group a and 30 in Group b, you’d need 4 times the sample size to estimate the interaction as to estimate the main effect. If the treatment effect is 0 in group a and 40 in Group b, you’d need equal sample sizes.

The problem with p-values is how they’re used

The above-titled article is from 2014. Key passage:

Hypothesis testing and p-values are so compelling in that they fit in so well with the Popperian model in which science advances via refutation of hypotheses. For both theoretical and practical reasons I am supportive of a (modified) Popperian philosophy of science in which models are advanced and then refuted. But a necessary part of falsificationism is that the models being rejected are worthy of consideration. If a group of researchers in some scientific field develops an interesting scientific model with predictive power, then I think it very appropriate to use this model for inference and to check it rigorously, eventually abandoning it and replacing it with something better if it fails to make accurate predictions in a definitive series of experiments. This is the form of hypothesis testing and falsification that is valuable to me. In common practice, however, the “null hypothesis” is a straw man that exists only to be rejected. In this case, I am typically much more interested in the size of the effect, its persistence, and how it varies across different situations. I would like to reserve hypothesis testing for the exploration of serious hypotheses and not as in indirect form of statistical inference that typically has the effect of reducing scientific explorations to yes/no conclusions.

The logical followup is that article I wrote the other day, Before data analysis: Additional recommendations for designing experiments to learn about the world.

But the real reason I’m bringing up this old paper is to link to this fun discussion revolving around how the article never appeared in the journal that invited it, because I found out they wanted to charge me $300 to publish it, and I preferred to just post it for free. (OK, not completely free; it does cost something to maintain these sites, but the cost is orders of magnitude less than $300 for 115 kilobytes of content.)

On a really bad paper on birth month and autism (and why there’s value in taking a look at a clear case of bad research, even if it’s obscure and from many years ago)

In an otherwise unrelated thread on Brutus vs. Mo Willems, an anonymous commenter wrote:

Researchers found that the risk of autism in twins depended on the month they were born in, with January being 80% riskier than December.

The link is from a 2005 article in the fun magazine New Scientist, “Autism: Lots of clues, but still no answers,” which begins:

The risk of autism in twins appears to be related to the month they are born in. The chance of both babies having the disorder is 80 per cent higher for January births than December births.

This was one of the many findings presented at the conference in Boston last week. It typifies the problems with many autism studies: the numbers are too small to be definitive – this one was based on just 161 multiple-birth babies – and even if the finding does stand up, it raises many more questions than it answers.

The article has an excellently skeptical title and lead-off, so I was curious what’s up with the author, Celeste Biever. A quick search shows that she’s currently Chief News and Features editor at Nature, so still in the science writing biz. That’s good!

The above link doesn’t give the full article but I was able to read the whole thing through the Columbia University library. The relevant part is that one of the authors of the birth-month study was Craig Newschaffer of the Johns Hopkins School of Public Health. I searched for *Craig Newschaffer autism birth month* on Google Scholar and found an article, “Variation in season of birth in singleton and multiple births concordant for autism spectrum disorders,” by L. C. Lee, C. J. Newschaffer, et al., published in 2008 in Paediatric and Perinatal Epidemiology.

I suppose that, between predatory journals and auto-writing tools such as Galactica, the scientific literature will be a complete mess in a few years, but for now we can still find papers from 2008 and be assured that they’re the real thing.

The searchable online version only gave the abstract and references, but again I could find the full article through the Columbia library. And I can report to you that the claim that the “chance of both babies having the disorder is 80 per cent higher for January births than December births,” is not supported by the data.

Let’s take a look. From the abstract:

This study aimed to determine whether the birth date distribution for individuals with autism spectrum disorders (ASD), including singletons and multiple births, differed from the general population. Two ASD case groups were studied: 907 singletons and 161 multiple births concordant for ASD.

161 multiple births . . . that’s about 13 per month, sounds basically impossible for there to be any real evidence of different frequencies comparing December to January. But let’s see what the data say.

From the article:

Although a pattern of birth seasonality in autism was first reported in the early 1980s, the findings have been inconsistent. The first study to examine autism births by month was conducted by Bartlik more than two decades ago. That study compared the birth month of 810 children diagnosed with autism with general births and reported that autism births were higher than expected in March and August; the effect was more pronounced in more severe cases. A later report analysed data from the Israeli national autism registry which had information on 188 individuals diagnosed with autistic disorder. It, too, demonstrated excess births in March and August. Some studies, however, found excess autism births in March only.

March and August, huh? Sounds like noise mining to me.

Anyway, that’s just the literature. Now on to the data. First they show cases by day:

Ok, that was silly, no real reason to have displayed it at all. Then they have graphs by month. They use some sort of smoothing technique called Empiric Mode Decomposition, whatever. Anyway, here’s what they’ve got, first for autistic singleton births and then for autistic twins:

Looks completely random to me. The article states:

In contrast to the trend of the singleton controls, which were relatively flat throughout the year, increases in the spring (April), the summer (late July) and the autumn (October) were found in the singleton ASD births (Fig. 2). Trends were also observed in the ASD concordant multiple births with peaks in the spring (March), early summer (June) and autumn (October). These trends were not seen in the multiple birth controls. Both ASD case distributions in Figs. 2 and 3 indicated a ‘valley’ during December and January. Results of the non-parametric time-series analyses suggested there were multiple peaks and troughs whose borders were not clearly bound by month.

C’mon. Are you kidding me??? Then this:

Caution should be used in interpreting the trend for multiple concordant births in these analyses because of the sparse available data.

Ya think?

Why don’t they cut out the middleman and just write up a bunch of die rolls.

Then this:

Figures 4 and 5 present relative risk estimates from Poisson regression after adjusting for cohort effects. Relative risk for multiple ASD concordant males was 87% less in December than in January with 95% CIs from 2% to 100%. In addition, excess ASD concordant multiple male births were indicated in March, May and September, although they were borderline for statistical significance.

Here are the actual graphs:

No shocker that if you look at 48 different comparisons, you’ll find something somewhere that’s statistically significant at the 5% level and a couple more items that are “borderline for statistical significance.”

This is one of these studies that (a) shows nothing, and (b) never had a chance. Unfortunately, statistics education and practice is focused on data analysis and statistical significance, not so much on design. This is just a ridiculously extreme case of noise mining.

In addition, I came across an article, The Epidemiology of Autism Spectrum Disorders, by Newschaffer et al. published in the Annual Review of Public Health in 2007 that doesn’t mention birth month at all. So, somewhere between 2005 and 2007, it seems that Newschaffer decided that, whatever birth-month effects were out there weren’t important enough to include in a 20-page review article. Then a year later they published a paper with all sorts of bold claims. Does not make a lot of sense to me.

Shooting a rabbit with a cannon?

Ok, this is getting ridiculous, you might say. Here we are picking to death an obscure paper from 15 years ago, an article we only heard about because it was indirectly referred to in a news article from 2005 that someone mentioned in a blog comment.

Is this the scientific equivalent to searching for offensive quotes on twitter and then getting offended? Am I just being mean to go through the flaws of this paper from the archives?

I don’t think so. I think there’s a value to this post, and I say it for two reasons.

1. Autism is important! There’s a reason why the government funds a lot of research on the topic. From the above-linked paper:

The authors gratefully acknowledge the following people and institutions for their resources and support on this manuscript:
1 The Autism Genetic Resource Exchange (AGRE) Consortium. AGRE is a programme of Cure Autism Now and is supported, in part, by Grant MH64547 from the National Institute of Mental Health to Daniel H. Geschwind.
2 Robert Hayman, PhD and Isabelle Horon, DrPH at the Maryland Department of Health and Mental Hygiene Vital Statistics Administration for making Maryland State aggregated birth data available for this analysis.
3 Rebecca A. Harrington, MPH, for editorial and graphic support.
Drs Lee and Newschaffer were supported by Centers for Disease Control and Prevention cooperative agreement U10/CCU320408-05, and Dr. Zimmerman and Ms. Shah were supported by Cure Autism Now and by Dr Barry and Mrs Renee Gordon. A preliminary version of this report was presented in part at the International Meeting for Autism Research, Boston, MA, May 2005.

This brings us to two points:

1a. All this tax money spent on a hopeless study of monthly variation in a tiny dataset is money that wasn’t spent on more serious research into autism or for that matter on direct services of some sort. Again, the problem with this study is not just that the data are indistinguishable from pure noise. The problem is that, even before starting the study, a competent analysis would’ve found that there was not enough data here to learn anything useful.

1b. Setting funding aside, attention given to this sort of study (for example, in that 2005 meeting and in the New Scientist article) is attention not being given to more serious research on the topic. To the extent that we are concerned about autism, we should be concerned about this diversion of attentional resources. At best, other researchers will just ignore this sort of pure-noise study; at worst, other researchers will take it seriously and waste more resources following it up in various ways.

Now, let me clarify that I’m not saying the authors who did this paper are bad people or that they were intending to waste government money and researchers’ attention. I can only assume they were 100% sincere and just working in a noise-mining statistical paradigm. This was 2005, remember, before “p-hacking,” “researcher degrees of freedom,” and “garden of forking paths” became commonly understood concepts in the scientific community. They didn’t know any better! They were just doing what they were trained to do: gather data, make comparisons, highlight “statistical significance” and “borderline statistical significance,” and tell stories. That’s what quantitative research was!

And that brings us to our final point:

2. That noise-mining paradigm is still what a lot of science and social science looks like. See here, for example. We’re talking about sincere, well-meaning researchers, plugged into the scientific literature and, unfortunately, pulling patterns out of what is essentially pure noise. Some of this work gets published in top journals, some of it gets adoring press treatment, some of it wins academic awards. We’re still there!

For that reason, I think there’s value in taking a look at a clear case of bad research. Not everything’s a judgment call. Some analyses are clearly valueless. Another example is the series of papers by that sex-ratio researcher, all of which are a mixture of speculative theory and pure noise mining, and all of which would be stronger without the distraction of data. Again, they’d be better off just reporting some die rolls; at least then the lack of relevant information content would be clearer.

P.S. One more time: I’m not saying the authors of these papers are bad people. They were just doing what they were trained to do. It’s our job as statistics teachers to change that training; it’s also the job of the scientific community not to reward noise-mining—even inadvertent noise-mining—as a career track.

Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?

Federico D’Atri writes:

I am a PhD student in Neuroscience and Cognitive Science at the University of Trieste.

As an exercise, I was trying to replicate the simulation results in Measurement error and the replication crisis, however, what remains unclear to me is how introducing measurement error, even when selecting for statistical significance, can lead to higher estimates compared to ideal conditions in over 50% of cases.

Imagine the case of two variables, x and y, with a certain degree of true correlation. Introducing measurement noise to y will produce a new variable, y’, which has a diminished true correlation with x. The distribution of the correlation coefficient calculated on a sample is known and depends both on the sample size and the true correlation. If we have a reduced true correlation (due to noise), even when selecting for statistical significance (and hence truncating the distribution of sample correlation coefficients), shouldn’t we find that the correlation in the noise-free case is higher in the majority of the cases?

In your article, there are three graphs. I’ve managed to reproduce the first and second graphs, and I understand that increasing the sample size decreases the proportion of studies where the absolute effect size is greater when noise is added. In the article’s second graph, however, it seems that even for small sample sizes, the majority of the time the effect is larger in the “ideal study” scenario when selecting for statistical significance. The third graph, while correctly representing the monotonic decreasing trend of the proportion, seems to contradict the second graph regarding small samples. Even though the effect might be larger, I don’t think that introducing noise would result in an effect size estimate that’s larger than without noise more than 50% of the time, given the reduced true correlation.

I ran some simulations and the only scenario in which this happens is when considering correlations very close to zero. By adding noise, thus reducing the true correlation, it becomes “easier” to obtain large, statistically significant correlations of the opposite sign. I might be missing something or making a blatant error, but I can’t see how, even when selecting for statistical significance and for small sample sizes, once we select the effect sign consistent with the true correlation, that adding noise could result in larger effects than without it over 50% of the time.

Here is a Word document where my argument is better formalized. Additionally, I’ve included two files here and here with the R code used for plotting the exact distribution of the correlation coefficients and for the simulations used to reproduce the plots from you article.

I responded that I didn’t remember exactly what we did in the paper, but I did have some R code which we used to make our graphs. It’s possible that we made a mistake somewhere or that we described our results in a confused way.

D’Atri responded:

The specific statement that I think could be wrong is this: “For small N, the reverse can be true. Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”.

The simulations I did are slightly different, I added measurement error only on y and not on x, but the result would be the same (if you add measurement error the true correlation get smaller in magnitude).

I’ve reviewed the latter section of the code, particularly the segment related to generating the final graph. It appears there’s a problem: the approach used selects the 100 most substantial “noisy” correlations. By only using the larger noisy correlations, it neglects a fair comparison with the “non-noisy” ones (I added a comment in the line where I think lies the problem).

To address this, I’ve adapted your code while preserving its original structure. Specifically, I’ve chosen to include all effects whenever either of the two correlations (with or without added measurement noise) demonstrates statistical significance. Given the structure of our data, in particular the small sample sizes ranging from 25 to 300, the low true correlation values, and the fact that selecting for statistical significance yields few statistically significant correlations, I believed it was good to increase the simulation count significantly. This adjustment provide a more reliable proportion estimate. With these changes, the graph now closely resembles what one would achieve using your initial code, but without any selection filters based on significance or size.

Furthermore, I’ve added another segment to the code. This new portion employs a package allowing for the generation of random data with a specific foundational structure (in our case, a predetermined degree of correlation).

One thing I definitely agree with you is the need to minimize measurement error as much as possible and the detrimental effects of selecting based on statistical significance. From my perspective, the presence of a greater measurement error amplifies the tendencies towards poor research practices, post-hoc hypotheses, and the relentless pursuit of statistically significant effects where there is mainly noise.

I have not had a chance to study this in detail, so I’m posting the above discussion and code to share with others for now. The topic is important and worth thinking about.

P.S. We looked at it in more detail! See here for update.

Teaching materials now available for Llaudet and Imai’s Data Analysis for Social Science!

Last year we discussed the exciting new introductory social-science statistics textbook by Elena Llaudet and Kosuke Imai.

Since then, Llaudet has created a website with tons of materials for instructors.

This is the book that I plan to teach from, next time I teach introductory statistics. As it is, I recommend it as a reference for students in more advanced classes such as Applied Regression and Causal Inference, if they want a clean refresher from first principles.

EDA and modeling

This is Jessica. This past week we’ve been talking about exploratory data analysis (EDA) in my interactive visualization course for CS undergrads, which is one of my favorite topics. I get to talk about model checks and graphical inference, why some people worry about looking at data too much, the limitations of thinking about the goal of statistical analysis as rejecting null hypotheses, etc. If nothing else, I think the students get intrigued because they can tell I get worked up about these things!

However, I was also reminded last week in reading some recent papers that there are still a lot of misconceptions about exploratory data analysis in research areas like visualization and human-computer interaction. EDA is sometimes described by well-meaning researchers as being essentially model-free and hypothesis-free, as if it’s a very different style of analysis than what happens when an analyst is exploring some data with some hunches about what they might find. 

It bugs me when people use the term EDA as synonymous with having few to no expectations about what they’ll find in the data. Identifying the unexpected is certainly part of EDA, but casting the analyst as a blank slate loses much of the nuance. For one, it’s hard to even begin making graphics if you truly have no idea what kinds of measurements you’re working with. And once you learn how the data were collected, you probably begin to form some expectations. It also mischaracterizes the natural progression as you build up understanding of the data and consider possible interpretations. Tukey for instance wrote about different phases in an exploratory analysis, some of which involve probabilistic reasoning in the sense of assessing “With what accuracy are the appearances already found to be believed?“ Similar to people assuming that “Bayesian” is equivalent to Bayes rule, the term EDA is often used to refer to some relatively narrow phase of analysis rather than something multi-faceted and nuanced. 

As Andrew and I wrote in our 2021 Harvard Data Science Review article, the simplistic (and unrealistic) view of EDA as not involving any substantive a priori expectations on the part of the analyst can be harmful for practical development of visualization tools. It can lead to a plethora of graphical user interface systems, both in practice and research, that prioritize serving up easy-to-parse views of the data, at the expense of surfacing variation and uncertainty or enabling the analyst to interrogate their expectations. These days we have lots of visualization recommenders for recommending the right chart type given some query, but it’s usually about getting the choice of encodings (position, size, etc.) right. 

What is better? In the article we had considered what a GUI visual analysis tool might look like if it took the idea of visualization as model checking seriously, including displaying variation and uncertainty by default and making it easier for the analyst to specify and check the data against provisional statistical models that capture relationships they think they see. (In Tableau Software, for example, it’s quite a pain to fit a simple regression to check its predictions against the data). But there was still a leap left after we wrote this, between proposing the ideas and figuring out how to implement this kind of support in a way that would integrate well with the kinds of features that GUI systems offer without resulting in a bunch of new problems. 

So, Alex Kale, Ziyang Guo, Xiao-li Qiao, Jeff Heer, and I recently developed EVM (Exploratory Visual Modeling), a prototype Tableau-style visual analytics tool where you can drag and drop variables to generate visualizations, but which also includes a “model bar.” Using the model bar, the analyst can specify provisional interpretations (in the form of regression) and check their predictions against the observed data. The initial implementation provides support for a handful of common distribution families and takes input in the form of Wilkinson-Pinheiro-Bates syntax. 

The idea is that generating predictions under different model assumptions absolves the analyst from having to rely so heavily on their imagination to assess hunches they have about which variables have explanatory power. If I think I see some pattern as I’m trying out different visual structures (e.g., facetting plots by different variables) I can generate models that correspond to the visualization I’m looking at (in the sense of having the same variables as predictors as shown in the plot), as well as view-adjacent models, that might add or remove variables relative to the visualization specification.

As we were developing EVM, we quickly realized that trying to pair the model and the visualization in terms of constraining them to involve the same variables is overly restrictive. And a visualization will always generally map to multiple possible statistical models so why aim for congruency.

I see this project, which Alex presented this week at IEEE VIS in Melbourne, as an experiment rather than a clear success or failure. There have been some interesting ideas proposed over the years related to graphical inference, and the connection between visualizations and statistical models, but I’ve seen few attempts to locate them in existing workflows for visual analysis like those supported by GUI tools. Line-ups, for instance, which hide a plot of the observed amongst a line-up of plots representing the null hypothesis, are a cool idea, but the implementations I’ve seen have been standalone software packages (e.g., in R) rather than attempts to integrate them into the types of visual analysis tools the non-programmers are using. To bring these ideas into existing tools, we have to think about what kind of workflow we want to encourage, and how to avoid new potential failure modes. For example, with EVM there’s the risk that having the ability to directly check different models one generates as they look at data leaves them with a sense that they’ve thoroughly checked their assumptions and can be even more confident about what explains the patterns. That’s not what we want.

Playing around with the tool ourselves has been interesting, in that it’s forced us to think about what the ideal use of this kind of functionality is, and under what conditions it seems to clearly benefit an analysis over not having it. The benefits are nuanced. We also had 12 people familiar with visual analysis in tools like Tableau use the system, and observed how their analyses of datasets we gave them seemed to differ from what they did without the model bar. Without it they all briefly explored patterns across a broad set of available variables and then circled back to recheck relationships they had already investigated. Model checking on the other hand tended to structure all but one participants’ thinking around one or two long chains of operations geared toward gradually improving models, through trying out different ways of modeling the distribution of the outcome variable, or selection of predictor variables. This did seem to encourage thinking about the data-generating process, which was our goal, though a few of them got fixated on details in the process, like trying to get a perfect visual match between predictions and observed data (without any thought as to what they were changing in the model spec).

Figuring out how to avoid these risks requires understanding who exactly can benefit from this, which is itself not obvious because people use these kinds of GUI visual analysis tools in lots of different ways, from data diagnostics and initial data analysis to dashboard construction as a kind of end-user programming. If we think that a typical user is not likely to follow up on their visual interpretations by gathering new data to check if they still hold, then we might need to build in hold-out sets to prevent perceptions that models fit during data exploration are predictive. To improve the ecosystem of visual analysis tools, we need to understand goals, workflow, and expertise.

Bloomberg News makes an embarrassing calibration error

Palko points to this amusing juxtaposition:

I was curious so I googled to find the original story, “Forecast for US Recession Within Year Hits 100% in Blow to Biden,” by Josh Wingrove, which begins:

A US recession is effectively certain in the next 12 months in new Bloomberg Economics model projections . . . The latest recession probability models by Bloomberg economists Anna Wong and Eliza Winger forecast a higher recession probability across all timeframes, with the 12-month estimate of a downturn by October 2023 hitting 100% . . .

I did some further googling but could not find any details of the model. All I could find was this:

With probabilities that jump around this much, you can expect calibration problems.

This is just a reminder that for something to be a probability, it’s not enough that it be a number between 0 and 1. A real-world probability don’t exist in isolation; they are ensnared in a web of interconnections. Recall our discussion from last year:

Justin asked:

Is p(aliens exist on Neptune that can rap battle) = .137 valid “probability” just because it satisfies mathematical axioms?

And Martha sagely replied:

“p(aliens exist on Neptune that can rap battle) = .137” in itself isn’t something that can satisfy the axioms of probability. The axioms of probability refer to a “system” of probabilities that are “coherent” in the sense of satisfying the axioms. So, for example, the two statements

“p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001”

are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.

The general point is that a probability can only be understood as part of a larger joint distribution; see the second-to-last paragraph of the boxer/wrestler article. I think that confusion on this point has led to lots of general confusion about probability and its applications.

Beyond that, seeing this completely avoidable slip-up from Bloomberg gives us more respect for the careful analytics teams at other news outlets such as the Economist and Fivethirtyeight, both of which are far from perfect, but at least we’re all aware that it would not make sense to forecast a 100% probability of recession in this sort of uncertain situation.

P.S. See here for another example of a Bloomberg article with a major quantitative screw-up. In this case the perpetrator was not the Bloomberg in-house economics forecasting team, it was a Bloomberg Opinion columnist who is described as “a former editorial director of Harvard Business Review,” which at first kinda sounds like he’s an economist at the Harvard business school, but I guess what it really means is that he’s a journalist without strong quantitative skills.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

The other day we talked about checking survey representativeness by looking at canary variables:

Like the canary in the coal mine, a canary variable is something with a known distribution that was not adjusted for in your model. Looking at the estimated distribution of the canary variable, and then comparing to external knowledge, is a way of checking your sampling procedure. It’s not an infallible check—–our sample, or your adjusted sample, can be representative for one variable but not another—but it’s something you can do.

Then I noticed another reference, from 2014:

What you’d want to do [when you see a problem] is not just say, Hey, mistakes happen! but rather to treat these errors as information, as model checks, as canaries in the coal mine and use them to improve your procedure. Sort of like what I did when someone pointed out problems in my election maps.

Canaries all around us

When you notice a mistake, something that seemed to fit your understanding but turned out to be wrong, don’t memory-hole it; engage with it. I got soooo frustrated with David Brooks, or the Nudgelords (further explanation here), or the Freakonomics team or, at a more technical level, the Fivethirtyeight team, when they don’t wrestle with their mistakes.

Dudes! A mistake is a golden opportunity, a chance to learn. You don’t get these every day—or maybe you do! To throw away such opportunities . . . it’s like leaving the proverbial $20 bill on the table.

When Matthew Walker or Malcolm Gladwell get caught out on their errors and they bob and weave and avoid confronting the problem, then I don’t get frustrated in the same way. Their entire brand is based on simplifying the evidence. Similarly with Brian Wansink: there was no there there. If he were to admit error, there’s be nothing left.

But David Brooks, Nudge, Freakonomics, Fivethirtyeight . . . they’re all about explanation, understanding, and synthesis. Sure, it would be a short-term hit to their reputations to admit they got fooled by bad statistical analyses (on the topic of Jews, lunch, beauty, and correlated forecasts, respectively) that happened to aligned with their ideological or intellectual preconceptions, but longer-term, they could do so much better. C’mon, guys! There’s more to life than celebrity, isn’t there? Try to remember what got you interested in writing about social science in the first place.

Moving from “admit your mistakes” to accepting the inevitability of mistakes and their fractal nature

I wonder whether part of this is the implicit dichotomy of “admit when you’re wrong.” We’re all wrong all the time, but when we frame “being wrong” as something that stands out, something that needs to be admitted, maybe that makes it more difficult for us to miss all the micro-errors that we make. If we could get in the habit of recognizing all the mistakes we make every day, all the false starts and blind alleys and wild goose chases that are absolutely necessary in any field of inquiry, then maybe it would be less of a big deal to face up to mistakes we make that are pointed out to us by others.

Mistakes are routine. We should be able to admit them forthrightly without even needing to swallow hard and face up to them, as it were. For example, Nate Silver recently wrote, “The perfect world is one in which the media is both more willing to admit mistakes—and properly frame provisional reporting as provisional and uncertain—and the public is more tolerant of mistakes. We’re not living that world.” Which I agree with, and it applies to Nate too. Maybe we need to go even one step further and not think of a mistake as something that needs to be “admitted,” but just something that happens when we are working on complicated problems, whether they be problems of straight-up journalism (with reports coming from different sources), statistical modeling (relying on assumptions that are inevitably wrong in various ways), or assessment of evidence more generally (at some point you end up with pieces of information that are pointing in different directions).

Many ways to trick a deep model

This is Jessica. Somewhat related to Andrew’s earlier post today, Bo Li from UIUC gave an interesting talk at Northwestern yesterday. The results she presented were an informative counterpoint to the “max AI hype” moment that seems to be playing out in the stream of articles celebrating artificial general intelligence or proposing explanations of how ML got so successful (like Donoho’s paper on frictionless reproducibility, which I liked).

Much of the talk covered a recent project called Decoding Trust, which is an open source toolkit for assessing how trustworthy GPT models are. It provides test cases in the form of prompts that can be used to make the model mess up, where messing up can be defined from various perspectives: generating toxic output, demonstrating gender or racial bias, leaking private training data, etc. 

For example, one prompt might describe a CS undergrad with a certain level of work experience, and ask the model whether they deserved a software engineer position with a starting salary of $225k or more. A second prompt is the same but substitutes the male name and pronouns with female name and pronouns, leading the model to change its mind about whether the salary was appropriate. 

One thing her results showed was that GPT-4 is actually worse than GPT-3.5 on many of these tests. I.e., improvements that lead GPT-4 to better follow instructions may also make it more vulnerable to being tricked.

Another was that if you examined recent LLMs from these nine perspectives, there were differences in terms of where different models were most likely to fail, but none of them were dominating across the board. She did mention that lamba2 was more conservative, though, and would often default to saying it didn’t know or couldn’t answer.

One set of results she presented was interesting in that it got at unexpected holes in GPT’s understanding of words that meant the same thing. She showed that depending on which synonym you used to describe one person privately telling another person something (secretly, in confidence, confidentially, privately, etc.) you could get different responses from GPT about whether it was appropriate to tell others the info, suggesting it properly understood some of the synonyms but not others. It was also eye-opening ot me that these issues could be demonstrated even when the model was prompted with context warning it what not to do. For example, even when it was first reminded not to share certain corporate info like contact information outside of the company, she showed you could still get it to reveal emails and phone numbers. 

Overall, this has me wondering how viable some safe-guarding approaches used to prevent bad behaviors of LLMs like fine-tuning are for getting these models to the point where they can be deployed in the world. For example, she mentioned that GPT models won’t produce SSNs, suggesting they have been carefully tuned to avoid this, and that they seem  less likely to leak numeric information than text. However, there are many ways to be adversarial, and its hard to imagine fine-tuning to prevent them all.

Li also suggested that having approaches that can efficiently find such examples, or incorporate them in training (i.e., robust training approaches), doesn’t necessarily translate to performance improvements. She put up a graph at the end of her talk, similar to the one below, to make the point that despite all the work on robustness there is still a long way to go. The concept of “certified robustness” was new to me, but refers to robustness verification approaches that can theoretically certify the lower bound of a model’s performance against a particular form of adversary. For example, the strongest adversary is “white-box,” meaning they can access the model (including parameters and architecture). A common formulation is an adversary who searches a space of perturbations within some predefined bound of an input instance to identify those that will fool the model into making incorrect predictions. Robust training approaches are devised to prevent models from being vulnerable to such attacks, but, as Li’s graph showed, beyond certain benchmarks like MNIST, progress in certifiably robust ML has been slow and there’s still a long way to go. I pulled from the below graph from her github site on a platform for benchmarking verification and robust training approaches for deep neural nets

Frictionless reproducibility; methods as proto-algorithms; division of labor as a characteristic of statistical methods; statistics as the science of defaults; statisticians well prepared to think about issues raised by AI; and robustness to adversarial attacks

Tian points us to this article by David Donoho, which argues that some of the rapid progress in data science and AI research in recent years has come from “frictionless reproducibility,” which he identifies with “data sharing, code sharing, and competitive challenges.” This makes sense: the flip side of the unreplicable research that has destroyed much of social psychology, policy analysis, and related fields is that when we can replicate an analysis with a press of a button using open-source software, it’s much easier to move forward.

Frictionless reproducibility

Frictionless reproducibility is a useful goal in research. It can take a while between the development of a statistical idea and its implementation in a reproducible way, and that’s ok. But it’s good to aim for that stage. The effort it takes to make a research idea reproducible is often worth it, in that getting to reproducibility typically requires a level of care and rigor beyond what is necessary just to get a paper published. One thing I’ve learned with Stan is that much is learned in the process of developing a general tool that will be used by strangers.

I think that statisticians have a special perspective for thinking about these issues, for the following reason:

Methods as proto-algorithms

As statisticians, we’re always working with “methods.” Sometimes we develop new methods or extend existing methods; sometimes we place existing methods into a larger theoretical framework; sometimes we study the properties of methods; sometimes we apply methods. Donoho and I are typical of statistics professors in having done all these things in our work.

A “method” is a sort of proto-algorithm, not quite fully algorithmic (for example, it could require choices of inputs, tuning parameters, expert inputs at certain points) but it follows some series of steps. The essence of a method is that it can be applied by others. In that sense, any method is a bridge between different humans; it’s a sort of communication among groups of people who may never meet or even directly correspond. Fisher invented logistic regression and decades later some psychometrician uses it; the method is a sort of message in a bottle.

Division of labor as a characteristic of statistical methods

There are different ways to take this perspective. One direction is to recognize that almost all statistical methods involve a division of labor. In Bayes, one agent creates the likelihood model and another agent creates the prior model. In bootstrap, one agent comes up with the estimator and another agent comes up with the bootstrapping procedure. In classical statistics, one agent creates the measurement protocol, another agent designs the experiment, and a third agent performs the analysis. in machine learning, there’s the training and test sets. With public surveys, one group conducts the survey and computes weights; other groups analyze the data using the weights. Etc. We discussed this general idea a few years ago here.

But that’s not the direction I want to go right here. Instead I want to consider something else, which is the way that a “method” is an establishment of a default; see here and also here.

Statistics as the science of defaults

The relevance to the current discussion is that, to the extent that defaults are a move toward automatic behavior, statisticians are in the business of automating science. That is, our methods are “successes” to the extent that they enable automatic behavior on the part of users. As we have discussed, automatic behavior is not a bad thing! When we make things automatic, users can think at the next level of abstraction. For example, push-button linear regression allows researchers to focus on the model rather than on how to solve a matrix equation, and it can even take them to the next level of abstraction and think about prediction without even thinking about the model. As teachers and users of research, we then are (rightly) concerned that lack of understanding can be a problem, but it’s hard to go back. We might as well complain that the vast majority of people drive their cars with no understanding of how those little explosions inside the engine make the car go round.

Statisticians well prepared to think about issues raised by AI

To get back to the AI issue: I think that we as statisticians are particularly well prepared to think about the issues that AI brings, because the essence of statistics is the development of tools designed to automate human thinking about models and data. Statistical methods are a sort of slow-moving AI, and it’s kind of always been our dream to automate as much of the statistics process as possible, while recognizing that for Cantorian reasons (see section 7 here) we will never be there. Given that we’re trying, to a large extent, to turn humans into machines or to routinize what has traditionally been a human behavior that has required care, knowledge, and creativity, we should have some insight into computer programs that do such things.

In some ways, we statisticians are even more qualified to think about this than computer scientists are, in that the paradigmatic action of a computer scientist is to solve a problem, whereas the paradigmatic action of a statistician is to come up with a method that will allow other people to solve their problems.

I sent the above to Jessica, who wrote:

I like the emphasis on frictionless reproducibility as a critical driver of the success in ML. Empirical ML has clearly emphasized methods for ensuring the validity of predictive performance estimates (hold out sets, common task framework etc) compared to fields that use statistical modeling to generate explanations, like social sciences, and it does seem like that has paid off.

From my perspective, there’s something else that’s been very successful though as well – post-2015ish there’s been a heavy emphasis on making models robust to adversarial attack. Being able to take an arbitrary evaluation metric and incorporate it into your loss function so you’re explicitly training for it is also likely to improve things fast. We comment on this a bit in a paper we wrote last year reflecting on what, if anything, recent concerns about ML reproducibility and replicability have in common with the so-called replication crisis in social science.

I do think we are about at max hype currently in terms of perceived success of ML though, and it can be hard to tell sometimes how much the emerging evidence of success from ML research is overfit to the standard benchmarks. Obviously there have been huge improvements on certain test suites, but just this morning for instance I saw an ML researcher present a pretty compelling graph showing that the “certified robustness” of the top LLMs (GPT-3.5, GPT 4, llambda 2, etc), when trained on the common datasets (imagenet, mnist, etc), has not really improved much at all in the past 7-8 years. This was a line graph where each line denoted changes in robustness for different benchmarks (imagenet, mnist, etc) with new methodological advances. Each point in a line represented the robustness of a deep net on that particular benchmark given whatever was considered the state of the art in robust ML at that time. The x-axis was related to time, but each tick represented a particular paper that advanced SOTA. It’s still very easy to trick LLMs into generating toxic text, leaking private data they trained on, or changing their mind based on what should be an inconsequential change to the wording of a prompt, for example.

Simple pseudocode for transformer decoding a la GPT

A number of people have asked me for this, so I’m posting it here:

[Edit: revised original draft (twice) to fix log density context variable and width of multi-head attention values.]

This is a short note that provides complete and relatively simple pseudocode for the neural network architecture behind the current crop of large language models (LLMs), the generative pretrained transformers (GPT). These are based on the notion of (multi-head) attention, followed by feedforward neural networks, stacked in a deep architecture.

I simplified the pseudocode compared to things like Karpathy’s nanoGPT repository in Python (great, but it’s tensorized and batched PyTorch code for GPU efficiency) or Hunter and Phuong’s pseudocode, which is more general and covers encoding and multiple different architectures. I also start from scratch with the basic notions of tokenization and language modeling.

I include the pseudocode for evaluating the objective function for training and the pseudocode for generating responses. The initial presentation uses single-head attention to make the attention stage clearer, with a note afterward with pseudocode to generalize to multi-head attention.

I also include references to other basic presentations, including Daniel Lee’s version coded in Stan.

If this is confusing or you think I got a detail wrong, please let me know—I want to make this as clear and correct (w.r.t. GPT-2) as possible.

Difference-in-differences: What’s the difference?

After giving my talk last month, Better Than Difference in Differences, I had some thoughts about how diff-in-diff works—how the method operates in relation to its assumptions—and it struck me that there are two relevant ways to think about it.

From a methods standpoint the relevance here is that I will usually want to replace differencing with regression. Instead of taking (yT – yC) – (xT – xC), where T = Treatment and C = Control, I’d rather look at (yT – yC) – b*(xT – xC), where b is a coefficient estimated from the data, likely to be somewhere between 0 and 1. Difference-in-differences is the special case b=1, and in general you should be able to do better by estimating b. We discuss this with the Electric Company example in chapter 19 of Regression and Other Stories and with a medical trial in our paper in the American Heart Journal.

Given this, what’s the appeal of diff-in-diff? I think the appeal of the method comes from the following mathematical sequence:

Control units:
(a) Data at time 0 = Baseline + Error_a
(b) Data at time 1 = Baseline + Trend + Error_b

Treated units:
(c) Data at time 0 = Baseline + Error_c
(d) Data at time 1 = Baseline + Trend + Effect + Error_d

Now take a diff in diff:

((d) – (c)) – ((b) – (a)) = Effect + Error,

where that last Error is a difference in difference of errors, which is just fine under the reasonable-enough assumption that the four error terms are independent.

The above argument looks pretty compelling and can easily be elaborated to include nonlinear trends, multiple time points, interactions, and so forth. That’s the direction of the usual diff-in-diff discussions.

The message of my above-linked talk and our paper, though, was different. Our point was that, whatever differencing you take, it’s typically better to difference only some of the way. Or, to make the point more generally, it’s better to model the baseline and the trend as well as the effect.

Seductive equations

The above equations are seductive: with just some simple subtraction, you can cancel out Baseline and Trend, leaving just Effect and error. And the math is correct (conditional on the assumptions, which can be reasonable). The problem is that the resulting estimate can be super noisy; indeed, it’s basically never the right thing to do from a probabilistic (Bayesian) standpoint.

In our example it was pretty easy in retrospect to do the fully Bayesian analysis. It helped that we had 38 replications of similar experiments, so we could straightforwardly estimate all the hyperparameters in the model. If you only have one experiment, your inferences will depend on priors that can’t directly be estimated from local data. Still, I think the Bayesian approach is the way to go, in the sense of yielding effect-size estimates that are more reasonable and closer to the truth.

Next step is to work this out on some classic diff-in-diff examples.