(from 2017 but still relevant): What Has the Internet Done to Media?

Aleks Jakulin writes:

The Internet emerged by connecting communities of researchers, but as Internet grew, antisocial behaviors were not adequately discouraged.

When I [Aleks] coauthored several internet standards (PNG, JPEG, MNG), I was guided by the vision of connecting humanity. . . .

The Internet was originally designed to connect a few academic institutions, namely universities and research labs. Academia is a community of academics, which has always been based on the openness of information. Perhaps the most important to the history of the Internet is the hacker community composed of computer scientists, administrators, and programmers, most of whom are not affiliated with academia directly but are employed by companies and institutions. Whenever there is a community, its members are much more likely to volunteer time and resources to it. It was these communities that created websites, wrote the software, and started providing internet services.

“Whenever there is a community, its members are much more likely to volunteer time and resources to it” . . . so true!

As I wrote a few years ago, Create your own community (if you need to).

But it’s not just about community; you also have to pay the bills.

Aleks continues:

The skills of the hacker community are highly sought after and compensated well, and hackers can afford to dedicate their spare time to the community. Society is funding universities and institutes who employ scholars. Within the academic community, the compensation is through citation, while plagiarism or falsification can destroy someone’s career. Institutions and communities have enforced these rules both formally and informally through members’ desire to maintain and grow their standing within the community.

Lots to chew on here. First, yeah, I have skills that allow me to be compensated well, and I can afford to dedicate my spare time to the community. This is not new: back in the early 1990s I wrote Bayesian Data Analysis in what was essentially my spare time, indeed my department chair advised me not to do it at all—master of short-term thinking that he was. As Aleks points out, was a time when a large proportion of internet users had this external compensation.

The other interesting thing about the above quote is that academics and tech workers have traditionally had an incentive to tell the truth, at least on things that can be checked. Repeatedly getting things wrong would be bad for your reputation. Or, to put it another way, you could be a successful academic and repeatedly get things wrong, but then you’d be crossing the John Yoo line and becoming a partisan hack. (Just to be clear, I’m not saying that being partisan makes you a hack. There are lots of scholars who express strong partisan views but with intellectual integrity. The “hack” part comes from getting stuff wrong, trying to pass yourself off as an expert on topics you know nothing about, ultimately being willing to say just about anything if you think it will make the people on your side happy.)

Aleks continues:

The values of academic community can be sustained within universities, but are not adequate outside of it. When businesses and general public joined the internet, many of the internet technologies and services were overwhelmed with the newcomers who didn’t share their values and were not members of the community. . . . False information is distracting people with untrue or irrelevant conspiracy theories, ineffective medical treatments, while facilitating terrorist organization recruiting and propaganda.

I’ve not looked at data on all these things, but, yeah, from what I’ve read, all that does seem to be happening.

Aleks then moves on to internet media:

It was the volunteers, webmasters, who created the first websites. Websites made information easily accessible. The website was property and a brand, vouching for the reputation of the content and data there. Users bookmarked those websites they liked so that they could revisit them later. . . .

In those days, I kept current about the developments in the field by following newsgroups and regularly visiting key websites that curated the information on a particular topic. Google entered the picture by downloading all of Internet and indexing it. . . . the perceived credit for finding information went to Google and no longer to the creators of the websites.

He continues:

After a few years of maintaining my website, I was no longer receiving much appreciation for this work, so I have given up maintaining the pages on my website and curating links. This must have happened around 2005. An increasing number of Wikipedia editors are giving up their unpaid efforts to maintain quality in the fight with vandalism or content spam. . . . On the other hand, marketers continue to have an incentive to put information online that would lead to sales. As a result of depriving contributors to the open web with brand and credit, search results on Google tend to be of worse quality.

And then:

When Internet search was gradually taking over from websites, there was one area where a writer’s personal property and personal brand were still protected: blogging. . . . The community connected through the comments on blog posts. The bloggers were known and personally subscribed to.

That’s where I came in!

Aleks continues:

Alas, whenever there’s an unprotected resource online, some startup will move in and harvest it. Social media tools simplified link sharing. Thus, an “influencer” could easily post a link to an article written by someone else within their own social media feed. The conversation was removed from the blog post and instead developed in the influencer’s feed. As a result, carefully written articles have become a mere resource for influencers. As a result, the number of new blogs has been falling.
Social media companies like Twitter and Facebook reduced barriers to entry by making so easy to refer to others’ content . . .

I hadn’t thought about this, but, yeah, good point.

As a producer of “content”—for example, what I’m typing right now—I don’t really care if people come to this blog from Google, Facebook, Twitter, an RSS feed, or a link on their browser. (There have been cases where someone’s stripped the material from here and put it on their own site without acknowledging the source, but that’s happened only rarely.) Any of those legitimate ways of reaching this content is fine with me: my goal is just to get it out there, to inform people and to influence discussion. I already have a well-paying job, so I don’t need to make money off the blogging. If it did make money, that would be fine—I could use it to support a postdoc—but I don’t really have a clear sense of how that would happen, so I haven’t ever looked into it seriously.

The thing I hadn’t thought about was that, even if to me it doesn’t matter where our reader are coming, this does matter to the larger community. Back in the day, if someone wanted to link or react to something on a blog, they’d do it in their own blog or in a comment section. Now they can do it from Facebook or Twitter. The link itself is no problem, but there is a problem in that there’s less of an expectation of providing new content along with the link. Also, Facebook and Twitter are their own communities, which have their strengths but which are different than those of blogs. In particular, blogging facilitates a form of writing where you fill in all the details of your argument, where you can go on tangents if you’d like, and where you link to all relevant sources. Twitter has the advantage of immediacy, but often it seems more like community without the content, where people can go on and say what they love or hate but without the space for giving their reasons.

Freakonomics and global warming: What happens to a team of “rogues” when there is no longer a stable center to push against? (a general problem with edgelords)

A few years ago there was a cottage industry among some contrarian journalists, making use of the fact that 1998 was a particularly hot year (by the standards of its period) to cast doubt on the global warming trend. Ummmm, where did I see this? . . . Here, I found it! It was a post by Stephen Dubner on the Freakonomics blog, entitled, “A Headline That Will Make Global-Warming Activists Apoplectic,” and continuing:

The BBC is responsible. The article, by the climate correspondent Paul Hudson, is called “What Happened to Global Warming?” Highlights:

For the last 11 years we have not observed any increase in global temperatures. And our climate models did not forecast it, even though man-made carbon dioxide, the gas thought to be responsible for warming our planet, has continued to rise. So what on Earth is going on?

And:

According to research conducted by Professor Don Easterbrook from Western Washington University last November, the oceans and global temperatures are correlated. . . . Professor Easterbrook says: “The PDO cool mode has replaced the warm mode in the Pacific Ocean, virtually assuring us of about 30 years of global cooling.”

Let the shouting begin. Will Paul Hudson be drummed out of the circle of environmental journalists? Look what happened here, when Al Gore was challenged by a particularly feisty questioner at a conference of environmental journalists.

We have a chapter in SuperFreakonomics about global warming and it too will likely produce a lot of shouting, name-calling, and accusations ranging from idiocy to venality. It is curious that the global-warming arena is so rife with shrillness and ridicule. Where does this shrillness come from? . . .

No shrillness here. Professor Don Easterbrook from Western Washington University seems to have screwed up his calculations somewhere, but that happens. And Dubner did not make this claim himself; he merely featured a news article that featured this particular guy and treated him like an expert. Actually, Dubner and his co-author Levitt also wrote, “we believe that rising global temperatures are a man-made phenomenon and that global warming is an important issue to solve,” so I could never quite figure out in their blog he was highlighting an obscure scientist who was claiming that we were virtually assured of 30 years of cooling.

Anyway, we all make mistakes; what’s important is to learn from them. I hope Dubner and his Freaknomics colleagues learn from this particular prediction that went awry. Remember, back in 2009 when Dubner was writing about “A Headline That Will Make Global-Warming Activists Apoplectic,” and Don Easterbrook was “virtually assuring us of about 30 years of global cooling,” the actual climate-science experts were telling us that things would be getting hotter. The experts were pointing out that oft-repeated claims such as “For the last 11 years we have not observed any increase in global temperatures . . .” were pivoting off the single data point of 1998, but Dubner and Levitt didn’t want to hear it. Fiddling while the planet burns, one might say.

It’s not that the experts are always right, but it can make sense to listen to their reasoning instead of going on about apoplectic activists, feisty questioners, and shrillness.

Freakonomists getting outflanked

The media landscape has changed since 2005 (when the first edition of Freakonomics came out), 2009 (when they ran that ridiculous post pushing climate-change denial), and 2018 (when the above post appeared; I updated it in 2021 with further discussion, and here’s the news from 2023).

Back in the day, Steven Levitt was a “rogue economist,” a genial rebel who held a mix of political opinions (for example, in 2008 thinking Obama would be “the greatest president in history” while pooh-poohing concerns about recession at the time), along with some soft contrarianism (most notoriously claiming that drunk walking was worse than drunk driving, but also various little things like saying that voting in a presidential election is not so smart). Basically, he was positioning himself as being a little more playful and creative than the usual economics professor. A rogue relative to a stable norm.

I wonder how the Freakonomics team feels now, in an era of quasi-academic celebrities such as Dr. Oz and Jordan Peterson, and podcasters like Joe Rogan who push all sorts of conspiracy theories, and not just nutty but hey-why-not ideas such as UFO’s and space aliens but more dangerous positions such as vaccine denial.

Being a contrarian’s all fun and games when you’re defining yourself relative to a reasonable center, maybe not so much when you’re surrounded by crazies.

For example, what were Levitt and Dubner thinking back in 2009 when they published that credulous article featuring an eccentric climate change denier? I can’t know what they were thinking, but I suspect it was something like: “Hey, this guy deserves a hearing. And, in any case, we’re stirring things up. Conversation and debate are good things. Those global-warming activists are so shrill. Let’s make them apoplectic—that’ll be fun!”

The point is, this was all taking place in a media environment where climate change denial was marginalized. So they could run ridiculous pieces like the above-linked post without being concerned of having bad effects. They were just joking around, taking the piss, setting up boring Al Gore as a foil for “a particularly feisty questioner,” promoting a fringe character such as Professor Don Easterbrook from Western Washington University (he who told us in 2009 that are climatic conditions “virtually assuring us of about 30 years of global cooling”) secure in the belief that no one would take this claim seriously. Just a poke in the eye at humorless liberals, that’s all.

Recall that, around the same time, Levitt and Dubner also wrote, “we believe that rising global temperatures are a man-made phenomenon and that global warming is an important issue to solve” (see also here) so my take on the whole episode is that they felt ok promoting a fringe climate-change denier without concern that they could be upsetting the larger consensus. They got to have the fun of being edgy by promoting the prediction of “30 years of global cooling” without ever actually believing that ridiculous claim.

Nowadays, though, things are getting out of control, both with the climate and with extremists and wild takes in news and social media and in politics, and I imagine that it’s more difficult for the Freakonomics team to feel comfortable as rogues. They no longer have a stable center to push against.

A political science perspective

In political science we sometimes talk about proximity or directional voting. In proximity voting, you choose the party or candidate closest to us in policy preferences; in directional voting, you choose the party or candidate whose position is most extreme relative to the center while being in the same general direction as ours (to be precise, if we consider your position and each party’s position as vectors in a multidimensional space, you’d choose the party to maximize the dot product of your position and the party’s position, with that dot product being defined relative to some zero position in the center of the political spectrum). The rationale for proximity voting is obvious; the rationale for directional voting is that your vote has only a very small impact, which you can maximize by pushing the polity as far as you can in the desired direction.

There is a logic to directional voting; the problem arises when many people do it, with the result that extreme parties get real influence and even attain political power in a country.

Some examples of directional voting, or directional position-taking, include Levitt and Dubner pushing climate-change denial, people who should know better on the right supporting election denial in 2020, or, on the other side, center-leftists supporting police defunding, presumably following the reasoning that the police would not be defunded and the pressure to defund would merely cause police funding to decrease. Once you think to look, you can find this sort of political behavior all the time: a way to oppose the party in power is to support its fiercest opponents, even if you would not ever want those opponents to be in power either.

But . . . directional voting falls apart when the center does not hold.

Faculty position in computation & politics at MIT

We have this tenure-track Assistant Professor position open at MIT. It is an unusual opportunity in being a shared position between the Department of Political Science and the College of Computing. (I say “unusual” compared with typical faculty lines, but by now MIT has hired faculty into several such shared positions.)

So we’re definitely inviting applications not just from social science PhDs, but also from, e.g., statisticians, mathematicians, and computer scientists:

We seek candidates whose research involves development and/or intensive use of computational and/or statistical methodologies, aimed at addressing substantive questions in political science.

Beyond advertising this specific position, perhaps this is an interesting example of the institutional forms that interdisciplinary hiring can take. Here the appointment would be in the Department of Political Science and then also within one of the relevant units of the College of Computing. And there are two search committees working together, one from the Department and one from the College. I am serving on the latter, which includes experts from all parts of the College.

[This post is by Dean Eckles.]

The connection between junk science and sloppy data handling: Why do they go together?

Nick Brown pointed me to a new paper, “The Impact of Incidental Environmental Factors on Vote Choice: Wind Speed is Related to More Prevention-Focused Voting,” to which his reaction was, “It makes himmicanes look plausible.” Indeed, one of the authors of this article had come up earlier on this blog as a coauthor of paper with fatally-flawed statistical analysis. So, between the general theme of this new article (“How might irrelevant events infiltrate voting decisions?”), the specific claim that wind speed has large effects, and the track record of one of the authors, I came into this in a skeptical frame of mind.

That’s fine. Scientific papers are for everyone, not just the true believers. Skeptics are part of the audience too.

Anyway, I took a look at the article and replied to Nick:

The paper is a good “exercise for the reader” sort of thing to find how they managed to get all those pleasantly low p-values. It’s not as blatantly obvious as, say, the work of Daryl Bem. The funny thing is, back in 2011, lots of people thought Bem’s statistical analysis was state-of-the-art. It’s only in retrospect that his p-hacking looks about as crude as the fake photographs that fooled Arthur Conan Doyle. Figure 2 of this new paper looks so impressive! I don’t really feel like putting in the effort to figuring out exactly how the trick was done in this case . . . Do you have any ideas?

Nick responded:

There are some hilarious errors in the paper. For example:
– On p. 7 of the PDF, they claim that “For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented (Mean (M) = 4.5, Standard Error (SE) = 0.17, t(101) = 6.05, p < 0.001) whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused (M = 3.05, SE = 0.16, t(101) = 2.87, p = 0.003).": But the question was not "Do you want Brexit, Yes/No". It was "Should the UK Remain in the EU or Leave the EU". Hence why the pro-Brexit campaign was called "Vote Leave", geddit? Both sides agreed on before the referendum that this was fairer and clearer than Yes/No. Is "Remain" more prevention-focused than "Leave"? - On p. 12 of the PDF, they say "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU." This is again completely false. The Conservative government, including Prime Minister David Cameron, backed Remain. It's true that a number of Conservative politicians backed Leave, and after the referendum lots of Conservatives who had backed Remain pretended that they either really meant Leave or were now fine with it, but if you put that statement, "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU" in front of 100 UK political scientists, not one will agree with it. If the authors are able to get this sort of thing wrong then I certainly don't think any of their other analyses can be relied upon without extensive external verification. If you run the attached code on the data (mutatis mutandis for the directories in which the files live) you will get Figure 2 of the Mo et al. paper. Have a look at the data (the CSV file is an export of the DTA file, if you don't use Stata) and you will see that they collected a ton of other variables. To be fair they mention these in the paper ('Additionally, we collected data on other Election Day weather indicators (i.e., cloud cover, dew point, precipitation, pressure, and temperature), as well as historical wind speeds per council area.5 The inclusion of other Election Day weather indicators increases our confidence that we are detecting an association between wind speed and election outcomes, and not the effect of other weather indicators that may be correlated with wind speed.") My guess is that they went fishing and found that wind speed, as opposed to the other weather indicators that they mentioned, gave them a good story. Looking only at the Swiss data, I note that they also collected "Income", "Unemployment", "Age", "Race" (actually the percentage of foreign-born people; I doubt if Switzerland collects "Race" data; Supplement, Table S3, page 42), "Education", and "Rural", and threw those into their model as well. They also collected latitude and longitude (of the centroid?) for each canton, although those didn't make it into the analyses. Also they include "Turnout", but for any given Swiss referendum it seems that they only had the national turnout because this number is always the same for every "State" (canton) for any given "Election" (referendum). And the income data looks sketchy (people in Schwyz canton do not make 2.5 times what people in Zürich canton do). I think this whole process shows a degree of naivety about what "kitchen-sink" regression analyses (and more sophisticated versions thereof) can and can't do, especially with noisy measures (such as "Precipitation" coded as 0/1). Voter turnout is positively correlated with precipitation but negatively with cloud cover, whatever that means. Another glaring omission is any sort of weighting by population. The most populous canton in Switzerland has a population almost 100 times the least populous, yet every canton counts equally. There is no "population" variable in the dataset, although this would have been very easy to obtain. I guess this means they avoid the ecological fallacy, up to the point where they talk about individual voting behaviour (i.e., pretty much everywhere in the article).

Nick then came back with more:

I found another problem, and it’s huge:

For “Election 50”, the Humidity and Dew Point data are completely borked (“relative humidity” values around 1000 instead of 0.6 etc; dew point 0.4–0.6 instead of a Fahrenheit temperature slightly below the measured temperature in the 50–60 range). When I remove that referendum from the results, I get the attached version of Figure 2. I can’t run their Stata models, but by my interpretation of the model coefficients from the R model that went into making Figure 2, the value for the windspeed * condition interaction goes from 0.545 (SE=0.120, p=0.000006) to 0.266 (SE=0.114, p=0.02).

So it seems to me that a very big part of the effect, for the Swiss results anyway, is being driven by this data error in the covariates.

And then he posted a blog with further details, along with a link to some other criticisms from Erik Gahner Larsen.

The big question

Why do junk science and sloppy data handling so often seem together? We’ve seen this a lot, for example the ovulation-and-voting and ovulation-and-clothing papers that used the wrong dates for peak fertility, the Excel error paper in economics, the gremlins paper in environmental economics, the analysis of air pollution in China, the collected work of Brian Wansink, . . . .

What’s going on? My hypothesis is as follows. There are lots of dead ends in science, including some bad ideas and some good ideas that just don’t work out. What makes something junk science is not just that it’s studying an effect that’s too small to be detected with noisy data; it’s that the studies appear to succeed. It’s the misleading apparent success that’s turns a scientific dead end into junk science.

As we’ve been aware since the classic Simmons et al. paper from 2011, researchers can and do use researcher degrees of freedom to obtain apparent strong effects from data that could well be pure noise. This effort can be done on purpose (“p-hacking”) or without the researchers realizing it (“forking paths”), or through some mixture of the two.

The point is that, in this sort of junk science, it’s possible to get very impressive-looking results (such as Figure 2 in the above-linked article) from just about any data at all! What that means is that data quality doesn’t really matter.

If you’re studying a real effect, then you want to be really careful with your data: any noise you introduce, whether in measurement or through coding error, can be expected to attenuate your effect, making it harder to discover. When you’re doing real science you have a strong motivation to take accurate measurements and keep your data clean. Errors can still creep in, sometimes destroying a study, so I’m not saying it can’t happen. I’m just saying that the motivation is to get your data right.

In contrast, if you’re doing junk science, the data are not so relevant. You’ll get strong results one way or another. Indeed, there’s an advantage to not looking too closely at your data at first; that way if you don’t find the result you want, you can go through and clean things up until you reach success. I’m not saying the authors of the above-linked paper did any of that sort of thing on purpose; rather, what I’m saying is that they have no particular incentive to check their data, so from that standpoint maybe we shouldn’t be so surprised to see gross errors.

Unifying Design-Based and Model-Based Sampling Inference (my talk this Wednesday morning at the Joint Statistical Meetings in Toronto)

Wed 9 Aug 10:30am:

Unifying Design-Based and Model-Based Sampling Inference

A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But it is not clear how to apply this advice when fitting regressions that include only some of the weighting information, nor does it tell us what to do when analyzing already-collected surveys where the weighting procedure has not been clearly explained or where the weights depend in part on information that is not available in the data. It is also not clear how one is supposed to account for clustering in such analyses. We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest.

No slides, but I whipped up a paper on the topic which you can read if you want to get a sense of the idea.

What is a standard error?

I spoke at a session with the above title at the American Economic Association meeting a few months ago. It was organized by Serena Ng and Elie Tamer, and the other talks were given by Patrick Kline, James Powell, Jeffrey Wooldridge, and Bin Yu. In addition to speaking, Bin and I wrote short papers that will appear in the Journal of Econometrics. Here’s mine:

What is a standard error?

In statistics, the standard error has a clear technical definition: it is the estimated standard deviation of a parameter estimate. In practice, though, challenges arise when we go beyond the simple balls-in-urn model to consider generalizations beyond the population from which the data were sampled. This is important because generalization is nearly always the goal of quantitative studies. In this brief paper we consider three examples.

What is the standard error when the bias is unknown and changing (my bathroom scale)?

I recently bought a cheap bathroom scale. I took the scale home and zeroed it—there’s a little gear in front to turn. I tapped my foot on the scale, it went to -1 kg, I turned the gear a bit, then it went up to +2, then I turned a bit back to get it exactly to zero, and tapped again . . . it was back at -1. That was frustrating, but I still wanted to estimate my weight. So I got on and off the scale multiple times. The first few measurements were 66 kg, 65.5 kg, 68 kg, and 67 kg. A lot of variation! To get a good estimate in the presence of variation, it is recommended to take multiple measurements. So I did so. After 46 measurements, I got bored and stopped. The resulting measurements had mean 67.1 with standard deviation 0.7, hence a standard error of 0.7/sqrt(46) = 0.1.

Would I want to use the resulting 95% confidence interval, 67.1 +/- 0.2? Of course not! The whole scale is off by some unknown amount. What, then, to do? One approach would be to calibrate, either using a known object that weighs in the neighborhood of 67 kg or else my own weight measured on an accurate instrument. If that is not possible, then I would want a wider uncertainty interval to account for the uncertainty in the scale’s bias. The usual purpose of a standard error is to attach uncertainty to an estimate, and for that purpose, the usual standard error formula is inappropriate.

How do you interpret standard errors from a regression fit to the entire population (all 50 states)?

Sometimes we can all agree that if you have a whole population, your standard error is zero. This is basic finite population inference from survey sampling theory, if your goal is to estimate the population average or total. Consider a regression fit to data on all 50 states in the United States. This gives you an estimate and a standard error. Maybe the estimated coefficient of interest is only one standard error from zero, so it’s not “statistically significant.” But what does that mean, if you have the whole population? You might say that the standard error doesn’t matter, but the internal variation in the data still seems relevant, no?

One way to think about this is to imagine the regression being used for prediction. For example, you have all 50 states, but you might use the model to understand these states in a different year. So you can think of the data you have from the 50 states as being a sample from a larger population of state-years. It’s not a random or representative sample, though, in that it’s data from just one year. So to get the right uncertainty you’ll need to use a multilevel model or clustered standard errors. With data from only one cluster, some external assumptions will be needed to compute the standard error. Alternatively, one could just use the standard error that pops out of the regression, which would correspond to an implicit model of equal variation between and within years. Just because you have an exhaustive sample, that does not mean that the standard error is undefined or meaningless.

How should we account for nonsampling error when reporting uncertainties (election polls)?

In an analysis of state-level pre-election polls, we have found the standard deviation of empirical errors—the difference between the poll estimate and the election outcome—to be about twice as large as would be expected from the reported standard errors of the individual surveys. This sort of nonsampling error is usual in polling; what is special about election forecasting is that here we can observe the outcome and thus measure the total error directly. The question then arises: what standard error should a pollster report? The usual formula based on sampling balls from an urn (with some correction for weighting or survey adjustment) gives an internal measure of uncertainty but does not address the forecasting question. It would seem better to augment the standard error based on past levels of nonsampling error, but then the question arises of what to do in other sampling settings where no past calibration is available. In election polling we have some sense of that extra uncertainty; it seems wrong to implicitly set it to zero when we don’t know what to do about it.

How, then, should we interpret the standard error from textbook formulas or when fitting a regression? We can think of this as a lower-bound standard error or, more precisely, as a measure of variation corresponding to a particular model.

Summary

The appropriate standard error depends not just on the data and sampling model but also on the generalization of interest, and the model of variation across units and over time corresponding to the uses to which the estimate will be put. Deciding on a generalization of interest in a sampling or regression problem is similar to the problem of focusing on a particular average treatment effect in causal inference: thinking seriously about your replications (for the goal of getting the right standard error) and inferential goals, you might well get a better understanding of what you’re trying to do with your model.

Mundane corrections to the dishonesty literature

There is a good deal of coverage of the more shocking reasons that papers on the psychology of dishonesty by Dan Ariely and Francesca Gino need to be corrected or retracted. I thought I’d share a more mundane example — in this same literature, and in fact in the very series of papers.

There is no allegation of further fraud here, the errors are mundane, but maybe this is relevant to challenges in correcting the scientific record, etc.

Back in August 2021, Data Colada published the initial evidence of fraud in the field experiment in Shu, Mazar, Gino, Ariely & Bazerman (2012). They were able to do this because Kristal, Whillans, Bazerman, Gino, Shu, Mazar & Ariely (2020), which primarily reported failures to replicate the original lab experimental results, also reported some problems with the field experimental data (covariate imbalance inconsistent with randomization) and shared the spreadsheet with this data.

So I clicked through to the newer (2020) paper to check out the results. I came across this paragraph, reporting the main results from the preregistered direct replication (Study 6):

We failed to detect an effect of signing first on all three preregistered outcomes (percent of people cheating per condition, t[1,232.8] = −1.50, P = 0.8942, d = −0.07 95% confidence interval [CI] [−1.96, 0.976]; amount of cheating per condition, t[1,229.3] = −0.717, P = 0.7633, d = −0.04 95% CI[−1.96, 0.976]; and amount of expenses reported, t[1,208.9] = −1.099, P = 0.864, d = −0.06 95% CI[−1.96, 0.976]). The Bayes factors for these three outcome measures were between 7.7 and 12.5, revealing substantial support for the null hypothesis (6). This laboratory experiment provides the strongest evidence to date that signing first does not encourage honest reporting.

A couple things jumped out here. First, this text says the point estimate for the effect of signing at the top on amount of cheating is d = −0.04, but Figure 1 in the paper says it is d = 0.04:

Figure 1 of Kristal et al. (2020)

Figure 1 of Kristal et al. (2020), where Study 6 is the pre-registered direct replication. [Update: This apparently has the wrong sign for all of the estimates here.]

So somehow the sign got switched somewhere.

Second, if you look at that paragraph again, there are some unusual things going on with the confidence intervals. They are all the same and aren’t really on the right scale or centered anywhere near the point estimates. In fact, it seems like it seems like a critical value (which would be ±1.96 for a z-test) and a cumulative fraction (which would be .025 and .975) got accidentally reported as the lower and upper ends of the 95% intervals. I imagine this  could happen if doing these calculations in a spreadsheet.

So in August 2021 I emailed the first author and Francesco Gino to report that something was wrong here, concluding by saying: “Seems like this is just a reporting error, but I can imagine this might create even more confusion if not corrected.”

Professor Gino thanked me for bringing this to their attention. I followed up in October 2021 to provide more detail about my concerns about the CIs and ask:

This line of work came up the other day, and this prompted me to check on this and noticed there hasn’t been a correction issued, at least that I saw. Is that in the works?

First author Ariella Kristal helpfully immediately responded with [see update below] their understanding of the errors at the time (that the correct point estimate is positive, d = 0.04), and said a correction had not yet been submitted but they were “hoping to issue the correction ASAP”. OK, these things can take a little time — obviously important to make these corrections with care!

But still I was a bit disappointed when, in February 2022, I noticed that there was not yet any correction to the paper. So I emailed the editorial team at PPNAS, where this paper was published, writing in part:

I notified the authors of these problems in August.

I’m wondering if there is any progress on getting this article corrected? Have the authors requested it be corrected? (Their earlier response to me was somewhat ambiguous about whether PNAS had been contacted by them yet.)

I’m a bit surprised nothing visible has happened despite the passage of six months.

Staff confirmed then that a correction had been requested in October, but that the matter was still under review. (In retrospect, I can now wonder whether perhaps by this point this had become tied up in broader concerns about papers by Gino.)

In September 2022, with over a year passed since my initial email to the authors, I thought I should at least post a comment on PubPeer, so other readers might find some documentation of this issue.

As of writing this post, there is still no public notice of any existing or pending correction to “Signing at the beginning versus at the end does not decrease dishonesty”.

Of course, maybe this doesn’t really matter so much. The main result of the paper really is still null result, and nothing key turns on whether the point estimate is 0.04 or –0.04 (I had thought it was the former, but gather now that it is the latter). And there is open data for this paper, so anyone who really wants to dig into that could figure out what the correct calculation is.

But maybe it worth reflecting on just how slowly this is being corrected. I don’t know whether any of my emails after the first helped move this along, so maybe really anything beyond the first email, which was easy for me to write, did nothing. Perhaps my lesson here should be to post publicly (e.g. on PubPeer) with less of a delay.

Update: After posting the above, Ariella Kristal, the first author of this study, contacted me to share this document, which details the corrections to the paper. As a result, I’ve edited my statements above about what the correct numbers are, as the correct value is apparently d = –0.04 after all. She also emphasized that she contacted the journal about this matter several times as well.

[This post is by Dean Eckles.]

Fully Bayesian computing: Don’t collapse the wavefunction until it’s absolutely necessary.

Kevin Gray writes:

In marketing research, it’s common practice to use averages of MCMC draws in Bayesian hierarchical models as estimates of individual consumer preferences.

For example, we might conduct choice modeling among 1,500 consumers and analyze the data with an HB multinomial logit model. The means or medians of the (say) 15,000 draws for each respondent are then used as parameter estimates for each respondent. In other words, by averaging the draws for each respondent we obtain an individual-level equation for each respondent and individual-level utilities.

Recently, there has been criticism of this practice by some marketing science people. For example, we can compare predictions of individuals or groups of individuals (e.g., men versus women), but not the parameters of these individuals or groups to identify differences in their preferences.

This is highly relevant because since the late 90s it has been common practice in marketing research to use these individual-level “utilities” to compare preferences (i.e., relative importance of attributes) of pre-defined groups or to cluster on the utilities with K-means (for example).

I’m not an authority on Bayes of course, but have not heard of this practice outside of marketing research, and have long been concerned. Marketing research is not terribly rigorous…

This all seems very standard to me and is implied by basic simulation summaries, as described for example in chapter 1 of Bayesian Data Analysis. Regarding people’s concerns: yeah, you shouldn’t first summarize simulations over people and then compare people. What you should do is compute any quantity of interest—for example, a comparison of groups of people—separately for each simulation draw, and then only at the end should you average over the simulations.

Sometimes we say: Don’t prematurely collapse the wave function.

This is also related to the idea of probabilistic programming or, as Jouni and I called it, fully Bayesian computing. Here’s our article from 2004.

Statistical mistake in published paper is fixed. Correction is issued. Original paper still sitting online with the wrong findings. Press release still sitting online with the wrong findings. Elsevier!

David Allison points to this post:

Often indirectly, but sometimes directly, we hear from true believers in concepts attached to obesity, nutrition, and public policy. The embedded question is “Why do you doubt this article of faith?” Among the many articles of faith in this realm is the belief that if we deliver just the right education or just the right nudges to people (or children) at risk for obesity, we will finally “tackle” the problem. It’s understandable. In fact, we see no problem with bringing strong beliefs to the subject of obesity, so long as we bring along even stronger analyses to test them.

An excellent case in point appears in the February issue of the Journal of Nutrition Education and Behavior. The initial analysis of a study of cooking programs found support for its effectiveness with children. But a stronger analysis showed that the effect was nil.

The premise of this study was straightforward. Test the presumption that exposing children to healthy foods in a cooking program could nudge them toward more often choosing healthy foods afterward. Frans Folkvord, Doeschka Anschütz, and Marieke Geurts set up this experiment with 125 children between the ages of 10 and 12. They randomized the children for exposure to ten-minute video clips from a cooking program. The experimental group saw a clip that emphasized only healthy foods, but the control group viewed a clip that emphasized unhealthy foods. After those clips, children in both groups chose between healthy and unhealthy snacks as a reward for participating.

Initially, Folkvord et al reported positive results:

“These findings indicated a priming effect of the foods the children were exposed to, showing that nutrition education guided by reactivity theory can be promising.”

Unfortunately this initial analysis didn’t hold up under a closer look. . . . the initial finding was reversed and a corrigendum has been published . . .

Happy ending!

Just one thing. I clicked on the original article (there’s a link at the end of the above-linked post), and it still has the unsupported conclusions:

Results

Children who watched the cooking program with healthy foods had a higher probability of selecting healthy food than children who watched the cooking program with unhealthy foods (P = .027), or with the control condition (P = .039).

Conclusions and Implications

These findings indicated a priming effect of the foods the children were exposed to, showing that nutrition education guided by reactivity theory can be promising. Cooking programs may affect the food choices of children and could be an effective method in combination with other methods to improve their dietary intake.

Ummmmm, no! From the correction:

Results

Children who watched the cooking program with healthy foods did not have a higher probability to select the healthy food than children who watched the cooking program with unhealthy foods, or who were in the control condition.

Conclusions and Implications

These findings do not directly indicate a priming effect of the foods children were exposed to. The current study did not show that cooking programs affect food choices of children, although other studies showed that cooking programs could be an effective method in combination with other methods to improve children’s dietary intake.

The authors regret these errors.

The Journal of Nutrition Education and Behavior should put this correction on the main page of the original article.

As it is, if you go to the original article, you’ll see the bad abstract and then you have to scroll all the way down past all the references to get to a section called Linked Article where there’s a link to the Corrigendum. Or you can click on Linked Article on the main page. Either way, it’s not obvious, and there’s no good reason to keep the original abstract as the first thing that anyone sees.

This is not the fault of the authors, who were admirably open in the whole process. Everyone makes mistakes. It’s the publisher that should fix this. Also this completely uncorrected press release is still on the internet. Not a good look, Elsevier.

“They got a result they liked, and didn’t want to think about the data.” (A fish story related to Cannery Row)

John “Jaws” Williams writes:

Here is something about a century-old study that you may find interesting, and could file under “everything old is new again.”

In 1919, the California Division of Fish and Game began studying the developing sardine fishery in Monterey. Ten years later, W. L. Scofield published an amazingly through description of the fishery, the abstract of which begins as follows:

The object of this bulletin is to put on record a description of the Monterey sardine fishery which can be used as a basis for judging future changes in the conduct of this industry. Detailed knowledge of changes is essential to an understanding of the significance of total catch figures, or of records of catch per boat or per seine haul. It is particularly necessary when applying any form of catch analysis to a fishery as a means of illustrating the presence or absence of depletion or of natural fluctuations in supply.

As detailed in this and subsequent reports, the catch was initially limited by the market and the capacity of the fishing fleet, both of which grew rapidly for several decades and provided the background for John Steinbeck’s “Cannery Row.” Later, sardine population famously collapsed, and never recovered.

Sure enough, just as Scofield feared, scientists who did not understand the data subsequently misused it as reflecting the sardine population, as I pointed out in this letter (which got the usual kind of response). They got a result they liked, and didn’t want to think about the data.

The Division of Fisheries was not the only agency to publish detailed descriptive reports. The USGS and other agencies did as well, but generally they have gone out of style; they take a lot of time and field work, are expensive to publish, and don’t get the authors much credit.

This comes to mind because I am working on a paper about a debris flood on a stream in one of the University of California’s natural reserves, and the length limits for the relevant print journals don’t allow for a reasonable description of the event and a discussion of what it means. However, now I can write a separate and more complete description, and have it go as on-line supplementary material. There is some progress.

Studying average associations between income and survey responses on happiness: Be careful about deterministic and causal interpretations that are not supported by these data.

Jonathan Falk writes:

This is an interesting story of heterogeneity of response, and an interesting story of “adversarial collaboration,” and an interesting PNAS piece. I need to read it again later this weekend, though, to see if the stats make sense.

The article in question, by Matthew Killingsworth, Daniel Kahneman, and Barbara Mellers, is called “Income and emotional well-being: A conflict resolved,” and it begins:

Do larger incomes make people happier? Two authors of the present paper have published contradictory answers. Using dichotomous questions about the preceding day, Kahneman and Deaton reported a flattening pattern: happiness increased steadily with log(income) up to a threshold and then plateaued. Using experience sampling with a continuous scale, Killingsworth reported a linear-log pattern in which average happiness rose consistently with log(income). We engaged in an adversarial collaboration to search for a coherent interpretation of both studies. A reanalysis of Killingsworth’s experienced sampling data confirmed the flattening pattern only for the least happy people. Happiness increases steadily with log(income) among happier people, and even accelerates in the happiest group. Complementary nonlinearities contribute to the overall linear-log relationship. . . .

I agree with Falk that the collaboration and evaluation of past published work is great, and I’m happy with the discussion, which is focused so strongly on data and measurement and how they map to conclusions. I don’t know why they call it “adversarial collaboration,” as I don’t see anything adversarial here. That’s a good thing! I’m glad they’re cooperating. Maybe they could just call it “collaboration from multiple perspectives” or something like that.

On the substance, I think the article has two main problems, both of which are exhibited by its very first line:

Do larger incomes make people happier?

Two problems here:

1. Determinism. The question, “Do larger incomes make people happier?”, does not admit variation. Larger incomes are gonna make some people happier in some settings.

2. Causal attribution. If I’m understanding correctly, the data being analyzed are cross-sectional; to put it colloquially, they’re looking at correlation, not causation.

3. Framing in terms of a null hypothesis. Neither of the two articles that motivated this work suggested a zero pattern.

Putting these together, the question, “Do larger incomes make people happier?”, would be more accurately written as, “How much happier are people with high incomes, compared to people with moderate incomes?”

Picky, Picky

You might say that I’m just being picky here; when they ask, “Do larger incomes make people happier?”, everybody knows they’re really talking about averages (not about “people” in general), that they’re talking about association (not about anything “making people happier”), and that they’re doing measurement, not answering a yes-or-no question.

And, sure, I’m a statistician. Being picky is my business. Guilty as charged.

But . . . I think my points 1, 2, 3 are relevant to the underlying questions of interest, and dismissing them as being picky would be a mistake.

Here’s why I say this.

First, the determinism and the null-hypothesis framing leads to a claim about, “Can money buy happiness?” We already know that money can buy some happiness, some of the time. The question, “Are richer people happier, on average?”, that’s not the same, and I think it’s a mistake to confuse one with the other.

Second, the sloppiness about causality ends up avoiding some important issues. Start with the question, “Do larger incomes make people happier?” There are many ways to have larger incomes, and these can have different effects.

One way to see this is to flip the question around and ask, “Do smaller incomes make people unhappier?” The funny thing is, based Kahneman’s earlier work on loss aversion, he’d probably say an emphatic Yes to that question. But we can also see that there are different ways to have a smaller income. You might choose to retire—or be forced to do so. You might get fired. Or you might take time off from work to take care of young children. Or maybe you’re just getting pulled by the tides of the national economy. All sorts of possibilities.

A common thread here is that it’s not necessarily the income causing the mood change; it’s that the change in income is happening along with other major events that can affect your mood. Indeed, it’s hard to imagine a big change in income that’s not associated with other big changes in your life.

Again, nothing wrong with looking at average associations of income and survey responses about happiness and life satisfaction. These average associations are interesting in their own right; no need to try to give them causal interpretations that they cannot bear.

Again, I like a lot of the above-linked paper. Within the context of the question, “How much happier are people with high incomes, compared to people with moderate incomes?”, they’re doing a clean, careful analysis, kinda like what my colleagues and I tried to do when reconciling different evaluations of the Millennium Villages Project, or as I tried to do when tracking down an iffy claim in political science. Starting with a discrepancy, getting into the details and figuring out what was going on, then stepping back and considering the larger implications: that’s what it’s all about.

In the real world people have goals and beliefs. In a controlled experiment, you have to endow them

This is Jessica. A couple weeks ago I posted on the lack of standardization in how people design experiments to study judgment and decision making, especially in applied areas of research like visualization, human-centered AI, privacy and security, NLP, etc. My recommendation was that researchers should be able to define the decision problems they are studying in terms of the uncertain state on which the decision or belief report in each trial is based, the action space defining the range of allowable responses, the scoring rule used to incentivize and/or evaluate the reports, and process that generates the signals (i.e., stimuli) that inform on the state. And that not being able to define these things points to limitations in our ability to interpret the results we get.

I am still thinking about this topic, and why I feel strongly that when the participant isn’t given a clear goal to aim for in responding, i.e., one that is aligned with the reward they get on the task, it is hard to interpret the results. 

It’s fair to say that when we interpret the results of experiments involving human behavior, we tend to be optimistic about how what we observe in the experiment relates to people’s behavior in the “real world.” The default assumption is that the experiment results can help us understand how people behave in some realistic setting that the experimental task is meant to proxy for. There sometimes seems to be a divide among researchers, between a) those who believe that judgment and decision tasks studied in controlled experiments can be loosely based on real world tasks without worrying about things being well-defined in the context of the experiment and b) those who think that the experiment should provide (and communicate to participants) some unambiguously defined way to distinguish “correct” or at least “better” responses, even if we can’t necessarily show that this understanding matches some standard we expect to operate the real-world. 

From what I see, there are more researchers running controlled studies in applied fields that are in the former camp, whereas the latter perspective is more standard in behavioral economics. Those in applied fields appear to think it’s ok to put people in a situation where they are presented with some choice or asked to report their beliefs about something but without spelling out to them exactly how what they report will be evaluated or how their payment for doing the experiment will be affected. And I will admit I too have run studies that use under-defined tasks in the past. 

Here are some reasons I’ve heard for using not using a well-defined task in a study:

People won’t behave differently if I do that. People will sometimes cite evidence that behavior in experiments doesn’t seem very responsive to incentive schemes, extrapolating from this that giving people clear instructions on how they should think about their goals in responding (i.e., what constitutes good versus bad judgments or decisions) will not make a difference. So it’s perceived as valid to just present some stuff (treatments) and pose some questions and compare how people respond.

The real world version of this task is not well-defined. Imagine studying how people use dashboards giving information about a public health crisis, or election forecasts. Someone might argue that there is no single common decision or outcome to be predicted in the real world when people use such information, and even if we choose some decision like ‘should I wear a mask’ there is no clear single utility function, so it’s ok not to tell participants how their responses will be evaluated in the experiment. 

Having to understand a scoring rule will confuse people. Relatedly, people worry that constructing a task where there is some best response will require explaining complicated incentives to study participants. They might get confused, which will interfere with their “natural” judgment processes in this kind of situation. 

I do not find these reasons very satisfying. The problem is how to interpret the elicited responses. Sure, it may be true that in some situations, participants in experiments will act more or less than the same when you put some display of information on X in front of them and say “make this decision based on what you know about X” and when you display the same information and ask the same thing but you also explain exactly how you will judge the quality of their decision. But – I don’t think it matters if they act the same. There is still a difference: in the latter case where you’ve defined what a good versus bad judgment or decision is, you know that the participants know (or at least that you’ve attempted to tell them) what their goal is when responding. And ideally you’ve given them a reason to try to achieve that goal (incentives). So you can interpret their responses as their attempt at fulfilling that goal given the information they had at hand. In terms of the loss you observe in responses relative to the best possible performance, you still can’t disambiguate the effect of their not understanding the instructions from their inability to perform well on the task despite understanding it. But you can safely consider the loss you observe as reflecting an inability to do that task (in the context of the experiment) properly. (Of course, if your scoring rule isn’t proper then you shouldn’t expect them to be truthful under perfect understanding of the task. But the point is that we can be fairly specific about the unknowns). 

When you ask for some judgment or decision but don’t say anything about how that’s evaluated, you are building variation in how the participants interpret the task directly into your experiment design. You can’t say what their responses mean in any sort of normative sense, because you don’t know what scoring rule they had in mind. You can’t evaluate anything. 

Again this seems rather obvious, if you’re used to formulating statistical decision problems. But I encounter examples all around me that appear at odds with this perspective. I get the impression that it’s seen as a “subjective” decision for the researcher to make in fields like visualization or human-centered AI. I’ve heard studies that define tasks in a decision theoretic sense accused of “overcomplicating things.” But then when it’s time to interpret the results, the distinction is not acknowledged, and so researchers will engage in quasi-normative interpretation of responses to tasks that were never well defined to begin with.

This problem seems to stem from a failure to acknowledge the differences between behavior in the experimental world versus in the real world: We do experiments (almost always) to learn about human behavior in settings that we think are somehow related to real world settings. And in the real world, people have goals and prior beliefs. We might not be able to perceive what utility function each individual person is using, but we can assume that behavior is goal-directed in some way or another. Savage’s axioms and the derivation of expected utility theory tell us that for behavior to be “rationalizable”, a person’s choices should be consistent with their beliefs about the state and the payoffs they expect under different outcomes.

When people are in an experiment, the analogous real world goals and beliefs for that kind of task will not generally apply. For example, people might take actions in the real world for intrinsic value – e.g., I vote because I feel like I’m not a good citizen if I don’t vote. I consult the public health stats because I want to be perceived by others as informed. But it’s hard to motivate people to take actions based on intrinsic value in an experiment, unless the experiment is designed specifically to look at social behaviors like development of norms or to study how intrinsically motivated people appear to be to engage with certain content. So your experiment needs to give them a clear goal. Otherwise, they will make up a goal, and different people may do this in different ways. And so you should expect the data you get back to be a hot mess of heterogeneity. 

To be fair, the data you collect may well be a hot mess of heterogeneity anyway, because it’s hard to get people to interpret your instructions correctly. We have to be cautious interpreting the results of human-subjects experiments because there will usually be ambiguity about the participants’ understanding of the task. But at least with a well-defined task, we can point to a single source of uncertainty about our results. We can narrow down reasons for bad performance to either real challenges people face in doing that task or lack of understanding the instructions. When the task is not well-defined, the space of possible explanations of the results is huge. 

Another way of saying this is that we can only really learn things about behavior in the artificial world of the experiment. As much as we might want to equate it with some real world setting, extrapolating from the world of the controlled experiment to the real world will always be a leap of faith. So we better understand our experimental world. 

A challenge when you operate under this understanding is how to explain to people who have a more relaxed attitude about experiments why you don’t think that their results will be informative. One possible strategy is to tell people to try to see the task in their experiment from the perspective of an agent who is purely transactional or “rational”:

Imagine your experiment through the eyes of a purely transactional agent, whose every action is motivated by what external reward they perceive to be in it for them. (There are many such people in the world actually!) When a transactional agent does an experiment, they approach each question they are asked with their own question: How do I maximize my reward in answering this? When the task is well-defined and explained, they have no trouble figuring out what to do, and proceed with doing the experiment. 

However, when the transactional human reaches a question that they can’t determine how to maximize their reward on, because they haven’t been given enough information, they shut down. This is because they are (quite reasonably) unwilling to take a guess at what they should do when it hasn’t been made clear to them. 

But imagine that our experiment requires them to keep answering questions. How should we think about the responses they provide? 

We can imagine many strategies they might use to make up a response. Maybe they try to guess what you, as the experimenter, think is the right answer. Maybe they attempt to randomize. Maybe they can’t be bothered to think at all and they call in the nearest cat or three year old to act on their behalf. 

We could probably make this exercise more precise, but the point is that if you would not be comfortable interpreting the data you get under the above conditions, then you shouldn’t be comfortable interpreting the data you get from an experiment that uses an under-defined task.

It was an open secret for years and years that they were frauds, but nobody seemed to care.

I love this story so much I’m gonna tell it again:

I remember the Watergate thing happening when I was a kid, and I asked my dad, “So, when did you realize that Nixon was a crook?” My dad replied, “1946.” He wasn’t kidding. Nixon being an opportunistic liar was all out there from the very beginning of his career, and indeed this was much discussed in the press. Eventually just about everybody acknowledged it, but it took awhile.

When I posted this before, I listed a few examples where some people were able to stay afloat for years and years after their lying or fraud or misrepresentation or impossible promises were apparent:

– Theranos: the famed blood-testing company faked a test in 2006, causing one of its chief executives to leave, but it wasn’t until 2018 that the whole thing went down. They stayed afloat for over a decade after the fraud.

– Pizzagate guy from Cornell: people had noticed major problems in his work, but he managed to dodge all criticism for several years before being caught.

– Some obscure Canadian biologist: The problems were first flagged in 2010, this dude continued doing suspicious things for over a decade, and it finally came out in 2022.

There’s also that Los Angeles tunnel that didn’t make sense back in 2018 and still makes no sense.

And here’s another one:

Effective Altruist Leaders Were Repeatedly Warned About Sam Bankman-Fried Years Before FTX Collapsed

Leaders of the Effective Altruism movement were repeatedly warned beginning in 2018 that Sam Bankman-Fried was unethical, duplicitous, and negligent . . . They apparently dismissed those warnings, sources say, before taking tens of millions of dollars from Bankman-Fried’s charitable fund for effective altruist causes. . . . When Alameda and Bankman-Fried’s cryptocurrency exchange FTX imploded in late 2022, these same effective altruist (EA) leaders professed outrage and ignorance.

“Think long-term. Act now.” What could possibly go wrong??

I don’t have any deep theories about this one. It’s just interesting to me as another example where the problems were clear to people in the know, but because of some combination of personal/political interests and restricted information flow, nothing happened for years.

Cross-validation FAQ

Here it is! It’s from Aki.

Aki linked to it last year in a post, “Moving cross-validation from a research idea to a routine step in Bayesian data analysis.” But I thought the FAQ deserved its own post. May it get a million views.

Here’s its current table of contents:

1 What is cross-validation?
1.1 Using cross-validation for a single model
1.2 Using cross-validation for many models
1.3 When not to use cross-validation?
2 Tutorial material on cross-validation
3 What are the parts of cross-validation?
4 How is cross-validation related to overfitting?
5 How to use cross-validation for model selection?
6 How to use cross-validation for model averaging?
7 When is cross-validation valid?
8 Can cross-validation be used for hierarchical / multilevel models?
9 Can cross-validation be used for time series?
10 Can cross-validation be used for spatial data?
11 Can other utility or loss functions be used than log predictive density?
12 What is the interpretation of ELPD / elpd_loo / elpd_diff?
13 Can cross-validation be used to compare different observation models / response distributions / likelihoods?

P.S. Also relevant is this discussion from the year before, “Rob Tibshirani, Yuling Yao, and Aki Vehtari on cross validation.”

We want to go beyond intent-to-treat analysis here, but we can’t. Why? Because of this: “Will the data collected for your study be made available to others?” “No”; “Would you like to offer context for your decision?” “–“. Millions of taxpayer dollars spent, and we don’t get to see the data.

Dale Lehman writes:

Let me be the first (or not) to ask you to blog about this just released NEJM study. Here are the study, supplementary appendix, and data sharing statement, and I’ve also included the editorial statement. The study is receiving wide media attention and is the continuation of a long-term trial that was reported on at a 10 year median follow-up. The current publication is for a 15 year median follow-up.

The overall picture is consistent with many other studies – prostate cancer is generally slow to develop and kills very few men. Intervention can have serious side effects and there is little evidence that it improves long-term survival, except (perhaps) in particular subgroups. Treatment and diagnosis has undergone considerable change in the past decade. The issue is of considerable interest to me – for statistical reasons as well as personal (since I have a prostate cancer diagnosis). Here are my concerns in brief:

This study once again brings up the issue of intention-to-treat vs actual treatment. The groups were randomized between active management (545 men), prostatectomy (533 men), and radiotherapy (545 men). The analysis was based on these groups, with deaths in the 3 groups of 17, 12, and 16 respectively. Figure 1 in the paper reveals that within the first year, 628 men were actually in the active surveillance group, and 488 in each of the other 2 groups: this is not surprising since many people resist the invasive treatment and possible side effects. I would consider those that chose different groups than the random assignment within the first year as the true effective group sizes. However, the paper does not provide data on the actual deaths for the people that switched between the random assignment and actual treatment within the first year. So, it is not possible to determine the actual death rates in the 3 groups.

The paper reports death rates of 3.1%, 2.2%, and 2.9% in the 3 groups. If we just change the denominators to the actual size of the 3 groups in the first year, the 3 death rates are 2.7%, 2.5%, and 3.3%, making intervention look even worse. If we assume that half of the deaths in the random prostatectomy radiotherapy groups were among those that refused the initial treatment and opted for active surveillance, then the 3 death rates would be 4.9%, 1.2%, and 1.6% respectively, making active surveillance look rather risky. Of course, I think allocating half of the deaths in those groups in this manner is a fairly extreme assumption. Given the small numbers of deaths involved, the deviations from random assignment to actual treatment could matter.

The authors have the data to conduct both an intention to treat and actual treatment received comparison, but did not report this (and did not indicate that they did such a study). If they had reported details on the 45 total deaths, I could do that analysis myself, but they don’t provide that data. In fact, the data sharing statement (attached) is quite remarkable – will the data be provided? “No.” That really irks me. I don’t see that there is really any concern about privacy. Withholding the data serves to bolster the careers of the researchers and the prestige of the journal, but it doesn’t have to be that way. If the journal released the data publicly and it was carefully documented, both the authors and the journal could receive widespread recognition for their work. Instead, they (and much of the establishment) choose to rely on their analysis to bolster their reputations. But these days the analysis is the easy part, it is the data curation and quality that is hard. Once again, the incentives and rewards are at odds with what makes sense.

Another question that is not analyzed but could be if the data was provided, is whether the time of randomization matters. The article (and the editorial) cites the improved monitoring as MRI images are increasingly used along with biopsies. Given this evolution, the relative performance of the 3 groups might be changing over time – but no analysis is provided based on the year upon which a person entered the study.

One other thing that you’ve blogged about often. For me, the most interesting figure is Figure S1 that actually shows the 45 deaths for the 3 groups. Looking at it, I see a tendency for the deaths to occur earlier with active surveillance than either surgery or radiation. Of course, the p values suggest that this might just be random noise. Indeed it might be. But, as we often say, absence of evidence is not evidence of absence. The paper appears to overstate the findings, as does all the media reporting. Statements such as “Radical treatment resulted in a lower risk of disease progression than active monitoring but did not lower prostate cancer mortality” (page 10 of the article) amounts to a finding of now effect rather than a failure to find a significant effect. Null hypothesis significance testing strikes again.

Yeah, they should share the goddam data, which was collected using tons of taxpayer dollars:

Regarding the intent-to-treat thing: Yeah, this has come up before, and I’m not sure what to do; I just have the impression that our current standard approaches here have serious problems.

My short answer is that some modeling should be done. Yes, the resulting inferences will depend on the model, but that’s just the way things are; it’s the actual state of our knowledge. But that’s just cheap talk from me. I don’t have a model on offer here, I just think that’s the way to go: construct a probabilistic model for the joint distribution of the all the variables (which treatment the patient chooses, along with the health outcome) conditional on patient characteristics, and go from there.

I agree with Lehman that the intent-to-treat analysis is not the main goal here. It’s fine to do that analysis but it’s not good to stop there, and it’s really not good to hide information that could be used to go further.

As Lehman puts it:

Intent-to-treat analysis makes sense from a public health point of view if it closely reflects the actual medical practice. But from a patient point of view of making a decision regarding treatment, the actual treatment is more meaningful than intent-to-treat. So, when the two estimates differ considerably, it seems to me that they should both be reported – or, at least, the data should be provided that would allow both analyses to be done.

Also, the topic is relevant to me cos all of a sudden I need to go to the bathroom all the time. My doctor says my PSA is ok so I shouldn’t worry about cancer, but it’s annoying!

I told this to Lehman, who responded:

Unfortunately, the study in question makes PSA testing even less worthwhile than previously thought (I get mine checked regularly and that is my only current monitoring, but it is not looking like that is worth much, or should I say there is no statistically significant (p>.05) evidence that it means anything?

Damn.

Causal inference and the aggregation of micro effects into macro effects: The effects of wages on employment

James Traina writes:

I’m an economist at the SF Fed. I’m writing to ask for your first thoughts or suggested references on a particular problem that’s pervasive in my field: Aggregation of micro effects into macro effects.

This is an issue that has been studied since the 80s. For example, the individual-level estimates of wages on employment using quasi-experimental tax variation are much smaller than aggregate-level estimates using time series variation. More recently, there has been an active debate on how to port individual-level estimates of government transfers on consumption to macro policy.

Given your expertise, I was wondering if you had insight into how you or other folks in the stats / causal inference field would approach this problem structure more generally.

My reply: Here’s a paper from 2006, Multilevel (hierarchical) modeling: What it can and cannot do. The short answer is that you can estimate micro and macro effects in the same model, but you don’t necessarily have causal identification at both levels. It depends on the design.

You’ll also want a theoretical model. For example, in your model, if you want to talk about “the effects of wages,” it can help to consider potential interventions that could affect local wages. Such interventions could be a minimum-wage law, it could be inflation that reduces real (not nominal) wages, it could be national economic conditions that make the labor market more or less competitive, etc. You can also think about potential interventions at an individual level, such as a person getting education or training, marrying or having a child, the person’s employer changing its policies, whatever.

I don’t know enough about your application to give more detail. The point is that “wages” is not in itself a treatment. Wages is a measured variable, and different wage-effecting treatments can have different effects on employment. You can think of these as instruments, even if you’re not actually doing an instrumental variables analysis. Also, treatments that affect individual wages will be different than treatments that affect aggregate wages, so it’s no surprise that they would have different effects on employment. There’s no strong theoretical reason to think that effects would be the same.

Finally, I don’t understand how government transfers connect to wages in your problem. Government transfers do not directly affect wages, do they? So I feel like I’m missing some context here.

New research on social media during the 2020 election, and my predictions

Back in 2020, leading academics and researchers at the company now known as Meta put together a large project to study social media and the 2020 US elections — particularly the roles of Instagram and Facebook. As Sinan Aral and I had written about how many paths for understanding effects of social media in elections could require new interventions and/or platform cooperation, this seemed like an important development. Originally the idea was for this work to be published in 2021, but there have been some delays, including simply because some of the data collection was extended as what one might call “election-related events” continued beyond November and into 2021. As of 2pm Eastern today, the news embargo for this work has been lifted on the first group of research papers.

I had heard about this project back a long time ago and, frankly, had largely forgotten about it. But this past Saturday, I was participating in the SSRC Workshop on the Economics of Social Media and one session was dedicated to results-free presentations about this project, including the setup of the institutions involved and the design of the research. The organizers informally polled us with qualitative questions about some of the results. This intrigued me. I had recently reviewed an unrelated paper that included survey data from experts and laypeople about their expectations about the effects estimated in a field experiment, and I thought this data was helpful for contextualizing what “we” learned from that study.

So I thought it might be useful, at least for myself, to spend some time eliciting my own expectations about the quantities I understood would be reported in these papers. I’ve mainly kept up with the academic and  grey literature, I’d previously worked in the industry, and I’d reviewed some of this for my Senate testimony back in 2021. Along the way, I tried to articulate where my expectations and remaining uncertainty were coming from. I composed many of my thoughts on my phone Monday while taking the subway to and from the storage unit I was revisiting and then emptying in Brooklyn. I got a few comments from Solomon Messing and Tom Cunningham, and then uploaded my notes to OSF and posted a cheeky tweet.

Since then, starting yesterday, I’ve spoken with journalists and gotten to view the main text of papers for two of the randomized interventions for which I made predictions. These evaluated effects of (a) switching Facebook and Instagram users to a (reverse) chronological feed, (b) removing “reshares” from Facebook users’ feeds, and (c) downranking content by “like-minded” users, Pages, and Groups.

My guesses

My main expectations for those three interventions could be summed up as follows. These interventions, especially chronological ranking, would each reduce engagement with Facebook or Instagram. This makes sense if you think the status quo is somewhat-well optimized for showing engaging and relevant content. So some of the rest of the effects — on, e.g., polarization, news knowledge, and voter turnout — could be partially inferred from that decrease in use. This would point to reductions in news knowledge, issue polarization (or coherence/consistency), and small decreases in turnout, especially for chronological ranking. This is because people get some hard news and political commentary they wouldn’t have otherwise from social media. These reduced-engagement-driven effects should be weakest for the “soft” intervention of downranking some sources, since content predicted to be particularly relevant will still make it into users’ feeds.

Besides just reducing Facebook use (and everything that goes with that), I also expected swapping out feed ranking for reverse chron would expose users to more content from non-friends via, e.g., Groups, including large increases in untrustworthy content that would normally rank poorly. I expected some of the same would happen from removing reshares, which I expected would make up over 20% of views under the status quo, and so would be filled in by more Groups content. For downranking sources with the same estimated ideology, I expected this would reduce exposure to political content, as much of the non-same-ideology posts will be by sources with estimated ideology in the middle of the range, i.e. [0.4, 0.6], which are less likely to be posting politics and hard news. I’ll also note that much of my uncertainty about how chronological ranking would perform was because there were a lot of unknown but important “details” about implementation, such as exactly how much of the ranking system really gets turned off (e.g., how much likely spam/scam content still gets filtered out in an early stage?).

How’d I do?

Here’s a quick summary of my guesses and the results in these three papers:

Table of predictions about effects of feed interventions and the results

It looks like I was wrong in that the reductions in engagement were larger than I predicted: e.g., chronological ranking reduced time spent on Facebook by 21%, rather than the 8% I guessed, which was based on my background knowledge, a leaked report on a Facebook experiment, and this published experiment from Twitter.

Ex post I hypothesize that this is because of the duration of these experiments allowed for continual declines in use over months, with various feedback loops (e.g., users with chronological feed log in less, so they post less, so they get fewer likes and comments, so they log in even less and post even less). As I dig into the 100s of pages of supplementary materials, I’ll be looking to understand what these declines looked like at earlier points in the experiment, such as by election day.

My estimates for the survey-based outcomes of primary interest, such as polarization, were mainly covered by the 95% confidence intervals, with the exception of two outcomes from the “no reshares” intervention.

One thing is that all these papers report weighted estimates for a broader population of US users (population average treatment effects, PATEs), which are less precise than the unweighted (sample average treatment effect, SATE) results. Here I focus mainly on the unweighted results, as I did not know there was going to be any weighting and these are also the more narrow, and thus riskier, CIs for me. (There seems to have been some mismatch between the outcomes listed in the talk I saw and what’s in the papers, so I didn’t make predictions for some reported primary outcomes and some outcomes I made predictions for don’t seem to be reported, or I haven’t found them in the supplements yet.)

Now is a good time to note that I basically predicted what psychologists armed with Jacob Cohen’s rules of thumb might call extrapolate to “minuscule” effect sizes. All my predictions for survey-based outcomes were 0.02 standard deviations or smaller. (Recall Cohen’s rules of thumb say 0.1 is small, 0.5 medium, and 0.8 large.)

Nearly all the results for these outcomes in these two papers were indistinguishable from the null (p > 0.05), with standard errors for survey outcomes at 0.01 SDs or more. This is consistent with my ex ante expectations that the experiments would face severe power problems, at least for the kind of effects I would expect. Perhaps by revealed preference, a number of other experts had different priors.

A rare p < 0.05 result is that that chronological ranking reduced news knowledge by 0.035 SDs with 95% CI [-0.061, -0.008], which includes my guess of -0.02 SDs. Removing reshares may have reduced news knowledge even more than chronological ranking — and by more than I guessed.

Even with so many null results I was still sticking my neck out a bit compared with just guessing zero everywhere, since in some cases if I had put the opposite sign my estimate wouldn’t have been in the 95% CI. For example, downranking “like-minded” sources produced a CI of [-0.031, 0.013] SDs, which includes my guess of -0.02, but not its negation. On the other hand, I got some of these wrong, where I guessed removing reshares would reduce affective polarization, but a 0.02 SD effect is outside the resulting [-0.005, +0.030] interval.

It was actually quite a bit of work to compare my predictions to the results because I didn’t really know a lot of key details about exact analyses and reporting choices, which strikingly even differ a bit across these three papers. So I might yet find more places where I can, with a lot of reading and a bit of arithmetic, figure out where else I may have been wrong. (Feel free to point these out.)

Further reflections

I hope that this helps to contextualize the present results with expert consensus — or at least my idiosyncratic expectations. I’ll likely write a bit more about these new papers and further work released as part of this project.

It was probably an oversight for me not to make any predictions about the observational paper looking at polarization in exposure and consumption of news media. I felt like I had a better handle on thinking about simple treatment effects than these measures, but perhaps that was all the more reason to make predictions. Furthermore, given the limited precision of the experiments’ estimates, perhaps it would have been more informative (and riskier) to make point predictions about these precisely estimated observational quantities.

[This post is by Dean Eckles. I want to note that I was an employee or contractor of Facebook (now Meta) from 2010 through 2017. I have received funding for other research from Meta, Meta has sponsored a conference I organize, and I have coauthored with Meta employees as recently as earlier this month. I was also recently a consultant to Twitter, ending shortly after the Musk acquisition. You can find all my disclosures here.]

 

He’s looking for student-participation activities for a survey sampling class

Art Owen writes:

It is my turn to teach our sampling course again. Last time I did that, the highlight was that students could design and conduct their own online survey using Google consumer surveys. Today I learned that Google has discontinued that platform and I don’t see any successor product.

Do you know anybody offering something similar that could be used in a classroom?

Traditional sampling theory is pretty dated now and not often useful for sampling opinions. I compress the best parts of it into the first half or so of the class. Then I talk about sampling from databases, sampling from wildlife populations, and online sampling. I’ve been tempted to add something about how every time you interact with certain businesses (hotels, ride share …. ) you get nagged for a survey response either on the 5 point scale or the net-promoter 10 point scale about recommending the product. Mainly I find those things annoying, though I should probably add something about how they are or should be used.

My reply: For several years I taught a sampling class at Columbia. In the class I’d always have to spend some time discussing basic statistics and regression modeling . . . and this always was the part of the class that students found the most interesting! So I eventually just started teaching statistics and regression modeling, which led to our Regression and Other Stories book.

Also our new book, Active Statistics: Stories, Games, Problems, and Hands-on Demonstrations for Applied Regression and Causal Inference, has lots of fun material on sampling, including many in-class activities.

Regarding surveys that students could do, I like your idea of sampling from databases, biological sampling, etc. You can point out to students that a “blood sample” is indeed a sample!

Owen responded:

Your blood example reminds me that there is a whole field (now very old) on bulk sampling. People sample from production runs, from cotton samples, from coal samples and so on. Widgets might get sampled from the beginning, middle and end of the run. David Cox wrote some papers on sampling to find the quality of cotton as measured by fiber length. The process is to draw a blue line across the sample and see the length of fibers that intersect the line. This gives you a length-biased sample that you can nicely de-bias. There’s also an interesting out there about tree sampling, literally on a tree, where branches get sampled at random and fruit is counted. I’m not sure if it’s practical.

Last time I found an interesting example where people would sample ocean tracts to see if there was a whale. If they saw one, they would then sample more intensely in the neighboring tracts. Then the trick was to correct for the bias that brings. It’s in the Sampling book by S. K. Thompson. There are also good mark-recapture examples for wildlife.

I hesitate to put a lot of regression in a sampling class; It is all too easy for every class to start looking like a regression/prediction/machine learning class. We need room for the ideas about where and how data arises and it’s too easy to crowd those out by dwelling on the modeling ideas.

I’ll probably toss in some space-filling sampling plans and other ways to down size data sets as well.

The old style, from the classic book by Cochran, was: get an estimator, show it is unbiased, find an expression for its variance, find an estimate of that variance, show this estimate is unbiased and maybe even find and compare variances of several competing variance estimates. I get why he did it but it can get dry. I include some of that but I don’t let it dominate the course. Choices you can make and their costs are more interesting.

I understand the appeal of a sampling class that focuses on measurement, data collection, and inference issues specific to sampling. The challenge I’ve seen is getting enough students interested in taking such a class.

Slides on large language models for statisticians

I was invited by David Banks to give an introductory talk on large language models to the regional American Statistical Association meeting on large language models. Here are the slides:

Most usefully, it has complete pseudocode up to but not including multi-head attention. It also has an annotated bibliography of the main papers if you want to catch up. After the talk, I added a couple slides on scaling laws and an annotated bibliography, which I didn’t have time to get to before the talk. I also added a slide describing multi-head attention, but without pseudocode.

P.S. The meeting was yesterday at Columbia and I hadn’t been to the stats department since the pandemic started, so it felt very strange.

P.P.S. GPT-4 helped me generate the LaTeX Tikz code to the point where I did zero searching through doc or the web. It also generates all of my pandas and plotnine code (Python clones of R’s data frames and ggplot2) and a ton of my NumPy, SciPy, and general Python code. It can explain the techniques it uses, so I’m learning a lot, too. I almost never use StackOverflow any more!

What’s the story with news media insiders getting all excited about UFOs?

Philip Greengard writes:

This tweet by Nate Silver in which he says that the UFOs that made news recently are “almost definitely not aliens” reminded me of some discussions on your blog including this. I thought you had another blog post about motivations of pundits around uncertainty but I can’t find it. I’m not claiming this is any sort of “gotcha,” but I thought it was somewhat interesting/revealing that Nate felt the need to include “almost” in his prediction.

Yeah, that’s funny. Some further background is that a couple years ago a bunch of elite journalists were floating the idea that UFOs might actually be aliens, or that we should be taking UFOs seriously, or something like that. I had an exchange back in 2020 with a well-respected journalist who wrote, “The Navy is releasing videos of pilots watching something in the air that is blowing their minds, It’s worth exploring what it is! Likely it’s not aliens, but I’d like to know what it is.”

So I think Nate’s comment is just a reflection of the elite media bubble in which he lives. A couple other UFO-curious people who are well connected in the news media are Ezra Klein and Tyler Cowen.

Another way to put it is: Common sense and logic say that UFOs are not space aliens. On the other hand, it’s gotta be that millions of Americans believe it . . . aaaah, here it is, from Gallup:

Four in 10 Americans now think some UFOs that people have spotted have been alien spacecraft visiting Earth from other planets or galaxies. This is up from a third saying so two years ago. Half, however, believe all such sightings can be explained by human activity or natural phenomena, while an additional 9% are unsure.

And, as I never tire of reminding you, polls find that 30% of Americans believe in ghosts. OK, polls aren’t perfect, but no matter how you slice it, lots of people believe in these things.

From that perspective, yeah, if 40% of Americans believe something, it’s no surprise that some fraction of journalists believe it too. Maybe not 40%, but even if it’s only 20%, or 10%, that’s still a bunch. Enough so that some might be personal friends of Nate, so he’ll think, “Sure, that space aliens thing seems pretty implausible, but my friends X and Y are very reasonable people, and they believe it, so maybe there’s something there . . .”

It all sounds funny to Philip and me, but that’s just because we’re not linked into a big network of friends/colleagues who opinionate about such things. Nate Silver sees that Ezra Klein and Tyler Cowen believe that UFO’s might be space aliens, so . . . sure, why not at least hedge your bets on it, right? So I think there’s some contagion here, not exactly a consensus within the pundit class but within the range of consensus opinion.

P.S. More here from Palko, with a focus on the NYT being conned. I think part of this story, as noted above, is what might be called the lack of statistical independence of the news media: if Klein, Cowen, and Silver believe this, this is not three independent data points. The other thing is that there are all sorts of dangerous conspiracy theories out there now. Maybe some people like to entertain the theory that UFO’s are space aliens because this theory is so innocuous.

Just to summarize: Palko blames the New York Times for the mainstreaming of recent UFO hype. I guess this must be part of the story (UFO enthusiast gets the UFO assignment and runs with it), but I still see this is as more of a general elite media bubble. Neither Nate Silver nor Tyler Cowen work for the Times—indeed, either of them might well like to see the Times being taken down a peg—and I see their UFO-friendliness as an effect of being trapped in a media consensus.

Here’s another way to look at it. A few years ago, Nate wrote, “Elites can have whatever tastes they want and there’s no reason they ought to conform to mainstream culture. But I do think a lot of elites would fail a pop quiz on how the median American thinks and behaves, and that probably makes them a less effective advocate for their views.” There is of course no such thing as “the median American,” but I get what he’s saying. Lots of Americans believe in UFO’s, ghosts, etc. According to a Pew Research survey from 2021, 11% of Americans say that UFOs reported by people in the military are “definitely” evidence of intelligent life outside Earth (with 40% saying “probably,” 36% saying “probably not,” and 11% saying “definitely not”). Younger and less educated people are slightly more likely to answer “definitely” or “probably” to that question, but the variation by group is small; it’s close to 50% for every slice of the population. The “median American,” then, is torn on the issue, so Nate and the New York Times are in tune with America on this one. Next stop, ghosts (more here)!

P.P.S. Non-elite journalists are doing it too! But I guess they’re just leaping on the bandwagon. If it’s good enough for the New York Times, Nate Silver, Ezra Klein, and Tyler Cowen, it’s good enough for the smaller fish in the media pond.

P.P.P.S. Still more from Palko. It remains interesting that this hasn’t seemed to have become political yet. But maybe it will drift to the right, in the same way as many other conspiracy theories, in which case our news media overlords might start telling us that, by laughing at the idea of UFOs being space aliens, we’re just out-of-touch elitists. Nate and the NYT are ahead of the game by taking these ideas seriously already. They’re in touch with the common people in a way that Greengard, Palko, I, and the other 50% of Americans who are don’t believe that UFOs are space aliens will never be.

P.P.P.P.S. Still more here from Palko. It’s just sad to see all these media insiders fall for it. Again, though, millions of otherwise-savvy Americans believe in ghosts, too. I think the key is not that these otherwise-savvy media insiders are so receptive to the UFOs-as-space-aliens theory, as that something has changed and it’s acceptable for them to share their views. Probably lots of media insiders are receptive to ghosts, astrology, and other classic bits of pseudoscience and fraud (not to mention more serious things such as racism and anti-vaccine messages), but they won’t talk much about it because these ideas are generally considered ridiculous or taboo. But some impressive efforts by a handful of conspiracy-minded space-alien theorists have pushed this UFO thing into the mainstream.