Skip to content

“In any case, we have a headline optimizer that A/B tests different headlines . . .”

The above line is not a joke. It’s from Buzzfeed. Really.

Stephanie Lee interviewed a bunch of people, including me, for this Buzzfeed article, “Two Big Studies Say There Are Way More Coronavirus Infections Than We Think. Scientists Think They’re Wrong.”

I liked the article. My favorite part is a quote (not from me) that I’ll return to tomorrow. But right now I want to talk about titles.

After Lee pointed me to her article, I wrote, “Do you think your headline [Two Big Studies Say There Are Way More Coronavirus Infections Than We Think. Scientists Think They’re Wrong.] is too strong? Maybe ‘Scientists Think They May Be Wrong’ would be more accurate?”

Lee responded:

The other scientists I talked to (plus several others who made their feelings heard online) felt very strongly that there were flaws in the testing, analysis, etc. and, therefore, the estimates are flawed, which is why we initially went with “think they’re wrong.” I understand that you thought that the problems were more about how the researchers presented the uncertainty, not necessarily whether the conclusions were right or wrong, so I included your quote to that effect to distinguish you from the other folks.

In any case, we have a headline optimizer that A/B tests different headlines, and minutes ago it just chose a new headline: “Scientists Are Mad.” Which I think is apt here!

Buzzfeed uses a headline optimizer! That’s so Buzzfeed.

This would be a great Jackbox game: Optimize That Headline.


  1. Dave says:

    Scientists Think They’re Wrong.
    Scientists Think They May Be Wrong.
    Some Scientists Think They’re Wrong.
    Some Scientists Think They May Be Wrong.
    Some Scientists Think.

    That last one seems pretty safe…

  2. Tom says:

    While I’m not surprised that Buzzfeed uses A/B test for headlines, I am surprised that they let it slip so casually.
    I am not a journalist, but that kinda seems to go against what journalism is about (assuming they optimize clicks and views).

    • Andrew says:


      It doesn’t bother me. If I have 6 headlines, and they all seem like reasonable summaries of the article, I don’t see anything wrong with choosing the one that gets the most clicks. Seriously. It’s amusing, but it seems fine with me. If they start messing with the content of the article or the political slant, then, sure, at that point I’d say they’re going too far. But this sort of headline analysis seems fair game to me.

      • Tom says:

        I guess the reason that it bothers me is while you may have six acceptable headlines, some are surely better objective summaries of the article more than others. If you are always optimizing for clicks, you are implicitly accepting that some headlines are going to be lower quality than they need to be. And also, I’m not going to pretend that my previous experiences with some amusingly terrible Buzzfeed headlines did not factor into my complaint. But you are right in the sense that the main problem would arise from headlines you are a/b testing rather than the a/b test itself.

        • Noah Motion says:

          Are headlines supposed to be optimally objective summaries? If so, that seems like a very difficult thing to figure out reliably and efficiently. I think a pretty strong case can be made that something like clicks is a better thing to optimize for headlines, even knowing that a lot of people will just read the headline and move on. If clicks are the objective, A/B tests are a very efficient way to find (local) optima.

          • Mendel says:

            I propose that the measurement to optomize for is “retainment after clicking”, i.e. how much time does the viewer spend on the article once they have clicked?
            If I read the first paragraph, I have a better idea of what the article is about, and if the headline mislead me, I’ll just leave. If the headline was a good summary, then readers ought to be more likely to spend time on the article.
            — less ad displays in the short term. If the aim is to make money, every click is golden.
            — better long-term reputation. Users learn that interesting headlines are not mere click-bait.
            — caveat the magnitudes. If 400 users follow a click-bait headline, and only 20% (80) stay to read, that is still a better readership than 100 users following a good summary, and 50% (50) staying to read.

            P.S. This blog also does headline optimization, as evidenced by blog post URLs and headlines mismatching for some articles. I’m assuming these changes are driven by a different approach, though.

      • Kaiser says:

        Andrew: This optimizer is what brings us clickbait headlines. They have been around forever and almost every media site uses it. Buzzfeed is particularly famous for it because .. well Buzzfeed is synonymous with clickbait (listicles, etc.). Their Buzzfeed News section is a separate and newer entity and I do like their reporting. It all goes back to the business model. These websites are “free” to readers and sponsored by advertisers. The objective function of the headline optimizer is to generate more clicks or more ad exposure. On some site I visit, the headlines often say the opposite of what the articles say but the optimizer succeeded since I clicked on the headline!

        The headline optimization problem is actually very interesting. There is a limited time within which each article is relevant so all gains must happen quickly. There is not much directly usable data from previous articles and headlines since those are different articles, thus have to learn latent variables.

        • jim says:

          “On some site I visit, the headlines often say the opposite of what the articles say”

          Don’t leave out newspapers and mags!!! No one will see any adds if they don’t pick it up and buy it.

    • Dave says:

      I think I remember seeing a job posting from them a few years back that mentioned headline A/B testing. Makes sense for their business model.

  3. Ethan Steinberg says:

    A/B tests are really quite fascinating when you think about it. Right at this very moment, thousands upon thousands of large very well controlled randomized experiments are being performed on all sorts of aspects of human behavior and interaction with all sorts of websites. I only wish more of of that A/B testing data was available for research. There is probably a lot we could learn about human behavior from that data (not only looking at immediate reactions such as clicks but also things like the types of comments people leave, etc, etc).

    I wonder if Buzzfeed would be willing to release their A/B results as a public dataset …

    • Kaiser says:

      I used to do a lot of A/B testing. Here are a sample of some interesting problsms:
      a) interaction between tests. If a site is simultaneously running many tests on many pages, each user may be part of multiple tests. There are some dependency problems that are ignored.
      b) users flow from page to page through the site. But what they were exposed to on an earlier page (if effective) changes the composition of population flowing into another test. Tests are therefore not indepedent. Also note that the order by which pages are visited are not controlled. Some go to page A before B while other go to B before A. Multiply this by a lot if the site is large.
      c) there may be all kinds of targeting and optimization happening all over the site while tests are being run. those systems are usually poorly documented
      d) do you randomize across the whole site or per test
      e) there is often a need to establish the baseline behavior across all tests. How to do it?
      f) few systems are set up such that you can set up analysis windows, so each test cells contain some users who joined the test 5 days ago and some who joined 1 day ago. If the response is a sale, that sale could take place immediately or within say 10 days. The lag differs by treatment.

  4. Joseph Candelora says:

    That’s funny. Have to admit, though, my favorite version was when I (mis)read Lee’s response to say that the headline would read simply: “Scientists Are Mad”.

    One aspect of this article and all this coverage bothers me, though, are lines like the one in the Buzzfeed article that “Their estimates were jaw-dropping.” The new estimates really aren’t.

    Bendavid and Bhattacharya (with input from Sood) published last month in the Wall Street Journal an estimate that IFR could be 0.01%. _That_ was jaw-dropping. Their new IFR estimate of 0.12% – 0.2% in Santa Clara? If true, it moves the needle meaningfully, but we’re not talking about an order of magnitude. The Imperial College London study (which seemed to do the most to drive action in response to the threat) used a nationwide IFR estimate of 0.83% for its death count.

    What I really want is for one of these articles to pin the authors down on a revised nationwide IFR estimate. They were happy to do it with zero serological data, and said they needed these tests in order to make better estimate. So where are those estimates?

    But for argument’s sake, I’ll assume the national IFR would be higher than that for SC County, which is younger and healthier than the rest of the country. Let’s say we’re talking 0.3% nationwide IFR; sure that cuts the estimated deaths by 65%, from about 2.2m unmitigated from Imp Col Lon to presumably about 0.8m. Would that change our policy responses? Maybe yes, maybe no. But it’s not jaw dropping, it’s not the complete invalidation of everything that came before, it’s simply new information that should inform our decision making.

    Of course, with the prevalence of serology tests going on right now it makes the most sense to sit back a bit and gather more information, which again is an easier position to take if you’re not under the misapprehension that some new study has completely blown a hole in all prior assumptions.

    • Andrew says:


      I continue to think that’s it’s a mistake to talk about “the” IFR; I think that people who give low IFR numbers are implicitly not counting nursing homes and other places with high concentrations of people over the age of 75. Rather than trying to pin them down to a national IFR estimate, I’d rather pin them down to estimates as a function of age and maybe broken down in other ways too.

    • Joshua says:

      Joseph –

      > What I really want is for one of these articles to pin the authors down on a revised nationwide IFR estimate. They were happy to do it with zero serological data, and said they needed these tests in order to make better estimate. So where are those estimates?

      FWIW, in a video interview that Iaonnidis did *after* the Santa Clara prescript came out, he stated with pretty much certainty that the IFR was pretty much the same as the seasonal flu. Not exactly being pinned down on an estimate, but maybe somewhat close? By the way, he also stated that the 2 million deaths without intervention was “science fiction.” At some point, I guess that someone could reverse engineer to establish a boundary from that to determine an upper range boundary for what he considers even possible.

  5. Adede says:

    I think A/B headline testers are pretty common. I’m sure I had an NY Times article change its headline on me recently…

  6. Joseph Candelora says:

    I agree with your preference to an extent, but still don’t agree with you on what the people who are giving low IFRs are talking about.

    When they calculate their IFR estimates, they aren’t excluding nursing homes/elderly, either explicitly or implicitly. All the known Covid nursing home deaths in Santa Clara County are in the death count (and by extension the death estimate) used in the SC County paper.

    When Ioannidis published his take in Stat back in May, he took Diamond Princess data and adjusted to a national average IFR based on difference in age characteristic between the nation and Diamond Princess passengers, and then further adjusted because presumably the health of those on the ship was better than average in the oldest cohorts. And he used his IFR to estimate total national deaths from an unmitigated outbreak. Similarly, Bhattacharya and Bendavid published their 0.01% IFR estimate in the WSJ in March, and seemed to use it generate (or at least inform) a potential range of 20k – 40k national deaths in an unmitigated outbreak.

    As to your preference for estimates as function if age and comorbidities I don’t disagree, more detail is better. But as far as a common currency in lay discussion — and that’s what I’m talking about, having these researchers update the numbers they’ve already put out in lay discussion in Stat and WSJ — you could do much more worse than an estimate of overall national IFR.

    For me personally, it’s the critical feature, and getting some level of consensus on it is necessary to have meaningful policy debate. If it’s a virus that, unchecked, kills 20,000 Americans, you basically shouldn’t do anything. If it kills 20,000,000, then you should do eveything. If the estimate is 200,000 or 2,000,000, I can see reasonable people disagreeing on how much to do.

    An article like this, that claims the new results if true would be “jaw dropping”, and then discusses the 50-85x undercount estimates,is likely to leave readers with the impression that whatever prior estimate they had latched onto for deaths must’ve been vastly overstated. And that just pollutes the discussion. At least you could leaven it with “new IFR estimate from researchers suggests virus could be 30x more deadly than their earlier estimate, which was one of the lowest” or something like that.

    • Andrew says:


      I agree with everything you write here. I just think that if someone is on record saying the IFR is 0.02% or whatever, and then deaths come out to be higher than that, then they’ll discount the nursing-home cases, they’ll say that these people didn’t get the best care so their cases shouldn’t count in the true IFR, etc. It’s just the natural step to take in the argument.

    • Martha (Smith) says:

      Joseph wrote,
      “When Ioannidis published his take in Stat back in May, he took Diamond Princess data and …”

      I think you mean March.

  7. Joseph Candelora says:

    “When Ioannidis published his take in Stat back in *March”

    and my “at least you could leaven it” in final sentence was a generic “you” to reporters, not meant to reference Andrew. I know he didn’t write this BuzzFeed article, just contributed a quote.

  8. Beau Dure says:

    I’m a journalist, and I absolutely detest search-engine optimization (SEO). It turns writers into clickbait-generating robots.

    That said, don’t blame Buzzfeed or (SITE REDACTED) or (SITE REDACTED) and so forth.

    Blame Google. Their search algorithms stink.

Leave a Reply