“Small Steps to Accuracy: Incremental Updaters are Better Forecasters”

Pavel Atanasov writes:

I noticed your 2016 post on belief updating. Here is the key bit:

From the perspective of the judgment and decision making literature, the challenge is integrating new information at the appropriate rate: not so fast that your predictions jump up and down like a yo-yo (the fate of naive poll-watchers) and not so slow that you’re glued to your prior information (as happened with the prediction markets leading up to Brexit).

I wanted to share a manuscript, co-authored with Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip Tetlock, that addresses a closely related question: How do accurate forecasters update their predictions, given the twin threats of over- and under-reaction? Here’s the abstract:

We study the belief updating patterns of real-world forecasters and relate those patterns to forecaster accuracy. We distinguish three aspects of belief updating: frequency of updates, magnitude of updates, and each forecaster’s confirmation propensity (i.e., a forecaster’s tendency to restate her preceding forecast). Drawing on data from a four-year forecasting tournament that elicited over 400,000 probabilistic predictions on almost 500 geopolitical questions, we find that the most accurate forecasters make frequent, small updates, while low-skill forecasters are prone to make infrequent, large revisions or to confirm their initial judgments. Relating these findings to behavioral and psychometric measures, we find that high-frequency updaters tend to demonstrate deeper subject-matter knowledge and more open-mindedness, access more information, and improve their accuracy over time. Small-increment updaters tend to score higher in fluid intelligence and obtain their advantage from superior accuracy in their initial forecasts. Hence, frequent, small forecast revisions offer reliable signals of skill.

Slowness to update . . . that’s one of my favorite topics! Good to see work in this area.

31 thoughts on ““Small Steps to Accuracy: Incremental Updaters are Better Forecasters”

    • We look into Sam Wang’s 2016 forecasts in a different piece on Monkey Cage blog. Nate refused to share data so we did not score his forecasts for the piece, but based on simple analysis of 538 screen-captured forecasts, he would have done relatively well, though not the best, among the series we scored.
      https://www.washingtonpost.com/news/monkey-cage/wp/2016/11/30/which-election-forecast-was-the-most-accurate-or-rather-the-least-wrong/

      • Pavel:

        Wang was so far off in 2016 that at some level he had to grapple with the fact that he’d screwed up. What’s been frustrating with Silver is that he’s made claims that are incoherent and aren’t supported by his own data. For example, he wrote, “We think it’s appropriate to make fairly conservative choices especially when it comes to the tails of your distributions. Historically this has led 538 to well-calibrated forecasts (our 20%s really mean 20%).” This is incoherent in that for prediction intervals to be conservative implies that their coverage is greater than their stated probabilities. Also, according to the calibration plots posted on the Fivethirtyeight website, in this domain the prediction interval of 20% really means 14%, and 80% really means 88%. We further discuss the issue in the second column of page 873 of this article.

        • Andrew:

          Yes, thank you for sharing teh JDM journal article. It was great to see the research related to The Economist model discussed in such detail there in an open source format. That highlights the contrast between open source and proprietary code/data. My two cents on that is that most of Nate Silver and 538’s special sauce is in the journalism/explanation behind his forecasts, rather than in the proprietary modeling choices he builds into his code, but I know he would disagree. The incorrect calibration claim is especially weird because he could have picked a different point on the scale where forecasts and observed event rates are more closely matched.

          I am sure you are aware of this work by Rajiv Sethi and colleagues on the accuracy of the PredictIt and The Economist model in 2020, but sharing here for other readers:
          https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3767544

        • Andrew:

          It’s understandable that you are frustrated by FiveThirtyEight’s statements. However, by similar logic, please note that it can be frustrating when I and other blog commenters point out that your 2020 presidential election predictions are overconfident (prediction intervals too narrow, as shown by poor calibration and low p-values for 2008-2016), but you decide not to correct the issue.

          From my perspective, there is similarity between choosing to be conservative (underconfident) in one’s predictions and choosing not to correct overconfidence in one’s predictions. I hope you see the similarity too because I think that overconfident election predictions have negative consequences for elections, so that it would be valuable for your team to have better calibration in future election predictions.

        • Fogpine:

          I don’t know what you mean by “decide not to correct the issue.” Our forecasts are already done; we can’t go back and correct them. We’re working on various ways to make things better for next time.

        • Andrew:
          I mean comments made in months before Nov. 3, such as those pointing out that the Economist model’s backtesting on 2008-16 had prediction interval coverage below the stated intervals, assigned very small p-values to the the set of actual 2008-16 election outcomes, tail issues, etc.

          As someone who doesn’t run a popular blog, I’m not sure how memorable these comments are to you now, but they did provide concrete evidence of the issues.

          I’m glad you’ll work to improve predictions in the future.

        • +1

          I was following fogpine’s comments, and Andrew’s/Morris’ reactions to them, with a lot of interest at that time. I appreciated that you (Andrew) occasionally responded to these comments. I also appreciated that you cited fogpine’s comments in a great post on October 15, 2020 where you highlighted some of these issues with the tails of your election forecasts. Still, the decision not to adjust the model bothered me. I’d be curious to hear more about your own internal thought process at that time–it might cast light onto Nate’s. Perhaps fixing the model would have just meant putting in too much work in too little time, which I think is relatable.

        • Peter:

          Our model has lots of issues, and we’ll redo it for next time. The Economist people also had a model for congressional elections that I wasn’t involved in at all. The point is, they were busy, so yeah, “too much work in too little time” is about right. Adjusting a model isn’t easy; it’s not like there’s just one knob to turn. Indeed, I think some of the silly aspects of Nate’s model happened because he kept throwing in adjustments without understanding how they all fit together. In addition to all the distributional issues we discussed on the blog, there were also issues extraneous to the polls, such as concerns about voter turnout modeling and questions about how to consider the effects of factors that were not under control of the voters at all, such as the possibility that Lindsay Graham could convince the officials in Georgia to throw out the actual election results. My point is not that we should not have fixed what we could in our model; rather, I’m just noting that even had we fixed that one thing with the tails of the distributions, there still would’ve been other unresolved and even unresolvable issues.

        • Andrew:

          Our model has lots of issues, and we’ll redo it for next time… My point is not that we should not have fixed what we could in our model; rather, I’m just noting that even had we fixed that one thing with the tails of the distributions, there still would’ve been other unresolved and even unresolvable issues.

          But this is the essence of the problem. How can you, on the one hand, acknowledge that a model has all these issues but, on the other hand, be OK asserting that Biden had a 97% chance of winning the election? I can’t think of a situation in which it would be reasonable for me to say “my model for X has lots of issues that should be addressed” and at the same time be confident enough to publicly and prominently state “X has a 97% chance of occurring”, especially when the topic is of national importance. It’s prediction overconfidence.

          P.S. I don’t want the important advantages of the Economist’s approach to get lost in this criticism, so I should add that it’s fantastic that the Economist team published the backtesting models’ code. Making the backtesting code available is both a lot of extra work and a brave choice. Also, FiveThirtyEight hasn’t done it.

        • I appreciate your response Andrew and your rejoinder, fogpine! You both make a lot of sense.

          I think the Economist’s/Andrew’s choice to open source so much of the data and model code, and Andrew’s public engagement with criticism, is a major strength over the 538 team and definitely brave.

          I’ve been impressed with the anxiety evidenced by the 538 data visualization team about how to communicate the results of their model in a way that conveys uncertainty associated with the modeling process itself. Their major redesign of the election forecast webpage for 2020, especially their decision to present the reader with the detailed results from a small number of draws from the posterior distribution, I think went a long way toward communicating additional sources of uncertainty associated with simulation error and uncertainty related to model specification. I also think it was smart to make the reader scroll down literally an entire page in order to get to a number, with the topline instead displaying an adjective like “favored”…

      • I was 95 % sure that Trump would win in 2016. Maybe good luck in that prediction. Even before the news of these Clinton emails. I based my prediction on Trump’s campaign speech content. He was a master communicator in the 2016 election.

  1. Aha, great you all are discussing this. As I often mention, Expert Political Judgment by Prof. Tetlock was for me a relief to read b/c I have been so interested in the decision making scholarship. I recall Sherman Kent and Irving Janis as a pre-teen and teen.

    • Werner:

      Thank you for the note. I do have an issue with your framing of our work as being work done by “Phil Tetlock and his group”. There is a reason why scientific citation are in the form: Lead Author et al. (YEAR). In this case, the lead author is myself, but more generally, lead authors do much of the heavy lifting and deserve lead credit for a publication. I raise the issue because it is so prevalent i the way credit is given in science, and will continue to raise it if I am the senior/famous author some day.

      Readers are probably aware of the Matthew effect, which is that famous people often get noted for the work done by many ithers. I don’t want to diminish in any way what senior authors, such as Phil Tetlock or Andrew, do on projects in which someone else takes the lead and therefore becomes lead author. But this Matthew effect has very real consequences for the careers of folks like me who work with famous senior co-authors. Please be more careful in attributing credit to groups in which some authors are more famous than others. It’s really easy to know who the first author is.

  2. RE: ‘Small-increment updaters tend to score higher in fluid intelligence and obtain their advantage from superior accuracy in their initial forecasts. Hence, frequent, small forecast revisions offer reliable signals of skill’

    I would love to read how you came to this conclusion. Maybe the comparison ‘high frequency’ updaters as a contrast to ‘small increment daters is tripping me up.

    I would think that not having conflicts of interest make a lot of difference in so far as accuracy. I watched how this has played out in foreign policy development.

  3. Pavel – I couldn’t tell from the description in the paper how the “Updating Simulation” (the test of what happens if updates were 70% or 130% as large) worked.

    Using the example in the paper of successive forecasts of 25, 45, and 9 (skipping the zero update forecast) and the counterfactual experiment of 130% larger updates:

    First forecast in actual and counterfactual stream: 25
    Second actual forecast 45, so first actual update magnitude is 45 – 25 = 20
    First counterfactual update: 20*1.3 = 26, so second forecast in counterfactual stream is 25 + 26 = 51
    Third actual forecast 9, so second actual update magnitude is 9 – 45 = -36
    Second counterfactual update, options #1 and #2: -36*1.3 = -46.8
    Third forecast in counterfactual stream, option #1: 51 – 46.8 = 4.2
    Third forecast in counterfactual stream, option #2: 45 – 46.8 = -1.8 (beyond minimum value, so adjusted to 0)
    Second counterfactual update, option #3: (9 – 51)*1.3 = -54.6
    Third forecast in counterfactual stream, option #3: 51 – 54.6 = -3.6 (beyond minimum value, so adjusted to 0)

    The only option that appears to have the desirable property of allowing the larger-update and smaller-update counterfactual forecasts to converge to truth as knowledge approaches 100% is option #3. (Forecast confirmations improve the counterfactual forecasts.) But the text refers to “update magnitudes … set to 130% of their original values”, which implies either option #1 or option #2, since those are the two options that use the original update magnitude in the calculation.

  4. Is it really slowness? That implies to me that they will eventually update like they should, but just aren’t ready yet. But what about cases where people show resistance to ever updating to become as certain as they should? (I’m thinking about what has been called non-belief in the law of large numbers, which seems hard to separate from slowness).

    • Jessica,

      Your question about slowness is useful, as the three belief updating predictors relate to slowness differently:
      – frequency: opposite of slowness, people are quick to update when finding new info
      – confirmation: that is a sign of slowness; forecasters are back on the platform and choose to keep forecast as is
      – magnitude: not clear if small updates are a sign of slowness; if the forecaster was already close to an accurate estimate, she does not need to make a large update. But the small-magnitude updater may seem slow when you zoom out the time series.

      Small-magnitude is moderately correlated with high-frequency (r ~ .3), and perceived “slowness” depends on the balance between the two.

  5. An interesting follow-up study would force people to change their probability statement at regular intervals and by small, fixed increments. It would address to what extent prediction ability is moderated by updating style. Maybe if everyone were forced to update rarely and by small amounts, for example, the “true” visionaries would emerge!

  6. Atanasov et al. study people who are forecasters, not statistical models for forecasting. However, when Atanasov et al. conclude that better forecasters tend to modify predictions more often and in small ways, that conclusion feels familiar to my practical experience working with statistical models.

    In particular, some statistical models are quick and easy to change when new evidence says they should be, while other models are difficult and time consuming to change, for example because they have many pieces that interact in un-anticipatable ways, a confusing likelihood function, long fitting run times, or unwieldy code organization. Statistical models that are more difficult to change are corrected less frequently, and often with larger accompanying changes in predictions.

    For example, Andrew mentions above that it would have been difficult and time-consuming to change the Economist’s 2020 election forecasting model to address the miscalibration. In my own research, I’ve also coded plenty of models that end up being obstructively difficult to modify, and surely many others have been in the same situation. Perhaps then, ease and quickness of correction should emphasized more as a strength of some statistical modeling approaches and a weakness of others. Unfortunately, the academic world of “publish your paper and move on” doesn’t provide much incentive for that goal.

    Returning to the topic of people who are forecasters, I’m tempted to view their internal, mental prediction processes as an analogue of the concrete statistical models written in code. Atanasov et al. focus on how better forecasters differ from others in terms of personality attributes like open-mindedness. But maybe the worse forecasters did not lack so much in terms of personality attributes, but instead because their internal representation of the world was the mental equivalent of a statistical model with many pieces that interact in awkward, un-anticipatable ways, making the model difficult and time-consuming to correct, and thereby preventing the frequent small changes warranted by new evidence. Also, there’s probably overlap between personality attributes and the structure of one’s internal models of the world.

    • fogpine:

      This is a thought-provoking analogy, which I can relate to as a modeler, not just as someone who studies how people make forecasts. In my mind, the key variable connecting models and human forecasters is “need for consistency”. If one has a complicate model–mental or statistical–with many variables and high need for consistency, updating is difficult and time-consuming. This is one criticism I have for analytical philosophers and some members of the rationality movement: their arguments and mental models are often so complex and optimized for internal consistency, that they are difficult to update when circumstances change. It’s the age-old friction between consistency and correspondence. I don’t think that’s the main driver of our results–many forecasters who didn’t log in often enough, so effort is probably a more important limiting factor on update frequency than complicated mental models. But I find this line of reasoning useful in general.

      • Pavel:

        Agreed, models (and people!) can end up impressively self-consistent but far from reality.

        On the other hand, some models and people are undeniably successful even with all three characteristics we’re talking about: highly self-consistent, highly complex, and resistant to correction when there is new evidence. For instance, physics has many examples.

        …many forecasters who didn’t log in often enough, so effort is probably a more important limiting factor on update frequency than complicated mental models.

        Ah, I see what you mean. Yes, effort makes more sense than my speculation.

    • Fogpine:

      Perhaps the two different types of models – those that are easy to update and those that are difficult to update, both mentally and in code – reflect preparation vs lack of preparation.

      Digression:
      One of the most important things a quarterback needs to do in American football is to recognize the defensive alignment and adjust his “model” of how the play will run according to the defensive alignment. If he can do that quickly, he can exploit whatever misalignments occur on the play. If he can’t, he might get sacked.

      But the quickness with which he recognizes the defensive alignment isn’t due to just raw speed of electrical impulses, right? It’s due to more to the degree to which he’s developed the synapse connections that recognize and act on the defensive alignment: it’s due to the work he’s put in ahead of time, mapping out the potential pathways that a play will develop.

      IMO both the mental processes of forecasting and the process of coding a model can be prepatterned in the same way.

      One reason a good forecaster might be better at making small adjustments quickly is because s/he’s worked out the likely adjustment vectors well in advance. So while h/her *forecast* is changing, the model really isn’t. The model still has the same pathways. The only difference is which one the marble will roll down.

      Same thing can be done with coding. But of course that depends on the time available for the project. But obviously on any given model the more quickly a person understands and recognizes the parameters, the faster an appropriate model can be built that incorporates all the potential pathways of the outcome. And how quickly one recognizes the parameters might depend alot on how well a person can generalize from previous knowledge.

      • jim:

        Hm, yes. For example, I’m thinking back on the situations where my statistical models have been painfully difficult to modify. In all those cases, when I was initially planning the statistical approach and coding it, I could have benefitted from more foresight into the general kinds of modifications that might reasonably become necessary in the future. If I had had that foresight, it would have been quicker and easier to address the specific modifications that actually became necessary.

        Yet, it’s not the case that all statistical approaches accommodate correction with similar fluency. For example, suppose you apply some frequentist approaches and then find out that you need to address measurement error. You may have a huge obstacle ahead of you. In contrast, some Bayesian approaches can be theoretically easy to modify, but practically difficult because it’s an obstacle to get the model to actually finish fitting at all, let alone have it run fast enough.

  7. I wonder about low-frequency updaters…does the momentary accuracy of their predictions influence either whether or not update or or change? I’m having a hard time expressing the idea, but I’m wondering if people who have strong concern about being right shy away from updating, then find themselves forced to chose between a substantial revision (i.e., admitting being wrong) or “sticking to their guns” (thus perpetuating their wrongness)? If that makes any sense…

Leave a Reply to fogpine Cancel reply

Your email address will not be published. Required fields are marked *