The above sketch shows a decision tree.
The circles are uncertainty nodes and the squares are decision nodes. Read the tree from left to right: to start, there is uncertainty of which of the strata i=1,…,I you will be in. In any given stratum, you will have to decide between options 1 and 2, and for each of these decision options there is uncertainty about the payoff.
The goals are:
(a) Conditional on the stratum, pick the best decision. This is the local decision problem.
(b) Averaging over the strata, evaluate the expected value of the tree, that is, the expected value under an optimal decision analysis given the uncertainty.
The challenge is that you don’t know which internal decision is best, because there is uncertainty about the payoffs.
The “optimizer’s curse” is that if, for each stratum in step (a), you make the best decision given available information–that is, you estimate the expected payoff under each of the two decision options and then pick the the one whose expected payoff is higher–then if you use these expected payoffs in step (b) you will systematically overestimate the value of the tree.
The “curse” here is not that the optimizer is making bad decisions, it’s that a naive estimate will be overly optimistic about the net value because you’re selecting on choices that look good.
In 2007, Erwann Rogard, Hao Lu, and I published a paper on the topic, including the above diagram. Here’s our abstract:
The evaluation of decision trees under uncertainty is difficult because of the required nested operations of maximizing and averaging. Pure maximizing (for deterministic decision trees) or pure averaging (for probability trees) are both relatively simple because the maximum of a maximum is a maximum, and the average of an average is an average. But when the two operators are mixed, no simplification is possible, and one must evaluate the maximization and averaging operations in a nested fashion, following the structure of the tree. Nested evaluation requires large sample sizes (for data collection) or long computation times (for simulations).
An alternative to full nested evaluation is to perform a random sample of evaluations and use statistical methods to perform inference about the entire tree. We show that the most natural estimate is biased and consider two alternatives: the parametric bootstrap and hierarchical Bayes inference. We explore the properties of these inferences through a simulation study.
I kinda like the paper. I wouldn’t say it’s one of my all-time favorites, but I think it’s interesting, and I like that we offer two different solutions to the problem.
On the downside, the paper seems to have disappeared without a trace. In 20 years, it’s only been cited three times, and none of them look very impressive:

“Using Alternating Decision Treets,” indeed.
Maybe one problem with our paper was its dry-as-dust title, “Evaluation of multilevel decision trees.”
This all came to mind because Sean Manning pointed me to this post, “The best cause will disappoint you: An intro to the optimisers curse.” Now that’s a good title.
It seems that the term “optimizer’s curse” came from this 2006 paper by James Smith and Robert Winkler, which has a lot of overlap with our article that appeared a year later. Both papers use hierarchical Bayesian analysis. Their paper is better than ours, for sure, and not just in the title, as they make a much better case for the importance of the problem. But we were working independently. Too bad: had we joined forces we could’ve produced something better, as each of the two papers had lots of material that was not in the other. Smith and Winkler consider the problem of choosing among many options with different levels of uncertainty, whereas we consider a multiplicity of binary decisions. These are just two cases of the general principle.
The above-linked post, by someone who goes by the handle “titotal,” is good too. It doesn’t have any new technical material, but it explains the problem in plain English from first principles, goes through some examples, and discusses some of the policy implications.

I was confused for a long while because this observation seemed to violate the tower property of expectations. Now I see that the issue comes from the fact that the conditional payoff distributions are not known, but estimated by sampling, which means that the chosen decision in each stratum is also a random variable.
This Optimizer’s Curse sounds much the same as a section from The World Of Mathematics series, published in 1956, if I remember correctly from long ago. The application was to a set of bidders for a contract. If the bidders had a distribution of errors in their cost estimates then the low bidder very likely would have underestimated the cost, and would therefore likely lose money on the contract.
Tom:
The example you’re talking about is known as the winner’s curse, and I agree that it has a similar mathematical structure as the optimizer’s curse discussed in the above post.
Neural networks from locality sensitive hashing: https://sciencelimelight.blogspot.com/2026/06/atlas-lsh-neural-networks-geometry-as.html
I just cited the Smith and Winkler paper for the first time this week but hadn’t seen your paper; I’ll have to go through it. Picking the best choice at each stage, or among a set of competitors, is such an obvious thing to do, and the introduction of bias is somewhat surprising even to statisticians not familiar with the problem. This particular curse pops up in medical research all the time, especially early-stage research where a set of choices are made about what to continue studying, but given the sample sizes the best choice is often the most biased (related to the Type M errors).