More on prior distributions for climate sensitivity

In response to this post the other day on prior distributions for climate sensitivity, Nicholas Lewis wrote in:

Your post refers to comments I made at ATTP’s blog about the use of Jeffreys’ prior in estimating climate sensitivity. I would like to explain why, in some but not all cases, the Jeffreys’ prior for estimating climate sensitivity peaks at zero, a physically implausible sensitivity.

Unfortunately, observational uncertainty is high, and the choice of prior has a substantial effect on the posterior density when estimating climate sensitivity. The value of perfect information about a less uncertain parameter that is primarily determined by climate sensitivity was estimated in a recent paper as $10 trillion, so prior selection for this problem is quite an important issue.

Roughly speaking, when estimating climate sensitivity (ECS) from observational data over the industrial period, the reciprocal of climate sensitivity, the climate feedback parameter, is a location parameter with approximately normal (or t distributed) uncertainty. If one does not want to incorporate prior information about the parameter being estimated, then a uniform prior seems to me the obvious choice when estimating a location parameter, here the climate feedback parameter. Even if one knows that the climate feedback parameter can’t be infinite (corresponding to ECS = 0), using a uniform prior does not bias estimation. On changing variable back to ECS, that prior transforms to c/ECS^2 (c being a constant), which peaks at zero. But the cut off from the declining likelihood function at is very sharp at low ECS, and makes the posterior density negligible at ECS values well above zero.

Although I use computed noninformative (or minimally informative) priors in my published Bayesian climate sensitivity studies, I think I am the only climate scientist to do so. Many published climate sensitivity studies take a subjective Bayesian approach and either use a wide uniform prior for ECS (which hugely fattens the upper tail of the posterior PDF, as well as shifting the central estimate, relative to use of Jeffreys’ prior), or an “expert” prior that typically dominates the likelihood function. Some studies do, however, use non-Bayesian methods, and so avoid use of subjectively chosen priors.

No doubt, as a Bayesian, you don’t like likelihood ratio and profile likelihood methods of parameter estimation. But I will nevertheless point out that when I use such methods, they give results very close to those obtained by use of Jeffreys’ prior. So also does use of a reference prior.

I replied:

In any case, I think this is a challenging problem because of the decision aspect. As you note, there’s lots of attention drawn to the center of the prior distribution, but the tails are crucial when considering expected costs. This came out in some of the discussion in comments, for example here.

Another point that came up, but which I did not emphasize in my post, is that the Jeffreys prior depends on the likelihood function, thus indirectly on various assumptions. I’m not in general opposed to noninformative priors, I just think they should be taken for what they are. It always makes sense to understand the mapping from assumptions to conclusions.

Lewis then added:

The upper tail of the estimated posterior density for climate sensitivity is indeed critical for estimates of expected damages. Unfortunately, the likelihood typically only declines gently as ECS goes to high values, as they correspond to the climate feedback parameter approaching zero more and more slowly. Hence, in my view at least, the importance here of using a prior that declines at least approximately as 1/ECS^2, to reflect the data-parameter relationship, and not a uniform prior. It is a pity that the standard parameterization is in terms of ECS rather than climate feedback parameter; if it had been I think this issue would never have arisen.

I accept that Jeffreys’ prior depends on the likelihood function, and hence on various assumptions, and I agree that one needs to understand how those assumptions affect the conclusions. In observationally-based climate sensitivity estimation, usually the likelihood is normal or t-distribution in the observations, or in some simple transformation thereof. I usually carry out the Bayesian inference in that parameterization, where the choice of prior appears more straightforward, and obtain the Jeffreys’ prior for the climate system parameters being estimated by carrying out a change of variable. (Some climate sensitivity studies by other scientists use the same method, but without saying that they are using a Bayesian approach.) The assumptions about the likelihood function for the observables and those about the relationship between the observables and the climate system parameters (model accuracy) can then be examined separately, and some sensitivity testing carried out.

Also, nuisance parameters don’t seem to be a particular problem in the studies I have carried out, I think partly because (unlike some other Bayesian studies) they estimate no more than two other uncertain parameters in addition to climate sensitivity and partly because plug-in estimates of observational etc. uncertainty taken from other sources are used, as is usual in such studies.

Lots to think about here. 20 or 30 years ago we would’ve agonized over what’s the appropriate noninformative prior, but now we’re all more comfortable talking about prior distributions as encoding useful information, and priors regularizing by downweighting scientifically implausible regions of parameter space.

9 thoughts on “More on prior distributions for climate sensitivity

  1. Maybe I’m misunderstanding what Nic is suggesting, but that we may have to make difficult decisions on the basis of this research, should not – IMO – influence how we do the research. Estimating ECS, for example, should be something we do in order to gain understanding of how our climate may respond to increasing anthropogenic forcings. We may then make difficult decisions on the basis of that understanding, but I don’t think the choice of prior should be influenced, in any way, by this possibility – assuming that I have understood Nic’s justification.

    • Right, we shouldn’t use the Jeffries prior because it downweights the upper tail like 1/x^2 and that in turn reduces the range of stuff we need to consider in the decision.

      If it’s scientifically appropriate based on an understanding of climate science to use a prior that falls off like 1/x^2 then we should use it, and take whatever good or bad comes out of the decision process.

      However, I think what he’s suggesting is more mild. We SHOULDN’T use things that don’t fall off fast enough (such as uniform) because they don’t correspond to the science, AND they have bad decision properties.

      If they didn’t correspond to the science, but they had no real effect on the decisions… then we could argue that we just don’t need to have the science that carefully tuned, the decision is invariant to our gross approximations. But that’s not the case here. I think he’s arguing for the Jeffry’s prior over the uniform, because it doesn’t pollute the decision as badly as the uniform.

      • However, I think what he’s suggesting is more mild. We SHOULDN’T use things that don’t fall off fast enough (such as uniform) because they don’t correspond to the science, AND they have bad decision properties.
        Except, as I understand it, Nic Lewis is simply now using a uniform prior in climate feedback parameter, rather than a uniform prior in ECS itself. So, I think one could make the same argument about this uniform prior (wrt climate feedback parameter) as one could make for a uniform prior in ECS.

        I think, though, that Nic is probably right that the uniform prior in climate feedback parameter only becomes really unphysical in regions where the likelihood functions declines sharply anyway, but that doesn’t seem like a very strong argument for using it instead of a prior that was motivated by some prior understanding.

      • Of course, we should not change science because we don’t like the results.

        But the story here is probably more nuanced. Usually, what we want to get right first is the order of magnitude, then the position of the maximum/average/median/whatever, then the spread (standard deviation, interquartile difference yadda, yadda). Then tails. It might happen though that, because of the costs of the decision, we should actually focus our attention on getting the tails right first. That would be a hugely different shift in the usual point of view, but maybe worth exploring ($10T sorta grabs one’s attention).

  2. Were 20 years ago that much of the dark ages? Is the concept of “prior distributions as encoding useful information” such a revolutionary / recent concept?

  3. FWIW, about six months ago a similar exercise was set off by Cliff Asness and Aaron Brown. Now Eli, were he so inclined, could provide a list of several interesting responses, but perhaps a link to one of them (which links to the others) would be a good place to start

    Mark Buchanan (down one on the food chain) summed the situation up
    Anyway, for clarification, let’s use an analogy. Imagine there’s a black box with a red light fixed to it. The light flashes every second or so on average, but in a highly irregular and unpredictable way. Some people argue that the flashing is getting more frequent with time, and showing larger fluctuations from its average behavior. Others say, no way, that’s an illusion, it’s always been irregular and these apparent changes are only insignificant and temporary fluctuations.

    Two sets of people set to work to figure out if the pattern of flashing really is changing, and to predict how much we should expect it to change in the near future, if at all. The two teams go about their work in very different ways. Team A decides to work just with the mathematical pattern of flashing recorded over a not-too-distant interval of the past — say, one week. The other team, Team B, also uses that information, but decides to supplement it with other recordings of the flashing pattern from further in the past, some going back months, even years. Team B goes further too, using X-ray, MRI and ultrasound imaging of the box to work out a detailed, but certainly incomplete, picture of what goes on inside the box — gears and electronics and other stuff — to produce the flashing. They do experiments outside of the box to tease apart these mechanisms, and to get insight into how different mechanisms might interact within the box.

    As the system turns out to be highly complex, Team B also starts to build replicas of the box, as well as large-scale computer models designed to simulate the interplay of all the mechanisms inside of the box. They test and refine both the replicas and the simulations over time using real data from the box. The members of Team B, knowing how easy it is for people to confuse themselves, and to believe they understand more than they really do, also splits itself into a number of sub-teams which compete against each other on standard data sets so they can get objective measures of improvement of these simulations over time. Who can run a simulation, based on plausible mechanisms, which can reproduce what the box did between 10 and 12 week ago? How does a model, trained on that interval, do if applied to other intervals later on? In this way, Team B slowly builds up a capacity for understanding what goes on in the box, and for predicting how it will likely behave next.

    Now, suppose Team A and Team B make predictions for what they think is most likely to happen to the flashing pattern in the near future — say over the next 5 weeks. Both would acknowledge that the task is difficult given the complexity of the system. But which team do you think is more likely to make the better prediction? I think most people would naturally choose Team B, as they’re using a much richer set of information and data about the box and it’s behavior than Team A. They’re taking into account lots of things that Team A is not. Usually, the more information one brings to bear on a problem, the better one does on that problem. Indeed, most of the theories developed by Team A based on the short time series alone can be immediately shown to be highly unlikely by comparison with other data studied by Team B.

  4. >”Unfortunately, observational uncertainty is high, and the choice of prior has a substantial effect on the posterior density when estimating climate sensitivity. The value of perfect information about a less uncertain parameter that is primarily determined by climate sensitivity was estimated in a recent paper as $10 trillion, so prior selection for this problem is quite an important issue.”

    I think there may be a fundamental misunderstanding here. Look at equation three in this paper which is used to get what they call a pdf for the climate sensitivity:

    It has variables deltaT0=constant and deltaT= sensitivity to doubling of CO2 (ie what is being estimated). Then there is f_hat and sigma_f, where f_hat is an estimated feedback factor and sigma_f is the standard deviation. As shown in their equation 1, these values are derived from a linear regression of annual temperature anomaly vs radiative forcing, the lambda0 there is also a constant. My point is the only real variables are f and sigma_f. Here, f= 1 – constant*(slope of the linear regression).

    It seems sigma_f is simply the residual standard error of this linear regression. This is a property of the linear fit and data, it will not become more precise by collecting more data (unless the relationship between temp anomaly and forcing becomes more linear). In other words, the range of values for climate sensitivity is a derived property of the data (plus assuming a linear relationship between temp and forcing). It is not supposed to get more precise any more than standard deviation is supposed to approach zero as you collect more data. The priors are supposed to be used on f and sigma_f, not on climate sensitivity.

    Further, simulating climate with complicated GCMs will not decrease sigma_f and thus not increase precision of the sensitivity estimate. The net radiative forcings for each year are inputs, the models use these to conserve total energy. The GCMs are also trained on the historical temperature data. Anyway, we don’t want them to give a more precise distribution, that would mean they do not fit the observations well.

    Anyone want to explain how I am misinterpreting this?

Leave a Reply

Your email address will not be published.