Daniel Lakeland asks, “Where do likelihoods come from?” He describes a class of problems where you have a deterministic dynamic model that you want to fit to data. The data won’t fit perfectly so, if you want to do Bayesian inference, you need to introduce an error model. This looks a little bit different from the usual way that models are presented in statistics textbooks, where the focus is typically on the random error process, not on the deterministic part of the model. A focus on the error process makes sense in some applications that have inherent randomness or variation (for example, genetics, psychology, and survey sampling) but not so much in the physical sciences, where the deterministic model can be complicated and is typically the essence of the study. Often in these sorts of studies, the staring point (and sometimes the ending point) is what the physicists call “nonlinear least squares” or what we would call normally-distributed errors. That’s what we did for our toxicology and dilution-assay models. Sometimes it makes sense to have the error variance scale as a power of the magnitude of the measurement. The error terms in these models typically include model error as well as measurement variation. In other settings you might put errors in different places in the model, corresponding to different sources of variation and model error. For discrete data, Iven Van Mechelen and I suggested a generic approach for adding error to a deterministic model, but I don’t think this really would work with Lakeland’s examples.

You might wish look (again) at stuff by Gauss, David Sprott has written re him on that.

Before the quantitative masters were distracted by Peirce’s spreading doubt whether “All clouds, were clocks”

I know it is just a typo, but “the staring point” is a great phrase. It means that part of the paper that grabs your attention and won’t let go because it just doesn’t make sense, right?

I think that this a fascinating topic with many practical applications. People in fields such as civil, aerospace or chemical engineering spend a great deal of time creating very complex deterministic models representing the behavior of their systems. Then it’s time to infer the values of some of the parameters of the model based on observations and it’s probably not the best idea to just minimize the sum of the squared errors!

I agree with Daniel that the likelihood is the tricky bit when performing Bayesian inference using deterministic models. It is generally possible to specify vague priors based on domain knowledge. Also, if we are not hardcore Bayesians, we can always interpret those priors as a way to regularize the inference…

I quite like the work by Tony O’Hagan and collaborators (e.g. http://goo.gl/WUfBk) where they develop a method to perform Bayesian inference to find values for the parameters of complex deterministic models (i.e. to calibrate the models). Those methods deal with continuous magnitudes and also provide answers to a problem raised by Daniel in his post: MCMC inference may need thousands of evaluations of the original deterministic model and a single run of a large ocean fluid flow model could take a day! Gaussian processes are used to approximate the deterministic model within the probabilistic model that encapsulates it.

One thing that still puzzles me is that the likelihood should not only model the uncertainty in the measurements but also any sort of structural deficiency in the deterministic model. There could be simplifications in the modeled physics, simplifications in the numerical methods used to solve the physics or even small bugs in the code! How does one even try to introduce those things in the likelihood function?

Roger Frigola: > There could be simplifications in the modeled physics, simplifications in the numerical methods used to solve the physics or even small bugs in the code! How does one even try to introduce those things in the likelihood function?

That is why I feel less nervous if I also model the motivated actor using the model under consideration to take decisive action towards a measured satisfaction with the result. Then hit that “meta-model” both with stress tests entirely in the computer, in small scale experimental set-ups, and in small scale situations.

And if you find this “meta-model” to be too hairy to even imagine managing successfully, why were you so confident about managing successfully the original model? ;-)

I have exactly this problem right now. The data are measurements of pressure differences between inside and outside a house, and inside and outside a garage, as well as a flow rate into the house (through a “blower door”, basically a fan that pressurizes the house). Of course the house-garage pressure difference can also be inferred from the house-outdoor and garage-outdoor pressure differences. The fan is used at a several flow rates, and with several different conditions (e.g. garage door open or closed). The goal is to estimate six physical parameters that describe the system. The pressures are measured at point locations. The indoor-outdoor pressure measurement (and the others) generally differs from the actual pressure difference averaged over the surface of the house, because the measurement will depend on wind speed and direction. Also, the pressure measurements vary in time (even when the surface-area-average pressure difference does not) because of changes in wind speed and direction during the test. In a phenomenon similar to “regression dilution,” simply using the best-fit parameters leads to a systematic mis-estimation of the parameters that describe the system; that is, a bias. Additionally, the deterministic model, though good, is known to not be perfect, so there is model misspecification error as well.

I have tried to fit what I think is an appropriate model in JAGS, but can’t get it to work. I haven’t (at least, not yet) found an example problem that is close enough for me to base it on. I am about to program it up myself.

At any rate it would be great if there were standard methods and standard software for handling this sort of thing.

Phil:

If you’re having problems getting it to run in Jags, you should (soon) try Stan!

Phil,

I had to deal with a model that was very similar to the one you describe. If your model doesn’t work I suggest you make it as simple as possible. Then, once it’s returning sensible results, you can increase its complexity.

You are trying to estimate six physical parameters based on a bunch of observations. First thing you need is a prior probability density over those parameters that represents your uncertainty before seeing any data. To keep things simple you can start with a normal distribution centered at six values of the parameters that you know are feasible and a diagonal covariance matrix with large variances for each of the parameters. It’s better to use variances that are too large than too small!

Then comes what I think is the most difficult bit for someone not used to probabilistic modeling: the likelihood function. Even if your underlying model is deterministic, the likelihood will encapsulate it and provide a model of uncertainty in the measurements and uncertainty in the deterministic model. Again, to keep things very simple you can use a Gaussian likelihood which could look like this: Normal( measurements | virtualMeasurements(params), Sigma). Here virtualMeasurements(params) would consist of simulated values of the measured magnitudes provided by the deterministic model as a function of the parameters. Sigma is the covariance matrix representing (amongst other things) measurement error. You can start with a diagonal matrix where you set the variances to what you think is a sensible error in the measurements. This provides a prababilistic model representing your error mechanism: “I, too, have a deterministic model that has no built-in error mechanism, yet I do not expect my data to exactly be fit by the model”.

Note that this expression of the likelihood is actually a function of the parameters: the measurements are fixed values and, in this simple model, you have to set the values of the matrix Sigma.

Proper statisticians will probably frown at something this simple but I think it can be a useful first test to see if your setup is working.

Andrew, I would be happy to try Stan but I think the problem is with my model specification, not with the software. It’s only about 15 lines so you’d think it wouldn’t be so hard, yet I struggle.

But my main point is that I, too, have a deterministic model that has no built-in error mechanism, yet I do not expect my data to exactly be fit by the model, so I need to figure out an approach. It’s a common real-world issue.

Phil:

Model specification shouldn’t be a problem with Stan, you should be able to write the log-joint density directly. (Once Stan is released, that is.)

Phil, I would be interested to hear more specifics about your problem. It’s very easily possible to get into a situation where the dynamic or deterministic model error is the main issue. For example a temperature measurement can be made with a digital thermometer whose measurement errors are say normal +- .01 C and yet your model might for the optimal parameter values predict say 2 degrees of error. it’s totally implausible to attribute this to measurement error (hundreds of sigmas away from the measured value) but it might be actually very good in the context of your model which leaves out lots of physical effects.

But when the main source of error is an accumulation of missing physical effects, it can be totally implausible to use something like an iid normal error model. But then, what IS the proper likelihood?? this is the kind of question I was trying to get at in my blog post.

Daniel,

What about explicitly modeling the discrepancy between the full physics and the deterministic model? I guess one could model this discrepancy in the likelihood by using a new function. This function could have a number of new parameters in such a way that the final posterior is not only over the parameters of the deterministic model but a joint posterior between those and the parameters of the discrepancy function. One could then integrate out the parameters of the discrepancy function to obtain the posterior over the parameters of the deterministic model.

In your example you could suspect that the error is not only iid but that your deterministic model has, say, a constant bias. You could add this bias to the likelihood function, put a prior over it and compute the joint posterior p(parameters,bias|observations) which you can marginalize to obtain p(parameters|observations). This would be a very simple discrepancy model but one could imagine others using more parameters or even non-parametric.

Although this seems reasonable to me in theory, I wonder if someone has written about doing it for very large models (e.g. environmental, aerodynamic…)?

For some kind of constant bias, this idea is pretty straightforward. But suppose you’re observing a dynamic system, such as for example a building being shaken by an earthquake. You have stiffnesses and interconnectedness of the various structural members, and you have the input ground motion measured by a seismometer, and you have displacements of the various floors measured by seismometers on various floors of the building. What you don’t have is the wind patterns during the earthquake, nor do you take into account certain types of structural damage (perhaps cracking of the plaster, or loosening of certain bolts or nails or screws or whatever).

Now the wind and the damage could be considered as random processes. Perhaps you can even write a stochastic differential equation for the motion of the building and you can choose some kind of noise term for the stochastic differential equation based on physical principles. Now you’re in the best possible situation with respect to model specificiation, and yet you’re in trouble.

Why, because as Keith says below the likelihood is the probability (density in the vicinity of) the observations given the parameter values and the probability model. The probability model is our stochastic differential equation, and in any given MCMC draw the parameters are given, but what’s the density in the vicinity of the measurements? The straightforward way would be to run a large number of realizations of your stochastic process and then at the time points of the observations calculate the density via say kernel density estimation from the simulated observations. But this involves running say 100 or 1000 stochastic processes PER MCMC step, and then you’re going to want to run thousands of MCMC steps. Holy cow that could be a LOT of computation!

Whereas Andrew says “you should be able to write the log-joint density directly” in fact the point is that we often have no closed form for the log joint density in these situations!

Phil-

I worked on this exact problem maybe 15 years ago, including even more complex versions involving additional zones such as duct systems (and actually invented the testing methods). The approach I took was generalized least squares adjustment as in Deming’s book Statistical Adjustment of Data (based on Gauss really) and described in more modern detail in Observations and Least Squares by Mikhail. You need to specify uncertainties in all of the measurements and can specify a full var-covar matrix if you have knowledge of correlations in the errors. The output includes estimates of the true measurements and the parameters and the uncertainties in all of them (including a posterior var-covar matrix, based on the usual assumptions). I actually implemented a version of this in Stata (before the advent of Mata, which would make it quicker) but you can even set this up to use the solver in Excel to get the point estimates. One fascinating aspect of the least squares adjustment approach is that there is no sharp distinction between measurements and parameters — you just tend to have much great a priori uncertainty in the parameters.

Yes, the only difference between measurements and parameters is for those who think that some things are (and other things are not) ‘observable’.

I am bit confused here, as I think of likelihood as a _probability_ of observations given parameter values and a probability model!

So you will need a probability model for a discrepancy term…

Susie Bayarri present some thing a couple yaers ago here

http://www.samsi.info/sites/default/files/10PKPD_CoMo101.pdf

Exactly, and typically in these situations you have definitions for the parameters, and if you’re lucky you have a good set of observations, but the missing component is the probability model! In many situations the probability model is a real head scratcher.

My last comment above discusses a probability model for a very simple discrepancy term: a constant bias. The bias becomes a new parameter that is estimated in the same way as the parameters in the deterministic model. There is a prior over the bias and the bias needs to be introduced in the likelihood: p(observations|parameters,bias).

You can see this sort of approach in Susie Bayarri’s slide #5 where rather than being applied to a bias it is applied to the noise parameter lambda which is inferred at the same time as the parameters of the deterministic model.

Thanks for the link, it is quite interesting and relevant. One question though, is GaSP jargon for Gaussian Stochastic Process or is it something else?

Daniel,

GaSP here means Gaussian Process which, in this case, has a squared exponential covariance function.

The earthquake model you described above does look complex! I was thinking about slightly more “deterministic” models. For instance I could imagine a simple model of a metallic structure where we are uncertain about the values of the stiffness and damping of several of its elements. We perform a number of tests where we excite the structure at one location with sine waves of different frequencies and magnitudes and we measure the movement at other locations within the structure. Our simple deterministic model can be evaluated as the following function:

amplitudes = f(excitationAmplitude, excitationFreq, stiffnesses, dampings)

Now this could be a very simple model with linear stiffness and viscous damping which does not capture amplitude-dependent effects. If the real structure actually exhibits amplitude-dependent effects it would be a good idea to tell the likelihood that our deterministic model is actually not that great.

Therefore, the likelihood contains 1) a model of the discrepancy between the deterministic model and reality and 2) a model of the measurement error. The likelihood is of course a probability:

p(measurements | stiffness, damping, discrepancyFunctionParameters, measurementNoise)

This represents the case where we have specified a parametric discrepancy function. Gaussian processes would be the nice non-parametric alternative.

[…] asked Andrew Gelman to comment on my post about Where Do Likelihoods Come From? He had some interesting links to papers that he's […]

[…] the comments to an earlier post, I mentioned a problem I am struggling with right now. Several people mentioned having (and […]