Bayesian model-building by pure thought: Some principles and examples

This is one of my favorite papers:

In applications, statistical models are often restricted to what produces reasonable estimates based on the data at hand. In many cases, however, the principles that allow a model to be restricted can be derived theoretically, in the absence of any data and with minimal applied context. We illustrate this point with three well-known theoretical examples from spatial statistics and time series. First, we show that an autoregressive model for local averages violates a principle of invariance under scaling. Second, we show how the Bayesian estimate of a strictly-increasing time series, using a uniform prior distribution, depends on the scale of estimation. Third, we interpret local smoothing of spatial lattice data as Bayesian estimation and show why uniform local smoothing does not make sense. In various forms, the results presented here have been derived in previous work; our contribution is to draw out some principles that can be derived theoretically, even though in the past they may have been presented in detail in the context of specific examples.

I just love this paper. But it’s only been cited 17 times (and four of those were by me), so I must have done something wrong. In retrospect I think it would’ve made more sense to write it as three separate papers; then each might have had its own impact. In any case, I hope the article provides some enjoyment and insight to those of you who click through.

5 thoughts on “Bayesian model-building by pure thought: Some principles and examples

  1. Great paper. I love Jaynes’s solution to Bertrand’s paradox (The Well-Posed Problem) by invariance considerations. In that case there were enough invariance requirements to determine the needed distribution uniquely.

    But what I really liked about it was the deeper intuition that there was a frequency connection to this derivation from “invariance”. Namely the frequency distribution you would actually get will resemble the theoretical distribution if the person performing the experiment is unskilled (has no fine control over the initial conditions of each “trial”).

    Or alternatively you could turn this around think of it in terms of model checking: if the frequency distribution differs substantially from the theoretical one derived from invariance considerations, then the experimenter had have been highly skilled and could tightly control the initial conditions.

    Anyway, there is enough in that one little paper to drive just about every stripe of Statistician or Philosopher nuts.

  2. Continuum modeling is something I’ve been thinking about a lot recently. In most areas of Physics a continuum model can only be interpreted as a statistical model for some regional average, after all the world is made up of discrete particles. Thinking about the consequences of choosing a scale for representing the continuum leads to some interesting ways of deriving new continuum models for phenomena not previously considered.

    Continuum models are what Grigory Barenblatt calls an “intermediate asymptotics”. The scale size has to be big enough that statistical averages over the particles are relatively meaningful, and yet small enough that they represent a level of detail which is much smaller than the entire body of interest. So long as you’re away from these two endpoints, your model has a chance of being meaningful.

    Your concept of classes of models which are closed under scaling is also relevant here. Sometimes we define equations that involve material properties whose values as measured would change if they were measured at a smaller or larger scale. In other words, scale can itself become important in the description of the model.

  3. Very interesting to this non-statistician. I really learned something from this.

    This kind of discretization problem does come up in my own field of population ecology, as many species reproduce and die continuously but their population sizes are only sampled at discrete, arbitrarily-chosen intervals. My own preferred approach in such situations has been to fit a mechanistic, continuous-time model. I’d been aware in that fitting an AR model in such situations isn’t advisable because your parameter estimates are sensitive to the arbitrary choice of discretization. But now I have a much deeper sense of why it’s not advisable.

    Another population ecology context in which this comes up is the use of matrix models, which divide a population into discrete classes, with all the members of a given class assumed to have the same per-capita birth and death rates. For instance, juveniles and adults. You run into problems if in reality birth and death rates actually vary continuously, e.g., as a function of a continuous variable like body size. For a while ecologists put a fair bit of work into trying to figure out the “optimal” discretization in such situations (e.g., how many size classes should you define?) But fortunately nowadays we have approaches like integral projection models, that respect the continuous nature of the underlying biology.

  4. re the 4 cites – take heart – sometimes it takes people time to catch up
    I’m in biotech, and a well known prof, H Schachman (from Berkeley, I may not have the last name right) has this funny graph (in a very amusing volume from the Annals of the NY Academy of Sci, on the history of protein science), with probability of funding (NIH) on the X axis, and originality on the Y axis
    The shape of the line is an inverted “U”, ie if you are to original, no funding….

  5. This is all certainly very interesting. A relevant paper is Peter McCullagh:”What is a Statistical Model?” (2002), the annals of statistics. In this paper he find it necessary to use category theory, but it seems that most concrete examples can be treated much simpler. Here is one example of mine, encountered when I worked on some data of health effects of PM10
    air contamination in la Paz, Bolivia.

    We used a poisson regression model, but the point is the use of a model with a multiplicative structure. The engeeners asked why we didnt simply use linear regression? and the answer is, that then the parameters are not interpretable!

    This is easy to see:
    Let Y_i be the number of cases attending to a certain healt center on day i, and x_i a vectro of covariates. The model is
    \[
    Pr(Y_i=y_i | x_i= = e^{-\lambda} \frac{\lambda^{y_i}}{y_i !}
    \]
    where \lambda_i=\exp(x_i^T\beta)= \exp(\eta_i)
    where \eta_i is the linear predictor.

    Let the (in practice unknown) population size using the health center when sick be A_i, and think now of a similar center (same covariaye values) but with population base f A_i.

    The expected number of visits for this centre becomes f \lambda_i = \exp(\log(f)+\eta_i),
    so the population base only influences the intercept, leaving the other parameters with an interpretation
    which is the same for health centers of different sizes, vital when the data have been collected from different centers.

    So, what happens if we use a linear model? Then
    \[ E(Y_y | x_i) = x_i^T \beta =\eta_i
    \]

    and when scaling the model to the other center with population base f A_i
    the expectation becomes f x_i^T\beta = x_i^T(f\beta_0,f \beta_1, \dots, f \beta_p),
    so the poulation base multiplies all the parameters, making them in practice uninterpretable.

    This simple example has exactly the same structure as McCullaghs example in his section 8, “Extensive response variable”
    .

Comments are closed.