Skip to content
 

Bayes-respecting experimental design and other things

Dan Lakeland writes:

I have some questions about some basic statistical ideas and would like your opinion on them:

1) Parameters that manifestly DON’T exist:

It makes good sense to me to think about Bayesian statistics as narrowing in on the value of parameters based on a model and some data. But there are cases where “the parameter” simply doesn’t make sense as an actual thing. Yet, it’s not really a complete fiction, like unicorns either, it’s some kind of “effective” thing maybe.

Here’s an example of what I mean. I did a simple toy experiment where we dropped crumpled up balls of paper and timed their fall times. (see here: http://models.street-artists.org/?s=falling+ball ) It was pretty instructive actually, and I did it to figure out how to in a practical way use an ODE to get a likelihood in MCMC procedures. One of the parameters in the model is the radius of the spherical ball of paper. But the ball of paper isn’t a sphere, not even approximately. There’s no single value of r which could be plausibly thought of as giving the right time history of distance vs time for the entire trajectory. So it’s not a matter of there being a true value that I need to zero in on, as would be if I wanted to measure the speed of light or the mass of an electron or something. In fact, it must be really just some r which gives me on average the right value for the trials I actually performed.

In what sense does it make sense to talk about the posterior distribution of this parameter r? I have my own ideas on the subject but I’m sure you’ve got something interesting to say here too. I’m sure in social sciences it’s frequently the case that your model is obviously wrong from the start, so zeroing on the “true” value of the parameter is in some sense meaningless, and yet it can’t be meaningless in every sense, or we wouldn’t do mathematical modeling at all!

2) Bayesian “Design of Experiments”

In what sense do classical design of experiments concepts apply to an experiment that will be analyzed in a Bayesian way especially with multilevel models? It seems that the prior, when it is at least somewhat informative, can make effects partially identifiable, or reduce confounding in many ways. For example suppose you’re dropping a ball without air resistance, Newton’s laws say the fall time for height h will be t = sqrt(2*h/g). Now suppose you don’t know g, and you don’t know how well calibrated your stopwatch is. There is no way to determine whether your clock reads consistently say 10% too long of a time, or if 1/sqrt(g) is 10% bigger than some nominal value that someone told you, both will generate data where t is on average 10% bigger than the prediction. On the other hand, when we put an informative prior around the g value, say it is normally distributed with about 1% error around the nominal value you were told, suddenly you can much better calibrate your watch!

In design of experiments often the goal is to reduce confounding while saving money by not having to do the full set of all different possible combinations of different levels of some factors of interest. This doesn’t seem to me that it translates directly over to Bayesian designs, because of course the classical results don’t incorporate the priors and they don’t incorporate the partial pooling of the hierarchical model.

The other thing that is a little backwards is that in Bayesian analysis, the data is fixed, and it’s the parameters we’d like to infer things about. But design of experiments is all about deciding to collect certain kinds of data, so about deciding what values of the certain portions of the data you will see in your analysis. knowing that you’re only going to get certain data values for certain types of data might change the model you choose to use!

If I have moderately informative priors about various things, and I have a preliminary multilevel model for the data that I’m going to collect (in other words if I have the opportunity to model the experimental process ahead of time), shouldn’t there be some way for me to design what data I collect such that my posterior distribution will be maximally informative? I doubt very much that this procedure would look like classical design of experiments. Much more like maybe some stochastic optimization procedure I’d think.

I replied:

1. In your example, radius is not defined, but one can use the diameter of the smallest sphere that circumscribes the object. But that doesn’t yet solve the problem, in that if you model the crumpled ball as a solid sphere of that size, you’ll get the wrong answer. You could, however, include the physical diameter as data and then estimate a calibration function (actually a joint distribution) relating the physical diameter to the “diameter” parameter estimated from your air-resistance model.

2. I agree with your general point that, if you plan a Bayesian analysis, this should affect your design. In many cases, I think Bayesian awareness will make designs more conservative. A classical statistician can design for 80% power or even roll the dice and do a study with 20% power. Traditionally the only reason not to run studies with 20% power is that it’s considered irresponsible to waste resources in that way (not to mention the ethics of trying untested medical treatments on volunteers, in a study that is unlikely to succeed). But the problem is really much worse than that. As Tuerlinckx and I wrote in our 2000 paper, in a low-powered study, it is highly probable that any “statistically significant” effects will be in the wrong direction (for example, Weakliem and I estimated the probability of such a Type S error to be in the neighborhood of 40% in the notorious Kanazawa study—whoops, I guess I now have to file this post in the Zombies category) and they will also typically way overestimate the magnitude of any effects.

So, yeah, a lot changes when you move the goalposts from “achieving statistical significance” to “achieving high posterior probability.” This point is missed on many people, I think, who notice the mathematical equivalence of p-values and posterior probability _for fixed prior, data, and sample size_ but do not see the importance of the distinction when it comes to experimental design.

Jennifer and I discuss experimental design with multilevel models in chapter 20 of our book, but we don’t get into these important issues.

P.S. Regarding unicorns, here’s something I’ve always wondered: Why don’t natural history museums have stuffed narwhals? I’d love to see one of those.

Lakeland followed up with a comment on my point 1 above:

In fact, what I did was lay the balls of paper on a tape measure, write down some number that seemed “about right” based on the visual “core” of the ball (ignoring small wings of paper that maybe stuck out) and then just took a “typical” value for that measurement as the center for my normally distributed prior distribution for all the balls with a variability in the prior that was a few tens of percent, similar in size to the actual variation in all the measurements, but not fit exactly to their sample standard deviation or anything like that. I actually took 3 measurements for each ball, and could use those to estimate a joint distribution of measurements and “actual” r in a more rigorous way, but there *is no* actual r.

Here’s something to think about. Suppose I just dropped one ball over and over. If it really were a perfect sphere, I would expect that after a large number of drops the posterior probability would be peaked around r* which would be very close to the value I would get from a micrometer measurement. But if I drop my single crumpled ball of paper say 10^6 times I doubt very much the posterior would be more peaked than if I dropped it say 10^4 times. The asymptotic standard deviation of r would be some value that reflects the fact that in each drop the r that most accurately captured the aerodynamics would be a true random variable, changing from experiment to experiment a little. Also, drops from higher heights would have similar r values, as more of the time would be spent in falling at the “terminal velocity”. so the best r is likely to be a function of fall height at least for the smallish fall heights considered (a few meters).

I guess part of my question which you didn’t exactly address, is what kind of a construct *is* r? How best to think about it? (In fact, the specifics of this dropping balls situation is not very interesting to me, since it’s just a toy problem, but toy problems are good when they get at some fundamental issues in a simple to understand way).

If I were doing maximum likelihood for example, there is a sense in which a single value of r could be pulled out of the analysis at the optimal point. In a Bayesian analysis we’re used to getting a distribution of r values, but we’re also used to having this distribution mean something about our lack of knowledge. In this case, it seems that it means something both about lack of knowledge, and inherent variability in outcomes together at the same time. This is a bit confusing.

And to my point 2:

If “Bayesian awareness will make designs more conservative” this means you will tend to do more samples I suppose? It seems to me, especially after reading a little in the last few days, that the best approach is a decision analysis, weighing the costs and benefits of running bigger studies vs smaller studies. Also, in such an analysis we MUST use informative priors (just as classical statisticians must guess at the variability that will be observed). Some of the associated costs would include the cost of doing the study, and some would include the cost of coming to various wrong answers. Finding the middle path will require explicitly defining the tradeoffs/utility function. I suspect that in lots of cases, especially in exploratory cases, we would choose to start with bayesian studies that are very small, hoping that by spending less money, and doing more studies, we will find interesting and relatively large results in some cases, and then we can spend our money on confirmatory studies for these large results. As you point out though, we should expect that some of these large magnitude results will be erased once we collect larger datasets.

And to my P.S.:

I suspect that collecting narwhals now would be considered uncouth, and few were collected back in the days of the-devil-may-care. Most of those great dioramas of gorillas and elephants and soforth were shot on expeditions in the 1920’s or so. The LA Natl. Hist. museum has a bunch of those, but no narwhals.

Sure, but how hard would it be for them to get just one narwhal for the Natural History Museum here?

31 Comments

  1. nick says:

    Splitting hairs here, I don’t believe you can stuff narwhals. It would be more like creating casts for marlins or other fish –realizing a narwhal is a whale of course. I wonder if we (AMNH) have a narwhal at the Whale Exhibit currently? I’ll follow up this afternoon to confirm.

  2. Corey says:

    Daniel Lakeland: chemical engineers (I was trained as one) deal with this sort of thing often; check out Reynolds number. Regarding convergence, I have a different intuition. Suppose that we’re have hierarchical data generation:

    (latent) mu_x_i ~ N(mu, true_sigma^2),
    x_i ~ N(mu_x_i, true_sigma_x^2),

    but we model it using one level:

    x_i ~ N(mu, sigma^2)

    We can expect mu_hat -> mu, sigma_hat^2 -> true_sigma^2 + true_sigma_x^2. I expect a similar thing would happen for the radius (more engineeringly, characteristic length) parameter.

    AG: Narwhals are indeed awesome — maybe too awesome for museums.

    • Yes, the Reynold’s number is a well defined thing only when the geometry of the situation is extremely well defined (a very nearly perfectly spherical thing) but that doesn’t stop us from considering a Reynolds number in the case of a crumpled paper ball, and in fact that’s exactly what my model does, it uses an “effective spherical radius” to get a reynolds number and plug into a sphere-drag-coefficient function and give me a nice nonlinear ODE to solve. But there is no single “r” which will consistently describe each fall of a given paper ball.

      So that gets to your second point, you can say that the drag on the paper ball is D = \rho * A * Cd(r'(v,t))) v(t)^2/2 where r’ is no longer a constant as it would be for a nice bearing ball, but rather a random function of velocity and time (there might be consistent changes to the average at a given velocity, plus random perturbations by turbulence and changes in orientation of the ball for example). So if we do a lot of drops of the paper ball, we can eliminate the uncertainty associated with our ignorance of the r'(v,t) function (say we use a gaussian process prior for example), but we can’t eliminate the actual variability from drop to drop in this function.

      Each time we drop the ball, we’ll get a different random path from the random function r'(v,t). And that data generating process will generate fall times that are distributed in some way (depending on the distribution of fall heights) and when we model the whole thing as a constant r value throughout the ODE prediction, that will result in a distribution of effective r values in our experiments which can not become any more narrowly peaked than the “frequency distribution”. At any given time, our distribution for the r value will reflect prior uncertainty in the form of the function, and “sampling uncertainty” in the actual path. After a long time, the “epistemic” uncertainty will be eliminated, we will know the sampling distribution of r(v,t) pretty darn well, but we won’t get rid of the fluctuations.

      To simplify somewhat, imagine your model turned around, there’s really only one level of data generation

      x_i ~ normal(mu,sigma)

      but we model it as

      x_i ~ normal(mu,s)
      s ~ exponential(1.0/s_guess);

      As time goes on, we will know s nearly exactly (its bayesian uncertainty will be highly peaked) but it won’t help us to predict future x_i any better than being normally distributed around mu with stddev = s* the peak value of s. The only thing is in my ODE example, the x_i are themselves parameters (r values) and the only data is the fall time t_i.

      Bayesian parameters that are “effective aggregates of more detailed stuff” can not become more highly peaked than some asymptotic level of uncertainty associated with how well the aggregate helps us predict the data.

      • Corey says:

        ‘Bayesian parameters that are “effective aggregates of more detailed stuff” can not become more highly peaked than some asymptotic level of uncertainty associated with how well the aggregate helps us predict the data.’

        Let’s distinguish between posterior distributions of parameters and posterior predictive distributions. Under fairly general conditions, (MLE – theta_0)*(n*Sigma)^(-1/2) converges in distribution to standard normal. The parameter theta_0 is the value within the parameter space that minimize the (directed, but I forget which way) Kullback-Leibler distance between the “true” sampling distribution and model sampling distribution; Sigma is the Fisher information at theta_0.

        The posterior density qua sampling-theory-style random function inherits this behavior, which implies that the posterior distribution will contract like n^(-1/2). The “extra” variability due to unmodeled hierarchical variation will just look like noise, so it will inflate estimated sampling variance (relative to conditioning on known values of these hierarchical random variables). So posterior distributions of “effective aggregate” can indeed become arbitrarily sharply peaked; the extra variation that would be removed by disaggregation goes first into the estimated sampling variability, and from there into the posterior predictive distribution.

        • Corey says:

          (Posterior distributions of “effective aggregate” parameters, I mean.)

        • So to be clear if we model it in stan like:

          sr ~ exponential(1/srtyp)
          r[b[i]] ~ normal(rtyp,sr)
          st ~ exponential(1/sttyp)
          t[i] ~ normal(f(r[b[i]]),st)

          where b[i] is the ball identifier

          suppose we have 10000 drops of 10 balls (1000 drops each). would you expect

          I would expect sr,st to both be HIGHLY peaked. I guess I would expect r[k] to be highly peaked around some mean value, and the t[i] ~ normal(f(r[b[i]]),st) would account for the drop-to-drop variation.

          on the other hand, if you fixed the sdt parameter, based on say a known precision in timing errors, then suddenly the r values would absorb the modeling error and would no longer become highly peaked I suspect. so it depends on your model specification right?

          • Corey says:

            Yup, very model-dependent. Provided I’ve understood the above, your intuitions and mine are in accord.

  3. K? O'Rourke says:

    You may wish to look at Rubin 1984 Bayesianly justifiable and relevant frequency calculations for the applied statistician. especially the point that in design data is not fixed but random even for Bayesians.

    The is a whole wealth of insight in this paper that seems to have not been picked up much on in 30 years!

    One of the first projects I was involved in was NIH funded study of cost effectiveness of clinical trails where we used priors for effect sizes and considered utilities and adoption rates of treatments identified as worth while. Optimal funding was much different than funding for claimed power of 80% and a write up called “Clinical trias are cheap” never made it into a journal. Think this has been eclipsed by work by Don Berry and others (adaptive designs).

    As for real parameters, to quote Peirce “All clocks are clouds”. The idea that there is one speed of light, every that has always been and always will be, is more faith than reality.

    • Thanks for the reference, seems very readable and interesting.

      When it comes to the NIH effectiveness study can you comment a little more on what you found? Optimal funding under your model was what, much larger than actual funding for 80% power? Smaller? What was the take home message of “Clinical trials are cheap”?

      The one true speed of light issue is of course faith since if it were asymptotically changing by 1% every 10 billion years we would have a hard time detecting such a thing… though perhaps astronomers looking at super-distant galaxies would have ways to detect such a thing, however I think it would be confounded with distance estimates and cosmic expansion estimates. I don’t know enough about astronomy, perhaps David Hogg can enlighten us as to the real problems associated with allowing c to be time varying.

      Of course from a standards standpoint, we *define* c to be a constant, and we define 1 second to be a constant, so what really varies is how big a meter really is.

      • K? O'Rourke says:

        It was a long time ago, but many more much larger trails, for instance an 80% power trial may get something adopted in practice much slower and less widely than a 95% power trial and if the treatment works really well you take a humongous big hit on unnecessary health care costs (not to mention) suffering. If it doesn’t work you are more likely to avoid funding a second trial. On the other hand those low power trials end up creating (for a small subset) exaggerated treatment effects that get adopted quickly and widely and you again take a humongous big hit on unnecessary health care costs (not to mention) suffering.

        But it was all single point once only decisions being assessed – adaptive should be better.

  4. Dan: Independent of the statistical issues, but in terms of mathematical modeling issues, you might look at the beginning of Walter Meyer’s lovely book, Concepts of Mathematical Modeling. He treats the issue of the way Earth’s gravity acts on a “particle” in terms of the particle’s size. Thus, when an atomizer releases droplets of perfume one has to think in different terms from what happens when a coin is dropped from the top of the Empire State Building.

  5. jrc says:

    Dan,

    I’ve been wondering about this exact problem myself lately, but in a different context. The core problem to me is the idea of estimating something that is not actually some thing out there in the word, and what that could possibly mean (and what statistical inference or confidence or whatever might mean about something that isn’t actually in the world.)

    So, you can measure the “radius” of your ball, drop it a bunch of times, and estimate the effect of “radius” on drop time. But it’s not radius that you are measuring, and you are using a stylized equation relating gravity, air, and paper that is demonstrably false. BUT – and this is my favorite part – your estimated equation will probably get you a pretty good prediction of the time it takes some similarly crumpled piece of paper to fall.

    This is so common in the social sciences. Here’s a list of other parameters that doesn’t exist in the world – or exists in such a stylized way that it resembles the “rate of fall as a function of ‘radius’ of a crumpled ball” – but might give us useful predictions and/or explain things: elasticity of labor supply, returns to education, risk aversion coefficients….OK, basically every parameter in economics, and probably in most social sciences.

    This seems like a bigger philosophical problem than Andrew has been giving it credit.

    • K? O'Rourke says:

      Please don’t take this the wrong way but my favourite quote on this is from Leonard Cohen

      “There is the word and there is the butterfly. If you confuse these two items people have the right to laugh at you.”

      Change “word” to representation or model and “butterfly” to what’s being represented or modeled – and from Box all models are wrong some are useful (some more than others).

      • jrc says:

        I wouldn’t take that the wrong way. Leonard Cohen is great. He used to walk by my old coffee shop in LA wearing a dapper suit and shuffling along with the aid of a beautiful woman 40 years his junior holding his elbow. Good life.

        Substantively: I think we generally agree on the idea that models do not perfectly correspond to reality. But the difference I’m getting at is whether the thing itself we are trying to describe is something real. The butterfly is a real thing out in the world. We believe that butterflies exist and want to know something about them. Structural parameters of economic models are different – they are by design simply useful mathematical constructs. Crumpledpaperballs are somewhere in between – they exist, but they are so dissimilar to each other that the category itself is just a useful catch-category for a lot of variously folded/bent pieces of paper.

        Now, all science may be just a useful mathematical construct*. But it seems to me like there is something different in degree, if not kind, between estimating “risk aversion” and “half-life of polonium 210”. In the case of radioactive metals, I think most people believe that a piece of metal behaves just like any other piece of the same metal, that decay rates are real things in the world, and that there is a “true” rate of decay. We could say the phrase “If I did this experiment 1,000 times, in about 950 of those my confidence interval would cover the true mean.” It would make sense as a statement that could be true or false.

        Not true with “risk aversion.” There is no “true parameter” out there that our confidence interval could theoretically cover. You can’t say “95% of the time this method generates a CI that covers the true parameter”, it is a nonsense claim to make. I don’t think this is purely a frequentist problem either – how would a Bayesian make a statement about the risk aversion parameter value they calculate that made sense statistically?

        *I’m sympathetic to the argument that my examples – radioactive decay and structural parameters from economic models – are both of the same kind: useful generalizations that do not correspond to some objective reality. But the difference in degree seems important to me, and the difference in interpretation. In one case we really believe that there is something out there in the world we are trying to get at, while in the other we begin by knowing that the thing we are estimating is just a useful theoretical construct that sort of gets at some set of features of people.

        • I would say you’ve hit exactly the same nail exactly on the same head I was thinking of when I first sent this question in. Even the radius of a nicely machined bearing ball is not quite a well defined concept, it’s not a perfect sphere, but it’s perfect enough that we’re not going to quibble. The crumpled paper balls are INTENTIONALLY a significant DEGREE different from perfect, and that’s just the beginning, as you say eventually we reach “risk aversion” which is so far from a real thing in the world that we’re perfectly happy to both call it a fictional construct even before we get any data. Also, there’s no real direct way to imagine measuring it even approximately, like I did when I laid the paper balls on a tape measure and “eyeballed” about how big they were.

          Furthermore, these aggregate thingamajigs we use for modeling are necessarily going to be sort of a fluctuating quantity. There is no “one true risk aversion” just as there is no “one true radius of my paper ball” unlike the “one true radius of a ball bearing” which is pretty darn well defined, to within a few micro-meters or so. Hence, Bayesian uncertainty intervals will be expected to asymptote to a non-zero width for risk aversion, even in the limit of infinite data, whereas for ball bearings this width will be tiny, negligible for almost all practical purposes.

        • konrad says:

          I lean to the “everything is a generalization” side. Yes, the degree of model imperfection can vary hugely, but even with ball bearings it is a conscious decision to ignore the imperfections and work with a mathematical abstraction at a particular level of simplicity. We need to bear the degree of imperfection in mind at all times, because it informs us about the degree of precision with which we expect our results to match reality – and if we need more precision, we are still willing and able to expand the model (i.e. make its imperfections explicit). The width of a confidence interval may become negligible, but there is still a clear _conceptual_ difference between “negligible” and “zero”.

        • K? O'Rourke says:

          CS Peirce choose to define _real_ as the representation an enquiring community would settle on after an infinite amount of deliberation/refinement. We may judge some representations have already (mostly) converged. He also defined representation as continuous (between any two is another)to address konrad’s point below.

          Can hardly wait to find out if he was right ;-)

          • konrad says:

            But is the chosen representation not a function of the purpose for which it is intended to be used?

            A community may settle on representing objects as point masses to which Newton’s law is applied because that leads to the most practical calculations for their purposes, while simultaneously acknowledging that much more sophisticated representations could be used in settings with unusually strict precision requirements.

            And in statistics we routinely choose representations of which the complexity matches the information content of the data set rather than the complexity of the phenomenon being represented.

  6. Mike Betancourt says:

    The “All models are wrong, some are less wrong” philosophy is ingrained in the physical sciences. Consider, for example, rerunning your experiment with a round ball. How round is it? What are it’s surface properties? Will any energy be lost to rotation induced by torque from uneven air currents on the boundary layer? Is it deformable? All of these generalizations would change the answer a bit, but almost all of them would be minuscule changes. Even the air resistance formula is just an approximation that makes calculations easy!

    And this doesn’t even get into the approximations inherent in classical mechanics, deterministic mechanics, non-relativistic quantum mechanics, etc. Drop your spherical cows and judge the model for its consistency with the observations, not on its ontology.

    • Even though all models are wrong, not all models will have similar statistical properties. The approximations related to radii inherent in dropping ~1cm steel ball bearings through 1 to 30 meters of still air will be negligible for almost all purposes. Bayesian uncertainty intervals on r will asymptotically go to nearly zero width. Not true for the crumpled paper balls.

      I’m perfectly happy with all models are wrong, some are useful, but I think we can move beyond that to discuss in what ways different types of models are wrong and what that will imply about the statistics and distributions of parameters. It seems that jrc above has got a similar idea to mine. I think it’s fair to take this discussion a little farther.

      • Devin Kilminster says:

        I’ve been interested in this sort of issue for a long time. Not only do I agree that it’s worth investigating how the different sorts of “model wrongness” will affect the outcome of a given statistical procedure (for example Bayesian inference), but that it can also be justified to consider adapting the statistical procedure to optimise the use for which we expect our “useful but wrong” model to perform. How the statistical procedure should be adapted to the preferred use of the model in the case of varying types of model wrongness is also something I think worthy of study.

        Although I might express some things a little differently now, an example in the context of modelling timeseries measurements of a nonlinear circuit is given in the first few sections of this conference paper:

        http://cats.lse.ac.uk/publications/papers/KilminsterMachete_NOLTA05.pdf

        • Thank you for the link. That seems highly relevant, and has some nice commentary on exactly this issue. I’ve recently been fitting a model to timeseries trying to discover the appropriate values of a certain parameter. I found that the convergence of the model was dependent on a proper specification of the likelihood, but there is no a-priori likelihood in this problem. In the end it took specifying a certain kind of gaussian process for the errors that incorporated certain kinds of knowledge of the problem. ie. that it was OK to have short duration excursions from good fit, and that at some time points the fit should be good and other time points it is OK for the fit to be less good (where we’re taking the log of small positive random noise). Took a couple of weeks to tune the likelihood and the jumping scales, considering how slowly the model ran thanks to having to run 40 ODEs with 20 dimensions each at each MCMC step.

          In the end it became clear to me that in some sense I was tuning the statistical procedure to enforce certain scientifically motivated constraints on the model performance, Perhaps in most statistician’s worlds this kind of thing is rare, certainly for clinical trials or designed biological experiments, but I think in timeseries and dynamics in general the tuning of the statistical procedure to produce results that make sense is both necessary, and a bigger philosophical issue by far than the “subjectivity of the prior” that is so often railed against. Furthermore, it is something that enters whether you’re using maximum likelihood / frequentist methods or bayesian methods with priors.

          ABC / likelihood free methods might have been relevant for this problem, but I didn’t have any suitable software, so I didn’t investigate that possibility.

  7. konrad says:

    Question 1 is interesting and deep – it goes to the heart of what a model actually is. I can think of two strategies for answering the question, and it seems to me that the two are distinct (perhaps mutually incompatible):

    ————–

    Strategy 1: a model is an approximation of reality. The ball of paper _is_ approximately a sphere, because by adopting the proposed model we are _approximating_ the ball of paper as a sphere (this is true regardless of whether the approximation is a good one). The only issue is that we may dislike this description because we have a preconceived (and perhaps overly strict) idea of how good an approximation needs to be before we are willing to use the word “approximately” – but here I choose to use the word regardless of the quality of the approximation.

    Models (and model parameters) live in the idealized world of mathematics – they never exist in the real world other than “approximately”, and (in terms of how I choose to use the word above) they exist “approximately” in the real world _every time they are used_.

    In this strategy, r is the radius of a sphere (a mathematical construct) used to approximate the ball of paper (a real-world object). Note there is no requirement that the approximation has to be a good one – we are free to use any sphere we like, even if its radius is on the nanometer or light year scale – so r does not have a “true” value. The MLE (or other point estimate) of r is the radius of the sphere that best approximates the ball in some explicitly specified sense (ML, or whatever other criterion is used to get the point estimate). But it seems to me we run into trouble when we try to say what is meant by the distribution of r, if we take “distribution” to refer to a knowledge state about the value of a parameter, conditional on some set of information. To define a distribution of r in this Bayesian sense, we need r to have a “true” value.

    —————-

    Strategy 2: a model is a counterfactual statement. We know that the ball of paper is not a sphere, but for the purpose of analysis we pretend it _is_: all of our analysis is conditioned on the counterfactual claim that the model is actually true. This works nicely in the Bayesian framework: in the counterfactual world where the model is actually true, r has a true (but unknown) value; we start with a prior describing our knowledge state of what that value actually is and update it as measurements come in. We can allow for the fact that unobserved initial conditions vary across repetitions by allowing r to be different in different repetitions – this gives a hierarchical model where the variance of the posterior for r is not expected to shrink to a point.

    In this strategy, we can only ever make statements like “if the ball really were a sphere, we would assign the following probabilities to hypotheses about its radius”. Such statements are counterfactual hypotheticals (http://en.wikipedia.org/wiki/Counterfactual_conditional). We can say that they are useful _to the extent_ that a sphere provides a good approximation to the ball, but the nature of the approximation is more abstract than in Strategy 1 because (a) we are using the _class_ of spheres to approximate the ball, rather than a _particular_ sphere; and (b) outside of model comparison settings, we do not have to find the “best” approximation, so we typically do not choose a criterion by which to evaluate the quality of the approximation. (Not having to make this choice is an advantage, but leaves us without a clear understanding of what we mean by “to the extent that a sphere provides a good approximation”.)

    ——————-

    One could take the view that, ultimately, all scientific knowledge is in the form of counterfactual hypotheticals: we only ever produce knowledge that is conditional on false assumptions, and hope that the quality of the knowledge will degrade gracefully as the quality of the assumptions degrade. I think this works nicely in the Bayesian but not so much the frequentist framework, so I suspect most frequentists would not like this view at all.

  8. Manoel Galdino says:

    I`m pretty sure I`ll make a stupid comment. No one mentioned it but it seems so obvious that I`m afraid I`m missing something. In anycase, don`t you guys think that de Finneti representation theorem `solves ? I`ll quote Koope et. al. book on Bayesian econometrics:

    From de Finneti`s standpoint, both the quantity theta and the notion of independence are `mathematical fictions` implict in the researcher`s subjective assessment of arbitrary long observable sequence of successes and failures. The parameter theta is of interest primary because it constitutes a limiting of predictive inference about the observable y-bar_t [mean y]. The mathematical construct theta may nonetheless be useful. However, the theorem implies that the subjective probability distribution need not apply to the `fictitious theta` but only to the observable exchangeable sequence of successes and failures`

    • Andrew says:

      Manoel:

      See my discussion here, in particular this part:

      I understand the appeal of the pure predictive approach, but what I think is missing here is that what we call “parameters” are often conduits to generalizability of inference. . . . There’s a saying, A chicken is nothing but an egg’s way of creating another egg. Similarly, the de Finetti philosophy might say that parameters are nothing but data’s way of predicting new data. But this misses the point. Parameterization encodes knowledge, and parameters with external validity encode knowledge particularly effectively.

      • Fernando says:

        +1 and add: Directed Acyclic Graphs encode knowledge about the external validity of causal parameters. (Not all parameters are created equal.)

      • Manoel Galdino says:

        Thank you for the response. I’ll think about it, because I see what you saying more as a complement to de Finneti’s point of view rather than an opposite thing. Sometimes we may think of parameters as a way to our predictions goals. Sometimes, we’re more interested in undesrtanding the data generating process (aka, the population), and you view is more useful…

  9. Manoel Galdino says:

    ops, the last line of the first paragraph should be:

    think that de Finneti representation theorem `solves` the issue about parameters not being real? I`ll quote Koope et. al. book on Bayesian econometrics: