> I’m not so old, I still have visions of progress!

So did Peirce when he was not old ;-)

http://en.wikipedia.org/wiki/Charles_Sanders_Peirce#Theory_of_categories

First, there is the thought of writing a generative model, how was Jon’s and other’s heights determined and how were the observations we got generated, both in terms of probabilistic representations.

Second, given a probabilistic representation, as Fernando put it “given a prior, likelihood and data, we “already know” the posterior” so we have essentially error free access to that and with that we note _Brute facts_ are somewhat different.

Third, we interpret the above “makes sense” judgmentally deciding how it could be made less wrong or perhaps hold steady with it for now (at least until more serious _Brute facts_ come to our attention).

I find the “Bayesian (Jaynesian) data analysis” as you put it so colourfully direct and straightforward in at least the first two steps.

]]>> like network meta-analysis, which can be motivated from a Bayesian point of view but also without Bayes.

In network meta-analysis, there is information completely missing on a total treatment group, so not sure it can be addressed without an informative prior for that group which would make it Bayes. Yes, very common for folks to construct a phantom likelihood (or mean and se) for that missing group, but that violates Andrew’s rule below

(a) Each piece of information comes into the posterior distribution exactly once.

]]>I agree. No other example helped me understand Stan as well as this post! So, thanks! :)

]]>Rahul:

See here. The short answer is that it can be helpful to understand a method in the context of a simple example.

]]>Andrew:

If I may pick your brain more, what advantage did we get my modelling it this way than just a set of point inequalities. i.e.

“So if Jon Lee Anderson is 10″ taller than Catalina Garcia, who is 2″ shorter than Philippe Petit, who is the same height as Claire Danes, then he is 6′ 2″ tall.”

That’d give you approximately the same point estimates for all their heights? OK, we lost the intervals but those are kinda too wide anyways to be of much help.

And adding in the prior about population heights doesn’t seem to have helped enhance accuracy by much in this particular case?

]]>Like you say it’s fairly well defined. To first approx there is an unknown underlying parameter, what’s his face’s height. A frequentist is presumably interested in the reliability of the procedure of estimating height based on google image searches. Raising and tackling this question seems as worthwhile as formalizing back of the envelope calculations.

]]>Does the beta distribution not automatically constrain them?

]]>Is the Hispanic height distribution significantly different than the American? Is it ok to assume that Catalina Garcia is Hispanic?

Instead of * normal(5’3″,2″) * another mean maybe?

Wonder if that’d change any estimates substantially?

]]>Steve:

Them being brothers can be included in the statistical analysis (and in the Stan model) via a correlation.

In this example, of course, all these data are overkill, but I can well imagine similar reasoning being used to good effect in sparse-data settings such as data from an archeological site (or, for that matter, tree-ring reconstruction, which is an area I have worked on where uncertainties are huge).

I like these simple examples such as estimating heights or ranking soccer teams or spell checking because, if you think about them carefully, it is possible to gain general insights and see some general statistical principles in a setting where the prior information is easy to grasp.

]]>John Lee Anderson is notably taller than his brother Scott Anderson, another writer.

I like how my reasoning naturally runs in circles: JLA is a lot taller than SA, so that raises the probability that JLA is tall, but if JLA is tall then SA probably isn’t short (they are, after all, brothers), which adds credence to JLA being tall.

]]>I think any “gambit” is elsewhere; the problem is pretty well-specified – just estimate John Lee A’s height. For most of the data (i.e. pictures obtained by typing the appropriate minor celebrity’s name into Google Images) random sampling from the essentially-infinite population of digitized pictures of them seems like an okay data-generation model, albeit not perfect.

Exactly how that data relates to John Lee A’s height is a tricky measurement error/missing data problem, but should be possible.

How would you fit it into frequentism? I agree it’s an interesting challenge.

]]>I think what I am trying to get at – especially in the context of measurement model – is the notion of experimenter’s regress http://en.m.wikipedia.org/wiki/Experimenter's_regress

]]>Fernando:

I agree. This process can and should be formalized more. One reason for me to do little examples step by step is to build the intuition that can lead for formalizing. I’m not so old, I still have visions of progress! In particular, one issue is that there’s uncertainty. If a result “doesn’t make sense,” this implies a conflict with other information that might be included in the model, but at some point these “don’t make sense” claims are only probabilistic.

]]>Rahul:

If we have informative priors there are no real constraints, I think.

]]>Andrew:

I think I was not sufficiently clear. I completely understand why you are doing this, and so on. Moreover, I am not criticising this particular application. I think it is a great example!

But I was also trying to use this toy example to ask a broader question related to the style of inference you propose. You touch on it again when you state: “adding additional information as necessary if the conclusions don’t make sense”. Do we already know the conclusion? If so, why bother?

One answer might be that we add more information to get rid of nonsense like negative heel height say. That comes from the prior not the conclusion.

But at times it seems as if we know what the conclusion should look like, and are just iterating through the modeling to get there. Put bluntly, is like we are beating a confession out of the model rather than learning from it.

That may be ok when we want to describe data (we want y_rep to look like y) but in cases like this, where we are trying to solve a mystery as it where, it seems the approach can lead one astray. I don’t think its happening here but I just think that the approach is still very heuristic.

PS I am not saying that is bad. It can be very useful. But it also suggests it can be improved and formalized more, I think.

]]>Can one under / over constrain the model?

]]>Fernando:

You ask what is gained by the above exercise. To start with, nobody (except maybe Jon Lee Anderson) cares much about Jon Lee Anderson’s height: it is a demonstration problem intended to illustrate methods.

The first thing it’s illustrating is Bayesian measurement error models. These are not widely understood, but if you do it carefully it’s surprisingly easy to just add error terms to express different aspects of bias and variance in measurement.

The second thing is it’s illustrating the use of Stan. Again, it’s (to me) impressively easy to implement the model straight in Stan in one take. Sure, this particular example could be done in other ways. But, to be honest, those other ways aren’t so easy, they involve various shortcuts and worries about how the data are being included. Once you’ve put in the overhead to learn Bayes and Stan (which you might have done, for example, to fit nonlinear hierarchical models), it’s pretty effortless to solve measurement-error problems too.

An analogy might be the use of a pocket calculator to solve a simple problem such as 5*(38-12)+46. Easy enough to do by hand, but if you have the calculator (which of course is the product of millions of dollars of R&D involving semiconductors, etc.) you might as well use it. Stan is the “calculator” here.

The third thing (and I don’t think I got this point across very well in my post) is to illustrate Bayesian (Jaynesian) data analysis: the method that works by laying out very clear assumptions, working out their implications, and then adding additional information as necessary if the conclusions don’t make sense.

Finally, just about any worked example—this one included—reveals some little things that can be useful. For example, the problem I had with the “pre” tag and the brackets in my Stan code, the feature that you can include arbitrary expressions on the left side of a squiggle sign in a Stan model, the use of the scaled beta as a quick prior for the shoe heights, the use of coefplot() to display the inferences.

]]>now that i think about it, almost all the time I set up a bayesian estimation model, it’s ultimately in service of performing contrasts…

]]>depends on the definition of “already know”. If you define it loosely enough, given a prior, likelihood and data, we “already know” the posterior.

But, particularly when we’re dealing with higher dimensionality and missing data, it’s not trivial to make the inference by eyeball. And then when we start contrasting differences between parameters, it can get even trickier to eyeball parameter differences in a consistent way. A posterior on the difference in parameters makes the comparisons explicit.

]]>Yes, the key principles are:

(a) Each piece of information comes into the posterior distribution exactly once.

(b) To the extent the inferences don’t make sense, this is an indication that there is additional prior information not included in the model.

]]>Well, in my model, the jla_boost_above_avg is a parameter, and I put a prior over it that was informed by the book signing photo. so, it’s data-based but it doesn’t enter into what you’d call the “likelihood”.

The prior and likelihood breakdown is a little bit artificial I suppose.

]]>Not so much that your prior should be different. More that you start with the same prior but then the book signing photo is information that gets included as data.

]]>Definitely a book signing photo is different information than just a comment about “little twerp” so the information in your prior *should* be different for the two cases.

]]>oooh, the old “this problem was randomly chosen from a population of problems” gambit. Interesting.

]]>Of course, if you do have some specific information like from book signing event, it is reasonable to include it. But I would be hesitant to do it just on a hunch that he is tall because he calls others “little twerp”. Anyways, data always wins against speculation.

]]>To be more specific, I created a parameter jla_boost_above_avg and made it exponentially distributed with mean value 3/2 of a standard deviation (of the whole population of men in the US).

]]>In my model (above) I used the information originally given about Anderson at a book signing (that he looked taller than various random people who were probably average sized) to bump the mean value of his prior distribution up by 3/2 of a standard deviation (vs the overall male population). My model didn’t use shoe heights for males though (I figured women wearing “flats” were about as much taller than barefoot as men wearing typical loafers or whatever).

I disagree that we should make Anderson’s height flatter, the point is, if we have some general information that he seems “taller” than average, then make his prior height have an average which is taller than the average for men in general, don’t make it flatter… because that would imply implausible heights, such as say 7 or 8 ft tall.

]]>Excellent.

Anderson appears to be more than half a foot taller.

]]>http://earth.columbia.edu/articles/view/3138

Here’s Bollinger with Bill Gates, whom Google calls 5’10”.

http://engineering.columbia.edu/web/newsletterarchive/fall05/billgates.php

Bollinger appears an inch or inch and a half taller than Gates.

]]>John Lee Anderson is several inches taller than his brother Scott:

http://sangam.org/2011/02/images/JonLeeAnersonandScottAndersonin1988.jpg

Here he’s bending weigh down to shake the hand of movie star Gael Garcia Bernal, whom Google.com says is 5’6″.

http://www.zimbio.com/photos/Jon+Lee+Anderson

Here’s he with Hugo Chavez, whom Google says was 5-8, but they’re not side by side, so the picture isn’t too useful:

Here, even seated, Anderson towers over film maker Saul Landau, who is of unknown height:

http://www.zimbio.com/photos/Jauretsi+Saizarbitoria/Jon+Lee+Anderson

Here, slumped in a chair, he appears to be much taller than random female students:

http://www.frontlineclub.com/capturing-the-story-with-jon-lee-anderson/

Here Anderson towers over a man whose face has been blurred out presumably because he’s some kind of desperado:

Here Anderson is standing in a line up on stage of prize winners with the president of Columbia U., Lee Bollinger (Anderson looks a couple of inches taller than Bollinger, whom Professor Gelman has likely met):

http://cabotprize.com/the-2013-cabot-prizes-ceremony/

Here is Anderson with Latin American intellectuals and Canadian writer John Ralston Saul. Anderson and Saul are similar in height, and looking at pictures of Saul with other people, he is clearly a six-footer at least.

http://grupopenta.com.co/1589/hay-festival-cartagena-de-indias-2013/

In general, Anderson appears to be the among the tallest people in most pictures he is in, although he tends to get his photograph taken a lot with Latin Americans and Middle Easterners rather than Dutchmen and Serbs. He has the body language of a man of considerable height.

]]>I’m not die-hard anything, but had the same question. Here’s a start on an answer;

Presumably one could imagine a sampling distribution of pictures of John Lee A and Catalina G that we might observe, in a search for pictures of the guy, of which this is only one. Same goes for all the other pairings, presumably conditional on searching for pictures of Catalina. With n of 1 sampled from all these distributions, one would have to use other information to make assumptions about them, e.g. that the observations may reflect heights wearing plausibly-sized heels, that sometimes people slouch, whatever. This is okay; frequentist does not mean assumption-free, or even assumption-lite.

Putting it all together, you could presumably come up with an interval that, under valid assumptions, would cover the true height of John Lee A in a given percentage of replicate experiments of this sort – which one could assess for sensitivity to the assumptions, if desired.

It also struck me the estimation process is very like network meta-analysis, which can be motivated from a Bayesian point of view but also without Bayes.

]]>Let’s be even more explicit. because I think this is confusing. I haven’t thought too carefully about what Stan is doing on the inside, but it’s worth checking that I understand it.

Stan works with log(p(data | params) * p(params) * C) where C is some constant related both to 1/p(data) and possibly normalizing factors left out of the calculation of some of the p(params) etc. It works internally with the logarithms, but I’ll keep that explicit here

a statement

funky_func(y) ~ foo(theta);

where y,theta are parameters enters into the p(params) portion, since it doesn’t have anything to do with the data. Stan does:

increment_log_prob(foo(F(y),theta))

But, Stan needs to calculate log(p(y)), not log(p(F)) which in this case is log(p(F) dF/dy) *provided that funky_func is 1-1*

so we need to do

increment_log_prob(log_dF_dy(y))

The whole thing is easier to understand if you explicitly make everything dimensionless by choosing an infinitesimal dy:

p(y) dy = p(F)dF = foo(F(y),theta) dF

p(y) = foo(F(y),theta) dF/dy

log(p(y)) = log(foo(F(y),theta) + log(dF/dY)

]]>Damn html ate my angle brackets! The bounds were in the code, they just disappeared when I put them into the blog post.

]]>I guess, you can implement your idea directly by making it a 2 component model with 0.01 probability on “(Jon + Jon_shoe_1) – (Catalina + Catalina_shoe_1) ~ normal(9.5,1.5);” and 0.99 probability on some default, but 1) it seems to be unnecessarily complicated and 2) I’ve no idea how Stan programs multi-component distribution. Heck, you can even make your 1:100 believe ratio a fitting parameter.

]]>what’s a range that you are “almost absolutely sure” that (Jon + Jon_shoe_1) – (Catalina + Catalina_shoe_1) falls into? Let’s say you’re willing to say you’re almost absolutely sure that it’s between say 3 inches and 15 inches… then make it

~ normal(9,2)

Don’t like that, maybe you think it might be between 0 inches and 20 inches? make it ~normal(10,3) or maybe ~normal(10,4)

At some point, if you make it wide enough you’ve got to believe that it’s in that range… ~normal(9,20) I guarantee for sure it’s in the “high probability region” (defined as mean plus or minus 2 or 3 or 4 standard deviations) of that prior.

]]>I also like the exposition of using general formulae on the left hand side of the sampling statement.

I think the stan manual has a few examples of this, but andrews post made heavy use of it. It was a refreshing reminder.

]]>(My pre-tags, where did they go? I am doing something wrong.)

]]>So just to be as explicit as possible about what “it’s your job to account for any Jacobian adjustments they induce from non-linear transforms of parameters” entails:

funky_func(y) ~ foo(theta);

// and now I need account for the log-Jacobian of funky via

increment_log_prob(loggy_J(y));

That about right?

]]>PS In game theory, for example, a common assumptions is that agents update their beliefs using Bayes’ theorem. If so, if you show them the various pictures they should get the same guess as they would if they used Stan. This is a useful assumption, but it is hard to believe.

]]>I’d like to push you here a little to put your finger on exactly the value added, and what we learn from, the modeling exercise. (I’m not coming at this as an skeptic, which I am not, but as genuinely curious.) Here are two possible interpretations:

1. We are having the model tell us what we already know. That is, we have an idea in our head of what would be reasonable inferences, and we are tweaking the model to get it to spit that image out. I suspect we would have gotten pretty close if we simply skipped Stan and everything and had simply drawn the final graph by hand. Writing down a model is just a way of rationalizing in a public fashion how we came to our conclusion.

2. We are having the model tell us something we already know but in the process learn something new. I think here is the real value added of course. The question is why we learn something new. Put differently, where did our brains fail? For example, I had not though of the same person wearing different shoes on different occasions. But this is an instance, what I am interested in are general classes of failures. For example:

– Processing capacity: A computer may be better at integrating tons of information (but humans play chess pretty well, and do many complex tasks easily that computers find impossible);

– Errors and omissions: A program, like algebra, can highlight errors and omissions in internal logic (but this seems somewhat serendipitous);

– Biases and heuristics: E.g. availability biases and other things that may be less powerful when dealing with code as opposed to direct guessing (but these biases may filter into the prior);

Any thoughts?

]]>The distributions clearly aren’t frequency distributions, at least not accurately so, and the thing they are distributions of aren’t “random draws” from a population. Yet at the end of the day Bayesians are estimating an objective fact (Anderson’s height, to within the usual tolerances).

The choice of distributions clearly isn’t arbitrary. General knowledge of shoes such as “people don’t casually wear three feet tall shoes because they’d fall over all the time” are being used and aren’t easily dismissed as “subjective opinions”.

So which of the Frequentists gambits would they use to discount this? Would they go with the “this is secretly frequentist” gambit? If so, how would they pull that off?

]]>In the R code:

I’m guessing that “stan_run” is a custom function that you’ve created? I just replaced that with “stan”

In the stan code:

I had to put bounds to all the “*_shoe*” variables, otherwise I would get a series of

Error in function stan::prob::beta_log(N4stan5agrad3varE): Random variable is -0.317581:0, but must be >= 0!Rejecting proposed initial value with zero density.

Error in function stan::prob::beta_log(N4stan5agrad3varE): Random variable is 1.13518:0, but must be less than or equal to 1Rejecting proposed initial value with zero density.

followed by

Initialization failed after 100 attempts. Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

error occurred during calling the sampler; sampling not done

I still got a couple of those errors after setting the lower bounds, and some warnings of Metropolis proposals being rejected, but at the end it worked and I got pretty much the same results.

Perhaps the dev version of stan has all sorts of improvements compared to 2.4.0, so that’s why Andrew didn’t get these problems?

]]>Can you describe how, say in this example.

Say, I trust the prior “Jon ~ normal(mu_men,sigma_men);” 100 times more than I trust the information that “(Jon + Jon_shoe_1) – (Catalina + Catalina_shoe_1) ~ normal(9.5,1.5);”

Is “100 times more” too vague? Any better way to state this?

]]>I visit another blog (Scott Aaronson’s Shtetl Optimized), and that seems to use WordPress & has a neat preview feature. Not sure how they manage it….

]]>Yes. It’s just priors and measurement error models all the way up.

]]>Yes, you can put general expressions on the left side of a sampling statement, but it’s your job to account for any Jacobian adjustments they induce from non-linear transforms of parameters. In general in Stan, a sampling statement like

y ~ foo(theta);

is just shorthand for incrementing the log probability with foo_log, the log probability function (density or mass) for foo,

increment_log_prob(foo_log(y,theta));

with one key difference: the sampling statement form drops constant terms (we’re planning to give users control over dropping constants in both sampling and functional forms in a future version, but we won’t change the default behaviors).

WordPress hasn’t fixed lack of preview in 10 years, so I wouldn’t hold my breath.

]]>