The thing about Bayesian methods that you have to remember as a person using them is that they all ask the question “if my model is correct, what do I know about the parameters?”

So, keeping in mind the dependence on the model being correct, and understanding what it means to be “correct” (ie. what does the Stan statement my_data ~ my_distrib(my_params) mean?) those are key things, and the meanings are different in a Frequentist and a Bayesian analysis and that’s also key for people moving to Bayesian analysis to understand.

]]>Sorry but rather than “helpful in detecting problems with the likelihood” it is a failing of Normality assumptions when it is not obvious that something is wrong with the model as the variance expands to hide the problem.

Work to make Bayesian approaches robust to data contamination and errors in model misspecification is not finished but rather just starting.

If you have not yet read this one effort by David Dunson http://arxiv.org/pdf/1506.06101.pdf I think you would really enjoy it.

]]>Ditto Leicester.

]]>For a simple example, suppose that y = normal(t,1) and you model it as normal(mu,sigma) and take 10 observations at time t=0, and then 10 observations at time t=10… After putting the first 10 observations into the Bayesian machinery your posterior concentrates around mu = 0, sigma = 1. Very nice, the world is a consistent place that we all love, think of how wonderful “random” sampling is!

Now you add the next ten observations and it concentrates around mu = 5, sigma = 10 or so !!

It always concentrates, but it need not always concentrate on the same place after each round of updating, and this inconsistency is helpful in detecting problems with the likelihood. The Bayesian machinery lets you detect this by giving you a clear way to update information from one dataset to the next, taking into account an assumption of a consistent model (which you can then detect is an incorrect assumption).

Of course we can check things in a frequentist analysis as well and see that our model has inconsistency, but having a consistent logical framework that does this automatically is certainly helpful.

]]>> When you collect more data and it fails to concentrate your parameter estimates, it indicates that there is no one parameter estimate that can explain the data sufficiently.

This requires the likelihood to have specified a random (common in distribution) parameter rather than a common parameter – with a common parameter, likelihood always concentrates with more independent data – see Review of likelihood and some of its properties in my old post here http://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

(If I am wrong I would really like to know!)

So its primarily getting the likelihood less wrong rather than any underlying Bayesian non-reproducibility.

]]>http://models.street-artists.org/2016/05/17/bayesian-models-are-also-non-reproducible/

]]>https://gemsandrhinestones.com/2016/05/04/why-did-bookmakers-lose-on-leicester/

https://gemsandrhinestones.com/2016/05/05/what-price-should-leicester-have-been/

]]>See this comment. The evidence from horse racing, at least, is that longshots are already overbet, so in general I don’t recommend longshot betting as a way of winning money. Especially since, if you want to make money reliably using this strategy, you’ll have to find *a lot* of these undervalued longshot bets. And I don’t think anyone suggests that there are many of these possibilities.

It’s going to be incredibly difficult to distinguish between 1:100, 1:500, 1:1000, and 1:5000. But of course that’s where some real money can be made. Find a bunch of longshots that are priced at 1:5000 but you think are actually 1:100, and you could make some real money. I bet (ha) that lots of people will be trying that this coming August. ]]>

If reasonable odds were 500-1 or even 200-1, it was still very impressive that Leicester won. Buster Douglas’s odds were only 42-1 and people are still talking about that one! I think we’re all in agreement that, prospectively, Leicester deserved long odds of winning the championship. Just not 5000-1.

You offer data from 23 prior seasons, ok, that’s fine, that gives us a factor of 1/23 right there. Still a long way from 1/23 to 1/5000. For example, if you want to call it 1/50 that any team in the bottom half will win the championship in any given years, and then say that Leicester was one of these 10 teams, that gives you 1/500. Obviously there are lots of other ways of doing these calculations (as I’ve said before, I think it makes sense to model based on precursor events), but, again, it takes a lot to get to 1/5000.

]]>If you look at Leicester’s prior season – they were last in the league at Christmas and went on a good run to avoid relegation, then fired their coach after his son was in a sex tape using racist slurs, and hired a manager who had been out of work since being fired 8 months prior. Their main striker had scored all of 5 EPL goals, better odds of being suspended for assault rather than scoring 20, his own video using racist slurs had just surfaced, and had a rep for showing up to training drunk.

The 5000-1 odds were there for good reasons. Maybe 1000-1 would have been better odds (how can you even calculate the difference?), but it was still absolutely incredible that Leicester did what they did. 50-1 odds for the 14th-ranked team (West Bromwich Albion) to win next year are absolutely awful. I wouldn’t be surprised if a team that has finished in the bottom half of the league the prior season never wins the EPL ever again (barring a giant cash infusion).

]]>Indeed, you can make an argument that 5000-1 is ridiculous odds in a 20-team league; see the article referred to in Ivan’s comment here in which everyone is in agreement that those 5000-1 odds were nuts.

Here you’re using 200-1 as a baseline. Again, there’s a big big difference between 200-1 and 5000-1: it’s a factor of 25! If the odds for Leicester had been 200-1 or even 500-1, I doubt Campos would’ve written that post.

You do raise a good question, though, which was why there were no savvy betters scooping up this juicy lottery ticket ahead of time. That I don’t know. I wouldn’t’ve even thought of trawling through obscure sports betting opportunities to find this one.

My guess is that savvy bettors know about the longshot bias, and so the sorts of bettors who were looking for good-value, positive-expectation bets, didn’t even think of looking for good-value longshots. The Leicester odds were hiding in plain sight. And of course if word had gotten out about these juicy odds and some people had started betting real money on it, the bookies would’ve adjusted occasionally, hence limiting their losses.

In future, bettors will be aware of the possibility of good-value longshots, but of course the bookies will be aware also, so now it’s probably too late to follow your “no problem, get the money” strategy.

]]>The main idea is that the exact same argument can be made retrospectively about any actual longshot winner; that Leicester City is not a particularly “special” extreme longshot in this regard (in a way that could be determined before the season).

So if it’s true that the fair price before the season was like 200-1 for them, I think the same argument, if you accept it, establishes that the fair price for all the other longshot teams were similar (or better, since Leicester was generally judged to be one of the worst teams before the season). See how accepting this kind of retrospective argument leads to no team having really long odds and therefore longshots greater than 200-1 being good bets in general?

As an aside, there isn’t any vig in cases like these: if they offer you a better price than fair, you have positive expectancy from the bet; they don’t charge you something else if you win or something. The vig in markets like these comes from the fact that the entire market adds up to >100% probability. So if you think all these teams are 200/1 instead of 1000/1, no problem, get the money.

]]>Thanks for the link. Here’s the key bit:

William Hill said 25 customers took the 5,000-1 odds with the largest stake £20 from a customer in Manchester and the smallest 5p from a woman in Edinburgh.

So they didn’t have much of a motivation to get the odds right on this one.

]]>It’s hard to beat the vig, so that fact that it’s not easy to go out and “be a big winner” does not mean the odds are alright. And you say, “you can bet on eight teams at 500/1 or better.” But there’s a big difference between 500-1 and 5000-1!

Regarding your specific idea: In horseracing they talk about longshot bias which means that it’s the long odds which tend to be the bets to avoid. So it will depend on context. So I don’t recommend betting longshots as a general strategy.

Regarding Campos: He wasn’t arguing that all longshot bets are good, he was talking about this particular case. And, the fact that Leicester won is indeed *some* evidence that these 5000-1 odds were off. Yes his argument was “handwaving”—that is, not quantitiative—but I think a quantitative argument (what above I called a model for rare events using precursors, an idea that you appeared to be mocking in your comment above) could make this argument more precise.

Good luck with that.

]]>See the P.S. and P.P.S. above. The short answer is that, sure, you can label these as one-of-a-kind events and say that it is impossible to evaluate the probabilities retrospectively. But to me that’s just giving up too early. These are *not* one-of-a-kind events; there are other soccer games and other elections. Oddsmakers can make errors, and I find Campos’s argument above to be persuasive. Maybe better odds would’ve been 200-1 or whatever, but 5000-1 does seem extreme. Sure, it took the unexpected event to make us realize this, but that doesn’t mean that there weren’t problems there all along.

Consider this analogy: Hurricanes Katrina and Sandy made us realize that, in retrospect, New Orleans and New York had problems in their disaster preparedness. Yes, it took these hurricanes to make policymakers aware of the problem, but the problem was there, nonetheless.

]]>Absolutely not, no. I’m not saying that it’s all the same really. I does make a difference indeed. What I think is that the concepts of probability that we’re discussing now still have traces of older ideas and intuitions in them, and that understanding this contributes to the understanding of the concepts. I have this view that may seem paradox at first sight, namely that a) it is important and helpful to be clear about the differences between the different concepts of probability (on which basis it can be seen that different concepts can be appropriate for different applications) and b) that it is also important to understand to what extent all these different concepts trace back to certain common intuitions and original ideas – although this has to be qualified, because things were not monolithic in the past either and different older strands of though went into different later developments not with all the same weight.

So for example I’d suspect, making reference to an old discussion we had on your blog, that Cox was quite keen on coming up with technical conditions that enabled him to show that the kind of measurement of evidence he was interested in is equivalent to already existing probabilities. This connected him to a huge existing culture and made him relevant. At the same time it meant that what he came up with cannot be completely and cleanly separated from what the rest of that culture did. To me it seems therefore funny when “Laplace” emphasizes (a little bit here and much more elsewhere) that the frequentists need to understand that whenever it works what they do it is “actually really” (Jaynesian/Coxian) Bayes. Historically the frequentists were not first but they came before Cox and Jaynes, and Jaynes and Cox did what they do partly in order to accommodate what they thought worked well before they came. I think it’s an illusion to think that at the same time they can be totally free of what caused issues with the not so well working stuff before.

]]>My main point was that those bookmakers odds do not necessarily represent a good or accurate estimate of the real probability of the event in question occurring! And they should not be interpreted like this.

]]>the fact that Laplace initially wrote about counting states (i.e., frequencies)

Christian, in your opinion, were Venn’s and Boole’s frequentist criticisms of Bernoulli’s and Laplace’s classical definition of probability actually just a case of distinction without a difference?

]]>+10

]]>However, I think that the fact that Laplace initially wrote about counting states (i.e., frequencies) illustrates very nicely that there is some “gravity” to frequencies as a standard conceptualisation of probabilities. If probabilities are just abstract measurements of plausibility that cannot be given a “closer to life” interpretation than that they fulfill Cox’s axioms (a major building block of Jaynes-style Bayesianism), it is hard to connect them to what we are interested in in life.

Frequencies that stem from hypothetical repetitions of the world are very abstract and far from life, too, and insofar they’re just a crutch for thinking, and a very weak one at that; except that any crutch is probably helpful (even if only a little bit) to make sense of Cox’s axioms. Which, I’d assume, made Laplace write about counting states.

The bookies will actually be interested in frequencies (although these are weighted, too, by the money that is bet); they ultimately need to predict distributions of bets (they may have a stab at predicting distributions of match results, too, although this is of secondary importance to their jobs), and any probability computation helpful to them should have an implication regarding such to be observed distributions. An abstract plausibility measurement won’t serve them.

]]>To view “non reproduciblity” as a feature is sort of upsetting my whole mental model of doing science. :)

]]>Planning to put up a blog post directly responding to your points, ie. the points that your Bayesian models are not reproducible either. I consider this a feature! (specifically, it’s a feature that you are aware that they don’t reproduce).

]]>Getting fancier, expensive lipstick doesn’t really help much.

PS. We need “Lipstick on a Pig” in the Lexicon. :)

]]>Was there a parallel pari mutuel betting on the event too? It’d be interesting to see what the odds people placed on this outcome there.

Or is there no difference, in practice? i.e. Do the fixed-odds offered by bookies match the autogenously arising pari mutuel betting odds approximately?

]]>You don’t *have* to model the probability that Leicester is going to win the football championship, but modeling it seems to be a good idea if you want to place bets on the event!

I moved away from frequentist to Bayesian modeling in psycholinguistics some years ago (2012?), and my students and I have published maybe a dozen or more papers since then using Bayesian methods (Stan or JAGS). We still have the problem that we cannot replicate most of our and most of other people’s results. A lot of the stuff my field produces is just one-time hole-in-one lucky shots, never to be repeated, and it nothing to do with the sophistication or philosophy one brings to the problem. Psychology is probably the same. (If you submit a paper with replications, editors and reviewers ask you: what’s the point of wasting journal space on replications?)

The problem lies with the kinds of questions we ask and the methods we use to answer them; we also don’t learn from experience. Even the assumption that mu has some fixed prior distribution, pure fiction (as Andrew has noted, I think). Right now I see a tendency among people to blithely ask even more subtle questions about language than we used to; we have learnt nothing from our embarrassing failure to even understand the drosophila of psycholinguistics, relative clauses.

]]>Like Andrew says, the biggest issue is using methods/models incorrectly, usually in the misguided pursuit of certainty or definitive answers from noisy situations.

]]>To channel Mayo, it’s incorrect frequentist calculations that lead to these mistakes. Power-pose researchers not realizing that the p-value depends on what you would’ve done had the data been different, Monty-Hall-style. Had they done freq methods correctly (via preregistration, as in the 50 Shades of Gray paper), they most likely wouldn’t have made those mistakes.

To channel me, it’s a problem with statistics in general (including Bayesian statistics) that it is sold as a way to get effective certainty from noisy data. Everybody’s textbooks are all too full of triumphant examples with uncertainty estimates that happily exclude zero. Students learn the lesson all too well.

]]>What do you mean power poses don’t mean anything? we have frequentist guarantees on the size of the effects!

]]>yup I actually agree. Personally I’ve been using both a pen-enabled tablet for derivations and a laptop with code/animations in lectures. Incidentally my Dad has finally decided to do his PhD and the topic is the use of pen-enabled tablets in mathematics education.

]]>Yes, what Tao writes is related to probabilistic programming, which is what’s done in Stan, in particular in the generated quantities block. But I disagree with Tao when he writes, “When teaching mathematics, the traditional method of lecturing in front of a blackboard is still hard to improve upon.” All the evidence I’ve seen, both formal and anecdotal, suggests that lecturing in front of a blackboard may be a great way for the instructor to learn the material, but it’s not such a good way to teach. Don’t get me wrong—it’s great to have a blackboard—but it’s my impression that blackboard derivations go in one ear and out the other.

]]>https://terrytao.wordpress.com/2016/05/13/visualising-random-variables/

]]>Ie – just as you can derive the same ODE models with standard analysis as you can with nonstandard analysis, you can usually give a probability model both a frequentist and plausibility interpretation. You may prefer one to the other, fine, but some people are equally ok with epsilons and deltas.

]]>1. As noted, the odds were the amount necessary to attract money to bet on Leicester and that before the season. This suggests the crowd under-estimated Leicester’s chances at the start and that the stories miss the point that of course the odds changed with each game.

2. I think it’s absolutely correct to say the pre-season odds were out of whack and specifically for 1or 2 reasons, both referring to the model being used. That is, assuming the odds are set by the need to attract cash to a team, the crowd seems to have used a traditional big money Premiership model in which the only contenders for the actual title come from a short list. The second reason is that same model may have been used by the experts, but I don’t know if that’s true or to what extent. The model for the Premiership has changed as money has flowed to each team. For example, it was traditional that lower teams would sell their best players to the big boys as though that were a law. One of the contenders this year, Tottenham, has refused to sell Harry Kane and now teams are all rich enough that they don’t need to sell and certainly not within their own league. The change in transfers has been noted over the past few years. Another effect of more money is that even the worst team has solid players from all over the world; the average salary is now almost $3M, much more than any other league. The model used to be that ordinary teams hoped to generate talent they could sell and now the model is that each team has enough money to put some of the world’s best players on the pitch. The Premier League differs in this way from other top leagues. Consider Spain: a small number of top teams, meaning Barca, Real and Atletico, are essentially all-star teams that routinely dominate the rest. I’d say the changes in the Premier League were evident in prior seasons and that this year merely brought them out. Note that while we talk about Leicester, Tottenham was next and they also have never won. Under the prior model, ManU, Chelsea, Liverpool, etc. would have continued at the top because they would have scooped up the top players from lower teams and the other teams would not have the resources. The new model is closer to the NFL model of general parity.

]]>A given frequentist probability is a single number which may be considered (at least as a first cut) as a fraction representing a property of an ideal (hypothetical population).

A given fraction may be interpreted as a frequentist probability using a well defined algorithm/model for generating samples given this probability. This may or may not be relevant to the world, just like assigning a ‘plausibility’ to something may or may not be relevant to the world.

What’s the plausibility God exists, Leicester city wins etc? Andrew seems to return to the question of defining a relevant reference population to answer this, which seems reasonable to me.

Both are probably best viewed as some summary of some property of (ie within) a model eg how many states of type 1 are there vs type 2 or, if I simulate the model ‘forever’ (ie a large number of times) how often would I expect states of type 1 or 2 to occur.

]]>To fit me a useful model I’d rather choose a wise Bayesian over a stupid frequentist & a wise frequentist over a stupid Bayesian. Because it seems to me that, so long as they are both smart, they end up producing almost equally good models. Though they will call things by different names and passionately hate the other side’s model’s philosophical groundings.

On the other hand, you get Cuddy or Fiske write a model for you its probably crap anyways but that has little to do with Bayesian or Frequentist foundations.

]]>I can’t say much about Laplace’s models because I’ve never seen anything he’s modeled. But you can look at my dissertation here:

http://models.street-artists.org/wp-content/uploads/2014/03/dissertation.pdf

In it, I have a model for how waves travel along a “1D” bar of molecules. The computation is done using molecular dynamics. Using some modeling techniques based on ideas from Nonstandard Analysis, I derive an ODE for the wave propagation, that is, the propagation of statistical averages over the molecules. It looks like the wave equation, but with an added “momentum diffusion” term whose intensity scales like a certain function of the length of the bar as a fraction of the molecular length scale, the temperature, and soforth.

If given a coefficient which is specific to each bar and related to the effective momentum diffusivity you can run the ODE and then get predictions for the measurements that are output from the molecular dynamics.

To find the coefficients for each bar, I run a Bayesian calculation in which I write a likelihood for the timeseries of the logarithm of total kinetic wave energy. This likelihood is a transformed non-stationary gaussian process. it views the whole timeseries as a point in a high dimensional space. I had to do the sampling in the “mcmc” package in R, because Stan didn’t have an ODE solver at that point.

Writing down that likelihood was directly a result of changing how I viewed statistics. Instead of thinking in terms of repeated sampling of errors through time, I thought in terms of plausibilities of different observed functions based on what errors I knew would be reasonable to expect, and what errors were unreasonable. The choice of covariance function was directly based on my knowledge of this physical process, not of any concept of repeated sampling.

So, yes I think it has practical significance, it opens up modeling vastly.

]]>