A few years ago, David Rothschild and I wrote:

Prediction markets have a strong track record and people trust them. And that actually may be the problem right now. . . . a trader can buy a contract on an outcome, such as the Democratic nominee to win the 2016 presidential election, and it will be worth $1 if the outcome occurs and $0 if the outcome does not occur. The price at which people are willing to buy and sell that contract can be interpreted as the probability of the outcome occurring, or at least the collective subjective probability implicitly assigned by the crowd of people who trade in these markets. . . .

But more recently, prediction markets have developed an odd sort of problem. There seems to be a feedback mechanism now whereby the betting-market odds reify themselves. . . .

Traders are treating market odds as correct probabilities and not updating enough based on outside information. Belief in the correctness of prediction markets causes them to be too stable. . . . pollsters and pundits were also to some extent anchoring themselves off the prediction odds. . . .

And that’s what seems to have happened in the recent Australian election. As Adrian Beaumont wrote:

The poll failure was caused in part by “herding”: polls were artificially too close to each other, afraid to give results that may have seemed like outliers.

While this was a failure for the polls, it was also a failure of the betting markets, which many people believe are more accurate than the polls. . . . the Betfair odds . . . implying that the Coalition had only an 8% chance of winning. . . . It is long past time that the “betting markets know best” wisdom was dumped. . . .

I don’t want to overstate the case here. The prediction markets are fine for what they are. But they’re a summary of what goes into them, nothing more.

**P.S.** Yes, if all is calibrated, if the stated probability is 8%, then the event will occur 8% of the time. You can’t demonstrate lack of calibration from one prediction. So let me flip it around: why should we assume that the prediction markets are some sort of oracle? Prediction markets are a particular information aggregation tool that can be useful, especially if you don’t take them too seriously. The same goes for any other approach to information aggregation, including those that I’ve promoted.

“… why should we assume that the prediction markets are some sort of oracle? Prediction markets are a particular information aggregation tool that can be useful, especially if you don’t take them too seriously.”

How seriously to take markets is an empirical question. Actually, I would say “take seriously” is a category error in talking about prediction markets.

From this paper (Figure 5), it looks like there’s actually a strong positive longshot bias in election markets (n=1,883). Other markets are pretty well calibrated.

https://faculty.fuqua.duke.edu/~clemen/bio/Published%20Papers/45.PredictionMarkets-Page&Clemen-EJ-2013.pdf

This paper argues that, for political questions, polls have lower bias and markets higher precision:

https://irep.ntu.ac.uk/id/eprint/33821/1/11278_Vaughan-Williams.pdf

But the point of both is that you can do some calculation to see how good they are in terms of predicting outcomes of interest. Sure, at the end of the day, every situation is different. But we do have a fair amount of information on how well these aggregation channels actually perform.

In the previous post I hemmed and hawed while thinking about a good metric for the “goodness” of a prediction market. The final idea was essentially surprisal per prediction:

sum(-log(p(actual_outcome[i])), i=1..N) / N

It turns out for binary outcomes, this is the same as the well known log-loss… But this works just as well for multi-possibility distributions. It can even work well for a “continuous” outcome provided that we handle it correctly… as all measurement outcomes are actually discrete at some level (like say a temperature measurement with a digital thermometer is often rounded to the nearest say 0.1C or 0.1F, or a weight measurement on your bathroom scale is rounded to the nearest 1lb or 0.5lb or something similar)

So, here’s a question, does anyone have any datasets on different prediction markets with the price history and the final outcomes for a large range of predictions? It would be interesting to compare the markets by this metric, and compare to frequentist “calibration” metrics…

obviously the choice of logarithm base alters this by a constant, but for convenience of interpretation let’s use log2 so the measure is in bits/prediction.

> It would be interesting to compare the markets by this metric, and compare to frequentist “calibration” metrics…

What do you think that this metric is, other than a frequentist calibration metric?

It’s an information content metric. It’s measuring the observed average quality of Bayesian predictions of one-off events. Each event is unique, and has a different set of information informing its probability assignment. This metric is asking how informed a person having the model is about what will actually happen. If they are well informed, this metric will be small, they are unsurprised by the final outcomes. Since there is no repetition of “the election that was held on X day” there can be no frequency probability associated to it.

A frequency metric would look at all the events predicted say 60% and ask whether very close to 60% of them actually happened… it depends on things that *didn’t* happen.

The metric here *only* depends on the probability assigned to the thing that *did* happen.

Suppose Brexit was predicted 86% “for”, you can’t re-run Brexit and see if 86% of the time you run Brexit the vote goes “for”…

The best you can do to construct a “frequentist” process is to choose a random event which was predicted at 86% and then see if the fraction of the time that you choose one of these at random and it does come out “for” is very close to 86%… If you have non-binary outcomes, the metric should depend on the probability you assigned to the outcomes that didn’t happen… so if p1,p2,p3 are the probabilities you assigned to those outcomes, it should be a function of p1, p2, and p3… whereas surprisal per prediction is *only* dependent on the probability you assigned to the thing that did happen.

when you have binary outcomes, the prediction for the thing that didn’t happen is a fully symmetric fact about the thing that did happen. but with multiple outcomes possible this symmetry goes away.

For example suppose you have 3 outcomes possible A,B,C and you assign 50,25,25 %… and this is repeated 100 times… the outcome is always A

should this have the same frequentist evaluation as a model

50,45,5 % ?

It seems to me if we’re going to evaluate frequencies, we should be looking at predictions like “B should happen 25%” or “B should happen 45%” etc… You could argue that these should be symmetric and cancel, but that’s the same as saying all frequentist metrics should be linear functions of the probabilities right? log is certainly nonlinear, as is squared error in frequency…

The kind of thing I was thinking of when I said “Frequentist calibration metrics” would be things like chi-squared tests of goodness of fit… bin things into different predicted probability bins, then see whether the fraction of observed things that fall into that bin is statistically significantly different from the expected number under a random sampling model with the predicted p… etc.

I am pretty sure you want the bilinear loss function:

https://arxiv.org/abs/1704.06062

https://stats.stackexchange.com/questions/292445/how-to-implement-bilinear-loss-function-in-r-tensorflow

The bilinear loss is a generalization of the cross entropy loss, which is in turn a generalization of the log-loss.

Bilinear loss is a whole class of loss functions depending on what your “loss matrix” looks like. Since the classification vectors are 1 or 0, it is a fancy linear algebra way to write a sum of costs, picking out only the relevant costs. I often find linear algebra like this to be conceptually opaque.

The thing Bilinear loss gives you is the ability to express a huge range of utility type models as instances of a single kind of calculation. For example, qhen you have a particular application like in manufacturing, then you can penalize different things based on how much they matter to you in dollars or something… but this is obviously a utility measure, rather than an information measure. Of course the information measure I propose is one particular instance of the bilinear loss, where the cost a[i,j] = -log(p_j(actual_value[i]))

Using the surprisal, what you find out is in some sense how many bits of information you learned when the “true outcome” was revealed. A highly informed market should have very low surprisal, because you essentially “already know” what’s going to happen, and what your model told you was in fact correct.

As far as I can see, there’s nothing really Frequency based here at all… each prediction is based on a new set of information, aggregated over the market, and then when the true outcome is revealed, we have some amount of information gain. To the extent that the real outcome had probability close to 1 the information gain from the “reveal” was small, and to the extent it was close to 0, the information gain was large.

Yes, if the cost of every type of error is equal, that is the cross entropy. What you want to predict in a prediction market is the proportion of votes going to each option, not just the winner.

That’s the cross entropy only if your model is all options equally probable. If the “cost” of predicting an outcome that happened is -log(p) where p is the probability you assigned to the thing that happened then we have the surprisal. For the most part surprisal is the cross entropy of a prediction with an empirical distribution assigning 100% probability to the thing that happened

if the cost is not a function of the probability but of the say dollar cost or the number of QALYs lost or whatever those are all still expressable as bilinear cost functions

Cross entropy allows you predict whatever you want for each option. Eg, you predict 43%, 27%, 20%, 10%. The actual outcome is 55%, 10%, 25%, 10%. Cross entropy tells you how well your predictions performed.

Right, but then the a_ij won’t follow what you said “the cost of every type of error is equal”

also, you are assuming a fixed model of repeated events where you can calculate the frequency. See below for some differences in the prediction market context, where each event is a unique event with a unique probability distribution predicted for all its possible outcomes.

https://statmodeling.stat.columbia.edu/2019/11/10/when-prediction-markets-fail/#comment-1163444

I think you are using “cost” in a different way. I am using it in the sense that when a model says to “buy/long” when the correct action was “sell/short”, this is worse than deciding “hold” instead of “sell/short”. Different errors have different costs.

I’m talking about information coding cost… every real world problem can be evaluated by lots of costs, different people have different economic situations for example…

but if you have a probabilistic model, you can use that model to design a Huffman code, where different events are described by different symbols that consist of different numbers of bits. So for example if your prediction market assigns probabilities to who wins each of 20 sports contests, the symbol “0” could mean “all the contests were won by the team that had the highest probability to win” thereby turning a 20 bit message into a 1 bit message…

this is the cost measured by surprisal… it is relevant regardless of whether you bet $1 on the first game or $1000

Aha, I see some sources of confusion, so in your example you are talking about predicting a continuous measure of vote percentage. But a Bayesian model doesn’t predict just one vote split, it puts probability over each possible vote split… so you would look at the probability it assigned to the actual 0.42,0.1,0.48 split that occurred.

using surprisal in a continuous case is bound to lead to larger surprisals because 3 numbers rounded to 2 decimal digits need 6 decimal digits total to encode. that’s something like 20 bits.

Log loss is an example of a proper scoring rule. It’s the standard loss function to use for calibration. See, for example, Dawid’s “Well-calibrated Bayesian” or Gneiting et al.’s “Probabilistic forecasts, calibration and sharpness”. They work through the continuous generalizations. It’s also related to what the information theory folks call cross-entropy, which also generalizes to continuous outcomes.

Right, for binary outcomes, the total (or average) surprisal is minimized by a model outputting the vector of outcomes… so if the outcomes are 101100, and your probability for each to occur was 101100 then you can’t get less surprising than that.

The concept of calibration is only applicable if the concept of a fixed model that doesn’t change, and IID outcomes is imposed on the problem.

in the prediction market, each problem is a new thing, with a different set of possible outcomes, and each prediction is based on a relevant information set for that thing. There’s nothing repeatable about it, however the surprisal is still a useful metric. For example, if the prediction market probabilities are widely disseminated through a separate channel, and there’s some well understood algorithm for constructing a code from the prediction market probs… You can transmit the outcomes of votes or sporting events or other events to Mars with a very small number of bits using the code by referencing the prediction market for decoding…

If the prediction market is very good, you basically never need to send any additional code bits, because the outcomes are 100% predictable from the prediction market probability data. However, if something happens that was given a very small probability under the prediction market… you’ll have to transmit a bunch of bits to inform Mars that this unexpected event happened.

You’ve lost me.

Forget the bilinear loss, since that seems to have confused things.

Say you have an election and a model that predicts the share of votes each candidate will receive (which is roughly the probability they will win). Then you get the actual election results. The usual way to assess the skill of such a model is the cross entropy. Same thing for a prediction market.

How is this superior? sum(-log(p(actual_outcome[i])), i=1..N) / N

This was supposed to be a response to Daniel here: https://statmodeling.stat.columbia.edu/2019/11/10/when-prediction-markets-fail/#comment-1164038

The actual concept of Cross entropy is basically the average of this measure over an iid repetition process that is well identified.

So, you have a loaded die, with known bias, and a model that says it’s unbiased and IID (the probabilities don’t change with each roll). Cross entropy averages how many extra bits you’ll need to describe a long string of these rolls due to the known bias.

“cross entropy” as calculated in ML should really be called something like “sample estimate of the cross entropy”, because to calculate the cross entropy you should average over the “true” distribution that the die has, which is always unknown, and therefore requires basically an infinite series of throws, so if you have 25 training data points, the estimate can be somewhat far from the quantity you’re trying to calculate.

In any case, the concept only applies when there is a “true IID process” (like the loaded die) from which you are sampling and your model is an “IID process” whose probabilities may be different from the “true” ones.

When we’re dealing with a prediction market, each event is an unrepeatable event, and the model predicts a different probability for each of them because it’s an aggregate of a bunch of different information (the information about which team will win friday’s soccer game is totally different from the information about which person will win the local school board election… etc)

Since the events aren’t even in principle repeatable, at best you can say that whatever the real outcome was had probability 1 and the rest had probability 0… this is the frequency in “all the repetitions that are possible”.

When we average over this “true” probability distribution, the only thing that matters is the 1, since multiplying by 0 makes that term drop out of the sum… so when you calculate the cross entropy of observing data from this “sequence of deterministic but unknown events” and try to figure out the cross entropy with the distribution where “you don’t know what the event will be, but you treat it as having probability from the prediction market estimates” you get the sum that I’m talking about above, because all the terms where the “true” probability isn’t 1 have “true” probability 0 and drop out.

Does that make more sense?

The interpretation of this measure is “how much more bits your code symbol has because it didn’t predict the actual outcome with certainty p=1”

It’s kind of a property of the code-book you’d make from your prediction market.

In math, the cross entropy between ptrue and pmodel is:

sum(ptrue(outcome) * -log(pmodel(outcome)))

it’s a property of two theoretical distributions, it can be calculated if you hypothesize any two distributions, in this sense ptrue doesn’t have to be a “physical fact” but normally it’s treated as if it were, and the sum is taken over all the possible outcomes. The only reason we calculate it from a sample is because ptrue isn’t actually known, so we get some estimate of it. Of course, when calculating from a sample, you have to assume the sample is IID samples from a constant ptrue… otherwise you should be changing ptrue for each event…

If for example you’re rolling a long series of differently weighted die, indexed by i, you should do:

sum(ptrue[event=i](outcome=j) * -log(pmodel[event=i](outcome=j)), i,j range over all events and all outcomes within the events)

in the non-repeatable prediction market context, ptrue[event=i](outcome_that_occurred)=1 and it is 0 for all other outcomes… You can either treat this as a physical fact (all the repetitions that were possible caused outcome_that_occurred to happen, so the frequency was 1), or you can treat it as your Bayesian state of information after you observe the actual outcomes (I know with certainty that outcome_that_occurred actually occurred). In this context, ptrue actually *is* known, after the fact. so all the terms where ptrue isn’t 1 become 0, and the sum becomes

sum(-log(pmodel[event=i](outcome_that_occurred)), i ranges across all the predicted events)

dividing by N just changes the scale to “per prediction” rather than the total over all the predictions seen so far. This makes it an “intensive property” that doesn’t automatically get bigger with a larger number of predictions.

As I say, the interpretation is that if the model is shared by both endpoints of a communication channel, and an algorithm for construction of an optimal code is also shared, then one end of the channel can tell the other end of the channel what *really* happened using a code and it will send sum(-log2(pmodel[event=i](outcome_that_occurred))) extra bits than it would have if the model had just gone ahead and predicted all the correct outcomes with certainty.

if all the predictions were with certainty, and correct, then you could have sent 0 bits, and that would have meant “yep, it all went according to plan”, instead you’ll send some other symbol and it’ll mean “here are some corrections to the predictions you would make from the model” but because the model has some information, you send somewhat more than 0 bits, but less than the full cost of sending explicitly all the results.

I’d have to look at some examples to see how your metric behaves vs cross-entropy. Your concerns about IID samples, probabilities being one or zero, etc do not seem relevant to me.

If you have three outcomes where the model predicts:

A: 0.4

B: 0.25

C: 0.35

And the final vote goes:

A: 0.42

B: 0.1

C: 0.48

The cross entropy is:

-(0.42*log(0.4) + 0.1*log(0.25) + 0.48*log(0.35))

= -(-.38 -.14 -0.5)

= 1.02

If our predictions were 100% accurate we get 0.95.If we predicted 99% chance of B winning and half a percent for A/C then it would be 4.8.

If instead the final vote was 100% for B, the cross entropies for the above predictions would be 2.3, 0, and 0.01.

The advantage of this metric is that it rewards confidently predicting very skewed outcomes more than near uniform ones.

ugh, editing on my phone, comment went to wrong place… copying it here…

Aha, I see some sources of confusion, so in your example you are talking about predicting a continuous measure of vote percentage. But a Bayesian model doesn’t predict just one vote split, it puts probability over each possible vote split… so you would look at the probability it assigned to the actual 0.42,0.1,0.48 split that occurred.

using surprisal in a continuous case is bound to lead to larger surprisals because 3 numbers rounded to 2 decimal digits need 6 decimal digits total to encode. that’s something like 20 bits.

I thought we were talking about judging the skill of a prediction market:

A prediction market will return an approximate set of probabilities to you. Eg, these add to over 100% so just softmax them and compare that to the eventual outcome: https://www.predictit.org/markets/detail/3698/Who-will-win-the-2020-US-presidential-election

But there definitely is not a distribution.

So, prediction markets sell contracts on events that can be well confirmed. For example on “Brexit referendum succeeds”… The closing price of “Brexit referendum succeeds” might be something like .94 the day before the vote… Then Brexit actually does succeed, so the outcome is 1 instead of if it failed it’d be 0.

so then the surprisal for this prediction is -log2(.94)…. if brexit had failed, the quantity you’d compute is -log2(1-.94) which was essentially the price for “fail”.

So there’s a probability distribution over the outcome but the outcome is a single bit… succeed, or fail.

Now if you get into a more complicated situation, like say contracts on vote splits… I have actually never heard of such contracts, but the contract would have to be something like “Trump gets more than .45 and less than or equal to .46 of the popular vote rounded to 2 digits” and “Trump gets more than .46 and less than or equal to .47 …”

Otherwise I think you’re thinking about situations like there are 3 candidates, A,B,C and the contracts are “A wins” and “B wins” and “C wins”…

the outcome isn’t “A gets .42, B gets 0.1, C gets 0.48” it’s “A wins”, the purpose of the contracts isn’t to accurately predict the vote splits… it’s to put a probability on which of the candidates wins.

So suppose A wins…. then the calculation is -log2(.42) since there was a prediction of a .42 chance that A would win… suppose instead B wins, you’ll do -log2(.1)

Sorry, I wrote that A wins when you gave .42,.1,.48 vote split, so clearly C wins… so if C wins and the ‘normalized price’ of the “C wins” contract was .35 then your surprisal is -log2(.35)

in any case, the point is when you have a contract on a definite outcome whose price is X and there are related contracts on alternative outcomes, only the outcome that actually occurs matters for the surprisal, so -log2(price_of_contract_that_occurred)

obviously you need to normalize the prices to add to 1 as sometimes they don’t quite add up.

Well you are not stuck with your bet on predict it. You buy or sell shares that are worth either $1 or 0$ each once the result is known. But the value fluctuates until then. So if you bought Hillary Clinton shares today for $0.05 each because you thought she would run, but not necessarily win. You could sell them later for $0.20 each when she announces her candidacy. However, once you get a few days before the event the share price should approximate the probability of winning. I am saying to use the state of the prediction market at the last point before the event is determined.

Yes, so am I, or rather it’s possible to assess the goodness of the market at any point in time, but let’s agree to assess it the closing prices day before an event…

Your calculation was:

predicted A,B,C = .4, .25, .35

actual A,B,C = .42, .1, .48

But assuming this is 3 contracts, one on “A wins”, one on “B wins” and one on “C wins”…. then the “actual” was

A,B,C = 0,0,1

So I don’t see the relevance of your calculation using the numbers A,B,C = .42, .1, .48

those are the vote fractions, and the contract wasn’t on “what will the vote fractions be” the contract was on “wins the election”

If you want to assess the quality with which a prediction market predicts the vote fractions, you will need a whole series of contracts on “possible vote fractions” rather than just contracts on “A wins, B wins, C wins”

I’m assuming that in a “good” prediction market the odds of winning should be approximately the actual proportion of votes, or frequency with which each outcome occurs, etc.

Also, you can use 0/1 encoding for the outcome but then cross entropy reduces to simply log(prob(winner)). So your equation seems to be a special case (but I am not sure why the cross entropy formula uses ln and you are using log2).

I don’t see why the price of a contract for “C wins” should converge to say 0.55 when the vote fractions seem likely to wind up being say .25,.20,.55

If this is what it looks like, everyone knows C is going to win, the price of C should converge near 1 right?

If I had good evidence that the vote fractions would be .25, .20,.55 I’d buy the C contract at 0.55 ALL DAY LONG

Because the payout is proportional to how probable the event is according to the market. You win more money by betting on the underdog(s).

That’s my point though, how probable it is to *win the election* is different from how probable it is to *win a randomly chosen single vote*

You win the election presumably if you get more votes than any other candidate (ignore runoffs etc). So if the market thinks that the votes are likely to go close to .25,.20,.55 then there’s basically *no chance in hell* that C will lose.

It’s like in a 2 party system where the polls are showing Hillary will get 75% of the votes and Trump will get 25% who thinks these vote totals will be so far off that the real thing will turn out to be say 45% Hillary and 55% trump, an error of 30 percentage points?

Also, to bolster my point, Hillary contracts were trading up around 90% shortly before the election, but no one was predicting her to get 90% of the popular vote…

https://www.zerohedge.com/sites/default/files/images/user3303/imageroot/2016/11/01/20161102_GSELECTION4.jpg

> but I am not sure why the cross entropy formula uses ln and you are using log2

That just introduces a factor log(2), it’s a completely irrelevant rescaling.

Let’s see:

http://www.lasvegassportsbetting.com/2016-US-Presidential-Election-Las-Vegas-Odds_P7056.html

That means payout for Clinton was 1/4.5, and payout for Trump was 3.25/1. So the implied probability given for a Clinton win was 1 – 1/5.5 = 81.8% and Trump was 1 – 3.25/4.25 = 23.5%. This is close to the predictit results for Nov 7th, which were $0.82 for Clinton and $0.22 for Trump.

Softmaxing those gives 64.1% and 35.8% respectively. The electoral college then voted 56.5% Trump and 42.2% Clinton, which obviously surprised a lot of people relying on the polls but those types of percentages don’t seem unusual: https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin

I guess the “no chance in hell” was the “other” category which paid out 10k/1, and got 7/538 = 1.3% of electoral college votes. But obviously those votes were meant as a political statement rather than an attempt to get the candidate elected.

I would like to see the payouts in a real three-way race to see if the underdogs scale appropriately or not.

Anon:

You write, “So the implied probability given for a Clinton win was 1 – 1/5.5 = 81.8% and Trump was 1 – 3.25/4.25 = 23.5%.”

But 81.8% + 23.5% is greater than 100%.

You gotta correct for the vig (and other biases inherent in betting markets).

I have another post pending but will respond in the meantime. But this 90% disagrees with the source I found (see the other post).

I’d think you should use the electoral college votes sine those are what actually matter (and determine campaign strategies, etc). Six presidents were elected with 90%+ of the electoral college vote: https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin

I don’t know how cross entropy was derived but I assume the base of the logarithm is not arbitrary.

Andrew wrote:

Yes, I took the softmax:

Is that wrong?

Yes, that is surely incorrect. I don’t know what a softmax is, but if the probabilities with vig are 81 and 23, then obviously the fair odds are not 65 and 35. If the margin has been applied equally the the fair odds are just 81/(81_23), and 23/(81_23). However, at least in betting markets set by bookmakers (which this is not), there is usually a favorite-longshot bias, meaning that more margin is applied to longer odds. So in this case you might apply 5-6% vig to the 23 (the longshot) and only 1-2% to the 81 (I’m making these numbers up, obviously the overall vig needs to still be consistent with the overround of 104%). The language I’m using would be more natural if we were talking about a bookmaker setting odds, as opposed to an exchange, but the logic is the same.

But ya… not sure how 65 and 35 made it past your common sense filter.

Ah.. sorry this is too good, after looking up what softmax is. So inappropriate for this use case. Anoneuoid, the great critic of science and ‘statistics-in-a-box’ approaches, gets caught button-mashing. Using a store-bought, canned algorithm to do his deed, without applying some good ol’ common sense to the problem.

What exactly is wrong with it? And why could you not give the correct answer?

What we know is that prices were 81 and 23. If the fair odds were then 65 and 35, this would indicate that those holding the 23 contracts would have an enormous positive expected value (.65/.23 – 1), while those holding the 81 shares would have enormous negative expected value. Now, of course it’s _possible_ that 65 and 35 are the fair odds (or the *true* probabilities, if you will) here, but that would imply that the market is massively inefficient to be that far off.

I think it’s easiest to understand vig if you think of the way a bookmaker could hypothetically set lines. First, he would estimate the fair odds according to some in-house model he has; let’s say in this case he gets 77.9% and 22.1%. Then, he knows he wants to add a 4% margin to this (to ensure his advantage). How does he decide to add this margin? He could just apply 4% to both probabilities, giving offered odds of 81% and 23%. Or, he could apply a 10% margin to the longshot and a 2.3% margin to the favourite, yielding offered odds of 24.3% and 79.6%. Or he could do any combinations he desires — ultimately though they do need to add up to 104%.

However, all we see are the offered odds; we have no way of knowing how the bookmaker applied their margin. We also don’t know that the bookmaker knew the *true* probabilities to begin with. Therefore there is no way of knowing what the fair odds are given a set of odds that add up to more than 100%. But, like I said above, it’s unlikely that the odds are 65 and 35 in this case because that would indicate that the market is incredibly inefficient, as I could just buy the 23 shares and enjoy a massive (expected) profit. If people behaved like that then the share price would get pushed up.. etc, etc.. which is why it’s not plausible for a market with any sort of volume to be off by that much.

A data-driven approach to estimating fair odds from juiced odds would be the following: aggregate the odds from many markets (from the same provider), remove the margin using some method (e.g. assuming it’s applied equally.. so you just divide the offered probabilities by the overround — 81%/(23%+81%) in the above example). Then, depending how much data we have, we bin our estimated fair odds and compare them to the observed frequencies. If it’s the case that our estimated fair odds correspond to actual frequencies, then it seems like we’ve found a decent method for removing margin (e.g. 5% fair odds events happen 5% of the time). Of course “decent” will depend how finely defined the bins are. As said above, in many markets it appears as though margin is not applied equally… bookmakers at least tend to apply more to longshots. Empirically, the simplest way to see this is that if you aggregate a bunch of odds from a bookmaker and calculate the return from betting blindly on longshots versus betting on favourites, you’ll see that the return is lower (i.e. more negative.. ) than it is from betting blindly on favourites.

The 65-35 estimate you gave would be a VERY extreme form of a reverse favourite-longshot bias… so much so that it is actually massively positive expected value to bet on the longshot. Again, this is just very implausible.

Is there an independent bookmaker here? it’s just a market where individuals place orders, and then they either clear or don’t… There’s not perfect efficiency, but basically if you see 81 and 23 for contracts on A wins vs B wins… the implied probability is more or less close to 81/(81+23) = 77.9 and 23/(81+23) = 22.1

if B wins, then your surprisal is -log2(.221) and yes the base of the logarithm *is* arbitrary it basically is the difference between cm and inches, it’s a choice of units.

No independent bookmaker, no. And even with a bookmaker obviously they adjust their odds in response to money flowing in.. or at least the ones that don’t restrict bettors do. But even with an exchange it’s not obvious how to back out fair odds. It depends also on the commission that the exchange charges. Most apply this only to profit.. so this affects bettor behaviour.

these are not entirely trivial concerns but they are sort of second-order effects if you think about just the correction that I did dividing by the sum of the two contracts whatever the effect of the transaction cost issues is should be a minor additional correction provided that there is sufficient liquidity. I would expect for a liquid contract the effect is maybe a percent or smaller.

so if we just start with the simple first order model calculate the percentages by normalizing by the sum and then calculate the surprisal, we can see that it’s the same as the cross entropy when the single event occurs and the outcome is either a zero or a one for each contract. things become much more complex if the contracts are not mutually exclusive events, or if the contract payout in some complex way like for example a contract that pays out $1 for every successful goals scored by a particular soccer player

Anoneuoid: what’s wrong with softmax…

Softmax solves the problem of mapping [-inf,inf] into [0,1]. It works ok for having something like a highly nonlinear neural network model learn a function whose output is [0,1] because it allows the output of the model to be in an intermediate unconstrained space, and then be constrained into [0,1]. The nonlinearity in the transform is unimportant because the neural networks are infinitely flexible for producing any function at all, and can therefore compensate for it during training.

This is a little like building a model on a logit scale, and then doing inv_logit(model_output) to make sure you get something on the [0,1] scale. In the modern world with electronic digital computers, you never need probit or other forms of nonlinear link functions, because you can always just do a nonlinear model in the logit space… back in the days when you looked up stuff in tables in the back of the book, it was useful to have a couple different kinds of link function.

But softmax doesn’t behave in a way that makes sense when mapping essentially probabilities plus some error to probabilities.

Consider that exp(0) = 1 and exp(1) = 2.71, so if you have two probabilities that are related by p=1 and (1-p)=0, and you softmax them, you’ll wind up with .73 and .27, dramatically inflating small probabilities and dramatically deflating large probabilities.

In order to get something like 1,0 out, you need to put in something like 1,-10 and if you have a near zero probability with a little error, you’re never going to get anything close to -10

Thanks for the responses matt/daniel. I got busy but will take a look later, I don’t want you to think your efforts are wasted.

> Their reliability, the very source of their prestige, is causing them to fail.

I suppose people can trade derivatives of the predictions before the final outcome too, in which case the goal of a trader is not necessarily to predict the physical outcome, but the market price itself at some given time– which in turn implies a different meaning of market efficiency.

options on futures on prediction contracts on available commodity quantities…

Is anyone familiar with any studies where prediction markets were used as part of the evaluation of a program or intervention, maybe educational or therapeutic? For example, by allowing stakeholders (program recipients, implementers, administrators, beneficiaries, etc.) to “buy stock” in various potential outcomes, then using market information to inform implementation or to predict outcomes. Seems like it could be a way to get beyond problems with self-reported data.

Reminds me of the paper by Grossman and Stiglitz “On the Impossibility of Informationally Efficient Markets”. American Economic Review (70): 393–408. The wiki entry https://en.wikipedia.org/wiki/Grossman-Stiglitz_Paradox puts it succinctly – that information is costly, prices cannot perfectly reflect the information which is available, since if it did, those who spent resources to obtain it would receive no compensation, leading to the conclusion that an informationally efficient market is impossible.

In this context, if people trust the prediction market, they will not invest to acquire additional (outside) information, which means that the prediction market will likely be biased.

Presumably people with information can make money by trading, and afterwards the prices are closer to the informationally efficient level… in other words market Dynamics leads towards efficiency, but the closer you get the less compensation for further progress.

The question then becomes how close do we get?

I agree that people will not invest in outside info if they believe the market prices are good. But we sometimes see small biases and sometimes large. Clinton to win vs Trump had moderately large and persistent biases. These biases seem to have been built into the polls used to assess the question, rather than the feedback in trusting the market. Paying to acquire good unbiased info might have led to large payoffs, and yet the markets showed 80-90% prices for Clinton…

people were spending plenty and yet getting nowhere. I think this is fascinating because it wasn’t just about market inefficiency due to transaction costs or an unwillingness to acquire new information. It really was about GIGO, the new information of the numerous polls was bad information