Julia Galef mentioned “meta-uncertainty,” and how to characterize the difference between a 50% credence about a coin flip coming up heads, vs. a 50% credence about something like advanced AI being invented this century.

I wrote: Yes, I’ve written about this probability thing. The way to distinguish these two scenarios is to embed each of them in a larger setting. The question is, how would each probability change as additional information becomes available. The coin flip is “random” to the extent that intermediate information is not available that would change the probability. Indeed, the flip becomes less “random” to the extent that it is flipped. In other settings such as the outcome of an uncertain sports competition, intermediate information could be available (for example, maybe some key participants are sick or injured) hence it makes sense to speak of “uncertainty” as well as randomness.

It’s an interesting example because people have sometimes considered this to be merely a question of “philosophy” or interpretation, but the distinction between different sources of uncertainty can in fact be encoded in the mathematics of conditional probability.

Can this be analogy-ied into something like the concept of higher order derivatives?

i.e. The probability may be 50% for both. But the probability of that probability is very very high in one case but lower in the other?

We have must stronger structural reasons to believe the coin-flip-50% versus the AI-50%?

The latter case sounds like Knightian Uncertainty. If something is “random” you might still know the probabilities of all relevant outcomes. With Knightian uncertainty, all bets are off.

Jacob:

The point of my post is that so-called Knightian uncertainty can be modeled using the framework of probability theory.

Knightian uncertainty is distinct from what you describe. Say I give you an urn, and I tell you that it is filled with 100 balls. I tell you that some of them are red, and some of them are blue. I pick a ball from the urn. How likely is it to be red?

Maybe you (arbitrarily) choose a prior of 50%. The first ball turns out to be red. I pick another ball. How likely is this ball to be red? Maybe you make some (arbitrary) assumptions on various probabilities, or you model in some (arbitrary) way the process by which I filled the urn. Depending on what choices you make, your estimate might shift to 5%, it might shift to 50%, it might shift to 95%. This is Knightian uncertainty, and it’s independent of how much your posterior moves around. I wish it were so simple to embed this kind of uncertainty in a standard probability framework but it really is much more complicated and subtle.

what’s wrong with modeling uncertainty on the prior, like a beta binomial?

There is a very general divide between how people react to “uncertainty” of a distribution. Roughly speaking there are two strategies:

(1) Put a distribution on the distribution which hopefully reflects the uncertainty in the distribution. (doing a sensitivity analysis on the prior, for example, is effectively an example of this).

or

(2) Spreading the distribution out to the point to where in some sense it’s maximally uncertain.

The first is the more common approach because it jives more with frequentist intuition which even most so called “Bayesians” use internally. The later is a more purely Bayesian approach and reflects a correct understanding of how probability distributions model uncertainty. It’s less popular because few people really get Bayesian probabilities.

If that’s all there was to it, I’d recommend that people first understand Bayesians better and then use (2). Unfortunately there’s a subtle issue which results in a perfect storm of confusion on this topic. Much of the time people talk about a “probability distribution” they actually mean a “frequency distribution”. Instead of writing the “frequency distribution” f_1,…,f_n as they should, they write p_1,…,p_n and call it a probability distribution.

In cases like that the frequency distribution f_1,…,f_n should just be though of as a physical observable outcome no different than any other measurable quantity. So if there’s uncertainty in it, you should model this with a probability distribution P(f_1,…,f_n). This is actually using approach (2) above in essence, however the vast majority of statisticians will interpret it to be case (1)!

Like I said, it’s the perfect recipe for endless confusion. It’s another example of the insidious way Frequentism holds the subject back.

Could you give a concrete example of a problem, the common approach (1) and the more purely Bayesian approach (2) ?

Well, I don’t know about concrete, but I can sketch something. Suppose we want a distribution for your height. The statistician looking at this thinks that both N(6ft,.3ft) and N(5.8ft, .3ft) among several other possibilities are in some sense reasonable.

Method 1: Put a distribution on the space of distributions which matches your sense of what’s reasonable. In practice this would be a lot work to do, so as an approximate shortcut, do the analysis with your favor distribution N(6ft, .3ft) and then see how sensitive your answer is to other “reasonable” possibilities such as N(5.8ft, .3ft).

Method 2: Think about what it means to say both N(6ft, .3ft) and N(5.8ft, .3ft) are “reasonable” possibilities. What this means is that both 6.3ft and 5.5ft are reasonable potential values for your height even though neither is in the high probability region of both distributions. So why not just use a single distribution which gives some probability mass to every reasonable potential value?

In other words, choose a distribution which is “maximally uncertain” in the sense that it spreads the distribution out to cover all reasonable possibilities. In this example something like N(5.9ft,.4ft) might work.

Notice that N(5.9ft, .4ft) has higher entropy (i.e. is more spread out) than either N(6ft, .3ft) or N(5.8ft, .3ft) separately. Keep developing this idea along this line and you can dream up several quantitative ways to use the entropy expression to “combine” potential candidate distributions into a more uncertain (higher entropy) final distribution which in a way considers all potential values for your height which need to be considered.

You can call the resulting methods “maximum entropy” something or other.

Thanks for the example, but I don’t find it very illuminating. My height is well defined. I assume you want a distribution to reflect your knowledge about my height. It’s not clear to me why would you need a distribution of distributions in that case.

In any case, in the end what you do is what I understood by (1) “put a distribution on the distribution”. You use a distribution of the form M(mu,0.3), with mu distributed as N(5.9,0.26), which is the same as a distribution of the form N(5.9,0.4). This is not a maximum entropy distribution covering all reasonable possibilities, it’s a maximum entropy distribution with mean 5.9 and standard deviation 0.4.

Carlos,

by “probability distribution” I only ever mean “a distribution to reflect your knowledge about ..”

“It’s not clear to me why would you need a distribution of distributions in that case.”I don’t think you do, but that is effectively what people do quite often. It’s an often heard complaint that there often seem to be more than one reasonable prior, so how do you choose? In some cases things are simple enough to explicitly consider a “probability of a probability”, but in most cases it would be too much trouble. As an approximate shortcut to this, they do a sensitivity analysis on the prior by varying it over “reasonable” choices and observing how that effects their inference.

This is not a maximum entropy distribution covering all reasonable possibilitiesI didn’t say it was, I said it had *higher* entropy than the original two. I wasn’t giving a method for combining two “reasonable” distributions into a single better one, rather I was giving a hint of idea how you could create or motivate such a method.

It seems that we agree, then. If your prior distribution is too restrictive given your prior knowledge you have to widen your prior distribution. Putting a distribution on a distribution can also be a good idea in some cases. In your example, maybe the best distribution you can think of for my height is N(5.8,0.3) if I’m a man and N(5.3,0.3) if I’m a woman. A mixture of normals will be a better distribution than a wider normal. And the male/female probability you use for the mixture can be seen as the average of your prior distribution for that parameter: it might more or less concentrated (this affects how easily will you change your estimate if you get additional data).

I think that how we model it depends on our goals. If we’re “only” statisticians, and we’re trying to characterize a joint distribution, we can simplify a bit, and we don’t need “second order probabilities,” in the terminology of Pearl’s discussions from the 90s – but we need the correct bayesian net in order to update given evidence. If we’re decision theorists, we want to be clear on the distinction between how we can reduce uncertainty, as follows.

If I have a coin of unknown bias, I have epistemological uncertainty about the bias of the coin – this uncertainty can be reduced by getting data about its behavior. This information will not reduce the “aleatory” uncertainty that exists in the outcome of a future flip. Once it is flipped, if I don’t know the outcome, I can reduce my uncertainty by observing the flip, which will reduce it to near-0, making it an observed quantity. (There is still some measurement error – you may record it incorrectly!)

It also lets you do value of information and decision analysis correctly. You can decide how much it’s worth to observe 100 flips, if you will be able to bet on the outcome afterwards. This is simple – but if you have modeled this correctly, you can also differentiate between the case where the coin’s owner offers you a bet before the coin flips (which may give you information about the bias of the coin,) versus where they offer you a bet after it lands (which should make you think they know the answer, in which case you can reject the bet.) This is obvious, but if you’re not careful with your model, it may not be able to support this distinction.

The question remains about whether there is any “real” aleatory uncertainty, but I’d agree with Dr. Gelman that probability theory (+ decision theory, which gets ignored too often) makes this question moot, as I explain a bit here; http://lesswrong.com/r/discussion/lw/n7o/mapterritoryuncertaintyrandomness_but_that_doesnt/

Randomness is variation inherent in the data. Uncertainty is measurement error.

(Fire at will!)

I’d say randomness is variation inherent in the model and uncertainty refers to the model itself. The following is a quote from Jaynes, chapter 18 (which has ben cited in another comment), see also the charts at: http://i.stack.imgur.com/bhEQd.jpg

Suppose you have a penny and you are allowed to examine it carefully, and convince yourself that it is an honest coin; i.e. accurately round, with head and tail, and a center of gravity where it ought to be. Then you’re asked to assign a probability that this coin will come up heads on the first toss. I’m sure you’ll say 1/2. Now, suppose you are asked to assign a probability to the proposition that there was once life on Mars. Well, I don’t know what your opinion is there, but on the basis of all the things that I have read on the subject, I would again say about 1/2 for the probability. But, even though I have assigned the same ‘external’ probabilities to them, I have a very different ‘internal’ state of knowledge about those propositions.

To see this, imagine the effect of getting new information. Suppose we tossed the coin five times and it comes up tails every time. You ask me what’s my probability for heads on the next throw; I’ll still say 1/2. But if you tell me one more fact about Mars, I’m ready to change my probability assignment completely. There is something which makes my state of belief very stable in the case of the penny, but very unstable in the case of Mars.

Andrew, I like the “as information becomes more available” heuristic, but this depends on the type of analysis (i.e., model or information accumulation procedure) being performed. If we consider “randomness” and “uncertainty” to be a partition of the sources variation we see in data, uncertainty is variation that the investigator is _willing_ to specify some structure for (i.e., model), while “randomness” is variation that the investigator is unwilling to model. For example, L. Mahadevan at SEAS has an interesting talk on the physical dynamics of coin flipping — because he’s willing to model the physics, there is information that can change the coin flip probability. So what we is randomness and what is uncertainty (unless we’re dealing with true quantum systems) is often determined by the investigator.

“The coin flip is “random” to the extent that intermediate information is not available that would change the probability”

What if you learn that the coin belonged to a famous conman who always bet it would come up heads?

Deja-vu http://statmodeling.stat.columbia.edu/2005/09/01/p12_or_ep12/

I was waiting to see if someone would point that out.

I use these terms as follows: randomness is something that a human observer attributes to the outside world, uncertainty is located in the observer.

I think that it is important to understand that randomness, although located in the outside world by human definition, is still a human construct, as is the concept of probability.

The coin flip is random to the extent that the observer prefers modelling it as a random process instead of trying to analyse it as deterministic. Indeed this preference can be altered if more information comes in, but just flipping the coin more often is not the kind of information that I’d have in mind here, rather information about the exact physical process that is going on for a specific flip.

Having seen 51 heads in 100 flips gives me a clearer idea about what to expect from the coin indeed, but it doesn’t change in the least my perception of it as a random process.

Note that by “my perception of it as a random process” I do not mean that I believe that this is indeed “objectively” a random process, but rather that I decide that I think of it as one, ignoring for example the specific way the coin is flipped next time as opposed to last time (as long as my experience is that taking into account what I see of it doesn’t help me to predict the next outcome).

So I’d keep up the distinction between randomness and uncertainty, and I think that it is clearer to use the concept of probability explicitly for either one or the other (both being legitimate) rather than for something in which both are mixed up.

This despite agreeing that the coin flip could be a case for them both.

There is no formal distinction in probability between different sources of uncertainty; they’re all (conditional) probabilities at the end of the day.

The idea that physical randomness can be characterized as an insensitivity (“resiliency”) to conditionalization is explored in a classic (in philosophy circles) article by Skyrms: http://fitelson.org/269/Skyrms_RPACN.pdf

Yes, patterns that are thought of as “random” are in some sense very insensitive to assumptions. Once you realize this then the phenomena of randomness disappears and is no longer a kind of physical force as most statisticians (and all Frequentists) think of it.

For example, suppose there are 1 million possibilities for something. 999,999 of those possibilities lead to patterns we interpret as “random”, while just 1 leads to a pattern we interpret as “ordered”. Then it’s no surprise when we observe “randomness”. In fact, “randomness” is what we expect to see even if we vary the underlying physical mechanism by huge amounts. The “random” outcome is incredibly insensitive to the what’s actually physically happening. That’s why a series of real coin flips can be simulated by a deterministic digital algorithm! Physically the two are completely different but the “random” outcome is incredibly insensitive to that fact.

You could turn this around and note that if you observe a “random” pattern, that tells you almost nothing about the underlying physical cause. There’s just to many wildly different “causes” which lead to the same pattern.

This example is a good illustration of information theory. Using Boltzman’s and Shannon’s insights into entropy and information, we’ll define the entropy (information) to be S = ln W, where W is the size of the space involved.

So initially, we have S_1 = ln 10^6 ~ 13.815510. If you observe “randomness” however this reduces to S_2 = ln 999999 ~ 13.815509. Thus the amount of information gained by this observation is S_1-S_2 = .000001 and is incredibly small as is intuitive.

On the other hand, if you observe the an “ordered” outcome then the entropy is S_2 = ln(1) = 0, so that the amount of information learned is the maximum possible S_1-S_2 = 13.815510. Which is also very intuitive.

Also McElreath in his new book Statistical Rethinking as a near perfect (elementary) statement of the meaning of probabilities:

In modest terms, Bayesian inference is no more than counting the numbers of ways things can happen, according to our assumptions. Things that can happen more ways are more plausible. And since probability theory is just a calculus for counting, this means that we can use probability theory as a general way to represent plausibility..The only thing I would add at this elementary level is that another way of say “plausible” is “insensitive”. Normally, if a Bayesian puts a uniform distribution on the space of coin flip sequences they would say,

“there’s a very high probability the frequency of heads is near .5”but an equivalent way of saying this is,

“if you vary the outcome among the set all possible outcomes, then the most robust (insensitive) claim I can make about the frequency is that it’s near .5My initial take on this is that Cox/Jaynesian probability theory actually doesn’t make any distinction. Probability is a measure of how much credence to put on a given outcome conditional on a state of knowledge. If that state of knowledge is that You’re using Persi Diaconis’ perfect coin flipper machine ( http://statweb.stanford.edu/~susan/papers/headswithJ.pdf ) then you put one set of probabilities, and if your state of knowledge is that the coin flip is high into the air and will land on a concrete surface… then you put another set of probabilities.

Basically, outside of pure mathematics, there is no such thing as “randomness”. :-)

(note: since no-one who understands Cox/Jaynes probability has really figured out quantum mechanics exactly as far as I know, I’m going to exclude the issue of quantum “probability”)

Daniel,

This is a far simpler topic than everyone makes it. The first step towards clarity, as always, is to carefully distinguish between probability and frequency. Take for example the quote

“50% credence about a coin flip coming up heads”. That has two quite distinct meanings. It could mean:(1) The most likely frequency is .5

or

(2) The marginal distributions for each flip are all equal and satisfying P(H)=.5

For almost everyone, including most of the so called Bayesians here, these are identical statements. Yet in reality, they’re very different. In particular while (2) implies (1), it’s definitely not the case that (1) implies (2). That alone is enough to prove they’re very different. I wont bother trolling everyone by trying to explain the details to the clowns in the statistics community, but without that understanding people are just wasting their time.

One thing which isn’t a waste of time is Jaynes’s mysterious chapter 18, which everyone ignores, but is 10 months pregnant with possibilities. It also talks directly about the topic of the post. If you don’t want to buy the book you can find a pdf of it here:

http://www.bayesianphilosophy.com/the-ap-distribution/

By the way Daniel, in that chapter, Jaynes uses what he calls the “Ap distribution” to both illuminate the topic of this post but also to examine Laplace’s (no relation) rule of succession. If you understand what Jaynes is really getting at though, you can use the “Ap” distribution in quite practical problems. I had a lot of fun with this a few years ago, but don’t intent to write any of it up because frankly, statisticians would never get it.

For my take on “uncertainty,” “probability,” and “random”, see Parts I, II, and III of http://www.ma.utexas.edu/users/mks/CommonMistakes2015/CommonMistakesDay1_2015.pdf

Who is going to end up Prime Minister of Canada and who will end up managing a strip club – red or blue corner? – https://www.youtube.com/watch?v=XuSpZ3_5pTc

Hi, I’m interested in statistics and probability just as a pastime hobby, but I fail to see the difficulty in here. Can’t we simply use hierarchical Bayes:

A = Flip of a fair coin

B = Sports competition that we have no knowledge of

A ~ Bern(p1), p1 ~ Degenerate(.5) (or p1 ~ Beta(x, x), x being an enormous number)

B ~ Bern(p2), p2 ~ Beta(.5, .5) (Jeffrey’s prior of complete ignorance)

I feel I’m missing something here but honestly I have no idea what, can someone please explain?

Volkan: If you just want to write down some kind of model for the two situations, this is fine. But what does it mean and what do you get from it? How is it justified that you model in part B assigns very specific probabilities to specific outcomes? Why Jeffreys’ prior and not uniform? (I know there are reasons for using Jeffreys, but there are reasons against it as well.) What do you intend to do with the probabilities you get from B, and should this have an impact on how you choose them? (You’re not going to bet your money on sports competitions you have no knowledge of, are you?)

A has an interpretation in terms of expected frequencies, B hasn’t. I think that’s quite a difference.

To me, uncertainty applies when there is some unknown true value. We reflect our uncertainty about that true value using probability. Using an example from a previous comment, there either is or is not life on Mars, and we can reflect our uncertainty using probability.

In contrast, randomness applies when there is a process that is inherently stochastic. We also reflect randomness using probability. For example, even if we are totally certain that the Pr(Coin Lands Heads) = 0.5, the flipping of a coin is stochastic, so we cannot be certain what the next realization of a flip will be, i.e., it is random. (We can quibble about whether coin-flipping is actually a stochastic phenomenon.)

Moreover, we can also be uncertain about Pr(Coin Lands Heads) = pi, and characterize our uncertainty about pi using a probability distribution. In this case, there is uncertainty about a stochastic process.

I think the term random/randomness is not very precise; To the lay-person, it seems imply that a phenomenon is totally uncertainty or unpredictability. However, we can be very certain about the properties of a random/stochastic phenomenon. I think stochastic may be better.

Backing in to aleatory versus epistemological chance? Or is there a difference here?

Yes, there is a difference, but it’s mostly terminology. Here’s a writeup where I tried to disambiguate it a bit; http://lesswrong.com/r/discussion/lw/n7o/mapterritoryuncertaintyrandomness_but_that_doesnt/

The discussion of uncertainty and randomness has thus far omitted a third closely related concept: immateriality.

For example, it is generally reasonable to assume that the outcome of an individual coin toss is uncertain. Traditionally, this type of uncertainty is modeled by regarding each observed outcome as a sample that has been drawn at random from an associated probability distribution. In my new book Rethinking Randomness (website is http://www.RethinkingRandomness.com), I refer to this nearly universal assumption as the sampling premise.

The sampling premise provides the bedrock for almost all analyses of uncertainty. It supports a variety of powerful assumptions regarding the form of the distributions that regulate uncertain behavior, and it is ingrained in the thinking of almost everyone who has taken an introductory course on probability.

Rethinking Randomness develops an alternative approach to the analysis of uncertain behavior. The new approach does not depend in any way on the sampling premise. Its origins lie instead in the fact that practitioners typically regard the observed values of uncertain quantities as immaterial details. This link between uncertainty and immateriality provides a springboard for the analysis that follows.

The next step is to note that most probabilistic models are validated by observing a large number of individual cases/events and then associating probabilities with computable proportions. Even though individual details are directly observable, the validation process is concerned primarily with proportions computed over entire populations or sub-populations: for example, the proportion of time a walker spends at a particular station during a random walk. The direction the walker heads on any individual turn is observable, but is an immaterial detail. There is no need to characterize this uncertain quantity as a sample drawn at random from an associated probability distribution.

The biggest challenge in developing the new framework is to introduce an alternative class of assumptions that provide enough mathematical structure to solve the problems being analyzed … and are likely to be satisfied, at least approximately, in practice. I refer to this new class of assumptions as loose constraints. There’s more information about all this in my new book and in the extensive set of excerpts I’ve posted on the companion website.