“Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?”

Deborah Mayo points us to a post by Stephen Senn discussing various aspects of induction and statistics, including the famous example of estimating the probability the sun will rise tomorrow. Senn correctly slams a journalistic account of the math problem:

The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise.

[The above quote is not by Senn; it’s a quote of something he disagrees with!]

Canonical and wrong. X and I discuss this problem in section 3 of our article on the history of anti-Bayesianism (see also rejoinder to discussion here). We write:

The big, big problem with the Pr(sunrise tomorrow | sunrise in the past) argument is not in the prior but in the likelihood, which assumes a constant probability and independent events. Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas? Why does stationarity apply to this series? That’s not frequentist, it is not Bayesian, it’s just dumb. Or, to put it more charitably, it is a plain vanilla default model that we should use only if we are ready to abandon it on the slightest pretext.

Strain at the gnat that is the prior and swallow the ungainly camel that is the iid likelihood. Senn’s discussion is good in that he keeps his eye on the ball knits his row straight without getting distracted by stray bits of yarn.

46 thoughts on ““Why should anyone believe that? Why does it make sense to model a series of astronomical events as though they were spins of a roulette wheel in Vegas?”

  1. “But this number [the probability of the sun coming up tomorrow] is far greater for him who, seeing in the totality of phenomena the principle regulating the days and seasons, realizes that nothing at present moment can arrest the course of it.”

    — Pierre-Simon Laplace

    • The crux is that it is very easy to criticize precocious newborn’s reasoning. But what model should a smarter newborn do Andrew’s world? That’s the interesting question.

      There’s a reason the story chooses a precocious newborn. It doesn’t have access to lots of other information about what it is trying to predict.

      • Okay, so suppose our precocious newborn dismisses celestial mechanics and constrains herself to statistical analysis. She has two hypotheses, A and B. Using in the marble-in-the-bag approach, the black marble hypothesis (B) always turns out to be wrong. Always. Conversely, the white marble hypothesis always turns out to be right. Always. At some point a truly precocious newborn – perhaps now a toddler – would realize that the marble-in-the-bag decision approach is overestimating the probability of B and revise her methodology.

        • What do you mean by “hypothesis (B) always turns out to be wrong”? Always wrong in many multiverses? Or wrong every morning? If the latter, then are you proposing that one should always discard models with really low probability?

        • > If the latter, then are you proposing that one should always discard models with really low probability?

          The latter. I only live in this universe, in this solar system, on this planet. I’m saying that if every time you decide B you’re wrong then you better update your priors. (In practice, I discard some low probability models as irrelevant and assume the risk associated with the ridiculously low probability that I’m wrong.)

        • @Chris G.

          Imagine one such precocious infant that was born in Svalbard, Norway on the 1st day of October. After 40 successive, boring sunrises & sunsets he accumulates 40 white marble and retains his one lonely black marble.

          And then on 11th November? The black marble isn’t as useless as it sometimes seems?

        • @Rahul

          Touche.

          (Damn, foiled by my 42nd-parallel-centric view of the world!)

          That begs the question though, does our newborn notice the changing length of the day? Does she notice that the rate of change is not constant? Does she attempt to extrapolate based on limited information? How many years of observation before she hypothesizes seasons? Questions aside, you want to use the right tool for the job. It should become apparent fairly quickly that there’s a predictable pattern to the sunrise and sunset. Maybe one starts out agnostic as to whether a statistical or deterministic model is appropriate but observations should push you (hard) towards determinism.

        • @Chris

          Sure, there’s various levels of sophistication in a model. Besides, we have the luxury of hindsight.

          But I still don’t see why Andrew would ridicule our poor newborn model as “dumb”. Building predictive models, when very sparse information is available, is challenging yet necessary.

  2. But … a precocious newborn wouldn’t know anything about “a series of astronomical events”, or that they might be driven by some regular laws. (I’m not exactly sure how a newborn would know to put marbles in a sock, or how that would relate to probability, but I suppose we can let that pass).

    Seems to me that, with this degree of ignorance, the only thing to be done *is* to model events as if they were spins at Vegas, then later to see if that seems to agree with what else has been learned. Of course, that’s not what the ancients did in astronomy. They imposed philosophical (and religious) ideas that had little basis in fact, and then found ways to make the data fit.

    • Tom: “Of course, that’s not what the ancients did in astronomy. They imposed philosophical (and religious) ideas that had little basis in fact, and then found ways to make the data fit.”

      Ha! Sound like a fitting description of modern day social science…

  3. Just a little provocation:

    The definition for the conditional probability is similar to the definition of division: let x be the number such that

    P(A & B) = P(B) * x

    As 0<= P(A & B)<= P(B)<= 1, this value x is always a well-defined number between 0 and 1. We can understand this number x as a value of a function of the events A and B, x = f(A,B), since this number varies with the events A and B.

    It is possible to show that this function f(A,B) is a probability measure over the first argument, when B is fixed: that is f(.,B) is a probability measure: f(empty, B) = 0, f(Omega, B) = 1 and if C and D are disjoint sets, then f(C or D, B) = f(C,B) + f(D,B). Define P(A|B) = f(A,B).

    That is, the interpretation for P(A|B) as "the probability of A given that the event B has occurred" seems to be fictional, since P(A|B) is just a number such that P(A & B) = P(B)*P(A|B).

    All the best,
    Alexandre Patriota.

  4. Just a little provocation:

    The definition for the conditional probability is similar to the definition of division: let x be the number such that

    P(A & B) = P(B) * x

    As 0<= P(A & B)<= P(B)<= 1, this value x is always a well-defined number between 0 and 1. We can understand this number x as a value of a function of the events A and B, x = f(A,B), since this number varies with the events A and B.

    It is possible to show that this function f(A,B) is a probability measure over the first argument, when B is fixed: that is f(.,B) is a probability measure: f(empty, B) = 0, f(Omega, B) = 1 and if C and D are disjoint sets, then f(C or D, B) = f(C,B) + f(D,B). Define P(A|B) = f(A,B).

    That is, the interpretation for P(A|B) as "the probability of A given that the event B has occurred" seems to be fictional, since P(A|B) is just a number such that P(A & B) = P(B)*P(A|B).

    All the best,
    Alexandre Patriota

    • Alexandre:

      Everyone agrees that P(A,B) = P(A|B)*P(B).

      The question is, what comes first?

      In traditional probability textbooks, P(A,B) is defined first, then P(A|B) is defined as P(A,B)/P(B), and the conditional probability is only defined if the joint probability exists.

      But it is completely consistent with the mathematics to first define P(A|B) and then to define P(A,B) as P(A|B)*P(B). In that case, the conditional probability can exist even if P(B) is not defined and even when there is no joint probability.

      The math works either way. But this does confuse people from time to time.

      • Thanks Andrew,

        Yes, you can begin from the “below” or from the “above”. But the interpretation for P(A|B) as “the probability of A given that B has occurred” seems still to be fictional.

        For instance, if you first define the function P(.|B), then you must have a sigma-field for the events for the argument of the function P(.|B), since for each fixed B, this function must be a probability hence you must define a list of sets to be measurable by P(.|B): i.e., the domain of P(.|B). The symbol “|B” seems initially to mean that the probability P(.|B) was built by using the *information* contained in B and you can write instead mu_B(.) = P(.|B). Well, this is how any type of probability is built, naturally also likelihood functions, joint probabilities, marginal probabilities and so forth. The problem is how to justify a probability space for the conditional events, since they may not be measurable in the probabilistic sense. For instance, when the probability measure mu_B is built by employing some deterministic laws, such as via differential equations, in this case, B contains our knowledge about differential equations, knowledge about the relation among the elements of interest and so on. Can it be measurable in terms of probabilities? some conceptual discussion is needed and maybe this is not the right place to do it.

        Well, you want to start from the “conditional” probability mu_B(.) = P(.|B) — which is not really a conditional probability in the usual sense, since B might not be measurable in terms of probability laws — to get a “joint” probability measure.

        Let us assume that B is a measurable event in terms of probabilities. You must expand the initial measurable space to built a joint measurable space. First, you have to build all probability spaces (O_B, F_B, mu_B), such that mu_B(B)=1 for all “conditional” sets B in K (a non-pathological sigma-field of the “conditional” subsets of B), and, finally, you must define a probability space to be applied in all “conditional” sets B in K, say Q(.). Then you define W(A & B) to be Q(B)*mu_B(A); naturally that mu_B(.) and Q(.) must have both special behaviors, otherwise W is not well defined, it was just an informal description.

        P(A|B) has a precise mathematical definition like division, but the usual interpretation “the probability of A given that B has occurred” is not well justifiable.

        • Just a correction at the end: when we use measurable events and start from the “below” the interpretation for P(A|B) as “the probability of A given that B has occurred” is indeed more justifiable than when we start from the “above”.

        • Alexandre:

          Sorry, but what you call fictional, I just call mathematics. You want to first define a joint distribution, and that’s fine, it’s your preference. I don’t. The probability I use follows all the mathematical axioms. You might not like it, but that’s your problem, not mine! An expression such as P(A|B) simply represents a collection of distributions, indexed by B. From this can be derived a joint distribution, P(A,B), if a distribution P(B) is specified. When you write that my probability “is not really a conditional probability in the usual sense”—ok, sure, the usual sense based on whatever textbook you happen to have used. That doesn’t matter to me; it satisfies all the axioms and it’s no more “fictional” than 0, or pi, or sqrt(2) i, or any other number that can be considered as a generalization of some more narrow framework.

          Again, you can use whatever notation you want. But my notation is no more informal than yours; it’s just different from yours, and you naturally think that something different must be less rigorous. But it’s not.

        • I think the point is that when we get something complicated like:

          P(A ball will land inside a certain region | a differential equation, some imprecise initial conditions, and some parameters)

          you have a hard time putting some kind of probability over all the possible differential equations that could be used to model the flight of the ball, so if you happen to say that the appropriate differential equation is something in particular, there’s no real way to deal with a probability space over all the other possible differential equations that you didn’t chose…

          I’m not sure if I’m getting at Alexandre’s point, but that’s how I understand the issue.

        • My solution to this issue is to reject measure theory as the basis for probability, and to reject frequentism as the basis for probability. Instead, I interpret probability in the context of nonstandard analysis, and Cox’s theorem as the basis for Bayesian reasoning.

          The result is that I interpret the conditioning as essential P(A|B) encodes the meaning *degree of reasonableness of A given that our state of knowledge is that B is true*. Sometimes the knowledge state is only that we know frequency histograms, and then we get frequentist view of probability as a special case, and in that special case we demand P(A&B) = P(A|B)P(B), but there are plenty of Bayesian models in which there’s no real meaning to P(B) (ie. when B is the differential equation of motion, we don’t have a probability space over all possible differential equations)

          So, I feel philosophically that conditioning is primary, and the frequentist notion is a special case.

        • Daniel, nice comment!

          Some comments:

          1. It always good to know the assumptions of a Theorem, since this is a very nice way to learn its limitations and possibly expand the results under weaker assumptions.

          2. In 1946, Cox has showed that the axioms of probability can be justified in terms of basic desiderata.

          3. In 1931, De Finetti anticipated the Cox’s Theorem (and Ramsey, in 1926, has already discussed the issue). However, both authors made many assumptions, one very important is: the belief function must be strictly increasing in a specif sense: let [c|a] be the belief in c given some information provided in a. Let also F([c| b&c], [b|a]) = [c & b| a]. Cox and De Finetti assumed tacitly that F(x,y) is a strictly increasing function in both arguments.

          5. In 2003, Marichal has showed that by assuming that F is just non-decreasing in both arguments (which is the minimum required to maintain consistent), then possibility measures emerge under some extra conditions (it is not an additive measure, the sum operation is replaced by the max operation). That is, probabilities are not universal justifiable, there are always a domain where the result is valid and other domain where the result is not valid.

          6. There are other arguments in favor of probabilities, but all of them have their own domain of applications. For instance, the Dutch book argument imposes a linear restriction: a bet with stake S for the price p(E)*S, where S is won if E is true. All the other arguments make similar assumptions.

          7. In 1997, Waidacher showed some hidden assumptions in the Dutch Book argument and showed also that the other measures must be used if the price is non-linear.

          8. Under incomplete and imprecise information, Dempster (1967), Walley (1991) and Miller et al. (2013) argued that upper and lower bounds for probabilities can be elicited. It is needless to say that in practice, we have just incomplete and imprecise informations. Fraassen (2006) provided some examples of incomplete information.

          To sum up, probabilities are nice and can be applied in some types of events. But there are other domains that probability is not justifiable. For instance, the classical model is not a probability model, instead it is formed by a family of possible probability models to explain the observed data. Notice that, we may assign a possibility measure for this family of probability models instead of a prior probability measure; this changes the interpretation of a statistical model. Unfortunately, it seems that statisticians are not quite interested in this interpretation.

          Some statisticians strongly belief that “every uncertain event MUST be modeled ONLY by probabilities” based on Cox’s theorem and Dutch book arguments and so on. However, once you understand the limitations of those results, your belief on the last statement significantly decrease.

          I feel that human coherence is primary; human coherence has a strong aesthetic component (it is not the Bayesian coherence, there are many other ways to define coherence). Coherence is way beyond these modern statistical principles. We can apply consistent mechanisms to infer about uncertain events (probabilities, possibilities, plausibilities, impossibilities, etc.), why only probabilities? there is not a non-dogmatic justification for that.

          Some references:

          Cox, R.T. (1946). Probability, Frequency and Reasonable Expectation, American Journal of Physics, 14, 1–13.

          Ramsey, F.P. (1926). Truth and Probability, in Kyburg and Smokler (1980), 23–52.

          de Finetti, B. (1931). Sul significato soggettivo della probabilità, Fundamenta Mathematicae, 17, 298–329.

          Waidacher, C. (1997). Hidden assumptions in the Dutch book argument, Theory and Decision, 43, 293–312.

          Dempster A.P., (1967). Upper and lower probabilities induced by a multivalued mapping, Annals of Mathematical
          Statistics, 38, 325–339.

          Walley P. Statistical Reasoning with Imprecise Probabilities, Chapman and Hall, 1991.

          Millner, et al. (2013). Do probabilistic expert elicitations capture scientists’ uncertainty about climate change?, Climatic Change, 116, 427–436

          Fraassen B.C. (2006). Vague Expectation Value Loss, Philosophical Studies: An International Journal for Philosophy in the Analytic Tradition, 127, 483–491.

        • Andrew and Daniel,

          Andrew, I did not criticized you, I did not understand why you got so uncomfortable with that. The interpretation of mathematics is not mathematics, it is at least meta-mathematics. In my view, a mathematical quantity, in this case P(A|B), can be (re)interpreted in a number of ways. I am not saying that it has only one interpretation, I am trying to open the interpretation spectrum, since as P(A|B) is a number x such that P(A & B) = P(B)*x, it can be interpreted in other ways, provided 0< P(B).

          The notation and the way you build the theory are important because they induce one (or a class) of many possible interpretations. It is important to note that some conditional statements are less justifiable to be written in probabilistic terms than others. One example is that the p-value is *NOT* a conditional probability, it is easy to see that by using its formal definition. Daniel got me right, thanks.

          The intention of my post was just to provoke some different thoughts and to question the rapid and usual interpretation of conditional probabilities. But sorry to write that…

        • Anonymous,

          I can’t see the relation with your claiming about Pelé and the Zeno’s paradox with my post, sorry. Pelé scored many goals, that is an observed fact registered by many cameras. However, we can indeed re-interpret the subject meaning Pelé and the predicate meaning “scored a goal”; we can draw different conclusions for each re-interpretation. We can consider Pelé only as a physical body or also as a mental entity (formed by a mixture of personalities). This opens the universe to other possible valid analyses and conclusions, since other concepts are being employed. That is more like what I am saying, some mathematical concepts can be re-interpreted, specially in logic, alternative logics, set theory, statistics and probability. For example, there are many different schools in logic, some of which do not accept some logical principles accepted for the others: that is, some inferential rules are not accepted because a different interpretation of mathematical concepts are being used.

          I think it is a good exercise to open our analyses to other interpretations then just accept the accepted-practical ones. After all, we want to explain, describe, interpret the observed data and to infer or to predict unknown quantities based on mathematical models. We must be free to think in other possible interpretations, other theories, other tools, other philosophies, other principles, other values and so on; otherwise, we will stuck in the same toolbox making the same types of interpretations, the same type of inferences, by using the same type of models.

          In mathematics a definition must be eliminable and also non-creative. It means that all results obtained from a specific definition should be attainable without that specific definition, otherwise contradictions emerge from this creative non-eliminable definition. A definition is eliminable means that its definiendum can be replaced by other definiens. The definition of P(A|B) must comply this criteria, however P(A|B) = P(A & B)/P(B) is not eliminable, the reason is that we cannot substitute the definiendum P(A|B) by any other definiens, since it does not have a definiens for the case P(B)=0. On the other hand, we can define P(A|B) as a number x in [0,1] such that P(A & B) = x P(B). This latter is a definition.

          All the best

        • Sorry for the typos, I wrote without proofreading so carefully. It is hard to discuss in a foreign language, even more so when you do not use this language every day. Some pictures/images are different from my native language and it is difficult to find proper surrogates. It easier to discuss when the subject is only technical, but when it comes to philosophical concepts some analogies/metaphors/metonymies, that are specific to each language and culture, are required to make it easier to comprehend.

          I am quite open to discuss these subjects in depth and without embracing any ideologies, only human coherence (and we can discuss what is human coherence, it is not trivial). One of my main goals is to study the limitations/potentialities of some rules to make inferences under uncertainty. Notations play an important role to avoid misguiding interpretations, since informal definitions promote many avoidable controversies.

          The subjects I am interested in are: logics, foundations of statistics/probability, set theory, Fuzzy sets, limits of logic/probability/statistics, philosophy in general, mathematical principles, meta-mathematics and related topics. If you are interested in these issues too, feel free to contact me.

          All the best,
          Alexandre.

        • Alexandre,

          Thanks for some interesting discussion.

          The point of relating this to Zeno’s paradox is to highlight the effect that your line of argument has on an applied statistician: rather than casting doubt on the interpretation of P(A|B) (which appeared to be your intention), it casts doubt on the validity of using mathematical descriptions based in measure theory (which did not appear to be your intention).

          I am surprised you did not mention chapter 2 of the book by Jaynes (2003), which gives a more recent exposition of Cox’s theorem that may be familiar to more readers (I for one have not read the original work by Cox). I do not recall noticing the tacit assumption you mention in Jaynes (my oversight I assume), so am interested in reading more. Do you have a reference for the work by Marichal?

        • Konrad,

          Sorry, I forgot to put the Marichal’s work in the Reference list. In the Richard Cox’s original work, he did not state many important assumptions, for instance, he divides some quantities by the derivatives of that function F. Well, the denominator of that specific quantity must be positive for all values of (x,y), that is, F must be strictly increasing. Paris (2006) formalized the Cox theorem and all hidden assumptions are made explicit.

          If you need any of these works I can send you by email.

          All the best,
          Alexandre Patriota

          References:
          Marichal, J-L. (2000). On the associativity functional equation, Fuzzy Sets and Systems, 114,
          381–389.

          Paris, J.B. The Uncertain Reasoner’s Companion: A Mathematical Perspective, Cambridge University Press, 2006.

    • Andrew, please can you erase these short comments? and insert in my initial post the following statement: “provided that probability of B is strictly greater than zero, ” after:

      “As 0<= P(A & B)<= P(B)<= 1, "

  5. The symbols for “provided that probability of B is strictly greater than zero” were not printed… maybe the symbols for “lesser” and “greater” cause some problems here…

    • browsers interpret “lesser” and “greater” as the left and right brakets for html tags which usually butchers the comment. It would be better if latex math environments were allowed.

      • I think they are, but there’s no reminder/instructions about the particular form of Latex that this blog allows. Can we get a static page linked in the header that describes how to use LaTeX on this blog?

        • You can always use the html encodings for special characters. See here:

          https://www.utexas.edu/learn/html/spchar.html

          So, the basic idea is to start the special symbol with an ampersand &, continue with a special code for the desired character, and finish with a semicolon. I’ve used this code in encoding the ampersand above…we’ll see if it works!

  6. I cannot appreciate all the wonderful axiomatic complications you are talking about here, but, partly, the problem is that, in addition to the amusement of mathematicians, mathematics exists to help formalize some models related to the empirical world. And if it happens that having a notion of P(A|B) even if P(A,B) does not exist is useful, axiomaticians would need simply to go to the drawing board to figure out how to put it on the formal footing. Just think about the history of generalized functions. Next stop, QFT.

  7. The problem I have with the precocious newborn is a certain state of knowledge is presumed but coupled with not only a general ignorance but a nonsensical approach to the problem. As in, look around: if the sun doesn’t rise, how does life exist? How does gravity exist? Why would a person aware of this issue not ask anything? Why would that person decide out of thin air that the way to answer this is to count sunrises rather than figure out – by research or observation – that sunrise is a necessity for life?

  8. It seems to me that this is an argument over the terms of the hypothetical more than anything else. If the newborn were observing a daily coin flip (of a possibly weighted coin), this would seem more like a reasonable approach. If the hypothetical newborn is defined to be precocious in probability and marble counting (and nothing else), then it can be assumed to not be aware of the questions about the causes and effects of the sun rising that jonathan mentioned. Under these terms, the sun rising has no more real world import from the newborn’s point of view than the coin flip. In that case, it would have no pretext by which to justify the rejection of the plain vanilla default model. If, on the other hand, the hypothetical is not quite so carefully defined, then we start, justifiably I suppose, asking whether a newborn that can count marbles could make inferences about what the sun is and does. I really don’t see a deep statistical question here; just an ill-defined hypothetical.

  9. If I may quote your article on the history of Bayesianism:

    ‘More than that, though, the big, big problem with the
    Pr(sunrise tomorrow | sunrise in the past) argument is not in
    the prior but in the likelihood, which assumes a constant probability
    and independent events. Why should anyone believe that?’

    I don’t see anything wrong with making such simple assumptions. This is how science proceeds.
    Simple working hypothesis are used until conflicting information arises showing that such a model
    must be refined or rejected. i.e. Occam’s razor at work.

    Once this child learns more about astronomy, the age of the sun, and that ultimately the sun itself
    will collapse billions of years from now, he will by then realize that the probability that the sun
    will rise tomorrow depends on many things and that it doesn’t converge to 1 at all.

    For a newborn child, I don’t think this argument is unreasonable at all. In fact, the most intelligent
    adults I know reason in a similar way implicitly for lack of a better predictive model.

    • Aidan:

      Of course it’s fine to make simple assumptions. Then when they fail, we make our model more complicated.

      My problem with that sun-rising example is that it’s been taken as an argument-by-absurdity against Bayesian statistics (ha ha, that silly prior, those silly Bayesians!), but the absurd part is the data model or likelihood.

      • I agree that the absurd part is the data model and not the Bayesian approach.
        Moreover, I think the critics are the silly ones here.

        At least Bayesians can contemplate such things in a reasonable
        manner and improve their data model with new information. Frequentists
        on the other hand rely on their ‘many worlds’ idea which is
        really silly.

        I think ‘many worlds’ assumption is beyond reason.

Leave a Reply to Rahul Cancel reply

Your email address will not be published. Required fields are marked *