A probability isn’t just a number; it’s part of a network of conditional statements

This came up awhile ago in comments, when Justin asked:

Is p(aliens exist on Neptune that can rap battle) = .137 valid “probability” just because it satisfies mathematical axioms?

And Martha sagely replied:

“p(aliens exist on Neptune that can rap battle) = .137” in itself isn’t something that can satisfy the axioms of probability. The axioms of probability refer to a “system” of probabilities that are “coherent” in the sense of satisfying the axioms. So, for example, the two statements

“p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001”

are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.

The general point is that a probability can only be understood as part of a larger joint distribution; see the second-to-last paragraph of the boxer/wrestler article. I think that confusion on this point has led to lots of general confusion about probability and its applications.

Along these lines, I also recommend chapter 1 of Bayesian Data Analysis, where we lean hard on the idea of probability as a real thing that can be measured, and we give several examples.

P.S. This is related to the idea that a story doesn’t exist on its own, it exists in context.

P.P.S. As commenters pointed out, the title of this post, “A probability isn’t just a number; it’s part of a network of conditional statements” is also correct if we replace the word “probability” by “mathematical statement.” For example, “x=3” has meaning in the context of other statements about x.

49 thoughts on “A probability isn’t just a number; it’s part of a network of conditional statements

  1. I recall some large literature on ‘causal inference’ in this vein; happy to trade reference [only kept two, one not very useful as it hinges into statistical mechanics never borrowed into Econ, not for wrong reasons; the other tells how far things can go otherwise];
    all but note to self

  2. Another way of saying the same thing:

    you can calculate any ridiculous thing you want – you can “show” that apples fall upwards and “show” that p(aliens exist on Neptune that can rap battle) = 763 even if no aliens exist on Neptune at all – if you make incorrect assumptions about the real world.

    This applies to all mathematical representations of the real world, not just probability.

    It’s tragic that claims about the real world are unlikely to be true if they’re based on incorrect assumptions about the real world. But so it is.

    • This is not at all saying the same thing. Probabilities can be incoherent or incompatible with no reference to the real word at all. Hence, in the example above:

      “p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001”

      are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.

      Martha can judge the statement to be wrong without any real world knowledge, simply because the probabilities are incompatible with each other. Another example is

      Probability of A = 0.8, probability of not A = 0.6

      must be wrong because the probabilities do not add up to 1.

      This is important because it’s helpful to separate different sources of error and lines of criticism; soundness, or internal consistency, and truth of the premises.

      https://en.wikipedia.org/wiki/Soundness?wprov=sfti1

      • “Martha can judge the statement to be wrong without any real world knowledge, simply because the probabilities are incompatible with each other.”

        I disagree!

        The presumption that “aliens exist on Neptune that can rap battle” is a subset of “aliens exist on Neptune” is true only because it is a rational belief in the real world. In some fantasy world, it is possible that “aliens exist on Neptune that can rap battle” even though there are no “aliens [that] exist on Neptune”. In a fantasy world, anything is possible. There is no reason to presume that rationality as we know it in our world is universal. “Rationality” as we know it, *is defined by our understanding of the world we live in*. It does not exist independently of that world. We are able to understand many features of our world because our deductions are verified by real world experience. Without this verification, our deductions would be meaningless.
        For example, that *Homo*, *Pan*, *Pongo*, and *Gorilla* are members of the family *Hominidae*. However, this belief results only from careful and extensive analysis of both the fossil record and the genetic information from each group.

      • “Martha can judge the statement to be wrong without any real world knowledge, simply because the probabilities are incompatible with each other.”

        The presumption that “aliens exist on Neptune that can rap battle” is a subset of “aliens exist on Neptune” is true only because it is a rational belief in the real world – a deduction from a general principle based on observation of the real world. You don’t recognize it as such because it’s so fundamental. Surely it’s been many millennia since humans recognized the general principle: “birds in the forest that can sing” is a subset of “birds in the forest”. This is a property of our universe. The probabilities are known to be incompatible from observation.

        • I do not mean to deny that probability theory was invented to model real world phenomena. I’m just pointing out that your statement is not at all paraphrasing the original blog post, in case it confuses someone.

          you can calculate any ridiculous thing you want – you can “show” that apples fall upwards and “show” that p(aliens exist on Neptune that can rap battle) = 763

          You actually cannot “show” that a probability = 763 no matter what your assumptions about the real world are because by the kolmogorov axioms of probability it must be between 0 and 1. The point being made above is that these axioms define the coherence of a system, or sigma algebra, of probabilities. It doesn’t let you say very much about any particular element of the system, but implies a lot about how the elements of the system fit together.

          Yes, the kolmogorov axiomatization was designed with real world applications in mind, but that’s just its history; the definition of probability, given where we are now, is independent of real world assumptions. Similarly, while the constant pi is meant to model the ratio between the diameter and the circumference of a circle, its value is independent of actual circles. In fact, the ratio between the circumference and diameter of “real circles”, even idealized perfect circles in the real universe, is not actually equal to pi because we live in a non Euclidean geometry, and spacetime has a nonzero curvature. Even so, it would be incorrect to say that e^(i pi) =/= -1, because the definition is what it is. Confusing, I know!

        • Somebody:

          Your concern is based on the definition that probabilities are defined to be LT/EqT 1. Correct? Fair enough, I accept that.

          If I modify my statement to:

          “you can “show”…that p(aliens exist on Neptune that can rap battle) = .997 even if no aliens exist on Neptune at all”

          Is your concern satisfied?

    • My concern is that you said

      Another way of saying the same thing:

      And then talked about something completely different. The distinction between soundness and truth is important. As a subjective bayesian, I cannot in principle dispute your subjective belief that the probability of rapping aliens is 0.997 (though in practice I wouldn’t trust that you’re reporting your beliefs honestly). But I can dispute whether or not you are rationally adjusting your beliefs in response to new information.

  3. The trouble with probability is that it doesn’t exist. Nothing ‘has’ a probability and it is not an inherent quality or property of any phenomena. Things have mass, velocity, etc., but for probability we have to have a conscious observer to ‘realize’ that there is a probability when events are observed.

    • I wouldn’t say that probability doesn’t exist — but the definition and usage are relevant to a specific context. For example, the probability that event A occurs in situation 1 may be different from the probability that event A occurs in situation 2.

    • ” Nothing ‘has’ a probability and it is not an inherent quality or property of any phenomena. ”

      Probability is a property of a system, isn’t it?

      At the Pacific Science Center in Seattle, they used to have a big plastic tablet-shaped thing with a bunch of vertical columns. Balls would shoot into the top and bounce around then fall down one of the columns. When all the balls were out, the piles of balls in the column always formed a normal distribution. This distribution was a property of the system – it *always* came out the same.

      You could say the same about the probability of getting lung cancer from smoking. The system is the entire suite of human interactions with tobacco. One’s individual probability changes with different “slices” through the system along different dimensions – years smoked; age started; age quit; non-smoking interaction with tobacco; genetic characteristics; and possibly to a lesser extent many other aspects of the system like diet, physical activity, other chemical interactions and so on. It’s a very complicated system, which in turn makes the probability highly dependent on many specifications in the system. But it’s still a system and for a fixed set of conditions would probably produce the same result over and over, just like the ball system at the Pacific Science Center.

      • Probability is only our guess at what will happen when we don’t know all the variables. If we knew all of the properties of a system, we wouldn’t have a Probability of x or y happening, we would be able to work out what will happen each time. So Probability is just a model of the system we make to try and predict an outcome. All models are wrong, but some are useful.

        • Magnus:

          According to quantum mechanics, your statement, “If we knew all of the properties of a system, we wouldn’t have a Probability of x or y happening, we would be able to work out what will happen each time,” is incorrect.

        • Andrew wrote:

          According to quantum mechanics, your statement, “If we knew all of the properties of a system, we wouldn’t have a Probability of x or y happening, we would be able to work out what will happen each time,” is incorrect.

          Quantum mechanics does not say this, only the most popular interpretation of the equations does. Look into the walking droplet experiments that fit the pilot wave interpretation from 2005 onward.

          If those were done when QM was developed no one would believe any of the mystical/weird stuff. I really can’t see any other explanation than inertia for why these results didn’t lead to a paradigm shift.

          I mean now we have a deterministic, local, realist, and causal model analogous to hydrodynamics. Or there is stuff going on so weird that no one can understand it going on.

      • At the Pacific Science Center in Seattle, they used to have a big plastic tablet-shaped thing with a bunch of vertical columns. Balls would shoot into the top and bounce around then fall down one of the columns. When all the balls were out, the piles of balls in the column always formed a normal distribution. This distribution was a property of the system – it *always* came out the same.

        Trouble is the closer you look, the more the probabilities disappear. If you could specify the exact configuration of the balls, the microabrasions on the surface, the little currents in the air, where all the balls end up becomes not probabilistic at all! It would become some particular exact configuration with probability one! The probabilities are changing, but nothing in the world is changing, just what you know about the balls and setup.

        • “becomes not probabilistic at all! ”

          Is there such a thing? If the outcome is fully predictable it still has a probability of 1 doesn’t it?

        • Yeah, you’re correct. I just mean that a very uncertain system with spread out probabilities can become a very certain one with tight probabilities without any changes to the system.

          But if you’re interested in the really nitty gritty semantics of it, “true” and “true with probability 1” are distinct. For example, for the classic points on a dartboard example, for every individual point, the probability of selecting that point is zero. Yet, obviously the dart lands somewhere. Mathematically, this reflects the true nature of probability as an integral over some set of points. But that’s not relevant to this discussion.

        • Reply to: ” “becomes not probabilistic at all! ”

          Is there such a thing? If the outcome is fully predictable it still has a probability of 1 doesn’t it?”

          According to my Internet, probabilistic means:

          “subject to or involving chance variation”

          so no. (Not that arguments over semantics accomplish much.)

        • Anoneuoid said,
          “I mean now we have a deterministic, local, realist, and causal model analogous to hydrodynamics. Or there is stuff going on so weird that no one can understand it going on.”

          This sounds too extreme — I’d say that often there is stuff going on that is weird to many (maybe even most) people.

      • Frequencies can be properties of physical systems. This happens when the final frequency distribution is insensitive to any uncontrolled initial conditions in the physical setup. This gives the final frequency distribution a “hard reality” since it always seems to occur no matter what the initial conditions were.

        Probabilities describe the uncertainty in the initial conditions. They are in part objective, since initial conditions are objective, and are in part subjective, in that they depend on how much is known/assumed/constrained about the initial conditions.

        Probabilities have many uses. One important use is to calculate how sensitive things like “the final frequency distribution” are to the unknown initial conditions.

        In certain very special cases there is a numerically coincidence, whereby the value of the probability is the same as the best guess for the frequency. The great tragedy of this subject is that some bozos a century decide this special case was the *only* case, and butchered the entire field of statistics in order to force-fit it into that one special example.

        • 100% agree with this. Yes, you can set up a system such that over many repetitions the outcomes have a stable frequency distribution because MOST possible inputs produce an outcome that is one of the high probability ones from the distribution of interest. This especially tends to happen when there is sensitive dependence on initial conditions such that very small perturbations will shift the outcome to a different portion of the phase space. For example when you throw a die even small changes in initial conditions can change the way things bounce such that it will wind up on a different side, but it only ever lands on a side, it doesn’t spontaneously balance on a corner or edge etc

        • I like your comment here, I hadn’t thought of that perspective. But a question: are there circumstances where events are inherently uncertain, intrinsically noisy, the messiness of the universe and biological systems, even despite precisely defined initial conditions?.

        • Joe,

          Who knows?

          But more importantly, how would we know?

          One way would be to use probabilities to determine what should follow reliably from our model of the universe and see if actually does occur reliably. If not, we’ve learned a bit more. Iterate ad infinitum.

  4. A friend and I were talking the other day about all the people who seem to be on the phone (via headset) all day at work. This is particularly common for jobs like cab/uber driver, convenience store clerk, and mail/package delivery.

    Who exactly is available to talk all day with them? Then we realized who must be on the other end of the conversation: other people with similar jobs.

    That lead to the next idea, what if for every “crazy” person who hears voices and talks to themself there is another “crazy” person on the other end?

    Now yes, this theory requires assuming some form of natural long distance communication is possible. And also must be highly compressed/encrypted to escape detection for so long. But all of that sounds like what you would expect evolutionarily anyway.

    And it is, in principle, possible to ask someone what they are hearing then go see if anyone else was saying the same thing at that moment. But in practice no one is going to really do this experiment, so we are left with only priors.

    And in that case you can really assume whatever you want. The prior is another (along with the model/likelihood) premise/assumption/axiom you are asking others to take as true, but that doesn’t mean they must accept your premise(s).

    The key is to come up with a set of premises everyone could agree on, priors included. My understanding is the axioms of Euclid’s elements were arrived at this way.

    • Anoneuoid said, “The key is to come up with a set of premises everyone could agree on, priors included. My understanding is the axioms of Euclid’s elements were arrived at this way.”

      Well, I don’t know if the axioms of Euclid’s elements were arrived at this way. Still, there is the question of “everyone” in its broadest sense, or just “everyone” in a certain well-defined group of people who were interested in Euclid’s work.

      • It would be stuff like if you tie a string between two sticks it follows the shortest path between them, then that is a “line”. Really, the goal is to find axioms no one can honestly argue against. And look how well geometry has withstood changing political and religious power over the millennia.

        Also, the ideal would be self proving axioms in the vein of “this sentence is false”.

        • It would be stuff like if you tie a string between two sticks it follows the shortest path between them, then that is a “line”. Really, the goal is to find axioms no one can honestly argue against.

          Not only do people argue against this, it’s generally considered to be false!

        • Not only do people argue against this, it’s generally considered to be false!

          Not sure what you mean. I’d guess you are thinking of non-Euclidian geometries.

          It took thousands of years to figure out the parallel postulate was a special case. And that is the exception that proves the rule, since people always suspected something was wrong with that one anyway.

          But either way, if you put two distant stakes in the ground and connect with a very long string/cable it will still give you the shortest distance over the surface, even though the surface is spherical. So the method still works.

          I would like to see the arguments against this that you mention.

        • The part that isn’t true is this

          then that is a “line”.

          It will not be a line, in the geometric sense of having zero curvature or in the algebraic sense of solving a linear equation. In general, the shortest path will be a geodesic, which happens to be a line in a euclidean geometry geometry.

          You can define “line” to be the shortest path between two points, but then lines are neither linear nor straight, which is linguistically too confusing to me.

        • You can define “line” to be the shortest path between two points, but then lines are neither linear nor straight, which is linguistically too confusing to me.

          You seem to be saying only use the term “line” if/when the parallel postulate holds. But that’s clearly not what was meant because then it wouldn’t be a separate postulate.

          In that case, how do you define/derive the concepts of “linear” and “straight”? Anyway, I think my question regarding the arguments was answered.

        • A couple things

          There are many ways to define a line. One important definition is the solution set of a system of linear equations

          x + y = constant

          The fact that distance minimizing paths do not in general coincide with this is the reason why the term “geodesic” was invented, to generalize the concept of a line beyond Euclidean geometry.

          In addition, the string tightening system has equilibria that do not necessarily correspond with shortest distances. On a spherical geometry, these would be the arcs on the opposite side of the sphere, though it would be an unstable equilibrium.

          Non Euclidean geometry does not necessarily mean relaxing the fifth postulate specifically. In a spherical geometry, substituting geodesics for lines, you cannot extend the arcs infinitely or describe a unique circle with any radius, and there are at least 2 and possibly infinitely many lines between two distinct points. Indeed, wouldn’t that be weird, that a line is not uniquely specified by two points on it?

        • Anoneuoid said, “Really, the goal is to find axioms no one can honestly argue against.”

          This seems like a fool’s errand.

  5. Doessn’t quantum mechanics, at present state clearly that probalistic phenomena are the fabric of reality? Every probability statement is the likelihood of x given whatever; where whatever grows in information, the variance goes down until you hit quantum phenomena at which point you’re done? Happily prepared to be corrected x

    • Antoninus said,
      “Doessn’t quantum mechanics, at present state clearly that probalistic phenomena are the fabric of reality? Every probability statement is the likelihood of x given whatever; where whatever grows in information, the variance goes down until you hit quantum phenomena at which point you’re done? Happily prepared to be corrected x”

      I wouldn’t say that the following is correcting you, but my understanding of quantum mechanics is (at least in part) that we have limitations in what we can assert as certainty. If I am not mistaken, the original quantum mechanics situation said something like, “We cannot simultaneously know both the position and momentum of a certain type of elementary particle at the same time.”

  6. Sometimes we approximate discrete outcomes by a continuous outcome model; sometimes we approximate a score function (whose curl needs to be 0) by an arbitrary function; the same logic applies to fitting a a-network-of-conditional-statements, and I find it often justified to even estimate a probability density function by some easy function that is not necessarily respecting the probability axioms. Having a probabilistic prediction network to be coherent is nice; but often we have other purposes as well.

  7. “Along these lines, I also recommend chapter 1 of Bayesian Data Analysis, where we lean hard on the idea of probability as a real thing that can be measured, and we give several examples.”

    Speaking of probabilities, I think the folks that would read Bayesian Data Analysis and understand the content are not likely part of the group that would not understand:

    ““p(aliens exist on Neptune that can rap battle) = .137″ and p(aliens exist on Neptune) = .001
    are incompatible according to the axioms of probability, because the event “aliens exist on Neptune that can rap battle” is a sub-event of “aliens exist on Neptune”, so the larger event must (as a consequence of the axioms) have probability at least as large as the probability of the smaller event.”

    BDA is just too hard for me to understand. So many formulas, so much calculus (in the first chapter!!!), and not enough written explanation. We get integrals like
    int p(˜y|θ)p(θ|y)dθ on page 7. What?

    Then likelihood ratios get a paragraph and:
    = p(θ1)
    p(θ2)
    p(y|θ1)
    p(y|θ2)

    It’s just sort of dreadful, in a way.

    There are people that read the brief text, see the formula, get it, and move on. Then there are people who don’t get it, but who would otherwise understand the concept of a likelihood ratio through written text, some examples, and some figures. Likelihood ratios are part of any epidemiology course, for example, when authors get into sensitivity, specificity, and positive/negative predictive values.

    Ok, the authors of BDA warn of this in the Preface. “Although introductory in its early sections, the book is definitely not elementary in the sense of a first text in statistics.” Understatement.

    I get the feeling that Bayesian data analysis is just for the 1600 math SAT folks, I guess. How is this way of doing things going to get popularized if it is so difficult to understand? Would be very happy to read a book 4 times as long that took 4 times as much text to explain the stuff. =( I remember Bob Carpenter made a post some time ago with some very complicated title about Hamiltonian / Quasi-Newtonian something-something, and there was a short post that was just as complicated, and someone said it sounded interesting but they had no idea what it meant. Then Bob took the time out of his busy day to write out a much longer explanation of his post, and it was like holiday lights were suddenly put on–it all made a lot of sense and the work sounded absolutely fascinating.

    I wish BDA was like Bob’s long-form explanation, but it’s not. Maybe Regression and Other Stories is a better place to start for a dope like me, but it may not lean into BDA as much. I wonder where a good starting point would be, one that doesn’t have stuff like f(yi)g(θ)eφ(θ)T u(yi) on page 36?

    • Sad:

      I think BDA is just wonderful but I agree that it’s not wonderful for everyone. If you want a book with pretty much the same attitude and perspective as BDA but written with a lot less math and with a lot more discussion, I recommend Statistical Thinking, by Richard McElreath’s book.

      • I guess it’s getting late and we are not at our best, especially with our typing, but I can’t help pointing out that Andrew’s comment, “If you want a book with pretty much the same attitude and perspective as BDA but written with a lot less math and with a lot more discussion, I recommend Statistical Thinking, by Richard McElreath’s book,” seems to need some proofreading; in particular, I don’t really think that Richard McElreath’s book wrote the book Statistical Thinking.

    • Where T is theory, O is observations, | = given, and 0:n refer to the various theories, then the posterior probability is:

      p(T|O) = p(T_0)*(p(O|T_0)/[ p(T_0)*(p(O|T_0) + p(T_1)*(p(O|T_1) + … + p(T_n)*(p(O|T_n) ]

      Basically you normalize how well your theory (T_0) fits the data to all the other theories out there, while weighting them according to how much you believed each theory before you made new observations. It is quite simple and describes what people naturally do anyway.

      Also, Theory = hypothesis = explanation, etc or whatever term you want to use. And the continuous version you seem to have a problem with is more computationally efficient but I agree it hinders understanding. Anyone who can understand basic algebra should be able to understand this equation.

      • Well there are two typos, sorry. It should be:

        p(T_0|O) = p(T_0)*p(O|T_0)/[ p(T_0)*(p(O|T_0) + p(T_1)*(p(O|T_1) + … + p(T_n)*(p(O|T_n) ]

        Here is another way to write it:

        p(T_0)*p(O|T_0) / sum[ p(T_0:n)*p(O|T_0:n) ]

  8. If someone says P(A)=q, then they are implicitly declaring a probability space with sigma algebra consisting of the empty set, A, complement of A, and the entire sample space. As long as q is a number in [0,1], then they have (implicitly) constructed a “valid” probability measure, at least according the the common standard mathematization of probability. Of course the point still stands that P(A)=q by itself doesn’t satisfy the axioms, but the implicitly constructed probability measure does indeed satisfy the axioms.

Leave a Reply

Your email address will not be published. Required fields are marked *