Rosenthal’s textbook: A First Look at Rigorous Probability Theory

I’ve never taken a stats class. I was a math major, but dropped stats after I got appendicitis because I didn’t want to drop abstract algebra or computability theory. So here I am 40 years later trying to write up some more formal notes on probability theory and Markov chain Monte Carlo methods (MCMC) and finding myself in need of a gentle intro to probability theory that defines things rigorously. Jeffrey S. Rosenthal’s 2006 book A First Look at Rigorous Probability Theory is just what I needed. Despite not being very good at continuous math as an undergrad, I would have loved this book as it’s largely algebraic, topological, and set-theoretic in approach rather than relying on in-depth knowledge of real analysis or matrix algebra. Even the examples are chosen to be simple rather than being designed for Ph.D.s in math, and even include a lot of basic analysis results like constructing uncountable, unmeasurable sets.

This is not the final book you’ll need on your way to becoming a measure theorist. For example, it never discusses Radon-Nikodym derivatives to unify the theory of pdfs. It doesn’t really get into stochastic processes beyond a high-level discussion of Brownian motion. It does cover the basic theory of general state space, time-homogeneous Markov chains in a few pages (why I was reading it), but that’s just scratching the surface of Rosenthal and Roberts’ general state-space MCMC paper which is dozens of pages long in much finer print.

One of the best features of the book is that it’s both short and packed with meaningful examples. Rosenthal was quite selective in eliminating unnecessary fluff and sticking to the through line of introducing the main ideas. Most of the texts in this area just can’t resist all kinds of side excursions, examples which bring in all sorts of Ph.D. level math. That can be good if you need breadth and depth, but Rosenthal’s approach is superior for a first, broad introduction to the field.

In summary: 5/5 stars.

134 thoughts on “Rosenthal’s textbook: A First Look at Rigorous Probability Theory

  1. Bob:

    Thanks for the review, which should be useful to many people. Now we just need star ratings from you and other raters for a bunch of other books so we can fit an item-response model and calibrate your scores!

    • Hey, that’s the problem that you and Jennifer helped me model (inferring consensus from crowdsourced ratings by an adjusting for rater accuracy and bias) that got me into Bayesian stats in the first place. Turns out we just replaced the David and Skene model from 1979, which is a really great model for this. I have a summer intern working on adding the equivalent of the first IRT-1PL parameter (difficulty).

  2. 236 pages sounds quite reasonable. I’ve been urged before to shore up my measure-theoretic foundations for probability theory — is this a good text for that purpose, or does it require grounding in measure theory first? (otherwise decent at analysis, linear algebra, calc, etc.).

    Pinging off recent discussions, how have you found chatGPT as a supplemental tutor for stuff like this? I was working through some signal processing material a few weeks back and found it to be quite helpful in eg expanding on certain definitions or “confirming” that certain properties helped, with the usual caveats that certain prompting strategies were more reliable than others (my usual preference to assert points of disagreement between old understanding and new information inclined it towards retracting valid statements, instead of exploring the conflict directly).

    By my count, there are also 270 plug-ins currently available for chat.openai.com/?model=gpt-4-plugins. Are there any hidden gems in there that would be especially helpful for these purposes?

    • For example, the current context window size of chatGPT’s GPT-4 model is 4,000 tokens. But human-readable English has a lot of extraneous information, wasting precious space. If I prompt vanilla ChatGPT (GPT-4) with the following:

      Please efficiently compress the following text in a manner that a version of you without access to the original text could decompress with high accuracy. It does not have to be in readable English, just understandable to you, while minimizing total token count. You can use any symbols or notation that would help you towards this goal:

      Rosenthal’s textbook: A First Look at Rigorous Probability Theory
      Posted on June 5, 2023 3:00 PM by Bob Carpenter

      I’ve never taken a stats class. I was a…

      I receive the following output:

      R’snthal txtbk 1stLook@RigProbThy_Jun5,2023_3PM_BobC

      No stats class_mathmajor_drop stats_appendicitis_>drop abstrAlg/compuThy. 40y latr_need formal notes_probThy/MCMC_need gentle intro_probThy_rigorous. J.S.R’snthal’s ’06 book_1stLook@RigProbThy_perfect. Despite weak_contMath_undgrad, likd book_algebraic/topological/set-theoretic_not req in-depth reAnal/matAlg. SimplExamples_not PhD math incl_basicAnal_res: uncountable/unmeasurable sets.

      Not final_measure theorist book. No Radon-Nikodym_derivatives, not much stoProces_beyond high-lvl Brownian motion. Covers basicThy_genStaSpa_tim-homog MCMC_few pgs (why read)_scratch surface R’snthal&Roberts’ genStaSpaMCMC paper_finer print.

      Book best feat: short, meaningful examples. R’snthal selective, no fluff, stick to main ideas. Other texts sidetracks PhD math, R’snthal superior_1st broad intro.

      Summ: 5/5.

      I can then ask it to decompress the above with: “In another instance, I asked you to compress some text for later decompression. Specifically, I had written the following: “Please efficiently compress the following text in a manner that a version of you without access to the original text could decompress with high accuracy. It does not have to be in readable English, just understandable to you, while minimizing total token count. You can use any symbols or notation that would help you towards this goal:” with the uncompressed text attached. You provided a response, quoted below. Please decompress your response into as close a passage of text to the original as possible, including all headers and other data you are able to extract from the compressed text: …”

      yielding:

      Decompressing your text, I obtain:

      “Bob C. wrote a review of Jeffrey S. Rosenthal’s textbook ‘A First Look at Rigorous Probability Theory’ on June 5, 2023, at 3 PM.

      In his younger years, he had not taken any statistics classes as a mathematics major. He had to drop a few classes, including abstract algebra and computational theory, due to appendicitis. Now, 40 years later, he needed formal notes on probability theory and Markov chain Monte Carlo (MCMC) methods. He was looking for a gentle introduction to rigorous probability theory, and he found Rosenthal’s 2006 book to be perfect for that purpose.

      Despite his weak background in continuous mathematics from his undergraduate years, he enjoyed the book because it did not require an in-depth understanding of real analysis or matrix algebra. He found the simple examples included in the book, which were not meant for Ph.D. mathematics, very helpful, including basic analysis results on uncountable and unmeasurable sets.

      However, he noted that this is not the final book for budding measure theorists. It does not cover Radon-Nikodym derivatives, and it does not delve much into stochastic processes beyond a high-level introduction to Brownian motion. The book covers the basic theory of general state space and time-homogeneous MCMC methods over a few pages, which, according to him, only scratch the surface of Rosenthal and Roberts’ paper on general state space MCMC in finer print.

      One of the best features of the book, according to him, is the short, meaningful examples. Rosenthal is selective, does not include unnecessary fluff, and sticks to the main ideas. While other texts may sidetrack into Ph.D. level mathematics, Rosenthal’s book is superior as a first broad introduction.

      In summary, he rated the book 5 out of 5.”

      Not bad! Giving it the original text and asking it to highlight discrepancies, we get:

      Looking at the original text you provided and the decompressed version, I see several minor discrepancies but no major errors that would significantly change the content’s meaning. Here are a few specific points:

      The decompressed version states that Bob had to “drop a few classes, including abstract algebra and computational theory”, but the original text only mentions him dropping stats due to appendicitis while not wanting to drop abstract algebra and computability theory.

      In the decompressed text, I wrote “Bob has returned to academic study out of necessity.” This could imply that he’s back in formal education, but the original text only states that he’s writing up some more formal notes on probability theory and MCMC methods, suggesting self-study or professional application rather than formal academia.

      The original text mentions that Bob appreciated the book because “it’s largely algebraic, topological, and set-theoretic in approach.” The decompressed version simplified this to “he enjoyed the book because it did not require an in-depth understanding of real analysis or matrix algebra,” which misses the emphasis on algebraic, topological, and set-theoretic approaches.

      The decompressed text suggests that the book “covers the basic theory of general state space and time-homogeneous MCMC methods over a few pages.” However, the original text specifies “it does cover the basic theory of general state space, time-homogeneous Markov chains in a few pages”, emphasizing Markov chains instead of MCMC methods.

      Finally, there is a minor discrepancy in the final sentence. The decompressed text says, “In summary, he rated the book 5 out of 5,” while the original text specifies a star rating system: “In summary: 5/5 stars.”

      Overall, the compressed text succeeded in capturing the key points of the review, with some minor details lost or slightly misrepresented during decompression.

      Anyway, my actual point. I’ve seen lots of preprints integrating GPTs with external reference materials, and several plugins allow you to provide links to PDFs or other online documents. Anyone know which are best at handling textbook-length docs through something like the compression, some sort of smarter indexing, or another approach?

    • The book’s great. It doesn’t really assume much analysis—it mentions Lebesgue measure, but only after defining it from first principles. It would certainly help to have done analysis and topology, but one of the nice things about this book is that it reinforces all of that with simple, concrete examples in the context of probability measures.

      I haven’t tried measure theory, but I found ChatGPT to be super-duper useful for shoring up my RNA-seq background in terms of the lab processes and where hexamer, positional, and GC biases come in. I then found it much easier to look up the references when I knew the terms of art for talking about things. It’d be interesting to see if GPT could keep up the right level (upper level undergraduate) and keep consistent notation in doing a tutorial.

      And yes, GPT’s ability to compress and uncompress its own output is quite impressive. OpenAI hasn’t said how much, if any, of this they’re doing to manage the user/assistant history in dialogues, but other people have been doing what Nik suggests. But I have no idea how to do this best at scale—you’d probably have to build your own API plug-in, because that lets you set system and user/assistant prompts directly.

  3. The title of the first chapter is “The need for measure theory”, but measure theory isn’t needed to make applied statistics rigorous.

  4. Hey Bob. Having read Rosenthal, and gotten the “usual” measure theoretic view of advanced probability theory. I wonder what you’d think of this:

    https://web.math.princeton.edu/~nelson/books/rept.pdf

    Nelson utilizes his nonstandard analysis ideas to show how probability can be constructed from finite sets, and then by adding nonstandard numbers you get essentially all of “measure theoretic probability” from finite sets. He discusses the “weirdness” that transferring to a limit produces, and then goes on to prove a tremendous amount of the usual probability stuff using these basic finite ideas with nonstandard numbers. In chapter 18 he proves a massive theorem that encompasses entire books worth of material in 5 pages.

    In the appendix he shows how basically all the usual measure theory adds nothing, for every measure theoretic concept there is a finitary nonstandard concept that is “nearby” and produces the same kinds of results via simple algebra.

    The entire book is 92 pages including the appendix. It moves pretty quick.

    The thing I learned from these kinds of readings (there are some other related books, Albeverio et al. for example) is that the “unintuitive” stuff about measure theory, such as that a point has zero probability and that only certain types of sets can be integrated over and such, adds nothing it’s pure barrier to interpretation and application.

    Anyway, see what you think if you have the time and inclination.

    • From Chapter 4:

      “Here is a sketch of how the strong law of large numbers is formulated in conventional mathematics. One replaces the finite sequence x1,…,xn by an actually infinite sequence x1,x2,… To do this, one must construct the infinite Cartesian product of the initial probability space Omega with itself. Even if Omega is a finite probability space, the infinite Cartesian product will contain an uncountable infinity of points. Each individual point has probability 0. Only certain (measurable) sets are events, and the probability of an event is no longer the sum of the probabilities of the points in it. Only certain (measurable) functions are random variables. The expectation becomes an integral, and only certain (integrable) random variables have expectations. Then the strong law of large numbers becomes the statement that except for an event of probability 0, for all epsilon > 0 there is an m such that for all n > m we have ….

      The approach that we take is different, and has the virtue of remaining within the elementary framework of finite probability spaces. We retain the finite sequence x1,…,xn but we let n be nonstandard…”

      The advantage is now you can just do algebra… the probability of an event composed of K different elementary possibilities is the sum of the probabilities of each elementary possibility… Each of these may be infinitesimal but if you add up a nonstandard number of them, you get an appreciable result, etc. Every random variable has a PMF either on a finite standard set of outcomes (like a Bernoulli) or on a nonstandard set (like a ~ Normal distribution defined on the values {-N + k dx} for k from 0 to 2N and dx infinitesimal).

      The “picture” you get more accurately reflects the computational picture as well. Float64 values are discrete points in the interval [-1.798e308 1.798e308]. They may not be evenly distributed but the main point is that if you have an accurate iterative calculation scheme, in general there is some small value of a parameter for which the calculation is exactly accurate to the limit of the machine representation. Smaller intervals for integration for example don’t help, they just lead to the same answer. This is in essence the picture of nonstandard analysis.

    • Wow, that Nelson book is definitely not for someone who can “add, subtract, and reason” (from the preface). I would not assign it as a first text. But the book looks cool. My first serious contact with probability theory was with Ross’s book (A first course in probability; no mention of measure theory anywhere). It was really intimidating but it got the job done. Today I would have probably started with the Blitzstein and Hwang book (which has zero instances of the phrase “measure theory”).

      • Shravan… I think “reason” is doing a lot of heavy lifting there. He was essentially a logician. But yes, it’s not a first text by any means, but then I don’t think any text in measure theory is a good first text either. I personally would start teaching probability theory with combinatorics, and then move directly to NSA. But please take a day or something to look through Nelson’s book and send me an email or post on Mastodon or something what you think.

        I have often felt like I should write a book called something like “Building Mathematical Models and Testing Them With Data: A guide to doing real science”. I feel like I have that book in me, but I don’t have the motivation to write it that I need.

        If I were to write such a book, it would be aimed at someone like a bench biologist or maybe a psycholinguist ;-) … someone who knows some kind of science discipline but who doesn’t regularly use mathematics in it, but who suspects that maybe they should be.

        It would start with basic algebraic models… calculating the volume of stuff, calculating the average speed that something traveled, dimensional analysis of algebraic expressions representing physical properties, we’d generate algebraic expressions that model questions from Mythbusters or something like that: do you get wetter walking or running in the rain? that kind of stuff. Then we’d ask the question of whether our model did a good job, we’d introduce graphing things, and ask the question which values of unknowns make our predictions look like our data? I’d have some experiments the reader could actually carry out, but also some data sets actually generated from those real world experiments as carried out by middle-school kids or something.

        Then I’d introduce the idea of figuring out which values of the parameters in the model “make sense” by measuring the “plausibility” of a model, and the basic idea of Cox’s theorem resulting in some rules for how to manipulate plausibility. I wouldn’t call it probability yet. I’d work with some basic linear regression against a point cloud generated by one of those experiments above…

        we’d choose some small finite set of intercept values, and some small finite set of slope values… maybe 20 each. There would be 400 possible combinations. We’d introduce the idea of selecting one of each “uniformly at random” and then plot the lines you get. Maybe we’d chose 20 of them. Some of them would not go anywhere near the data… We’d introduce the idea of the “typical error” in our model, the idea that your model usually can’t predict each point exactly to 37 decimal places, but it rather predicts to within some “fuzzy region” around the prediction. I’d draw a curve like the normal distribution but emphasize that the main thing is we usually think of a good model as one which peaks at zero error and declines away from that. This is usually the defining characteristic of what it means to be a “good / plausible model” but it’s also possible to have other patterns (for example maybe the data should fall either a little to the left, or equally balanced, a little to the right, but we actually don’t expect it to be right at the predicted value). We’d then use that information to calculate a plausibility p(data | model) for each of the 400 possible combinations of intercept and slope. We’d put that in a table. Then we’d draw 20 pairs of intercept and slope from the table but we’d just take the 20 **most plausible** from the table and graph that. Then maybe we’d start with the most plausible and increase the set until we’ve got 95% of the plausibility. We’d look at that set of graphs. We’d mention there might be a better way to select a few for plotting, but we wouldn’t do it yet. We’d just consider this whole table as the result … a complete set of “plausibilities” for all the possibilities.

        The next thing we’d do is talk about algebraic expressions as functions, and then talk about other ways of describing functions. Piecewise linear for example. We’d build some nonlinear regressions using “standard” polynomials but then also introduce Chebyshev polynomials, Fourier Series (and the idea of orthogonal basis in general), and Radial Basis Functions. We’d also choose some commonly used and way less good bases… for example piecewise constant (also known as “indicator variables for each time point”)

        Next we’d talk about systems that develop through time / dynamically and discuss how we can build them from a “series” of algebraic models where the model uses the current state, and some information to predict the “next” time point. We’d introduce “infinitesimal numbers” to represent very short time slices. We’d give the rules for manipulating infinitesimal numbers. We would create algebraic models using infinitesimals and use them to describe physics problems: A ball dropping, a spring and weight oscillating, maybe a biology example like predator prey or a chemical example like reaction dynamics. We’d mention this is called a “differential” equation because it’s an algebraic equation for “infinitesimal differentials”.

        Finally, we’d ask the question if we have a “complete table of plausibilities” which contains too many possibilities to compute with (for example any Float64 value for each of 10 variables) or a rule for computing them at least, how can we get a “representative set” of values of the unknowns? We’d introduce the rules for “sampling” in such a way that they preserve plausibility to within infinitesimal error in the limit of a nonstandard number of samples. We’d assume a computational sequence that can give “uniform on 0,1” random numbers, and show how to sample from our tables in proportion to the plausibility value. We’d also show that we get the same sample if we sample from any **multiple** of the plausibility values.

        That would probably be a pretty full book in and of itself. I need to write this down in some notes …

        The *final chapter* as an appendix would discuss probability and the relationship between plausibility, and sampling, it’d discuss how in real world experiments “random” outcomes occur because essentially of missing information, and that only if you have experience that suggests a stable frequency distribution can you then assign plausibility based on observed frequency. We’d discuss **scientific** errors made by assuming the frequentist viewpoint.

    • Shravan:

      I never learned measure theory either! Many people find it useful, though. I have three possible theories about the utility of measure theory:

      1. Measure theory is useful for proving certain theorems but it doesn’t come up in applied statistics or in the development and evaluation of statistical methods, so I haven’t needed it.

      2. I actually already know the content of measure theory; I just never learned it under that official name.

      3. If only I knew measure theory, I’d be able to be much more effective when doing applied statistics or when developing and evaluating statistical methods.

        • Seriously, read the Nelson book.

          What measure theory gives you, is a way to un-obfuscate results given in the language used by pure mathematicians who never model anything in the real world. If all you need to do is prove theorems about mathematical structures using language that other mathematicians use, you need measure theory. If you’re content with either not doing that, or converting those results to usually trivial algebra using infinitesimal numbers… then you don’t need measure theory. The fact that once converted to NSA the results are often trivial suggests that measure theory isn’t giving “deep” insight. This is Edward Nelsons’ point in the stuff I quoted https://statmodeling.stat.columbia.edu/2023/06/05/rosenthals-textbook-a-first-look-at-rigorous-probability-theory/#comment-2225579

        • Preface: “The mathematical background required is little more than that which is taught in high school”

          Nice!

          Chapter 2 (page 6): “The set R^Ω of all random variables on Ω is not only a vector space, it is an algebra. By an algebra A of random variables we will always mean a sub algebra of R^Ω containing the constants […] This means that every construct of theorem of probability theory can be relativized to any algebra A of random variables.”

          Hum…

          Chapter 4 (page 12): “By an infinitesimal we mean a real number whose absolute value is smaller than the reciprocal of some nonstandard natural number […] We make the following assumptions: 1. 0 is standard, 2. for all n in N, if n is standard then n+1 is standard […] it is consistent to assume also 3. there exists a nonstandard n in N. […] We also assume: 4. if A(0) and if for all standard n whenever A(n) then A(n+1), then for all standard n we have A(n). [….] Using external induction we can easily prove that every nonstandard natural number is greater than every standard natural number [….]”

          What?

          Thankfully the point is illustrated with a representation of natural numbers lying on a tape where we can distinguish all the infinite standard natural numbers 1,2,3,… on the left followed by an infinity (in both directions!) of non-standard numbers on the right.

          “Notice that we did not start with the left portion of the tape and invent a right portion to be added on. Rather we started with the whole tape and then adjoined a predicate to our language that allows us to distinguish the two portions of the tape.”

          It’s all cleared up then!

          Seriously, that’s the kind of thing that pure mathematicians may find trivial but it doesn’t look like high-school maths.

        • After reading multiple math books, we learn to ignore the assertions by mathematicians that such and such requires only xy background…

          If you accept instead my assertion, which is that Nelson’s book is a potential alternative to graduate measure theory… then I think you can proceed as planned.

          When it comes to “standard” vs “nonstandard”. You can think of all the natural numbers, 0, 1, 2, … and you can think of this new “property” that numbers have, either they are “standard” or they are “nonstandard”. A number is standard precisely when there is a finite formula you can write down that defines it **without using the predicate “standard”**. For example the number 0 or 1 is standard, as is pi = x : x > 0 and x 3 given that x > 0 using the grid and the probability mass function?

          p(x > 3 | x > 0) = sum(p(x_i) for x_i > 3) / sum(p(x_i) for x_i > 0) it’s just simple sums and division, because the spaces over which we’re working are *finite* (even though they have a ridiculously and undefinably huge number of discrete points).

          When Jaynes talks about the “Limiting Density of Discrete Points” he’s really talking about a nonstandard grid like this https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points

        • Oh damn it… as usual WordPress ate my text after a less than sign… and I didn’t save the text to be able to fix it! Like 80% of my comment is gone!

        • Well, that’s depressing.

          I’ll try to paraphrase…

          Understanding what nonstandard numbers are is somewhat advanced ideas in logic… but *working with them* uses the same machinery as high school algebra. You just add them up, or take logarithms, or construct ratios etc.

          To do probability on real numbers, like working with normal distributions, you’d define a fine grid of points:

          {x_i = -N + i * 1/N} for i from 0 to 2*N^2

          That defines a discrete grid of points, it starts with a negative nonstandard integer N, which is to the left of all standard integers, and continues in infinitesimal increments through until you get to N a nonstandard positive integer to the right of all standard numbers. Since the increments are infinitesimal, this grid contains numbers which are infinitesimally close to **every** standard number you could ever define.

          Then you’d assign some probability to each one, let dx = 1/N, then p(x_i) = 1/sqrt(2*pi*sigma^2) exp(-1/2 *((x_i – mu)/sigma)^2) * dx

          Now the total probability is sum(x_i) over all the i, which is 1(or infinitesimally close to it) because that sum *is* the integral (technically, the standard number closest to that sum is the integral).

          Basically, if you understand algebra on “standard” numbers, then you understand algebra on nonstandard numbers. They work the same. But now, instead of “not being able to give a point any probability” you can do it… because you have numbers that are so small (infinitesimal) that even when you add up a nonstandard number of them, the sum is still infinitesimal… Like the integral from -N to -N/2 of the p(x) above would be infinitesimal, because the exp(-x^2)*dx is so damn small that even a nonstandard number of them doesn’t add up to anything other than infinitesimal stuff.

          In general when working with nonstandard numbers you’ll get some standard result plus some infinitesimal, so there’s a notion called the “standard part” which is to say “give me the standard number closest to this number” and we’re always doing that.

          Nelson’s point is that if you have infinitesimally fine grids of points, they are still **finite** grids (ie. you can’t make a 1-1 correspondence between say every other grid point, and the full set, the way you can between the even integers and all the integers). So you just work with finite sums and other finite formulas and you don’t have to worry about “what if we have an actual infinity of points and add up all the values and the individual values aren’t actually 0”. And that’s where the problems lie in standard analysis… you have actually infinite sets and if you add up all the little bits unless each little bit is actually equal to 0, you will get infinity.

        • Hi Daniel, thanks for the detailed comments and thoughts. I looked at the Nelson book again and I really do find it impenetrable. I felt the same way about BDA3’s opening chapters and had to take a four-year excursus into math and stats to come back to it and become able to read and understand it (well, most of it). With the Nelson book, I feel like I’m missing some knowledge. I think that the problem does not entirely lie with me though; Nelson writes in a way that Knuth would disapprove of (see his lectures on writing math). For example, as soon as he introduces a finite set $\Omega$, he just assumes that the reader will work out that $\omega$ is some element of $\Omega$. Then he makes a really rough topic shift to the expectation of a random variable, without giving any inkling as to why that’s relevant to the definition of a random variable. Also, he brings up pr as an abbreviation of probability but when talking about the probablility of a subet A of $\Omega$, he uses Pr. This is sure to confuse the reader. This kind of writing prejudices me a bit.

        • Shravan, I think all criticism of his writing style and his reliance on mathematical background information is valid. The book is clearly meant to be read by professional mathematicians who study probability theory. It’s unfortunate that I don’t know of a better more applied book. Maybe I just need to write it… Sigh.

          I had hoped that people could read through that to engage with the mathematical ideas, but if not I certainly don’t blame you, Nelson was brilliant and a mathematician, and he writes like one. Even though I have an undergrad degree in math I find his characterization of conditional probability as “relativization to a sub algebra” as a bit impenetrable. I can work out what he means, but he’s clearly writing that stuff for professional algebraists to tie into ideas that would be familiar to them in other contexts.

          The real takeaway from the book is basically that everything you’d do in measure theory can be done by considering only finite probability spaces (sets that have some finite integer number of elements) provided that you let that integer size be nonstandard. And that the operations you do on these finite spaces are the usual operations you’d do on probability spaces like the outcome space of the Bernoulli variable with N trials (ie. 0,1,2,…N are the possibilities)

          I find that particularly freeing as it essentially means that the measure theory stuff is what Nelson calls “generalized nonsense” it’s just machinery to ensure formal correctness in the absence of a better alternative

        • Daniel, I think if I put my mind to it, I could get through the Nelson book. I will give it a shot.

          Incidentally, A Probability Path (the first three chapters) by Resnik looks a lot more accessible to me than Nelson for non-mathematicians like me who know enough math and stats to read texts like BDA3.

        • In classical probability, the “event space” Omega is some completely abstract thing… What are the elements of Omega? You can’t really talk about it as a concrete mathematical thing, it’s just some symbols that mean “shit happened” and then there is a mapping from that stuff to some numerical outcome…

          In Nelson, you’re free to think of the Omega space as just the positive integers 1…N the 67th possible outcome for example has probability p(67) and if F is a function from the integers 1..N then F(67) is the measurement you will get if the 67th possible event occurs.

          This is like you have a class full of 100 students each student has an ID from 1..100 and if the 67th student is the next one to enter the room, then maybe the random variable Name is just the list of the students names on the roster, you look up 67 and find Daniel.

          This is what’s meant by elementary/high school type math. The probability of getting either Daniel or Shravan as the next student through the door is just p(67)+p(42) if Shravan is the 42nd student on the roster.

          Of course we don’t have to use the integers 1..100 we could use for example the numbers 0.0 0.01 0.02… 0.99 as our omegas and we could talk about the probability that x is greater than 0.42 which would be adding up p(0.43)+p(0.44)+…+p(0.99)

          How do we make this work for something like the probability that a randomly chosen point inside the square [0,1]x[0,1] is also inside the circle centered at (0.5,0.5) with radius 0.5? Nelson says create a fine grid of points and talk about the probability that you’re inside the circle as the probability that you’re getting one of the grid points inside the circle. If you don’t have nonstandard analysis you can do that just fine on a computer with say 1000×1000 grid points but your probability will be a little off the correct number… You’ll get closer if you use 2000×2000 if you use NxN where N is nonstandard then your error will be so small that it’s literally infinitesimal, it’s a number which standard mathematics can’t distinguish from 0.

          Forget about measurable sets and abstract omega spaces which map to outcomes that are points in the square… It’s just literal points in the square and there are a finite but truly enormously big number of them, and each one has some infinitesimal probability, and adding up all that probability gives you the answer.

          It may just not be possible to expect people who aren’t professional mathematicians to extract that info from Nelson’s book… Which I guess means we need an alternative book.

        • > Nelson’s book is a potential alternative to graduate measure theory

          Sure, in the sense of “mathematicians who know measure theory may find this alternative approach interesting”. But I would be surprised if anyone, let alone a non-mathematician, has ever tried to study probability on infinite sets using this instead of measure theory.

        • Well, Nelson himself spent a lot of his career trying to understand foundations of QM using these techniques, but I take your meaning to be that you doubt anyone other than Nelson has used this kind of stuff… For that I can refer you to a couple examples: Albeverio et al.

          https://www.amazon.com/Nonstandard-Stochastic-Analysis-Mathematical-Mathematics/dp/0486468992

          or Loeb and Wolff (eds)

          https://www.amazon.com/Nonstandard-Analysis-Working-Mathematician-Peter-ebook/dp/B014IT48BU/

          I believe that nonstandard approaches to path integrals have actually moved the ball a bit on that topic here’s an example paper which I haven’t read or read a long time ago maybe.

          https://pubs.aip.org/aip/jmp/article-abstract/32/2/457/229266/A-nonstandard-representation-of-Feynman-s-path?redirectedFrom=fulltext

          But, 100% I agree it’s a niche thing. The question in my mind is: do you need to use probability to do some application? If so you might well benefit from this because *it lets you spend your time on the application*. Do you need to prove some mathematical theorems in probability as a professional mathematician? Then you will generally be pushing against the current as mathematicians overwhelmingly are following along with the analysis/measure theory/Kolmogorov approach. IMHO this is also why they are off in left field and why actually useful applications of probability don’t tend to come from Mathematicians.

          If you make your formal machinery complicated to use, then anyone who needs to apply it to some real world problem will struggle to do so. In general, the result is that those who make progress tend to be those who ignore the formalism. The Dirac delta function is a good example of that. Feynman path integrals also are a good example, originally they had no formal basis if I understand correctly. Hell, all of calculus was informal for from like 1660 ish to the early 1800’s so around maybe 150 years (and was done as if infinitesimals and unlimited numbers existed, which they did, but no one had formalized until Abraham Robinson in the 1960’s)

        • > In classical probability, the “event space” Omega is some completely abstract thing…

          I should clarify this. In some simple cases, it can be a concrete thing like {H,T} for the coinflip or even “real numbers in [0,1]”. In other cases, like “a function of space which represents the temperature everywhere on a metal plate” the omega is a function of space, and the possible functions of space are uncountably infinite, you just think of “little omega” as a label for the function you happened to get. In Nelson’s finite nonstandard version, you can still 1-1 map this to some integers 1…N ie “you got the 1954819330567184….00101010928395102 th function” where there are a nonstandard number of digits in that integer ;-)

        • > I take your meaning to be that you doubt anyone other than Nelson has used this kind of stuff…

          What I doubt is that there is anyone who has studied – and understood – this kind of stuff and has not studied measure theory.

        • Carlos, this is entirely down to the fact that there are hardly any textbooks that could be useful for a calculus or biology student to learn from. And part of that is that historically there’s a small but extremely vocal anti-NSA group within the Mathematics profession. The wikipedia page on criticisms of nonstandard analysis is quite hilarious I think https://en.wikipedia.org/wiki/Criticism_of_nonstandard_analysis it has the same flavor as Frequentists ridiculing Bayesian stats.

          Gerome Keisler wrote a calculus text based on nonstandard analysis back in the 70’s. Abraham Robinson’s original invention was only 1963. It is much more oriented towards the beginning undergrad, and Shravan may find it a better starting place to prepare to try to digest Nelson:

          https://people.math.wisc.edu/~hkeisler/calc.html

          He put that pdf up on his website back in the late 1990’s or so (again, only 35 ish years after the invention of the ideas).

          When it comes to stuff like teaching math, the world is quite conservative. I mean, we’re still writing textbooks for psych students on p values of t-tests and how to carry them out 60 years after Paul Meehl right?

          I don’t consider “whether this has been adopted widely” to be a particularly important question. There’s the question of “whether it’s true math” and that’s been established… so it’s a matter of can it be taught well, and can people who learn it better utilize math in modeling their science? Mathematicians post Bourbaki are the last people you want telling you what kind of math to use in science. Consider the “New Math” of the 1950’s.

          To me the question is, which is more intuitive to a person who wants to use math in mathematical modeling:

          1) to determine the probability associated with all the outcomes in a certain region, add up all the individual probabilities of those outcomes

          or

          2) to determine the probability associated with all the outcomes in a certain region, form the intersection of all the measurable sets in the Borel sigma algebra that contain that outcome which are entirely contained in the region, by the closure of the set of Borel sets under countable intersection this will form a Borel set and by applying the probability set-function to that Borel set you will get the probability of the outcome being in that region.

          Which of those would you like to explain to a 3rd year undergraduate Biology major?

        • This has been an interesting conversation to follow, my thanks to all involved. Daniel I think you’re missing Carlos’ point, which I take to be that the books you’ve linked, regardless of how heterodox they are or how much they vary in their approach to probability, all require a certain level of what Mathematicians usually call ‘Mathematical Maturity’. Basically a comfort with proofs, common results, and the language/style Mathematicians use. The way one usually attains this is by taking courses in analysis that typically result in having some experience with measure theory. So sure, while technically all you need are concepts you learned in high school, without the maturity the books are fairly impenetrable (without a lot of work). It’s kind of like being told the salsa dance is easy and consists of only 3-4 moves, then being dropped into a high level competition

        • Will you also explain to those 3rd year Biology undergraduate how to map that infinity of individual outcomes with non-standard natural numbers that cannot be found in the sequence 1,2,3,…? How are they supossed to come up with the infinitesimal values – the probability corresponding to each individual outcome – that you are telling them to add?

          What kind of mathematical modelling do you expect them to do anyway that requires measure theory or a substitute thereof? If Andrew can do without it, they surely can as well.

        • Blackthorn, no I’m not missing that, as you’ll see I already mentioned several times that the book requires excessive mathematical maturity. I’d love for there to be a more intro book, and I absolutely wouldn’t assign Nelson’s book to a bunch of undergrads. I had hoped the audience here might be up for trying to extract something out of it. it looks like I’d have to write the alternative… Sigh

          Carlos, here’s an example. You would like to describe a prior over a function, because you know that a time series has certain properties. Thankfully MCMC doesn’t require a normalization constant. So let’s assign numbers to **entire paths** which are big when things are more plausible and small when they are less plausible. The way we will do this is ultimately that we calculate some functional integral. Suppose f(x) is a path then integrate g(f(x)) + h(f'(x)) + I(f”(x)) dx over the domain of x will be our weighting function. We will take the logarithm of that and add it to our log posterior distribution (pretty easy to do in Julia + Turing.jl)

          Is this legit? It needs to be at least normalizable even if we can’t compute the normalization factor. Let’s represent the functions as a chebyshev series with 50 coefficients and each coefficient will have a pre-prior which is a normalized probability mass function on a nonstandard grid. Then we show that to calculate the normalization factor we just need to do the 20 dimensional finite sum over the nonstandard grids, calculate the integral for each grid points, and use that as our normalization factor…. This is in essence Feynman path integration. But we don’t need to do any weird discussion of measurable sets on the Hilbert space of functions etc… It’s straightforward sums. I’ve used that technique successfully many times, I convinced myself it’s legit by straightforward computation once, and haven’t looked back. If the functional is bounded then the 20 dimensional nonstandard sum is bounded and so the total integral is bounded and therefore there’s a number that normalizes the sum.

          Or, most Biology students take calculus, but they don’t understand it, so if you want to explain to them how chemicals transported through the body could be modeled, you’d have to discuss some calculus that would be scary. But they’re ok with sums and products and ratios… So if we tell them it’s legit to sum up or take the ratios of infinitesimally small numbers… We can just talk about how that ODE would look by having them say “at the next time point the quantity would be q(t+dt) = q(t) + A*f(g)*dt ” etc

        • I’m surprised to hear that people say they never need measure theory. On a simple zero-inflated model, is the correct likelihood


          if (observation == 0) {
          return log(zero_probability);
          else {
          return log(1 - zero_probability) + continuous_lpdf(observation);
          }

          Or

          if (observation == 0) {
          return log_sum_exp(
          log(zero_probability),
          log(1 - zero_probability) + continuous_lpdf(0.0)
          );
          else {
          return log(1 - zero_probability) + continuous_lpdf(observation);
          }

          Where zero is within the strictly-positive support of the continuous distribution?

          I know the answer, but I can’t think of way to disambiguate without basic measure theory or asking someone who knows measure theory. Of course, you can experiment, find which ones gives biased estimates, and then hand wave…

        • Or maybe those infinite outcomes can be labeled as 1,2,… – the point is that the explanation “to determine the probability associated with all the outcomes in a certain region, add up all the individual probabilities of those outcomes” doesn’t explain much. If glossing over the set theory concepts and other mathematical abstractions involved is acceptable for option 1 maybe you could give the same summary for option 2.

          This discussion reminded me of the V. I. Arnold essay “On teaching mathematics”: https://www.math.fsu.edu/~wxm/Arnold.htm

        • > I’m surprised to hear that people say they never need measure theory. On a simple zero-inflated model, […] Where zero is within the strictly-positive support of the continuous distribution?

          You mean that you observation is exactly 0 but it could have been something like -1e-10 or 1e-100 instead and that your model is for example a mixture of a standard normal and degenerate zero distribution? I’d say that you don’t need measure theory to calculate the ratio of the probabilities for an observation in [-precision precision] and it’s easy to see that as the precision goes up the relative likelihood of the normal goes down vanishing in the infinite precision limit.

        • Carlos, we don’t teach 4th graders the Dedekind cut definition of the real numbers, or Cantor’s proof of the countability of the rationals, or the controversy over the axiom of choice or Peano arithmetic or all sorts of things foundational to 4th grade arithmetic. But we do teach them how to reason with numbers and if we are lucky they can calculate areas of a sheet of paper with holes punched in it. The tool of addition and multiplication and inverses of those and decimal place value numbers are usable to them. I don’t know how to make measure theory usable to a Biology undergrad. If they want to reason to a correct answer to the problem of describing a prior over functions by integration over the path, how do they do it? I don’t know because I don’t know enough measure theory, and I had an undergrad degree in math and took graduate level stochastic PDEs course (that I was almost entirely snowed by, but I learned the prerequisites ok). Feynman couldn’t work out whether path integrals were legit for like a couple decades without help from serious mathematicians. So I’m guessing it’s unreasonable to teach Biology students how to do probability on Hilbert spaces using measure theory. I do know that the nonstandard argument is straightforward and you come to the right conclusion.

          We have to distinguish between people needing to do formal proofs of theorems for Math Journals, and people needing to do semiformal reasoning to sensible solutions to applied problems. It’s not just about teaching them recipes, it’s about teaching them how to invent their own recipes. Any tool that dramatically expands the semiformal reasoning ability of scientists is in my opinion a must-have.

          Somebody… As far as I can tell either model could be correct, but they are different models. If your goal is to say there’s a point mass on 0 and also there’s a continuous pdf which has nonzero density at 0, then it’d be the second. If your goal is to say there’s a point mass at zero and there’s a pdf over the nonzero region which has no density at 0 then it’d be the first. I don’t think either is an invalid model, but perhaps there’s some measure theoretic reason why you can’t have a standard pdf on R-{0} ?

        • Carlos’ point that you can ignore the pdf component because the contribution of the point at 0 is infinitesimal is only valid in the limit of infinite precision measurements. Most real world measurements have a finite precision, like weight to the nearest tenth of a gram… Or whatever. The continuous model is only an approximation to the true discrete model of the digital scale… So it’s nontrivial to figure out what you want in real world data. Sometimes you even want to do the roundoff model explicitly

        • > they are different models. If your goal is to say there’s a point mass on 0 and also there’s a continuous pdf which has nonzero density at 0, then it’d be the second. If your goal is to say there’s a point mass at zero and there’s a pdf over the nonzero region which has no density at 0 then it’d be the first.

          In what sense are these different models?

          a) a mixture of zero and a uniform distribution defined on [-1,1]

          and

          b) a mixture of zero and a uniform distribution defined on [-1,0) ∪ (0,1]

        • Carlos, at first I didn’t see the point somebody was trying to make, and I’m being sloppy, and trying to get my dog breakfast etc…

          Real world data is always discrete. The question is whether the “continuous” component can throw off values that actually equal 0 when “rounded off” to the measurement precision or if you can ignore that component. So it’s not really [-1,1] – {0} it’s [-1,1] – [-dx/2,dx/2] where dx is the measurement precision.

          It’s actually an interesting point that in continuous models we use pdf(data | model) rather than pdf(data|model)*d_data this is usually because the d_data (the precision of the measuring instrument) is constant, and we’re really calculating an approximation to the integral(pdf(data|model),data – d_data, data+d_data)

        • I guess we agree then that one doesn’t need measure theory, non-standard numbers or other exotic mathematical animals to handle discrete and continuous distributions – at least the sufficiently well-behaved non-pathological ones.

          Thanks for the example. I don’t doubt that you will be teaching to the students useful techniques but I don’t know if from their point of view being told “you can solve the problem like this doing this finite sum (because Nelson)” is different from being told “you can solve the problem doing this finite sum (because Kolmogorov)” or “you can solve the problem using this finite sum (because finite)”. Maybe the only hear “you can solve the problem like this”.

        • Let’s ignore finite measurement precision for the moment. Let’s suppose we’ve sent a bunch of people out to measure something that’s normally distributed. However, some fraction of their devices are broken and always return 0.

          Little fake data simulation

          https://pastebin.com/UK9ppqSS

          and stan model with a correct and incorrect likelihood

          https://pastebin.com/hvX19CPs

          The correct likelihood gives correct estimates, the incorrect one gives me nearly zero for the zero probability.

          It’s easiest to arrive at these by first defining the full model with a latent variable for “is_broken”, then projecting down while following the measure theoretic principle of preserving integrals. I’m not really sure how one could do that without being able to define the unified lebesgue integral against a compound lebesgue-counting measure.

          I’m not sure if I’m right about this (I’ve never taken a course in either statistics or measure theory), but as far as I can tell, the incorrect one really is incorrect, it’s not for a different model. I think if you define the likelihood that way, the integral for any fixed parameters (integration over the data space) exceeds one. Going piecewise

          Integral(density)d Lebesgue(x) over R – {0} = (1 – lambda)
          Integral(density)d Counting(x) over {0} = lambda + (1 – lambda) normal_density(mu, sigma)

          = 1 + (1 – lambda) normal_density(mu, sigma) > 1

          so you get a total probability of all observations that exceeds 1

        • This model is of course farcical because you can easily just separate these out and analyze separately. A less farcical example would be a right/left censored model with covariates, but it complicates the discussion unnecessarily

        • > Let’s ignore finite measurement precision for the moment. Let’s suppose we’ve sent a bunch of people out to measure something that’s normally distributed. However, some fraction of their devices are broken and always return 0.

          Thankfully the broken devices are easy to identify: if a device returns zero it’s broken with 100% probability because the probability that a non-broken infinitely-precise device measures precisely zero is 0%.

          > Integral(density)d Counting(x) over {0} = lambda + (1 – lambda) normal_density(mu, sigma)

          I don’t know what kind of notation is that but the formula doesn’t make sense. It’s like calculating a mass adding kg and kg/m^3.

        • somebody: yes, the finite measurement precision is necessary for the two models question to be relevant. Initially I thought that was the question you were posing but it was because I was not really paying clear attention.

          Let’s take the finite measurement precision issue seriously but make the measurement precision be dx

          It’s a mixture of discrete distribution pz*1 or continuous (1-pz) * normal_pdf()

          the probability of getting zero is (approximately) pz + (1-pz) * normal_pdf(0)*dx

          The probability of getting some other nonzero measurement x is

          (1-pz)*normal_pdf(x)*dx

          suppose pz=0, and we have 100 data points, and dx=0.1 and we have one or two measurements that are 0.0.. then the likelihood would be a product like:

          (pz + (1-pz) * normal_pdf(0)*dx) * normal_pdf(1.2)*dx * normal_pdf(0.8)*dx…

          Now, if dx goes to zero… like you’ve got a precision microgram balance, then p(0.0) = pz + (1-pz) * normal_pdf(0)*dx ~= pz

          But if there’s finite data precision, you could also have a model where you’re saying the *only* way that you can get a 0.0 reading is that someone had a broken meter, because the real meters always read either 0.1 or -0.1 and never read 0.0 exactly, maybe because of the way they’re constructed… That’d be a different case.

          In the fine-precision measurement situation, then yeah you just get p(0.0) = pz the point mass dominates the continuous distribution.

          None of this really requires measure theory though. I guess you need measure theory to define a “delta distribution” on 0.0 if you want to consider the whole problem continuous and do proofs.

        • Carlos, thankfully the same is true for if the infinitely precise measurement is *any given number*… they’re *all* broken!

        • @Carlos

          It’s measure theory!

          I’ve definitely botched it a bit, but the whole point is a general theory of integration. In this case, you would define the compound measure

          mu(X) = LebesgueMeasure(X – {0}) + CountingMeasure({0})

          and compute the Lebesgue integral against that compound measure

          https://en.wikipedia.org/wiki/Lebesgue_integration#Via_simple_functions

          Anyways, yeah this particular model is farcical, but it’s not always so obvious, and it becomes very easy to accidentally write down the wrong likelihood nonetheless. And yeah, you can easily verify with fake data simulation, but aren’t you the least bit curious how to prove one of them right and one of them wrong?

        • > Carlos, thankfully the same is true for if the infinitely precise measurement is *any given number*… they’re *all* broken!

          You may have missed the part where we were told that some fraction of their devices are broken and always return 0.

          Thankfully the same is not true for *any given number* because that would be *impossible*.

        • Carlos, it is indeed impossible for an infinite precision instrument to ever return to you any given number. With probability 1 it will return an irrational number and those have an infinite number of digits, it would never terminate printing.

        • Consider this classic problem:

          You want to regress covariates X against life length Y. However, some of the individuals in your dataset are still alive! A “standard” way of analyzing this is to define some time cutoff. However, that’s throwing away data; if some of the people are alive and extremely old, they’re providing real information about their covariate group.

          Informally,


          log_life_length = X * covariate_matrix + intercept
          for (i in 1:num_people) {
          if (is_alive[i]) {
          target_log_prob += dist_log_pdf(obs_life_length[i] | log_life_length, residual_sigma)
          } else {
          target_log_prob += log(1.0 - dist_cdf(obs_life_length[i] | log_life_length, residual_sigma))
          }
          }

          But why can you do this? You’re adding log_densities to log_probabilities! But it’s here

          https://mc-stan.org/docs/stan-users-guide/censored-data.html

          and I’m pretty sure (again, no expert) it’s correct measure theoretically

        • >But why can you do this? You’re adding log_densities to log_probabilities! But it’s here

          It’s actually wrong! but it doesn’t matter because MCMC!

          the logarithm makes it more complicated to describe the issue so let’s talk about writing a likelihood not a log likelihood

          p(deathtime1 , deathtime2, still_alive_age | model) = {p(deathtime1 | model) d_data} * {p(deathtime2 | model) d_data} * p(deathtime > still_alive_age | model)

          Note however that d_data is a constant that is determined by the way you measure time, like maybe to the nearest day or maybe to the nearest month… and also note that we’re relying on the approximation p(x)dx ~ integral(p(x),x-dx/2,x+dx/2) for dx small. So anyway we just pull it out in front:

          d_data^2 p(deathtime1|model) p(deathtime2|model) * p(deathtime > still_alive_age|model)

          and then MCMC doesn’t care about constants, so we just ignore it.

          also note that d_data is something about the “measurement precision” of the instrument measuring the time, and if it gets large enough, then the approximation integral(p(x),x-dx/2,x+dx/2) is no longer good and so we need to use the cdf for all the measurements because each one is rounded off in a coarse way (a roundoff error model)

        • >:(

          Actually, that’s pretty cool. I guess the NSA approach, you get a straightforward interpretation of measurement granularities getting infinitesimal, while people being still alive can just be interpreted a really imprecisely measured time of death between (now, infinity). My mathy friends don’t like NSA, but maybe I’ll check it out

        • Somebody: I don’t think the story actually changes in NSA versus measured theoretical versions in either one you construct the likelihood for the continuous part as the density multiplied by a small width or alternatively as the measure of the probability for that small set which doesn’t have to be represented as an integral. But in the measure theory version in either case you need a probability not a density. And for the usual case we’re relying on that approximation of the probability as the density times a small width and then factoring out the small width since it’s a constant. This is not emphasized enough in many textbooks possibly even BDA 3

        • > In this case, you would define the compound measure
          > mu(X) = LebesgueMeasure(X – {0}) + CountingMeasure({0})

          For what it’s worth, you could also define it as mu(X) = LebesgueMeasure(X) + CountingMeasure({0}) and the measure doesn’t change.

          As far as I can see, the main issue here is that Stan cannot handle this kind of compound measure. There are discrete distributions (defined on natural numbers only, not on arbitrary points) and there are continuous distributions but if you try to mix them “things get a little wonky”.

          https://betanalpha.github.io/assets/case_studies/mixture_models.html

          “The implementation challenge is that we can’t just plug this into Stan because when we evaluate it on a non-zero observation the Dirac probability mass function will introduce an infinity,
          logδ0(y=1)=log(0)=−∞. This ill-defined intermediate expression would then propagate through the code resulting in an ill-defined output.”

          (I’m not sure why trying to specify zero probability for the data in one component of a mixture should break everything – instead of just suppressing that particular component for that particular data point – but that seems to be the state of things.)

          Anyway, in your example you could use something like this and it works as expected (the size of the “point” interval doesn’t matter much, one can widen or narrow it by orders of magnitude and the output remains stable):

          out += log_mix(zero_prob, uniform_lpdf(dpoint | -0.0001, 0.0001), normal_lpdf(dpoint | mu, sigma));

        • Which is precisely how you would define the Dirac measure as a non-standard density in non-standard analysis you would choose some function like the normal distribution and give it an infinitesimal width. And of course it turns out that there’s no unique non-standard density you could just as soon have a uniform distribution with an infinitesimal width for example. Furthermore if you’re working with non-standard densities then the integral has to be defined in a way that respects the scale of the non-standard features

        • @Carlos

          No, that’s not quite right, the probability measure you specify integrates to higher than 1. You’re looking at the wrong section, that’s taken from the the section on zero inflating a discrete distribution. He gives the justification

          moreover for any baseline model specified by a continuous probability density function the probability assigned to any single point vanishes,
          PπB[y]=limϵ→0∫y+ϵy−ϵdy′πB(y∣θ)=0.
          In other words the baseline model contributes negligibly to the inflated value, and any time we observe it we know that only the inflation model was active.

          But that doesn’t satisfy me because the baseline model probability vanishes at all points.

          @Daniel Lakeland

          But in the measure theory version in either case you need a probability not a density. And for the usual case we’re relying on that approximation of the probability as the density times a small width and then factoring out the small width since it’s a constant. This is not emphasized enough in many textbooks possibly even BDA 3

          This is about to get extremely semantical; measure theoretically, probabilities are also densities under the counting measure. The nice thing about the measure theoretic integral is that you have a general “density” which is basically defined as something you integrate. So a PMF is a density, and a PDF is a density, and everything in between. Further, a dirac delta is also a density, though it is neither a PDF nor a PMF, since it is not a function but it does integrate.

          In the standard picture, in the case of analytic integrals, you’re not relying on the density times a small width. The integral converges to a limit as the widths get smaller, but the infinitesimal widths are not mathematical objects and do not have their own algebra; hence there are no such constants in the density. But yeah, in computational practice there’s always a real little width, so I can be convinced that the non standard picture is more amenable to applied math.

        • Somebody: sure everything is a density, but you still can’t add things with different dimensionality (probability vs probability per length for example).

          In the measure theoretic version, instead of p(x)dx you’d write mu([x-dx,x+dx]) or at least mu({x : x reads x_i on the measurement instrument}) having defined a density with respect to Lebesgue measure you could write mu({x:…}) = Integrate(p(x)dx, x_i-epsilon,x_i+epsilon) which is just another way of writing mu({x:…}) All of which is to say that eventually you’re going to take your density (per unit length) and multiply it by length…

          When we start talking about integrate(p(x)dx) with p(x) a Dirac measure we are actually just doing formal manipulation of symbols. We define the symbol integrate(p(x)dx,a,b) = dirac_measure([a,b]) = 1 if 0 in [a,b] and 0 otherwise. It’s a symbolic transformation, not any meaningful integration.

          In applications such a as Stan we “should” use a cdf for all measurements and take the difference between nearby values cdf(x+dx)-cdf(x-dx), but when the measurement instrument is precise and dx is small, that’s numerically unstable compared to just writing p(x)dx and since all the dx are the same we just factor that out as well.

          Measurement instruments always read the same value for some *set* of underlying outcomes. They are always finite precision, otherwise as mentioned they could never terminate printing out decimal values and therefore could never read any particular value, just as required by measure theory (probability zero that you would get any particular value).

          Not only are they always finite precision, but the most precise measurement devices we have are A/D converters with something like 32 or 36 bits so they don’t even read to more than around 10 or 12 decimal places.

          Dimensional analysis doesn’t go away with measure theory, it just gets swept under the rug by discussions in books. That concept isn’t strictly NSA related, it’s just that NSA maps particularly well onto the reality of applications, whereas measure theory is a lot of formalized obfuscation when it comes to applications. Mathematicians these days are usually concerned entirely with “what can we prove about mathematical objects?” and not about “how can we effectively model scientific questions?” this separation occurred through tine between about 1880 and 1950. Jaynes was pushing back against that in Carlos’ quoted material.

          The problem with measure theory for applications is that it solves problems about “how to prove results” and not “how to describe science”.

          NSA let’s you align proving results and the reality of science a bit more. That most mathematicians don’t like it is mainly because they don’t do science. Edward Nelson **did** do science, he spent a lot of his career wondering about quantum mechanics.

        • > No, that’s not quite right, the probability measure you specify integrates to higher than 1.

          If you mean (I’m not sure) that you don’t agree that

          LebesgueMeasure(X – {0}) + CountingMeasure({0}) = LebesgueMeasure(X) + CountingMeasure({0})

          I understood that what you meant with that equation was the addition of two different measures defined on the reals with the first term corresponding to a continuous distribution and the second term corresponding to a discrete distribution.

          I assume that X stands for the reals (it it’s an interval the second term should be an intersection). As far as I can see CountingMeasure({0}) is equal to 1. So if the first term is positive what you wrote also integrates to more than one – and if there is some implicit normalisation it works just the same with what I wrote because everything has the same value.

          The important point is that the continuous distribution measure LebesgueMeasure(X – {0})=LebesgueMeasure(X) because LebesgueMeasure({0})=0 and I don’t understand what’s your objection because as you say later “the baseline model probability vanishes at all points”.

          > You’re looking at the wrong section, that’s taken from the the section on zero inflating a discrete distribution.

          You’re right. Even working with discrete distributions alone is apparently tricky.

          > But that doesn’t satisfy me because the baseline model probability vanishes at all points.

          If you’re referring now to your original question “I’m surprised to hear that people say they never need measure theory. On a simple zero-inflated model, is the correct likelihood [stan code] or [stan code]…” I don’t think that measure theory can answer the question. It’s mostly a question about the Stan language, the way mathematical concepts are represented and computed in it and how you can work around its limitations.

        • > He gives the justification […] But that doesn’t satisfy me because the baseline model probability vanishes at all points.

          Is the following reasoning more satisfactory? (Not sure if it’s correct though.)

          Stan computations work with discrete distributions on defined on integers when you use data of type integer.

          Stan computations work with continuous distributions defined on reals when you use data of type (kinda) real.

          You can handle a mixture of continuous and discrete distributions working separately on integer data (using the discrete distribution) and on non-integer data (using the discrete distribution).

          Mathematically that’s fine because the continuous distribution produces integer data with probability 0 and the discrete distribution produces integer data with probability 1.

          In reality it’s not so fine because the integer data is not really integer and the real data is not really real and in this case both are floats. We can still identify the floats that we want to consider integers and and we still have lots of “non-integer” floats compared to “integer” floats.

          The method of dispatching floats to either the continuous treatment and integer treatment is an approximation but seems to work well enough in practice. (I don’t know if it’s better or worse than the alternative of approximating discrete distributions with narrow continuous distributions so everything can be added together.)

          Note that theoretically one could also have written

          if (observation == round(observation)) {
          return log(discrete_probability) + log(discrete_prob(observation));
          else {
          return log(1 – discrete_probability) + continuous_lpdf(observation);
          }

          where in this case discrete_prob(0)=1 and discrete_prob(n)=0 otherwise.

          All the integers deserve the “discrete” treatment. They “don’t happen” otherwise so there is not loss in theory from not including them in the “continuous” treatment.

          However, that’s a bad idea in practice for several reasons (including the log(0) divergence).

        • @Carlos

          Sorry, I’ve completely bungled the discussion. Yes, you don’t need to excise the 0 point from the Lebesgue measure because {0} is or Lebesgue measure 0. I meant that the discrete density you linked to,

          (1 – lambda) base_density(x) + lambda Indicator(x; x = 0)

          if you tried to use it straightforwardly as a continuous one, integrates to more than 1. That kind of mistake, in naively continuizing a problem, is what I think rigor helps to avoid.

          The trouble is that there’s no such thing as a density or an integral on a compound space like that until you introduce measure theory. Hence, there are no posterior densities, and methods that use posterior densities and their gradients fall apart. It’s true that you don’t necessarily need densities to do probability, they’re more of a computational device, so yeah defining the density is kind of a way of working around the limitations of stan and other similar methods.

          That said, it’s very likely that one can come up with a correct posterior density that works with gradient based methods without properly defining everything and without measure theory. Practitioners do this sort of thing all the time, in the same way that thermodynamic equations came up before a theory of equilibrium statistical mechanics and Newton used calculus before analysis. The trouble is that it’s easy someone else to come up with a convincing heuristic argument for something wrong that doesn’t work, and I have a tough time disambiguating between those without being extremely careful.

          @Daniel

          I’m not really sure what you’re getting at. Since we’re working with the posterior density, the integrand is all probability / (parameter length).

          I agree that in practice all measurements are discretized. However, if we’re getting very semantic, I don’t think CDFs would be an appropriate way to capture that since floating point numbers are extremely not uniformly distributed over their range. An appropriate discretized model for “nearest floating point number” is far too complicated for me to care about though

        • Somebody. Let’s use a concrete example since I’m talking about real world measurement issues. Take a kitchen scale, put it in gram mode put a small bowl or weighing boat on it, tare the scale, and then scoop up a large spoonful of rice and put it in the boat/bowl.

          If you have a scale like mine it will read something like … I’m about to carry it out…

          22.4g

          Ok, we’re going to model the repetition of this experiment with a Gamma distribution, Let’s call it Gamma(25,21.0/24) in Julia notation and the density at 22.4 is pdf(Gamma(25,21.0/24),22.4) = 0.08812653761967656

          What’s the probability of getting 22.4g? It’s not pdf(Gamma(25,21/24),22.4) for two reasons. First off, that isn’t probability, it’s probability per gram. Second, if we put anything between 22.35 to 22.45 grams of rice onto the scale we would get the same 22.4 as the readout so the probability of getting 22.4g as the readout is actually integrate(x->pdf(Gamma(25,21/24,x),22.35,22.45) which we can calculate several ways:

          julia> cdf(Gamma(25,21/24),22.45) – cdf(Gamma(25,21/24),22.35)
          0.00881249686317731

          Alternatively we could calculate it as

          julia> pdf(Gamma(25,21/24),22.4) * 0.1 (grams)
          0.008812653761967658

          both calculations give the same answer to the first 4 significant figures. The CDF calculation may lose accuracy because it’s the difference of two numbers which are close together… let’s try directly integrating:

          julia> quadgk(x -> pdf(Gamma(25,21/24),x),22.35,22.45)
          (0.008812496863177447, 2.2551405187698492e-17)

          The first number is the estimate, the second number is the estimated numerical integration error

          So the CDF calculation is more accurate, correct to about 13 decimal places, the pdf calculation is approximate to only 4 decimal places.

          Now, suppose my scale reads to the nearest hundredth of a gram:

          julia> cdf(Gamma(25,21/24),22.405) – cdf(Gamma(25,21/24),22.395)
          0.0008812652192964343

          julia> pdf(Gamma(25,21/24),22.4) * 0.01
          0.0008812653761967657

          Now the two methods agree to 6 sig figs.

          When we write computer programs we typically assume that “continuous” measures like the number of grams have “high” measurement precision, but they often don’t. For example some kitchen scales only read to the nearest 1g, or human scales might weigh to the nearest 0.1lb if they’re digital but more like 2.5lb if they’re an analog scale. Digital multimeters might read volts to nearest 0.05 volts… household income might only be recorded to the nearest $100/yr. Often the “roundoff” issues don’t need to be explicitly modeled, and so we ignore them. We can say something like:

          pdf(Gamma(25,21/24),22.4) * 0.1

          is the probability associated with getting 22.4 g of rice as the readout. Since the 0.1 is multiplied into *every* data point, it’s just a constant, and it’s ignorable because after we normalize the likelihood times the prior it will just appear in both the numerator and the denominator.

          But from the strictly formal perspective, the 0.1g multiplying the pdf is needed, to get dimensional homogeneity, and to understand what actually causes a reading like 22.4 g to come about.

          Incidentally, I am going to go try removing a few grains of rice and adding a few grains of rice, and see how much wiggle room there is before the readout changes…

          It’s somewhere around 3-4 grains of rice before the scale changes reading.

      • @Shravan: The reason to use Lebesgue measure theory (what you do in undergraduate real analysis) is to understand continuous probability and what can logically be an event. This is very foundational in the mathematical sense. You can do applied stats without understanding this stuff the same way as you can be a programmer without understanding computability (e.g., the halting problem) or complexity theory (e.g., NP-completeness).

        The breaking point for me was trying to understand BDA. It used the term “random variable” without ever defining it. Expectation notation in BDA is just unhinged. When you start talking about MCMC, you have to juggle two measures (transitions for Markov chain and stationary target), which is really confusing to someone like me without a foundation in what all the terms mean. I’m curious about how you think about terms like “random variable” without knowing what a sample space and measure is.

        The real utility comes in when you want to prove something. For instance, a very fundamental question is, do continuous probability spaces exist and after that, what kind of sets can be events? Or if you want to define something general like the Metropolis algorithm with discrete, continuous, or mixed proposals.

        Take Modi’s, Barnett’s and my delayed-rejection HMC paper. HMC composes a Gibbs step to update momentum composed with a deterministic Metropolis step using the discretized Hamiltonian to make a proposal. To state the Metropolis algorithm in a way that applies to continuous, discrete, or mixed densities is typically done measure theoretically. That’s why people write P(x, dx) for the transition function in MCMC—it generalizes discrete and continuous and mixed distributions. If you’re just dealing with a Lebesgue measure in continuous space, you can do everything in terms of densities. If you’re starting discretely, you can do everything with counting. So you can certainly get a long way without it.

        Andrew ill be amused at the reviewers’ response to our submission of delayed rejection to Bayesian Analysis on delayed rejection HMC—they said our use of delta functions and applied math notation would ruin a generation of grad students and they basically forced us to rewrite it in measure theoretic notation.

    • “In discussing this issue, Feller (1966) notes that specialists in various applications sometimes ‘deny the need for measure theory because they are unacquainted with problems of other types and with situations where vague reasoning did lead to wrong results’. If Feller knew of any case where such a thing has happened, this would surely have been the place to cite it – yet he does not. Therefore we remain, just as he says, unacquainted with instances where wrong results could be attributed to failure to use measure theory. […….] Feller does acknowledge, albeit grudgingly, the validity of our position. While he considers passage to a well-defined limit from a finite set unnecessary, he concedes that it is ‘logically impeccable’ and has ‘the merit of a good exercise for beginners’. That is enough for us; for in this field we are all beginners. Perhaps the beginners who have the most to learn are those who now decline to practice this very instructive exercise.”

      Jaynes, Probability Theory, Appendix B Mathematical formalities and style

      • Is he arguing for or against measure theory? I find it really hard to follow Jaynes’ writing in general.

        PS Somewhere else in his book he complains about people nit-picking about mathematical notation, which resonated with me.

        • This is the conclusion of the section on measure theory: “Of course, we stand ready and willing to use set and measure theory – just as we stand ready and willing to use number theory, projective geometry, group theory, topology, or any other part of mathematics – wherever this should prove helpful for the technique of finding a result or for understanding it. But we see no reason why we must state every proposition in set/measure theory terminology and notation in cases where plain English is clearer and, as far as we can see, not only more efficient for our purposes but actually safer. Indeed, an insistence that all of mathematics be stated in that language all of the time can place unnecessary burdens on a theory, particularly one intended for application in the real world. It can also degenerate into an affectation, used only linguistically rather than functionally. To give every old, familiar notion a new, impressive name and symbol unknown to Gauss and Cauchy has nothing to do with rigor. It is, more often than not, a form of gamesmanship whose real purpose is to conceal the Mickey Mouse triviality of what is being done. One would blush to state it in plain English.”

          And he closes the appendix with this Emancipation Proclamation:

          “Every variable x that we introduce is understood to have some set X of possible values. Every function f (x) that we introduce is understood to be sufficiently well-behaved so that what we do with it makes sense. We undertake to make every proof general enough to cover the application we make of it. It is an assigned homework problem for the reader who is interested in the question to find the most general conditions under which the result would hold.”

        • It reads to me like he is arguing against the necessity of using measure theory. He’s essentially saying that Feller claims that failure to use measure theory has led to wrong results, but does not provide any actual examples of where that has happened.

  5. [I’m tired of trying to find the right place to reply to stay in-thread so I move it here]

    > I meant that the discrete density you linked to,
    > (1 – lambda) base_density(x) + lambda Indicator(x; x = 0)
    > if you tried to use it straightforwardly as a continuous one, integrates to more than 1. That kind of mistake, in naively continuizing a problem, is what I think rigor helps to avoid.

    I don’t know where that equation comes from – or what is a “discrete density” – so I won’t comment on that.

    > The trouble is that there’s no such thing as a density or an integral on a compound space like that until you introduce measure theory.

    I’m not sure if you mean that there’s not such thing as a density until you introduce measure theory or that there’s no such thing as a density on a compound space like that until you introduce measure theory.

    If it’s the former, obviously you can have densities before introducing measure theory – and in fact as long as you only care about densities I don’t think you need to introduce measure theory.

    It it’s about compound spaces like that, measure theory won’t produce a density where none exists like in your example. If you need to approximate that probability mass with a probability density of your methods to work it’s not just that you don’t need measure theory: measure theory can’t help. (Measure theory can be useful – but not necessarily a requirement – if you have a discrete distribution which is not defined on n=1,2,3,… but on a complicated set of points that get concentrated in some regions. How often do people encounter that kind of problems anyway?)

    • It it’s about compound spaces like that, measure theory won’t produce a density where none exists like in your example.

      I’m pretty sure this is wrong. In measure theory, the density is just a thing that you can Lebesgue integrate against an appropriate measure to get probabilities. So sums are integrals against the counting measure and PMFs are densities, and you can tweak your measure to get everything in between. So the function with the point mass at zero IS a density against the modified compound measure.

      This way, you can talk about expectations and recover a law of large numbers and all the good stuff we like where without measure theory, using just the analysis 101 notion of Riemann integration, no density exists.

      • You’re talking about the Radon Nikodym derivative. In this case you get a Dirac distribution as the density of the discrete probability component. I don’t know enough measure theory to establish in what sense the integral sign really means any sort of integration, or whether in the context of the Dirac measure you are just “punning” the same symbol to refer to two actually entirely different operations. Math has a tendency to reuse symbols to mean multiple things. For example 1 can mean the Peano set for the integer 1 or it can mean the Dedekind cut for the real number 1, those are entirely different sets under the hood.

        One of the things I like about NSA is that while the nonstandard numbers themselves are non-constructive, stuff like a Dirac density is directly constructable once you have a nonstandard integer, so you can be explicit about what you’re doing.

      • > I’m pretty sure this is wrong.

        Why?

        > In measure theory, the density is just a thing that you can Lebesgue integrate against an appropriate measure to get probabilities.

        A think that has some specific definition and that sometimes exists and sometimes doesn’t exist.

        > So the function with the point mass at zero IS a density against the modified compound measure.

        Is it?

        This Stack Exchange question (I don’t include the link but it’s easy to find) says that there is no density in your example: “What is an example of a measure without a density that is useful for some application?”

        “Consider a random variable 𝑋 s.t. it takes either the value 0 with probability 𝑝 or a continuously uniformly distributed value in [−1,0)∪(0,1] with probability 1−𝑝. This can model any IID measurement process in which we sometimes witness some uniform measurement error above or below 0. Thus the ‘density’ is […] and the distribution function is […]
        But you can notice that 𝑓 is no proper Lebesgue density (it has a Dirac delta) nor it can be a density wrt the counting measure.”

        Wikipedia says that writing down a Lebesgue integral against that kind of measure is an abuse of notation: https://en.wikipedia.org/wiki/Dirac_delta_function#As_a_measure

        “One way to rigorously capture the notion of the Dirac delta function is to define a measure, called Dirac measure, which accepts a subset A of the real line R as an argument, and returns δ(A) = 1 if 0 ∈ A, and δ(A) = 0 otherwise. If the delta function is conceptualized as modeling an idealized point mass at 0, then δ(A) represents the mass contained in the set A. One may then define the integral against δ as the integral of a function against this mass distribution. Formally, the Lebesgue integral provides the necessary analytic device. […] The measure δ is not absolutely continuous with respect to the Lebesgue measure—in fact, it is a singular measure. Consequently, the delta measure has no Radon–Nikodym derivative (with respect to Lebesgue measure) […] As a result, the latter notation is a convenient abuse of notation, and not a standard (Riemann or Lebesgue) integral.”

        • Wikipedia says that writing down a Lebesgue integral against that kind of measure is an abuse of notation: https://en.wikipedia.org/wiki/Dirac_delta_function#As_a_measure

          In the dirac delta case, the “density” is 1 against the dirac delta measure. There is a density there. The abuse of notation they mention is putting the delta “function” itself as a function inside the integral against the standard Lebesgue measure. 1 is the function, and the dirac delta is the measure.

          Properly stated it is:

          /int_{/mathbb{R}} 1 d /delta(x)

          Hence, the constant function 1 is the density.

          The stackexchange link does not say no density exists in general, it’s more precisely about where there is

          not a density with respect to either counting measure or Lebesgue measure?

          That’s the key, you recover the density, given definition of Lebesgue integral, by modifying the measure. The density is only defined up to arbitrary choice of measure. The important thing for computational purposes is that it projects back down to an integral against the Lebesgue measure when conditioning on the data to get a posterior density.

        • Perhaps the confusion here is that the Lebesgue integral is not necessarily against the Lebesgue measure. The Lebesgue integral is the limit of the product of simple functions and some measure. Classic integration is against the lebesgue measure. However, “not an integral against the lebesgue measure” =/= “not a valid lebesgue integral”

        • > In the dirac delta case, the “density” is 1 against the dirac delta measure. There is a density there.

          If I understand correctly you’re saying that for any measure mu the “density” is 1 against mu.

          Maybe you could give a reference to the definition of “density” that you are using. Otherwise it’s not clear if there is a density there in the usual sense of the word.

        • No, the density is the integrand. For a dirac delta integral,

          \int_{\mathbb{R}} f(x) \delta(x) = f(0)

          The density is f(x). The abuse of notation is to write

          \int_{\mathbb{R}} f(x) \delta(x) dx = f(0)

          which makes it look like \delta is a function that’s part of the integral against the “standard” Lebesgue measure.

          Starting over,

          In measure theoretic probability, the probability is defined as the lebesgue integral of 1 over the probability measure.

          In this sense, 1 can be said to be the “density”.

          You usually want to get back the Lebesgue measure so you can use familiar calculus, or the counting measure to use sums. The Radon Nikodym derivative between the target measure and the probability measure, which can be thought of as a kind of change of variables, will multiply a term into the integrand f(x) which is our new density

          \int_{\mathbb{R}} 1 d \mu(x) -> \int_{\mathbb{R}} f(x) d dx

          and you can do this transformation when the target measure is absolutely continuous wrt to the probability measure. Hence

          * Densities are defined with respect to some measure mu as the integrand such that integrals against mu give you probabilities
          * Densities are only unique up to choice of measure
          * Density is always defined at least in the trivial way as the constant function 1 if there is a probability measure
          * Importantly for our purposes, we can define a density on compound spaces that projects down to an integral against the Lebesgue measure when we condition on the data, allowing us to use traditional calculus to integrate parameter space

          The trouble is that without a notion of densities, stuff like the expectation value is actually not defined. Seriously, in the case like

          z = Indicator coin flip
          x = z * normal(mu, sigma)

          What is the definition of the expectation value of x? Now, of course lots of people will quickly say

          0.5 * 0 + 0.5 * mu

          without any notion of measure theory. And that’s fine, but until the expected value is actually defined, I can’t actually show why a wrong answer is wrong without resorting to brute simulation and law of large numbers.

          @Daniel

          They really are the same operation.

          integral of f against a measure mu over a space X

          Partition the space X into subsets x1, x2, x3…

          Take constant ci <= f(x in xi)

          Take sum(ci * mu(xi))

          The supremum of that sum as the partition gets finer and finer is the integral

          No funny business

        • And that’s fine, but until the expected value is actually defined, I can’t actually show why a wrong answer is wrong without resorting to brute simulation and law of large numbers.

          And what if it was a cauchy distribution? We could sit here arguing for days without an resolution

        • Somebody:

          Yes, the operation integrate( 1 dmu(x)) is defined for dmu(x) equal to the “Dirac delta measure” because the measures take sets of reals and return numbers but there is no density wrt Lebesgue measure Integrate(delta(x) dx) because there is no function delta(x) that takes a real and returns a real. however this notation is often used. The idea that you can use 1 as the density and then shove all the “calculation” into the dmu is just the kind of thing Mathematicians like because it allows you to use proof machinery you already invented for a more complicated case. But it’s not insightful in any way.

          The key aspect of measure theory is that it works with objects that take sets of numbers and return numbers… Mu([1,2]) = 1 for Lebesgue measure but maybe equals 0.1 for some probability measure.

          Sometimes you can induce a measure on sets by defining the measure as the value of a Lebesgue integral integrate(normalpdf(x)dx) for example, other times the measure needs to be defined otherwise, as a rule that assigns measures to sets directly : delta(X) = 1 if X is a set that contains 0 otherwise the measure is 0

          And then there’s the problem of not being able to use arbitrary sets it needs to be measurable sets. You can define the measurable sets in more than one way.

          All of this gets us a theory that generalizes in ways that Mathematicians like a lot. You can for example define measurable sets on manifolds like the surface of a taurus. If the definition of the sets is legit then proofs go through without modification! Yay, a mathematician just saved a month of work proving a theorem! Good for them. Now, try to carry out a numerical calculation of the total energy stored in the insulating wrapper around a toroidal mechanical part on a spacecraft exposed to solar radiation leaking through a pinhole in a reflective shield…

          Engineer: we need to know the energy in this insulation caused by a very bright source coming through a pinhole in the shield and shining on this point on the part…
          Mathematician: here’s a proof that the value exists!
          Engineer:🙄

          In the end, applied people want numerical approximations almost all of the time. Measure theory doesn’t add anything there because it’s about what happens when you take the approximation to a limit… You get an object that changes to a different kind of object! Isn’t that incredibly interesting as a mathematician!? You move from normal(0,epsilon) as a function of space which is a very peaked normal density, to a measure of sets, which is a rule that assigns 1 to some sets and 0 to others…

          But in order to do the engineering calculation, you need to undo that transformation anyway and go back to the heat source that has width 200 microns like a normal distribution.

          It turns out that there’s a proof that the generalized theory of PDEs using distributions like the Dirac distribution is equivalent to a theory in which you convolve everything with a very narrow infinitely smooth function and then just go ahead and do regular derivatives… 🙄

          If you’re an engineer and just spent let’s say a year doing a two semester sequence learning all this specialized notation about filtrations and measures and sequences of test functions and what’s the generalized derivative of a Dirac distribution and the behavior of a countably infinite grid of Dirac distributions and soforth… And it all culminated in a proof that if you just ignore all that shit and use really narrow gaussians and take normal derivatives it’s all the same… Would you be happy? Or could they have just told you that in the first day of the course and then you could skip the whole thing?

          Remember I’m a guy with a math degree! But I’m also a guy that likes to actually calculate things as they pertain to real world conditions, and I accept that everything is an approximation anyway. The pinhole isn’t zero width, it’s some unknown width probably between 200 microns and 1mm … The sand whacking into the painted surface of my object isn’t infinitely small, it’s grains of a few hundred microns… Mathematicians who never calculate numbers like abstract theories for their proof properties, people who want to know the temperature field on the toroidal mechanical part from solar radiation at 10000K penetrating a pinhole in a reflective shield don’t really get much out of measure theory etc.

        • Sorry, I realized the above is not clear.

          Suppose \mu is the probability measure. The probability of a set X in its domain is defined as

          \int_{X} 1 d \mu(x)

          So yes, 1 is the always the density with respect to the basic probability measure

          We want to actual compute the integral though, and we don’t have algorithms that compute integrals with respect to arbitrary measures. We do have algorithms from calculus that compute integrals with respect to the Lebesgue measure. Assuming absolute continuity of \mu wrt the Lebesgue measure, meaning that \mu gives 0 probability to any set of 0 volume, there must exist a function f such that

          \int_{X} f(x) dx = \int_{X} \mu(x)

          f is the Radon Nikodym derivative of \mu wrt Lebesgue measure, and is our traditional probability density function.

          https://en.wikipedia.org/wiki/Radon%E2%80%93Nikodym_theorem#Radon%E2%80%93Nikodym_derivative

          In the mixed discrete/continuous case I mentioned, our probability measure is

          \mu(X) = (1 – \lambda) \int_{X} normal(j, k) dx + \lambda \mathbb{1}_{0 \in X}

          Indeed, no density exists with respect to Lebesgue measure because absolute continuity does not hold; the set {0} is 0 volume since it’s infinitesimal by the Lebesgue measure, but has non-zero probability. However, you can define a compound measure that is absolutely continuous

          \eta(X) = \mathcal{L}(X) + \delta(X)

          Then a density exists against this measure

          \int_{X} (1 – \lambda) \text{Indicator}(x \neq 0) \text{Normal}(j, k) + \lambda \text{Indicator}(x = 0 ) d \eta(x)

          and it’s easy to compute integrals piecewise since by linearity of integration you can break it out into two integrals that simplify to

          \int_{X – {0}} (1 – \lambda) \text{Normal}(j, k) dx + \int_{{0}} \lambda d \delta(x)

          Thus the expectation value can now be defined and assuming the integrals converge, you get law of large numbers and CLT and all that good stuff.

          I do want to emphasize I know this is all post-hoc justification. I’m confident people were taking expectations on mixed spaces like this before it was “properly” defined, just as Newton was doing calculus before the first epsilon delta proof. The point is to resolve debates and provide sufficient conditions for checking your work, verifying extensions and reparameterizations, etc.

        • I am confused here. It seems to me that people are making things more difficult than necessary.

          It’s been a long time since I studied this topic. But, as I recall, Stieltjes integrals cope perfectly well with distributions that are a mix of discrete events and continuous PDFs.

          Wikipedia agrees with me here. See https://en.wikipedia.org/wiki/Riemann–Stieltjes_integral. That article characterizes the Stieltjes integral saying, “an invaluable tool in unifying equivalent forms of statistical theorems that apply to discrete and continuous probability.”

          I don’t recall seeing any practical problems that required Stieltjes integrals—but I don’t work in math or statistics. A quick literature search finds a number of examples, e.g. The measurement of power spectra from the point of view of communications engineering> by Blackman and Tukey states “interpret all integrals as Stieltjes integrals.” That publication, in its multiple forms, must have more than 10,000 cites.

          The second author has received many positive mentions on this blog.

          Bob76

        • @Bob

          Looking at it, it looks like it relies on sequentially ordered partitions, so there’s not an immediately obvious way to generalize it to higher dimensions where the space is not ordered. Quickly googling, it looks like there exists a generalization using n-dimensional CDFs

          https://link.springer.com/chapter/10.1007/978-1-4613-8072-6_11

          but at that point it seems more awkward than Lebesgue integrals.

          @Daniel

          If you’re an engineer and just spent let’s say a year doing a two semester sequence learning all this specialized notation about filtrations and measures and sequences of test functions and what’s the generalized derivative of a Dirac distribution and the behavior of a countably infinite grid of Dirac distributions and soforth… And it all culminated in a proof that if you just ignore all that shit and use really narrow gaussians and take normal derivatives it’s all the same… Would you be happy? Or could they have just told you that in the first day of the course and then you could skip the whole thing?

          Well, as a counterpoint, suppose you were working on the mixed estimation problem and one person wrote down a likelihood that excises the normal lpdf from the 0 point and another person wrote down one that doesn’t. Person one argues:

          “The probability of 0 is 0 from the normal, so you can ignore it.”

          Person two argues:

          “The probability of any point is 0 from the normal, you need to treat it like any other point.”

          Of course, experiments are cheap in statistics, since you can simulate and find that one works and the other doesn’t. Nonetheless, would you be comfortable not being able to explain why? No matter how many simulations I did, I don’t think I would be comfortable just moving on from that.

        • > In measure theoretic probability, the probability is defined as the lebesgue integral of 1 over the probability measure. In this sense, 1 can be said to be the “density”.

          Ok, it seems clear now that when you say “density” you refer to the Radon–Nikodym derivative that expresses a relationship between any two generic measures defined on the same measurable space and that may or may not exist – but exists trivially for any measure relative to itself.

          I was referring to what is called “probability density” in measure theoretic probability which may or may not exist for a given probability measure.

          > You usually want to get back the Lebesgue measure so you can use familiar calculus, or the counting measure to use sums. The Radon Nikodym derivative […] and you can do this transformation when the target measure is absolutely continuous wrt to the probability measure.

          And you cannot do this transformation when it isn’t and the Radon Nikodym derivative doesn’t exist.

          > * Importantly for our purposes, we can define a density on compound spaces that projects down to an integral against the Lebesgue measure when we condition on the data, allowing us to use traditional calculus to integrate parameter space

          What does projecting down to an integral against the Lebesgue measure mean when the Lebesgue measure is not absolutely continuous wrt to your compound probability measure?

          > The trouble is that without a notion of densities, stuff like the expectation value is actually not defined.

          Why do you say “without a notion of densities”? How is that related to the current discussion about the need for measure theory?

          For example, “Probability density appears in 1939 in H. Jeffreys, Theory of Probability: “We shall usually write this briefly P(dx|p) = f'(x)dx, dx on the left meaning the proposition that x lies in a particular range dx. f'(x) is called the probability density” (OED2).” and he doesn’t discuss measure theory at all.

          > Seriously, in the case like z = Indicator coin flip x = z * normal(mu, sigma) What is the definition of the expectation value of x? Now, of course lots of people will quickly say 0.5 * 0 + 0.5 * mu without any notion of measure theory.

          E(x) = E(E(x|z)) = E(x|z=0)P_z(0) + E(x|z=1)P_z(1) = E(0) * 0.5 + E(normal(mu, sigma)) * 0.5 = 0 * 0.5 + mu * 0.5

          > And that’s fine, but until the expected value is actually defined, I can’t actually show why a wrong answer is wrong without resorting to brute simulation and law of large numbers.

          If we don’t need measure theory to find the answer – and consequently to know that anything else is not the answer – I don’t understand what part do you find troubling.

        • > Well, as a counterpoint, suppose you were working on the mixed estimation problem and one person wrote down a likelihood that excises the normal lpdf from the 0 point and another person wrote down one that doesn’t. Person one argues: “The probability of 0 is 0 from the normal, so you can ignore it.” Person two argues: “The probability of any point is 0 from the normal, you need to treat it like any other point.”

          They don’t need to understand measure theory. They need to understand the software that they use. They are not really writing a likelihood. They are writing two separate “likelihoods” because they are actually describing to the computer two separate models. The system works on discrete distributions using a trick that no longer works when you add to the mix a continuous distribution. The workaround is to use two separate models and split the data appropriately: integer data goes to the discrete model and non-integer data goes to the continous model. You don’t need measure theory to understand why a single data point cannot go to both models.

        • What does projecting down to an integral against the Lebesgue measure mean when the Lebesgue measure is not absolutely continuous wrt to your compound probability measure?

          The joint density is not absolutely continuous, but the conditional posterior density is. But you need* a joint density to define the conditional density.

          E(x) = E(E(x|z)) = E(x|z=0)P_z(0) + E(x|z=1)P_z(1) = E(0) * 0.5 + E(normal(mu, sigma)) * 0.5 = 0 * 0.5 + mu * 0.5

          The trouble is that the law of iterated expectations is not a definition, it is a result proved *about* expectations. The definition of an expectation is

          integral x f(x) dx

          or

          sum k f(k)

          and if it’s neither continuous nor discrete, without measure theory it’s undefined, and the question doesn’t even make sense. Hence you’ve calculated a number, but without measure theory it’s not an expectation, and you need* to show it has desirable properties like LLN and CLT.

          I’ve been putting an asterisk next to need because you obviously don’t REALLY need any of those things. You’ve calculated the right number that measure theorists have proved comes with those desirable properties, and it was true before they proved it. Obviously, Newton calculated the motions of the planets without a well defined derivative; he didn’t need Weirstrass and his friends. But I don’t think Weirstrass et al are useless either. My anxiety is that you can also do it wrong, and until there’s a definition of right and wrong it’s all just rhetoric and experiments.

        • Somebody:

          Note that my objection to measure theory isn’t that it’s formal, I’m ok with formalisms. My objection is that the formalism has a huge impedance mismatch between the theory and the application.

          Go back to my heating from a pinhole in a solar shield. Let the heating power from the pinhole be represented by a mathematical object. Which would you rather use? A normal pdf with width 200 microns, or a delta measure?

          Notice that these things are two totally different types of objects. Notice that we have numerical algorithms that can deal with the pdf, and we have essentially nothing that can deal with the delta distribution numerically.

          Notice that the normal distribution with width epsilon is a nonstandard object that does represent the delta distribution, so the NSA approach has zero impedance mismatch, the density doesn’t exist in standard mathematics, but the density is totally understandable in an NSA approach.

        • They are not really writing a likelihood. They are writing two separate “likelihoods” because they are actually describing to the computer two separate models.

          This problem happens to decompose nicely in that way, but it doesn’t always. Off the top of mine head

          Probability lambda that z = 0
          x is normally distributed
          if x >= 2.5 the device “saturates” and shows 0

          y = z * x if x < 2.5 else 0

          You *can* fit a model like this in stan. You *do* have to assume a joint distribution of observations and continuous parameters, which can then be conditioned to produce a posterior.

          Sorry about earlier sloppiness:

          The joint density is not absolutely continuous, but the conditional posterior density is. But you need* a joint density to define the conditional density.

          Should be “measure is not absolutely continuous”.

          Anyways, my point is not that every practitioner must learn measure theory. Just that it can be clarifying. If you flip a coin and multiply it by a normal, I’d like to be able to ask “what’s the expected value of that.” And it’s just a fact that neither continuous nor discrete probability theory has a satisfactory definition of expected value for that case.

          Yes, doing what comes naturally often gives the right answer anyways. But at least for me, if I think about it too hard, I second guess my instincts until I can convince myself either way. Maybe that’s the mathematics mind poisoning, but in any case, measure theory helps inoculate myself against that.

        • Somebody: yes, formalism gives you a way to crank through a calculation and get the right answer. I understand that value. What I find unconvincing is that measure theory is the right choice of formalism.

    • Measure theory generalizes the notion of pdf and pdf, but tends to use the term “density” and use integral and dx notation for both. In that sense, it’s not punning in the sense Daniel Lakeland measures but unifying into a general framework where no punning is required. I haven’t actually gotten that far as I haven’t needed it for anything yet.

      Let me put it another way. Let’s say we have a zero-inflated normal (a kind of “spike and slab”). How do you reason with that mixed discrete/continuous density? It has a cdf, but it doesn’t have a pdf or pmf. Measure theory makes it possible to do general reasoning at the pdf/pmf level.

      I don’t know what a “compound space” is. What I was getting at in terms of existence is that when being formal, a density function is defined to be the derivative of a cdf for a random variable X, which is just any old function from the sample space to real numbers such that the cdf exists. The cdf is defined as F_X(x) = Pr[X <= x] where Pr is called a "measure". Without measure theory, how do you define what a random variable is? I completely agree with everyone that you don't ever want abstraction purely for abstraction's sake. If you stick to entirely continuous spaces, everything can be done with Lebesgue measure. The same way as you can do calculus without understanding Lebesgue measure, you can do probability theory without understanding Lebesgue measure. You just can't do it in a lot of generality, which is why Lebesgue measure is the first thing they teach you in real analysis after intro calculus. I'm also not sure how you'd go about proving that continuous random variables exist without measure theory (or some other kind of analysis).

      • Bob, the pun for me was something like

        integrate(delta(x) dx,-inf,inf)

        what Carlos and “somebody” eventually came around to realizing was that this is intended to mean:

        integrate(1 ddelta(x)) which is defined in terms of splitting the x dimension into an open cover of measurable sets, and then adding up 1 * mu(dx) where mu in this case is the “delta measure” which is 1 if dx contains 0 otherwise it’s 0.

        The standard “lebesgue” integral is defined in terms of breaking up the *range* of the function into different values. In this sense the function is 1(x) = 1 for all x. The range therefore has only 1 value, and therefore it’s pre-image is (-inf,inf) so the whole symbol thing falls out to

        delta((-inf,inf)) which is just to say apply the delta measure to the whole real line.

        The “pun” here is that you’re talking as if there’s integration going on, but in fact all there is is applying the measure delta to the whole real line. There’s no sense of limits, there’s no partial sums, there’s nothing but just applying a measure to all of R.

      • >Let me put it another way. Let’s say we have a zero-inflated normal (a kind of “spike and slab”). How do you reason with that mixed discrete/continuous density? It has a cdf, but it doesn’t have a pdf or pmf. Measure theory makes it possible to do general reasoning at the pdf/pmf level.

        Measure theory is one possible formalism. The other decent candidate is nonstandard analysis. In that formalism you can define a nonstandard density that describes the delta distribution.

        First, we define a grid of points which defines our probability space. We make them epsilon = 1/N apart.

        {-N^2 * epsilon … N^2 * epsilon} (ie. the set of points epsilon distance apart starting at -N and going to N

        Now we define a delta distribution on this as normal_pdf(0.0, sqrt(1/N)). Here the standard deviation is sqrt(1/N) which is much bigger than 1/N but still infinitesimal.

        Then we define the “slab” as normal_pdf(0.0, sigma) where sigma is a standard number like 3.3 or whatever you want. And we do a mixture of the two of them

        lambda * spike + (1-lambda) * slab

        Now what’s the mean of f(x)? Just sum across the discrete values i = 0 to N^2

        sum(f(x) *( lambda * spike + (1-lambda) * slab))

        The spike component has values that are nonstandard (really big) in the infinitesimal vicinity of 0.0 but infinitesimal outside that range.

        If f(x) is a standard continuous function, then the result will be infinitesimally close to f(0) with probability lambda, and will be the usual mean of f(x) over the normal distribution with probability 1-lambda

        • note, everywhere I said normal_pdf() it should say normal_pdf() * epsilon. Each of the infinitesimally close points has infinitesimal probability, so that the sum = 1

        • Yes, discrete probabilities are being multiplied by densities. And the resulting number is neither a probability nor a density when that happens.

          All I want to get across is that there exists a formalism in which it is a density, and said density is a well defined object with a consistent set of rules and operations, and that formalism is the dominant way of studying probability in academia. If you want to insist that these semantics are wrong, there’s no arguing with that.

          Yes, you can alternatively handle mixed discrete and continuous spaces by simply saying “there are actually no continuous spaces.” Then you can justify continuous approximations with infinitesimals and their special algebra or whatever. I am fully willing to believe this is easier and better than the track I was taught on. But I’m not sure it gets you everything you might want. How does one model phenomena with fractal properties like brownian motion? Quickly googling, the “non standard analysis” approach to brownian motion seems to be something called “Loeb measure theory”

        • Somebody, I think you’re quoting Carlos?

          In any case I don’t believe its correct to say that a mixture model of continuous and discrete probabilities has a density. It has a measure yes, but the discrete probability doesn’t have a density. Measures and densities are not the same. Every density induces a measure, not every measure is induced by a density.

          https://math.stackexchange.com/questions/98801/probability-distribution-function-that-does-not-have-a-density-function

          As stated there, you can decompose every measure into a part with a density, a discrete set of atoms, and a continuous “singular part”. Neither the atoms nor the singular continuous part have a density.

        • Replied in the wrong place

          As I’ve stated, the density is everything under the integral sign that is not the measure. 1 can be a density. This is the sense it’s used in the Lebesgue measure theorem and the Radon Nikodym theorem.

          I know you don’t like it. In the sense you mean, the only useful density is the radon nikodym derivative to the Lebesgue measure. I disagree. Other densities let you define the expectation and prove that it exists or doesn’t, and that things concentrate around it.

        • Somebody,

          I wrote the fragment that you quote in response to a comment of yours where you said “Here, discrete probabilities are being multiplied by densities.”

          I thing that you were using densities in the usual “things that you multiply by dx1 dx2 … to calculate the probability in an area dx1 dx2 … around the point”.

          If you had been using density in the sense you bring up now you would have said “Here, densities are being multiplied by densities”, wouldn’t have you?

          Daniel,

          somebody uses density sometimes in the sense of the Radon–Nikodym derivative of the measure induced by the probability distribution relative to any other measure. That density may not exist for a given reference measure.

          Relative to itself the “density” is trivially one. With this definition, conditional on the parameters, the “density” is one for every observation and the joint density is also one when we have multiple observations and multiply of those ones together.

          The usefulness of this “insight” is limited though as the 1 obtained for one set of parameters doesn’t mean the same as the 1 obtained for another set of parameters. They cannot be used to compare the relatively likelihoods of different values of the parameters.

        • @Carlos

          Yes, I was speaking in the voice of someone who does not have a theory for understanding more general spaces that are neither discrete nor continuous

          They cannot be used to compare the relatively likelihoods of different values of the parameters.

          It is my view that you should basically never do this anyways unless you have some kind of canonical parameterization and very low dimension.

          https://twitter.com/betanalpha/status/1395945995238842370?s=20

          The density or relative densities at any point are arbitrary parameterization choices. This is true even with the implicit “dx” afterwards; when you change coordinates, the “dx” rescales the density high or low depending on the jacobian of the transform.

        • >> They cannot be used to compare the relatively likelihoods of different values of the parameters.

          > It is my view that you should basically never do this anyways unless you have some kind of canonical parameterization and very low dimension.

          That’s how you construct a posterior density. Normalizing a likelihood function. Should you basically never do that?

        • Well if you mean “compute at multiple points” when you say “compare”, then you actually CAN compare parameters that way using a measure theoretic density. You *can* construct a valid posterior and, so long as the measure is constant Lebesgue for all values in the PARAMETER SPACE (NOT the observation space), the formal result is classically integrable and equivalent to the version with implicit differentials and you can use it in stan.

          But that’s not what I mean when I say “compare.”

          I feel your goal at this point is to try and catch me in a contradiction. If your position is that measure theory is useless, that’s one thing, but if your position is that measure theory is actually incorrect or has less descriptive power than classical probability, you’re just wrong no matter how much you argue.

        • I don’t find the part about 1 being a density particularly enlightening. This is what I mean by saying that there’s always a measure but not always a density (I guess I should have said “with respect to Lebesgue measure”). Almost everyone including mathematicians who use the term “density” would do so with an implied “with respect to Lebesgue measure” (WRTLM). In my graduate class on stochastic PDEs the professor multiple times said things like “this exists and has a density” he didn’t mean “trivially it has the density 1 with respect to some measure” he meant “it has a density WRTLM”.

          Interestingly, to compare the relative importance of different locations in parameter space with no density WRTLM, you’d do so by choosing a finite “dx” and then using the measure to measure the probability of that dx around the points of interest… If the dx contained an atom to compute its measure you’d split out the density with respect to lebesgue measure, and the atom, and add them together, since you can decompose the measure into a sum of a “regular” part with a density (WRTLM), a set of atoms, and a “singular part” which can’t really exist in an applied problem (there is no applied problem where you could meaningfully say increases in probability occur at a countably infinite set of points like the cantor set)

          The theorem that tells us that measures are decomposable to “regular densities”, “atoms” and “weirdo abstract sets only pure mathematicians could love” essentially says that while you can write down symbols like mu(dx) for the measure of a set dx, this is shorthand for p(x)dx + atom_value(dx) + singular_part(dx) which is basically just saying that you’re hiding a sum inside some symbols.

          In computations, you will always break the atoms out separately. The nice thing about hiding the sum inside a symbol mu(dx) is that it’s easier to write proofs, not that it provides insight into computations. This is like “coordinate free differential geometry”. It’s great for proofs, but just try to calculate the volume of some abstract set without coordinates.

          I’ll say it again, if you are a pure mathematician who only cares about doing proofs, measure theory is good for you. If you are an applied mathematician who cares about constructing models of real world phenomena measure theory offers you nothing of much value since it’s primarily a way to ensure uniformity of methodology in proofs. This is why a giant intellect in applied math like Richard Hamming could pooh-pooh the distinction between Riemann and Lebesgue integration. What Lebesgue integration does is it completes certain functions spaces… the limit points exist inside the “Lebesgue integrable” sets of functions, whereas they don’t in the Riemann integrable sets.

          In some sense this is like making the distinction between “the probability of a Bernoulli outcome being “heads” is a number in (0,1)” vs “… is a number in [0,1]”.

          There’s no meaningful difference in applied problems. p=10^-(10^1000) is close enough to 0 for government work. and 1-p is close enough to 1. There are no experiments you could do which would distinguish between them. Remember modern Homo sapiens only evolved like 200000 years ago, and have only been doing symbolic math on probability for ~ 200 years. The universe has only existed for 13 Billion years. If you did one experiment every second since the beginning of time you’d do 10^17 experiments. The probability of seeing heads on something with 10^-(10^1000) per experiment after doing 10^17 experiments is as close to zero as dammit.

        • > Well if you mean “compute at multiple points” when you say “compare”, then you actually CAN compare parameters that way using a measure theoretic density. You *can* construct a valid posterior and, so long as the measure is constant Lebesgue for all values in the PARAMETER SPACE (NOT the observation space), the formal result is classically integrable and equivalent to the version with implicit differentials and you can use it in stan.

          Maybe this is what you mean but note that to construct a posterior p(theta|data) given the likelihoods p(data|theta) this probability function conditional on theta has to be relative to the same measure on the observation space for all the values of theta.

          (n.b. p() could be here a mass, a density or a hybrid – the differentials that may be required to make it a proper probability will be left implicit.)

          For a very simple example of a normal with unknown mean and unit variance if you have two observations data={0, 1} you may calculate p(data|mu=0) ∝ exp(0) exp(-1/2) = 0.61, p(data|mu=0.5) ∝ exp(-1/8) exp(-1/8) = 0.78, etc. These values f(0)=0.61, f(0.5)=0.78, etc. can be put together to produce the (unnormalised) posterior density for mu.

          If you allow each p(data|mu) to be relative to a mu-specific measure on the observation space you could reparametrise the data according to the quantiles of the probability distribution and getting a p-value that is uniformly distributed. With this density (always equal to one) you may calculate p(data|mu=0)=1, p(data|mu=0.5)=1, etc. These values f(0)=1, f(0.5)=1, etc. cannot be put together to produce a posterior density for mu.

          (That’s what I meant when I said that all those 1s relative to parameter-specific measures “cannot be used to compare the relatively likelihoods of different values of the parameters”.)

  6. > This problem happens to decompose nicely in that way, but it doesn’t always. Off the top of mine head […]

    I don’t see how that case is any less nice. It’s straightforward to decompose any probability distribution which has probability masses at [some] integers and the rest spread out continuously elsewhere as the sum of a discrete distribution defined on [those] integers and producing integer observations and a continuous distribution defined at the non-integers [and the rest of the integers] producing non-integer observations [and with probability zero integer observations].

    • No, the second problem doesn’t decompose because the discrete observations tell you information about the continuous process. In the limit where the discrete process probability = 0, you still have discrete observations from the latent continuous process saturating at x >= 2.5. Hence, the number of zeroes you observe tells you something about the location of the continuous mean and standard deviation.

      • Note that this is not just a toy example, this is what you might get in an applied setting if software “engineers” overload the semantic meaning of NULL in a DOUBLE PRECISION type column in a relational database, or if a popular programming language from Google were to autopopulate uninitialized float64 fields with 0 so its users decide to interpret 0 as “invalid”

        https://stackoverflow.com/questions/38511526/check-empty-float-or-integer-value-in-golang

      • So what? We’re rewriting “ugly probability distribution” as “continuous probability distribution” plus “discrete probability distribution”. That’s mathematically fine and that’s seems to be how Stan manages to do the proper calculations: considering them separately.

        • It does not consider them separately; it differentiates the parameters mu and sigma through both the continuous and discrete datapoints, and both pull on ONE mean and ONE standard deviation. It’s not two models—it’s a single posterior density and one set of non-independent parameters.

          It only considers them separately in the sense that it considers all datapoints separately, since they’re conditionally i.i.d and factorize so you can sum up the log probabilities piecewise. But following those semantics, for any model you fit, you have N separate models for N datapoints.

        • Ok, maybe we agree.

          I mean that the integers and the non-integers are handled separately in the same sense that if you fit negative observations using one distribution and positive observations using another distribution they would be handled separately (that has nothing to do with the different components sharing parameters).

          I don’t know if you would also find confusing in that case to see something like “if (observation<0) useOneDistribution else useAnotherDistribution". Each observation goes in only one bucket – and the same happens when the split the distribution in a continuous part and a discrete part.

        • The main advantage is that you can prove a theorem without decomposing it into all the cases. Sweeping the details under a rug is a great tactic if you have complicated logic to discuss at a higher level for a proof and you never get near a floating point number at all. On the other hand, if you have a numerical calculation to do, hiding the details under the rug can be disadvantageous. What the symbol integrate(dmu(x)) means is different calculations depending on the nature of the measure. When you go to tell a computer that you wind up having to confront it directly so it doesn’t make things easier, at least it’s not obvious.

        • My trouble is that the algorithm still assumes the existence of a posterior density, which assumes the existence of a joint density which simply cannot exist in a purely discrete or purely continuous world.

          f(lambda, mu, sigma, obs_1, obs_2,…) = p(lambda, mu, sigma) * h(obs_1 | lambda, mu, sigma) * h(obs_2 | lambda, mu, sigma)…

          So far, so good. So what should h be? A PDF? A PMF? Sometimes it’s a density and sometimes it’s a probability? You don’t need the formalism to handle a mixed world, but I at least need the formalism to convince myself that I’m doing it right.

        • Stan does not support discrete parameters, precisely because we can’t differentiate the target log density with respect to them as we would need for Hamiltonian Monte Carlo. We support discrete distributions, but the outcomes have to be observed data. When you have something like a mixture model in Stan, you have to explicitly marginalize.

          I’m not sure what Carlos Ungil means by “separately.” I’m just amplifying somebody’s replies here. Stan defines a single continuous log target density up to a normalizing constant (and also lets you define posterior predictive quantities but we can set that aside). It mixes the discrete and continuous distributions together by mixing pmfs and pdfs in definitions. We write things like

          p(n, theta | N) = binomial(n | N, theta) * beta(theta | 2, 2),

          where we mix the pmf for the binomial with the pdf for the beta. It’s easy to understand at the joint cdf level, but not so clear at the joint pdf/pmf level or in terms of how it lets us define a posterior

          p(theta | N, n) propto binomial(n | N, theta) * beta(theta | 2, 2).

        • > I’m not sure what Carlos Ungil means by “separately.” I’m just amplifying somebody’s replies here. Stan defines a single continuous log target density up to a normalizing constant (and also lets you define posterior predictive quantities but we can set that aside).

          I mean that as far as I can see (it’s as if) Stan uses a density to represent a (regularly spaced) discrete distribution. Mixing (creating a [normalized] linear combination of) continuous distributions works well. Mixing discrete distributions works well. Mixing continuous and discrete distributions doesn’t work. By design, the _lpmf and _lpdf functions work on different datatypes and cannot be easily applied to a single set of outputs. The true “continuous density” and the fake “discrete density” cannot be used together, they have to be handled separately.

          > It mixes the discrete and continuous distributions together by mixing pmfs and pdfs in definitions. We write things like p(n, theta | N) = binomial(n | N, theta) * beta(theta | 2, 2), where we mix the pmf for the binomial with the pdf for the beta.

          That’s not mixing the discrete and continuous distributions together in the sense that was being discussed here (https://statmodeling.stat.columbia.edu/2023/06/05/rosenthals-textbook-a-first-look-at-rigorous-probability-theory/#comment-2226639). You actually give a similar example in another recent comment of yours: “Let’s say we have a zero-inflated normal (a kind of “spike and slab”). How do you reason with that mixed discrete/continuous density? It has a cdf, but it doesn’t have a pdf or pmf.”

          Mathematically a probability distribution can be represented as the sum of probability densities (normalized). For example the density for a mixture of normals is – at every point – the weighted average of the densities as the components.

          The distributions being mixed can have non-overlapping supports. For example a distribution which is a mixture (in unkown proportions) of an exponential with paramater lambda for positive numbers and a mirrored exponential with parameter 2*lambda. One could write in this case something like:

          for (i in 1:N){
          if(y[i]<0){
          target += log1m(theta) + exponential_lpdf(-y[i]| 2*lambda);
          }else{
          target += log(theta) + exponential_lpdf(y[i]| lambda);
          }
          }

          That's what I meant by treating them separately: each observation goes to one of the two alternatives. It's easy to see why: because we're separating the distribution in two non-overlapping components.

          Zero goes here to the positive part. It could have been the other way. Or averaged somehow. If Stan accepted the following formulation (which doesn't)

          log_sum_exp(log1m(theta)*exponential_lpdf(-y[i]| 2*lambda), log(theta)*exponential_lpdf(y[i]| lambda));

          the zero would contribute to both terms (unlike in the previous formulation). But what happens with the zero doesn't matter much. Theoretically there is zero probability of seeing a 0 outcome – in practice it will be very rare (unlike in the zero-inflated examples).

          Essentially the same happens for the "spike and slab" case. The distribution can be decomposed in two parts: the spike at zero and the slab elsewhere. The observations go either to the spike (if zero) or to the slab (if non-zero). Hence:

          if (observation == 0) {
          return log(zero_probability);
          else {
          return log(1 – zero_probability) + continuous_lpdf(observation);
          }

          In general a mixture of discrete (at integers) and continuous (at non-integers) distributions can be handled considering them separately (and working around the impossibility to use _lpmf functions on reals even if they correspond to integers).

        • 1. They’re go down different code branches, but they get added to a single number, the log posterior density. That’s what I mean, you still have to assume the posterior density is a meaningful quantity despite being the product of some pmfs and pdfs.
          2. In the case where a device saturates and reads 0 past a certain number, the two distributions do not go down different code branches. it turns into

          if (observation == 0) {
          return log_mix(
          zero_probability,
          0.0,
          continuous_lccdf(observation | continuous_params)
          );
          else {
          return log(1 – zero_probability) + continuous_lpdf(observation | continuous_params);
          }

          The continuous distribution appears in both branches

        • > They’re go down different code branches, but they get added to a single number, the log posterior density.

          The numbers calculated using the corresponding probability mass for points where a probability mass exists and the corresponding probability density elsewhere are added to a single number.

          That number is not a density. It’s number! Maybe you mean that you have a density when you calculate such numbers for the same set of observations for many different values of parameters (which if I’m not mistaken are necessarily continuous in Stan).

          > That’s what I mean, you still have to assume the posterior density is a meaningful quantity despite being the product of some pmfs and pdfs.

          The aforementioned number is as meaningful as the number representing the total mass of a system composed of a unit point mass and unit ball of unit density that you would calculate af the sum of two numbers: one representing the mass of the point and another one representing the mass of the ball.

          > 2. In the case where a device saturates and reads 0 past a certain number, the two distributions do not go down different code branches. […] The continuous distribution appears in both branches.

          As I tried to explain before, that “continuous distribution” which “appears in both branches” is a probability distribution that is explicitly decomposed as the sum of a probability mass (the _lccdf) and a probability density (the _lpdf). The former goes to the “probability mass at zero” branch and the latter goes to the “elsewhere” branch.

        • Or maybe you mean that this number is the value of a N-dimensional probability density at the point (y_1, .. y_N) but such a probability density doesn’t exist in this case because the cumulative distribution function has discontinuities.

        • Yes, they are computed in branches. But this

          They need to understand the software that they use. They are not really writing a likelihood. They are writing two separate “likelihoods” because they are actually describing to the computer two separate models

          Makes no sense to me. That’s just how all computations work. If you dig into the assembly on the continuous branch, it’s full of branches too. The logistic function is implemented like

          if x < 0:
          return 1 – 1 / (1 + exp(x))
          else:
          return 1 / (1 + exp(-x))

          In order to prevent overflow on the exp function. Would you say that a logistic regression is actually 2 models for probability 0.5?

          But the important point here is that it all gets added together in target, which is usually referred to as the log posterior density. If you want to be obtuse, we can call it “the number representing the logarithm of the numerator of the posterior density for a given set of parameter values.”

          So if datapoint 0 hits the discrete branch and datapoint 1 hits the continuous branch,

          target = log(zero_prob) + log(1-zero_prob) + continuous_lpdf(x[1] | continuous_params) + …

          Hence, they are both pulling on one set of parameters, and there is only model. Here, discrete probabilities are being multiplied by densities (or their logarithms are being added)

          The resultant log density represented by target is indeed purely continuous. However, deriving the posterior density as prior * likelihood presupposes the existence of a joint density over the observation space and parameter space.

          Yes, you don’t have to prove it to use it, and high schoolers can compute the center of mass of a system with point masses and densities. However, by that line of reasoning, there’s no reason to learn math in general. Why would any practitioner learn a proof of the central limit theorem, or the law of large numbers?

        • Somebody:

          I keep saying it but nobody appears to be listening. The numbers we are calculating are ALL probabilities, NOT probability densities. It’s just that we are taking a shortcut to the calculation of the probability. P(Data) = p(Data)*dData capital P is probability, lowercase p is density. But in most models the dData is a constant! It’s the precision of the measurement instrument. But if it’s too large a number, then we can’t use this approximation anymore and we need to use the difference of the cdf function at the limits of the measurement interval.

          See detailed example here

          https://statmodeling.stat.columbia.edu/2023/06/05/rosenthals-textbook-a-first-look-at-rigorous-probability-theory/#comment-2226987

        • Also note that I’m only discussing the situation with respect to the observed data. With respect to continuous parameters of course we work with densities as they are unconstrained by observed values (parameters are never observed).

        • @Daniel

          Yes, I agree with you, in practice with finite measurement precision, your observation space is always discrete. However, I still think that it’s important to understand mixed observation spaces theoretically for the same reason that it’s still important to understand calculus and continuous probability theory in general. If we’re going to use the approximation, we may as well understand what precisely it is we’re assuming. Maybe with NSA, it all just gets a lot cleaner in the limit–I haven’t studied it (yet)

        • To me this is important conceptually because people are often thrown for a loop when they go from a “fine grained” measurement, like say measuring the weight of boxes of cereal (say 500 to 1000 grams or so) to the nearest 0.1 gram, to “coarse grained” measurements like measuring income ($50-200k) to the nearest $10k. In the code, we switch from something like

          pdf(Data)

          to cdf(BinMax) – cdf(BinMinimum) for the endpoints of the given “data bin”

          These have different units, and it’s not clear why you can mix the first thing and the second thing in the same calculation, or when you should use which one. Like maybe you have a population of people whose income is measured to the nearest 0.01 dollars because you have data from their bank, with a larger group of people from a survey and their income is measured to the nearest $10k and you want to mix both datasets… what the hell, suddenly you’re using one form or another form depending on what kind of measurement it is.

          It’s conceptually much clearer once you realize that

          pdf(Data) is short-hand for pdf(Data) * dData for dData some small increment, and this is **an approximation** of cdf(Data+dData/2) – cdf(Data-dData/2) and both expressions result in dimensions of *probability*. It’s also clear then that if dData is not “small” then the approximation we’re using for the difference of the cdfs is not good, and so we resort to calculating the differences explicitly. In fact, you could just calculate the cdf difference for *every* data point, but at some point the bin size is small enough that maybe the density is more accurate because the cdf values are two very close values being subtracted.

          Conceptually, everything you observe has a **probability** associated with it. Anything you don’t observe can have a probability density associated with it, because a probability density is a tool for calculating any number of probabilities you might care about. Like the probability that the mean income is greater than $100k or the probability that the mean income is between $55k and $65k or (either between $55k and $56k or between $65k and $66k ) or whatever other sets you might want to “measure”.

          In the Bayes formula we have Posterior = prior(params) * probability(Data | params) / normalization_factor

          The normalization factor is the integral over all the params, it treats the data as *fixed* and there is no integration over the Data space. Notice that’s why the dimensions of the Posterior work out correctly. probability is “dimensionless” whereas pdf(params) has dimensions of “per unit something” like if our parameter is a length, it’d have dimensions of 1/length.

          This understanding helps when you are dealing with measurements that are rounded off, censored, truncated, etc.

        • > Yes, they are computed in branches. But this […] Makes no sense to me.

          I think I already clarified that with “separate models” I didn’t mean independent: there may be common parameters. I’m happy to drop the word models and talk about submodels, components or branches.

          > That’s just how all computations work. If you dig into the assembly on the continuous branch, it’s full of branches too. The logistic function is implemented like […] In order to prevent overflow on the exp function. Would you say that a logistic regression is actually 2 models for probability 0.5?

          I also gave an example of decomposing a probability density function on two separate domains (positive and negative). The point was that when you write a probability distribution as the sum of two non-overlapping distributions each point will be associated with only one component. That is, each observation can go to one branch or to the other but not both.

          The difference with the discrete/continuous mixture is that considering a purely continuous distribution as a two non-overlapping components would be a choice – and both components would be probability density functions – while for the discrete/continuous mixture separating the two non-overlapping discrete and continuous components is necessary – and the components correspond to essentially different things.

          > So if datapoint 0 hits the discrete branch and datapoint 1 hits the continuous branch, […] Here, discrete probabilities are being multiplied by densities (or their logarithms are being added).

          Yes, discrete probabilities are being multiplied by densities. And the resulting number is neither a probability nor a density when that happens. What it is depends of how many discrete probabilities and how many densities were multiplied together. (Maybe what we could say that are different models are those corresponding to different numbers of zero and non-zero observations as the product numbers calculated have different dimensions.)

          If you have two observations x=a and y=b which are non-zero you get the joint (x,y) distribution multipling two densities and you get a density. Let’s say that the number Z that you get for (zero_prob=0.4, sigma=1) is half the number Z’ = 2 Z that you get for (zero_prob=0.2, sigma=0.5) that makes the latter twice as probable as the former. More precisely:

          P(a < x < a+dx, b < y < b+dy | zero_prob=0.4, sigma=1) = Z dx dy

          P(a < x < a+dx, b < y < b+dy | zero_prob=0.2, sigma=0.5) = Z' dx dy = 2 Z dx dy

          If you had two different observations different from 0 (like x=a' and y=b') the meaning of the numbers calculated would be the same.

          If you had instead an mix of zero and non-zero observations like x=0 and y=b the numbers that you calculate for (zero_prob=0.4, sigma=1) and (zero_prob=0.2, sigma=0.5), say W and W', are essentially different from the previous ones. You multiply a probability by a density and get something else. However, you can still calculate the relative probability W/W'. More precisely:

          P(x=0, b < y < b+dy | zero_prob=0.4, sigma=1) = W dy

          P(x=0, b < y However, deriving the posterior density as prior * likelihood presupposes the existence of a joint density over the observation space and parameter space.

          As I think that Daniel has tried to explain, there are some implicit differentials there to have probabilities all around for the data conditional on the values of the parameters. They just cancel out when you look at a ratio of probabilities (or a difference of log probabilities) and, conditional on the observations, you can just look at Z vs Z’ or W vs W’.

        • Fortunately the only thing that dissapeared from my previous comment was the end of that equation which was not difficult to guess:

          P(x=0, b < y N) target += log_mix(zero_prob, normal_lpdf(obs[i] | 0, 1e-10), normal_lpdf(obs[i] | mu, sigma));

          One advantage of this approach is that it also works when instead of zero what is “inflated” is mu. While the width of the “spike” has to be wider in that case and the estimate of sigma is biased downwards a bit, that’s anyway better than the “discrete branch” alternative because there is no way to write it as far as I can see.

        • Oops. This is the line that doesn’t want to appear: P(x=0, b .lt. y .lt. b+dy | zero_prob=0.2, sigma=0.5) = W’ dy

          And then I tried to say that a density-based representation for probability masses works well, as I had mentioned in another comment,

          for (i in 1 to N) target += log_mix(zero_prob, normal_lpdf(obs[i] | 0, 1e-10), normal_lpdf(obs[i] | mu, sigma));

          and that one advantage of this approach is that it also works when instead of zero what is “inflated” is mu. While the width of the “spike” has to be wider in that case and the estimate of sigma is biased downwards a bit, that’s anyway better than the “discrete branch” alternative because there is no way to write it as far as I can see.

  7. > The trouble is that the law of iterated expectations is not a definition, it is a result proved *about* expectations. The definition of an expectation is integral x f(x) dx or sum k f(k) and if it’s neither continuous nor discrete, without measure theory it’s undefined, and the question doesn’t even make sense.

    Another definition of expectation uses the integral of the cumulative distribution function which is well defined for a mixture of continuous and discrete distributions. The question made sense, the answer made sense, and the trouble is gone. All without measure theory. I’m not sure if you will be relieved or disappointed though.

    • I’ve never seen any such definition. Closest I can find is

      https://stats.stackexchange.com/questions/305031/expectation-when-cumulative-distribution-function-is-given

      But
      1. I’m pretty sure that doesn’t apply here because the variable is not non-negative.
      2. I’m also pretty sure that’s not a definition, since the book provides a proof
      3. Expectations are a more general concept than CDFs, and can be defined where no CDF exists, such as with random matrixes and complex variables

      I’m interested in the definition you’re thinking of, but I sense this discussion is getting pointlessly hostile. I’m happy to end it here, but I’m not convinced that there’s a non-measure theoretic definition of the expected value here.

      • For the univariate case, I can find

        https://stats.stackexchange.com/questions/222478/expectation-of-a-function-of-a-random-variable-from-cdf

        But the answer is a Riemann-Stieltjes integral, which is another abstract integral halfway to Lebesgue anyways, which again hides the complexity inside the integrator function. The wikipedia page suggests this integral is taught as a precursor to the measure theoretic Lebesgue integral, but I suppose it technically is not measure theory. Still an exotic, non-calc-101 integral though, so don’t know if it saves any effort.

      • > Expectations are a more general concept than CDFs, and can be defined where no CDF exists, such as with random matrixes and complex variables

        That’s a funny objection as you were discussing a problem and you had just wrote that “the definition of an expectation is integral x f(x) dx or sum k f(k)” which is even less general.

        I guess that you find your own definition unacceptable after all and the concept of expectation as a weighted average or whatever just doesn’t make sense without measure theory.

        • This is becoming ridiculous. It’s not “my definition.” That is THE definition of an expected value without measure theory. The one on wikipedia and in textbooks, in the intro to BDA3 and section 3.4 of Jaynes’ probability theory. And yes, I find that definition inadequate because it doesn’t work on spaces that are neither continuous nor discrete—that was the point I was making.

          CDFs cannot exist in unordered spaces, but expectations should—I hope that’s not a controversial opinion to you.

        • I said “your” definition because when I read that “The trouble is that […] if it’s neither continuous nor discrete, without measure theory it’s undefined, and the question doesn’t even make sense.” I understood that you didn’t find the definitions that you wrote for the discrete and continuous cases on the real line troubling on their own. It was not obvious that not being applicable to random matrices was a major issue.

          Calculating the first moment of a probability distribution is not different from calculating the first moment of a mass distribution. Or the zeroth moment, for that matter.

          Any high school student can quickly calculate the total mass of a system consisting of a unit point mass and a unit ball of density one without any notion of measure theory.

          While we could endlessly debate whether they would need a graduate course in measure theory to actually show why 42 is a wrong answer it may be better to leave it here.

        • Carlos, indeed all the trouble comes when there’s a countable or uncountable number of points masses etc. If there’s a point mass at every rational between 0,1 and the point masses are decreasing in size and tending towards 0 as they approach 1 does the whole thing have a total mass? That kind of nonsense.

          When it comes to things like random functions this can be of some interest. For example a chebyshev series with N a nonstandard number of terms. Or matrices etc usually you can represent these things in terms of countable sums.

  8. Somebody:

    The advantages you mention of having a formalism are something I 100% agree with. It can be tricky to do things right, so the “rules” or “algebra” of correct manipulations are important to have. That being said, for me, measure theory doesn’t appeal. The things I find annoying are:

    1) we’re working with measures on measurable sets. But there’s not a clear good mapping between that and numerical calculations you could carry out on a computer. Like how do you approximate an integral(f(x) dmu) numerically first let’s take some domain, and then let’s break it up into a fine cover of open sets, then we’ll calculate the measure of each open set, and then add it up… practically speaking on a computer, how does that recipe work? I think in almost all cases, it’s really going to fall back on standard numerical integration methods, and handling the discrete cases as a separate case… so the notation doesn’t mesh with the calculation and the formalism doesn’t help you figure out how to program the computer.

    2) The “abnormalities” that measure theory are supposed to take care of are 100% the result of infinite and infinitesimal ideas. Like “what if there’s a countably infinite number of discontinuous pieces of a function defined f : [0,1] -> [0,1] how can we integrate it?” kind of thing. In applications these basically never appear ever. Richard Hamming said “Does anyone believe that the difference between the Lebesgue and Riemann integrals can have physical significance, and that whether say, an airplane would or would not fly could depend on this difference? If such were claimed, I should not care to fly in that plane.”

    3) If you don’t know any other kind of formalism, then the advantage of having *some* formalism is real, and so you try to learn the formalism that is taught, and get familiar with it. But we *do* have a choice of formalisms, it’s just that the NSA one is not widely taught or even known by most mathematicians.

    Let’s take a look at your example using finite probability spaces together with NSA a-la Nelson’s IST.

    This problem happens to decompose nicely in that way, but it doesn’t always. Off the top of mine head

    Probability lambda that z = 0
    x is normally distributed
    if x >= 2.5 the device “saturates” and shows 0

    y = z * x if x < 2.5 else 0

    Let’s write the probability that y is zero. We have z in {0,1} and x in {-N^2 epsilon, -(N^2-1) epsilon, … , N^2 epsilon} with epsilon = 1/N as our two probability spaces. The x space is a finite grid of infinitesimally close grid points from -N to N where N is a nonstandard integer.

    z is either 0 or 1 by the sum rule of probability we can add up the probability of each case, with probability lambda that z = 0

    So p(z=0) = lambda + Sum(normal_pdf(x_i)*epsilon, x_i == 0 or x_i > 2.5)

    The normal_pdf has strictly limited values, so when multiplied by epsilon the contribution of x_i ==0 is infinitesimal. So we can ignore that. We’re left with:

    sum(normal_pdf(x_i)*epsilon, x_i > 2.5)

    which in NSA is the definition of one kind of integral(normal_pdf(x)dx,x > 2.5) (the Lebesgue integral is not needed, but it would technically be sum(normal_pdf(st(x_i))*epsilon,x_i>2.5) where you take the standard part of the x_i before you integrate if I remember correctly)

    So the solution is p(z=0) = lambda + (1-lambda) * integrate(normal_pdf(x)dx, x=2.5,infinity)

    We now have a recipe we can directly put into Julia and calculate with either using the cdf function, or numerical integration.

    We know that this formalism is logically sound because there’s proofs of that in the literature (Nelson etc). But that doesn’t mean we have to LIKE it! I happen to like it, but i’d be happy to hear what kinds of aesthetic or pedagogical or other similar objections you might have are?

    • I don’t really have any objections to NSA formalism. I don’t understand it well enough to express myself that way, but it is intuitively appealing and am more than willing to believe it’s pedagogically superior for the applied mathematician.

      When I was taking undergraduate electromagnetism I was happily throwing around dy and dx as fractions and recombining them to form jacobian determinants for spherical coordinates. I learned not to do that, and am a slightly less happy person as a result, so it does seem likely that unlearning it will be a good thing.

  9. Somebody, Carlos: It’s hard to find all the reply locations, so I’m restarting from here.

    Let’s consider the question of whether “continuous probability” is even needed for any applied question. Suppose a future in which every computer has a clock speed 10^1000 times faster than today, has 10^1000 times the memory, and 10^1000 times the number of cores, each core has 10^10 registers, each register is 2048 bits. In this future, suppose we forget about the complexity of dealing with IEEE floating point numbers, and we redefine numbers as 2048 bit fixed point numbers with the first 1024 bits being the “integer part” and the next 1024 bits being the “fractional part”. The smallest positive number we can represent is 2^-1024 which is about 10^-308. Now suppose we work exclusively with probability spaces on the representable numbers… they are {-10^308, -10^308+10^-308, -10^308+2*10^-308… 10^308}. One of these numbers would be about 616 decimal digits long with a decimal place in the middle. (note I just realized that there’s no sign bit, so assume the registers are 2049 bits and there’s an explicit sign bit that’s separate)

    Is there any conceivable scientific experiment which could distinguish between this representation and a continuous one? Note that if we measure in meters, the nucleus of an atom is around 10^-14 meters across, so there are 308-14 = 294 decimal digits of precision to specify a location **within the nucleus of an atom** not to mention any location outside. The observable universe is about 10^26 meters so you can represent distances 10^(308-26) = 10^282 times larger than the observable universe.

    As of today the most precise measurements we have of anything at all is accurate to like 15 decimal places ish… it’s probably a length. Time is measured relative to cycles of a microwave emission from cesium, and we might be able to measure down to 1 cycle or less… all the measurements are in the 10^-10 or 10^-15 kind of range for accuracy.

    In such a system, we could just calculate simple sums. The probability of being between 0 and 1 for a normal(0,1) distribution would just be sum(p(x_i) for each representable number x_i from 0 to 1). Given the speed of the computer it could calculate that sum which involves 2^(1024) additions in an amount of time something like… well lets say we can do 10^9 additions per second now so it’s 10^-9 seconds per addition, but we have 10^1000 times as much speed, so it would take 10^(308-9-1000) = 10^(-701) seconds. It’s fair to say that you would never notice the computer calculating anything, it’d just be done as soon as you pressed enter.

    Of course, we’d never display all 616 decimal digits of anything, we’d always write out just 15 or 20 or 25 digits because nothing we can measure has more precision than that. So every time we did a calculation, we’d then round off to the nearest let’s say 25 decimal digits before printing it out.

    And now we have a description of a real world computing environment which sort of mimics the nonstandard one. The calculations are all done to ridiculous precision, we do a ridiculous number of them, and in the end we only pay attention to the first n bits that represent the “standard part”.

    When it comes to NSA from Nelson he mimics this at a formal level.

    Whenever you see an integral, you just replace it with a finite sum with a nonstandard number of terms. Whenever you see a derivative you replace it with a ratio of finite differences with a delta that’s infinitesimal. You could never notice the difference because you’re always “printing out” the first 25 decimal digits anyway, and the difference is “in the infinitesimal part”.

    If you want to represent a delta function, you can do it by using a really narrow normal distribution, like normal(0.0,10^-300). If we all agree that we will not distinguish below 10^-25 because they don’t print out in the result, then we also see that we can’t distinguish between normal(0.0,10^-300) and normal(0.0,10^-100) they’re both delta functions for all intents and purposes. We can extend that view to show that there’s no standard function that represents a delta function because any time we evaluate the density we either get 0 (the smallest “standard” number is 10^-25 and normal(0.0,10^-300) has zero density at 10^-25. or we get a nonstandard number (the value at 0.0 is way above 10^25)

    Nelson’s version of NSA works the same way. It just doesn’t specify the “cutoff point” for infinitesimal and infinite. If it were specified with “standard” symbols, it would be a “standard” number. And nonstandard numbers are specifically those you can’t specify without using the predicate “standard”.

    In this world, measure theory isn’t very interesting. The world is an extremely fine discrete grid. We have no conceivable physical experiment which could disprove that, and the answer is always correct to more digits than we have the physical capacity to verify. The “infinity” of points in the cantor set don’t exist, they are figments of our imagination only. Functions like the “devil’s staircase” don’t exist, because at the finest resolution inside the computer they’re still just a crap-load of individual points.

    If you don’t like that particular example, just change the exponent to even bigger… the numbers could be 10240 bits long with 5120 fractional digits, etc. At some point you have to acknowledge that because there’s no conceivable way to distinguish between this situation and reality in terms of representation of scientific questions, that you’re forced to acknowledge that the discrete ultra-fine grid must be acceptable scientifically, even if you don’t like it aesthetically.

Leave a Reply

Your email address will not be published. Required fields are marked *