A thought on Bayesian workflow: calculating a likelihood ratio for data compared to peak likelihood.

Daniel Lakeland writes:

Ok, so it’s really deep into the comments and I’m guessing there’s a chance you will miss it so I wanted to point at my comments here and here.

In particular, the second one, which suggests something that it might be useful to recommend for Bayesian workflows: calculating a likelihood ratio for data compared to peak likelihood.

I imagine in Stan using generated quantities to calculate say the .001, .01, .1, .25, .5 quantiles of log(L(Data)/Lmax) or something like that and using this as a measure of model misfit on a routine basis. I think it would be useful to know for example that for a given posterior draw from the parameters, the least likely data points are no less than say 10^-4 times as likely as the data value at the mode of the likelihood, and you’d definitely like to know if some percentage of the data is 10^-37 times less likely ;-) that would flag some serious model mis-fitting.

Since the idea deserves elaboration, but the comments on this blog post are sort of played out, what do you think I should do with this idea? Is there a productive place to discuss it or maybe set up some kind of example vignette or something?

I don’t have the energy to think this through so I thought I’d just post it here. My only quick thought is that the whole peak likelihood thing can’t work in general because (a) the likelihood can blow up, and (b) wherever is the peak likelihood can be super-noisy at times. So I’d replace “peak likelihood” with something using an informative prior. But maybe there’s something to the rest of this?

99 thoughts on “A thought on Bayesian workflow: calculating a likelihood ratio for data compared to peak likelihood.

  1. I think it would be useful to know for example that for a given posterior draw from the parameters, the least likely data points are no less than say 10^-4 times as likely as the data value at the mode of the likelihood, and you’d definitely like to know if some percentage of the data is 10^-37 times less likely ;-) that would flag some serious model mis-fitting.

    Isn’t this nearly the same as the posterior probability with a uniform prior?

    For a uniform prior, we assume:
    1) p(H[0]|D) = p(H[0])*p(D|H[0])/sum(p(H[0:n])*p(D|H[0:n]))
    2) p(H[0]) = p(H[0:n])

    3) p(H[0]|D) ~ p(D|H[0])/sum(p(D|H[0:n]))

    You want to consider cases where :
    4) p(D|H[0]) << max(p(D|H[0:n]))

    Then we drop all negligible terms from the denominator, including the one under consideration p(D|H[i]):
    5) idx = which(( p(D|H[0:n]) ~ max(p(D|H[0:n])) )
    6a) sum(p(D|H[idx])) ~ sum(p(D|H[0:n]))

    So (usually) we are left with:
    7a) p(H[i]|D) ~ p(D|H[i])/sum(p(D|H[idx]))

    Instead you want to approximate the sum in the denominator by dropping all but the largest term:
    6b) max(p(D|H[0:n])) ~ sum(p(D|H[0:n]))

    7b) p(H[i]|D) ~ p(D|H[i])/max(p(D|H[0:n]))

  2. >(a) the likelihood can blow up, and (b) wherever is the peak likelihood can be super-noisy at times. So I’d replace “peak likelihood” with something using an informative prior. But maybe there’s something to the rest of this?

    Yes, certainly an issue. I wasn’t actually thinking max marginal likelihood or anything like that, I was considering max conditional likelihood max(p(Data | sampled,values,of,parameters)) where the maximum is taken over the Data. The idea being to check each sampled parameter value to see if it induces a model over the data that leaves some of the data in a very non-predicted region of data space. Although it’s possible to have likelihoods that have unlimited density, it seems rare enough that it’s a special case.

    • Over the data space, not over the actually observed data, so basically look at some quantities related to -log(p(Data[i] | params) / p(Data* | params)) with Data* being the point that has “peak density for the given parameters” for each sampled parameter value, and be able to track this as you sample, to see if the model is predicting silly things…

      for example it might be that no matter how you adjust the parameters, most of the data points are far away from the most likely value in some sense. Like for example normal(0,1) likelihood for data that’s bimodal normal(5,1) and normal(-5,1) for example.

    • An excellent suggestion but perhaps Daniel should first communicate his idea paying careful attention to what other statisticians mean by certain terms such as likelihood.

      From wiki as a start “In statistics, a likelihood function (often simply a likelihood) is a particular function of the parameter of a statistical model given data.” So don’t call things over the sample space likelihoods but say rather likelihood based quantities and clearly define them and if possible give some motivation.

      My sense is that you are doing some analogous to what I was doing with almost individual likelihood functions (they all add up to the total likelihood but one can see how they differ). https://statmodeling.stat.columbia.edu/wp-content/uploads/2011/05/plot13.pdf

      Additionally, I think you will be fighting an uphill battle as many statisticians have not thought that much about likelihood functions and any quantities based on them other than MLE and its SE.

      On the other hand, I have heard statisticians talk about likelihood for other other outcomes than the given data (including Andrew?) but I think they should be using another label other than likelihood.

      • you’re right Keith, there should be some other terminology here. likelihood is used for fixed data, whereas the max value I describe is for a theoretical best fit data point.

        what I’m talking about is a conditional probability density ratio. p(data|parameters)/max where max is the maximum that density can be over the data space for fixed parameters. this quantity is an ensemble of values, one for each data point. it describes how far from the most highly predicted value the data actually falls. if you take a logarithm of the ratio it tells you something like an entropy of the predictive error. how many bits would it take to encode the actual data if you got the peak probability prediction for free and used the probabilistic data model to build the code.

        the usefulness of this is it could help to detect poorly predicting models. while the machine learning world often cares about prediction only, the Bayesian machinery operates by trying to tune the parameter space until it drops the predictive distribution more or less onto the actual data… but the posterior is always a probability distribution, even if there are no parameters that fit “well”. this suggestion tries to generalize the notion of posterior predictive checking to ask how well we could tune the model, not just what are the best fit parameter values.

            • Hey! that’s great. Thanks for these references. In your blog post you say:

              “It does not need alternative parameter values to be considered, rather it needs alternative data to be considered. Clearly it is related in spirit to frequentist concerns of the form ‘what if the data were different?’.”

              and that’s definitely one way to use the idea, but I’m actually thinking more along the lines: given this data, should we consider an alternative model because the data is outside the realm where the model thinks it should be (say the realm where the probability ratio is greater than 1/100 or 1/1000 or something). For example in the unit normal case, a data point at 3.6 is below the 1/1000 ratio for a normal(0,1)

              Key to this idea is that it works even if the predictive density isn’t simple unimodal. for example you might have a predictive density that’s a mixture of 8/10 normal(0,1) and 1/10 normal(5,1) and 1/10 normal(-5,1)… your model is that either the prediction is dead on near 0 error… or the errors are of a slightly different form that align symmetrically about 0.

              Now if you are getting data points at say 2.5 or 3 you have data that disagrees with your model even though the tail area is still quite large due to the second error mechanism.

              Ideally models should put relatively large density in places where the data are found… Bayes can tune a posterior distribution even if that isn’t the case. How can we check to see if it’s the case? that’s the motivation here.

              Incidentally I hate the usage of the term “plausibility” in the Barndorf-Nielsen naming convention… it strongly conflicts with the notion of probability as plausibility in the Cox framework. oh well.

              • Funnily enough I think you are in fact effectively thinking of pvalues, in particular what is sometimes called an ‘absolute’ test (no specific alternative cf NP approaches) and the related ‘ordinate test’.

                A good reference is the other (statistical) Cox:


                Note in particular that a pvalue requires a test statistic specifying the extent (or really, relative extent) to which any observations ‘agrees’ or ‘disagrees’ with a specific distribution. As discussed in the above Cox paper, with no other info on what sort of discrepancy is of interest, a natural idea is to use the density ordering – an observation is considered more extreme the lower the density of this observation is under that distribution.

                This means that the ‘tail’ area is really defined in terms of density level sets – eg it is something like 1-prob of the ‘smallest’ (in terms of eg Lebesgue measure defined on the sample space) high density region containing the observed data. This works for multimodal distributions in principle.

                A common discussion point is the role of invariance in absolute tests – are such tails invariant under a change of variables (I don’t think they are)? Should they be?

                Similarly the usual issues to do with the behaviour of densities comes up (as Andrew gets at in part) as well as overfitting as raised by Yuling below.

                It seems to me perhaps that allowing the fit to be assessed only relative to other data from model itself, rather than an ‘external’ (eg held out or resampled or perturbed) source you could allow the fit to arbitrarily good for silly models. You can probably ‘regularise’ this away by only checking ‘interesting’ models that are actually ‘interesting’ (ie deliberately tend to underfit or smooth in some way!), in which case you perhaps only really need to worry about underfitting more that overfitting.

                (Re: terminology – I think BN’s ‘plausibility theory’ could actually fit well in eg Halpern’s discussion in ‘Reasoning about uncertainty’ which explains the sense in which eg Cox-Jaynes is a special case of a broad family of ‘plausibility’ logics. But don’t want to kick that one off ;-)

              • Ojm:

                I should perhaps devote a full post to this, but very briefly: Ordering based on the data density can be reasonable in some settings; where I think the big problem comes is in the calibration by probabilities..

                To put it another way: in a simple symmetric unimodal problem, the so-called p-value can be defined in two ways:

                1. The p-value: the probability of observing something as extreme or more than the data, if the null hypothesis is true.

                2. The u-value: the value of the predictive density at the data conditional on the null hypothesis, recalibrated to have a uniform distribution if the null hypothesis is true.

                For certain simple examples the p-value and u-value are the same, but with composite null hypotheses they’re generally different: a p-value (under definition 1) is a random variable, and it can be approximated by something that’s not a u-value; conversely, a u-value is not a p-value (under definition 1) in that it can’t be interpreted as a tail-area probability.

                This is sometimes stated as the p-value not being “calibrated,” but I’d prefer to split the concepts of p-value and u-value and just recognize that they do not in general go together. I’ve discussed this distinction between p-values and u-values before (see section 2.3 of this 2003 paper), but the message has not really got out there. People just seem to think of there being such a thing as a “p-value” with both properties, even though in general these properties are different.

                Anyway, my issue here is not the difference between the p-value and the u-value, important as it is, but rather the implicit idea that you should want a data summary that has a uniform distribution under the null hypothesis. This idea of “calibration” sounds appealing, but the more I look at it, the less sense it makes to me. What exactly can you do with this calibration? What relevance does the type 1 error rate have for decisions in which the null hypothesis is always false? I don’t see it. Tuerlinckx and I discuss this in our 2001 paper on type M and S errors.

                To put it another way, suppose you accept Cox’s idea of constructing a data summary, representing distance from the null hypothesis, ordered by the value of the probability of the data conditional on the null hypothesis being true. Fine. But then why convert it into a p-value? Why not just present the probability, or the distance from the null, directly?

              • > Fine. But then why convert it into a p-value? Why not just present the probability, or the distance from the null, directly?

                I’ve wondered about this myself more than a few times!

                For example in Cox and Hinkley’s Theoretical Statistics book they actually say in Section 3.2 ‘the larger the value of t the stronger the evidence of departure from H0’.

                If the realised t really does represent ‘the evidence of departure’ from your probability model, then why not just report this?

                They then say they take the realised p0 as a measure of the consistency between the data and the probability model, with an interpretation in terms of what amounts to type I errors.

                But again, why not just use T=t as the measure of consistency?

                I’m not saying there aren’t good reasons for this, but I also can’t say I’m very convinced by the usual arguments.

              • > an observation is considered more extreme the lower the density of this observation is under that distribution.

                Sure. In particular here I’m thinking of how we can decide whether we need to do more work in the modeling realm, ie. considering what we might have thought were second order effects, or getting more realistic in our description of the mechanisms etc

                I don’t particularly care for the idea of frequency based analysis: check a p value and see if it’s true that the frequency with which your model makes a certain error matches the frequency with which it really does make a certain error… but I do think a model’s p(data | params) describes what the model expects from its own errors, and when the data sits in low density regions it seems like the model is telling you that it expects one thing, and the data is telling you it’s getting something else. That’s useful info.

              • As long as we all agree that Jeffrey’s was a bit off when he criticised pvalues on the basis that:

                “An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure”

              • If we have a real frequency hypothesis, and we reject it with a p value, it seems to be useful information. It’s rejecting default null hypotheses that is the worst abuse of p values (or bayes factors or whatever). My biggest concern is that relatively infrequently do we have the volume of information required to construct a frequency hypothesis for relatively complicated scenarios, usually it’s only available to us using asymptotic attractors that let us avoid having to collect detailed statistics for frequency. We don’t have that kind of situation in doing many important things.

              • > If we have a real frequency hypothesis,

                What do you mean by a ‘frequency hypothesis’

                When someone writes p(y; H) they mean the predicted density p(y) for every value y under the assumption that theory H is true. This could be e.g. ‘H = general relativity in context C’ or whatever. This theory implies predictions about y. The predicted ‘frequencies’ are for observables y.

                Because of measurement noise and other uncertainties, instead of y = f(H), i.e. H maps to y, we just have a distribution-valued prediction of the form H maps to p(y).

                As far as I’m aware, most people interpret the predicted distribution over y in this way. Nothing much changes if you interpret p(y) as giving e.g. the relative plausibility of each y prediction under the model. A pvalue computed under the density ordering is then related to the ‘tail’ defined via density level sets.

                I don’t see how this differs from Bayes, except for potentially *more* flexibility in what H is (doesn’t technically need to be defined on a probability space).

              • If y is an observable quantity in the world (like a position or a length or temperature or whatever) then for a Bayesian model, p(y=y0 | H)dy is a quantity that describes the plausibility that y will take on the values in the interval [y0,y0+dy] under the information contained in H. It’s a description of an information state that we have about a compatibility between a theory of the world and measurements.

                In a frequentist model, p(y=y0 ; H)dy is a description of the frequency with which y will take on values in [y0,y0+dy] in a large number of repeated measurements of y, usually in future experiments of the same type.

                These are potentially completely different scientific ideas. The second type of idea, can be refuted by showing that the frequency with which we see y in [y0,y0+dy] in a large number of measurements of repeated experiments, is substantially different from p(y=y0 ; H) using say a p value type frequency based test.

                The first statement can’t be refuted in that way, since it’s not a statement about frequencies with which something specifically will occur in repeated measurements. As Jaynes said, it’s like “like trying to verify a boy’s love for his dog by performing experiments on the dog”.

                Of course, sometimes the Bayesian version can be about stable frequencies in repeated experiments… but it need not be.

              • A pvalue is just the probability of obtaining a test statistic as extreme or more extreme (where extreme means less compatible) under a probability model.

                You don’t have to interpret the implied probability model in terms of (absolute or relative) frequencies, though drawing samples from it for visualisation certainly seems to imply that’s one way people interpret these models.

                The definition of pvalue above is perfectly compatible with Bayes.

              • ojm, it seems to me that the distributional assumption is stronger in the case of p-values, based on test statistics computed in the usual frequentist way. Hence the proliferation of ‘checks’ on normality, etc. For instance, in a bog standard ANOVA you really do want the residuals to be normally distributed in order for the p-value derived from your F ratio to be meaningful.
                Bayesian reasoning also tends to reject the tail-area computation for violating the Likelihood Principle.

              • it seems fine to me to discuss the posterior probability under the model that future data would be such and such, since future data is not observed, but our actual data is incorporated in our fit.

                when we want to use our model to make frequency predictions of this type, we should use some basis function for frequency distributions such as a gaussian mixture model or a dirichlet process or what have you. then we learn the shape of the distribution and can make appropriate statements. when the distribution is fixed ahead of time the Bayesian model makes the same mistakes the Frequentist model makes when it makes predictions about future data, namely it claims knowledge of the shape of the distribution that it doesn’t have. using maximum entropy distributions for data is a kind of conservativism in choice of data distribution designed to reduce the effect of this assumption. it can give valid inference for parameters without giving valid posterior frequencies for future data.

              • Another way to put this is that using a p value or some such thing only works in a meta-statistical way if we have some reason to believe it *should* work. When we make a distributional assumption like normal errors or gamma distributed errors or t distributed errors etc, we incorporate only very basic information into our model, information like “errors are not very big” or “errors are positive and have a mean and a mean logarithm” or “errors occasionally are well outside the usual region (far out in tails)” etc.

                When we find that hold-out data doesn’t have the frequency distribution we assumed, or even that the data we used to fit doesn’t have the frequency distribution we assumed, it might “surprise” the model, but it doesn’t surprise the statistician at the meta-statistical level. Specific information about the shape of the distribution was explicitly avoided in construction of the model.

                On the other hand, if you do use something like a GMM or a finite approximation of a dirichlet process, then you *can* be meta-surprised when hold out data dramatically fails to conform to the model predictions, as you *did* try to learn the distributional shape.

              • What motivates the proposal that’s here in the original post is that even when we’re making very very simple assumptions about the shape of the data distribution, it should always be the case that when the density goes to zero, we will be surprised (at the meta-statistical level of the modeler) to see values in that region.

                So looking at p(Data[i])/p(Data_maxp) and seeing that it’s very small should tell us that something occurred which we didn’t expect. Whereas for example looking at the interquartile range of the posterior data distribution and finding that 64% of the data is there instead of 50% isn’t so interesting in the usual case (the one where we’re not using a GMM or a dirichlet process or whatever) even if we can use a p value to reject the fit using some chi-squared test of goodness of fit or something.

              • Chris – re distributional assumptions, I don’t really understand. Bayes typically makes far stronger distributional assumptions than frequentist analysis, Bayesians being ‘slaves to the likelihood’ and all that.

                But regardless, the notion of a tail area, whether defined via a density ordering or some other ordering, is defined for any probability distribution. It’s intimately tied to ‘high probability regions’ (basically being the complement of these).

                Re likelihood principle. Do prior and/or posterior predictive checks violate the likelihood principle? If so, so much the worse for the principle imo.

                The idea that you need to consider data other than that observed, in some capacity at least, seems to me a more compelling ‘principle’.

              • Thought experiment –

                You have a single probability model over y. No free parameters.

                You observe a single realisation y0.

                Can you define some notion of ‘goodness of fit’ or other relation between observed data and distribution p(y)?

                Why or why not? What is it? Etc

              • ojm re your thought experiment, p(y)/p(ymax) is exactly a dimensionless measure of goodness of fit. it is imho the only measure that makes sense for single realizations

                as far as Bayes making stronger distributional assumptions, I disagree, since it never uses properties other than the value of the density at the observed locations.

                the discussion here is about meta stats, how to detect the need for a better model.

              • – Sure. Pretty sure this is a 1-1 function of the observed pvalue as defined for the density ordering. You have discovered the ordinate and/or absolute test of significance!

                – As mentioned this is also BN’s ‘plausibility function’. It is eg modified by optional stopping, violates at least some versions of the likelihood principle etc.

                – While it may seem obvious to define a measure of fit for a single probability model, many Bayesians and eg Neyman both considered it to not make much sense. Eg Lindley (I think it was) said something like ‘Neyman was wrong about a lot but he was right that no hypothesis can be considered without an alternative’

                So when exactly do we and don’t we consider data other than that observed? What is the difference between statistical and meta-statistical?

              • Pretty sure this is a 1-1 function of the observed pvalue as defined for the density ordering.

                It isn’t though — density orderings are subject to arbitrariness due to Jacobians of monotonic transformations; ratios of densities aren’t subject to that kind of arbitrariness because the Jacobian cancels. (The Feldman-Cousins confidence interval procedure is based on a likelihood ratio ordering rather than a density ordering for precisely this reason.) Pretty sure the same cancellation goes for modifications to the normalization constant resulting from optional stopping — pretty sure those modifications have parameter dependence but not data dependence; data spaces get integrated over — but I haven’t actually written it out to check.

              • >So when exactly do we and don’t we consider data other than that observed? What is the difference between statistical and meta-statistical?

                If I’ve fit a Bayesian model and I want to know if it makes predictions that make sense, I can ask whether the predictions it made when it was fully fit “came true”. So I find the posterior distribution, and then using samples from the posterior I look at the individual predictions it would make for the data points… p(y|params)/p(ymax|params) tells me how much less likely was the actual data y compared to the most highly predicted ymax data point.

                As Corey mentions, this ratio is dimensionless (so it’s invariant to changes in units of measurement of the data), and invariant to any Jacobian type corrections (though we can’t just “reparameterize” data anyway, as it’s stuff that comes out of measurement instruments). More to the point, it’s invariant to pure scaling due to normalization factors, which is good because normalization factors are affected by the parameters of the model, so we can compare this ratio across multiple realizations of the posterior samples.

              • > As Corey mentions, this ratio is dimensionless (so it’s invariant to changes in units of measurement of the data), and invariant to any Jacobian type corrections

                I don’t think that’s correct. Jacobians cancel out when we calculate a ratio of two densities at a single point but what you propose is a ratio of the densities at two distinct points and one of the points is not even fixed but depends on the parametrization.

              • Ah good, Carlos!

                Yes, I think you’re right, if “reparameterizing” data were a thing, we’d be concerned I guess. But it’s not. The density is with respect to lebesgue measure on the data space. In order for a thing to be data (that is, an independently verifiable fact about the world) it must be the case that two people observing it from essentially the same place agree on the value. This is to within a multiplicative constant related to the units of measure. Two people must agree on the dimensionless ratio of the measured quantity and a reference quantity, both measured in the same units.

                Further, in order for this to be true, the dimensions are always a power law of the fundamental dimensions length, time, mass (maybe temperature, and a few other esoteric units), or counts of discrete objects. Which means it’s not possible to nonlinearly transform data without changing the meaning of the correspondence to a physical quantity.

                So the density ratio over data space is dimensionless in that the density is “per unit Q” where Q has dimensions, and the dimensions in the ratio cancel. which is all the invariance we need I think.

              • Also, when you say “one of the points isn’t fixed but depends on the parameterization” I don’t think that’s true. I’d say that for a given model, the predictions in data space are fixed. In fact, I’ll define a model-preserving-transform as any 1-1 transformation of the code that implements the model while leaving the probabilities over the data invariant.

                Of course it’s the case that if we change the model, we’ll get different predictions, and different probabilities over the data… we are explicitly looking for ways to identify when this would be a good idea.

              • > In order for a thing to be data (that is, an independently verifiable fact about the world)

                I’m not sure if the following qualifies as “data” for you: two different observers look at the same monochromous lightsource differently. Alices measuries wavelength, Bob measures frequency. The dimensionless ratios relating their measurements with the measurements (in the same units) of a reference light may be different. And in your case they won’t even use the same reference for the ratio: if they share a common prior distribution the location of the maxima in the frequency representation and in the wavelength representation will not be the same.

              • Re: invariance
                The ratio of the density at two different points is not invariant to re-param of the sample space. Neither is the tail area based on the density ordering. This seems like a legitimate issue with both.

                However, here is a counter argument that I actually kind of hate – all measurements are discrete hence all prob distributions over observables are discrete. Then you can take the ratio/ordering as in terms of prob mass at each point.

                In the discrete setting both the ratio and tail area are invariant to 1-1 sample space re-parameterisation (which amounts to re-ordering but not changing the prob mass function values – think eg shuffling a histogram).

                Like I said, I kind of hate this argument…but it exists.

                Re optional stopping, BN explicitly shows an examples where his plausibility function is modified, either in the paper linked above or in one of his books.

              • Carlos: of course, both Alice and Bob are taking data about the light source. But they’re taking *different* data. And, their models for that data would be expected to be different in general. If they want to predict frequency from some information, they’d construct a frequency model… and if they want to predict wavelength, they’d construct a wavelength model… and the measurement errors inherent in a frequency measuring device would be different from the measurement errors inherent in a wavelength measuring device… and It is in fact a specific model of the world that says that frequency and wavelength are connected via a constant speed of light. That’s a thing we have to arrive at by measuring frequency and wavelength of multiple light sources and forming the ratio wavelength*frequency and finding out that it’s constant within measurement imprecision. That is, we need to construct a specific model of the world. And one way we might discover that we need such a specific “constant speed of light” model is to detect that some other model predicts poorly, using a method like the one we’re discussing.

                However, if both people take frequency measurements, say of two different light sources, and they calculate ratios of those two frequencies, and they arrive at substantially different numbers… then apparently one of them isn’t really measuring frequency, they’ve got a broken instrument. alternatively, suppose they’re measuring something else, and no two people can agree on that thing even using the same method… then the thing isn’t an independently verifiable fact about the world apparently, so it’s not really data.

              • Like, for example, two people listen to the same comedy routine, and Alice “thinks it’s funny” and Bob “think’s it’s stupid”…

                Multiple people can objectively determine “Alice think’s it’s funny” or “Bob thinks its stupid”, but it’s not possible to determine “the record is funny” as an objective fact (data) about the world because it doesn’t behave in an invariant way.

              • For the same physical reality and the same physical model and the same physical measurements I think it would be desirable that we do not arrive at different conclusions depending on the representation chosen. It seems to me that you’re renouncing to this elemental invariance property for no reason.

              • OJM: it turns out that all measurements are effectively ultimately of the positions of electrons and are discrete… but I don’t think we need to go there, we can admit an “effectively continuous” underlying quantity, but still deal with the fact that in reality all measurement *instruments* are discrete, reading to a “least significant digit” which gives us an explicit size for “ddata” in a calculation p(data | params) ddata / p(datamax | params) ddata

                it turns out that ddata is invariant and cancels. if we receive instead an invertible function of data like f(data) we can invert finv(f(data)) and get back the underlying measurement, where we have a discretized space, and therefore respect the information we have about the discrete measurement instrument in an invariant way. People often work with this kind of model when there is explicit “rounding” to somewhat annoyingly few digits.

                But let’s consider the case where Carlos might come along and say “hey, how about if the function is non-invertible” like for example suppose I have a model for temperature as a function of space x, so I have T(x) and what’s reported to me is the average of T(x) over several known locations x[i].

                F(T(x)) is non-invertible, because there are many T(x) which are compatible with having a given average at the x[i]. So now what?

                The thing is in this case we’ll need to provide a plausibility assignment over the F(T(x)) values, and this will imply some plausibility assignments over T(x) functions, and we’ll just have to accept whatever they are, in essence T(x) is now a parameter, even though in principle we could have been given vectors of direct measurements T(x[i])

                when we collect different data, we will wind up with different models.

              • For the same physical reality and the same physical model and the same physical measurements I think it would be desirable that we do not arrive at different conclusions depending on the representation chosen. It seems to me that you’re renouncing to this elemental invariance property for no reason.

                I don’t think so. In fact, I think exactly the opposite. OJM’s point about discretization of measurement instruments provides the connection I think. Even if an underlying quantity is effectively continuous (ie. maybe is quantized at the electron level but we’re measuring coulombs worth of electrons) the quantity coming out of measurement instruments always has effectively a “least significant digit” and is rounded to that digit (or binary digit or whatever). a density over measurement instrument outputs is always really a device to assign probability mass to discrete measurement quantities.

                Sometimes we have explicit rounding and wind up using CDF values to handle it, other times we have pretty fine grained measurements and its sufficient to approximate CDF(data+ddata)-CDF(data) as pdf(data) ddata

                but if someone say reports the logarithm of an instrument output to you, you should invert that back to the measurement space and calculate the probability mass associated with that discrete measurement outcome. Measurement instrument models are always discrete.

              • I’m willing to entertain the idea that all *observations* (rather than all theoretical constructs) are in some sense ‘discrete’, but I would like to see a more mathematically satisfying treatment of this (which quite possibly exists!). BTW both Fisher and BN talk about needing to first decide on a proper grouping of the sample space before using the ordinate test – same thing, basically.

                Side note – re invariance for discrete vars under non-1-1 transformations. While probability is known to be non-invariant in this setting, there is a sense in which *possibility* remains invariant even for non-1-1 transformations. See eg 7.4 here:


                Either might be desirable depending on circumstance imo.

              • Let’s say the reading in a measurement device can be done in different units: the dial is marked in meters and Hz. And your model, the same exact model, can be expressed in both ways. Please don’t tell me that you could be using different models, we’re discussing what should happen when different representations of the same model are considered. You may get radically different results and it’s not just a question of rounding errors.

              • Yes, I think all observations are discrete, and all Bayesian models on measurement spaces are probability mass assignments. When we use pdf it’s just a shortcut to assigning p(x)dx for fixed and sufficiently small dx determined by the measurement instrument. Sometimes when we have explicit rounding, we wind up not being able to use that approximation and have to use CDFs explicitly.

                I’m not clear on whether length is underlying a really a discrete quantity, and/or time. It seems likely that mass is. But regardless of whether underlying constructs are discrete or continuous, the most sensitive measurements instruments we have available today are something like 32 bit A/D converters:


                obviously it becomes exponentially harder to add bits, so let’s suppose we get really really good at it… we’re still unlikely to ever have something like accurate 64 bit A/D converters. A coulomb of electrons is 6.242e18 whereas 2^64 is 1.84e19, so we’d have to be able to count every electron entering a millifarad capacitor as we charge it to over 3000 volts to get an accurate 64 bit A/D converter.

                so let’s say we never get better than 48bit ~ 2.8e14 it’s pretty obvious that these are discrete measurements.

                As for the non-invertible case. Well, we just need to provide plausibility assignments F(T(x))dF. And this model *is different* than a model placed directly on T(x) and we just need to acknowledge that.

              • >And your model, the same exact model, can be expressed in both ways

                I think the key thing I’m actually saying is that in order to assess a model, we need to assess its probability assignment, and it’s not really the density ratios that we need to calculate, it’s the ratio of the probability mass assignments in a fixed sized vicinity of the space on which we assigned the probability, ie. the space in which our information lies.

                If we work in some other space, we need to back out the quantities into the space in which the assignment was made in order to decide how well we’ve done in creating our models.

                So the rule would be not “p(q|params)/p(qmax|params)” for any pushforward measure over any q derived from data d, but rather p(d|params)/p(dmax|params) in the actual space in which we made our decisions about how to assign probability.

                If you’re working in space q which is a nonlinear transform of data d where you constructed your model, you should back q into d to assess the goodness of your modeling.

              • To be a little more mathematically explicit, suppose m and h are two different physical quantities that my theory says are related to other data (like temperature and voltages and etc) through two different functions:

                m = F(covariates) + errorm


                h = G(covariates) + errorh

                Now, I can assign an error distribution on m … pm(errorm) and it will immediately mathematically imply a density on errorh, or I can assign error distribution on h ph(errorh) and it will imply a density on errorm.

                to assess the quality of the model, I should compare errors in the space where the probability assignment is made, so if I assign error probability in errorm I should compare my actual errors to predicted errors in the errorm space.

              • A bit of a google just led me to this fairly recent (2010) thesis:

                ‘Invariant Procedures for Model Checking, Checking for Prior-Data Conflict and Bayesian Inference’

                by Gun Ho Jang (and supervised by Mike Evans!).

                See here: https://tspace.library.utoronto.ca/bitstream/1807/24771/1/Jang_Gun_Ho_201006_PhD_thesis.pdf

                This defines p-values for discrete distributions in terms of the probability mass ordering I mentioned above. It then takes the ‘all observations are discrete route’ considers the continuous case as an approximation to the discrete definition. This effectively allows them to give an invariant p-value for continuous distributions via an approximation process.

                So, as mentioned above, if you take this ‘discrete observations’ route, both the ‘tail area’ (defined via a probability mass ordering) and the ‘plausibility’ p(y0)/ (sup p(y)) are invariant under 1-1 changes of variable.

                I’m still not super happy with the ‘elegance’ of how this ‘every observation is discrete’ is treated here or elsewhere, but it seems like one could probably tidy it all up somewhat.

                [PS, while I’m mostly OK with ‘all observations are discrete’, properly tidied up, I’m definitely anti ‘all theoretical constructs are discrete’ and/or assuming additivity over possibly theoretical constructs. Hence probability distributions over observations ~ mostly OK, while probability distributions over models/theories etc ~ not in general .]

              • This seems to highlight again the difference between Frequentism and Bayesianism…

                If you have a Frequency model, then the Frequency is a fact about the world, which you could in principle verify by data collection: does the data fill up the histogram shape of the distribution? Furthermore, the invariance of “reparameterization” of data would be implied by the mathematics… if the frequency of being in [a,b] is such and such… then the frequency of being between [f(a), f(b)] needs to be the same thing in a transformed measurement space, this is implied by the math.

                The Bayesian model on the other hand is not a fact about the world, but a fact about our willingness to make predictions. The space in which we assign the probability is special, because it’s the space where we have sufficient information to assign probability to fixed width slices of the space. Probabilities in alternative spaces are mechanically imposed by our initial assignment in whatever space we assigned our probabilities.

                If we want to assess how well we did in modeling, the best way to construct a measure of “good modeling” is going to be *in the space where the model probabilities were assigned*.

                of course, you could do a mathematically equivalent calculation in a different transformed space, but it will need to be equivalent to the above idea if you want it to assess the appropriate thing, in which case as Carlos points out it will require a ratio of Jacobians.

                In many many fields, we don’t have pure precise invariant mathematical relationships between quantities in the world other than unit conversions. Like wavelength * frequency = constant for all observers.

                For example, you might easily have mean(annual pretax income) = f(some covariates) + errors, and mean(monthly posttax income) = g(some covariates) + errors, but while the conversion from annual to monthly has some precise mathematical relationship, the conversion from pretax to posttax does not have a precisely understood mathematical relationship. Inevitably our probability assignments in these two modeling scenarios will not be mathematically compatible.

              • Back on topic, here’s a weird little observation:

                Suppose you’re testing an essentially unrestricted (but discrete) family of models for ‘self-consistency’ with the observed data y0 by using consistency(theta, y0) = p(y0;theta)/ (sup_y [p(y;theta)]) for each theta in turn.

                Then both the fully deterministic model:

                p(y) = 1 if y = y0,
                p(y) = 0 otherwise

                and the ‘uniform randomness’ model:

                p(y) = 1/n for a set of n points including y0
                p(y) = 0 otherwise

                are both *equally consistent* with the data, i.e. consistency(theta, y0) = 1 for both models.

              • > I think the key thing I’m actually saying is that in order to assess a model, we need to assess its probability assignment,

                Ok. So you can have a model in representation A (wavelength), you do a probability assignment, and you conclude that the model is a failure because it doesn’t match well the data. But I can choose a different representation B (frequency), choose as probability assignment the transformation of yours, and conclude that my model is a success because it matches well the same data. That seems a problem to me, it seems a nice property to you, and we can agree to disagree.

                Unrelated to reparametrizations, what about the problem we discussed some time ago that makes many distributions impossible to model successfully (according to your method of assessment) when most of the probability lies in low-density regions.

                An example discussed often around here are high-dimensional Gaussians, which are “concentrated” (I don’t really like that way of putting it) in the low-density areas far from the center.

              • > An example discussed often around here are high-dimensional Gaussians, which are “concentrated” (I don’t really like that way of putting it) in the low-density areas far from the center.

                Putting aside the re-parameterisation issue for now (eg assuming it can be ‘solved’ by discretisation), I think the ‘level set’ p-value possibly works OK for this case, being equal to 1-the prob of the high density region that just excludes/includes the observed data.

                This is in (possible) contrast with the modal approach, though I’m not sure how this case behaves under discretisation…

              • ojm, I’ve not followed in detail the current discussion but I think I agree with you. As I said to Daniel last time, I can’t see a satisfactory solution not involving tail areas.

              • Carlos, no I don’t think that is a nice property, but then I’m not claiming this technique is definitive, just that it’s capable of detecting problems. in case of model A it did detect the problem, in case of model B which is equivalent but built on some transformed data, it fails to detect the problem. cest la vie. would we prefer to not detect a problem in either case?

                oh, yes you show that wide distributions will generally be ok with data over a wide range. I think that’s a feature.

                Carlos, in my proposal I am imagining conditionalizing to individual data points. so for example if you have a timeseries with 1000 time points you wouldn’t look at the entire sequence relative to the highest probability sequence, you’d look at time point 1, then 2, then 3, etc. are there individual time points whose dx neighborhood is thousands of times less probable than the max probability for that time point? this avoids the “curse of dimensionality” issues

              • Daniel, the main issue is detecting problems where there is none. In my example individual datapoints are high-dimensional. If you want a one-dimensional example of a problematic distribution we already talked about a mixture of a very narrow distribution and a very wide distribution. Even if you have the right distribution you are very likely to conclude it has a problem.

                I really don’t know in what sense do you think this “measure” is better (apart from being easier to compute) than the one based on tail areas (which doesn’t have the problem discussed in the previous paragraph).

              • beyond just calculating the ratio, there is also the question of evaluating how extreme is any given ratio. obviously this should be compared to some measure of how extreme the ratio is likely to be.

                suppose we simulated a few hundred draws from the distribution, and then calculate the mean of the log of the ratio. this is a measure like differential entropy. we then compare our actual quantity to this reference quantity, it turns out that this may help as well for the transform of variables, as now the reference level is going to change under the transformation.

                in a high entropy scenario, we are less surprised by extreme ratios, in a low entropy scenario we are more unhappy by extreme realizations.

              • Ok, I take back the “easier to compute” remark but I still don’t see the advantage of this method.

              • Carlos:

                The method flags individual data points that indicate discrepancy between model expectations and actual data collected in a way that works regardless of the shape of the distribution, particularly even if the distribution is multi-modal.

                Furthermore, the evaluation method is “local”, that is it compares the probability to be in a local neighborhood of the actual data point to the expected probability to be in a local neighborhood of a randomly selected predicted data point, and it uses a measure that is on an interpretable scale and has no dependence on the units of measurement.

                I think those properties are good, whereas I think a p value based on a tail area asks questions about probability to be in “distant” regions of space that are irrelevant.

                To me it has a flavor of lebesgue vs reimann integration. My proposed measure asks about being in level sets of the probability density, whereas tail-areas ask questions about probability to be in a continuous region of the data space that crosses potentially *many* level sets some of which are vastly different in probability density.

                I’m not claiming a fully worked out idea here, but I think this distinction between comparison of “how likely is it for a data point to sit at a given “depth” (or potential energy in the HMC formalism)” is a more interesting question than “how much probability is contained farther away in euclidean distance from the mode than this data point” or some such thing you can interpret a p value as meaning.

              • > I think a p value based on a tail area asks questions about probability to be in “distant” regions of space that are irrelevant.

                Note that I’m not talking about p-values based on the statistic “value of the parameter” or “distance to some central point”, or anything like that. I’m talking about p-values based on the statistic “p(x) = likelihood of the parameter given the data”. It’s not about “distant” regions, it’s about “low-density” regions.

              • One possible computational method to handle this stuff would be:

                generate MCMC sample from posterior:
                Sact = -log(p(Data|params)/p(Datamax|params))

                generate a random data point from the predictive distribution Dpred

                Spred = -log(p(Dpred|params)/p(Datamax|params))

                now in your final sample you can do things like mean(Dpred) and mean(Spred), you can plot the distribution of Dpred-Spred, you can even calculate probability for Dpred > Spred which is a kind of one-sided p value in “entropy space”

              • > likelihood of the parameter given the data

                Or probability of the data given the model, I’m not sure what’s the thing under discussion. The point is that there is no issue with multi-modal distributions and there is no issue with spiked distributions either. To put a point in the context of the probability distribution, looking at the prevalence of higher-density regions seems more natural than looking just at the density relative to the maximum value. The latter doesn’t have a clear interpretation, if I understand correctly you suggest putting it into a context of a sampling distribution to give it some meaning…

              • A little confused because you mention parameter space when I think we’re discussing data space, but assuming that was a typo, if you’re suggesting that I take samples of the value -log(p(data | params)) and just calculate tail areas in this “entropy space” then perhaps we’re not so far off from each other in the first place.

                fundamental to my intuition is I want a measure of goodness of fit which respects the local notion of “probability to be in the vicinity of where the data was actually found” rather than things like “probability to be farther out in the tail of the data space than where the data was” which relies on properties of our model which involve data *infinitely* far away from wherever our data is, and also on properties of the tail of our model which is usually not the most highly reliable portion of our data model.

                -log(p(data | params)/p(datamax | params)) is just a way to make this notion invariant to units of measure and interpretable in a reasonable way, what to do with it is I think still very much up for grabs. If you suggest calculating “probability in the MCMC sample for Sdata > Scomp” as defined in my previous post, I can see the utility in that.

              • > A little confused because you mention parameter space when I think we’re discussing data space, but assuming that was a typo,

                Yes, I was mixing things a bit, my clarification has crossed with your comment.

                >if you’re suggesting that I take samples of the value -log(p(data | params)) and just calculate tail areas in this “entropy space” then perhaps we’re not so far off from each other in the first place.

                That is what I suggested already in our previous discussion ( https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-1006108 ):

                “It seems you would like to check if the measurement is in the HDR of the model distribution and the more strightforward way to do so seems to calculate a p-value (using the probability density as statistic). I can’t see a satisfactory solution not involving tail areas.”

              • Also rather than “context of a sampling distribution” I’d rather say “context of the predictive distribution”

                sampling distributions are about where data actually is found… so for example you could resample your data to see what a new sample of data might look like in the future under some kind of stationarity/independence/representativeness assumption

                the predictive distribution however, is about *where your model **thinks** the data will be found* rather than where it actually is found in repeated sampling.

                This proposal is about comparing “how deep into the predictive distribution the data actually was” (or how high in potential energy it was, or how many bits of entropy change it had or something similar) relative to what the model predicts for it.

                so I think it has two components:

                1) calculating some kind of relative entropy thing -log(p(data|params) / p(datamax | params))… This is an “anchored log density”, it doesn’t have the problem of taking a logarithm of a thing with units.

                2) working with this relative entropy and comparing its realized values to what the model’s predictive distribution expects the realized values to be.

                none of it relies on frequency properties, which say a bootstrapping procedure would rely on.

              • I guess I said “sampling distribution” based on your description “suppose we simulated a few hundred draws from the distribution”. The point is that you calculate a quantity from data that ignores everything about the probability distribution except the maximum and then you proceed to simulate data using the probability distribution to see what is the expected distribution of this quantity.

                It really seems much simpler to calculate a p-value using the shape of the probability distribution (and not just the maximum value) as suggested by ojm and myself. Then you don’t need to simulate data using the probability distribution to see what is the expected distribution of this quantity… because it’s a p-value!


                1) calculating some kind of relative entropy thing

                2) comparing its realized values to what the model’s predictive distribution expects the realized values to be.


                1’) calculating a p-value

                2’) comparing with the expected distribution which is uniform in [0 1]

              • this “p value whose distribution is uniform(0,1)” is just the CDF of the predictive distribution.

                So, you’re suggesting rather than considering what is the probability to be sitting at a certain depth in density, I should just ask “what is the probability to be to the left (or using 1- to the right) of the data point” and that this somehow is a more useful way to compare where a data point fell vs where it was expected to fall.

                I guess I see questions like “were we exceedingly far to the left” or “were we exceedingly far to the right” are nonlocal questions. The question I want to know is “are data points within epsilon of this point particularly weird compared to our predictive distribution’s expectation for the probability of epsilon neighborhoods”. Epsilon neighborhoods often have a very good interpretation because as we said above all observations / measurements are actually discrete, and so there is a distinguishing value for epsilon that actually corresponds to a physical quantity, the “least count” of the measurement instrument.

              • > So, you’re suggesting rather than considering what is the probability to be sitting at a certain depth in density, I should just ask “what is the probability to be to the left (or using 1- to the right) of the data point”

                Definitely that is NOT what I’m suggesting.





              • AHA, now I understand! you’re saying sample a bunch of x values from the predictive distribution, calculate p(x) and then compare where our given p(data) is vs the distribution of p(x) values.

                Now that I understand let me consider the idea and see if I can notice any important differences.


                Daniel – I’m also talking about this BTW

              • Ok, so the thing I’m thinking about is using p(x) itself as the statistic of interest rather than p(x)/p(xmax) or its negative log transform…

                A particular thing of interest is the fact that we still are considering the behavior over the entire posterior distribution of the fit… so suppose you have something like data = 1 whose predictive distribution is normal_pdf(data,0,s) and s includes values in the range 100 to 200.

                now 1 is very close to 0 relative to 100 or 200. But due to normalization, a distribution with width 100 is 2 times larger in the vicinity of 1 compared to a distribution with width 200.

                so in my proposal -log(p(1 | params)/p(0 | params)) will be approximately 0 for the entire range from 100 to 200, but will range over a whole range as s ranges over the 100 to 200 range…

                Now, in my opinion, the -log(p/pmax) captures the idea “this is very close to its maximum density region” whereas p(x) tends to compare apples to oranges. It fails to have the symmetry I think you’d want to answer questions of interest. On the other hand, the sensitivity to the vagueness of the distribution may be desirable in some settings, so I feel like there’s potential to argue for either one. It’s particularly the case when the data itself is already reduced to a dimensionless form by forming a dimensionless group. If the data itself has units, I think it’s less good to work directly with the density.

              • When we use Srel = -log(p(data|param)/p(datamax|param)) then Srel is a number guaranteed to be in [0,inf] and numbers closer to 0 are “deeper in the high probability region” and numbers farther from 0 are “out of the high probability region”.

                In order to understand how far out of the high probability region things really are, we do need to sample from the predictive distribution, and see what the predictive distribution of the Srel statistic is… we can’t just rely on the one realized value, as shown by the spike and slab example Carlos gave.

                when we work with p(data|param) itself, due to the normalization constant, the number can be anywhere in [0,inf] but numbers nearer 0 are “either in a low probability region of a concentrated distribution, or in the high probability region of a broad distribution” and numbers near infinity are “in the high probability region of an increasingly concentrated distribution”. Basically it mixes together concentration and high probability region.

                So, if you want a number that indicates “relative distance outside the high probability region” which is invariant to the effect of normalization constant due to the varied concentration across the posterior parameter distribution, you want the Srel I think.

              • What’s wrong with the p-value then?

                Of course the p-value calculated based on p(data|param) is identical to the p-value calculated based on -log(p(data|param)/p(datamax|param)) and to the p-value calculated based on any other monotonic function of p(data|param).

              • Carlos. I think there’s nothing wrong with this p value concept as a way to determine if the density near your data point is unusual but while your statement about monotonic transforms applies for a single sample of the posterior parameters, across multiple posterior parameters, when pooled, the Srel behaves differently than the p(x) because it’s invariant to changes in the normalization constant that occur as you move around the parameter space.

              • I thought that that you wanted was to measure the fit of the assumed probability distribution looking at single realizations (for some reason). If what you want is to compare the model with the empirical distribution corresponding multiple realizations, I think in the best case you’re going to rediscover Kullback-Leibler divergence of something like that.

              • Or maybe I’ve completely misunderstood your comment, because now it’s me who is confused by the references to parameter space.

              • Carlos, ultimately I want to think of a procedure that allows me to detect when data is showing up in regions of data space that are not predicted by the model. so I want a quantity I can use to detect this and I want a procedure that does the detections. I want to use the entirety of the posterior distribution during my detection phase, ideally I want a procedure that can be done by adding a few extra calculations to each step of an MCMC type procedure.

                here’s the proposal so far…

                1) sample parameters, calculate Srel for the actual data
                2) sample a predictive data point, calculate Srel for predictive data
                3) repeat 1,2 until you have sufficient parameter sample
                4) plot a density plot of Srel for data and for predictive, visually decide whether these distributions indicate extreme values for Srel for data compared to predictive
                5) possibly use some p value type calculation to determine if Srel for data has some particular mathematical property, such as probability to exceed a Srel predictive greater than some value.

                since steps 1,2,3 involve traversing the parameter space, the statistic Srel should be invariant to changes in normalization constant caused by changes in parameters, hence the proposal for -log(p(data|params)/p(datamax|params))

              • > statistic Srel should be invariant to changes in normalization constant caused by changes in parameters

                This makes no sense to me. When you change the parameters the shape of the density function changes, the location of the mode changes, the maximum value changes, the density at the data value changes, the ratio between the two preceding item changes… what do you think it’s invariant precisely? Hopefully you agree that when you set the parameters using MLE the ratio becomes one.

              • Carlos, it’s not invariant to changes in the parameters, it’s just invariant to changes in the *normalization constant* as the parameters change.

                For example, if I have a data point right at the peak density data value, I want this to always return the same thing, in the case of Srel this is namely 0. With p(data) it doesn’t do that, if the shape of the distribution changes, but your data remains at the peak predictive location, nevertheless p(x) changes based entirely on just the constant needed to renormalize the distribution.

                If for all parameter values of interest your data is always say at 1/2 the maximum density, still as you change the shape of the distribution, p(x) can vary over many many orders of magnitude even while staying at half the maximum value…

                It’s a little like the Reynold’s number, if I make my discussion of drag on a sphere be in terms of the fluid used and the velocity, and the size… then I can’t speak to the commonality inherent in certain combinations of fluid, velocity, and size. On the other hand if I do my drag analysis in terms of a dimensionless ratio rho*v*L/mu then I can see that whenever this combination is the same, the drag pressure is similar.

                The same idea applies to normalizing my statistic in terms of peak data density. Now I have a meaningful reference scale for the density for comparing across multiple parameter values.

                I could *for each parameter generated in step 1* generate many predictive data values, and then calculate the p value you mention, and then use this as a measure of atypicality. It’s also dimensionless. Yes, that’s possible.

                But if I remove the multiple order of magnitude changes that could occur due to the normalization constant, I can pool Srel across the whole MCMC sample, and then I generate just one random predictive data point for each MCMC step and do my analysis in pooled version at the end of the whole sampling procedure: did Srel typically fall far from 0 compared to where Srelpred typically fell?

                does this make sense?

                also note, that because there’s a Bayesian posterior, I don’t really care intensely about the precise p value *within a given fixed parameter sample* what I care about is if there was a consistent problem, *regardless* of which parameter sample I use. This indicates that Bayes was unable to find any region of the parameter space that did a good job.

              • The p-value is arguably more “invariant” then, becase not only the value for the most probable data is fixed (1) but the distribution over the data distribution is also fixed (uniform in [0 1]) as you move on the parameters space.

                I asked “what’s wrong with the p-value”, you could have answered “nothing” and it would have been faster for you to type and easier for me to understand. :-)

              • Well it took quite a few iterations for me to understand the p value you were talking about wasn’t the p value I thought you were talking about (ie. the p value for the data as statistic, instead of the p value for density as statistic).

                It turns out you’re just ahead of the game and are dragging me up the mountain as usual ;-)

                Anyway, I’ve found the conversation useful because it helped clarify the ideas. In any case, now that we’re talking about the same thing, Yes I agree that the predictive p value for the density calculated over the fixed parameters is invariant to the change in parameters. It’s substantially more computational work to calculate, but I think it would work fine to detect the case “data is in regions that are abnormally low density compared to what we expect under the predictive distribution”

            • > If you have a Frequency model, then the Frequency is a fact about the world, which you could in principle verify by data collection: does the data fill up the histogram shape of the distribution?

              You don’t have to interpret a probability model in terms of frequency, but you *are* constrained by the math to accept additivity when working with probability as a ‘plausibility’ measure.

              I’m much more comfortable accepting additivity over observable events in general – e.g. either A or not A happens (in general anyway…) – while *much* less comfortable accepting additivity of ‘plausibility’ measures over *models* or theories etc.

              Others are too – hence e.g. the existence of non-additive plausibility measures. But we’ve discussed this enough in the past….

              • >additivity of ‘plausibility’ measures over *models* or theories etc.

                We’ve been over that before. Again I think of this as a provisional calculation: if these are the only models we’re considering… what does that imply. The goal of projects like calculating my suggested p(data|model)/p(datamax|model) is specifically to highlight instances where what’s possible within the model fails in a meta-statistical way. I think of this as the Godel incompleteness analog of statistics. We always have to look at a meta level to discover that we need a broader model.

    • my understanding is that Evans is interested in how the probability over parameters changes with data. a very useful idea. but here I’m proposing to check how well the models predictions match the data after tuning (ie. using posterior values for parameters).

      the idea is that a posteriori the model makes predictions about where each data point will be, but the data doesn’t sit right at the most highly predicted value usually. where does it sit? are there any extreme outliers? are most of the values outliers? etc

      • I find this post interesting but I am wondering to what extent the discrepancy between the observed data to the most-likely-data “would flag some serious model mis-fitting.” parameter can overfit data, but the hypnotically generated data can also overfit the model too. In particular, if a silly model overfits the data by predicting each y_i as it is through a point mass 1(y_i), then the observed data remains the most likely one, so such measurement is encouraging overfitting? But if I am only worried about “underfitting”, why not just look at the empirical error in the sample?

        Mathematically I think this is equivalent to likelihood ratio test by flipping data and parameter, so under null the log ratio should a chi square distribution. But we know any model is wrong a priori so I will essentially reject my model. OK actually I do not know if it is asymptotically equivalent to compare the model fit to MLE.

        Furthermore, Bayesian posterior predictive checking seems to take all this into account, where I can compare any function f(y_generate, theta) with the data y. I will literally recover that chi-square test by taking f=likelihood.

        More philosophically, it seems there are two principles that we can ask for when it comes to data perturbation. 1) In your proposal, you are asking for “the best data” should not be much better. In contrast, robustness or adversarial-learning are asking for the worst data (in the neighborhood) cannot be much worse. There should be some further principles…

  3. *Jeffreys.

    And off because a) considering data that could of but didn’t occur is actually very important and b) that’s not really how pvalues work

    • A bit OT. I find Sander Greenland’s S-values the most useful and least misleading implementation of p-values. I think the general notion of a ‘continuous measure of discrepancy between model and data’ is indeed what Daniel is looking for.

      • An alternative I like is to think of the pvalue calculated under H as a measure of *possibility* of H. High p, pretty possible. Low p, not that possible.

        Most people intuitively understand that possibility is non additive so that just because one H is possible doesn’t mean that another can’t be as or more possible.

        To measure the extent to which alternatives are also possible, you can take the necessity of H as 1-max{poss(Ha)} over alternatives Ha. Most H have at least one fully possible Ha, so most H aren’t necessary. Such is life.

        • (The idea also being to offer a *positive* interpretation of p that doesn’t commit the affirming the consequent probabilistic fallacy. This gets away from focusing on *inference by rejection* that plagues a lot of hypothesis testing)

        • I find this appealing as a way to explain what p values actually tell you. Small p means the null hypothesis is not a possible explanation for the data. It then follows that “something else” has to be the explanation, but it’s maybe a little clearer that it doesn’t have to be “my favorite hypothesis”

  4. Consider a simple model of two chained binomials:

    Y ~ Binom(X; p)
    Z ~ Binom(Y; q)
    p ~ Uniform(0, 1)
    q ~ Uniform(0, 1)

    Assume our data is X=1000000 and Z=1
    Lets make an inference for Y, p and q, ie p(Y, p, q| X=a lot, Z=1)

    It appears that likelihood (~~posterior) at point Y=1, q=1, p=1/X is way bigger then likelihood of any point near the posterior mode. What should I conclude out of this?

      • Yeah, I just got one data point, X=many, Z=1
        And I need to make an inference for Y anyway

        For example X is the population size, Y is the number of people infected with some hidden bug and Z is the observed number of infections. We only know X and Z, but we would like to say something about Y.

        • I think I finally got it. see below, the frequency inference for Y will be a broad posterior centered on 0.5, it will include samples that make our one data point 10 or 20 or even maybe 100 times less likely than the alternative data point but on average over the posterior not so extreme.

          • Well, you point was that if probability in the median is much smaller than the probability in the mode, something is wrong

            I pointed out to very simple (discrete!) model, that looks appropriate, and produce a distribution with median being much less likely than mode.

            I would imagine in models with high-dimension parameter space you can spread your posterior really thinly and still get a lot of volume.

            But maybe you a right, and chained binomial model is dangerous. It may look familiar, but there is a lot of hidden complexity. The point with X=many, Y=Z=few is a black hole where I lost my inference for a couple of times.

    • Binomials are kind of weird, right, because you really only ever get 1 data point, the data point is a sequence, and all sequences with a given number of successes are considered equally probable. You collapse the problem down to *two* numbers N and s where N is the sequence length and s is the successes.

      So, you see N=1e6 and s=1, you do inference for f, the frequency of success, and evidently find f approximately 1e-6. Now you ask what’s the ratio p(s | N,f)/p(smax|N,f). Since smax = 1, when f=1e-6, this ratio is 1.

    • Ok, I think I’m wrapping my head around your problem now. In addition to the s=1 we have the chained second situation which is NN=1 and ss=1. The posterior mode for ff=1/2 given the uniform prior. the probability p(ss=1 | NN=1,N=1,f=1e-6,ff=0.5) / p(ssmax | NN=1,N=1,f=1e-6,ff=0.5)

      ssmax = 1 or 0 when ff=0.5, they’re the same both are 1/2, so the ratio is also 1

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published.