It turns out you’re just ahead of the game and are dragging me up the mountain as usual ;-)

Anyway, I’ve found the conversation useful because it helped clarify the ideas. In any case, now that we’re talking about the same thing, Yes I agree that the predictive p value for the density calculated over the fixed parameters is invariant to the change in parameters. It’s substantially more computational work to calculate, but I think it would work fine to detect the case “data is in regions that are abnormally low density compared to what we expect under the predictive distribution”

]]>I asked “what’s wrong with the p-value”, you could have answered “nothing” and it would have been faster for you to type and easier for me to understand. :-)

]]>For example, if I have a data point right at the peak density data value, I want this to always return the same thing, in the case of Srel this is namely 0. With p(data) it doesn’t do that, if the shape of the distribution changes, but your data remains at the peak predictive location, nevertheless p(x) changes based entirely on just the constant needed to renormalize the distribution.

If for all parameter values of interest your data is always say at 1/2 the maximum density, still as you change the shape of the distribution, p(x) can vary over many many orders of magnitude even while staying at half the maximum value…

It’s a little like the Reynold’s number, if I make my discussion of drag on a sphere be in terms of the fluid used and the velocity, and the size… then I can’t speak to the commonality inherent in certain combinations of fluid, velocity, and size. On the other hand if I do my drag analysis in terms of a dimensionless ratio rho*v*L/mu then I can see that whenever this combination is the same, the drag pressure is similar.

The same idea applies to normalizing my statistic in terms of peak data density. Now I have a meaningful reference scale for the density for comparing across multiple parameter values.

I could *for each parameter generated in step 1* generate many predictive data values, and then calculate the p value you mention, and then use this as a measure of atypicality. It’s also dimensionless. Yes, that’s possible.

But if I remove the multiple order of magnitude changes that could occur due to the normalization constant, I can pool Srel across the whole MCMC sample, and then I generate just one random predictive data point for each MCMC step and do my analysis in pooled version at the end of the whole sampling procedure: did Srel typically fall far from 0 compared to where Srelpred typically fell?

does this make sense?

also note, that because there’s a Bayesian posterior, I don’t really care intensely about the precise p value *within a given fixed parameter sample* what I care about is if there was a consistent problem, *regardless* of which parameter sample I use. This indicates that Bayes was unable to find any region of the parameter space that did a good job.

]]>This makes no sense to me. When you change the parameters the shape of the density function changes, the location of the mode changes, the maximum value changes, the density at the data value changes, the ratio between the two preceding item changes… what do you think it’s invariant precisely? Hopefully you agree that when you set the parameters using MLE the ratio becomes one.

]]>here’s the proposal so far…

1) sample parameters, calculate Srel for the actual data

2) sample a predictive data point, calculate Srel for predictive data

3) repeat 1,2 until you have sufficient parameter sample

4) plot a density plot of Srel for data and for predictive, visually decide whether these distributions indicate extreme values for Srel for data compared to predictive

5) possibly use some p value type calculation to determine if Srel for data has some particular mathematical property, such as probability to exceed a Srel predictive greater than some value.

since steps 1,2,3 involve traversing the parameter space, the statistic Srel should be invariant to changes in normalization constant caused by changes in parameters, hence the proposal for -log(p(data|params)/p(datamax|params))

]]>I pointed out to very simple (discrete!) model, that looks appropriate, and produce a distribution with median being much less likely than mode.

I would imagine in models with high-dimension parameter space you can spread your posterior really thinly and still get a lot of volume.

But maybe you a right, and chained binomial model is dangerous. It may look familiar, but there is a lot of hidden complexity. The point with X=many, Y=Z=few is a black hole where I lost my inference for a couple of times.

]]>Of course the p-value calculated based on p(data|param) is identical to the p-value calculated based on -log(p(data|param)/p(datamax|param)) and to the p-value calculated based on any other monotonic function of p(data|param).

]]>In order to understand how far out of the high probability region things really are, we do need to sample from the predictive distribution, and see what the predictive distribution of the Srel statistic is… we can’t just rely on the one realized value, as shown by the spike and slab example Carlos gave.

when we work with p(data|param) itself, due to the normalization constant, the number can be anywhere in [0,inf] but numbers nearer 0 are “either in a low probability region of a concentrated distribution, or in the high probability region of a broad distribution” and numbers near infinity are “in the high probability region of an increasingly concentrated distribution”. Basically it mixes together concentration and high probability region.

So, if you want a number that indicates “relative distance outside the high probability region” which is invariant to the effect of normalization constant due to the varied concentration across the posterior parameter distribution, you want the Srel I think.

]]>A particular thing of interest is the fact that we still are considering the behavior over the entire posterior distribution of the fit… so suppose you have something like data = 1 whose predictive distribution is normal_pdf(data,0,s) and s includes values in the range 100 to 200.

now 1 is very close to 0 relative to 100 or 200. But due to normalization, a distribution with width 100 is 2 times larger in the vicinity of 1 compared to a distribution with width 200.

so in my proposal -log(p(1 | params)/p(0 | params)) will be approximately 0 for the entire range from 100 to 200, but will range over a whole range as s ranges over the 100 to 200 range…

Now, in my opinion, the -log(p/pmax) captures the idea “this is very close to its maximum density region” whereas p(x) tends to compare apples to oranges. It fails to have the symmetry I think you’d want to answer questions of interest. On the other hand, the sensitivity to the vagueness of the distribution may be desirable in some settings, so I feel like there’s potential to argue for either one. It’s particularly the case when the data itself is already reduced to a dimensionless form by forming a dimensionless group. If the data itself has units, I think it’s less good to work directly with the density.

]]>Daniel – I’m also talking about this BTW

]]>Now that I understand let me consider the idea and see if I can notice any important differences.

]]>Definitely that is NOT what I’m suggesting.

IT’S NOT ABOUT “DISTANT” REGIONS, IT’S ABOUT “LOW-DENSITY” REGIONS.

CALCULATE A P-VALUE (USING THE PROBABILITY DENSITY AS STATISTIC)

]]>So, you’re suggesting rather than considering what is the probability to be sitting at a certain depth in density, I should just ask “what is the probability to be to the left (or using 1- to the right) of the data point” and that this somehow is a more useful way to compare where a data point fell vs where it was expected to fall.

I guess I see questions like “were we exceedingly far to the left” or “were we exceedingly far to the right” are nonlocal questions. The question I want to know is “are data points within epsilon of this point particularly weird compared to our predictive distribution’s expectation for the probability of epsilon neighborhoods”. Epsilon neighborhoods often have a very good interpretation because as we said above all observations / measurements are actually discrete, and so there is a distinguishing value for epsilon that actually corresponds to a physical quantity, the “least count” of the measurement instrument.

]]>It really seems much simpler to calculate a p-value using the shape of the probability distribution (and not just the maximum value) as suggested by ojm and myself. Then you don’t need to simulate data using the probability distribution to see what is the expected distribution of this quantity… because it’s a p-value!

Compare

1) calculating some kind of relative entropy thing

2) comparing its realized values to what the model’s predictive distribution expects the realized values to be.

With

1’) calculating a p-value

2’) comparing with the expected distribution which is uniform in [0 1]

]]>sampling distributions are about where data actually is found… so for example you could resample your data to see what a new sample of data might look like in the future under some kind of stationarity/independence/representativeness assumption

the predictive distribution however, is about *where your model **thinks** the data will be found* rather than where it actually is found in repeated sampling.

This proposal is about comparing “how deep into the predictive distribution the data actually was” (or how high in potential energy it was, or how many bits of entropy change it had or something similar) relative to what the model predicts for it.

so I think it has two components:

1) calculating some kind of relative entropy thing -log(p(data|params) / p(datamax | params))… This is an “anchored log density”, it doesn’t have the problem of taking a logarithm of a thing with units.

2) working with this relative entropy and comparing its realized values to what the model’s predictive distribution expects the realized values to be.

none of it relies on frequency properties, which say a bootstrapping procedure would rely on.

]]>Yes, I was mixing things a bit, my clarification has crossed with your comment.

>if you’re suggesting that I take samples of the value -log(p(data | params)) and just calculate tail areas in this “entropy space” then perhaps we’re not so far off from each other in the first place.

That is what I suggested already in our previous discussion ( https://statmodeling.stat.columbia.edu/2019/03/20/retire-statistical-significance-the-discussion/#comment-1006108 ):

“It seems you would like to check if the measurement is in the HDR of the model distribution and the more strightforward way to do so seems to calculate a p-value (using the probability density as statistic). I can’t see a satisfactory solution not involving tail areas.”

]]>fundamental to my intuition is I want a measure of goodness of fit which respects the local notion of “probability to be in the vicinity of where the data was actually found” rather than things like “probability to be farther out in the tail of the data space than where the data was” which relies on properties of our model which involve data *infinitely* far away from wherever our data is, and also on properties of the tail of our model which is usually not the most highly reliable portion of our data model.

-log(p(data | params)/p(datamax | params)) is just a way to make this notion invariant to units of measure and interpretable in a reasonable way, what to do with it is I think still very much up for grabs. If you suggest calculating “probability in the MCMC sample for Sdata > Scomp” as defined in my previous post, I can see the utility in that.

]]>Or probability of the data given the model, I’m not sure what’s the thing under discussion. The point is that there is no issue with multi-modal distributions and there is no issue with spiked distributions either. To put a point in the context of the probability distribution, looking at the prevalence of higher-density regions seems more natural than looking just at the density relative to the maximum value. The latter doesn’t have a clear interpretation, if I understand correctly you suggest putting it into a context of a sampling distribution to give it some meaning…

]]>generate MCMC sample from posterior:

Calculate

Sact = -log(p(Data|params)/p(Datamax|params))

generate a random data point from the predictive distribution Dpred

calculate

Spred = -log(p(Dpred|params)/p(Datamax|params))

now in your final sample you can do things like mean(Dpred) and mean(Spred), you can plot the distribution of Dpred-Spred, you can even calculate probability for Dpred > Spred which is a kind of one-sided p value in “entropy space”

]]>Note that I’m not talking about p-values based on the statistic “value of the parameter” or “distance to some central point”, or anything like that. I’m talking about p-values based on the statistic “p(x) = likelihood of the parameter given the data”. It’s not about “distant” regions, it’s about “low-density” regions.

]]>The method flags individual data points that indicate discrepancy between model expectations and actual data collected in a way that works regardless of the shape of the distribution, particularly even if the distribution is multi-modal.

Furthermore, the evaluation method is “local”, that is it compares the probability to be in a local neighborhood of the actual data point to the expected probability to be in a local neighborhood of a randomly selected predicted data point, and it uses a measure that is on an interpretable scale and has no dependence on the units of measurement.

I think those properties are good, whereas I think a p value based on a tail area asks questions about probability to be in “distant” regions of space that are irrelevant.

To me it has a flavor of lebesgue vs reimann integration. My proposed measure asks about being in level sets of the probability density, whereas tail-areas ask questions about probability to be in a continuous region of the data space that crosses potentially *many* level sets some of which are vastly different in probability density.

I’m not claiming a fully worked out idea here, but I think this distinction between comparison of “how likely is it for a data point to sit at a given “depth” (or potential energy in the HMC formalism)” is a more interesting question than “how much probability is contained farther away in euclidean distance from the mode than this data point” or some such thing you can interpret a p value as meaning.

]]>suppose we simulated a few hundred draws from the distribution, and then calculate the mean of the log of the ratio. this is a measure like differential entropy. we then compare our actual quantity to this reference quantity, it turns out that this may help as well for the transform of variables, as now the reference level is going to change under the transformation.

in a high entropy scenario, we are less surprised by extreme ratios, in a low entropy scenario we are more unhappy by extreme realizations.

]]>I really don’t know in what sense do you think this “measure” is better (apart from being easier to compute) than the one based on tail areas (which doesn’t have the problem discussed in the previous paragraph).

]]>oh, yes you show that wide distributions will generally be ok with data over a wide range. I think that’s a feature.

Carlos, in my proposal I am imagining conditionalizing to individual data points. so for example if you have a timeseries with 1000 time points you wouldn’t look at the entire sequence relative to the highest probability sequence, you’d look at time point 1, then 2, then 3, etc. are there individual time points whose dx neighborhood is thousands of times less probable than the max probability for that time point? this avoids the “curse of dimensionality” issues

]]>And I need to make an inference for Y anyway

For example X is the population size, Y is the number of people infected with some hidden bug and Z is the observed number of infections. We only know X and Z, but we would like to say something about Y.

]]>Putting aside the re-parameterisation issue for now (eg assuming it can be ‘solved’ by discretisation), I think the ‘level set’ p-value possibly works OK for this case, being equal to 1-the prob of the high density region that just excludes/includes the observed data.

This is in (possible) contrast with the modal approach, though I’m not sure how this case behaves under discretisation…

]]>Ok. So you can have a model in representation A (wavelength), you do a probability assignment, and you conclude that the model is a failure because it doesn’t match well the data. But I can choose a different representation B (frequency), choose as probability assignment the transformation of yours, and conclude that my model is a success because it matches well the same data. That seems a problem to me, it seems a nice property to you, and we can agree to disagree.

Unrelated to reparametrizations, what about the problem we discussed some time ago that makes many distributions impossible to model successfully (according to your method of assessment) when most of the probability lies in low-density regions.

An example discussed often around here are high-dimensional Gaussians, which are “concentrated” (I don’t really like that way of putting it) in the low-density areas far from the center.

]]>Suppose you’re testing an essentially unrestricted (but discrete) family of models for ‘self-consistency’ with the observed data y0 by using consistency(theta, y0) = p(y0;theta)/ (sup_y [p(y;theta)]) for each theta in turn.

Then both the fully deterministic model:

p(y) = 1 if y = y0,

p(y) = 0 otherwise

and the ‘uniform randomness’ model:

p(y) = 1/n for a set of n points including y0

p(y) = 0 otherwise

are both *equally consistent* with the data, i.e. consistency(theta, y0) = 1 for both models.

]]>We’ve been over that before. Again I think of this as a provisional calculation: if these are the only models we’re considering… what does that imply. The goal of projects like calculating my suggested p(data|model)/p(datamax|model) is specifically to highlight instances where what’s possible within the model fails in a meta-statistical way. I think of this as the Godel incompleteness analog of statistics. We always have to look at a meta level to discover that we need a broader model.

]]>You don’t have to interpret a probability model in terms of frequency, but you *are* constrained by the math to accept additivity when working with probability as a ‘plausibility’ measure.

I’m much more comfortable accepting additivity over observable events in general – e.g. either A or not A happens (in general anyway…) – while *much* less comfortable accepting additivity of ‘plausibility’ measures over *models* or theories etc.

Others are too – hence e.g. the existence of non-additive plausibility measures. But we’ve discussed this enough in the past….

]]>If you have a Frequency model, then the Frequency is a fact about the world, which you could in principle verify by data collection: does the data fill up the histogram shape of the distribution? Furthermore, the invariance of “reparameterization” of data would be implied by the mathematics… if the frequency of being in [a,b] is such and such… then the frequency of being between [f(a), f(b)] needs to be the same thing in a transformed measurement space, this is implied by the math.

The Bayesian model on the other hand is not a fact about the world, but a fact about our willingness to make predictions. The space in which we assign the probability is special, because it’s the space where we have sufficient information to assign probability to fixed width slices of the space. Probabilities in alternative spaces are mechanically imposed by our initial assignment in whatever space we assigned our probabilities.

If we want to assess how well we did in modeling, the best way to construct a measure of “good modeling” is going to be *in the space where the model probabilities were assigned*.

of course, you could do a mathematically equivalent calculation in a different transformed space, but it will need to be equivalent to the above idea if you want it to assess the appropriate thing, in which case as Carlos points out it will require a ratio of Jacobians.

In many many fields, we don’t have pure precise invariant mathematical relationships between quantities in the world other than unit conversions. Like wavelength * frequency = constant for all observers.

For example, you might easily have mean(annual pretax income) = f(some covariates) + errors, and mean(monthly posttax income) = g(some covariates) + errors, but while the conversion from annual to monthly has some precise mathematical relationship, the conversion from pretax to posttax does not have a precisely understood mathematical relationship. Inevitably our probability assignments in these two modeling scenarios will not be mathematically compatible.

]]>INVARIANT P-VALUES FOR MODEL CHECKING

BY MICHAEL EVANS AND GUN HO JANG

https://projecteuclid.org/download/pdfview_1/euclid.aos/1262271622

]]>‘Invariant Procedures for Model Checking, Checking for Prior-Data Conflict and Bayesian Inference’

by Gun Ho Jang (and supervised by Mike Evans!).

See here: https://tspace.library.utoronto.ca/bitstream/1807/24771/1/Jang_Gun_Ho_201006_PhD_thesis.pdf

This defines p-values for discrete distributions in terms of the probability mass ordering I mentioned above. It then takes the ‘all observations are discrete route’ considers the continuous case as an approximation to the discrete definition. This effectively allows them to give an invariant p-value for continuous distributions via an approximation process.

So, as mentioned above, if you take this ‘discrete observations’ route, both the ‘tail area’ (defined via a probability mass ordering) and the ‘plausibility’ p(y0)/ (sup p(y)) are invariant under 1-1 changes of variable.

I’m still not super happy with the ‘elegance’ of how this ‘every observation is discrete’ is treated here or elsewhere, but it seems like one could probably tidy it all up somewhat.

[PS, while I’m mostly OK with ‘all observations are discrete’, properly tidied up, I’m definitely anti ‘all theoretical constructs are discrete’ and/or assuming additivity over possibly theoretical constructs. Hence probability distributions over observations ~ mostly OK, while probability distributions over models/theories etc ~ not in general .]

]]>m = F(covariates) + errorm

and

h = G(covariates) + errorh

Now, I can assign an error distribution on m … pm(errorm) and it will immediately mathematically imply a density on errorh, or I can assign error distribution on h ph(errorh) and it will imply a density on errorm.

to assess the quality of the model, I should compare errors in the space where the probability assignment is made, so if I assign error probability in errorm I should compare my actual errors to predicted errors in the errorm space.

]]>I think the key thing I’m actually saying is that in order to assess a model, we need to assess its probability assignment, and it’s not really the density ratios that we need to calculate, it’s the ratio of the probability mass assignments in a fixed sized vicinity of the space on which we assigned the probability, ie. the space in which our information lies.

If we work in some other space, we need to back out the quantities into the space in which the assignment was made in order to decide how well we’ve done in creating our models.

So the rule would be not “p(q|params)/p(qmax|params)” for any pushforward measure over any q derived from data d, but rather p(d|params)/p(dmax|params) in the actual space in which we made our decisions about how to assign probability.

If you’re working in space q which is a nonlinear transform of data d where you constructed your model, you should back q into d to assess the goodness of your modeling.

]]>I’m not clear on whether length is underlying a really a discrete quantity, and/or time. It seems likely that mass is. But regardless of whether underlying constructs are discrete or continuous, the most sensitive measurements instruments we have available today are something like 32 bit A/D converters:

obviously it becomes exponentially harder to add bits, so let’s suppose we get really really good at it… we’re still unlikely to ever have something like accurate 64 bit A/D converters. A coulomb of electrons is 6.242e18 whereas 2^64 is 1.84e19, so we’d have to be able to count every electron entering a millifarad capacitor as we charge it to over 3000 volts to get an accurate 64 bit A/D converter.

so let’s say we never get better than 48bit ~ 2.8e14 it’s pretty obvious that these are discrete measurements.

As for the non-invertible case. Well, we just need to provide plausibility assignments F(T(x))dF. And this model *is different* than a model placed directly on T(x) and we just need to acknowledge that.

]]>Side note – re invariance for discrete vars under non-1-1 transformations. While probability is known to be non-invariant in this setting, there is a sense in which *possibility* remains invariant even for non-1-1 transformations. See eg 7.4 here:

https://arxiv.org/abs/1801.04369

Either might be desirable depending on circumstance imo.

]]>For the same physical reality and the same physical model and the same physical measurements I think it would be desirable that we do not arrive at different conclusions depending on the representation chosen. It seems to me that you’re renouncing to this elemental invariance property for no reason.

I don’t think so. In fact, I think exactly the opposite. OJM’s point about discretization of measurement instruments provides the connection I think. Even if an underlying quantity is effectively continuous (ie. maybe is quantized at the electron level but we’re measuring coulombs worth of electrons) the quantity coming out of measurement instruments always has effectively a “least significant digit” and is rounded to that digit (or binary digit or whatever). a density over measurement instrument outputs is always really a device to assign probability mass to discrete measurement quantities.

Sometimes we have explicit rounding and wind up using CDF values to handle it, other times we have pretty fine grained measurements and its sufficient to approximate CDF(data+ddata)-CDF(data) as pdf(data) ddata

but if someone say reports the logarithm of an instrument output to you, you should invert that back to the measurement space and calculate the probability mass associated with that discrete measurement outcome. Measurement instrument models are always discrete.

]]>it turns out that ddata is invariant and cancels. if we receive instead an invertible function of data like f(data) we can invert finv(f(data)) and get back the underlying measurement, where we have a discretized space, and therefore respect the information we have about the discrete measurement instrument in an invariant way. People often work with this kind of model when there is explicit “rounding” to somewhat annoyingly few digits.

But let’s consider the case where Carlos might come along and say “hey, how about if the function is non-invertible” like for example suppose I have a model for temperature as a function of space x, so I have T(x) and what’s reported to me is the average of T(x) over several known locations x[i].

F(T(x)) is non-invertible, because there are many T(x) which are compatible with having a given average at the x[i]. So now what?

The thing is in this case we’ll need to provide a plausibility assignment over the F(T(x)) values, and this will imply some plausibility assignments over T(x) functions, and we’ll just have to accept whatever they are, in essence T(x) is now a parameter, even though in principle we could have been given vectors of direct measurements T(x[i])

when we collect different data, we will wind up with different models.

]]>