“Real data can be a pain”

Michael McLaughlin sent me the following query with the above title.

Some time ago, I [McLaughlin] was handed a dataset that needed to be modeled. It was generated as follows:

1. Random navigation errors, historically a binary mixture of normal and Laplace with a common mean, were collected by observation.

2. Sadly, these data were recorded with too few decimal places so that the resulting quantization is clearly visible in a scatterplot.

3. The quantized data were then interpolated (to an unobserved location).

The final result looks like fuzzy points (small scale jitter) at quantized intervals spanning a much larger scale (the parent mixture distribution). This fuzziness, likely ~normal or ~Laplace, results from the interpolation. Otherwise, the data would look like a discrete analogue of the normal/Laplace mixture.

I would like to characterize the latent normal/Laplace mixture distribution but the quantization is “getting in the way”. When I tried MCMC on this problem (using JAGS) and simply ignoring the quantization, it turned out that the “signal” coming from the jitter overwhelmed that due to the underlying parent mixture so that the MCMC process did not “see” the normal/Laplace distribution very well. That is, it returned unrealistic parameters for it, with values more in tune with the jitter than with the parent distribution. In other words, it thought it was modeling the jitter.

So I cannot ignore the quantization. Unfortunately, I have not found any way to incorporate it into a valid MCMC model either. I’ve searched everywhere I can find but have uncovered no examples for treating quantization error corrupting a parent distribution.

Do you, or your readers, have any suggestions?

My reply: I think this should work fine in Stan (or any other Bayesian software) if you just model the steps 1, 2, 3 directly:

For step 1, you have your mixture model. You just need to put an informative prior distribution on the parameters of the mixture components to get a stable estimate.

For step 2, just model the quantization: if the data from step 1 are z_i, and the rounded data are y_i, then set up a model for y|z, maybe just a simple rounding model. The z’s are now missing data. Or in this case (a mixture of normal and laplace densities for z), you should be able to simply integrate out the missingness to get a probability distribution for the rounded data, z.

For step 3, I don’t quite know what you mean by “interpolated to an unobserved location.” Again, though, this is just some process that can be modeled.

The moral of the story is: likelihood inference really works! The trick is to model the data and then do the inference, rather than trying to jump directly to create an estimate from the data.

4 thoughts on ““Real data can be a pain”

  1. Maybe a simpler way to see it would be as ABC (approxx bayes comp).

    Generate parameters from the prior, generate the data and then condition on appropriately quantized and _interpolated to location_ data.

    The target posterior is the correct one for that data process and you could get a simulation approximation for the likelihood ~ c * posterior/prior (you could even split these up into fully accurate data, quantized data and _interpolated to location_.)

    Doing it analytically is much better and likely the only feasible way for the full data set, but doing the above for just a small subset might help clarify and guide doing it analytically.

    But the priniciple is really simply and it has no choice but to work!

  2. See the following:

    Fernandez, C. and Steel, M. F. J. (1998) On the dangers of modelling through continuous distributions: A Bayesian perspective, Bayesian Statistics 6, pp. 213-238.

    • Radford:

      I’ll keep this paper as an example that you can publish a 15+ page journal paper about absolutely nothing – worrying about measure zero in applications ;-)

      I don’t think Cox and Hinkley spent more than a paragraph or two adequately dealing with it – though, as we know, even really smart people forget about it.

      Models are representations of what is being represented – what is being represented does not immediately/always inherit the properties of the representation. Here probability generating models represent observations and continuos ones are very, very convenient but the observations can not be really continuous with infinite precision.

      Now it mat not be very convenient when the continuous assumptions cause problems and need to be fixed up, but I don’t think thats the case here?

  3. I think this might be relevant

    Alston, Clair & Mengersen, Kerrie (2010) Allowing for the effect of data binning in a Bayesian Normal mixture model. Computational Statistics and Data Analysis, 54(4), pp. 916-923.

Comments are closed.