https://pubmed.ncbi.nlm.nih.gov/16446352/ ]]>

Cool – thanks for letting me know!

Not surprised as the math of likelihoods and priors are the same.

The paper is on my list to read.

Also reminds me of courses I took from Don Fraser many years ago.

]]>For likelihood-based models, this is exactly the marginal likelihood, but marginalized w.r.t. the Haar measure on the group of transformations under consideration.

]]>> Information is lost, but the model does better.

I think the model is doing better because useful information is being brought in by the prior specification.

Now with two competing prior specifications, one analytical and the other through augmented data, if the augmented data one would converge to the analytical one with exhaustive enough “examples” then there is a loss of information by using the augmented data one. So the model does better with augmented one but not as well as it would with the analytical one.

But one needs to keep in mind, Daniel’s warning about the “human information processing” issue that suggests the analytical one likely will be miss-specified https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/#comment-1200514

]]>x is not random, it’s a realized thing, and ax is also a realization. This information theoretic measure is a property of two distributions, not a property of a particular sample.

to the extent that we have a Bayesian model u is a distribution that induces a predictive data distribution D*, and potentially future observations are a distribution D, so the relevant information theoretic calculation is I(D*,D) the posterior predictive data vs the real future data.

]]>Ah right, although now that I think of it, the requirement that the transformation be independent on x is spurious.

One only needs the process to be a Markov chain.

“the mapping from x to ax is not a random variable”

a -> ax *is* a random transformation, no? Say when adding random Gaussian noise to each pixel.

The same holds if you rotate the picture by a random amount.

Moreover, ff the transformation is deterministic, then we trivially achieve the upper bound on the inequality, in which we haven’t reduced the mutual information, but haven’t increased it, either.

Therefore, the usefulness of data augmentation can’t be explained in term of “information” in the information theoretic sense.

I don’t think this is correct. the mapping from x to ax is not a random variable, and the augmentation procedure obviously *does* depend on x if you take x and rotate it, or take x and add noise to it.

]]>Thanks (I’m anon, forgot to login above)

Alex Ameni clarified properly what I meant.

One can view the transformation from data x to augmented data ax to parameters u as a Markov chain x -> ax -> u.

(Provided the choice of augmentation procedure is not conditioned on x — and it doesn’t when we just rotate or add noise indiscriminately).

When the Markov property holds, there’s a strict sense in which information is lost along the chain.

As a result, I’m not sure thinking in term of “information” is the right angle. Information is lost, but the model does better. Perhaps the connection between prior and regularization is a more fruitful avenue.

]]>Practically speaking, for high dimensional parameters, I think data augmentation is far far more likely to converge to something we can believe in than any amount of tweaking an analytical expression could do.

A big part of that is that specifying dependencies in high dimensions is very tricky, but they fall out automatically when you use a data-augmentation method.

A simple example: specify a prior over fourier coefficients that gives you functions on [0,1] which increase rapidly between x in [0,.5] up to a maximum near y=1 and stay flat out to x=1

one method: start with a very broad prior on all the coefficients, then create some fake x,y values that have the properties you want, define a normal(y,.25) iid model over the fake measurements. Then calculate the derivative at each x value and strongly downweight the coefficients if they produce any negative derivatives… and see what the posterior looks like…. it will have a high probability for fairly smooth monotonically increasing functions. Good luck doing that analytically.

Realistically speaking I think we’re better off talking about doing a good job of choosing your augmented data than we are discouraging this kind of thing. It’s a hugely valuable tool to impose real world knowledge on a problem.

]]>Daniel:

My p.s. presumed the analytical prior could be specified without error and that P(u) * P(ax|u) would eventually converge to that with extensive enough data augmentation. In that sense, I would argue the analytical prior is more satisfying.

Now the prior predictive for fake ax would be a nice way to check on the analytical prior specification.

Now the rest of you comment does seem consistent with the equations I gave in my post. If its not, please let me know.

]]>correction: where the *prior predictive* density of air pollution

]]>Actually adding fake data *is* encoding knowledge into the prior only.

The posterior is (proportional to): p(data | parameters) p(parameters)

where p(parameters) is any normalizable non-negative function encoding your information about the parameters.

so, let p(parameters) = p_pre(parameters) * p_fake(fake_data | parameters)

where p_pre(parameters) is a valid density encoding only vague bounds, and p_fake(fake_data | parameters) is a masking function that squishes the density in regions of parameter space that are far away from predicting fake_data (in particular you can for example provide a different standard deviation than for your real data, so p_fake need not be the same function as p(data|parameters) for your measurements)

now p(parameters) is proportional to a valid density, and it encodes your knowledge that the parameters are within some bounds and tend to predict things like fake_data…

QED

It *is* possible to do this badly, just as it’s possible to do other ways of encoding priors badly. But this is actually a good way to encode priors.

Remember the case that Dan Simpson mentioned a while back where the density of air pollution was somewhere between the density of concrete and the density of neutron stars? Well if he’d just put some fake data where it was around 1e-6 to 1e-4 times the density of air… he’d have squished his priors into line with his knowledge, without having to tune individual hyper-priors or whatever…

]]>Mark: Please see the p.s. in my post.

]]>Anonymous: Please see the p.s. in my post.

]]>It turns out that what I was describing is also called Multi-Input Multi-Output (MIMO) prediction in the machine learning literature.

]]>As I am not 100% on what is meant by data augmentation, I am thinking about it in terms of time series models where you need to enforce stationarity. So for instance, an AR(p) model. When p=1, it is easy to impose the constraints to ensure this, but it becomes more complicated for larger p.

If the original model is y_t = B * y_t_minus_1 + e_t with some prior, we could augment this with y_t_plus_j = B_j * y_t_minus_1 + e_t_plus_j, for some js. Where B_j is calculated from B and the standard deviation of e_t_plus_j is calculated from the standard deviation of e_t. So if I add j=1 or more, then this would be data augmentation?

I would think that by reducing the chance that I get a B with the property that it is bad at forecasting many periods out (it is explosive or something), then would be a good thing. I just wasn’t sure what the justification for doing this was. It just seemed like a hack.

]]>I’m sympathetic to this point. I’m still not entirely sure what they mean when they say data augmentation, even after seeing Keith’s comment below.

]]>Alex – thanks for the clarification.

]]>Nice point, A ~ B does imply B ~ A.

]]>Mark – Thanks, on my to read list.

]]>Andrew J. please see response below next comment.

]]>Andrew J.

You are right that I should have distinguished data augmentation as its understood in ML versus sampling from the posterior as its often used in Bayes. I meant to and then forgot.

To be more candid than I perhaps should be, I was about to delete the post and then noticed the blog seemed a bit slow recovering from the thanksgiving holiday. I am glad I posted it as the comments are interesting rather than as I feared this is “old news”.

The ax is augmented data and P.au is the augmented prior for u: P.au(u) = P(u) * P(ax|u).

]]>Somehow the start of my comment was lost:

Not the original Anonymous but I believe they were invoking am information theoretic interpretation, assuming the inference procedure only looks at augmented data and the augmentation procedure doesn’t depend on the inferred parameters, so that we satisfy the …

]]>Markov chain x->ax->u, then the mutual information between the data and the inferred parameters (I(x;u)) has to be less than or equal to the mutual information between the augmented data and the inferred parameters (I(ax; u)) by the [data processing inequality](https://en.wikipedia.org/wiki/Data_processing_inequality) I(x;u) <= I(ax;u) (I think they made a typo) (also I(x;u) <= I(x;ax)) So there is a strict and formal sense in which if you do data augmentation you are always extracting less information from the data. Intuitively, the noise injected in the augmentation procedure shields the data from the inference procedure. Granted, this is assuming the choice of augmentation procedure doesn't depend on the data itself, but scribbling on a napkin, I still have that I({x}, u) ax, and do inference on the whole set {ax}.

]]>We start with a prior over functions p(g) that have no invariance properties. We construct a prior over more invariant functions by averaging over the data augmentation process p(ax | x) (in the paper we show how this relates to invariances):

f(x) = \int g(ax) p(ax | x ) d{ax}

In our construction, p(ax | x) is more a means to an end to parameterise the prior on f( ).

My favourite part of our paper is that we put a Gaussian process prior over g( ) (and equivalently f( )), which allows us to learn the data augmentation through marginal likelihood maximisation. One could also think about learning a joint posterior over functions and invariance properties as a hierarchical model.

This allows you to automatically figure out transformations that leave your image label intact, such as skews and scales for MNIST, or full rotation invariance for rot-MNIST (see paper or a toy example [2]). Of course, you always need to specify the models/transformations that can be considered!

[1] Learning Invariances using the Marginal Likelihood: https://papers.nips.cc/paper/8199-learning-invariances-using-the-marginal-likelihood

[2] https://twitter.com/markvanderwilk/status/1184235600414683136

Apologies that my previous comment used markdown, which apparently is not accepted here. This version may look marginally better:

This post could benefit from a definition of (or at least a link to) “data augmentation”. As far as I can tell, when it is used in a machine-learning context (e.g., https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in Gibbs sampling as in https://link.springer.com/article/10.1007/s11222-013-9435-z).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

]]>Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

]]>Timothy: Thanks, I thought about referencing those but did not as there the explicit purpose is to incorporate a prior via fake data.

]]>Anonymous: Can you expand a bit or give a reference?

]]>This concept goes hand in hand with the “masking” concept we talked about a while back and which I wrote up on my blog: http://models.street-artists.org/2019/09/04/informative-priors-from-masking/

Imagine for example you want to describe a prior over functions expressed as fourier series. You expect these functions to be smooth, but say mostly increasing across the range of x values. You set up the fourier coefficients to have vague priors, leading to a prior distribution where realizations are all sorts of strange oscillating functions…

Then you draw on a piece of paper 5 representative curves, read off the value of the curves at 4 or 5 points each, and create a dataset y[i,j] where i indexes the curves, and j indexes the points on the curve.

then you set up a “fake likelihood” where it’s a mixture model of likelihoods across all the y[i,j] + iid normal error

You’ll wind up with a prior over the coefficients in which functions not too far from the sample of 5 representative curves are common, and functions that are weird compared to any of those samples are uncommon.

Now add in your real data, use your fourier coefficient model to describe whatever you needed it to describe, and get your real posterior over the coefficients.

]]>“Data augmentation priors for Bayesian and semi‐Bayes analyses of conditional‐logistic and proportional‐hazards regression” by Greenland and Christensen (2001)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.902

“Prior data for non‐normal priors” by Greenland (2007)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2788

“A New Perspective on Priors for Generalized Linear Models” by Bedrick, Christensen, and Johnson (1993)

https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476713

a -> ax -> u can be viewed as a Markov chain, and so we should have that I(a,u) >= I(ax, u). ]]>