I think the model is doing better because useful information is being brought in by the prior specification.

Now with two competing prior specifications, one analytical and the other through augmented data, if the augmented data one would converge to the analytical one with exhaustive enough “examples” then there is a loss of information by using the augmented data one. So the model does better with augmented one but not as well as it would with the analytical one.

But one needs to keep in mind, Daniel’s warning about the “human information processing” issue that suggests the analytical one likely will be miss-specified https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/#comment-1200514

]]>to the extent that we have a Bayesian model u is a distribution that induces a predictive data distribution D*, and potentially future observations are a distribution D, so the relevant information theoretic calculation is I(D*,D) the posterior predictive data vs the real future data.

]]>One only needs the process to be a Markov chain.

“the mapping from x to ax is not a random variable”

a -> ax *is* a random transformation, no? Say when adding random Gaussian noise to each pixel.

The same holds if you rotate the picture by a random amount.

Moreover, ff the transformation is deterministic, then we trivially achieve the upper bound on the inequality, in which we haven’t reduced the mutual information, but haven’t increased it, either.

Therefore, the usefulness of data augmentation can’t be explained in term of “information” in the information theoretic sense.

Alex Ameni clarified properly what I meant.

One can view the transformation from data x to augmented data ax to parameters u as a Markov chain x -> ax -> u.

(Provided the choice of augmentation procedure is not conditioned on x — and it doesn’t when we just rotate or add noise indiscriminately).

When the Markov property holds, there’s a strict sense in which information is lost along the chain.

As a result, I’m not sure thinking in term of “information” is the right angle. Information is lost, but the model does better. Perhaps the connection between prior and regularization is a more fruitful avenue.

]]>A big part of that is that specifying dependencies in high dimensions is very tricky, but they fall out automatically when you use a data-augmentation method.

A simple example: specify a prior over fourier coefficients that gives you functions on [0,1] which increase rapidly between x in [0,.5] up to a maximum near y=1 and stay flat out to x=1

one method: start with a very broad prior on all the coefficients, then create some fake x,y values that have the properties you want, define a normal(y,.25) iid model over the fake measurements. Then calculate the derivative at each x value and strongly downweight the coefficients if they produce any negative derivatives… and see what the posterior looks like…. it will have a high probability for fairly smooth monotonically increasing functions. Good luck doing that analytically.

Realistically speaking I think we’re better off talking about doing a good job of choosing your augmented data than we are discouraging this kind of thing. It’s a hugely valuable tool to impose real world knowledge on a problem.

]]>My p.s. presumed the analytical prior could be specified without error and that P(u) * P(ax|u) would eventually converge to that with extensive enough data augmentation. In that sense, I would argue the analytical prior is more satisfying.

Now the prior predictive for fake ax would be a nice way to check on the analytical prior specification.

Now the rest of you comment does seem consistent with the equations I gave in my post. If its not, please let me know.

]]>Actually adding fake data *is* encoding knowledge into the prior only.

The posterior is (proportional to): p(data | parameters) p(parameters)

where p(parameters) is any normalizable non-negative function encoding your information about the parameters.

so, let p(parameters) = p_pre(parameters) * p_fake(fake_data | parameters)

where p_pre(parameters) is a valid density encoding only vague bounds, and p_fake(fake_data | parameters) is a masking function that squishes the density in regions of parameter space that are far away from predicting fake_data (in particular you can for example provide a different standard deviation than for your real data, so p_fake need not be the same function as p(data|parameters) for your measurements)

now p(parameters) is proportional to a valid density, and it encodes your knowledge that the parameters are within some bounds and tend to predict things like fake_data…

QED

It *is* possible to do this badly, just as it’s possible to do other ways of encoding priors badly. But this is actually a good way to encode priors.

Remember the case that Dan Simpson mentioned a while back where the density of air pollution was somewhere between the density of concrete and the density of neutron stars? Well if he’d just put some fake data where it was around 1e-6 to 1e-4 times the density of air… he’d have squished his priors into line with his knowledge, without having to tune individual hyper-priors or whatever…

]]>If the original model is y_t = B * y_t_minus_1 + e_t with some prior, we could augment this with y_t_plus_j = B_j * y_t_minus_1 + e_t_plus_j, for some js. Where B_j is calculated from B and the standard deviation of e_t_plus_j is calculated from the standard deviation of e_t. So if I add j=1 or more, then this would be data augmentation?

I would think that by reducing the chance that I get a B with the property that it is bad at forecasting many periods out (it is explosive or something), then would be a good thing. I just wasn’t sure what the justification for doing this was. It just seemed like a hack.

]]>You are right that I should have distinguished data augmentation as its understood in ML versus sampling from the posterior as its often used in Bayes. I meant to and then forgot.

To be more candid than I perhaps should be, I was about to delete the post and then noticed the blog seemed a bit slow recovering from the thanksgiving holiday. I am glad I posted it as the comments are interesting rather than as I feared this is “old news”.

The ax is augmented data and P.au is the augmented prior for u: P.au(u) = P(u) * P(ax|u).

]]>Not the original Anonymous but I believe they were invoking am information theoretic interpretation, assuming the inference procedure only looks at augmented data and the augmentation procedure doesn’t depend on the inferred parameters, so that we satisfy the …

]]>We start with a prior over functions p(g) that have no invariance properties. We construct a prior over more invariant functions by averaging over the data augmentation process p(ax | x) (in the paper we show how this relates to invariances):

f(x) = \int g(ax) p(ax | x ) d{ax}

In our construction, p(ax | x) is more a means to an end to parameterise the prior on f( ).

My favourite part of our paper is that we put a Gaussian process prior over g( ) (and equivalently f( )), which allows us to learn the data augmentation through marginal likelihood maximisation. One could also think about learning a joint posterior over functions and invariance properties as a hierarchical model.

This allows you to automatically figure out transformations that leave your image label intact, such as skews and scales for MNIST, or full rotation invariance for rot-MNIST (see paper or a toy example [2]). Of course, you always need to specify the models/transformations that can be considered!

[1] Learning Invariances using the Marginal Likelihood: https://papers.nips.cc/paper/8199-learning-invariances-using-the-marginal-likelihood

[2] https://twitter.com/markvanderwilk/status/1184235600414683136

This post could benefit from a definition of (or at least a link to) “data augmentation”. As far as I can tell, when it is used in a machine-learning context (e.g., https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in Gibbs sampling as in https://link.springer.com/article/10.1007/s11222-013-9435-z).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

]]>Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

]]>Imagine for example you want to describe a prior over functions expressed as fourier series. You expect these functions to be smooth, but say mostly increasing across the range of x values. You set up the fourier coefficients to have vague priors, leading to a prior distribution where realizations are all sorts of strange oscillating functions…

Then you draw on a piece of paper 5 representative curves, read off the value of the curves at 4 or 5 points each, and create a dataset y[i,j] where i indexes the curves, and j indexes the points on the curve.

then you set up a “fake likelihood” where it’s a mixture model of likelihoods across all the y[i,j] + iid normal error

You’ll wind up with a prior over the coefficients in which functions not too far from the sample of 5 representative curves are common, and functions that are weird compared to any of those samples are uncommon.

Now add in your real data, use your fourier coefficient model to describe whatever you needed it to describe, and get your real posterior over the coefficients.

]]>“Data augmentation priors for Bayesian and semi‐Bayes analyses of conditional‐logistic and proportional‐hazards regression” by Greenland and Christensen (2001)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.902

“Prior data for non‐normal priors” by Greenland (2007)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2788

“A New Perspective on Priors for Generalized Linear Models” by Bedrick, Christensen, and Johnson (1993)

https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476713

a -> ax -> u can be viewed as a Markov chain, and so we should have that I(a,u) >= I(ax, u). ]]>