After my lecture on Principled Bayesian Workflow for a group of machine learners back in August, a discussion arose about data augmentation. The comments were about how it made the data more informative. I questioned that as there is only so much information in the data. In the view of the model assumptions, just the likelihood. So simply modifying the data, information should not increase but only possibly decrease (non-invertible modification).

Later, when I actually saw an example of data augmentation and I thought about this more carefully, I changed my mind. I now realise background knowledge is being brought to bear on how the data is being modified. So data augmentation is just a away of being Bayesian by incorporating prior probabilities. Right?

Then thinking some more, it became all trivial as the equations below show.

P(u|x) ~ P(u) * P(x|u) [Bayes with just the data.]

~ P(u) * P(x|u) * P(ax|u) [Add the augmented data.]

P(u|x,ax) ~ P(u) * P(x|u) * P(ax|u) [That’s just the posterior given ax.]

P(u|x,ax) ~ P(u) * P(ax|u) * P(x|u) [Change the order of x and ax.]

Now, augmented data is not real data and should not be conditioned on as real. Arguably it is just part of (re)making the prior specification from P(u) into P.au(u) = P(u) * P(ax|u).

So change the notation to P(u|x) ~ P.au(u) * P(x|u).

If you data augment (and you are using likelihood based ML, implicitly starting with P(u) = 1), you are being a Bayesian whether you like it or not.

So I goggled a bit and asked a colleague in ML about the above. They said it makes sense to me when I think about it, but that was not immediately obvious to me. They also said it was not common knowledge – so here it is.

Now better googling gets more stuff such as Augmentation is also a form of adding prior knowledge to a model; e.g. images are rotated, which you know does not change the class label. and this paper A Kernel Theory of Modern Data Augmentation Dao et al. where in the introduction they state “Data augmentation can encode prior knowledge about data or task-specific invariances, act as regularizer to make the resulting model more robust, and provide resources to data-hungry deep learning models.” Although the connection to Bayes in either does not seem to be discussed.

Further scholarship likely would lead me to consider deleting this post, but what’s the fun in that?

P.S. In the comments, Anonymous argued “we should have that I(a,u) >= I(ax, u)” which I am now guessing was about putting the augmentation into the model instead of introducing it through fake data examples. So instead of modifying the data in ways that are irrelevant to the prediction (e.g. small translations, rotations, or deformations for handwritten digits), put it into the prior. So instead of obtaining P.axu(u) = P(u) * P(ax|u) based on n augmentations of the data make P.au(u) mathematically (sort of an infinite number of augmentations of the data).

Then Mark van der Wilk adds a comment about actually doing that for multiple possible P.au(u),s and then compares these using the marginal likelihood in a paper with colleagues.

Now, there could not be a better motivation for my post then this from their introduction “This human input makes data augmentation undesirable from a machine learning perspective, akin to hand-crafting features. It is also unsatisfactory from a Bayesian perspective, according to which assumptions and expert knowledge should be explicitly encoded in the prior distribution only. By adding data that are not true observations, the posterior may become overconfident, and the marginal likelihood can no longer be used to compare to other models.”

Thanks Mark.

What about the data processing inequality?

a -> ax -> u can be viewed as a Markov chain, and so we should have that I(a,u) >= I(ax, u).

Anonymous: Can you expand a bit or give a reference?

Markov chain x->ax->u, then the mutual information between the data and the inferred parameters (I(x;u)) has to be less than or equal to the mutual information between the augmented data and the inferred parameters (I(ax; u)) by the [data processing inequality](https://en.wikipedia.org/wiki/Data_processing_inequality) I(x;u) <= I(ax;u) (I think they made a typo) (also I(x;u) <= I(x;ax)) So there is a strict and formal sense in which if you do data augmentation you are always extracting less information from the data. Intuitively, the noise injected in the augmentation procedure shields the data from the inference procedure. Granted, this is assuming the choice of augmentation procedure doesn't depend on the data itself, but scribbling on a napkin, I still have that I({x}, u) ax, and do inference on the whole set {ax}.

Somehow the start of my comment was lost:

Not the original Anonymous but I believe they were invoking am information theoretic interpretation, assuming the inference procedure only looks at augmented data and the augmentation procedure doesn’t depend on the inferred parameters, so that we satisfy the …

Alex – thanks for the clarification.

Anonymous: Please see the p.s. in my post.

Thanks (I’m anon, forgot to login above)

Alex Ameni clarified properly what I meant.

One can view the transformation from data x to augmented data ax to parameters u as a Markov chain x -> ax -> u.

(Provided the choice of augmentation procedure is not conditioned on x — and it doesn’t when we just rotate or add noise indiscriminately).

When the Markov property holds, there’s a strict sense in which information is lost along the chain.

As a result, I’m not sure thinking in term of “information” is the right angle. Information is lost, but the model does better. Perhaps the connection between prior and regularization is a more fruitful avenue.

I don’t think this is correct. the mapping from x to ax is not a random variable, and the augmentation procedure obviously *does* depend on x if you take x and rotate it, or take x and add noise to it.

Ah right, although now that I think of it, the requirement that the transformation be independent on x is spurious.

One only needs the process to be a Markov chain.

“the mapping from x to ax is not a random variable”

a -> ax *is* a random transformation, no? Say when adding random Gaussian noise to each pixel.

The same holds if you rotate the picture by a random amount.

Moreover, ff the transformation is deterministic, then we trivially achieve the upper bound on the inequality, in which we haven’t reduced the mutual information, but haven’t increased it, either.

Therefore, the usefulness of data augmentation can’t be explained in term of “information” in the information theoretic sense.

x is not random, it’s a realized thing, and ax is also a realization. This information theoretic measure is a property of two distributions, not a property of a particular sample.

to the extent that we have a Bayesian model u is a distribution that induces a predictive data distribution D*, and potentially future observations are a distribution D, so the relevant information theoretic calculation is I(D*,D) the posterior predictive data vs the real future data.

> Information is lost, but the model does better.

I think the model is doing better because useful information is being brought in by the prior specification.

Now with two competing prior specifications, one analytical and the other through augmented data, if the augmented data one would converge to the analytical one with exhaustive enough “examples” then there is a loss of information by using the augmented data one. So the model does better with augmented one but not as well as it would with the analytical one.

But one needs to keep in mind, Daniel’s warning about the “human information processing” issue that suggests the analytical one likely will be miss-specified https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/#comment-1200514

I think there’s definitely “prior” precedent for viewing data augmentation as encoding prior beliefs. For example:

“Data augmentation priors for Bayesian and semi‐Bayes analyses of conditional‐logistic and proportional‐hazards regression” by Greenland and Christensen (2001)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.902

“Prior data for non‐normal priors” by Greenland (2007)

https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2788

“A New Perspective on Priors for Generalized Linear Models” by Bedrick, Christensen, and Johnson (1993)

https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476713

Timothy: Thanks, I thought about referencing those but did not as there the explicit purpose is to incorporate a prior via fake data.

We sometimes talk about looking at the prior predictive distribution to help us calibrate our priors to what we actually think. Perhaps a better way to handle that in practice is to specify our vague priors, and then a set of augmented data, such that with the augmented data, the posterior better describes what we meant the prior to describe…. We then add the real data and get the “real” posterior. The augmented data is just a trick to calculate the prior we wanted in the first place.

This concept goes hand in hand with the “masking” concept we talked about a while back and which I wrote up on my blog: http://models.street-artists.org/2019/09/04/informative-priors-from-masking/

Imagine for example you want to describe a prior over functions expressed as fourier series. You expect these functions to be smooth, but say mostly increasing across the range of x values. You set up the fourier coefficients to have vague priors, leading to a prior distribution where realizations are all sorts of strange oscillating functions…

Then you draw on a piece of paper 5 representative curves, read off the value of the curves at 4 or 5 points each, and create a dataset y[i,j] where i indexes the curves, and j indexes the points on the curve.

then you set up a “fake likelihood” where it’s a mixture model of likelihoods across all the y[i,j] + iid normal error

You’ll wind up with a prior over the coefficients in which functions not too far from the sample of 5 representative curves are common, and functions that are weird compared to any of those samples are uncommon.

Now add in your real data, use your fourier coefficient model to describe whatever you needed it to describe, and get your real posterior over the coefficients.

You could think of augmentation as adding information to the data in a model without directly including prior data–as you say, augmented data are not real data. But you could also think of it as just another way of adding assumptions to the model. I think it’s best to keep that last perspective in mind regardless. Because we know to be cautious, even skeptical, when we talk about adding “assumptions,” whereas talking about adding “information” or “background knowledge” at least sounds like we’re doing something less questionable. Hopefully your machine learners realize their assumptions are still fallible, as is the decision about the right/best way to realize those assumptions as augmentations. (Of course, such caution may be less relevant if you’re trying to build a purely predictive, and potentially self-correcting, algorithm, as opposed to trying to describe the real world.)

This post could benefit from a definition of (or at least a link to) *data augmentation*. As far as I can tell, when it is used in a [machine-learning context](https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in [Gibbs sampling](https://link.springer.com/article/10.1007/s11222-013-9435-z)).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

Apologies that my previous comment used markdown, which apparently is not accepted here. This version may look marginally better:

This post could benefit from a definition of (or at least a link to) “data augmentation”. As far as I can tell, when it is used in a machine-learning context (e.g., https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in Gibbs sampling as in https://link.springer.com/article/10.1007/s11222-013-9435-z).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

Andrew J. please see response below next comment.

I’m sympathetic to this point. I’m still not entirely sure what they mean when they say data augmentation, even after seeing Keith’s comment below.

As I am not 100% on what is meant by data augmentation, I am thinking about it in terms of time series models where you need to enforce stationarity. So for instance, an AR(p) model. When p=1, it is easy to impose the constraints to ensure this, but it becomes more complicated for larger p.

If the original model is y_t = B * y_t_minus_1 + e_t with some prior, we could augment this with y_t_plus_j = B_j * y_t_minus_1 + e_t_plus_j, for some js. Where B_j is calculated from B and the standard deviation of e_t_plus_j is calculated from the standard deviation of e_t. So if I add j=1 or more, then this would be data augmentation?

I would think that by reducing the chance that I get a B with the property that it is bad at forecasting many periods out (it is explosive or something), then would be a good thing. I just wasn’t sure what the justification for doing this was. It just seemed like a hack.

It turns out that what I was describing is also called Multi-Input Multi-Output (MIMO) prediction in the machine learning literature.

Around the same time as the Dao et al “kernel theory on data augmentation” we published a paper [1] arguing that data augmentation constrains the prior over the mappings we want to learn (e.g. images to labels) to be invariant to the transformations in the augmentation. However, we use a different construction than the ‘additional likelihood’, although we end up with the same kernel as Dao et al.

We start with a prior over functions p(g) that have no invariance properties. We construct a prior over more invariant functions by averaging over the data augmentation process p(ax | x) (in the paper we show how this relates to invariances):

f(x) = \int g(ax) p(ax | x ) d{ax}

In our construction, p(ax | x) is more a means to an end to parameterise the prior on f( ).

My favourite part of our paper is that we put a Gaussian process prior over g( ) (and equivalently f( )), which allows us to learn the data augmentation through marginal likelihood maximisation. One could also think about learning a joint posterior over functions and invariance properties as a hierarchical model.

This allows you to automatically figure out transformations that leave your image label intact, such as skews and scales for MNIST, or full rotation invariance for rot-MNIST (see paper or a toy example [2]). Of course, you always need to specify the models/transformations that can be considered!

[1] Learning Invariances using the Marginal Likelihood: https://papers.nips.cc/paper/8199-learning-invariances-using-the-marginal-likelihood

[2] https://twitter.com/markvanderwilk/status/1184235600414683136

Andrew J.

You are right that I should have distinguished data augmentation as its understood in ML versus sampling from the posterior as its often used in Bayes. I meant to and then forgot.

To be more candid than I perhaps should be, I was about to delete the post and then noticed the blog seemed a bit slow recovering from the thanksgiving holiday. I am glad I posted it as the comments are interesting rather than as I feared this is “old news”.

The ax is augmented data and P.au is the augmented prior for u: P.au(u) = P(u) * P(ax|u).

Mark – Thanks, on my to read list.

Mark: Please see the p.s. in my post.

The opposite narrative of “data augmentation is just prior” is “prior is just (fake) data augmentation, besides prior is more (real) data”– in a way that prior may encode some robustness property such as rotation invariance. Therefore, the adversarial training should be viewed a way to construct prior.

Nice point, A ~ B does imply B ~ A.

From the PS: >It is also unsatisfactory from a Bayesian perspective, according to which assumptions and expert knowledge should be explicitly encoded in the prior distribution only.

Actually adding fake data *is* encoding knowledge into the prior only.

The posterior is (proportional to): p(data | parameters) p(parameters)

where p(parameters) is any normalizable non-negative function encoding your information about the parameters.

so, let p(parameters) = p_pre(parameters) * p_fake(fake_data | parameters)

where p_pre(parameters) is a valid density encoding only vague bounds, and p_fake(fake_data | parameters) is a masking function that squishes the density in regions of parameter space that are far away from predicting fake_data (in particular you can for example provide a different standard deviation than for your real data, so p_fake need not be the same function as p(data|parameters) for your measurements)

now p(parameters) is proportional to a valid density, and it encodes your knowledge that the parameters are within some bounds and tend to predict things like fake_data…

QED

It *is* possible to do this badly, just as it’s possible to do other ways of encoding priors badly. But this is actually a good way to encode priors.

Remember the case that Dan Simpson mentioned a while back where the density of air pollution was somewhere between the density of concrete and the density of neutron stars? Well if he’d just put some fake data where it was around 1e-6 to 1e-4 times the density of air… he’d have squished his priors into line with his knowledge, without having to tune individual hyper-priors or whatever…

correction: where the *prior predictive* density of air pollution

Daniel:

My p.s. presumed the analytical prior could be specified without error and that P(u) * P(ax|u) would eventually converge to that with extensive enough data augmentation. In that sense, I would argue the analytical prior is more satisfying.

Now the prior predictive for fake ax would be a nice way to check on the analytical prior specification.

Now the rest of you comment does seem consistent with the equations I gave in my post. If its not, please let me know.

Practically speaking, for high dimensional parameters, I think data augmentation is far far more likely to converge to something we can believe in than any amount of tweaking an analytical expression could do.

A big part of that is that specifying dependencies in high dimensions is very tricky, but they fall out automatically when you use a data-augmentation method.

A simple example: specify a prior over fourier coefficients that gives you functions on [0,1] which increase rapidly between x in [0,.5] up to a maximum near y=1 and stay flat out to x=1

one method: start with a very broad prior on all the coefficients, then create some fake x,y values that have the properties you want, define a normal(y,.25) iid model over the fake measurements. Then calculate the derivative at each x value and strongly downweight the coefficients if they produce any negative derivatives… and see what the posterior looks like…. it will have a high probability for fairly smooth monotonically increasing functions. Good luck doing that analytically.

Realistically speaking I think we’re better off talking about doing a good job of choosing your augmented data than we are discouraging this kind of thing. It’s a hugely valuable tool to impose real world knowledge on a problem.