Comments on: A Bayesian view of data augmentation.

By: Dave C.

Dave C. — Sat, 04 Jul 2020 13:43:52 +0000

This paper from Greenland “Bayesian perspectives on epidemiological research” gives further examples. Interested in others’ reactions.
https://pubmed.ncbi.nlm.nih.gov/16446352/

By: Keith O'Rourke

Keith O'Rourke — Mon, 02 Mar 2020 22:45:53 +0000

In reply to Shuxiao Chen.

Cool – thanks for letting me know!

Not surprised as the math of likelihoods and priors are the same.

The paper is on my list to read.

Also reminds me of courses I took from Don Fraser many years ago.

By: Shuxiao Chen

Shuxiao Chen — Mon, 02 Mar 2020 22:21:13 +0000

Nice Bayesian interpretation. We posted a paper last year (https://arxiv.org/abs/1907.10905), which gives a frequentist’s treatment of data augmentation. We start with the assumption that the data X has some probabilistic symmetries: g(X) is approximately equal to X in distribution for some g in a group G. This is the “prior” we have from frequentist’s point of view. We then show that training with data augmentation is equivalent to optimizing an “orbit-averaged” objective, which provable leads to variance reduction.

For likelihood-based models, this is exactly the marginal likelihood, but marginalized w.r.t. the Haar measure on the group of transformations under consideration.

By: Keith O'Rourke

Keith O'Rourke — Thu, 05 Dec 2019 13:33:26 +0000

In reply to jgyou.

> Information is lost, but the model does better.
I think the model is doing better because useful information is being brought in by the prior specification.

Now with two competing prior specifications, one analytical and the other through augmented data, if the augmented data one would converge to the analytical one with exhaustive enough “examples” then there is a loss of information by using the augmented data one. So the model does better with augmented one but not as well as it would with the analytical one.

But one needs to keep in mind, Daniel’s warning about the “human information processing” issue that suggests the analytical one likely will be miss-specified https://statmodeling.stat.columbia.edu/2019/12/02/a-bayesian-view-of-data-augmentation/#comment-1200514

By: Daniel Lakeland

Daniel Lakeland — Thu, 05 Dec 2019 02:14:08 +0000

In reply to jgyyou. x is not random, it's a realized thing, and ax is also a realization. This information theoretic measure is a property of two distributions, not a property of a particular sample. to the extent that we have a Bayesian model u is a distribution that induces a predictive data distribution D*, and potentially future observations are a distribution D, so the relevant information theoretic calculation is I(D*,D) the posterior predictive data vs the real future data.

By: jgyyou

jgyyou — Thu, 05 Dec 2019 00:48:38 +0000

In reply to Daniel Lakeland.

Ah right, although now that I think of it, the requirement that the transformation be independent on x is spurious.
One only needs the process to be a Markov chain.

“the mapping from x to ax is not a random variable”
a -> ax *is* a random transformation, no? Say when adding random Gaussian noise to each pixel.
The same holds if you rotate the picture by a random amount.

Moreover, ff the transformation is deterministic, then we trivially achieve the upper bound on the inequality, in which we haven’t reduced the mutual information, but haven’t increased it, either.
Therefore, the usefulness of data augmentation can’t be explained in term of “information” in the information theoretic sense.

By: Daniel Lakeland

Daniel Lakeland — Wed, 04 Dec 2019 22:27:05 +0000

In reply to jgyou. I don't think this is correct. the mapping from x to ax is not a random variable, and the augmentation procedure obviously *does* depend on x if you take x and rotate it, or take x and add noise to it.

By: jgyou

jgyou — Wed, 04 Dec 2019 22:01:25 +0000

In reply to Keith O’Rourke.

Thanks (I’m anon, forgot to login above)

Alex Ameni clarified properly what I meant.
One can view the transformation from data x to augmented data ax to parameters u as a Markov chain x -> ax -> u.
(Provided the choice of augmentation procedure is not conditioned on x — and it doesn’t when we just rotate or add noise indiscriminately).
When the Markov property holds, there’s a strict sense in which information is lost along the chain.

As a result, I’m not sure thinking in term of “information” is the right angle. Information is lost, but the model does better. Perhaps the connection between prior and regularization is a more fruitful avenue.

By: Daniel Lakeland

Daniel Lakeland — Wed, 04 Dec 2019 21:12:38 +0000

In reply to Keith O’Rourke.

Practically speaking, for high dimensional parameters, I think data augmentation is far far more likely to converge to something we can believe in than any amount of tweaking an analytical expression could do.

A big part of that is that specifying dependencies in high dimensions is very tricky, but they fall out automatically when you use a data-augmentation method.

A simple example: specify a prior over fourier coefficients that gives you functions on [0,1] which increase rapidly between x in [0,.5] up to a maximum near y=1 and stay flat out to x=1

one method: start with a very broad prior on all the coefficients, then create some fake x,y values that have the properties you want, define a normal(y,.25) iid model over the fake measurements. Then calculate the derivative at each x value and strongly downweight the coefficients if they produce any negative derivatives… and see what the posterior looks like…. it will have a high probability for fairly smooth monotonically increasing functions. Good luck doing that analytically.

Realistically speaking I think we’re better off talking about doing a good job of choosing your augmented data than we are discouraging this kind of thing. It’s a hugely valuable tool to impose real world knowledge on a problem.

By: Keith O'Rourke

Keith O'Rourke — Wed, 04 Dec 2019 20:50:46 +0000

In reply to Daniel Lakeland. Daniel: My p.s. presumed the analytical prior could be specified without error and that P(u) * P(ax|u) would eventually converge to that with extensive enough data augmentation. In that sense, I would argue the analytical prior is more satisfying. Now the prior predictive for fake ax would be a nice way to check on the analytical prior specification. Now the rest of you comment does seem consistent with the equations I gave in my post. If its not, please let me know.

By: Daniel Lakeland

Daniel Lakeland — Wed, 04 Dec 2019 16:21:13 +0000

In reply to Daniel Lakeland. correction: where the *prior predictive* density of air pollution

By: Daniel Lakeland

Daniel Lakeland — Wed, 04 Dec 2019 16:03:18 +0000

From the PS: >It is also unsatisfactory from a Bayesian perspective, according to which assumptions and expert knowledge should be explicitly encoded in the prior distribution only.

Actually adding fake data *is* encoding knowledge into the prior only.

The posterior is (proportional to): p(data | parameters) p(parameters)

where p(parameters) is any normalizable non-negative function encoding your information about the parameters.

so, let p(parameters) = p_pre(parameters) * p_fake(fake_data | parameters)

where p_pre(parameters) is a valid density encoding only vague bounds, and p_fake(fake_data | parameters) is a masking function that squishes the density in regions of parameter space that are far away from predicting fake_data (in particular you can for example provide a different standard deviation than for your real data, so p_fake need not be the same function as p(data|parameters) for your measurements)

now p(parameters) is proportional to a valid density, and it encodes your knowledge that the parameters are within some bounds and tend to predict things like fake_data…

QED

It *is* possible to do this badly, just as it’s possible to do other ways of encoding priors badly. But this is actually a good way to encode priors.

Remember the case that Dan Simpson mentioned a while back where the density of air pollution was somewhere between the density of concrete and the density of neutron stars? Well if he’d just put some fake data where it was around 1e-6 to 1e-4 times the density of air… he’d have squished his priors into line with his knowledge, without having to tune individual hyper-priors or whatever…

By: Keith O'Rourke

Keith O'Rourke — Wed, 04 Dec 2019 14:29:51 +0000

In reply to Mark van der Wilk. Mark: Please see the p.s. in my post.

By: Keith O'Rourke

Keith O'Rourke — Wed, 04 Dec 2019 14:28:55 +0000

In reply to Anonymous. Anonymous: Please see the p.s. in my post.

By: John Hall

John Hall — Wed, 04 Dec 2019 13:45:02 +0000

In reply to John Hall. It turns out that what I was describing is also called Multi-Input Multi-Output (MIMO) prediction in the machine learning literature.

By: John Hall

John Hall — Tue, 03 Dec 2019 15:11:00 +0000

In reply to John Hall.

As I am not 100% on what is meant by data augmentation, I am thinking about it in terms of time series models where you need to enforce stationarity. So for instance, an AR(p) model. When p=1, it is easy to impose the constraints to ensure this, but it becomes more complicated for larger p.

If the original model is y_t = B * y_t_minus_1 + e_t with some prior, we could augment this with y_t_plus_j = B_j * y_t_minus_1 + e_t_plus_j, for some js. Where B_j is calculated from B and the standard deviation of e_t_plus_j is calculated from the standard deviation of e_t. So if I add j=1 or more, then this would be data augmentation?

I would think that by reducing the chance that I get a B with the property that it is bad at forecasting many periods out (it is explosive or something), then would be a good thing. I just wasn’t sure what the justification for doing this was. It just seemed like a hack.

By: John Hall

John Hall — Tue, 03 Dec 2019 14:58:12 +0000

In reply to Andrew Jaffe. I'm sympathetic to this point. I'm still not entirely sure what they mean when they say data augmentation, even after seeing Keith's comment below.

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Dec 2019 13:41:25 +0000

In reply to Alex Alemi. Alex - thanks for the clarification.

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Dec 2019 13:39:50 +0000

In reply to Yuling Yao. Nice point, A ~ B does imply B ~ A.

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Dec 2019 13:38:09 +0000

In reply to Mark van der Wilk. Mark - Thanks, on my to read list.

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Dec 2019 13:37:05 +0000

In reply to Andrew Jaffe. Andrew J. please see response below next comment.

By: Keith O'Rourke

Keith O'Rourke — Tue, 03 Dec 2019 13:35:50 +0000

In reply to Mark van der Wilk.

Andrew J.

You are right that I should have distinguished data augmentation as its understood in ML versus sampling from the posterior as its often used in Bayes. I meant to and then forgot.

To be more candid than I perhaps should be, I was about to delete the post and then noticed the blog seemed a bit slow recovering from the thanksgiving holiday. I am glad I posted it as the comments are interesting rather than as I feared this is “old news”.

The ax is augmented data and P.au is the augmented prior for u: P.au(u) = P(u) * P(ax|u).

By: Yuling Yao

Yuling Yao — Tue, 03 Dec 2019 05:13:05 +0000

The opposite narrative of “data augmentation is just prior” is “prior is just (fake) data augmentation, besides prior is more (real) data”– in a way that prior may encode some robustness property such as rotation invariance. Therefore, the adversarial training should be viewed a way to construct prior.

By: Alex Alemi

Alex Alemi — Tue, 03 Dec 2019 04:20:39 +0000

In reply to Alex Alemi.

Somehow the start of my comment was lost:

Not the original Anonymous but I believe they were invoking am information theoretic interpretation, assuming the inference procedure only looks at augmented data and the augmentation procedure doesn’t depend on the inferred parameters, so that we satisfy the …

By: Alex Alemi

Alex Alemi — Tue, 03 Dec 2019 01:27:26 +0000

In reply to Keith O’Rourke.

Markov chain x->ax->u, then the mutual information between the data and the inferred parameters (I(x;u)) has to be less than or equal to the mutual information between the augmented data and the inferred parameters (I(ax; u)) by the [data processing inequality](https://en.wikipedia.org/wiki/Data_processing_inequality) I(x;u) <= I(ax;u) (I think they made a typo) (also I(x;u) <= I(x;ax)) So there is a strict and formal sense in which if you do data augmentation you are always extracting less information from the data. Intuitively, the noise injected in the augmentation procedure shields the data from the inference procedure. Granted, this is assuming the choice of augmentation procedure doesn't depend on the data itself, but scribbling on a napkin, I still have that I({x}, u) ax, and do inference on the whole set {ax}.

By: Mark van der Wilk

Mark van der Wilk — Tue, 03 Dec 2019 01:24:36 +0000

Around the same time as the Dao et al “kernel theory on data augmentation” we published a paper [1] arguing that data augmentation constrains the prior over the mappings we want to learn (e.g. images to labels) to be invariant to the transformations in the augmentation. However, we use a different construction than the ‘additional likelihood’, although we end up with the same kernel as Dao et al.

We start with a prior over functions p(g) that have no invariance properties. We construct a prior over more invariant functions by averaging over the data augmentation process p(ax | x) (in the paper we show how this relates to invariances):

f(x) = \int g(ax) p(ax | x ) d{ax}

In our construction, p(ax | x) is more a means to an end to parameterise the prior on f( ).

My favourite part of our paper is that we put a Gaussian process prior over g( ) (and equivalently f( )), which allows us to learn the data augmentation through marginal likelihood maximisation. One could also think about learning a joint posterior over functions and invariance properties as a hierarchical model.

This allows you to automatically figure out transformations that leave your image label intact, such as skews and scales for MNIST, or full rotation invariance for rot-MNIST (see paper or a toy example [2]). Of course, you always need to specify the models/transformations that can be considered!

[1] Learning Invariances using the Marginal Likelihood: https://papers.nips.cc/paper/8199-learning-invariances-using-the-marginal-likelihood
[2] https://twitter.com/markvanderwilk/status/1184235600414683136

By: Andrew Jaffe

Andrew Jaffe — Mon, 02 Dec 2019 23:36:36 +0000

In reply to Andrew Jaffe.

Apologies that my previous comment used markdown, which apparently is not accepted here. This version may look marginally better:

This post could benefit from a definition of (or at least a link to) “data augmentation”. As far as I can tell, when it is used in a machine-learning context (e.g., https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in Gibbs sampling as in https://link.springer.com/article/10.1007/s11222-013-9435-z).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

By: Andrew Jaffe

Andrew Jaffe — Mon, 02 Dec 2019 23:33:35 +0000

This post could benefit from a definition of (or at least a link to) *data augmentation*. As far as I can tell, when it is used in a [machine-learning context](https://towardsdatascience.com/data-augmentation-for-deep-learning-4fe21d1a4eb9) it does not mean the same thing as when it is used in a more standard Bayesian context (e.g., in [Gibbs sampling](https://link.springer.com/article/10.1007/s11222-013-9435-z)).

Also, I hope I’m not nit-picking, but the post could also benefit from a little clarity of notation (what’s “ax”? what “P.au”?) and, if possible, using actual latexified equations.

By: Keith O’Rourke

Keith O’Rourke — Mon, 02 Dec 2019 22:19:50 +0000

In reply to Timothy. Timothy: Thanks, I thought about referencing those but did not as there the explicit purpose is to incorporate a prior via fake data.

By: Keith O’Rourke

Keith O’Rourke — Mon, 02 Dec 2019 22:17:17 +0000

In reply to Anonymous. Anonymous: Can you expand a bit or give a reference?

By: Michael Nelson

Michael Nelson — Mon, 02 Dec 2019 22:02:09 +0000

You could think of augmentation as adding information to the data in a model without directly including prior data–as you say, augmented data are not real data. But you could also think of it as just another way of adding assumptions to the model. I think it’s best to keep that last perspective in mind regardless. Because we know to be cautious, even skeptical, when we talk about adding “assumptions,” whereas talking about adding “information” or “background knowledge” at least sounds like we’re doing something less questionable. Hopefully your machine learners realize their assumptions are still fallible, as is the decision about the right/best way to realize those assumptions as augmentations. (Of course, such caution may be less relevant if you’re trying to build a purely predictive, and potentially self-correcting, algorithm, as opposed to trying to describe the real world.)

By: Daniel Lakeland

Daniel Lakeland — Mon, 02 Dec 2019 21:20:32 +0000

In reply to Daniel Lakeland.

This concept goes hand in hand with the “masking” concept we talked about a while back and which I wrote up on my blog: http://models.street-artists.org/2019/09/04/informative-priors-from-masking/

Imagine for example you want to describe a prior over functions expressed as fourier series. You expect these functions to be smooth, but say mostly increasing across the range of x values. You set up the fourier coefficients to have vague priors, leading to a prior distribution where realizations are all sorts of strange oscillating functions…

Then you draw on a piece of paper 5 representative curves, read off the value of the curves at 4 or 5 points each, and create a dataset y[i,j] where i indexes the curves, and j indexes the points on the curve.

then you set up a “fake likelihood” where it’s a mixture model of likelihoods across all the y[i,j] + iid normal error

You’ll wind up with a prior over the coefficients in which functions not too far from the sample of 5 representative curves are common, and functions that are weird compared to any of those samples are uncommon.

Now add in your real data, use your fourier coefficient model to describe whatever you needed it to describe, and get your real posterior over the coefficients.

By: Daniel Lakeland

Daniel Lakeland — Mon, 02 Dec 2019 21:07:46 +0000

We sometimes talk about looking at the prior predictive distribution to help us calibrate our priors to what we actually think. Perhaps a better way to handle that in practice is to specify our vague priors, and then a set of augmented data, such that with the augmented data, the posterior better describes what we meant the prior to describe…. We then add the real data and get the “real” posterior. The augmented data is just a trick to calculate the prior we wanted in the first place.

By: Timothy

Timothy — Mon, 02 Dec 2019 20:45:10 +0000

I think there’s definitely “prior” precedent for viewing data augmentation as encoding prior beliefs. For example:

“Data augmentation priors for Bayesian and semi‐Bayes analyses of conditional‐logistic and proportional‐hazards regression” by Greenland and Christensen (2001)
https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.902

“Prior data for non‐normal priors” by Greenland (2007)
https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.2788

“A New Perspective on Priors for Generalized Linear Models” by Bedrick, Christensen, and Johnson (1993)
https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476713

By: Anonymous

Anonymous — Mon, 02 Dec 2019 20:45:03 +0000

What about the data processing inequality?
a -> ax -> u can be viewed as a Markov chain, and so we should have that I(a,u) >= I(ax, u).