Thanks, will take a look :-)

]]>OK one last quick comment. The gist (or a gist) of the argument is that I’m not convinced Bayesian inference is fully well-defined, particularly for continuous and/or infinite dimensional problems.

One way to make a possibly ill-defined problem better defined is to introduce a concrete procedure for computing what you mean.

Eg MCMC for actually sampling ‘the’ posterior. But multiple posteriors yield the same expectations, especially with finite computation power. Unfortunately these can also be arbitrarily far apart in the strong topology.

So what are we actually doing? What is ‘the’ posterior target in general if we only care about expectations? Do we care about strong or weak topologies etc?

So I’m not fully convinced the answer to ‘what are we doing’ is properly defined, though open to convincing. On the other hand I see alternative, more direct methods which relate directly to the actual boring mathematics underlying eg MCMC ‘technology’ and wonder why we don’t just consider those the fundamental concepts instead. Others seem to have done so.

]]>I think these responses miss the point somewhat but as Chris mentions we are unlikely to resolve the issues here. Plus I just landed in Hawaii for a holiday :-)

]]>There are two reasons to model a distibution as normal. 1) we know it’s a good description of the actual distribution 2) we don’t know much about the actual distribution but want to keep the model simple. The Gaussian is the maximum entropy distribution with the first two moments fixed, but of course nothing prevents you from adding higher order terms (more parameters in the model). I’m not sure what’s the fundamental problem with a single sample (apart from the low information content). We agree that statistics is easier (even unnecessary) if you have enough data.

I agree with Daniel about the typical set. That’s just a property of distributions when the number of dimensions grows to infinity. Nothing has to be Bayesian about it, in the same way that there is nothing Bayesian about MCMC or other numerical integration schemes. But if you have a probability distribution you can use MCMC to calculate expectations and if you have an very-high-dimensional probability distribution you may see the concentration of measure (I still don’t know what makes it interesting, though).

]]>ojm, my (possibly incorrect) understanding is that infinite Gaussian mixture model doesn’t have a Radon-Nikodym derivative that’s useful for Bayesian inference. For example, I can’t spot a clear use of the density version of Bayes’s theorem in this paper. IIRC the papers I’ve read on it give algorithms for sampling from the posterior measure without defining a posterior density as such. A related example is this paper on the Mondrian process in which a consistency property is used to give an algorithm for posterior sampling; I can’t spot a posterior density per se or a use of the density versions of Bayes’s theorem.

]]>Daniel, you are right of course about MCMC as a machine for Bayesian inference, rather than the inference itself. I think the meat of the matter here is that ojm just doesn’t like Bayes inference anymore (nor apparently likelihood), for some set of reasons that I don’t fully understand (but have concluded we are unlikely to resolve here :) )

]]>I think we need to make the distinction between focusing on large sample size of scientific data, vs focusing on large sample size of MCMC. Large sample size of MCMC is more or less like using a small step size for an ODE solver. It just makes the calculation more accurate, it doesn’t help you get closer to the true science if your model is wrong.

On the other hand, focusing on large sample size for data collection I think I agree with you. When you are in fact imposing a sufficient random structure on the problem. (ie. using a computer RNG to sub-sample a finite well defined population of things for example) then large sample size helps you. But, when you’re working with non-randomly-selected samples you shouldn’t be focusing on CLT type theorems, because the non-randomness of your sample violates those assumptions. Then, you’re stuck using an appropriate Bayesian model that acknowledges this. For example, in polling Trump vs Hillary there should have been a model structure that acknowledged an unknown bias in the polling, and made some prior guesses about its possible size. Instead, relying on a CLT type result, you get polls that decide that Hillary has a 95% chance of winning… If results were truly from random samples with 100% compliance etc… it would have been true. But they weren’t.

]]>ojm: it’s just focusing on sampling the distribution. In high dimensions samples aren’t in that little corner of the high probability set at the mode, because the volume there is too small, that’s all. No one’s actually keeping the sampler out of that region, it just won’t ever get there because of the math.

When it comes to relying on mathematical probability while doing MCMC sampling, this is because Bayesian probability as quantification of scientific uncertainty ends when you write down the distribution you want to use. In the same way that PDEs as a description of a scientific process involving momentum and energy and whatever ends after you write down the equations and the boundary conditions. After that it’s mechanical calculation that becomes the issue.

As far as mechanical calculation goes, in PDEs we have finite elements or finite volume or finite difference equations or whatever, and in Bayesian stats we have MCMC of some sort. Both are designed just as pure mechanical deduction… this equation and these boundary conditions imply this new state at this later time… or similarly, this density over the parameters implies this chain of samples converges to give the right expectations. Numerical analysis of difference equations isn’t scientific modeling of the behavior of physical objects, and MCMC sampling isn’t scientific analysis of the uncertainty in models and data… but both are techniques for calculation that apply to their individual purposes.

]]>Daniel – I mean focusing on sampling the typical set rather than the full high probability set, which as we’ve discussed are not quite the same thing.

More generally, taking advantage of probabilistic convergence results with increasing sample size seems to me pretty frequentist, and/or ’empirical inference’ in the Vapnik etc style..

As far as I am aware none of the convergence results/concentration of measure/ whatever are very Bayesian in spirit. They introduce explicit sample size, asymptotics etc which seem to do most of the work and have nothing to do with logical or a priori probabilistic modelling (which ironically usually assume an _exact_ model of uncertainty)

]]>ojm: I’m not sure what you mean by “typical set sampling”. All the sampling methods are MCMC, and it’s just a mathematical fact that basically all the probability mass is is in the typical set and so that’s where MCMC spends its time.

Why do we only compute expectations? Because in say 2000 dimensions, even just one sample on each side of the median in each dimension is 2^2000 samples.

yet, we can get good expectations with 20 or 100 or in extreme cases maybe 1000 effective samples.

]]>Eg someone like Vapnik would argue, I assume, that we begin from given data not a given model.

]]>Fair enough re ‘even Andrew’.

RE the problem and coherence. I think it brings out how unrealistic the concept of ‘a single sample known to be drawn from a particular distribution’ is in practice (and perhaps in principle).

]]>I agree uncertainty among models brings us to some deep philosophical waters…another time. As to estimating mean of normal with one observation – I view that as a corner case for showing how sensitive inference is (and should be!) to prior/external information, when data are very limiting. I don’t think posterior distributions should be taken literally like some fundamentalists claim to read the Bible – rather, the inference is relative to the modeling assumptions…

]]>> OK, but even Andrew refuses to use probability to quantify the uncertainty _of the models themselves_.

“Even Andrew”? I’m not sure how to interpret that, in particular after your remarks a few weeks ago about his attitude to Bayesians foundations.

> how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

How should we feel? Two samples are one single sample after another, etc. If I have a coherent method to update the uncertainty with each single sample, I don’t see where is the problem.

]]>BTW – my point about density-based inference was a mathematical one…I’m open to correction but I honesty have looked and haven’t yet seen a good Bayesian response, while I have seen a few people abandon Bayes on this basis…

Also, does how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

]]>Chris, this has probably reached the point of not being productive any more but…

> ‘quantifying uncertainty in parameters of statistical models’

OK, but even Andrew refuses to use probability to quantify the uncertainty _of the models themselves_. Given this is usually a massive source of ‘uncertainty’, this is a good example of a case where the role probability is contested, even by Bayesians.

Furthermore, Neyman himself said, for his brand of Frequentist inference, that his aim was

“to construct a theory of mathematical statistics independent of the conception of likelihood…entirely based on the classical theory of probability”

So even what it means to ‘quantify uncertainty in parameters of statistical models’ using probability is somewhat ambiguous (or uncertain…).

]]>Yes I mean quantifying uncertainty in the parameters of statistical models. Contested, sure- e.g. Fisher ;) But it is clearly possible and successful in many applications.

]]>Corey:

I should say I’m talking about continuous and/or infinite-dimensional problems. I’m actually curious – I’ve only seen infinite-dimensional problems tackled via RN derivatives. Can you point me to a good reference tackling it in other ways?

Chris, re:

> the Bayesian premise – use of probability to quantify uncertainty – is not generally valid. That seems like a rather extraordinary claim in general.

It depends what ‘not generally valid’ means.

Probability can quantify _some_ types of uncertainty, sure, so I am certaintly not saying it is invalid in all cases.

But to say that it quantifies _all_ types of uncertainty seems to me like an extraordinary claim, and one that has been hotly contested (successfully imo) by many people.

]]>ojm, clearly there is more at work here than either of us really are interested in dissecting at the moment. Whether formulated more precisely as probability measures (as Corey is saying) or summarized offhand as ‘density-based inference’ why is this especially problematic for you? It seems like you are saying that the Bayesian premise – use of probability to quantify uncertainty – is not generally valid. That seems like a rather extraordinary claim in general. But maybe I’m just not going to get it ;)

]]>ojm, Bayes’s theorem as usually stated is on probabilities of event; the density version of Bayes’s theorem can be derived from the event space version if the Radon-Nikodym derivative of the probability measure exists. So if there’s a problem to be found in the Bayesian approach, I feel that it cannot be merely that it operates on densities because it only does so if the Radon-Nikodym derivative exists. When the Radon-Nikodym derivative doesn’t exist (as is typically the case in Bayesian nonparametrics) it just means that the posterior probability measure can’t be defined via the density form of Bayes’s theorem, not that it can’t be defined at all.

]]>Point being, as far as I am aware Bayesians define Bayes’ theorem via densities (in infinite dimensions, Radon-Nikodym derivatives).

They then say ‘oh I’m only interested in expectations over this density’. But either a) the assume ‘this density’ is something that actually exists, in which case ‘density-based inference’ seems reasonable, or b) they actually only want some sort of expectation and don’t care about different densities that give the same expectation. But since this doesn’t uniquely define what posterior they are actually interested in the problem appears to me to be ill-defined. Why not tackle the problem more directly without the Bayesian detour?

(Additional criteria like ‘typical set’ sampling are added without a Bayesian justification implicitly, in my view, to make the problem better posed).

]]>How do you define Bayes’ theorem?

Also, expectations are not sufficient to uniquely determine posteriors (definitely not given finite samples) hence the need for something like ‘typical set’ sampling, which as far as I can tell is not a Bayesian concept as such.

]]>ojm, what do you mean by “density-based inference”? Computation of expectations wrt posterior distribution? If so, that’s a rather misleading formulation…

]]>>> OK, but then say what it is

> Machine learning?

So then https://en.wikipedia.org/wiki/Machine_learning ;-)

Maybe https://global.oup.com/ushe/product/discussion-of-the-method-9780195155990;jsessionid=68946F16BBF13953608DB8B898B78870?cc=&lang=en& when I have time!

]]>Correction – Billy Vaughn Koen

]]>> It’s just that this case is almost entirely as *an engineering exercise* rather than as a *scientific inquiry method*.

Agreed. Side point – I once read an interesting philosophical book on the engineering method called ‘Discussion of the Method’ by Billy Van Koen. Makes an interesting case for its universality, which is somewhat relevant to the general philosophical discussion here.

]]>> OK, but then say what it is

Machine learning?

> don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

Because I suspect that this itself is just one limited way to view things, obscured by the idea that having a correspondence to some Bayesian model under some particular circumstances means it ‘really is’ Bayesian.

Maybe Bayes is a special case of machine learning, valid under certain special circumstances?

Speaking of the ‘safety’ issue, I don’t like Bayes’ over-reliance on simple parameteric models and, even worse, density-based inference. This isn’t safe either*.

Procedures can of course be both mathematically and empirically studied. You don’t need to convert it to a Bayesian model to prove a theorem about an algorithm!

(*PS just switching to Gaussian processes, as seems to be a trend, don’t really suffice to address these concerns imo either..,)

]]>Daniel (re 11:30 am post): Good points

]]>Daniel: ” area of study makes a big difference”

Agree, I sometimes I argue a randomized study would be hopeless if for instance the world changes faster than such studies could learn about it.

“special case or approximation of some general Bayesian technique helps us design better techniques”

That is the bigger point, its the ensemble of future predictors and predictions we want to be less wrong about.

“visual recognition algorithms in autonomous vehicles. “

But someone has to regulate those, approve as being safe enough, and doing that with a black box approach with unknowable fail behavior – that has some folks concerned.

]]>Keith: I think the area of study makes a big difference. For example, one area that machine learning has done well in is product recommendations. What is the utility of knowing *why* people at time t prefer headphones X vs headphones Y, particularly when both products entire lifecycle is about 8 months? Developing N causal models for N recommendation problems whose lifetime of utility is 3-12 months is pretty clearly a waste of time.

Another area where machine learning has done well is things like visual recognition algorithms in autonomous vehicles. In the end, I’d rather have lower observed risk of accidents in a wide variety of test conditions than understand *why* an algorithm is better able to discern the difference between a stop sign and a yield sign.

Sure, if understanding some ML technique as a special case or approximation of some general Bayesian technique helps us design better techniques, then fine. But I think there is a place for pure predictive model free inference. It’s just that this case is almost entirely as *an engineering exercise* rather than as a *scientific inquiry method*.

When your goal is “optimize some well defined engineering objective function in any way that works” then… that’s your goal and it makes sense to carefully use whatever tool gets you that result. (and I say carefully because you’d better be testing in a *wide* variety of real-world conditions)

On the other hand, if your objective is to understand fundamental aspects of the causal structure of a problem, such as *why* do certain types of mutations cause certain types of cancer. Then ML doesn’t address that question directly, and if you’re going to use it, it can be useful to know that it’s really a fast computational approximation of some general Bayes model or the like and then try to extract the approximate underlying Bayesian inference.

]]>ojm: “I don’t like the ‘oh that’s just secretly Bayes’ or whatever attitude. Maybe it’s something else, ya know?”

OK, but then say what it is. Until you can do that, I don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

On the whole, this discussion reminds me of the ongoing arguments between David Cox (models) and Brian Ripley (just prediction) when I was at Oxford. It was a year later at John Nelder’s festcreft that Ripley publicly admitted it was hard for him to grasp the additional value of having a model – but now he did. Maybe he was wrong.

]]>I get the feeling that, by whatever criteria you are using, there are no sufficient, general frameworks :)

]]>Haha, well likelihood has its flaws too. I wouldn’t say likelihood or Bayes are sufficient frameworks.

But I have an example with an ODE model and real data that I’ve been meaning to look at properly at some point. Will let you know when I get around to it.

]]>ojm, I’m still waiting for you to write up a demonstration where profile likelihood is clearly better than Bayes…

]]>I agree that they _can_ be complementary but I think there is also a good case that modelling for prediction and modelling explanation can be a tradeoff.

I’m often surprised when people justify their inferential parameter estimation methods on the basis of ‘well the ultimate goal is prediction’. If that really is the case then you can often bypass many steps of traditional parameter estimation.

So I tend to think both explanation and prediction are good but that they are not identical and can often conflict. Of course this gets into many philosophical issues what exactly these terms mean etc etc but I’ve found the heuristic helpful.

]]>Ojm:

I read the paper you link to. I agree with them on the substance, but I disagree with the framing in which prediction is opposed to explanation. I think predictions work better in the context of good explanatory models, and that explanatory models are predictive.

]]>Related to present discussion and many themes of this blog, just saw this (haven’t read properly):

http://pilab.psy.utexas.edu/publications/Yarkoni_Westfall_PPS_in_press.pdf

‘Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning’

]]>Speaking of engineering, I think machine learning fits pretty seamlessly into an engineering curriculum. Traditional stats including Bayes less so. The good thing is these students also get a large amount of modelling too. So they can work well together.

]]>Hi Keith,

See above for a comment on the ‘oh that’s just Bayes’ approach. Not that it’s not necessarily a useful exercise it’s just that I’m often interested in thinking through things from a different perspective. I think it’s that I don’t find Bayes to be the appropriate general framework so much anymore.

]]>Yeah I think we broadly agree too…but that’s no fun :-)

Generally speaking I think I advocate more ML for modelling folks and more modelling for ML folks. I haven’t really reached a consistent opinion! But I do try to be open to new ways of doing things. I don’t like the ‘oh that’s just secretly Bayes’ or whatever attitude. Maybe it’s something else, ya know?

]]>Thanks Keith. I don’t have time at the moment to get into these but I will have a gander tonight.

]]>I think we are in agreement on the broad strokes. If you think the only way to make accurate predictions (or inferences) is to have a model that exactly replicates the phenomenon under study then you’re unlikely to be successful; abstractions are an essential component of mathematical modeling.

I come from an engineering background so when I say theory I don’t necessarily mean that to imply an understanding of exactly what happened to produce the data (I also don’t not mean that either); engineers are supposed to be practical people after all! What I usually settle for is some moderate understanding of the mechanism that produced the data so that I may furnish reasonable lower/upper bounds on the parameters, predictions, etc. so that I can either a) constrain the model or b) know when it’s getting a bit wonky or c) look at changes in the real life phenomenon for which I collect data and have some understanding if the changes will have an impact on the model predictions that I should be concerned about (prior to it blowing up in my face).

I think Keith’s links down below touch on this.

]]>Allan: I agree with you more than ojm, but ojm has a point you may wish to more directly deal with.

ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches is I think true. The way I have seen that in my career is folks use/prefer favored algorithmic choices – absolute norm (Tibshirani), neural analogies (Hinton), biological analogies (who ever thought of drop out for deep learning), etc. They happen upon procedures that have very good properties (relative to other approaches) but with little sense of why, how to improve them and when they will fail.

Then folks discern how these same things could be seen as (or close to) what data generating models would set out – Laplace priors for Lasso (me in one of the first seminars Tibshirani gave on the Lasso), Gaussian processesfor neural nets (Radford Neal), Bayesian priors for drop out (Yarin Gal 2016), etc.

So we have a larger community of zig zagging from happened upon but not understood algorithms that work surprisingly well. To then discerning Bayesian models that would imply similar algorithms that are better understood, can be tweaked to be less wrong and with some sense of when they will fail.

My favorite talk that describes doing that is by Mike West https://www.youtube.com/watch?v=oBYoPtEHzTE&list=PL3T2Ppt4bgDJBiGZlan-qNY6PsLOGXdAB&index=3

A shorter talk would be this one on the History of Bayesian Neural Networks by Zoubin Ghahramani https://www.youtube.com/watch?v=FD8l2vPU5FY (first 10 minutes should do).

]]>…last follow up…

I think ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches.

While realist thinking can be useful I think it can also hinder people. Eg when learning mathematics you can worry about what something ‘really is’ (complex numbers, irrational numbers, infinity etc) or you can worry about what they ‘do’. The latter is usually easier to get to grips on that the former.

]]>And what do you mean by ‘theory’ exactly? Someone like Vapnik has plenty of theory, it’s just generally not on the form of ‘literal model of thing’.

I think ‘literalist’ modelling is a trap too many fall into. ‘More realistic’ or ‘more literal representation’ doesn’t always mean better ‘model’.

Often quite abstract ideas can be very powerful. ML doesn’t necessarily have less theory it just has less literalist modelling. I find being willing to consider non-literal models surprisingly helpful when dealing with many complex problems.

]]>Sure, though I’m increasingly skeptical that ‘the data generating mechanism’ is a meaningful phrase.

And prediction is hard, especially about the future and especially with bad data etc etc.

BTW I come from a theory-first mathematical modelling background so I like the complement of data-driven ML approaches that do one simple thing well and can be used alongside theory if you want.

I have to admit that I’m not sure I really trust ML folk who haven’t originally come from modelling backgrounds, even though those who do probably don’t need to explicitly use it anymore. Something about Wittgenstein’s ladder, I guess.

]]>ojm: “I wouldn’t really agree with this – I’d say it’s most suited to situations where you don’t have a good theory and hence are willing to replace theory with less well understood relationships.”

This makes good sense. However, that doesn’t preclude information about the data generating mechanism from being beneficial in both cases (ML and statistical).

The issue I have with the typical discussion of the bias-variance trade-off is that it is usually framed in the context of the difference between performance on the training/test sets. My point is that you could get good predictions on both the training and the test set but be entirely inaccurate on future predictions if the original sample was not representative of subsequent data. And without a theory of the data generating mechanism, that’s a hard thing to know when you do the original analysis.

If I understand your analogies correctly, you are using examples where ML algorithms are updated as more data becomes available and therefore the models becomes more accurate through time despite a lack of theoretical understanding of the data generating process. In other words, as the collected data becomes less sparse across some unknown parameter space ML algorithms can more or less accurately tease out stable relationships. I agree with this but it again involves having a training set that is representative. Knowing when this is the case without a theory behind the data is pretty much impossible; which becomes problematic when you have to trust the predictions of the model.

]]>To mix the various the analogies above, there should be sufficient data to reach near ‘equilibrium’ with the environment. If the environment changes, hopefully it can adapt fast enough, yet in a stable way ( eg an ‘online’ learning scenario). Bias-variance and all that, really.

Typically you hope there is some lower-dimensional representation of higher dimensional data, you just don’t know or impose what it is. Instead you let it ’emerge’ and, possibly, adapt if the data/scenario changes.

]]>> most comfortably be utilized in situations (similar to statistics) where there is an underlying theory of the data generating mechanism.

I wouldn’t really agree with this – I’d say it’s most suited to situations where you don’t have a good theory and hence are willing to replace theory with less well understood relationships.

> unless the dataset is truly close to a representative sample from some collective

This is probably more important.

In general, ML usually relies on having a well-defined _task_ and feedback mechanism for learning about this task. Practice (train) how you play (test) and all that.

Think about eg learning to recognise people in photos, control a robot to reach a goal, play video games etc. More like emergent understanding via evolutionary-style learning than a priori imposed rules. But sure, no feedback and it will drift off to who knows where.

]]>Anoneuoid, ojm,

It’s been my impression that ML techniques can most comfortably be utilized in situations (similar to statistics) where there is an underlying theory of the data generating mechanism. In the other circumstances, good performance on the training, validation, and test sets is likely as not to be chasing noise (unless the dataset is truly close to a representative sample from some collective, which is usually far from the case if only due to variance of the data generating mechanism through time)

Would you agree or am I missing some theoretical or practical reason why ML can overcome overfitting without substantive theory?

]]>