One way to make a possibly ill-defined problem better defined is to introduce a concrete procedure for computing what you mean.

Eg MCMC for actually sampling ‘the’ posterior. But multiple posteriors yield the same expectations, especially with finite computation power. Unfortunately these can also be arbitrarily far apart in the strong topology.

So what are we actually doing? What is ‘the’ posterior target in general if we only care about expectations? Do we care about strong or weak topologies etc?

So I’m not fully convinced the answer to ‘what are we doing’ is properly defined, though open to convincing. On the other hand I see alternative, more direct methods which relate directly to the actual boring mathematics underlying eg MCMC ‘technology’ and wonder why we don’t just consider those the fundamental concepts instead. Others seem to have done so.

]]>I agree with Daniel about the typical set. That’s just a property of distributions when the number of dimensions grows to infinity. Nothing has to be Bayesian about it, in the same way that there is nothing Bayesian about MCMC or other numerical integration schemes. But if you have a probability distribution you can use MCMC to calculate expectations and if you have an very-high-dimensional probability distribution you may see the concentration of measure (I still don’t know what makes it interesting, though).

]]>ojm, my (possibly incorrect) understanding is that infinite Gaussian mixture model doesn’t have a Radon-Nikodym derivative that’s useful for Bayesian inference. For example, I can’t spot a clear use of the density version of Bayes’s theorem in this paper. IIRC the papers I’ve read on it give algorithms for sampling from the posterior measure without defining a posterior density as such. A related example is this paper on the Mondrian process in which a consistency property is used to give an algorithm for posterior sampling; I can’t spot a posterior density per se or a use of the density versions of Bayes’s theorem.

]]>On the other hand, focusing on large sample size for data collection I think I agree with you. When you are in fact imposing a sufficient random structure on the problem. (ie. using a computer RNG to sub-sample a finite well defined population of things for example) then large sample size helps you. But, when you’re working with non-randomly-selected samples you shouldn’t be focusing on CLT type theorems, because the non-randomness of your sample violates those assumptions. Then, you’re stuck using an appropriate Bayesian model that acknowledges this. For example, in polling Trump vs Hillary there should have been a model structure that acknowledged an unknown bias in the polling, and made some prior guesses about its possible size. Instead, relying on a CLT type result, you get polls that decide that Hillary has a 95% chance of winning… If results were truly from random samples with 100% compliance etc… it would have been true. But they weren’t.

]]>When it comes to relying on mathematical probability while doing MCMC sampling, this is because Bayesian probability as quantification of scientific uncertainty ends when you write down the distribution you want to use. In the same way that PDEs as a description of a scientific process involving momentum and energy and whatever ends after you write down the equations and the boundary conditions. After that it’s mechanical calculation that becomes the issue.

As far as mechanical calculation goes, in PDEs we have finite elements or finite volume or finite difference equations or whatever, and in Bayesian stats we have MCMC of some sort. Both are designed just as pure mechanical deduction… this equation and these boundary conditions imply this new state at this later time… or similarly, this density over the parameters implies this chain of samples converges to give the right expectations. Numerical analysis of difference equations isn’t scientific modeling of the behavior of physical objects, and MCMC sampling isn’t scientific analysis of the uncertainty in models and data… but both are techniques for calculation that apply to their individual purposes.

]]>More generally, taking advantage of probabilistic convergence results with increasing sample size seems to me pretty frequentist, and/or ’empirical inference’ in the Vapnik etc style..

As far as I am aware none of the convergence results/concentration of measure/ whatever are very Bayesian in spirit. They introduce explicit sample size, asymptotics etc which seem to do most of the work and have nothing to do with logical or a priori probabilistic modelling (which ironically usually assume an _exact_ model of uncertainty)

]]>Why do we only compute expectations? Because in say 2000 dimensions, even just one sample on each side of the median in each dimension is 2^2000 samples.

yet, we can get good expectations with 20 or 100 or in extreme cases maybe 1000 effective samples.

]]>RE the problem and coherence. I think it brings out how unrealistic the concept of ‘a single sample known to be drawn from a particular distribution’ is in practice (and perhaps in principle).

]]>“Even Andrew”? I’m not sure how to interpret that, in particular after your remarks a few weeks ago about his attitude to Bayesians foundations.

> how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

How should we feel? Two samples are one single sample after another, etc. If I have a coherent method to update the uncertainty with each single sample, I don’t see where is the problem.

]]>Also, does how do you feel about quantifing the uncertainty in a model parameter such as the mean of a Normal distribution on the basis of a single sample (as in Andrew’s ‘new favourite example’)?

]]>> ‘quantifying uncertainty in parameters of statistical models’

OK, but even Andrew refuses to use probability to quantify the uncertainty _of the models themselves_. Given this is usually a massive source of ‘uncertainty’, this is a good example of a case where the role probability is contested, even by Bayesians.

Furthermore, Neyman himself said, for his brand of Frequentist inference, that his aim was

“to construct a theory of mathematical statistics independent of the conception of likelihood…entirely based on the classical theory of probability”

So even what it means to ‘quantify uncertainty in parameters of statistical models’ using probability is somewhat ambiguous (or uncertain…).

]]>I should say I’m talking about continuous and/or infinite-dimensional problems. I’m actually curious – I’ve only seen infinite-dimensional problems tackled via RN derivatives. Can you point me to a good reference tackling it in other ways?

Chris, re:

> the Bayesian premise – use of probability to quantify uncertainty – is not generally valid. That seems like a rather extraordinary claim in general.

It depends what ‘not generally valid’ means.

Probability can quantify _some_ types of uncertainty, sure, so I am certaintly not saying it is invalid in all cases.

But to say that it quantifies _all_ types of uncertainty seems to me like an extraordinary claim, and one that has been hotly contested (successfully imo) by many people.

]]>They then say ‘oh I’m only interested in expectations over this density’. But either a) the assume ‘this density’ is something that actually exists, in which case ‘density-based inference’ seems reasonable, or b) they actually only want some sort of expectation and don’t care about different densities that give the same expectation. But since this doesn’t uniquely define what posterior they are actually interested in the problem appears to me to be ill-defined. Why not tackle the problem more directly without the Bayesian detour?

(Additional criteria like ‘typical set’ sampling are added without a Bayesian justification implicitly, in my view, to make the problem better posed).

]]>Also, expectations are not sufficient to uniquely determine posteriors (definitely not given finite samples) hence the need for something like ‘typical set’ sampling, which as far as I can tell is not a Bayesian concept as such.

]]>> Machine learning?

So then https://en.wikipedia.org/wiki/Machine_learning ;-)

Maybe https://global.oup.com/ushe/product/discussion-of-the-method-9780195155990;jsessionid=68946F16BBF13953608DB8B898B78870?cc=&lang=en& when I have time!

]]>Agreed. Side point – I once read an interesting philosophical book on the engineering method called ‘Discussion of the Method’ by Billy Van Koen. Makes an interesting case for its universality, which is somewhat relevant to the general philosophical discussion here.

]]>Machine learning?

> don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

Because I suspect that this itself is just one limited way to view things, obscured by the idea that having a correspondence to some Bayesian model under some particular circumstances means it ‘really is’ Bayesian.

Maybe Bayes is a special case of machine learning, valid under certain special circumstances?

Speaking of the ‘safety’ issue, I don’t like Bayes’ over-reliance on simple parameteric models and, even worse, density-based inference. This isn’t safe either*.

Procedures can of course be both mathematically and empirically studied. You don’t need to convert it to a Bayesian model to prove a theorem about an algorithm!

(*PS just switching to Gaussian processes, as seems to be a trend, don’t really suffice to address these concerns imo either..,)

]]>Agree, I sometimes I argue a randomized study would be hopeless if for instance the world changes faster than such studies could learn about it.

“special case or approximation of some general Bayesian technique helps us design better techniques”

That is the bigger point, its the ensemble of future predictors and predictions we want to be less wrong about.

“visual recognition algorithms in autonomous vehicles. “

But someone has to regulate those, approve as being safe enough, and doing that with a black box approach with unknowable fail behavior – that has some folks concerned.

]]>Another area where machine learning has done well is things like visual recognition algorithms in autonomous vehicles. In the end, I’d rather have lower observed risk of accidents in a wide variety of test conditions than understand *why* an algorithm is better able to discern the difference between a stop sign and a yield sign.

Sure, if understanding some ML technique as a special case or approximation of some general Bayesian technique helps us design better techniques, then fine. But I think there is a place for pure predictive model free inference. It’s just that this case is almost entirely as *an engineering exercise* rather than as a *scientific inquiry method*.

When your goal is “optimize some well defined engineering objective function in any way that works” then… that’s your goal and it makes sense to carefully use whatever tool gets you that result. (and I say carefully because you’d better be testing in a *wide* variety of real-world conditions)

On the other hand, if your objective is to understand fundamental aspects of the causal structure of a problem, such as *why* do certain types of mutations cause certain types of cancer. Then ML doesn’t address that question directly, and if you’re going to use it, it can be useful to know that it’s really a fast computational approximation of some general Bayes model or the like and then try to extract the approximate underlying Bayesian inference.

]]>OK, but then say what it is. Until you can do that, I don’t see what’s not to _like_ about it could be recast as this Bayesian model which provides more ways to view something?

On the whole, this discussion reminds me of the ongoing arguments between David Cox (models) and Brian Ripley (just prediction) when I was at Oxford. It was a year later at John Nelder’s festcreft that Ripley publicly admitted it was hard for him to grasp the additional value of having a model – but now he did. Maybe he was wrong.

]]>But I have an example with an ODE model and real data that I’ve been meaning to look at properly at some point. Will let you know when I get around to it.

]]>I’m often surprised when people justify their inferential parameter estimation methods on the basis of ‘well the ultimate goal is prediction’. If that really is the case then you can often bypass many steps of traditional parameter estimation.

So I tend to think both explanation and prediction are good but that they are not identical and can often conflict. Of course this gets into many philosophical issues what exactly these terms mean etc etc but I’ve found the heuristic helpful.

]]>I read the paper you link to. I agree with them on the substance, but I disagree with the framing in which prediction is opposed to explanation. I think predictions work better in the context of good explanatory models, and that explanatory models are predictive.

]]>http://pilab.psy.utexas.edu/publications/Yarkoni_Westfall_PPS_in_press.pdf

‘Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning’

]]>See above for a comment on the ‘oh that’s just Bayes’ approach. Not that it’s not necessarily a useful exercise it’s just that I’m often interested in thinking through things from a different perspective. I think it’s that I don’t find Bayes to be the appropriate general framework so much anymore.

]]>Generally speaking I think I advocate more ML for modelling folks and more modelling for ML folks. I haven’t really reached a consistent opinion! But I do try to be open to new ways of doing things. I don’t like the ‘oh that’s just secretly Bayes’ or whatever attitude. Maybe it’s something else, ya know?

]]>I come from an engineering background so when I say theory I don’t necessarily mean that to imply an understanding of exactly what happened to produce the data (I also don’t not mean that either); engineers are supposed to be practical people after all! What I usually settle for is some moderate understanding of the mechanism that produced the data so that I may furnish reasonable lower/upper bounds on the parameters, predictions, etc. so that I can either a) constrain the model or b) know when it’s getting a bit wonky or c) look at changes in the real life phenomenon for which I collect data and have some understanding if the changes will have an impact on the model predictions that I should be concerned about (prior to it blowing up in my face).

I think Keith’s links down below touch on this.

]]>ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches is I think true. The way I have seen that in my career is folks use/prefer favored algorithmic choices – absolute norm (Tibshirani), neural analogies (Hinton), biological analogies (who ever thought of drop out for deep learning), etc. They happen upon procedures that have very good properties (relative to other approaches) but with little sense of why, how to improve them and when they will fail.

Then folks discern how these same things could be seen as (or close to) what data generating models would set out – Laplace priors for Lasso (me in one of the first seminars Tibshirani gave on the Lasso), Gaussian processesfor neural nets (Radford Neal), Bayesian priors for drop out (Yarin Gal 2016), etc.

So we have a larger community of zig zagging from happened upon but not understood algorithms that work surprisingly well. To then discerning Bayesian models that would imply similar algorithms that are better understood, can be tweaked to be less wrong and with some sense of when they will fail.

My favorite talk that describes doing that is by Mike West https://www.youtube.com/watch?v=oBYoPtEHzTE&list=PL3T2Ppt4bgDJBiGZlan-qNY6PsLOGXdAB&index=3

A shorter talk would be this one on the History of Bayesian Neural Networks by Zoubin Ghahramani https://www.youtube.com/watch?v=FD8l2vPU5FY (first 10 minutes should do).

]]>I think ML encourages operational or algorithmic or abstract thinking which is a different way of thinking than more ‘realist’ approaches.

While realist thinking can be useful I think it can also hinder people. Eg when learning mathematics you can worry about what something ‘really is’ (complex numbers, irrational numbers, infinity etc) or you can worry about what they ‘do’. The latter is usually easier to get to grips on that the former.

]]>I think ‘literalist’ modelling is a trap too many fall into. ‘More realistic’ or ‘more literal representation’ doesn’t always mean better ‘model’.

Often quite abstract ideas can be very powerful. ML doesn’t necessarily have less theory it just has less literalist modelling. I find being willing to consider non-literal models surprisingly helpful when dealing with many complex problems.

]]>And prediction is hard, especially about the future and especially with bad data etc etc.

BTW I come from a theory-first mathematical modelling background so I like the complement of data-driven ML approaches that do one simple thing well and can be used alongside theory if you want.

I have to admit that I’m not sure I really trust ML folk who haven’t originally come from modelling backgrounds, even though those who do probably don’t need to explicitly use it anymore. Something about Wittgenstein’s ladder, I guess.

]]>This makes good sense. However, that doesn’t preclude information about the data generating mechanism from being beneficial in both cases (ML and statistical).

The issue I have with the typical discussion of the bias-variance trade-off is that it is usually framed in the context of the difference between performance on the training/test sets. My point is that you could get good predictions on both the training and the test set but be entirely inaccurate on future predictions if the original sample was not representative of subsequent data. And without a theory of the data generating mechanism, that’s a hard thing to know when you do the original analysis.

If I understand your analogies correctly, you are using examples where ML algorithms are updated as more data becomes available and therefore the models becomes more accurate through time despite a lack of theoretical understanding of the data generating process. In other words, as the collected data becomes less sparse across some unknown parameter space ML algorithms can more or less accurately tease out stable relationships. I agree with this but it again involves having a training set that is representative. Knowing when this is the case without a theory behind the data is pretty much impossible; which becomes problematic when you have to trust the predictions of the model.

]]>Typically you hope there is some lower-dimensional representation of higher dimensional data, you just don’t know or impose what it is. Instead you let it ’emerge’ and, possibly, adapt if the data/scenario changes.

]]>I wouldn’t really agree with this – I’d say it’s most suited to situations where you don’t have a good theory and hence are willing to replace theory with less well understood relationships.

> unless the dataset is truly close to a representative sample from some collective

This is probably more important.

In general, ML usually relies on having a well-defined _task_ and feedback mechanism for learning about this task. Practice (train) how you play (test) and all that.

Think about eg learning to recognise people in photos, control a robot to reach a goal, play video games etc. More like emergent understanding via evolutionary-style learning than a priori imposed rules. But sure, no feedback and it will drift off to who knows where.

]]>It’s been my impression that ML techniques can most comfortably be utilized in situations (similar to statistics) where there is an underlying theory of the data generating mechanism. In the other circumstances, good performance on the training, validation, and test sets is likely as not to be chasing noise (unless the dataset is truly close to a representative sample from some collective, which is usually far from the case if only due to variance of the data generating mechanism through time)

Would you agree or am I missing some theoretical or practical reason why ML can overcome overfitting without substantive theory?

]]>