Antti Rasinen writes:

I’m a former undergrad machine learning student and a current software engineer with a Bayesian hobby. Today my two worlds collided. I ask for some enlightenment.

On your blog you’ve repeatedly advocated continuous distributions with Bayesian models. Today I read this article by Ricky Ho, who writes:

The strength of Bayesian network is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution table. Similar to Neural Network, Bayesian network expects all data to be binary, categorical variable will need to be transformed into multiple binary variable as described above. Numeric variable is generally not a good fit for Bayesian network.

The last sentence seems to be at odds with what you’ve said. Sadly, I don’t have enough expertise to say which view of the world is correct. During my undergrad years our team wrote an implementation of the Junction Tree algorithm. We really did not consider continuous variables at all. I know continuous distributions are fine with small hiearchical models, but…

How well do continuous distributions work with large graphs? Do you have perhaps a good reference to a known large example of a large Bayesian network with several “numeric variables”?

My reply: The term “Bayesian network” is general and includes the possibility of continuous variables. Ho is not wrong, exactly, but he’s only talking about a subset of possible Bayesian models. I disagree with his recommendation to avoid continuous variables but perhaps this is good advice for the particular software he is working with.

In answer to the question, I don’t have any experience with large problems. Here’s a small but difficult Bayesian analysis that involved many continuous parameters. Here’s an example with some discrete parameters (which are often thought of as latent data) and some continuous parameters.

Even models that seem completely discrete can have continuous parameters representing the probabilities.

With regard to this part of a Bayesian analysis

posterior/prior = c * likelihood (or prior[data=data.o]/prior)

obviously data.o must be discrete (we never observe to an unlimited number of decimals places)

(S Stigler has a nice paper on how Galton realised the above)

Trying to condition on continuous outcomes perhaps does not even make sense.

But continuity is a wonderfully useful approximation and only really harmful if folks forget that it is an approximation (i.e. Borel paradox, McCullagh’s example of non-unique ancillary that Barnard objected to, etc.)

First, it’s helpful to realize that the things being called “Bayesian networks” here are not Bayesian. (They could be treated in a Bayesian way, but the description of how they are estimated leads me to think that’s not what Ho has in mind.)

Second, it’s clear that Ho has in mind a very restricted subset of “Bayesian networks”, with only discrete variables, which is indeed the setting for the usual junction tree algorithm. The further restriction to binary variables seems unmotivated, and saying that neural networks are restricted to binary variables is even more bizarre, since common multilayer perceptron (aka, “backprop”) networks are naturally defined with real inputs. (It’s true that non-binary categorical inputs might be transformed to a 1-of-n binary encoding, but there’s no reason to convert real inputs to binary.)

Third, he must have in mind problems in which all the variables in the network are observed, since otherwise estimation cannot be done by simple counting.

I would not recommend trying to solve real problems, big or small, with this restricted mode of thinking.

When Mr Ho writes about Bayesian networks, he is referring to specific set of methods that have little to do procedurally with what Prof Gelman calls Bayesian models. Bayesian networks in my limited experience refer to using a graph to find the class of conditional distributions and creating the appropriate contingency tables. There isn’t a likelihood, a prior or a Bayes theorem in sight.

People agree that connecting dots with arrows is a useful intellectual exercise, but then how one attaches numbers to the dots and arrows is contentious. You have to read very carefully whether one is using a Bayesian network, a Bayesian model, a causal model or something else, once a graph has appeared.