I suppose the point of my responses has been: being able to define every N \to N function is not particularly relevant in the context of probabilistic programming. Probabilistic programming languages are about defining probabilistic models and so then it is sensible to ask: what is the appropriate notion of (Turing) completeness in this setting and what are the implications? This has been my research focus for the past 6 or so years. Emphatically, I’m not saying that the only interesting languages are those that can define any (computable) probabilistic model. Indeed, the runaway success of STAN is testament to that.

Now… you mentioned “generated block”. I’m not exactly sure what these are, but I’m definitely curious to learn more. Having just read over the STAN manual, it seems like these are bits of code run on the output of inference. E.g., one can use these to make predictions, conditional on the posterior. But it also seems that one could put the entire model into the generated block part by implementing a rejection sampler there with a while loop. Is that right? I.e., sample parameters from priors (written in the generated block as potentially recursive probabilistic procedures), then calculate the likelihood and accept or reject based on some appropriate test. Repeat if rejected, otherwise output these values.

If that’s possible within a generated block, then I would say, yes, this mechanism makes STAN complete in the sense I think is relevant. On the other hand, it seems that these generated blocks are just evaluated and that inference doesn’t take place within a generated block.

]]>In the usual textbook sense (say, as defined on the Wikipedia), C is Turing complete (any function computable in C is computable by Turing machine and vice-versa).

Stan can compute any function C can and can thus compute any function a Turing machine can. This is the sense in which I mean Stan is Turing complete.

We could literally implement Church inside the generated quantities block in a Stan program if we wanted to (in theory — it’d be a rather Herculean task in practice without defining intermediate languages).

To exclude Stan from the class of Turing-complete languages would require a redefinition of what “Turing complete” means to something other than the standard textbook definition (as represented in the Wikipedia entry). Hence my analogy to Pluto being defined out of planet status — all that was needed was a new definition of the word “planet”.

The model block of a Stan program defines a log density (over the variables defined in the parameters block) up to a proportion with support on $latex \mathbb{R}^N$. This gets combined with the Jacobians induced by the change of parameters due to transforming all parameters to the unconstrained scale. And it does it in such a way that the log density is differentiable.

This density can be a Bayesian posterior $latex p(\theta | y)$ defined through definining a joint density $latex p(\theta, y)$, which in turn can be factored into a likelihood (or sampling function) times prior $latex p(y | \theta) \, p(\theta)$, but it doesn’t have to be.

The transformed parameters and generated quantities block let you define quantities of interests using functions defined on random variables. This is the sense in which I mean that Stan is a probabilistic programming language. That, and that we compute expectations of functions of parameters (estimates, event probabilities, decisions, predictions) conditioned on observed data. But that computation is something that uses the density defined by the Stan program to sample. We can also optimize and do variational approximations.

The generated quantities block allows you to call pseudo-random number generators.

Stan has print() statements. Stan has resizable arrays (you need to use continuation passing style in Stan functions to define them, because Stan’s basic arrays are not directly mutable in size, but it’s theoretically possible). Stan has conditionals and general while loops. Stan lets you define functions recursively.

The following is easy to make rigorous:

THEOREM: For any computable (in the Church-Turing sense) $latex f : \mathbb{N} \rightarrow \mathbb{N}$, it is possible to define a Stan program that compiles to a C++ program that prints $latex f(n)$ given input $latex n$ as data.

Of course, you get all the usual halting problem (undecidability) issues you get with a language that can define arbitrary recursively enumerable sets.

]]>IIUC, the purpose of STAN code is to calculate the log likelihood of the parameters given data. Because the set of functions computable by a Turing machine is the same as the set of functions computable by a Probabilistic Turing Machine (old result due to de Leeuw, Shannon, Moore, et al in the 50’s), there is no gain in representational power by adding randomness. In contrast, a language built to represent distributions by way of allowing users to specify algorithms that generate samples requires randomness even to make sense.

As for consequences:

Perhaps counterintuitively, a rather simple sampling algorithm can have a rather complex log likelihood. Indeed, it is efficient to sample a sequence of n binary digits and return as output the cryptographic hash of those digits. The evaluation of the density of that hash is hard under standard cryptographic assumptions, though. (Perhaps the key idea of Church is that the underlying space on which the Markov chain acts is the space of program executions, and the probability density here is easy to calculate, though it is defined on an, a priori, unbounded space.)

Finally, I could define a probabilistic programming language that builds on C/C++ to allow users to construct finite Bayesian networks with conditional probability tables represented by arrays. The Turing-completeness of C would be useful for making these programs compact and easy to write, but the power of C would never let me escape the limitations of a finite discrete Bayesian network. STAN is obviously more powerful than this straw man language, but this example highlights that it may be useful to understand what distributions can be defined, rather than focusing solely on the language being used to specify the models.

]]>The practical consequences are a whole different matter, and one where I think it’s really important to separate the representation of a distribution and the _efficient_ sampling from that representation. A definitive sampling algorithm is highly unlikely to be all that useful in practice, a fact with which many PPLs are struggling. We have taken a different approach, focusing on the subset of distributions representable by the Stan language that are amenable to high-performance sampling.

Let’s use more precise language so we don’t bicker about pedantry and can go back to bickering about practical consequences. :-D

]]>All these questions can be studied rigorously using the relatively mature framework of computable analysis, which extends Turing’s original work defining a computable real number. It now covers all areas of analysis, including measure and probability theory.

For more precise definitions on more general spaces see my work and references therein:

e.g., LICS 2011 paper on noncomputable conditional distributions: http://danroy.org/papers/AckFreRoy-LICS-2011.pdf

e.g., my thesis, which has a softer introduction: http://danroy.org/papers/Roy-PHD-2011.pdf

There are several equivalent notions of computability for probability measures on a complete separable metric space. The Naturals are even discrete and so every prob measure is dominated by counting measure and so things simplify, but you’d commit mistakes if you over-generalized to the more abstract setting.

Thm. Let mu be a probability measure on {1,2,…}. The following are equivalent.

1. The probability mass function x \mapsto mu{x} is computable (as a function from Naturals to Reals). More carefully, for every Natural n and nonnegative integer k, one can compute a rational q such that | mu{x} – q | [0,1].

3. The measure A \mapsto mu(A) is computable, where sets A are represented by their characteristic functions 1_A : Naturals \to {0,1}.

4. There is a probabilistic Turing machine whose output, interpreted as an encoding of a Natural in some standard way, has the distribution mu.

Happy to answer questions.

]]>Can you elaborate on what you mean by a computable distribution on the naturals? Do you mean that the program can _evaluate_ the density of the distribution with respect to the counting measure? Or something more elaborate like sampling from the distribution? And is this all conditioned on a given sample space, and hence a constant dimension?

Is your claim that most languages can’t handle all real functions just reduce to the fundamental limitations of floating point arithmetic? If so, is Church et al. valid because of the availability of arbitrary precision arithmetic?

Thanks!

]]>When I said Stan is Turing complete, what I meant is that our language for defining functions allows any computable function to be defined. Nothing at all to do with probabilities.

We don’t really care about about the original notion of Turing completeness, but I’m more curious about what the definition of a computable density is.

I was mainly just trying to clarify that Stan is not like BUGS in the sense that it’s not just a language for definining (finite) directed acyclic graphical models, which is a common misconception among people talking about Stan because our simple models look just like BUGS but with type declarations (and semicolons).

]]>So… what is the statement? Once we have it reasonably pinned down, we can answer this issue definitively.

Warning: I’m pretty sure that once we nail this down, the STAN community is going to think…. “that’s what it meant to be Turing-complete? We don’t care about that.”

]]>I disagree. The tool, as I see it, would actually be quite simple, and powerful. Besides one should not underestimate what kids can accomplish.

But don’t get me wrong, I am not a High School teacher. My point is that science, and causal inference in particular, is made unnecessarily complicated. You don’t need Stats 101-103 to do it. Using graphical tools like DAGs and a simple UI (though not a simplistic solution) even high school kids can address important problems and make important contributions. That is my belief. I draw inspiration form things like:

http://worrydream.com/#!/MediaForThinkingTheUnthinkable

http://worrydream.com/#!/LearnableProgramming

https://probmods.org/generative-models.html

One thing I liked about Harvard is that PhD courses were often open to all students, including some high school ones.

]]>Why would you need such a complicated and indepth tool for High School students? I don’t see the necessity, I actually think that might even make things more complicated for them than you would want.

]]>@bob

I’m not making this ruckus to have some toy example javascript on a web browser. But I’m not going to spell it out. Just think daggity with a simulation and inference engine, one that knows about d-separation. ]]>

Something simpler involving a few models and Javascript on the browser may be a better model than Stan for teaching.

Daggity already does exactly that, right?

]]>At a basic level the U is what probabilistic programing languages like Church automate, with routines like flip. But in causal inference this is often not the contentious part, unless you want runs simulations or make inferences about unobserved variables. The acrimony is typically about theory and identification, none of which require statistics per se.

Moreover, DAGs are intrinsically heterogeneous so they include all possible interactions. The issue is how you want to estimate them.

So I don’t know about applied statisticians, but applied scientists can get very far with this framework, and do things in a simple fashion where many statisticians still struggle.

]]>Many problems that are considered statistical problems — from Simpson’s paradox, to missing data, etc — do not require any knowledge of probability. You can understand and address them using DAGs, and d-separation, period. Indeed you can even use DAGs + d-separation + data & visualization to make non-parametric inferences. At which point the main benefit of statistics is regularization for sparse strata (and, having gotten here I wonder whether regularization is statistics or a convenient kludge).

To wit, it is easy to show using this framework where and why MRP will go wrong, and how to test for it in specific applications before believing your results. Presumably this is of interest to statisticians. Unfortunately probability is useless in this regard bc it cannot differentiate causation from correlation. It is an ambiguous language for causal inference, an ambiguity Stan will inherit if you ignore DAGs.

]]>We’re going to be diving into tutorials, but we don’t have any plans for deterministic graphs or really for graphical models at all other than as a subset of more general models. After all, (smoking -> cancer) is hardly deterministic. We’re trying to make a tool for applied statisticians, not logic students.

]]>We agree. The thing is you come at Stan from the perspective of a tool to solve problems that previously were very hard. I come form the perspective that I would like to teach high school kids how to do science using something like DAGs + Stan. I think it could be very simple, kind of a Logo version if you will. So all I am saying is not to loose track of other great potential uses of the language, which could be as important.

BTW I have a side project looking at HIV transmission among high school kids in Western Kenya and was thinking of using an agent-based model. I am new to them but I think they can be very useful, and could usefully be coded in Stan I imagine. For example, Andrew’s 4-part model of toxicokinetics could be recast as an agent-based model where each organ is an agent.

Why do this? In the case of human behavior experts might have much better priors about the parameters governing human choices than a differential equation of disease transmission that results from their interaction. It can also simplify model specification, as simple behaviors can engender pretty complex phenomena. But again, I am new to this. Grateful for any pointers or suggestions.

]]>Yes, that is the difference between a (deterministic) causal model, and a probabilistic causal model. But you only need the probabilistic part for some purposes and not others. And you can get very far without it (e.g. you can understand identification, selection bias, missing data, instrumental variables, exclusion restrictions, etc using graphs alone, and often much more easily an intuitively than with probabilistic models).

From a didactic perspective I think this is useful bc we need not scare people away with complex math, etc.. from the get go. They can get very far with much less machinery. (But I understand that you want to cater to people already advanced who want to do even more advanced models easily.)

BTW this is in no way to criticize what you are doing. I expect to only use Stan or something like it in the future. I just don’t think it does Stan justice to describe it as a language to specify a log posterior up to an additive constant even if indeed that is exactly what it does. I mean, is that the mission statement for the Stan team: To create a language for specifying a log posterior up to an additive constant?

]]>Our motivation in building Stan was that we had models we wanted to fit that other software couldn’t fit, or at least not as efficiently as we wanted. We of course still have a long way to go.

Some problems don’t have simple solutions. Or at least don’t currently have simple solutions. What’s considered “simple” changes over time as more people learn it. For instance, I now believe calculus, linear algebra, and computer programming are all considered simple in that we teach them to every engineering undergraduate, and teach the first and third to high school students.

As an example, we have been working on drug-disease models with Novartis for clinical trials. These models intrinsically involve differential equations to model drug diffusion and then relate them to observable effects, like visual acuity tests. There’s not a simple way to express such a model as a DAG. Yes, you can do it in PKBUGS, but not with the BUGS language for graphical modeling — you have to extend the notion to include nodes that are function definitions for the system of differential equations and then call an integrator to solve the differential equation. And then you need noise models for your noisy measurements to fit things like rate equations in the first place. Then you need to take into account that patients vary by sex and by age and by a host of other genetic and environmental factors that we don’t even have measurements for. And I almost forgot that we also want to take into account a meta-analysis of previous experiments, and have several drugs, dosages, and placebo controls.

It’s not that complicated, but it’s hardly smoking -> cancer.

]]>`stan::gm`

.
The directed acyclic graph structure is only part of the model. You also need the density or mass function for each node conditioned on it parent nodes. And then you usually have “plates,” which in theory can be blown out to simple DAGs. But this is very inefficient, as you can see, by say, trying to scale up an IRT model in both students and questions. BUGS or JAGS scale up in memory with the number of edges, whereas Stan scales with the number of nodes.

]]>Some models can be easily expressed in a simple directed acyclic graph specification language. Any such model without discrete parameters can be directly translated line-by-line from BUGS/JAGS to Stan.

And while my original conception of Stan’s language was following BUGS, we all soon realized it was much more general. It’s really just a way to specify a log posterior. So you don’t have to specify the full joint probability in Stan, just something equal to the log posterior up to an additive constant. This opens up many more programming possibilities.

Another example is that we can do lots of imperative programming in the model, such as including conditionals that are quite a pain to express in BUGS even if it is technically possible in many cases.

There are also many researchers (mostly in AI-related fields as far as I can tell) using DAGS for causal inference and/or graph structure induction (I’m not sure how these are related, to be honest, despite being a professor in the same department as Clark Glymour and Peter Spirtes for almost a decade — I did logic and natural language semantics, not stats, back then).

]]>Hmm, what do you mean by non parametric? Isn’t multivariate Gaussian used practically always for continous data? (Note: Most stuff what I have read about grahical models doesn’t actually involve causality, only conditional dependencies. So the way I see DAGs is probably quite a bit different than what you actually mean in this case)

]]>No they are not. DAGs are non parametric, and causal in nature. Hierarchical models need not be.

]]>I take it back. Sorry.

]]>Why aren’t DAGs/hierarchical models used more often relative to say, trying to shoehorn every applied data analysis problem into combinations of t-tests, linear regressions, bootstraps, fisher exact tests, and multiple comparison corrections? I would say this is an unfortunate consequence of 4-5 generations of failed statistical education.

]]>Now that my life is (in expectation) quite a bit more than half over, and I’ve had heart problems, I’m really starting to dislike that expression. No joke.

]]>My sense is they are becoming much more widely used, and their use will grow exponentially. But science advances one dead statistician at a time :-)

Graphical models more generally are very widely used in ML. Just open a recent textbook like Murphy.

]]>I guess it depends on what you want to do. For the sorts of problems I work on, it is computation that limits me from making my models as complicated as I’d like. For example, I do a lot of work on sample surveys. To generalize from sample to population requires a lot of struggles, as the saying goes. If we could easily fit models with deep interactions, that would help. And I think the same principles arise in causal inference. As for natural experiments, they’re fine as far as they go, but as we’ve seen on this blog, people often get a bit too excited by them. If sample size is small and variation is high, you’ll run into trouble no matter what.

]]>I disagree with you. The latest technology is not always the best solution. That is why we use Russian rockets to send satellites into space, or why the Mig fighter jet, as I understand it, was such good value for money.

You like regularization in statistics. I like regularization in research practice. Here is what I mean:

Despite a century of research we still don’t know why people get fat. In this time we have used increasingly complex statistical techniques, faster computers, etc.. yet arguably 19th century doctors had a better understanding of the underlying causes. There are many reasons for this, including many you discuss in this blog (incentives, publication bias, etc.)

But here is my hypothesis, and counterfactual scenario. If the only stats we had ever taught researchers over the past century was to compute a difference in means and CI. And if instead we made them focus on theory, concepts, measures, measurement instruments, and research design. Then I bet we would have much better knowledge today about the causes of obesity. Admittedly this is just a hypothesis. But in my defense I would adduce that much of the “credibility revolution” in econometrics is a back to basics approach. Natural experiments try to recreate simple experiments, where often all you need is a simple difference in means.

Most people (myself included) don’t know how to handle complex methods, and are taught to believe that science is running a regression — the more complex the better. I believe this has it all upside down. You are building powerful stuff, but that does not immediately imply you will get powerful results (not you personally, who know how to use it, but broadly). In an ideal situation yes, in practice, I am not so sure. This is why, when Bob in the post asks for suggestions, I suggest he focus on usability and simplicity. But feel free to ignore me.

]]>You write, “If even Andrew has trouble understanding VB I think there is something wrong.” I disagree. You might as well argue that, given that almost none of us understand how a jet engine works (let alone the principles of heavier-than-air flight or the metallurgy needed to build an aluminum fuselage), that we should be going across the Atlantic in sailboats.

We use complicated methods such as HMC and EP and VB because we have problems to solve. We want to get to Europe in 7 hours rather than 7 weeks, as it were, and we’re willing to use the latest technology to do so. It is the nature of the latest technology that few people understand it.

]]>Smoking $latex \Rightarrow$ Cancer is a generative model in the sense that it captures what we think is the data generating process. But to get it to actually generate data you need to add a few random variables, like the flip routine in Church.

You don’t need formal Bayes for any of this, though it can be very useful. But all I’m trying to say is the goal ought to be to keep things simple. If even Andrew has trouble understanding VB I think there is something wrong. I find simplicity a great virtue. What I’ve learned using DAGs is that things can be a lot simpler that using stats. You can search for Simpson’s paradox in this blog to see my point.

]]>You’re right about the subtlety — I didn’t understand the point about science vs. statistics in the first post or either of the second posts. Maybe I’m missing what you mean by “generative.”

I think of Stan as working in two stages — the language defines a log probabilty function, then you can use it for sampling or optimization. And we plan to add marginal maximum likelihood, EP, and VB over the next year or two, so there will be more options in terms of inference to plug in for the models defined in the language. But the language itself is agnostic as to how the log probability, data, and parameters defined by the model will be used.

Probabilistic programming languages like Figaro (object oriented) or Church (functional) don’t seem to derive from graphical model representation languages like BUGS, at least as far as I can tell.

]]>Stan is not based on R. There’s no way to access R functions, nor do we plan to. For any C++ function we use in Stan, it needs to be fully templated so that we can automatically differentiate it and every function used in the definition needs to be one that we define. So there’s no way we can roll in any R functions, much less all of them.

Is there something specific you’re looking for?

]]>probabilistic programing language = representation + generation

I think you quote lays more emphasis on the generation than the representation, while mine focuses on the representation. So to be exact maybe we should say: “STAN is a language for encoding your knowledge about a domain _and replicating its output_ using generative models”. After all, much of BDA is about y_rep. But again, the focus is on the use not the stats under the hood.

]]>I hope that is not what STAN aims to be. IMHO STAN should aim to be something more high level and abstract. Personally, I would aim for: “STAN is a language for encoding your knowledge about a domain using generative models”.

The difference is subtle but critical. The latter focuses on usability not statistics. (Science $latex \neq$ statistics.)

]]>I find the Stan imperative approach is very easy to think with. It has made simple some models that would require lots of data reformatting in BUGS/JAGS. Or maybe I’m just not fluent enough with the BUGS/JAGS declarative style.

For example, I’ve been working on a big hidden Markov model in Stan. The are hundreds of individual agents, each with multiple behavior sequences, and each sequence needs its own forward algorithm pass. With Stan, I could just feed in the data in its natural (grouped by sequence) form, and then use conditionals to detect when to restart the forward algorithm and update the log posterior.

It’s even more complicated, because the “emissions” in the model are latencies, and they are censored at start and end of each sequence. So the log-probability of each emission is a conditional likelihood depending upon position in the sequence. Again, it was easy to code this in Stan. It worked on the first try, actually.

I’m not sure how I’d do all of this in BUGS/JAGS, to be honest. Maybe it’d be simple. But my imperative programmer mind (I learned C on a SPARCstation) got it to go right away with Stan. And the resulting model is very efficient.

So it may be that users who haven’t done a lot of imperative programming will prefer a JAGS-style declarative approach. But I find the Stan approach very cromulent.

]]>