Stan is Turing Complete. So what?

This post is by Bob Carpenter.

Stan is Turing complete!

There seems to a persistent misconception that Stan isn’t Turing complete.1, 2 My guess is that it stems from Stan’s (not coincidental) superficial similarity to BUGS and JAGS, which provide directed graphical model specification languages. Stan’s Turing completeness follows from its support of array data structures, while loops, and conditionals.

What is a probabilistic programming language?

A free, early-access chapter of:

Avi Pfeffer. 2014 partial draft. Practical Probabilistic Programming. Manning Publications Co.

neatly sums up (on p. 9) the emerging definition of “probabilistic programming language”:

Representation language: a language for encoding your knowledge about a domain in a model

Probabilistic Programming Language: A probabilistic representation language that uses a Turing-complete programming language to representation [sic] knowledge.

The next paragraph continues with

Although probabilistic representation languages with programming language features have been developed that are not Turing-complete, like BUGS and Stan, this book focuses on Turing-complete languages.

And yes, I’ve let the author know Stan’s Turing complete on the Manning feedback forum, and he’s already responded that he’ll fix the next draft.

I’m also curious why Frank’s tutorial on probabilistic programming lists Stan on page 7 under the heading “Application Driven” rather than the heading “Probabilistic Programming Language”.

Stan’s really an imperative language for specifying a log posterior (or penalized likelihood function if you want to roll that way) up to an additive constant. Bayes’s rule shows that it’s sufficient to specify the log joint up to a constant. We then translate to a C++ class that reads in data in the constructor and provides the log posterior as a function of the parameters as a method in such a way that it can be differentiated at first (or higher) order. We then use this class for inference with MCMC or for point estimation with L-BFGS or for letting users explore their own inference algorithms.

If a taxonomist forced me to split, I’d say Stan is an imperative probabilistic programming language, Pfeffer’s language, Figaro, is an object-oriented probabilistic programming language, and Church is a functional probabilistic programming language.

Expressive in practice or in theory?

I’d also like to discuss the following claim, on page 9 of Pfeffer’s pre-release book version, and echoed in every other intro to probabilistic programming I’ve seen:

Naturally, representation languages with greater expressive power will be able to capture more knowledge, making them more useful.

The true measure of usefulness for a user is how easy it is to express the class of models the user cares about and how efficiently, scalably, and robustly the software can fit the model. Assembly language is Turing-complete and can encode a pseudo-random number generator, but that doesn’t make it “able to capture more knowledge” than BUGS in any practical sense.

Mark Chu-Carroll has an insightful discussion on his Good Math/Bad Math blog of Turing equivalence and Turing completeness and what it’s good for.

Whither Modularity?

What we’re really after is modularity. Stan’s going to add user-defined functions in version 2.3.0, which will be handy. For one thing, that should make it clear that Stan’s Turing equivalent. But functions are not not going to solve our modularity problem.

By modularity, I mean the ability to express a Bayesian “submodel” directly in the language. For instance, I might drop in a hierarchical logistic regression module with a configurable set of hyperpriors and then re-use it in many models. But I can’t do that in Stan as it will be in version 2.3.0, because that would require somehow exporting variables for parameters and data from the function to the global environment so that our inference algorithms, like HMC and L-BFGS, can see them jointly with all other parameters. Maybe in Stan 3.

Suggestions welcome, as always!


1 Informally speaking, a computable system of defining functions is Turing complete if it can compute any computable function. All of the standard programming languages (C, Java, Python, R, Lisp, etc.) are Turing complete, as are Turing machines and general recursive functions and untyped lambda calculus and first-order logic (the connections to Goedel’s incompleteness theorem run through this connection).

2 Technically speaking, no modern computer is Turing complete because there is a finite physical bound on available storage, whereas the full definition requires unbounded storage. So we squint a little and pretend we have infinite amounts of storage.

50 thoughts on “Stan is Turing Complete. So what?

  1. I’ve been assuming that Stan is Turing complete, so nice to hear confirmation that it is.

    I find the Stan imperative approach is very easy to think with. It has made simple some models that would require lots of data reformatting in BUGS/JAGS. Or maybe I’m just not fluent enough with the BUGS/JAGS declarative style.

    For example, I’ve been working on a big hidden Markov model in Stan. The are hundreds of individual agents, each with multiple behavior sequences, and each sequence needs its own forward algorithm pass. With Stan, I could just feed in the data in its natural (grouped by sequence) form, and then use conditionals to detect when to restart the forward algorithm and update the log posterior.

    It’s even more complicated, because the “emissions” in the model are latencies, and they are censored at start and end of each sequence. So the log-probability of each emission is a conditional likelihood depending upon position in the sequence. Again, it was easy to code this in Stan. It worked on the first try, actually.

    I’m not sure how I’d do all of this in BUGS/JAGS, to be honest. Maybe it’d be simple. But my imperative programmer mind (I learned C on a SPARCstation) got it to go right away with Stan. And the resulting model is very efficient.

    So it may be that users who haven’t done a lot of imperative programming will prefer a JAGS-style declarative approach. But I find the Stan approach very cromulent.

    • They’ll let you define functions in a C++ or C-like style that can be used as statements or expressions in the language.

      Stan is not based on R. There’s no way to access R functions, nor do we plan to. For any C++ function we use in Stan, it needs to be fully templated so that we can automatically differentiate it and every function used in the definition needs to be one that we define. So there’s no way we can roll in any R functions, much less all of them.

      Is there something specific you’re looking for?

  2. BOB: “Stan’s really an imperative language for specifying a log posterior (or penalized likelihood function if you want to roll that way) up to an additive constant”

    I hope that is not what STAN aims to be. IMHO STAN should aim to be something more high level and abstract. Personally, I would aim for: “STAN is a language for encoding your knowledge about a domain using generative models”.

    The difference is subtle but critical. The latter focuses on usability not statistics. (Science $latex \neq$ statistics.)

    • PS DAGs also meet the criteria of being “a language for encoding your knowledge about a domain using generative models” except they do not generate anything by themselves. Even so, you can still answer a lot of causal questions with them. In this sense we have:

      probabilistic programing language = representation + generation

      I think you quote lays more emphasis on the generation than the representation, while mine focuses on the representation. So to be exact maybe we should say: “STAN is a language for encoding your knowledge about a domain _and replicating its output_ using generative models”. After all, much of BDA is about y_rep. But again, the focus is on the use not the stats under the hood.

        • @Rahul

          My sense is they are becoming much more widely used, and their use will grow exponentially. But science advances one dead statistician at a time :-)

          Graphical models more generally are very widely used in ML. Just open a recent textbook like Murphy.

        • Fernando:

          Now that my life is (in expectation) quite a bit more than half over, and I’ve had heart problems, I’m really starting to dislike that expression. No joke.

        • Hierarchical/multilevel models are basically DAGS. Even Andrew/Bob often refer to Stan as “A (Bayesian) Directed Graphical Model Compiler”

          Why aren’t DAGs/hierarchical models used more often relative to say, trying to shoehorn every applied data analysis problem into combinations of t-tests, linear regressions, bootstraps, fisher exact tests, and multiple comparison corrections? I would say this is an unfortunate consequence of 4-5 generations of failed statistical education.

        • Anon: “Hierarchical/multilevel models are basically DAGS.”

          No they are not. DAGs are non parametric, and causal in nature. Hierarchical models need not be.

        • “No they are not. DAGs are non parametric, and causal in nature. Hierarchical models need not be.”

          Hmm, what do you mean by non parametric? Isn’t multivariate Gaussian used practically always for continous data? (Note: Most stuff what I have read about grahical models doesn’t actually involve causality, only conditional dependencies. So the way I see DAGs is probably quite a bit different than what you actually mean in this case)

        • PS to be clear I am not arguing for DAGs and against hierarchical models. On the contrary, I think their combination is awesome. But they are not the same thing. I think the former are better for identification, and the latter for estimation.

        • I’d like to make that past tense! I did a presentation at the ML Meetup in NY before Stan was fully implemented and before we fully realized how general the language was ourselves. We try not to refer to directed graphs at all any more, because they’re not actually intrinsic to Stan’s modeling languages. Now all we need to do is expunge it from our C++ namespaces, which still have the model parser and code generator in stan::gm.

          The directed acyclic graph structure is only part of the model. You also need the density or mass function for each node conditioned on it parent nodes. And then you usually have “plates,” which in theory can be blown out to simple DAGs. But this is very inefficient, as you can see, by say, trying to scale up an IRT model in both students and questions. BUGS or JAGS scale up in memory with the number of edges, whereas Stan scales with the number of nodes.

        • Bob: “The directed acyclic graph structure is only part of the model. You also need the density or mass function for each node conditioned on it parent nodes.”

          Yes, that is the difference between a (deterministic) causal model, and a probabilistic causal model. But you only need the probabilistic part for some purposes and not others. And you can get very far without it (e.g. you can understand identification, selection bias, missing data, instrumental variables, exclusion restrictions, etc using graphs alone, and often much more easily an intuitively than with probabilistic models).

          From a didactic perspective I think this is useful bc we need not scare people away with complex math, etc.. from the get go. They can get very far with much less machinery. (But I understand that you want to cater to people already advanced who want to do even more advanced models easily.)

          BTW this is in no way to criticize what you are doing. I expect to only use Stan or something like it in the future. I just don’t think it does Stan justice to describe it as a language to specify a log posterior up to an additive constant even if indeed that is exactly what it does. I mean, is that the mission statement for the Stan team: To create a language for specifying a log posterior up to an additive constant?

        • One of our main goals for the Stan language is to provide an easy and expressive and efficient way of specifying the log posterior. As well as for generating derived quantities probabilistically. That’s not an end itself. We also want to provide estimation and inference and model checking over that.

          We’re going to be diving into tutorials, but we don’t have any plans for deterministic graphs or really for graphical models at all other than as a subset of more general models. After all, (smoking -> cancer) is hardly deterministic. We’re trying to make a tool for applied statisticians, not logic students.

        • Bob: “After all, (smoking -> cancer) is hardly deterministic. We’re trying to make a tool for applied statisticians, not logic students.”

          Many problems that are considered statistical problems — from Simpson’s paradox, to missing data, etc — do not require any knowledge of probability. You can understand and address them using DAGs, and d-separation, period. Indeed you can even use DAGs + d-separation + data & visualization to make non-parametric inferences. At which point the main benefit of statistics is regularization for sparse strata (and, having gotten here I wonder whether regularization is statistics or a convenient kludge).

          To wit, it is easy to show using this framework where and why MRP will go wrong, and how to test for it in specific applications before believing your results. Presumably this is of interest to statisticians. Unfortunately probability is useless in this regard bc it cannot differentiate causation from correlation. It is an ambiguous language for causal inference, an ambiguity Stan will inherit if you ignore DAGs.

        • PS: OK you need basic knowledge of probability distributions and conditional independence but that is about it.

        • smoking $latex \rightarrow$ cancer becomes probabilistic once you specify a distribution for the background causes U as in smoking $latex \rightarrow$ cancer $\leftarrow$ U.

          At a basic level the U is what probabilistic programing languages like Church automate, with routines like flip. But in causal inference this is often not the contentious part, unless you want runs simulations or make inferences about unobserved variables. The acrimony is typically about theory and identification, none of which require statistics per se.

          Moreover, DAGs are intrinsically heterogeneous so they include all possible interactions. The issue is how you want to estimate them.

          So I don’t know about applied statisticians, but applied scientists can get very far with this framework, and do things in a simple fashion where many statisticians still struggle.

        • What kind of use do you have in mind? Because my impression matches Fernando’s.

          Some models can be easily expressed in a simple directed acyclic graph specification language. Any such model without discrete parameters can be directly translated line-by-line from BUGS/JAGS to Stan.

          And while my original conception of Stan’s language was following BUGS, we all soon realized it was much more general. It’s really just a way to specify a log posterior. So you don’t have to specify the full joint probability in Stan, just something equal to the log posterior up to an additive constant. This opens up many more programming possibilities.

          Another example is that we can do lots of imperative programming in the model, such as including conditionals that are quite a pain to express in BUGS even if it is technically possible in many cases.

          There are also many researchers (mostly in AI-related fields as far as I can tell) using DAGS for causal inference and/or graph structure induction (I’m not sure how these are related, to be honest, despite being a professor in the same department as Clark Glymour and Peter Spirtes for almost a decade — I did logic and natural language semantics, not stats, back then).

    • PPS One should also be clearer about the meaning of “domain knowledge” and “generative models”. From the graphical roots of probabilistic programing languages I surmise these refer to causal knowledge and structural models. But they need not. They could refer to knowledge about correlations and predictive models.

    • It’s not so much what Stan aims to be as what Stan actually is.

      You’re right about the subtlety — I didn’t understand the point about science vs. statistics in the first post or either of the second posts. Maybe I’m missing what you mean by “generative.”

      I think of Stan as working in two stages — the language defines a log probabilty function, then you can use it for sampling or optimization. And we plan to add marginal maximum likelihood, EP, and VB over the next year or two, so there will be more options in terms of inference to plug in for the models defined in the language. But the language itself is agnostic as to how the log probability, data, and parameters defined by the model will be used.

      Probabilistic programming languages like Figaro (object oriented) or Church (functional) don’t seem to derive from graphical model representation languages like BUGS, at least as far as I can tell.

      • My understanding is that Church came up as a way of extending graphical models, which become ugly and unwieldy when you have too many nodes.

        Smoking $latex \Rightarrow$ Cancer is a generative model in the sense that it captures what we think is the data generating process. But to get it to actually generate data you need to add a few random variables, like the flip routine in Church.

        You don’t need formal Bayes for any of this, though it can be very useful. But all I’m trying to say is the goal ought to be to keep things simple. If even Andrew has trouble understanding VB I think there is something wrong. I find simplicity a great virtue. What I’ve learned using DAGs is that things can be a lot simpler that using stats. You can search for Simpson’s paradox in this blog to see my point.

        • Fernando:

          You write, “If even Andrew has trouble understanding VB I think there is something wrong.” I disagree. You might as well argue that, given that almost none of us understand how a jet engine works (let alone the principles of heavier-than-air flight or the metallurgy needed to build an aluminum fuselage), that we should be going across the Atlantic in sailboats.

          We use complicated methods such as HMC and EP and VB because we have problems to solve. We want to get to Europe in 7 hours rather than 7 weeks, as it were, and we’re willing to use the latest technology to do so. It is the nature of the latest technology that few people understand it.

        • Andrew:

          I disagree with you. The latest technology is not always the best solution. That is why we use Russian rockets to send satellites into space, or why the Mig fighter jet, as I understand it, was such good value for money.

          You like regularization in statistics. I like regularization in research practice. Here is what I mean:

          Despite a century of research we still don’t know why people get fat. In this time we have used increasingly complex statistical techniques, faster computers, etc.. yet arguably 19th century doctors had a better understanding of the underlying causes. There are many reasons for this, including many you discuss in this blog (incentives, publication bias, etc.)

          But here is my hypothesis, and counterfactual scenario. If the only stats we had ever taught researchers over the past century was to compute a difference in means and CI. And if instead we made them focus on theory, concepts, measures, measurement instruments, and research design. Then I bet we would have much better knowledge today about the causes of obesity. Admittedly this is just a hypothesis. But in my defense I would adduce that much of the “credibility revolution” in econometrics is a back to basics approach. Natural experiments try to recreate simple experiments, where often all you need is a simple difference in means.

          Most people (myself included) don’t know how to handle complex methods, and are taught to believe that science is running a regression — the more complex the better. I believe this has it all upside down. You are building powerful stuff, but that does not immediately imply you will get powerful results (not you personally, who know how to use it, but broadly). In an ideal situation yes, in practice, I am not so sure. This is why, when Bob in the post asks for suggestions, I suggest he focus on usability and simplicity. But feel free to ignore me.

        • Fernando:

          I guess it depends on what you want to do. For the sorts of problems I work on, it is computation that limits me from making my models as complicated as I’d like. For example, I do a lot of work on sample surveys. To generalize from sample to population requires a lot of struggles, as the saying goes. If we could easily fit models with deep interactions, that would help. And I think the same principles arise in causal inference. As for natural experiments, they’re fine as far as they go, but as we’ve seen on this blog, people often get a bit too excited by them. If sample size is small and variation is high, you’ll run into trouble no matter what.

        • I agree that for some problems you really need a super computer, and the latest fancy software. I am just reacting to the idea that the more complex it sounds the better it is. That may be true for some applications but not all (perhaps not even many).

        • But we are trying to focus on usability and simplicity!

          Our motivation in building Stan was that we had models we wanted to fit that other software couldn’t fit, or at least not as efficiently as we wanted. We of course still have a long way to go.

          Some problems don’t have simple solutions. Or at least don’t currently have simple solutions. What’s considered “simple” changes over time as more people learn it. For instance, I now believe calculus, linear algebra, and computer programming are all considered simple in that we teach them to every engineering undergraduate, and teach the first and third to high school students.

          As an example, we have been working on drug-disease models with Novartis for clinical trials. These models intrinsically involve differential equations to model drug diffusion and then relate them to observable effects, like visual acuity tests. There’s not a simple way to express such a model as a DAG. Yes, you can do it in PKBUGS, but not with the BUGS language for graphical modeling — you have to extend the notion to include nodes that are function definitions for the system of differential equations and then call an integrator to solve the differential equation. And then you need noise models for your noisy measurements to fit things like rate equations in the first place. Then you need to take into account that patients vary by sex and by age and by a host of other genetic and environmental factors that we don’t even have measurements for. And I almost forgot that we also want to take into account a meta-analysis of previous experiments, and have several drugs, dosages, and placebo controls.

          It’s not that complicated, but it’s hardly smoking -> cancer.

        • Bob:

          We agree. The thing is you come at Stan from the perspective of a tool to solve problems that previously were very hard. I come form the perspective that I would like to teach high school kids how to do science using something like DAGs + Stan. I think it could be very simple, kind of a Logo version if you will. So all I am saying is not to loose track of other great potential uses of the language, which could be as important.

          BTW I have a side project looking at HIV transmission among high school kids in Western Kenya and was thinking of using an agent-based model. I am new to them but I think they can be very useful, and could usefully be coded in Stan I imagine. For example, Andrew’s 4-part model of toxicokinetics could be recast as an agent-based model where each organ is an agent.

          Why do this? In the case of human behavior experts might have much better priors about the parameters governing human choices than a differential equation of disease transmission that results from their interaction. It can also simplify model specification, as simple behaviors can engender pretty complex phenomena. But again, I am new to this. Grateful for any pointers or suggestions.

        • @Fernando: It may be possible to wrap Stan with something simpler that could be used for teaching. But given all of our other goals, this isn’t even a low-priority item for us. Something simpler involving a few models and Javascript on the browser may be a better model than Stan for teaching.

        • @Rahu
          I’m not making this ruckus to have some toy example javascript on a web browser. But I’m not going to spell it out. Just think daggity with a simulation and inference engine, one that knows about d-separation.

        • Fernando:

          Why would you need such a complicated and indepth tool for High School students? I don’t see the necessity, I actually think that might even make things more complicated for them than you would want.

        • Daniel:

          I disagree. The tool, as I see it, would actually be quite simple, and powerful. Besides one should not underestimate what kids can accomplish.

          But don’t get me wrong, I am not a High School teacher. My point is that science, and causal inference in particular, is made unnecessarily complicated. You don’t need Stats 101-103 to do it. Using graphical tools like DAGs and a simple UI (though not a simplistic solution) even high school kids can address important problems and make important contributions. That is my belief. I draw inspiration form things like:


          One thing I liked about Harvard is that PhD courses were often open to all students, including some high school ones.

  3. Pingback: The State of Probabilistic Programming « Some Thoughts on a Mysterious Universe

  4. Since I’m partially responsible for making claims in the past about Turing-completeness of probabilistic programming languages, I’m simply curious what is meant, in technical terms, by the statement “STAN is Turing complete.” I’m not that familiar with STAN. My recollection is that it focuses on continuous random variables, by which we actually mean distributions on R^n that are absolutely continuous w.r.t. Lebesgue measure on R^n and so admit a joint density f from R^n to R+. There are essentially no languages, with one exception in the theory community (called RealPCF++) that are capable of representing every computable joint density. Therefore, it cannot be that STAN is “Turing-complete” because it can represent every computable density function.

    So… what is the statement? Once we have it reasonably pinned down, we can answer this issue definitively.

    Warning: I’m pretty sure that once we nail this down, the STAN community is going to think…. “that’s what it meant to be Turing-complete? We don’t care about that.”

    • By no languages, I don’t mean no probabilistic programming languages. I mean there are essentially no languages such that every computable (or computable a.e.) function from R to R is representable by a term in the language. The exception, I believe, is RealPCF++.

    • That’s what I mean when I say this is just an issue of semantics. We can make Stan not Turing complete the way Pluto was excluded from planethood by redefining what “Turing complete” means.

      When I said Stan is Turing complete, what I meant is that our language for defining functions allows any computable function to be defined. Nothing at all to do with probabilities.

      We don’t really care about about the original notion of Turing completeness, but I’m more curious about what the definition of a computable density is.

      I was mainly just trying to clarify that Stan is not like BUGS in the sense that it’s not just a language for definining (finite) directed acyclic graphical models, which is a common misconception among people talking about Stan because our simple models look just like BUGS but with type declarations (and semicolons).

      • I can probably accept some blame for introducing the term “Turing complete” in the context of probabilistic programming languages, when instead I should have called it something different. But what I originally meant was to say that a probabilistic programming language was Turing complete if every computable distribution on the Naturals was definable. This definition can be extended to the Reals if one fixes a concrete representation of the Reals as, say, rapidly-converging Cauchy sequences: i.e., functions f from the Naturals to Rationals such that | f(n) – f(n+1) | < 2^{-k}. I believe Church, e.g., is complete in both the sense of Naturals and Reals represented in this way (or any Turing-equivalent way). Unfortunately, you cannot use Scheme's or Scala's or really any languages built-in support for Real numbers to build a Turing-complete language in this sense because these languages cannot define every computable real function. The main missing feature is that I cannot turn a rapidly-converging Cauchy sequence (i.e., the interface above) into a variable of type Real. Computable densities are simply those densities f where f is moreover (Turing) computable (see, e.g., Weihrauch 2001, or e.g., my thesis, or my work on computability and probability).

        • Dan,

          Can you elaborate on what you mean by a computable distribution on the naturals? Do you mean that the program can _evaluate_ the density of the distribution with respect to the counting measure? Or something more elaborate like sampling from the distribution? And is this all conditioned on a given sample space, and hence a constant dimension?

          Is your claim that most languages can’t handle all real functions just reduce to the fundamental limitations of floating point arithmetic? If so, is Church et al. valid because of the availability of arbitrary precision arithmetic?


        • Actually, being able to compute the density (w.r.t. counting measure) is equivalent to sampling exactly. I’m not sure what you mean by given a sample space: a distribution on the Naturals is specified by, in general, an infinite number of nonnegative reals that sum to one. Many of these can be represented by, e.g., a sampling algorithm (see more precise notions below) or equivalently, a density evaluator. But, while Church’s support for real arithmetic is not good enough to represent every computable probability mass function, Church can represent every distribution by sampling. That said, if I create an extensional represetation for reals (as, say, lazy sequences), then every language where such a structure can be built, can probably define every computable density.

          All these questions can be studied rigorously using the relatively mature framework of computable analysis, which extends Turing’s original work defining a computable real number. It now covers all areas of analysis, including measure and probability theory.

          For more precise definitions on more general spaces see my work and references therein:
          e.g., LICS 2011 paper on noncomputable conditional distributions:
          e.g., my thesis, which has a softer introduction:

          There are several equivalent notions of computability for probability measures on a complete separable metric space. The Naturals are even discrete and so every prob measure is dominated by counting measure and so things simplify, but you’d commit mistakes if you over-generalized to the more abstract setting.

          Thm. Let mu be a probability measure on {1,2,…}. The following are equivalent.
          1. The probability mass function x \mapsto mu{x} is computable (as a function from Naturals to Reals). More carefully, for every Natural n and nonnegative integer k, one can compute a rational q such that | mu{x} – q | [0,1].
          3. The measure A \mapsto mu(A) is computable, where sets A are represented by their characteristic functions 1_A : Naturals \to {0,1}.
          4. There is a probabilistic Turing machine whose output, interpreted as an encoding of a Natural in some standard way, has the distribution mu.

          Happy to answer questions.

        • So isn’t any Turing complete language endowed with a simple RNG (say binary or U[0, 1]) then “probabilistically” Turing complete per your definition? If so then the Stan _language_ would be probabilistically Turing complete in principle, as we claim. In any case, if claims are being made in reference to this probabilistic generalization of Turing completion the I think it’s fair to use a new name and not overload “Tuning completion” to everyone’s confusion.

          The practical consequences are a whole different matter, and one where I think it’s really important to separate the representation of a distribution and the _efficient_ sampling from that representation. A definitive sampling algorithm is highly unlikely to be all that useful in practice, a fact with which many PPLs are struggling. We have taken a different approach, focusing on the subset of distributions representable by the Stan language that are amenable to high-performance sampling.

          Let’s use more precise language so we don’t bicker about pedantry and can go back to bickering about practical consequences. :-D

        • I don’t think we’re bickering, though I’ve not read the comments above my post. :)

          IIUC, the purpose of STAN code is to calculate the log likelihood of the parameters given data. Because the set of functions computable by a Turing machine is the same as the set of functions computable by a Probabilistic Turing Machine (old result due to de Leeuw, Shannon, Moore, et al in the 50’s), there is no gain in representational power by adding randomness. In contrast, a language built to represent distributions by way of allowing users to specify algorithms that generate samples requires randomness even to make sense.

          As for consequences:

          Perhaps counterintuitively, a rather simple sampling algorithm can have a rather complex log likelihood. Indeed, it is efficient to sample a sequence of n binary digits and return as output the cryptographic hash of those digits. The evaluation of the density of that hash is hard under standard cryptographic assumptions, though. (Perhaps the key idea of Church is that the underlying space on which the Markov chain acts is the space of program executions, and the probability density here is easy to calculate, though it is defined on an, a priori, unbounded space.)

          Finally, I could define a probabilistic programming language that builds on C/C++ to allow users to construct finite Bayesian networks with conditional probability tables represented by arrays. The Turing-completeness of C would be useful for making these programs compact and easy to write, but the power of C would never let me escape the limitations of a finite discrete Bayesian network. STAN is obviously more powerful than this straw man language, but this example highlights that it may be useful to understand what distributions can be defined, rather than focusing solely on the language being used to specify the models.

        • @Daniel Roy:

          In the usual textbook sense (say, as defined on the Wikipedia), C is Turing complete (any function computable in C is computable by Turing machine and vice-versa).

          Stan can compute any function C can and can thus compute any function a Turing machine can. This is the sense in which I mean Stan is Turing complete.

          We could literally implement Church inside the generated quantities block in a Stan program if we wanted to (in theory — it’d be a rather Herculean task in practice without defining intermediate languages).

          To exclude Stan from the class of Turing-complete languages would require a redefinition of what “Turing complete” means to something other than the standard textbook definition (as represented in the Wikipedia entry). Hence my analogy to Pluto being defined out of planet status — all that was needed was a new definition of the word “planet”.

          The model block of a Stan program defines a log density (over the variables defined in the parameters block) up to a proportion with support on $latex \mathbb{R}^N$. This gets combined with the Jacobians induced by the change of parameters due to transforming all parameters to the unconstrained scale. And it does it in such a way that the log density is differentiable.

          This density can be a Bayesian posterior $latex p(\theta | y)$ defined through definining a joint density $latex p(\theta, y)$, which in turn can be factored into a likelihood (or sampling function) times prior $latex p(y | \theta) \, p(\theta)$, but it doesn’t have to be.

          The transformed parameters and generated quantities block let you define quantities of interests using functions defined on random variables. This is the sense in which I mean that Stan is a probabilistic programming language. That, and that we compute expectations of functions of parameters (estimates, event probabilities, decisions, predictions) conditioned on observed data. But that computation is something that uses the density defined by the Stan program to sample. We can also optimize and do variational approximations.

          The generated quantities block allows you to call pseudo-random number generators.

          Stan has print() statements. Stan has resizable arrays (you need to use continuation passing style in Stan functions to define them, because Stan’s basic arrays are not directly mutable in size, but it’s theoretically possible). Stan has conditionals and general while loops. Stan lets you define functions recursively.

          The following is easy to make rigorous:

          THEOREM: For any computable (in the Church-Turing sense) $latex f : \mathbb{N} \rightarrow \mathbb{N}$, it is possible to define a Stan program that compiles to a C++ program that prints $latex f(n)$ given input $latex n$ as data.

          Of course, you get all the usual halting problem (undecidability) issues you get with a language that can define arbitrary recursively enumerable sets.

        • I think we’ve exhausted the nested-levels here and so I’m responding to @Bob Carpenter.

          I suppose the point of my responses has been: being able to define every N \to N function is not particularly relevant in the context of probabilistic programming. Probabilistic programming languages are about defining probabilistic models and so then it is sensible to ask: what is the appropriate notion of (Turing) completeness in this setting and what are the implications? This has been my research focus for the past 6 or so years. Emphatically, I’m not saying that the only interesting languages are those that can define any (computable) probabilistic model. Indeed, the runaway success of STAN is testament to that.

          Now… you mentioned “generated block”. I’m not exactly sure what these are, but I’m definitely curious to learn more. Having just read over the STAN manual, it seems like these are bits of code run on the output of inference. E.g., one can use these to make predictions, conditional on the posterior. But it also seems that one could put the entire model into the generated block part by implementing a rejection sampler there with a while loop. Is that right? I.e., sample parameters from priors (written in the generated block as potentially recursive probabilistic procedures), then calculate the likelihood and accept or reject based on some appropriate test. Repeat if rejected, otherwise output these values.

          If that’s possible within a generated block, then I would say, yes, this mechanism makes STAN complete in the sense I think is relevant. On the other hand, it seems that these generated blocks are just evaluated and that inference doesn’t take place within a generated block.

  5. Pingback: Penn and Teller (3) vs. Pele; Alan Turing advances « Statistical Modeling, Causal Inference, and Social Science

Leave a Reply

Your email address will not be published.