“Which user?”: When chunking your math and your code, what seems thickerer depends on where you’re coming from.

One of my favorite joke scenarios is the one about the comedians who give a number to each joke. To me, what’s funny about the number thing is that it’s true—or, to be precise, it’s true that sometimes all you have to do is remember the label for some story, and that’s enough to make you laugh, without the need to play through the entire narrative.

I was thinking about this recently when working with Bob Carpenter on a simple Stan program.

Here’s what I started with:

data {
  int J;
  int T;
  int shock[J,T];
}
transformed data {
  int prev_shock[J,T];
  int prev_avoid[J,T];
  for (j in 1:J){
    prev_shock[j,1] = 0;
    prev_avoid[j,1] = 0;
    for (t in 2:T){
      prev_shock[j,t] = prev_shock[j,t-1] + shock[j,t-1];
      prev_avoid[j,t] = prev_avoid[j,t-1] + 1 - shock[j,t-1];
    }
  }
}
parameters {
  real<lower=0, upper=1> a;
  real<lower=0, upper=1> b;
}
model {
  real p[J,T];
  for (j in 1:J){
    for (t in 1:T){
      p[j,t] = a^prev_shock[j,t] * b^prev_avoid[j,t];
      shock[j,t] ~ bernoulli(p[j,t]);
    }
  }
}

And here’s Bob’s cleaned-up version:

data {
  int<lower=0> J;
  int<upper=0> T;
  array[J, T] int<lower=0, upper=1> shock;
}
transformed data {
  matrix<lower=0>[J, T] prev_shock;
  matrix<lower=0>[J, T] prev_avoid;
  for (j in 1:J) {
    row_vector[T - 1] shock_j = to_row_vector(shock[j, 1:T - 1]);
    prev_avoid[j] = cumulative_sum(append_col(0, 1 - shock_j));
    prev_shock[j] = cumulative_sum(append_col(0, shock_j));
  }
}
parameters {
  real<lower=0, upper=1> a;
  real<lower=0, upper=1> b;
}
transformed parameters {
  matrix<lower=0, upper=1>[J, T] p = a.^prev_shock .* b.^prev_avoid;
}
model {
  for (j in 1:J) {
    shock[j] ~ bernoulli(p[j]);
  }
}

Looking at the change in code, I see a few things:
1. Bob changed int shock[J,T]; to array[J,T] int shock;
2. Bob added bounds on the data and transformed data.
3. Bob changed my looping into various vector things such as append_col, to_row_vector, and cumulative_sum.
4. Bob did something where he referred to shock[j] and p[j] which are either rows or columns of matrices.

Just to go through these quickly:
1. This is a fix in the Stan code, as I was using an old formulation for arrays.
2. This is a matter of taste, but I guess Bob’s approach is better here. Including these bounds explicitly serves an error check: for example, if you feed it shock data other than 0’s and 1’s, the program will now return an error.
3. I have mixed feelings on this bit. I assume this is more efficient and to Bob it’s cleaner; to me it’s much more confusing! I guess that when writing a case study, it will make sense to explain the pluses and minuses of the two formulations.
4. I find this really confusing, the idea of using a single index to pull out one row or column of an array, so I’d much prefer to do this with two nested loops.

Bob responded to my third and fourth points by saying that, to him, the matrix version is clearer. I get that. I like the loops and explicit indexing, but I can see these indexes also offer opportunities for bugs, for example if you type i instead of j somewhere. Similarly with the thing where you use a single index to pull out a single row or column of a matrix: to me it looks like a trick, but if, like Bob, you’re familiar with the programming language it’s a transparent step.

Example of the logistic transform

I’ll write logistic regression like this:

Pr(y=1) = logit^{-1} (a + bx).

But a lot of people will write:

Pr(y=1) = e^{a + bx} / (1 + e^{a + bx})

or some other thing like that. To me, the “logit^{-1}” or “invlogit” formulation is clearest: invlogit is that S-shaped curve with a slope of 1/4 at 0, and when its argument goes from -5 to 0 to 5, it goes from 0.01 to 0.5 to 0.99, and the ratio of exponentials thing is a bit confusing; indeed I think the best way to read it is to chunk it into “invlogit.” But to people who are less familiar with statistical notation, the thing with the exponentials is clearer. For this example, I’m the “Bob Carpenter” who has internalized the transformation and prefers the chunked version, while these other users are still stuck on the other side (as I am with vectors and matrices in Stan) so they find the clunky notation more explicit and clear.

I say this because when considering coding options, we sometimes talk about a tradeoff between computational efficiency and clarity for the user, and, yes, sometimes this tradeoff does exist, but often the answer to the question, “What is clear code for the user?”, depends on another question: “Which user?”

13 thoughts on ““Which user?”: When chunking your math and your code, what seems thickerer depends on where you’re coming from.

  1. if there are bounds on the cleaned-up version, then the declarations would read:

    “`
    array[J, T] int shock;
    “`

    also, there’s typo in your description of change #1 – uppercase J for the size.

  2. Right, I would expect that changes like the ones in #3 lead to more efficient code generation, as you suspect, since vector operations can be delegated to highly optimized linear algebra libraries, and things like cumsum can be parallelized. Of course, in this simple case, a “smart” compiler might be able to automatically do some of this kind of vectorization, but in more sophisticated situations it can be tough. I suppose your invlogit example has a related property: even if the two encodings are mathematically equivalent, a library implementation of invlogit might have better precision than e^{a + bx} / (1 + e^{a + bx}), so it’s preferable to use if possible.

    I like to think of this communication with the compiler as another form of the “which user?” question, but broadened to include the “compiler” as a user (or consumer) of the code. If we use “higher level” chunks like cumsum, we can more directly express certain intents to the compiler so that it can generate the best code possible. Like Bob Carpenter, compilers know the language and its libraries very well. But unlike Bob, they don’t know much else, and they can be much harder to train than users…

    Of course, people usually (rightly) treat the compiler as the least important “user”.

  3. you refer to bounds, but the cleaned-up version doesn’t have any – line 4 should be

    “`
    array[J,T] int shock;
    “`

    there should also be lower bounds of either 0 or 1 on integers J and T, assuming both are counts.
    your description of the rewrite of the array declaration has a lowercase ‘j’, that should be uppercase.

    for the variable “prev_shock” – lower bound is 0 – what is the upper bound? are there J subjects and T trials?
    then upper bound is T or T-1, if lower bound on T is 1.

    your description of the rewrite of the array declaration has a lowercase ‘j’, should be uppercase.
    we always use uppercase for integers which denote size.

    I do love the giving a joke a number joke, but I’m not sure you know how to tell a model.

    more seriously, if the data comes from an experiment with J subjects, each of which underwent T trials,
    and you’re interested in figuring out the distribution of behaviors across subjects, then it makes sense to be using row indices and vectorized operations. not only is it more efficient, it’s a more direct description of what the model is about.
    a discussion of coding choices should be prefaced with at least this much description of the model and the data.

    • Mitzi:

      1. Ah, yes, annoying html thing fixed.

      2. I think you’re right about the model. The challenge is that this requires understanding of the idea of a matrix being a set of vectors, and I can never remember what direction this goes in. As with the example of logit, I think I’ll be able to be a better Stan programmer once I commit to fully learning this new concept. But then I will also have to commit to being able to explain this concept to others. I accept what you and Bob say, that this is worth doing; it will just take an investment in learning and teaching on my part.

  4. “ratio of exponentials thing is a bit confusing”

    Think this was a also small critique you had for Ruth Etzioni’s Statistics for Health Data Science book.

    I like the ratio of exponentials thing!! We’re taught logistic regression results in log odds: log-odds=a+bx. We are often taught in our very first stats class that we can go from converting probability to odds as P/(1-P) and odds to probability as Odds/(1+Odds).

    So from that sense, it’s intuitive that if we wanna convert logistic regression outcome to probability, we substitute odds for exponentiated log odds–i.e. exp(a+bx)–in the above formula.

  5. Sometimes I really wish for Stan to support libraries/packages and unit tests in some form, so I could

    1. code up log-posteriors made up composable functions,

    2. test these functions with a bunch of correctness tests, so eg I could keep the loop version in the tests and use the optimized version in the code (l prefer to reason about loops),

    3. reuse the pieces I find useful in other problems (in other ways than copy/pasting code).

    Yes, I know there are open issues about these things and they are difficult. It’s just my wishlist that in case Santa Claus is reading the blog.

  6. I almost always start Andrew-style — explicit loops, go element by element — and then look for ways to improve computational efficiency by making things more Bob-style. But I usually don’t go as far as someone like Bob would, because (a) I don’t know how, and (b) it makes it much harder for me to read and debug. I don’t know how long it would take for me to get so comfortable with something like “row_vector[T – 1] shock_j = to_row_vector(shock[j, 1:T – 1]);
    prev_avoid[j] = cumulative_sum(append_col(0, 1 – shock_j));” to the extent that that would seem more natural to me than spelling it out elementwise. But it would be much longer than I have spent so far!

    • I guess I could have mentioned: Stan’s way of handling (or, rather, not handling) latent categorical variables is really confusing to me. I’m able to do it in cases that clearly map to the examples in the manual but I have to think through every step, and heaven help me if I were to run across a problem that didn’t map to one of the manual’s examples. I guess this represents a shallowness of understanding (on my part) of some aspect of the problems.

  7. I think one relevant style-affecting axis on which users differ might just be their prior exposure to style standards and statistical modeling notation / syntax. For example, when sharing Stan code with quantitative researchers, I’ve encountered a lot of confusion any time a greek letter outside {alpha, beta, mu, sigma} makes an appearance, and stuff like theta[n, p, t] is right out. In those cases, writing code that’s “aggressively self-documenting” has been the clearest solution I’ve found, but you sacrifice a lot of compactness that makes the whole thing much more confusing and ugly to those used to reading relatively abbreviated models.

    I think having some sort of hover-over tooltip might be the best compromise, or maybe an auto-generated companion document in the style of this https://github.com/synercys/annotated_latex_equations, though I can understand how work on “non-technical usability” tools are not a very pressing priority!

Leave a Reply

Your email address will not be published. Required fields are marked *