## Shlemiel the Software Developer and Unknown Unknowns

The Stan meeting today reminded me of Joel Spolsky’s recasting of the Yiddish joke about Shlemiel the Painter. Joel retold it on his blog, Joel on Software, in the post Back to Basics:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. “That’s pretty good!” says his boss, “you’re a fast worker!” and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. “Well, that’s not nearly as good as yesterday, but you’re still a fast worker. 150 yards is respectable,” and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. “Only 30!” shouts his boss. “That’s unacceptable! On the first day you did ten times that much work! What’s going on?”

“I can’t help it,” says Shlemiel. “Every day I get farther and farther away from the paint can!”

Joel used it as an example of the kind of string processing naive programmers are prone to use.

The reason I bring it up is that software development almost inevitably employs the Shlemiel the Painter algorithm.

Here’s the problem in a nutshell: The more moving pieces your software has, the longer it takes to add a new feature or change an existing feature.

For example, when we have N special functions defined, if we want to change the way error handling works in all of them, it takes N units of work. If we have K interfaces defined, the amount of work to roll out a new user-facing feature increases K-fold over the situation where there’s only a single interface.

So the first feature takes one unit of time, the second two units, and so on. And we all know where this goes, though if you’re like me rather than like Gauss, you didn’t derive the result in your head in primary school:

$\displaystyle \sum_{n=1}^N \, n= \frac{N\, (N+1)}{2}.$

The upshot is that to add $N$ features takes time proportional to $N^2$.

This is why I’m always arguing that adding new features, as simple as they look, isn’t free. We have to test them and we have to document them and we have to build them into all of our interfaces. Then when things change, we have to change all of those moving pieces.

Sometimes you can design around the problem with modularity. In an ideal world, you look into the future when designing the algorithm the first time and imagine all the ways it might change and design something simple with that in mind. At least that’s how software design works in theory.

In practice, it’s nearly impossible to write simple and modular code with oracular accuracy about how it’ll be used in the future. It’s the unknown unknowns that get you every time. As every engineer knows, but as Donald Rumsfeld was mocked for pointing out,

There are known knowns; there are things we know that we know.

There are known unknowns; that is to say, there are things that we now know we don’t know.

But there are also unknown unknowns – there are things we do not know we don’t know.

It’s the unknown unknowns that get you every time. What usually happens is that you only figure out where the modularity has to be after the $N^2$ work is done.

At this point, you can spend even more time and refactor the code and hopefully not have to do yet another round of $N$ work in the future. If you’re lucky. One issue is that those without a lot of experience in software development can see refactoring or trying to design modularly in the first place as akin to rearranging the deck chairs on the Titanic. But what it’s really about is trying to wrestle software into a manageable state.

Just an example related to Stan — we’re about to go through and rewrite everyone one of our distribution functions yet again so that we can take higher-order derivatives (we need this for some optimization, for Laplace approximations, and for RHMC). We didn’t even know Stan would be used by anyone other than us when we first wrote the code. The second rewrite was for vectorization. The third time around we added template metaprogramming to drop constant terms. On the fourth rewrite, we added expression templates to make the vectorization efficient. And now we’re on the fifth time, adding higher-order derivative-compatible code. Each time, the amount of work we had to do was proportional to the number of distributions we supported, which continues to grow with both new distributions and more convenient reparameterizations of existing distributions, like Benroulli on the logit scale or Poisson on the log scale or multivariate normal with a Cholesky factor covariance or precision. And that work’s not just in defining the function, but in testing them. We had basic tests, then we needed tests for vectorization, then tests for all the varying ways the functions could be called with data and parameters. And now we need tests for the second-order derivatives. But it’s not just tests. It’s documentation. And debugging. And providing support to users, because the doc gets more confusing as we support more options.

And at one point, we actually simplified all the distributions (again requiring N units of work) to get rid of the traits-based error-handling configuration that we anticipated needed but never needed. It just added drag to all of our development and seriously complicated our testing. One of the biggest reasons to keep things simple is to simplify testing and documentation.

1. Steve says:

So true. Thanks for this. You might be interested Rich Hickey’s thoughts on this topic: http://www.infoq.com/presentations/Simple-Made-Easy

2. Rahul says:

Is Stan experiencing mission creep and feature bloat?

• So far, Daniel and I have been trying to keep the sprawl under control. I don’t think we’ve hit feature bloat yet, but we’re continuing with refactoring to try to make things easier to develop in the future.

I was just trying to explain the cost of adding new features.

3. John Mashey says:

See 8 slides starting here, from a talk first given in 1977. See also these 2 slides.

Refactoring is the modern name for what we called “crunch meetings”, where we’d review masses of code and look for examples of doing almost the same thing, and then identify the general program to be written, and then the list of places that could call it.
All software degrades unless explicit effort is made to clean it up, even without falling prey to “creeping featurism.”
It takes strong architectural will to lessen the combinatorial explosions.

Note also the lifeboat and sinking lifeboat approaches.

• Rahul says:

Nice slides! Thanks.

• I started reading them and was nodding so hard my wife (also a software developer) joined me. We read through all of these slides with delight and amusement. That talk is so packed with insight that we were just blown away. Even the quotes were amazing, and I generally dislike quotes of pithy sayings.

How well do you think a general audience of programmers without a lot of software development experience would understand them? Maybe Andrew can have a look and see if he nods along or is perplexed. One thing I find with software development (and stats and math and just about everything else) is that I often don’t understand someone’s solution to a problem until I have to solve it myself and therefore understand the problem itself a lot better.

I continue to be surprised when someone says “we did that back in the 1970s” then backs it up with slides or papers. This happened to me all the time as a grad student in the mid-80s talking to Ron Kaplan and Bonnie Webber, who’d been developing natural language processing systems and parsers in the early 70s. Between the projects those two worked on, they foresaw the entire next couple of decades (and then everything went to statistics, which almost nobody saw coming).

The slides also remind me of what Andrew always says: if you develop a new model, it was probably already invented by a psychometrician in the 1950s. For me, the first model I invented in stats with the help of Andrew and Jennifer had already been studied by Phil Dawid in a 1979 paper.

• K? O'Rourke says:

> understand the problem itself a lot better

That’s why Don Rubin suggested reading the earliest papers on a topic – they are clearer about the problems they are trying to solve.

And problems can’t be clearer than when you run into them head on and need to at least try to solve them yourself.

> would understand them?

It is very hard to grasp the stories others are induced to tell themselves upon hearing the same story.

• John Mashey says:

Thanks for the kind words. The context for that talk was BSDCON 2002. An old friend, Sam Leffler was Chair and asked me if I could come give those old talks, as he thought many of the younger folks needed to hear them.
So, I did. At the end, one of the audience (and old Bell Labs guy), leapt up and said:
“I heard the original! and we haven’t progressed one bit.”

I said, that wasn’t entirely true, we at least had better tools and especially faster machines :-)

We sent the paper over to USENIX to scan … the webmaster asked me for the PowerPoints…

As to general audiences: Small is Beautiful was done ~50 times inside Bell Labs, for ACM chapters and others, Software Army maybe half a dozen times. Every once in a while since 2002, they get used for some audience, although the have been updated with a few extras, including the slide build sequence of creeping featurism and cartoons of a small (cute) and grown (big, nasty) creeping feature.

Generally, the S.I.B. talk resonated really well with experienced software people, who often would say “Argh, my boss should have been here.” However, this was often done for ACM student chapters or other university lectures, and they usually got at least some of it … the hope being, of course, that when they first ran into the issues, they’d recognize them.

4. Rahul says:

Maybe one solution is to not provide support for many different interfaces at an early stage in the lifecycle?

• That ship has sailed. By design. We wanted to reach out to a broad set of users beyond the R community.

And if we get the next grant we applied for, Stata is next. And we’ve already promised the NSF we’d also do MATLAB.

5. C++ is just not a very expressive language for the kinds of mathematical programming that stan requires. Of course, it does produce some really fast code. So there are tradeoffs. Still I can’t help but think Common Lisp every time I read about you guys implementing stan.

• An axiom of comments on posts about programming is that someone will always suggest everything would be better with a different programming language. We’ve been told we should use a flexible interpreted language like Python and are often told we should use functional programming (with a strong division between the Lisp-like suggestions and the ML-like suggestions, and a further split on lazy vs. eager eval). These days, we get lots of suggestions for Julia. And of course lots of people tell us we should be focusing on running on clusters or on GPUs. We’ve also had suggestions to follow HBC or Passage and write the compilers in Haskell and then generate efficient C++ code.

We still get (misguided) comments that we should use C because it’s faster. For efficiency, it’s the static processing that’s such a win. Some highly typed functional languages like ML in its various forms compile down to very efficient code for the same reason as Fortran — you can do a huge amount of static analysis. See my reply to a post of John Cook’s blog:

We haven’t found C++ to be a limiting factor. It was a struggle for me moving from Java and a background in programming languages that focused on functional and logic programming languages (I used to do domain theory as an academic). But now that I’ve learned it, I love C++. There are, of course, lots of things I’d change, but backward compatibility’s a good argument for why they are like they are. C++11 is likely to be a game changer when it gets more traction — we haven’t turned it on for Stan yet, but we’re working on it.

• Rahul says:

Regarding clusters & GPUs: Do you often bump into a computational time bottleneck on applications of Stan? Just curious.

• Any time a model takes longer than 4 or 5 seconds to run, it’s a bottleneck. It means you’ll do something else and get distracted before it finishes. And most Stan models at optimization level 3 won’t even compile in 4 or 5 seconds. Most of the models I’m interested in still take hours to run. And the only reason I don’t run bigger models or bigger data is that a few hours is about the limit of my patience.

Another place we run into issues is running large evaluations. For instance, we’re currently running dozens of models from Gelman and Hill’s regression book, each for enough iterations to guarantee convergence and then running multiple times to get average effective samples per second time estimates with variation.

• Rahul says:

Interesting how perspectives differ: I spent some time in the Computational Quantum Chemistry community & for us 3-4 hours was a “fast” run. So, yeah I think you are “spoilt” if 5 secs. seems a long wait. :)

In any case, my point was there’s a certain run-time horizon below which it rarely makes sense to invest the time & effort to explicitly parallelize a code or port it to GPUs. If at all, one can always get around by poor-man’s-parallalism i.e. run parameterized serial runs on different machines with something like Condor.

What you describe seems an excellent fit for a Condor like system that just farms out runs to a cluster of machines. The headache required to explicitly parallelize might be quite high.

• Also, I should say after having read your post on C++ speed, that it’s interesting how important meta-programming is in your view of C++’s performance. Meta-programming is exactly the reason why I think “common lisp” when I hear about what you guys are doing.

So, in the interest of perhaps providing more useful perspective than my original one-line throwaway comment, perhaps even perspective that can benefit you in your C++ context here’s what I sort of imagine implementing stan would look like in CL:

Stan in CL would look like a translator of stan model files to CL code (as opposed to its current technique of translating to C++ code) followed by a call to the built-in “compile”. However, rather than direct translation, a typical method would be to translate the stan file into an intermediate set of custom CL macros that express common and important cases triggered by stan model files.

There are probably a lot of very common sub-expressions, like updates to lp__ and the ones you mention a+x*b for matrix scalar accumulation etc. The translator would most likely translate the stan code into an intermediate representation for those very common sub-expressions. These intermediate representations would themselves be implemented as common lisp macros which would expand the simple and concise intermediate representation into heavily type-declared highly efficient common lisp code that also implements auto-differentiation. Once that highly type-declared auto-differentiated raw low-level CL code was generated, simply calling the built-in “compile” function would turn it all into machine code. A CL implementation like SBCL would take advantage of those type declarations to do static analysis and produce code that was hopefully similarly fast to C++.

With this kind of architecture, adding new Stan features would look like creating new intermediate representations, and writing the macro code to convert them to raw low-level CL. Optimizing your compiler would probably involve changing the implementation of certain CL macros, or inventing new intermediate representations that let you take advantage of your domain knowledge (ie. the fact that you’re implementing a stan model file) to create new opportunities for optimization.

• Also, this technique would give you a two-level view of correctness and speed issues. You could implement a new stan feature in terms of defining new intermediate expressions, and debug that stage by reading the output, which would presumably look a lot like math in s-expression form. The correctness and efficiency of your macro implementations that actually translate that intermediate form into raw CL code would be a separate step, and presumably each intermediate form would be implemented and debugged and unit-tested separately.

To give a sense of what I mean, here’s some pseudo-version of what I might output from one of your stan examples from the manual page 76 (a missing data thingy, it was just the first example I found in the manual).

(compilefile
(declaredata (integer 0 maxint) N_obs)
(declaredata (integer 0 maxint) N_miss)
(declaredata real y_obs N_obs)
(declareparam real mu)
(declareparam (real 0 maxreal) sigma)
(declareparam real y_miss N_miss)
(modelsteps
(for n 1 N_obs (updatelp__ (lp_normal (index yobs n) mu sigma)))
(for n 1 N_miss (updatelp__ (lp_normal (index ymiss n) mu sigma)))…)

(you’ll have to excuse me for not closing the parens without a proper text editor)

compilefile would be a top-level macro which macro-expands all the sub-expressions and then calls compile on them. things like declaredata would allocate the various data structures using let statements with type declarations, and write a macro that replaces N_obs in the body with some kind of thing like (the (integer 0 maxint) N_obs) so that common lisp knows all about the type of N_obs and modelsteps would do a lot of expansion of the model statements, including expanding for loops into common lisp (do) syntax, and in the process if it could figure out to vectorize something it could automatically vectorize it or something like that, updatelp__ would generate statements that incremented the lp__ variable and its derivatives via auto-diff…

So I assume from thinking this through that you have a bunch of C++ template code that does similar things but I think you can see that it at least isn’t surprising that I think of CL when I read about what you’re up to.

• Of course, you could also use Common Lisp just to do the code generation, having it emit C/C++ code.

BTW, I’m not sure that CL is the best language for something like this. I’ve played around a bit with trying to use CL to automate parts of the process of developing MCMC code, and what I’ve found is that you end up trying to re-implement a lot of functionality that a computer algebra system such as Mathematica already does much better. So I suspect that writing a model compiler in Mathematica could be a very effective approach. The downside, of course, is that Mathematica licenses are quite expensive.

• Bob, I didn’t mean to imply that you’d made the wrong choice or that everything would be better with a different language, or that you should actually switch languages or anything like that. There is no single best language of course, and even when a given language has certain advantages, it can make sense to choose another one because of the overall tradeoffs faced by a project.

However, the kinds of things I hear about you doing as I follow along from my armchair at home just sound like the kinds of things I’d really like to have Common Lisp for, examples include implementing auto-diff via macros and reading stan model files, translating them to some kind of intermediate representation (in this case common lisp s-expressions), and then compiling them to machine code (which would basically be calling “compile” on your s-expressions). Also there’s good old Greenspun’s Law to deal with :-).

You guys know a lot more about what’s really under the hood in Stan than I do of course, and getting common lisp to behave well in numerics is possible but takes a lot of skill, whereas getting C++ to behave well in numerics is more straightforward, but getting C++ meta-programming right is possible but takes a lot of skill.

In either case, I really do appreciate that you guys are producing a high level tool that keeps me from having to write code in lower level languages when I want to fit a Bayesian model. In fact, I appreciate it a LOT, more than I can really articulate here.

6. Bill Harris says:

If you are adding interfaces, what about J (http://www.jsoftware.com/)? It’s a powerful language, but I think it could benefit greatly from Stan, and I don’t have the time to try to work on it for the foreseeable future.

They do list a number of other interfaces at http://www.jsoftware.com/jwiki/Interfaces. (Now if they only had an org-mode connection!)

• I don’t think we’ll have time to do any more ourselves through 2014, but if someone else wants to volunteer to build a J interface and maintain it, we’re glad to help out where we can.

7. Jb says:

Hope you folks know how grateful we all are for your time.

8. Wayne says:

Bob,

This reminds me of an analogy I use a lot in regard to adding “one more” of a particular feature. Supposedly, there are tribes that only have three numbers, “none”, “one”, and “many”, and it’s like that with programming: none is easy, one is straightforward, but after that it doesn’t make much difference if we need to display two windows or three, or 300. Once we move beyond more than one, it’s another level of design and implementation.

Oh, you should implement Stan in Scala. It would be more like the Java you know, without all of the warts. And it combines the best of functional and object-oriented programming in a single language. (I’m surprised you like C++, which is the base language from which Java inherited its warts.) Not trying to start a language war or anything. Just saying.