Ugly code is buggy code

People have been (correctly) mocking my 1990s-style code. They’re right to mock! My coding style works for me, kinda, but it does have problems. Here’s an example from a little project I’m working on right now. I was motivated to write this partly as a followup to Bob’s post yesterday about coding practices.

I fit two models to some coronavirus testing data, with the goal of estimating the ratio of prevalence between two groups. First I fit the simple model implicitly assuming specificity and sensitivity of 1:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
p[1]     0.023   0.000 0.004    0.015    0.020    0.023    0.026    0.033 33939    1
p[2]     0.020   0.000 0.005    0.012    0.017    0.020    0.023    0.031 31022    1
ratio    1.218   0.002 0.406    0.632    0.933    1.152    1.427    2.191 28751    1
lp__  -202.859   0.008 0.999 -205.567 -203.246 -202.554 -202.146 -201.879 17156    1

The huge effective sample sizes came because I ran Stan for 10,000 iterations in warmup and 10,000 in sampling, getting a ton of draws to get good precision on the 2.5% and 97.5% quantiles, which should not be a big deal for the application—it’s hard to think of a case when we should ever care about a super-precise estimate of the endpoints of an uncertainty interval—but is convenient for getting a single set of numbers to report in the published article. Anyway, that’s not my main point. I just wanted to let you know why I was straying from my usual practice.

To continue, I then re-fit the model with the data on specificity and sensitivity. Prior tests were perfect on specificity—this had to do with the stringent cutoff they were using to declare a positive result—and also had high sensitivity, so the results shouldn’t change much, especially given that we’re focusing on the ratio. But the uncertainty in the ratio should go up a bit, I’d think.

Here were the results:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
p[1]     0.023   0.000 0.005    0.015    0.020    0.023    0.026    0.033 34764    1
p[2]     0.020   0.000 0.005    0.012    0.017    0.020    0.023    0.031 34133    1
ratio    1.218   0.002 0.405    0.626    0.932    1.151    1.431    2.185 30453    1
lp__  -202.868   0.008 1.011 -205.582 -203.255 -202.560 -202.148 -201.878 17880    1

Not much different! The lower bound of the interval has gone up from 0.626 to 0.632, and the upper bound has remained the same. Interesting. Let’s check the inferences for the two probabilities individually . . . they’re exactly the same . . . Wait a minute!

Let me check the code.

OK, I have a bug in my R code. Here are the lines of code fitting that second model:

data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_spec=149)

compare_2 <- cmdstan_model("compare_1.stan")
fit_2 <- compare_2$sample(data = data_2, refresh=0, parallel_chains=4, iter_warmup=1e4, iter_sampling=1e4)
print(stanfit(fit_2), digits=3)

Do you see it? In the second line, I took the old file, compare_1.stan, and put it in the object for the new model, compare_2. Here's what I need to do instead:

compare_2 <- cmdstan_model("compare_2.stan")

And now it works! Well, almost. First it turned out I had some little errors, such as not correctly defining a vector variable, which were caught during compilation, then when the model finally ran it gave a "variable does not exist" error, and I went back and caught the typo in the above data definition, where I mistakenly had defined n_spec twice in the copy and pasting procedure. (The repeat of 130 was not a mistake, though; their specificity testing data really were 130 negative tests out of 130.) So I fixed to:

data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_sens=149)

And then the code really did run, yielding:

          mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
p[1]     0.019   0.000 0.007    0.004    0.014    0.019    0.024    0.032  5808 1.000
p[2]     0.016   0.000 0.007    0.002    0.011    0.016    0.021    0.030  8207 1.000
spec     0.994   0.000 0.005    0.983    0.991    0.995    0.998    1.000  4998 1.000
sens     0.913   0.000 0.023    0.862    0.898    0.915    0.929    0.952 19939 1.000
ratio    1.693   0.052 3.859    0.453    0.901    1.206    1.656    4.917  5457 1.001
lp__  -254.658   0.020 1.562 -258.611 -255.451 -254.309 -253.501 -252.688  5840 1.001

The 95% interval for the ratio is now super-wide (and the estimated interval endpoint is itself noisy, even after 10,000 iterations; re-fitting the model with a new random seed yields an upper interval of 4.5), which makes sense because of the small possibility from the data that the false positive rate is high enough to mess things up.

Anyway, the point of this post is just to demonstrate how, with sloppy code, you really have to keep your hands on the wheel at all times. Even such simple fitting is hard to do if you're not continually checking results against your expectations.

I expect that the particular things that happened to me here would not be so bad for people who use more modern coding practices and workflows. From the other direction, there are lots of researchers whose code and workflow are much sloppier than mine, and they'll have to be especially eagle-eyed to avoid tripping up and even publishing erroneous results.

To return to the title of this post: Ugly code is buggy code not just because it's easier to introduce bugs and harder to find them if your code is poorly structured. It's also that, for code to be truly non-buggy, it doesn't just have to be free of bugs. You also have to know it's free of bugs, so that you can use it confidently.

Also it took me about 20 minutes to write this post. My blogging workflow is efficient (or perhaps inefficient in a larger sense, in that blogging is so easy that I do lots of it, perhaps taking the place of more productive activities).

17 thoughts on “Ugly code is buggy code

  1. Usually, I have difficulty in expressing my ideas using the English language (I’m not native).
    But I find super satisfying writing code that works (of course :D) and it’s elegant and smooth. For that, I spend much more time optimising than writing. I cannot find a more satisfying professional activity than finding a way to make my code more concise and easier to follow logically.

    I think that all applied statisticians (I try to be one) should develop a set of tools that they will use very often. In this context, test units play a fundamental role.

    Typos are very subtle (they don’t involve the logic of the algorithm). For them, one needs a lot of patience or (better) a friend looking at the code.

    • I cannot find a more satisfying professional activity than finding a way to make my code more concise and easier to follow logically.

      This is a very common trait among programmers. I actually have to fight my tendency to do this and balance it against shipping a product. The biggest win for this effort is when multiple people work on a project (including you in the future!).

      The reason you want to break functionality down into small units is that makes them easier to test. When you know the problem is in 10 lines of code, it’s a lot easier to find than if it’s in 100.

      • Agree 100%! Sometimes I find myself going back 1000 times to the person I wrote the code for, saying: “Please update the code, I’ve added something cool that makes it 0.0001% faster!” :D
        They know me and just smile (it’s good being surrounded by patient people :D).

        About statistical code, in particular, I find it very useful reading the code when I am not sure about how something is done.
        It’s like reading the formulas after several paragraphs trying to explain a concept.

        That’s why I love opensource. It makes it easier to learn and also to collaborate with others.

  2. 1990s-style R isn’t the problem. It’s the wall of cut-and-paste code. And the choice of R in the first place, which goes out of its way to paper-over any potential errors in user code (see my comments on rgamma in yesterday’s post, for example).

    Moving from working on one’s own projects to working with a group is the biggest jump in programming (after learning recursion and pointers, that is). It requires much more focus on code being readable. And much more skill on everyone’s part in reading the code of others. One of the nice aspects of code review is that there’s an emphasis on code readability. If the reviewer can’t follow the twisted logic, they can’t review the pull request.

    • Bob,

      I took sort of the opposite journey from you and went from a professional programmer to a Data Scientist — by way of videography, but that’s another story — and along the way my programming skills atrophied and I’ve reverted to cut/paste too many times. I have to say that Python makes you sweat the details (list comprehensions, I’m looking at you) though Pandas is a, um, mess, and I’ve been following Julia for a couple of years now hoping that it would become broad-spectrum viable but it’s not quite there yet. It’s a beautiful language and I still hope to do a project in Julia.

      • I’ve been watching Julia for a few years myself. Last year I decided it was time to learn it… and then the time I tried to carve out went away with a summer trip and kids … I also discovered that they were in the middle of a major improvement in how threading works and had just completed a major revision of the package system. So it felt a little not quite fully cooked. This summer when I re-started that project, it seemed fully cooked. There were no major revisions planned, it works well, it runs fast, and the major packages I want to use seem stable and have been for months to years. Talks given in 2018 at JuliaCon are still largely relevant today without major changes in anything.

Leave a Reply to Bob Carpenter Cancel reply

Your email address will not be published. Required fields are marked *