People have been (correctly) mocking my 1990s-style code. They’re right to mock! My coding style works for me, kinda, but it does have problems. Here’s an example from a little project I’m working on right now. I was motivated to write this partly as a followup to Bob’s post yesterday about coding practices.
I fit two models to some coronavirus testing data, with the goal of estimating the ratio of prevalence between two groups. First I fit the simple model implicitly assuming specificity and sensitivity of 1:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[1] 0.023 0.000 0.004 0.015 0.020 0.023 0.026 0.033 33939 1 p[2] 0.020 0.000 0.005 0.012 0.017 0.020 0.023 0.031 31022 1 ratio 1.218 0.002 0.406 0.632 0.933 1.152 1.427 2.191 28751 1 lp__ -202.859 0.008 0.999 -205.567 -203.246 -202.554 -202.146 -201.879 17156 1
The huge effective sample sizes came because I ran Stan for 10,000 iterations in warmup and 10,000 in sampling, getting a ton of draws to get good precision on the 2.5% and 97.5% quantiles, which should not be a big deal for the application—it’s hard to think of a case when we should ever care about a super-precise estimate of the endpoints of an uncertainty interval—but is convenient for getting a single set of numbers to report in the published article. Anyway, that’s not my main point. I just wanted to let you know why I was straying from my usual practice.
To continue, I then re-fit the model with the data on specificity and sensitivity. Prior tests were perfect on specificity—this had to do with the stringent cutoff they were using to declare a positive result—and also had high sensitivity, so the results shouldn’t change much, especially given that we’re focusing on the ratio. But the uncertainty in the ratio should go up a bit, I’d think.
Here were the results:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[1] 0.023 0.000 0.005 0.015 0.020 0.023 0.026 0.033 34764 1 p[2] 0.020 0.000 0.005 0.012 0.017 0.020 0.023 0.031 34133 1 ratio 1.218 0.002 0.405 0.626 0.932 1.151 1.431 2.185 30453 1 lp__ -202.868 0.008 1.011 -205.582 -203.255 -202.560 -202.148 -201.878 17880 1
Not much different! The lower bound of the interval has gone up from 0.626 to 0.632, and the upper bound has remained the same. Interesting. Let’s check the inferences for the two probabilities individually . . . they’re exactly the same . . . Wait a minute!
Let me check the code.
OK, I have a bug in my R code. Here are the lines of code fitting that second model:
data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_spec=149) compare_2 <- cmdstan_model("compare_1.stan") fit_2 <- compare_2$sample(data = data_2, refresh=0, parallel_chains=4, iter_warmup=1e4, iter_sampling=1e4) print(stanfit(fit_2), digits=3)
Do you see it? In the second line, I took the old file, compare_1.stan, and put it in the object for the new model, compare_2. Here's what I need to do instead:
compare_2 <- cmdstan_model("compare_2.stan")
And now it works! Well, almost. First it turned out I had some little errors, such as not correctly defining a vector variable, which were caught during compilation, then when the model finally ran it gave a "variable does not exist" error, and I went back and caught the typo in the above data definition, where I mistakenly had defined n_spec twice in the copy and pasting procedure. (The repeat of 130 was not a mistake, though; their specificity testing data really were 130 negative tests out of 130.) So I fixed to:
data_2 <-list(y_sample=y, n_sample=n, y_spec=130, n_spec=130, y_sens=137, n_sens=149)
And then the code really did run, yielding:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p[1] 0.019 0.000 0.007 0.004 0.014 0.019 0.024 0.032 5808 1.000 p[2] 0.016 0.000 0.007 0.002 0.011 0.016 0.021 0.030 8207 1.000 spec 0.994 0.000 0.005 0.983 0.991 0.995 0.998 1.000 4998 1.000 sens 0.913 0.000 0.023 0.862 0.898 0.915 0.929 0.952 19939 1.000 ratio 1.693 0.052 3.859 0.453 0.901 1.206 1.656 4.917 5457 1.001 lp__ -254.658 0.020 1.562 -258.611 -255.451 -254.309 -253.501 -252.688 5840 1.001
The 95% interval for the ratio is now super-wide (and the estimated interval endpoint is itself noisy, even after 10,000 iterations; re-fitting the model with a new random seed yields an upper interval of 4.5), which makes sense because of the small possibility from the data that the false positive rate is high enough to mess things up.
Anyway, the point of this post is just to demonstrate how, with sloppy code, you really have to keep your hands on the wheel at all times. Even such simple fitting is hard to do if you're not continually checking results against your expectations.
I expect that the particular things that happened to me here would not be so bad for people who use more modern coding practices and workflows. From the other direction, there are lots of researchers whose code and workflow are much sloppier than mine, and they'll have to be especially eagle-eyed to avoid tripping up and even publishing erroneous results.
To return to the title of this post: Ugly code is buggy code not just because it's easier to introduce bugs and harder to find them if your code is poorly structured. It's also that, for code to be truly non-buggy, it doesn't just have to be free of bugs. You also have to know it's free of bugs, so that you can use it confidently.
Also it took me about 20 minutes to write this post. My blogging workflow is efficient (or perhaps inefficient in a larger sense, in that blogging is so easy that I do lots of it, perhaps taking the place of more productive activities).
I’ve been writing good code since I read “Professional Pascal, Essays on the Practice of Programming” By Henry F. Ledgard, 1986, https://www.google.com/books/edition/Professional_Pascal/v_UmAAAAMAAJ . I read it in 1986 when it was published, so I was writing good code in the 1990s.
What about this one?
https://devhumor.com/media/writing-code-that-nobody-else-can-read
Seriously, “The Practice of Programming” Brian W. Kernighan, Rob Pike – 1999
https://books.google.co.uk/books?id=j9T6AgAAQBAJ&lpg=PP1&dq=the%20practice%20of%20programming&pg=PP1#v=onepage&q=the%20practice%20of%20programming&f=false
was very useful to me.
Nobody writes good code! The best we can hope for is to write less bad code. That’s basically the thesis of Hunt and Thomas’s classic programming methodology book, The Pragmatic Programmer. You know coding’s hard and almost always buggy, so how do you mitigate the risk so you can sleep at night?
I really got a lot out of the The Craft of Prolog (speaking of 1980s languages), AI programming in Lisp (great programming tips for an even older language), Refactoring (Java, but also relevant for C++), and Effective Java, because I read them as I was transitioning from academia to programming.
I’ve yet to find an R or Python book I like. I’m slogging through the OCaml book, but it’s not a fun slog. Maybe I’m just tired of learning new programming languages :-)
Hunt and Thomas updated The Pragmatic Programmer and published a new edition last year. I come to programming from the scientific side, and I’m about halfway through the book, discussing it via email with a friend who is a software developer. It has already been extremely useful, and I expect to continue to get a lot out of it.
Cool. I should pick up a copy of the update. I read the first one around 2000 when i was transitioning from research to coding.
> Nobody writes good code!
How do you know that? I’ve read and written a lot of code. There is good code, but not a lot of it.
If you want me to show you how to write good code, contact me.
Let’s see your github pal.
I ran across Ledgard in 1986 and I found a coherent approach to writing programs that made sense to me. Also K&P Practice of Programming (mentioned below) and Jon Bentley’s Programming Pearls were very helpful in terms of thinking about relating technique to problem formulation.
What astonishes me is that people publish ugly and often inefficient code without the slightest embarassment about the quality of their product.
Usually, I have difficulty in expressing my ideas using the English language (I’m not native).
But I find super satisfying writing code that works (of course :D) and it’s elegant and smooth. For that, I spend much more time optimising than writing. I cannot find a more satisfying professional activity than finding a way to make my code more concise and easier to follow logically.
I think that all applied statisticians (I try to be one) should develop a set of tools that they will use very often. In this context, test units play a fundamental role.
Typos are very subtle (they don’t involve the logic of the algorithm). For them, one needs a lot of patience or (better) a friend looking at the code.
This is a very common trait among programmers. I actually have to fight my tendency to do this and balance it against shipping a product. The biggest win for this effort is when multiple people work on a project (including you in the future!).
The reason you want to break functionality down into small units is that makes them easier to test. When you know the problem is in 10 lines of code, it’s a lot easier to find than if it’s in 100.
Agree 100%! Sometimes I find myself going back 1000 times to the person I wrote the code for, saying: “Please update the code, I’ve added something cool that makes it 0.0001% faster!” :D
They know me and just smile (it’s good being surrounded by patient people :D).
About statistical code, in particular, I find it very useful reading the code when I am not sure about how something is done.
It’s like reading the formulas after several paragraphs trying to explain a concept.
That’s why I love opensource. It makes it easier to learn and also to collaborate with others.
1990s-style R isn’t the problem. It’s the wall of cut-and-paste code. And the choice of R in the first place, which goes out of its way to paper-over any potential errors in user code (see my comments on rgamma in yesterday’s post, for example).
Moving from working on one’s own projects to working with a group is the biggest jump in programming (after learning recursion and pointers, that is). It requires much more focus on code being readable. And much more skill on everyone’s part in reading the code of others. One of the nice aspects of code review is that there’s an emphasis on code readability. If the reviewer can’t follow the twisted logic, they can’t review the pull request.
Bob,
I took sort of the opposite journey from you and went from a professional programmer to a Data Scientist — by way of videography, but that’s another story — and along the way my programming skills atrophied and I’ve reverted to cut/paste too many times. I have to say that Python makes you sweat the details (list comprehensions, I’m looking at you) though Pandas is a, um, mess, and I’ve been following Julia for a couple of years now hoping that it would become broad-spectrum viable but it’s not quite there yet. It’s a beautiful language and I still hope to do a project in Julia.
I’ve been watching Julia for a few years myself. Last year I decided it was time to learn it… and then the time I tried to carve out went away with a summer trip and kids … I also discovered that they were in the middle of a major improvement in how threading works and had just completed a major revision of the package system. So it felt a little not quite fully cooked. This summer when I re-started that project, it seemed fully cooked. There were no major revisions planned, it works well, it runs fast, and the major packages I want to use seem stable and have been for months to years. Talks given in 2018 at JuliaCon are still largely relevant today without major changes in anything.
Another nice read on this topic:
http://theorangeduck.com/page/reproduce-their-results
“Coding Guidelines: Finding the Art in the Science” by Robert Green and Henry Ledgard, ACM Queue, vol. 9, No. 11, 2011-11-02, https://queue.acm.org/detail.cfm?id=2063168