Should computer programming be a prerequisite for learning statistics?

[cat picture]

This came up in a recent discussion thread, I can’t remember exactly where. A commenter pointed out, correctly, that you shouldn’t require computer programming as a prerequisite for a statistics course: there’s lots in statistics that can be learned without knowing how to program. Sure, if you can program you can do a better job of statistics, but you can still do a bit with just point and click.

Here’s what I will say, though:

In the twentieth century, it was said that if you wanted to do statistics, you had to be a bit of a mathematician, whether you want to or not. In the twenty-first century, if you want to do statistics, you have to be a bit of a programmer, whether you want to or not.

So, sure, no need to learn programming before you take that statistics course. No need to learn math, either. If you had to choose between them, I’d choose programming. Better to have both, though. Programming and math are both useful. Programming’s more useful, but math helps too.

64 thoughts on “Should computer programming be a prerequisite for learning statistics?

  1. Maybe the converse question is the more urgent?

    The flood of investment in machine learning and the easy availability of powerful toolkits such as Tensorflow and higher-level libraries like Keras make it hugely tempting for anyone with a bit of programming skill to build new products and services based around machine learning.

    We’ll see a lot of innovative work and we’ll also sadly see a lot of crap, possibly dangerous crap, as those with the programming skills and a profitable use case push difficult questions about the statistical basis for their work to one side.

    Statistics as a prerequisite for programming would not prevent all the bad applications that we are about to be exposed to, but it may help prevent some of them.

  2. You could even argue that practical applications of statistics are a great way to get a taste of programming (variables, logic, functions)

  3. Here’s a more interesting question…should (inferential) statistics be a prerequisite for data analysis? If I were to collect data using a group design (and I have), the first thing I would do is to plot the data in various ways and, quite literally, look at them – you know…so as to cause my verbal behavior to be affected? I would also plot the data in ways that do not obscure individual data – at least, for large N, the distributions should be shown (Is that little sliver what you really want to write home about?). For smaller data sets each subject’s data could be shown. The above description might characterize the last thing I do with the data as well as the first(not in most of what I published that was a group design – many journals require NHST). How well do you think science would do if this was how things were done? If the methods of an experiment are sound, then the data – fully explored graphically – should be published. This presupposes raw data at hand to be given freely on request.

  4. I think programming is essential to statistics and can’t be separated from it. Modern statistics is computational statistics. To interact with the computer to do stats, you need some way to communicate with the computer. Currently, this is through a programming language. It can be a relatively high level one like in Mathematica, or a low level one like in Python or R. But it will still be programming. Put boldly, if you aren’t programming you aren’t do statistics.

    From my point of view the pedagogical question is which programming language to teach?

    • Strange to hear Python or R described as “low level”. I’d put Mathematica, Python, and R all at the “high level”. None of them require explicit memory management for example. Low level would be something like C or C++.

      • I think R can be anything from low level (if you are writing in C) to high level (if you are creating Shiny Apps or doing something like mean(variable)).

        • If you are writing in C, then you are not writing in R. R is a high level language (dynamic typing, garbage collection, first class functions, etc.) with glue to a bunch of low level languages (C, C++, Fortran, etc.).

    • Programming is instrumental to science,only what level you want to

      scripts Mathematica/Julia/Python – only for ad hoc and trivial data crunching, which is what statistics is being held at … glorified graphing calculators really

      real languages say Java/Scala, Ruby – to synthesize systems and pipelines, leverage of computing resources

      LISP/clojure/prolog – are weakened in their printed form, but their spirit is what true data science endears to

      at some point
      a syntax-indignant but science-oriented language
      that can stretch everydayman scale to PhD

  5. Programming and math are both useful. Programming’s more useful, but math helps too.

    What kind and level of math do you have in mind? I agree that programming is important for stats, but without a fair amount of math, people won’t understand whatever they’re (supposed to be) programming.

    • +1

      And probably wouldn’t understand a lot of the statistical concepts (in particular, probably wouldn’t understand what the concepts are *not* saying — which is too often the case currently).

    • I think it really depends what kind of students you mean. College freshman? No question that combining some basic programming and some basic statistics is very powerful. They often have a lot of trouble with the concept of variable for example, and working with programming can really help with that. Even the idea of X_i is more clear when you see it in code.

  6. I’m more a practitioner than an academic, but if I were trying to teach statistics nowadays, I would teach it from a simulation-based perspective first. Introduce it that way. Then you can say, yeah, we have all these cool analytic results we can prove with just math.

    • Yes, simulations can often help learners understand both the statistical and math concepts (so that they can talk about concepts rather than just about words that are names for something they don’t really have a concept of.)

      • Agreed. A simulation is also a more concrete representation of the statistical theory which I think is especially helpful for novices, but also helpful for the more expert among is to explore the limitations of their models.

        • This brings to mind the idea that “physical” simulations (e.g., everyone has a sample of M and M’s and counts them) can be especially helpful — I suspect that many students need to do these before computer simulations can help them understand the concepts.

    • Exactly where we’re going with Stan materials. What we’re debating is how much math is required to understand the simulations? Specifically, pretty much everything we do requires an integral. For example, predictions for new data are given by this integral

      $latex p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) \ p(\theta \mid y) \ \mathrm{d}\theta.$

      but are simulated/computed by

      $latex p(\tilde{y} \mid y) \approx \frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)})$

      where $latex \theta^{(1)}, \ldots, \theta^{(M)}$ are drawn with marginal distribution $latex p(\theta^{(m)}) = p(\theta \mid y)$.

      The error in that approximation is then governed by the MCMC central limit theorem, which is like the regular central limit theorem, but adjusts for the autocorrelation in the $latex \theta^{(m)}$. And it only works if the MCMC sampler has certain ergodicity properties.

  7. I’m inclined to say math is way more important than programming. You can always rely on someone else to code for you when you try to solve stat/math/data analysis problems – a math/applied math/physics/stat professor can easily hire a programmer as his/her assistance. But if you know tons of programming but only a bit math, you will be screwed if you want to do some nontrivial math calculations. You can ask for collaboration, but more often than not, mathematicians are harder to be attracted by non-math problems and probably the most difficult part of that problem is the math part and your own contribution is probably not comparable to that mathematician you are collaborating with.

    • Dave:

      It has nothing to do with MCMC. Stan is a programming language and you can do more in Stan if you know how to program. If you don’t know how to program you can wing it by copying and modifying existing programs, but it will be difficult to make much progress. Similarly, if you don’t know any math, you can use existing formulas but it will be tough to figure out what to do when you get stuck.

    • It depends what you mean by “use Stan”. If you use a package like rstanarm, you’ll get the same kind of interface as you get for lm and glm in R (and lmer and glmer in the lme4 package in R), but it’ll use MCMC and Stan under the hood. You can use other people’s models in Stan without knowing more programming than how to run R, the same way you can use programs in any language without understanding the programming language.

  8. I think that before taking an statistics course you need to learn *some* math. And I don’t mean just 2 plus 2, or even 13 squared (https://www.ft.com/content/3174d5ce-30e7-11e7-9555-23ef563ecf9a).

    How can you learn statistics without a good understanding of probability and enough familiarity with concepts like functions, limits, derivatives and integrals? But I guess it’s true that you don’t need to learn that *before*… you may learn everying at the same time.

    Programming on the other hand is nice to have, but not really required to learn or apply statistics in general. In many cases, you can use a computer to do the statistical analysis that you need for your problem without any programming. For example, using Excel (several add-ins for statistical analysis exist), JMP, WinBUGS, or Stan.

    • I think Andrew is implicitly meaning something by “math” which is above and beyond a certain level. I’m not sure what that level is, it might be useful for Andrew to be explicit there. My personal take on it is you NEED to have the basics of Algebra (can you set up and solve linear equations? Do you understand what it means for a number to “solve” an equation? Can you do unit conversions etc?), and some core understanding of what Functions are and basic concepts like continuous functions in order to get much of anywhere with statistics. A lot of what statistics is comes down to finding functions that approximate what happens in your data.

      But if you have 1 year of high school algebra, and one semester of college calculus, you are way better off learning more programming next than more math, as your next step.

      Still, one should mention that the more math and programming you know, the better your tool box is.

      My general opinion is that at the point where you want to do some practical data analysis and get a good result, you will be able to do a decent job if you have as your background ALL of the following:

      1) A good understanding of Algebra and experience manipulating equations to isolate a variable or substitute one equation into another.

      2) A semester of calculus and a good understanding of the *concept* of continuity, derivative, and integral (regardless of whether you can sit down and carry out the process of differentiation or integration symbolically). Some experience with the idea of function approximation (say the concept of a Taylor series and of a Fourier Series).

      3) Computer programming at the level where you understand what variables are, how to write functions that compute basic things, lists, arrays, iteration, and basic data structures, and some knowledge of plotting commands in your language of choice. Can you write a function to pick out the largest 3 elements of some sequence? or translate a relatively simple mathematical formula into calculation instructions for a high level computer language so that the function can be plotted?

      4) Some experience with SQL databases, and writing queries to select data to be analyzed, and with file input-output functions to read and write data sets.

      If you don’t have that level of background you should still consider yourself a beginner, and work towards learning more of each of those things. At the point where you feel that you have all of those, you can move into intermediate territory.

      • Completely agree with (1)–(3). In fact, Michael and I have been talking about exactly this in terms of trying to teach people stats at the same time as Bayesian modeling and Stan.

        As to (4), I know SQL, but I haven’t written a query in a dozen years. SQL and Excel can be important for “traditional” business data, but more and more modern data is coming in less structured and more heterogenous sources (like through APIs) that makes combining data from multiple sources the bottleneck skill.

        • Understanding SQL can help you avoid a LOT of reinventing the wheel. For example I have *all* of the ACS microdata in CSV files from the Census. how do I work with it? It’s a disaster to try to directly write code to do it. The successful method was basically to hook Mariadb into the CSV files using CONNECT tables (that read CSV files as if they were database tables) and run massive queries to collect all the data into a single usable table, and then sub-select it for analysis. You could do all that by writing your own code to read and write csv files and connect things together and blablabla, but why? It’s what DBs are for.

          When it comes to data access and data cleaning, it’s a mistake to not KNOW SQL even if you don’t wind up using it.

    • Carlos:

      You ask, “How can you learn statistics without a good understanding of probability and enough familiarity with concepts like functions, limits, derivatives and integrals?”

      I don’t know, but thousands of people use statistical methods every day without using any of these ideas. Thousands of people are running regressions, computing p-values, fitting deep learning models, etc. It’s great to know about functions, limits, derivatives, and integrals—but I think it’s even better to be able to simulate fake data from your model and then graph these data.

      • People also test water pH without any knowledge of chemistry. Doing without understanding can be problematic. One of the running themes in your blog is people misunderstanding what they do (p-values, power, error rates, replications, decisions, etc). The proper remedy to this issue is learning statistics, not programming.

      • > be able to simulate fake data from your model and then graph these data
        I do think this is the great equalizer (lessens the disadvantage of less math and higher level programming skill) but even with this some very abstract notions of distribution, equal in distribution, approximate, confounded, counter-factual, reference class of could have happeneds, etc. etc. have to be grasped somehow. That is – what to simulate, how and then what to make of the simulations.

        Maybe math is one of the better routes to learning how to get and work with abstract notions (if some sense is also developed of how to make them concrete).

        I do suspect there is much more ideal math that could be learned for the current computationally infused field of statistics.

  9. It seems to me that the statement “you shouldn’t require computer programming as a prerequisite for a statistics course” needs to define what “a statistics course” means. There are a lot of different types of statistics courses, with different audiences and different learning goals. It’s impossible, I would think, to give a generic answer, which is why I think there are a lot of vague, though reasonable, comments above about math vs. programming.

    Not that I’m a statistician, but similar discussions come up in Physics all the time. They also don’t have any resolution, though at least they’re usually phrased in terms of types of students rather than some generic “course” — e.g. should we expect 2nd year graduate students to be able to program well, and how do we make this happen?

  10. I couldn’t fathom that the idea that programming (in any language) and math (of any nature) are helpful to the budding statistician would be a highly contested supposition.

    This suggests that the conversation should be framed a little differently. To me the discussion should be about if there are any courses taking up space in the existing curriculum that students would be better served if they were replaced with more programming? There are of course other issues to consider such as timing of the courses, the scope (e.g. what languages to teach, to what depth, how much programming theory),etc..

  11. My recent experience taking biostatistics classes was somewhere in the middle. It was expected that you would pick up enough Stata and SAS to do basic data cleaning and running existing modelling commands, and they did teach “programming” at least to the extent of covering saving variables and running for-loops. As someone with existing knowledge of R and Python it was all fairly basic stuff.

    I wouldn’t say that they really delved into the full potential of programming though, in the sense of “if you really need to achieve task x there’s a way to program it yourself”. I think part of the problem in this kind of traditional statistics is that languages like Stata don’t necessarily lend themselves easily to that kind of programming, and there doesn’t seem to be a big overlap between dedicated Stata users and people who consider themselves “programmers”.

  12. Both math and programming are important for the doing (calculating) of statistics. Both are also important for describing what is being done (I.e how to interpret statistical concepts such as mean, variance, etc). It how important are either to understanding how and why which of a set of potential inferences are appropriate? Perhaps a bit of philosophy and logic would also be useful.

      • Peter:

        No, I think that most of the recent replication issues did not have working code. Bem, Fiske, Cuddy, Wansink, etc etc: They didn’t have working, end-to-end code. That was a big part of the problem: not only was their work unreplicable, also they themselves didn’t have full control of what was going on in their studies.

    • Understanding how statistical algorithms operate is important to understanding what they can and cannot tell us. How iterative processing of the data occurs helps us understand how a given algorithm fits a model to the data and also how the algorithm treats and resolves uncertainty about that fit. While a conceptual understanding of this is important, most people will not fully appreciate it unless they have some concrete understanding of the code required to execute this process. This is one area where I am not sure any of the proposed solutions to statistics training is focused, but where there might be some benefit. There are substantial areas of misunderstanding that require conceptual bridges that might benefit from a clear explication of the technical bridges. Some level of concreteness in how the algorithms are coded and how they operate is helpful. I know that solving a linear model by hand was helpful when I took a graduate level statistics course, though I think coding this would be even more educational than that.

  13. Let me state up front that I believe computer programming should be a basic requirement for anyone intending to major in a quantitative STEM program (whether that be math, statistics, computer science, any type of engineering major, physics, etc.) I would go further in that computer programming and computer literacy should be a basic skill that need to be acquired prior to graduation from high school.

    Setting all of that aside, the question is, should programming be a prerequisite for learning statistics? That would really depend on what kind of statistics course you are teaching. If you are thinking of a research methods course for, say, those students in biology, chemistry, or social science majors, probably not — it’s more important for students to understand basic statistical reasoning and understanding, which doesn’t have too much to do with programming.

    If you are thinking of those majoring in statistics at the undergraduate level (or some cognate program), programming or some CS course should be co-requisite with whatever statistics/math course available.

      • Martha,

        I agree with you that some graduate students in biology (e.g. molecular biology, certain areas of ecology, etc.) will also need some computer programming as per their research. However, I would contend that would be orthogonal to their need to understand statistics.

        • It may be orthogonal in some areas of biology, but not in others — e.g., in phylogenetics, one might want to run simulations that incorporate statistical analysis.

  14. I struggle with this a lot teaching undergraduate statistics. I teach a class of ~250 2nd year psychology students a half-credit course in statistics. The students may or may not have a prior course in math, programming or statistics. I’ve come to a few conclusions:

    1. In the context of a single class (i.e., a half credit Fall or Winter class), there is not enough time to teach both statistics and programming to undergraduates without sacrificing too much material.

    2. You can certainly teach statistics without programming. RShiny apps have been a godsend for visualizing tough concepts and actually running some analyses, while removing the need to teach programming.

    3. However, if students ever want to DO anything with the knowledge of statistics (i.e., actually analyze real data), they need to know some programming & database management.

    In an ideal world, programming would be a co-requisite with statistics. That is they’d be taught side-by-side in a complementary way. The bigger problem tends to be that each of the three most important components (math, statistics, programming) are taught by different people from different departments, with little integration. I think it’s hard for students to see the connection between the three — especially programming and statistics.

    I agree with Andrew though, that if I could only learn one, I’d pick programming. I’d have spent a lot more time on programming during my education if I had known how valuable it would be to me now. I never really saw the connections until after my undergraduate years.

  15. “I struggle with this a lot teaching undergraduate statistics. I teach a class of ~250 2nd year psychology students a half-credit course in statistics. The students may or may not have a prior course in math, programming or statistics.”

    That’s a really tough job!

    “RShiny apps have been a godsend for visualizing tough concepts”

    Yes!

  16. From the perspective of programming languages I would agree with you about low vs high level. I was using level from the user perspective, meaning something like “how much work does the user have do to get the computer to do what they want.” Also, how much does the application assist the user in the task, acting like an intelligent assistant. Maybe I should have called it “friction”.

    In this regard R is quite low level. As a stats focused programming language it’s fine but R development environments don’t do much to actually help you with your analysis.

  17. Do you think that publishing studies in something like a Jupyter Notebook would help with this problem? Of course, they couldn’t be published in a print journal but then again maybe print journals are now obsolete.

  18. More math is better, period. There are 500 times I would have done better science and better stats if I knew more math, and everyone I’ve brought this up to has agreed.

Leave a Reply

Your email address will not be published. Required fields are marked *