The easygoing relationship between computer scientists and null hypothesis significance testing

This is Jessica. As you might expect, as a professor in a computer science department I spend a lot of time around computer scientists. As someone who is probably more outward looking than the average faculty member, there are things I like about CS, like the emphasis on getting the abstractions right and on creating new things rather than just studying the old, but also some dimensions where I know the things I like to think about or find interesting are “out of distribution.” One thing that continues to surprise me is the unquestioning way in which many faculty and students assume significance testing is the standard for scientific data analysis. 

Some examples, by area: 

ML, systems: We rely too heavily on comparing point estimates to assess performance across different [models/methods/systems]. Let’s fix this with significance testing! 

Privacy: Let’s noise up this data, but better make sure they can still do t-tests!

Big data/databases: Let’s do zillions of t-tests simultaneously! 

Theory: Let’s design a mechanism to allow for optimal data-driven science, by which we mean NHST! 

Visualization: Let’s turn graphs into devices for NHST!

HCI: Let’s make a GUI so people can do NHST without any prior exposure to statistics! 

On some level, this is not that surprising. CS majors often take a probability class, but when it comes to stats for data analysis, many don’t go beyond a basic intro stats course. And early non-major stats courses often devote a lot of time to statistical testing. Estimation, exploratory analysis, and anything else that might precede NHST are treated as mostly instrumental. So classical stats becomes synomous with NHST for many. Of course in CS, prediction gets a lot of attention but it’s sort of its own beast, treated like an engineering tool that powers everything everywhere.  

I expect the average computer scientist sees little reason to care, for example, about what happened when a bunch of psychologists doing small N studies overrelied on NHST. There’s a fallback attitude that issues caused by humans will never be very relevant objects of study because the primary artifacts are code, and that that kind of squishy social science stuff doesn’t belong in CS (though as the joke beloved to people who do deal with the human-computer interface goes, “The three hardest problems in computer science are people and off by one errors.”)   

And so, I seem to find myself somewhat regularly in a position where I am more or less performing some angst over the problems with NHST in an effort to get students or faculty colleagues to reconsider their unquestioning assumption that significance testing is how scientists analyze data. I can think of many situations where I’ve tried to explain why NHST, as practiced and philosophically, is not so rational. I bring up Andrew-isms like it doesn’t make sense to treat effects as present or absent because there’s always some effect, the question is how big, or what to do about the fact that the difference between significant and not significant is not significant, etc. Sometimes I can tell I capture their attention for a moment, but rarely do I feel like I’ve really convinced someone there’s a problem that might affect their research. For instance, I get responses that start with phrases like ‘If this is true …’  and I’m pretty sure it isn’t just me getting blown off for being female, because I’ve seen similar reactions when like-minded colleagues point out the issues. 

Repeatedly encountering all this resistance can almost make one feel a little bit guilty, like here your colleagues are obviously having a fulfilling relationship with their chosen interpretation of statistics and yet you’re insisting for some reason on dredging up weird anomalies with seemingly weak links to what they do, like some sort of witch determined to sow doubts in the healthy partnership between computer science and stats. But of course I don’t actually feel guilty because I think they need to hear it, even if I derail a few conversations.  

I guess one question is how a computer scientist’s orientation to NHST is qualitatively different than that of someone in another field that uses stats. For example, how does a psychology researcher’s perspective on NHST differ from that of a computer science researcher? I would expect that computer scientists are probably worse at than psychologists is anticipating misuse, again because understanding human behavior has never been perceived as being critical to doing great CS research. I think there’s a genuine belief that NHST is the answer based on believing that if it’s used properly (which can’t be that hard right? just don’t fake the data and make sure there’s enough of it), it provides the most direct answer to the questions people care about: is this thing real. On the surface, it can seems like a concise solution to a large class of problems, which doesn’t deserve to be conflated with the flaws of some humans who used it for very different seeming purposes. 

I also think there’s a genuine confusion about what the alternative would be if one doesn’t use NHST. Sometimes researchers make it explicit that they can’t imagine alternatives (e.g., here), in which case at least the value that someone like me can provide is clearer (giving them examples of alternative ways of expressing findings from an analysis). But, for that to work, I first have to convince them there’s a problem. Maybe the resistance is also partly a function of discrete thinking being built into CS. Advocating against NHST to some computer scientists can certainly feel like trying to convince them that we should replace binary.

On a more positive note, when I realized that much of the stat/science reform discussion hasn’t reached many computer scientists I started including some background in a CS research in a class I teach to first year PhDs. I’ve taught it a few times and they seem interested when I present some of the core issues and draw connections to CS research (like we do here). I’m also teaching a graduate seminar courst next quarter on explanation and reproducibility in data-driven science where we’ll discuss papers from stats, social science, and ML related to what it means for an explanation of model behavior to be valid and reproducible. Maybe all this will help me figure out how to better target my anti-NHST spiel to CS assumptions.

30 thoughts on “The easygoing relationship between computer scientists and null hypothesis significance testing

  1. I don’t spend much time around computer scientists, so I’ll defer to your opinions. However, your focus on data about humans seems misplaced to me – my perception is that computer scientists are agnostic regarding whether the data is about human behaviors, physical quantities (e.g. snowfall), or anything else. The most salient characteristic I would have thought – which you didn’t mention – is a belief that quantity of data usurps statistics. I’ve seen a prevailing belief (perhaps more with data scientists than computer scientists, if that is a meaningful distinction) that quantity of data substitutes for issues regarding inference. So, if inferential analysis is to be done, NHST is as good as anything, but is largely unnecessary. And, if your sample size is big enough, then everything is likely to show statistical significance, so why worry about whether NHST makes any sense.

      • In that case, I think the idea that quantity of data can substitute for inference is potentially more pertinent. If we can use data scientists and computer scientists interchangeably here (and that is a big if – so it may not be as relevant as I think), then most data sets computer scientists see are large, and if they believe the quantity of data substitutes for the need to do inferential work, then NHST would seem to be a distraction. Do computer scientists really accept NHST as meaningful, or do they not question it because it is mostly irrelevant to them?

        • I do think there’s a confusion between what having big data absolves you of and what it means to say, support a research claim about some big data system. Like an ML researcher assuming that all we need are point estimates to compare models or optimization tweaks, because the datasets the models train on are so big.

  2. Maybe the audience for this blog understands your point, but I’d appreciate if you could expand a bit on the NHST risks. For example, PAC learning theory is essentially a NHST, and it gives us strong guarantees (though, sometimes rather loose). Stopping rules for Monte Carlo simulations are sequential tests, and they also give us good guidelines for terminating the computation. I’m less familiar with differential privacy, which is also based on NHST-style reasoning; is there a weakness there? I strongly agree that when computer scientists do social science (e.g., in human-computer interaction experiments) or biostatistics (e.g., in AI applications in radiology), they encounter all of the usual problems of small sample sizes and insufficient modeling of variability, investigator DoF, etc. and they trust NHST far too much.

    Of course another strong tendency in computer science is to adopt an overly strong Bayesian formulation that treats Bayesian computation as normative rather than as a modeling enterprise in the Andrew Gelman tradition. This is another uphill battle for statisticians when talking to computer scientists!

    • Thomas:

      I’ve done a lot of work on stopping rules for Monte Carlo simulations and I don’t use null hypothesis significance tests to do it. I don’t think it’s correct to think that any decision rule has to represent NHST-style reasoning. In many applications we need to make decisions. There are lots of decision rules that don’t involve null hypothesis significance testing.

      Also, yes, I too am annoyed by the attitude that Bayesian inference = logical reasoning. So much depends on what goes into the model!

    • Good question, and I don’t know that I have very satisfying answers. But one thing I’m commenting on is how in cases where CS researchers see themselves building tools to serve scientists, they don’t bother to think about alternatives to NHST that might better suppport the goals of those scientists, they just design as though answering a yes/no question about whether some effect exists is all that matters. For example, maybe a scientific user of differentially private data shouldn’t be trying to say only whether there’s a difference in two rates, they should be trying to understand how much rates can vary across subpopulations and how this affects downstream decisions. If we design the tools we develop around NHST use cases, we can end up overlooking how this style of analysis can be bad for many types of questions.

      But then there’s the basis of CS methods like you mention. I get your point that there are many situations where you want a decision rule and so NHST seems like the obvious. I’ve been won over by methods like PAC for some problems as the best tool over something less agnostic to the data generating process. But I’ve also seen computer scientists try to apply them in cases where it would make more sense (e.g., given the domain specific goal of the application) to use methods that are less agnostic to how data are generated. It can suggest a sort of superficial understanding of what the goal of quantifying uncertainty is. But I get the impression there isn’t much knowledge in CS of what that looks like to carry through uncertainty in some analysis.

      I agree on the tendency to assume that Bayesian computation is normative in CS. I guess one of things that’s curious to me is how Bayesian logic is well understood but things like Bayesian statistical modeling seem mysterious and questionnable to some computer scientists.

  3. Jessica:

    I don’t think the reliance on hypothesis testing is just a practical issue. It’s also a conceptual barrier for some computer scientists. I thought about all this when writing my review article, Causality in statistical learning, when I learned that a dominant philosophy of casual inference among computer scientists is essentially deterministic. I guess this makes sense, given that digital computers rely on deterministic manipulation of 1’s and 0’s, so it makes sense to think of the natural and social worlds as big networks of switches. In that view, the master problem of causal inference is to discover the causal structure, which is often formulated as discovery of patterns of conditional independence of random variables. But in the social world, nothing is statistically independent, or conditionally independent, of anything else (with the exception of random sampling in the design of an experiment or survey), hence this purported discovery of conditional independence ends up being lack-of-rejection of hypothesis tests of conditional independence. This can be done classically or using so-called Bayesian hypothesis testing; either way, it’s null hypothesis significance testing and it doesn’t make sense to me, at least not in social science.

    Anyway, my point is that computer scientists’ embrace of null hypothesis significance testing is deeply baked in to their conception of causality. I don’t think the problem is them not being “good at thinking about human behavior”; I think the problem is that they are working with in a framework where there is some true sparse causal structure to be discovered, and I don’t think they think so hard about what their procedures would mean in a world in which that structure does not exist.

    On the other hand, when computer scientists aren’t thinking about causality—for example, when they’re writing programs to generate text or drive cars or play ping-pong or classify images or whatever—then they don’t seem at all tied to deterministic thinking or hypothesis testing.

    • Yes, I agree that we computer scientists tend to think deterministically.

      But I also think that some of the unanticipated side effects of digital advertising and recommender systems could have been anticipated if computer scientists had better training in social science (or stronger collaborations with social scientists). I think we computer scientists are not very good at thinking about human behavior (except for the human-computer interaction community, which does have closer ties to social science). In particular, the whole framing of artificial intelligence and computing as a form of automation aimed at replacing people rather than as the creation of “super tools” aimed at empowering people leads to a focus on the technology rather than on the resulting sociotechnical system.

    • Interesting, I will read your review article. This would help explain why it can be hard to convince computer scientists that potential model misspecification is a big issue in social science, like you can’t just assume that some low dimensional regression model someone through some data into is giving valid estimates given the complexity of human behavior. I’ve definitely wasted some breath on that.

    • Is it not a bit ungenerous (you and Jessica) to suggest that computer scientists are overly deterministic in their thinking, and tend to see the world in terms of on/off switches, and squishy uncertainty isn’t something they deal with?

      Surely you could equally make the opposite argument, that they also need to worry about things like floating point precision, and rigorous testing (of computer code that is!) to make sure all is working as expected, de-bugging being a huge, time-consuming part of CS in a vaguely analogous way that data cleaning and munging is in the “data sciences”.

      Anyway, I would not presuppose that CS are particularly worse than biologists or psychologists (or physicians?) with regards to NHST. Not saying the theory is wrong, just that it probably needs some empirical support before we philosophise or hand-wring too much, no? :)

      • Ricardo:

        My goal is not to be “generous” but to be accurate. In any case, I have a huge admiration for computer scientists—including admiration for all of my computer scientist colleagues. As I’ve said many times, computer scientists can do all sorts of things that statisticians can’t, and computer science has revolutionized statistics. It’s also true that success in one area can lead to confusion in others. Discrete thinking has been very valuable in computer science so it would be no surprise if on occasion computer scientists would take discrete thinking too far.

        I completely agree with your point that computer scientists “need to worry about things like floating point precision,” rigorous testing, and debugging. I don’t see this as at all in contradiction with discrete thinking. Indeed, non-computer scientists such as myself often think of floating-point numbers as continuous things, and sometimes we need to be reminded by computer scientists that, no, floating point math on the computer is, at bottom, discrete.

        Finally, I’m not at all saying that computer scientists are worse than biologists, psychologists, or physicians with regard to hypothesis testing, and I don’t think Jessica is saying that either! You’re attributing to us a “theory” that we are not advancing!

        • Sure, sure, “advancing a theory” is a bit strong. But there was a bit of stereotyping and pop psychologising going on!

          My point (maybe poorly made) was that computer science is just as “squishy” as social sciences in many respects.Them seeing the world as combinations of 0s and 1s is a little cartoonish.

          If discrete thinking occurs in inference, but then not when they are sitting down to write code to solve a problem, then maybe there are more complex reasons underlying this.

  4. I wonder if NHST is getting some unfair blame for a lot of the poor statistical practices in computer science research. As some CS folks have tried venturing into graphical inference or Bayesian stats, I’ve seen many similar errors (e.g., a dozen unadjusted pairwise comparisons). I think the bigger issue is that researcher methods (experiment methods and statistics) and philosophy of science (what is the difference between a hypothesis and a guess?) are rarely part of the CS curriculum, even among those who primarily do research that relies on experiments as evidence. So CS researchers tend to treat statistical methods like any library: tweak some example code and use it without understanding what it’s doing or why (or whether that statistical method is even appropriate for their question).

  5. “Sometimes I can tell I capture their attention for a moment, but rarely do I feel like I’ve really convinced someone there’s a problem that might affect their research. For instance, I get responses that start with phrases like ‘If this is true …’ and I’m pretty sure it isn’t just me getting blown off for being female, because I’ve seen similar reactions when like-minded colleagues point out the issues.”

    I very much sympathize with this. As an ecologist, I have the exact same responses from my ecology colleagues. On a somewhat brighter note, the graduate students and postdocs in my field are all enthusiastic about learning alternatives to NHST (I teach a graduate Bayes course without NHST). But the challenge is that they then have to convince their advisors to use Bayes. It often doesn’t go well as they are facing a wall built on decades of ritualistic application of NHST.

  6. one thing worth mentioning is that statistical tests are the appropriate tool to determine whether a sampling algorithm is doing the “right thing”. That is, a stream of “random” numbers is random precisely if it passes tests of randomness. The size of a monte-carlo error is distributed with a certain distribution precisely when the assumptions for sufficiently random behavior are met, which is basically when tests of random behavior don’t fail. The one thing you do want to use “hypoethesis testing” for is to show your algorithm behaves “as if random”.

    • Well, on some level a lot of my research is about how to visualize/communicate uncertainty without having to hide behind p-values. But, I have some thoughts/current work related to some of the examples on my snarky list (like ML performance comparisons and analysis of differentially private data) that I hope to have ready to share soon.

      In the meantime, I think this reframing of p-values as surprisal and confidence intervals as “compatibility intervals” gives one nice example of what it looks like to not accept conventional use: https://link.springer.com/article/10.1186/s12874-020-01105-9

  7. Please say some more about the statistics courses CS majors are likely to take at your school (and others if you can say).
    In giving talks on statistics of benchmarking at a few schools, I’ve seen very bimodal reactions, where some (grad) students thought stats part of the discussion was basic (it was) and others thought it was very tough.
    During mid-2000s, it seemed:
    1) Some CS (&EE) majors took a service course run by statistics dept that seemed more aimed at social sciences/medicine, so found no examples they thought relevant.
    2) If statistics was an option, sometimes they looked at syllabus like 1) and skipped it.
    3) In some cases, engineering schools ran their own statistics courses to cover topics and examples they thought more relevant.

    Maybe put another way, can you recommend a good syllabus for 1st & 2nd statistics courses that would be good for CS majors?

    • At Northwestern they take a couple calc courses, probability and statistics, and discrete math. If they are majors but not in the engineering school or if they transfer in late, they might get exposure to an intro applied stats course that covers simple regression, anova, statistical testing.

      If they are in a data science track, they get more exposure to EDA and statistical programming, or other undergrad stats courses like here: https://statistics.northwestern.edu/courses/full_course_list.html

      I would prefer if more of them took design of experiments and a dedicated regression course and decision theory.

  8. “We rely too heavily on comparing point estimates to assess performance across different [models/methods/systems]. Let’s fix this with significance testing!” This one I’d probably agree with, at least if we add confidence intervals. I’ve seen many papers in which “superiority statements” of one method compared to another are made based on point estimators, and I want to know whether results are compatible with just random variation. Tests are fine for this I’d say (if done correctly, and if considering relevant effect sizes is not forgotten), and I don’t hope that anyone advises against tests for this (or other issues where they might be fine) on the basis that “NHST is always evil”.

    • Hi Christian,

      I can see how NHST could be seen as an improvement over point estimates with no expression of uncertainty and no statistical testing in ML performance comparisons. However, it’s hard for me in good conscience to get behind calls that NHST should be the convention for reporting performance comparisons in ML, because most of my experience watching use of NHST in other fields (and the more social science-y parts of CS like HCI) suggests that when NHST is seen as the standard, people fall into writing results sections in ways that suppress uncertainty and don’t encourage reflection on relevant questions like how big a performance difference is, what might be causing variance, whether its likely to be a meaningful difference in relevant downstream applications, etc. Getting publishable results becomes synonomous with getting a p value below a threshold, but we’ve seen plenty of evidence that people don’t have a good understanding of how p-values work. So to me it feels like it would be a little irresponsible to tell a group of researchers with not a lot of exposure to alternatives to just do NHST because it gives them even less motivation to explore other ways to express uncertainty in their results or get better at reasoning about sources of variance. That said, I do recognize that the alternatives are not necessarily simple or error-proof either.

        • I agree on the not capturing all sources of uncertainty. I’ve seen a few papers present sensitivity analyses with performance results, showing how it varies as you change hyperparameters.

          I’ve heard different things from different people about what’s status quo when presenting comparisons, but I’ll trust your response. I’m curious if you know of any resources that students are given suggesting how to report performance comparisons, or if it’s all learned through practice with more senior collaborators (not to mention whether you think CIs are working).

      • Hi Jessica,
        I don’t think that it is exclusive to NHST that mindless application of a “standard”, ignoring anything else that can be done, is bad. I’m all for informative data visualisation and confidence intervals (Bayesian not so much in this setup), still in many cases I’ve seen a test of an “equal performance” null hypothesis would’ve been informative, and I never get why people discuss p-values vs. confidence intervals if they could have both. Also my experience is that many papers stop at point estimators indeed or maybe point estimators plus standard errors that ignore possible pairwise dependencies and other subtleties of the experimental design.

  9. Despite the bad reputation of NHST, I think many modern stats/ml methods still have their roots from hypothesis testing, even in this big data setting. For example, one key component in GAN is a classifier, which is effectively a hypothesis testing on whether the two distributions are the same.

Leave a Reply

Your email address will not be published. Required fields are marked *