The 5-sigma rule in physics

Eliot Johnson writes:

You’ve devoted quite a few blog posts to challenging orthodox views regarding statistical significance. If there’s been discussion of this as it relates to the 5-sigma rule in physics, then I’ve missed that thread. If not, why not open up a critical discussion about it?

Here’s a link to one blog post about 5-sigma.

My reply: Physics is an interesting realm to consider for hypothesis testing, because it’s one area where the null hypothesis really might be true, or at least true to several decimal places. On the other hand, with experimental data there will always be measurement error, and your measurement error model will be imperfect.

It’s hard for me to imagine a world in which it makes sense to identify 5 sigma as a “discovery”—but maybe that just indicates the poverty of my imagination!

In all seriousness, I guess I’d have to look more carefully at some particular example. Maybe some physicist could help on this one. My intuition would be that in any problem for which we might want to use such a threshold, we’d be better off fitting a hierarchical model.

77 thoughts on “The 5-sigma rule in physics

  1. Physics is becoming more like the lesser fields of research lately in that they test a null hypothesis different from what is predicted by theory.

    Eg, GR predicted 2x more displacement of sunlight during the eclipse than Newton. To check this they compared the magnitude observed to the magnitude predicted.

    Recent discoveries like the Higgs Boson and gravitational waves compare observed data to a model of background noise. Deviations from that model are then taken as confirmation of the theory (QM or GR).

    This is a big mistake. They need to compare precise predictions to observations collected after those predictions were made.

    • > This is a big mistake. They need to compare precise predictions to observations collected after those predictions were made.

      This is currently not possible in many high profile studies because the “new physics” beyond the standard model do not make precise predictions. In particular, supersymmetric theories of particle physics provide no lower bound on the scattering cross section of particles they’re looking for. So perturbations distinguishable from noise by null hypothesis logic can “discover” new particles while no perturbation can only indicate that the detector is insufficiently sensitive.

      Good for NSF grants though

      • See Meehl’s concept of a spielraum:

        Paul Meehl. 1990. Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It. Psychological Inquiry 1990, Vol. 1, No. 2, 108-141 http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/147appraisingamending.pdf

        Basically, you compare the range of observations consistent with the theory to to total possible range (the spielraum). Vague predictions take up large portions of the spielraum and so observing results consistent with the theory does not offer much support.

        • That sounds like a sensible idea, but how is that different from saying that predictive distributions must be normalized?

        • I agree that ultimately everything Meehl talked about is included in Bayes rule. But understand the pressure he was under at the time to not appeal to either Frequentist or Bayesian. He was just trying to get people to think logically again.

          And you can use Bayes factors to do the same thing as NHST, so the point he makes is orthogonal to that debate.

      • >>because the “new physics” beyond the standard model do not make precise predictions.

        That seems… somewhat problematic.

        But a lot of modern theoretical physics strikes me as more philosophy/pure math than actual science, given limited connection to anything testable (or zero connection – I think multiverse theories are untestable even in principle).

        But I’m a biology major, I’m sure physicists would have a rebuttal to that.

        • Sabine Hossenfelder, physicist, agrees with your assessment; check out her blog here (http://backreaction.blogspot.com/).

          There are many physicists who would disagree with her pessimistic assessment of modern theoretical physics, but I think she make the stronger argument by far.

          In summary, much of current theoretical physics is focused on arguments that are essentially aesthetic in nature, because the theories cannot be distinguished in terms of predictions that are currently testable. The fact that the theories are in principle testable means that theorists have a “get out of philosophy free” card. In the meantime, the proliferation of untestable theories is actually useful for convincing funding agencies to spend billions and billions of dollars on new particle accelerators which have the possibility of reaching energy levels at which the theories become testable. It’s a pretty sweet gig if you can get it!

        • > The fact that the theories are in principle testable means that theorists have a “get out of philosophy free” card

          I can see this up until we start talking multiverses. Anything beyond our universe would seem to me to be unobservable in principle, even with instruments a million years advanced beyond what our technology can provide.

        • There is a cosmic horizon surrounding us, and we will never receive any information about what lies beyond it. On the other hand general relativity predicts that our location is not special, that such a horizon exists around any location, and that the universe very much should extend past this region.

          Is trusting general relativity on the existence of unobservable regions of spacetime “untestable physics”? Kind of, but the philosophical debate doesn’t really matter: no one is hinging anything serious on this claim one way or another. All claims of some sort of multiverse in physics are similar. Someone might claim that taking a theory seriously means that we should believe in such things but this doesn’t really impact actual work much one way or the other.

        • Oh, I was mostly talking about string theory claims involving some enormous number of universes with slightly varying physical constants, 10^500 or something like that.

          Places beyond the “cosmic horizon”/”observable universe” would at least share our physical constants and originate with the same Big Bang – they are still unobservable, but at least there is some connection (common origin, etc.) with observable reality.

        • @confused: String theory has fundamentally simple strings, and gets our comparatively complicated particle physics through complicated shapes that the microscopic spacetime dimensions can form. So there is not an “electron string” and a “quark string” as separate things; there is one kind of string that can be wound around the shape in either an electron or quark configuration. If the universe takes a different shape then we would see different particles. The core claim here (a somewhat tenuous one) is that the various ways the dimensions can curl up are all actually possible from the same starting point, and which one you end up in is a matter of chaotic dynamics.

          If we take a many-worlds view of quantum physics then there are other branches where the shape formed differently, just as there are branches where the Earth doesn’t exist.

          Alternatively under some pictures of early-universe evolution the region where the dimensions curled up the way ours did is finite in size, and there are likely other disconnected bubbles where the low-energy physics and particle content came out differently. This is morally the same as the GR point, including originating with the same Big Bang, just pushing what counts as “fundamental physical constants” further back. Having a multiverse of sorts is not a separate postulate.

        • We don’t really have much confusing data that needs explaining in the first place. Certainly not enough to be able to seriously distinguish alternatives.

          Any particular proposal for physics beyond the standard model (e.g. there is a new particle with these properties, mass M, charge Q, etc) will make precise predictions, but who would stake out a claim on a specific such theory? Someone might believe that a particular *class* of modifications makes sense (e.g. a particle of some particular type exists), but such a class will usually have a few free parameters in it and so would need a few more experiments as “input” to fix those parameters before it starts making unambiguous predictions. If precision tests of the standard model reveal discrepancies then we would immediately have many competing predictions on what should come after those. What is happening right now is more like mapping the full collection of simple-looking standard model extensions well ahead of time, which would be filtered through once more data comes in.

          I agree that there’s not really much point in doing this early, and figuring out better experimental designs should be high priority. On the other hand people also object to building another collider, so it’s hard to make anyone happy.

        • So maybe it would make more sense to stop doing fundamental physics for a few centuries until we have the technology to investigate black holes close-up, handle super-high energies where string theory effects would be obvious, etc.? ;)

        • Possibly. The actual distribution of physics research already roughly looks like this; there are few people doing fundamental theoretical research these days compared to basically any other subfield. Even then a lot of that research is staying relevant by dovetailing with better ways of understanding the mathematical structure of quantum field theory. A lot of what passes for “string theory” research now is about trying to pull mathematical techniques that were first found in a stringy context back to ordinary physics.

          There are a couple of things that are used to justify further near-term experiments. One is the few anomalies that we do currently see: mainly the flavour anomaly and the muon magnetic moment. The (so-called?) statistical signifiance for those is slowly creeping up, and is taken as reason to keep going until the situation is resolved one way or the other. Experiments are noisy and complicated, but structurally distinct sets of experiments seem to agree on the direction that this weak discrepancy is in, so there’s cautious optimism here.

          Two is the more detailed understanding of the Higgs. The Higgs has an energy potential. The vacuum value of the Higgs field is the location of the minimum, and the Higgs boson is the signature ripple seen when kicking the field value off the bottom of the potential. The Higgs mass gives us the curvature of the bowl. So we know that this potential exists now, but we don’t know much else about it. What is its actual detailed shape?. Since the Higgs mass is weirdly right on the edge of letting the universe be stable at all, this is considered a pretty interesting question. It’s an unmeasured feature of something we know is there, where we understand more or less how to get accurate measurements by building better Higgs factories. There’s still debate on whether or not that overall endeavour is worth it, but at least it’s an experiment that doesn’t rely on speculation to deliver meaningful results.

        • That was not a completely serious comment, thus the “;)” – I don’t think one can really ‘shut down’ a field of research like that. (If the gap in research was larger than one career length much institutional knowledge not completely captured in published text would be lost, IMO).

          I am however somewhat skeptical of the cost/benefit ratio of things like the LHC vs. other uses for limited governmental science funding… but then, being personally involved in the environmental field, I am biased…

    • This comment is nonsense and ahistorical. it was great that Einstein was able to figure out GR from “thought alone” so to speak, but that is generally not how physics proceeds. Rather, one always has an existing theory and looks for deviations from that theory. In fact, for many physicists, the Higgs, as it was discovered, was the nightmare scenario, because it was the simplest mechanism that could achieve the symmetry breaking needed in the standard model and gave no hints about what new physics might look like (if anything, it made supersymmetry look less likely).

      I would say that almost every physicist would love to see something totally unexpected rather than a complete confirmation of an existing theory.

      Also, I don’t agree that deviations from the model are taken as confirmation of the theory. For the Higgs, we didn’t know it’s mass, but its couplings and how it should appear in scattering amplitudes was well understood. It would be great to have a theory with zero free parameters, but that’s not what we have right now.

      Finally, physicists to take into account the issue of multiple hypothesis testing under the name the “look elsewhere effect”. Regardless, it just seems to be the case that 3 sigma “discoveries” pretty much never work out. Particle physics is lucky enough we can get to 5 sigma to bear super sure. I’m sure someone out there has done an analysis of why 3 sigma things go away at a much higher rate than expected, but you’d have to ask an experimentalist. One reason might be that hadron collisions are messy and backgrounds are hard to calculate.

      • This comment is nonsense and ahistorical. it was great that Einstein was able to figure out GR from “thought alone” so to speak, but that is generally not how physics proceeds.

        My account is accurate historically, yours is not. Einstein saw existing anomalies such as the orbit of mercury, etc and used it to develop his theory.

        If you look up the laws of thermodynamics, electromagnetism, etc you will see the same process.

        • > My account is accurate historically, yours is not. Einstein saw existing anomalies such as the orbit of mercury, etc and used it to develop his theory.

          The anomalous precession of mercury was an output of the theory, not an input. Einstein developed GR to reconcile gravity and special relativity.

        • So then what are you claiming? That Einstein knew about the various anomalous observations but totally ignored them and went on to develop GR as if they did not exist?

          No, instead he knew about them and developed GR to explain the anomalies checking his results against them along the way.

          They were basically his training data.

        • If I remember correctly, indeed Ab is right. I am replying to Ab because there is no ‘Reply to this comment’ button under Anoneuoids’ comment.

          @Anoneuoid: GR has been in development at least for 7 years before Einstein published his paper on the Mercury case. While it is true that correcting planetary orbits was one of the motivations of GR, I am quite certain that he did not use the case of Mercury as his “training data”. He found approximate solutions for the orbit of Mercury in order to demonstrate the applications of his theory in 1915, while GR was in development since 1907. This is not a “training data” story as was presented.

          > They were basically his training data.

          A loosely related remark: I am always bewildered by the confidence of statements as the above. Ab is simply right, this is not how the theory was developed.

        • You just described Einstein using the orbit of mercury to check his theory.

          If it hadn’t made the correct prediction then he would have gone and modified something. Indeed, it is very possible he did deduce the orbit multiple times before figuring out the correct method.

          This is the same thing as if I check the performance of a set of models against some dataset then select the ones that perform best. You will get overfitting to that specific data.

          https://stats.stackexchange.com/questions/20010/how-can-i-help-ensure-testing-data-does-not-leak-into-training-data

        • The Einstein-Besso manuscript, written in 1913 and 1914, is one of just two known working manuscripts that show Einstein’s thought processes as he was developing the general theory of relativity, his crowning glory.

          Writing in neat, precise figures, and crossing out as he goes along, Einstein tries to calculate an anomaly in the orbit of Mercury around the Sun. It is, in effect, a peek over Einstein’s shoulder as he wends his way through the mistakes and discoveries that would culminate, in November 1915, in the general theory.

          https://www.nytimes.com/1996/11/06/arts/dark-side-of-einstein-emerges-in-his-letters.html

          Also, according to this he was using the orbit of Mercury to select and discard ideas as he went along.

        • The idea that the orbit of mercury served as “training data” for the general theory of relativity is so misinformed about how GR actually works, it’s hard to know where to start.

        • The idea that the orbit of mercury served as “training data” for the general theory of relativity is so misinformed about how GR actually works, it’s hard to know where to start

          Here it is in his own words:

          In 1915, Einstein wrote to Hans Albert, ”I have just completed the most splendid work of my life,” probably the final, and correct, calculation of the anomaly in Mercury’s orbit, one of the significant proofs of the general theory.

          As the other quote shows he was working on this for years trying to derive the correct prediction.

          You just seem to have a very naive understanding of data leakage.

      • > I would say that almost every physicist would love to see something totally unexpected rather than a complete confirmation of an existing theory.

        > Also, I don’t agree that deviations from the model are taken as confirmation of the theory. For the Higgs, we didn’t know it’s mass, but its couplings and how it should appear in scattering amplitudes was well understood. It would be great to have a theory with zero free parameters, but that’s not what we have right now.

        My issue is when experiments proceed with almost nothing but free parameters. Consider the search for supersymmetric WIMPs. First of all, in that case deviations from the standard model’s predictions of no modulation would absolutely have been interpreted as confirmation of supersymmetry. Second, with no real theoretical boundaries on its scattering behavior, is it possible to disprove the theory with ANY observed data? As far as I can tell, repeated null results have only been interpreted as “we need a more sensitive detector.” Third, it’s also not clear what would be learned from a signal. Had the DAMA3 3-sigma modulation signature checked out, what would that have meant for new physics? What did it mean for the theory in the short period where people were credulous?

        • > First of all, in that case deviations from the standard model’s predictions of no modulation would absolutely have been interpreted as confirmation of supersymmetry.

          This sounds like your confusing neutrino masses and supersymmetry?

          As for the rest, this isn’t really the right place to get into the merits of supersymmetry. It was a very good idea that sadly seems less and less likely given recent experimental evidence. Sadly there’s a lot of misinformation about the actual practice of theoretical physics out there, promulgated by various popular authors.

        • > This sounds like your confusing neutrino masses and supersymmetry?

          I don’t think I am. DAMA/Libra style dark matter detectors sought to detect evidence of supersymmetric WIMPs with crystal scintillators. The theory was that supersymmetric WIMPs would produce an annual modulation signal due to the movement of the Earth through the galactic dark matter halo, whereas the standard model predicted no modulation. Therefore, a modulation signature at 3 sigma was briefly thought to be weak evidence of supersymmetry. The reasoning is still null hypothesis thinking; standard model predicts no signal, so if there is any discernible signal, it proves our alternative, though our alternatives provides little description of what the signal should look like.

          Follow ups in more sensitive and carefully water-shielded noble gas direct-detectors lined with photomultipliers have since failed to reproduce the result, but all that’s meant is we need larger detectors and more water shielding. Consider

          https://arxiv.org/pdf/1707.06277.pdf

          In particular figures 5 and then 13. Note that the projected Xenon1N sensitivities cannot rule out the credible region for a CSSM neutralino; it can rule out the standard model and “discover” SUSY dark matter if it finds a signal, but if it cannot find a signal, SUSY stands and we just need another detector. Not to mention, this is just one candidate.

          From the article:

          > Within this class of WIMPs from thermal freeze-out a large number of particle candi- dates for DM have been proposed in the literature, and new ones (sometimes in a reincar- nated form) appear on a frequent basis. For one, this means that it is actually fairly easy to invent DM-like WIMPs. On the other hand, it is fair to say that, from the perspective of today’s (or foreseeable) experimental sensitivity, it would be very difficult to distinguish many (perhaps most?) of them from each other or from well established and popular can- didates, like the lightest neutralino of SUSY. Furthermore, one should not forget that the underlying frameworks that predict many, if not most, of them, while interesting, very of- ten lack a deeper or more complete theoretical basis and instead invoke some sort of “dark portal”. For instance, many such approaches lack a UV completion, do not address other serious questions in particle physics or cosmology, etc.

          It seems you know what you’re talking about here, so I am genuinely, earnestly asking you. What is the point of these experiments. I’ll repeat again from the article
          “On the other hand, it is fair to say that, from the perspective of today’s (or foreseeable) experimental sensitivity, it would be very difficult to distinguish many (perhaps most?) of them from each other or from well established and popular can- didates”
          What exactly is the point then? Is it really just to transfer funds from NSF grants to physics PhDs? Is it for someone to maybe get 5 sigma and a consequent nobel prize? If I’m being misled, can you explain how? That’s not a jab—I really would rather live in a world where physics makes sense.

        • Are you against the experiment itself, or just the claimed justification in terms of a particular theory? Something like “If dark matter is made of particles we might hope to detect them due to our motion through the solar system; let’s develop detectors for that” seems to make perfect sense as an experimental strategy.

          It seems that this thread consists of both complaining about further developing theory and also complaining about any experiment that doesn’t have a well developed theory attached to it. If we treat this as a form of exploratory data gathering, which is simultaneously used to investigate several theoretical proposals, what exactly is the problem? What kinds of experiments are physicists allowed to do?

        • > Something like “If dark matter is made of particles we might hope to detect them due to our motion through the solar system; let’s develop detectors for that” seems to make perfect sense as an experimental strategy.

          Not to me! I’d like to first know what can potentially be learned from the experiment.

          If the experiment has a signal, we learn a little about the mass/scattering cross section, but the cardinality of potential theories is still very high and there’s no way to tell if what we saw is actually responsible for the anomalous galactic angular velocities.

          If the experiment has no data, then we apparently don’t rule out any theories, just lower the upper bound on the cross section. And here’s the rub-since there’s no lower bound, there’s no limit to how big the detector needs to be to see a signal. The amount of grant money demanded by this family of theories can increase without limit. This may sound a bit profane, but I think experiments should justify the cost.

        • I don’t know enough about the field to have a strong opinion as to how bad the situation really is, but there *is* the possibility that a field of science can just sort of dead-end for a while (or even start to go down a wrong track, as in biology in the ‘eclipse of Darwinism’ in the end of the 19th century/beginning of the 20th)*…

          *Although those people’s problems with Darwin’s theory were actually largely correct (Darwin didn’t have a good enough mechanism for introducing new variation, and the idea of ‘blending inheritance’ held at the time would tend to remove what did exist), the theories they introduced were much more wrong, while Darwin’s was merely incomplete.

        • Although it wasn’t *entirely* just a matter of incompleteness in Darwin’s theory — there was also a completely wrong, yet entirely reasonable with the state of science at the time, objection. Astronomy c. 1900 didn’t allow for life to be as old as evolution probably needed.

          Life on earth couldn’t be older than the sun, and the astronomy/physics of the time couldn’t get the sun’s lifespan to be more than about 100 million years — since it clearly hasn’t burned out yet, the earth could only be some tens of millions of years old.

          (Once it was realized the sun was powered by nuclear fusion, which has vastly more energy per mass than anything known at the time, this problem went away. But that was still a few decades in the future.

          And by that time, Mendel’s work had been rediscovered and expanded, Mendelian genetics had been shown to be compatible with evolution by natural selection [unlike the old idea of ‘blending inheritance’], and the basics of population genetics had been worked out.)

        • Ah. I see the confusion. It’s worth stating, for what it’s worth, that WIMPs are fairly independent of SUSY. Some flavors of SUSY predict a WIMP, which was considered a success for the theory, but you can have WIMPs without SUSY.

          Anyways, the point of the experiment is to detect a WIMP. This would be a huge, huge deal. It would explain (some fraction of) the dark matter in the universe. Then the model builders would get to work with SUSY WIMPs, non-SUSY and god knows what else exotica to explain the experimental result. And hopefully those theories would make new predictions that could then be verified or falsified. But we’d have a WIMP. That’s Nobel prize winning work. You’d be in the history books. Figuring out the theory is great, but finding something new about the world? Also great.

          Which was sort of my point above. Science isn’t always hypothesis, test, yadda yadda. Sometimes it’s wtf is that? How can we explain it. Nobody had a prediction for dark energy that people set out to test. But it was there. And we still don’t understand it.

    • > compare precise predictions to observations collected after those predictions were made.

      I think I see where you’re going with this, but what if you observe something before you predict it? Doesn’t that sort of bar you from using data you collected without a theory and isn’t that sort of self-defeating (what if you run out of unique new measurements before you get the right theory)?

      And so is it this ordering in the HB that bothers you? Or are you saying that the HB models didn’t make predictions? Or is the background noise model the worrisome part? Or that the predictions weren’t precise enough?

      • I think I see where you’re going with this, but what if you observe something before you predict it? Doesn’t that sort of bar you from using data you collected without a theory and isn’t that sort of self-defeating (what if you run out of unique new measurements before you get the right theory)?

        Not at all. That is a key part of the scientific process: https://en.wikipedia.org/wiki/Abductive_reasoning

        But you can’t use the same observations used to develop a theory to confirm it post-hoc.

        • I think that the rule is really there to account for that, in fact, the null hypothesis probably *isn’t* true, at least not when you include the full halo of necessary ancillary hypotheses about instrumental effects, the noise statistics, etc — i.e., what is sometime called “systematic effects”. (You could also say it is to account the fact that you’re really testing aspects of the whole research program in the sense of Lakatos, not just the “hard core”.)

      • Or that the predictions weren’t precise enough?

        This is a seperate issue, see the Meehl paper I linked above and this one:

        The purpose of the present paper is not so much to propound a doctrine or defend a thesis (especially as I should be surprised if either psychologists or statisticians were to disagree with whatever in the nature of a “thesis” it advances), but to call the attention of logicians and philosophers of science to a puzzling state of affairs in the currently accepted methodology of the behavior sciences which I, a psychologist, have been unable to resolve to my satisfaction. The puzzle, sufficiently striking (when clearly discerned) to be entitled to the designation “paradox,” is the following: In the physical sciences, the usual result of an improvement in experimental design, instrumentation, or numerical mass of data, is to increase the difficulty of the “observational hurdle” which the physical theory of interest must successfully surmount; whereas, in psychology and some of the allied behavior sciences, the usual effect of such improvement in experimental precision is to provide an easier hurdle for the theory to surmount. Hence what we would normally think of as improvements in our experimental method tend (when predictions materialize) to yield stronger corroboration of the theory in physics, since to remain unrefuted the theory must have survived a more difficult test; by contrast, such experimental improvement in psychology typically results in a weaker corroboration of the theory, since it has now been required to survive a more lenient test [3] [9] [10].

        Theory-testing in psychology and physics: A methodological paradox. Paul E. Meehl. Philosophy of Science 34 (2):103-115 (1967). http://meehl.umn.edu/sites/meehl.dl.umn.edu/files/074theorytestingparadox.pdf

        Physics is adopting all the bad things from the lesser researchers.

    • For the Higgs, the Standard Model predicted essentially all of its properties (spin, interactions, how it decays, etc.) except its mass. So the comparison was to a model of all the known relevant “background” processes vs. a model of the background plus a Higgs with very well defined properties except its mass. 5-sigma was a useful threshold in this case because that was a clear enough signal to verify that the decay modes matched the predictions.

      LIGO similarly compares a model of the observed noise to the observed noise plus the signal predicted by GR for black hole mergers.

      In both cases that “deviations from that model” alone were not enough to constitute confirmation. In both cases, the “deviations” had to agree with detailed predictions of the Standard Model or GR.

      • In both cases that “deviations from that model” alone were not enough to constitute confirmation. In both cases, the “deviations” had to agree with detailed predictions of the Standard Model or GR.

        This is how NHST infects a field. The statistical significance is introduced as some minor thing, and slowly gains in importance. After a few generations that is all that matters to consider something “real” (worth publishing). It seems to take 2-3 generations. The pre-existing generation of scientists resists it and train their students with some benefit, but then by a few generations later they no longer understand science and use NHST instead for everything due to selection via grant funds.

        It is all in the public literature for anyone to see. It started in education research during the 1940s. You can see it most prominently in psychology, then delayed by 30 years in biology though.

    • What can be done when the predictions and/or observations do not have infinite precision?

      Is there a way to compare them to identify a “discovery”?

      • Yes, the “sigma” includes all uncertainties:
        – statistical (most often from counting)
        – systematic from background subtraction
        – systematic from measurement error
        – systematic from theory prediction
        The first one is on very solid grounds.
        The other ones, often not so much.

    • I am a little bit puzzled by this ginal statement. I will try to express my confusion by focusing on the gravitational waves (GWs) example, which is also my expertise.

      So, what happens is that GR predicts such signals, and we have built machines to detect them. These machines, as expected, are limited due to instrumental noise. When a detection is claimed we need to take into account these limitations. The signal however, if measured confidently, is exactly as predicted be the theory. By confidently, I mean that it’s quite bright, and it is measured in coincidence by all the detectors in the network.

      So, the confirmation does not come with deviations from the noise model, but with the confirmationof the theoretical model which predicts the measurements. The noise model is used to derive the accuracy which we measure the parameters of the particular astrophysical system that produced the signal.

      It is when the detector catches something unmodelled, or unpredicted, or unbounded by the theory, that we focus on the deviations from the noise model (or when “new physics” is the goal).

      > This is a big mistake. They need to compare precise predictions to observations collected after those predictions were made.

      In my eyes, this is exaxtly what happens in physics (most of the time). An experiment always aims to test the theory. I mean we knew how the GW signals would look like decades before we actually measured them. This is why huge projects like LIGO get funding. The same happened with the Higgs discovery. It was found exactly where it was predicted to be. Or maybe I am missing entirely your point. Could you the elaborate on this?

  2. A point the blog post mentions and that is important to us physicists is that there is another threshold called “evidence for” which is three sigma (and of course there’s also the Twitter rumors threshold at over 2 sigma). One could ponder the distinction between “evidence for” and “discovery of” but I would argue the distinction is primarily important for how funding agencies and news media deal with results and that there’s on obvious continuum of increasing belief as 2 sigma turns into 5 sigma turns into last year’s discovery being this year’s irritating background.

  3. Back in the day, when the Higgs Boson discovery was announced, I attended a presentation on it from the ATLAS group. I remember them announcing “we found it” with the 5 sigma noise threshold. Then they asked “but is it the Higgs Boson? Well it certainly quacks like a duck” and presented a comparison of theoretical expectations to data collected. So I think it’s understood that the null hypothesis rejection actually tells you very little and is more of a formality. The null hypothesis really should be the theoretical predictions about the particle under test though, not the nonexistence of a particle. The problem is that sometimes there are no real predictions, so that gets conveniently ignored.

    I skimmed this once—it looks good!

    https://arxiv.org/pdf/2012.14341.pdf

  4. The linked blog post seems to contain a logical fallacy. The general statements about five sigma sometimes not being enough are based upon an example of a failed experiment to argue that things can still go wrong. If all your data is hosed up by a faulty experiment your threshold does not matter, a ten sigma result is the same as a five sigma result.

    Is there a meaningful threshold for a “discovery” in terms of the magnitude of the standard deviation in purely stochastic processes? That is the pertinent question IMO.

    • Yep, this is a major problem for NHST but not for science which uses otherwise surprising predictions and impressive feats rather than statistical significance.

    • “Does it ever make sense to identify anything as a “discovery”? There is always measurement error.”

      Sure it does. At some point, scientists discovered that the tops of thunderstorms can enter the stratosphere. It might be really important. Measurement issues are entirely irrelevant unless you want to argue from the realm of semantics. There are lots and lots of discoveries that are not challenged at all by measurement uncertainty.

  5. Tony O’Hagan looked into this a few years with a view to replacing this 5-sigma with a Bayesian approach. A discussion started and a physicist chipped in. As I recall, the problem is generally that you’re looking for a bump on an energy curve but you don’t know where along the curve it is and you don’t know how big a bump to expect. Physicists knew that just looking at the bump and computing its significance was wrong because of multiple comparisons, but rather than trying to work out the true significance they just increased the number of sigmas until false discoveries were rare. The normal-theory probability statements about the significance levels are all wrong.

    • I don’t think that their calculation is really about sigmas (standard errors). They compute a probability in a very complicated way and present that result in terms of sigmas, if I remember correctly.

    • In particle physics that’s known as the “look elsewhere” correction, it’s usually estimated either analytically or by toy simulations to see how likely a fluctuation that size is anywhere in the mass range of the search. For unexpected results, there’s often a “local significance” that doesn’t correct for multiple comparisons, and “global significance” which does correct for that.

      I think the 5-sigma is more because there are many physicists looking at the same data, some looking for the same particle but in different final states, and some looking for different particles or predicted effects. With that many people analyzing the same data for different effects, you’ll eventually get a 3-sigma fluctuation.

      Maybe you could do some kind of hierarchical model for the same particle decaying to different final states–I haven’t thought about it much (and don’t know much about hierarchical models).

      In particle physics we do publish our null results–the literature is full of papers with the titles of the “Search for …”, including many published Higgs searches before it was observed. Those papers set upper limits on large the searched for signal could be and still go undetected at some confidence level.

  6. I have nothing to offer but pedantry, so here you go.

    Andrew said “Physics is an interesting realm to consider for hypothesis testing, because it’s one area where the null hypothesis really might be true, or at least true to several decimal places. ”

    I just want to de-weasel that claim: you can get rid of the entire last clause. The rest mass of the positron really could be exactly equal to the rest mass of the electron, not just “to several decimal places.” The neutrino really might have exactly zero electromagnetic interaction with other particles. The speed of light in a vacuum might be perfectly independent of direction. Perhaps none of those are true, but they might be.

    This is pedantry because Andrew is right that you aren’t going to have a perfect measurement error model, and it’s for sure that your measurement of the positron rest mass is not exactly equal to your measurement of the electron rest mass, and so on.

  7. as a physicist who somehow now designs experiments on agricultural products, I’m a bit more charitable to significance testing in physics than in agriculture. In my career I’ve gone from 15 digits of precision to 2 digits if I’m lucky.

    In physics, one is often trying to detect an extremely tiny perturbation or interaction term on a well-modeled main effect. Much of modern experimental physics is a project of iterative equipment characterization and calibration. It’s very easy to find a 2 sigma result caused by a picayune issue like a bad solder joint. It also means that before Physicists can publish the 5 sigma result, they must show in painstaking detail that their apparatus is calibrated.

    Those publications do not make the front cover of Nature, but internal to the Physics community they are taken seriously and strongly respected. For examples, here’s a paper* (in Nature!) on the progress towards measuring anti-hydrogen’s energy transition as compared to hydrogen. This paper is still 1000x less sensitive than their overall goal which is to see if the gravitational attraction between anti-proton and anti-electron in anti-hydrogen differs from that between proton and electron. Yet it’s in nature.

    These additional checks beyond significance testing mitigate the risk of published accepted results based purely on strong significant tests that are really Type M errors.

    *https://www.nature.com/articles/s41586-018-0017-2#:~:text=The%20transition%20of%20interest%20here,few%20parts%20in%201015.

  8. My physics friend writes:

    Let
    P(A) be the probability of obtaining a result
    and
    P(B) be the probability that the hypothesis is true.

    We measure P(A|B) but we want P(B|A).

    P(A) can be estimated (including look elsewhere effect)
    but we have no sensible knowledge about P(B).

    5-sigma is an empirical rule that encodes our ignorance of the prior P(B).

    and points to:

    https://arxiv.org/pdf/1310.1284.pdf

  9. Physicists use a 5-sigma threshold because we can. As David S’s friend says, it is an ad hoc rule that does a reasonable job. Because experimental physics is much easier (in terms of achieving accuracy) than life or social sciences, in the end can take the attitude (updating Rutherford’s reputed view) that if we are arguing over statistics, we should just do a better experiment. (But that doesn’t seem to stop us from arguing over statistics.)

    Measurements in life sciences are no more likely to disagree than in physics, but experimental physics results that we care about are easier to improve. When a new measurement is reported in particle physics, it is typically 1.3-2.0 times better than the best previous measurement of the same quantity (https://royalsocietypublishing.org/doi/10.1098/rsos.160600). A new quantitative result in medical sciences is typically significantly less precise than the best previous measurement of the “same” quantity, which makes progress slow and a 5-sigma criteria socially impractical.

    • GWAS studies use a threshold of 5 x 10^-8 though.

      The reality is that the threshold is collectively chosen based on what give the “right” rate of discoveries. This differs depending on the cost of each datapoint and the precision of the measurements.

      If the threshold is either too strict or too lax the illusion that NHST is telling you anything of value falls apart.

      The “right rate” seems to be whatever makes around 1/3 experiments “work”. Then the vast majority of graduate projects have something to report without it being too easy.

  10. A hunch I have as a casual skimmer of physics papers is that Bayesian methods are far more common in cosmology and astrophysics than particle/high energy physics. I’m not sure why that cultural divide exists – in fact, I’m not even sure it’s there since my reading habits are definitely biased. Curious to hear if other people have noticed the same pattern tho.

  11. Experimental particle physicist here.
    Several comments:

    (1) The three sigma (“evidence for”) and five sigma (“discovery of”) rules are essentially a particle physics convention. I don’t think that other fields of physics are too concerned with that.

    (2) They are a useful convention to protect against false positives and also against the fact that many of the uncertainties we deal with are what we call “systematic” in nature, e.g. they have to do with how well we understand our detector and other processes that can contaminate our signals. These systematic uncertainties can easily be underestimated.

    (3) In particle physics we have something called the Standard Model (SM) that in principle accounts for everything except gravity. It predicts various particle physics phenomena with accuracies that can vary between ~ 10% and ten parts per billion. (Some calculations are harder than others, sigh). However, we strongly believe that the SM is incomplete. Theorists write papers speculating on what the “Beyond the Standard Model” (BSM) theory might be and also
    make better and better SM calculations. Experimentalists test BSM theories, and falsify them (at least falsify some of the parameter space of a given BSM theory). Experimental results that exclude BSM theories (null results) are quantified and published all the time, and much progress in the field is made that way.

    (4) Discovery of BSM would be an extraordinary claim, and extraordinary claims require extraordinary evidence (5 sigma).

    (5) We search for signs of BSM also by making more and more precise SM measurements and compare them with the SM theory, as well as searching for very rare SM-allowed processes that we have not seen yet. We still use the 3 sigma/5 sigma standard to claim that a new rare SM process has been seen. This is a bit silly, since the claim is not necessarily extraordinary, but it is a harmless convention. Most people in the field understand that.

    (6) In searching for BSM the null hypothesis is always the SM.

    (7) For the Higgs discovery, the null hypothesis was the SM without the Higgs. The Higgs was part of the SM but a very untested part, and there were several alternatives available. This seems strange at first, but the point to understand is that many SM processes can mimic the Higgs signature. Now we are measuring Higgs properties as precisely as we can. In testing for BSM now, the SM with a Higgs is the null hypothesis.

    (8) Someone said: “First of all, in that case deviations from the standard model’s predictions of no modulation
    would absolutely have been interpreted as confirmation of supersymmetry”. As someone who worked on supersymmetry searches I can attest that this is not true. Most of the deviations from the SM predictions that we searched for could have been due to other (non-supersymmetric) BSM theories. Everyone in the field was aware of that. In fact before the Large Hadron Collider came on line, there were several papers and conferences dedicated to the problem of “once we find a deviation from the SM, how will we ever figure out what it is”. Google “LHC Olympics” for examples.

    (9) Some theorists (not all) like to speculate about things that cannot be tested experimentally. It is very exciting and they tend to get a lot of press. But thankfully there is much more to particle physics.

    • >>However, we strongly believe that the SM is incomplete.

      Is this because of the strong experimental evidence for dark matter/dark energy? Are there other reasons?

      (Well, besides the historical parallel that in the late 1800s physics was thought to have been “mostly all figured out” except for a couple of small lingering questions, IIRC…)

    • > Someone said: “First of all, in that case deviations from the standard model’s predictions of no modulation
      would absolutely have been interpreted as confirmation of supersymmetry”. As someone who worked on supersymmetry searches I can attest that this is not true.

      That may be true of physicists in reality, but there are certainly experiments that use CSSM/MSSM as motivating examples in their debut papers and, I suspect without any evidence, in their grant proposals.

      > Most of the deviations from the SM predictions that we searched for could have been due to other (non-supersymmetric) BSM theories. Everyone in the field was aware of that. In fact before the Large Hadron Collider came on line, there were several papers and conferences dedicated to the problem of “once we find a deviation from the SM, how will we ever figure out what it is”. Google “LHC Olympics” for examples.

      This is exactly my issue though–the experiments go on, when it really isn’t clear what can be learned. The two outcomes are:

      1. If the proposed signal is not found–that is, the standard model is not falsified–neither are the candidate BSM theories that motivate the experiment because the parameter search space is so large. Supposing the SM actually is true, the research program to find BSM particles can nonetheless proceed indefinitely because the potential sensitivity or kinetic energies required are infinite.

      2. If the proposed signal is found, we’ve narrowed the field from the seemingly infinite set of candidate theories to the still-seemingly-infinite subset of theories with ANY particle of a compatible mass/scattering cross section.

      The only progress that can be made is to falsify the null hypothesis of the standard model that we already don’t really believe in.

      > (4) Discovery of BSM would be an extraordinary claim, and extraordinary claims require extraordinary evidence (5 sigma).

      I personally don’t really care very much about claims of discovery. What I do care about is whether or not some other theory explains the data, or whether the data suggest some theory. I’d like to see a model proposed and then the log-likelihood of the data under that model, and if the log-likelihood is too small the data is anomalous. “BWbjte-SM predicts a snark of mass x eV/c^2 with a scattering cross section between y1 and y2, implying scintillations signals of poisson rate k when Mercury is in retrograde. We saw kf scintillations per hour, which is too small, but also still anomalous under SM with background, suggesting that we need to do more math.” Focusing positive attention on *any* observations that sufficiently deviate from SM physics, whether we know anything about that discovery or not, really feels like building a hype machine.

      • > … use CSSM/MSSM as motivating examples in their debut papers and, I suspect without any evidence, in their grant proposals

        In our papers we use plausible and well motivated models to demonstrate what kind of BSM our experiments are sensitive to, and to exclude BSM parameter space if the data agrees with the SM. Of course we write about the projected sensitivity of our experiments in our proposals. No-one should finance a new experiment if that experiment is not more precise/sensitive than what has been done before.

        > …. the research program to find BSM particles can nonetheless proceed indefinitely because the potential sensitivity or kinetic energies required are infinite.

        The program is to try to understand what happens at higher and higher energies or equivalently shorter and shorter distances. According to your argument one should never have bothered to invent the optical microscopes. A microscope could only look at things on the few micrometer scale, but there is an infinite range of shorter distances that could not be probed by it. We now know that on the nanometer scale there are atoms, on the femtometer scale there are nuclei, that nuclei are made of protons and neutrons, that protons and neutrons are made of quarks, and that quarks are at least 10,000 times smaller than protons.

        > The only progress that can be made is to falsify the null hypothesis of the standard model that we already don’t really believe in…..What I do care about is whether or not some other theory explains the data, or whether the data suggest some theory.

        We believe in the SM, we just do not think it is the full story. Just like we believe in Newton’s theory of gravity as a damn good approximation to Einstein’s general relativity. The goal is then not to falsify the SM per se, the goal is to go beyond it. At the moment, by definition, all viable BSM theories must be able to explain the data that we have. This (pretty much) means that all BSM theories contain the SM, at least as a low energy/long distance approximation. In order to get BSM guidance we search for places where the SM fails. Once we find that, we would study *how* it fails. In other words, we would move to the next phase, ie, the right BSM theory would be the one that (a) explains the particular failure that has been seen, qualitatively as well as quantitatively (b) does not predict other BSM effects that have already been ruled out, and (c) (the gold standard) also predicts some other BSM effect that has yet to be seen and that through some new clever experiment we will succeed in measuring.

  12. Hi Andrew. I am a particle physicist with interests in statistics, especially foundations, Bayes factors and techniques for computing the marginal likelihood. I didn’t exactly understand your comment that ‘in any problem for which we might want to use such a threshold, we’d be better off fitting a hierarchical model.’

    The motivations for the 5 sigma criterion are to some degree unclear even to me, as the proponents sometimes motivate it as a desired tiny error rate in a Neyman-Pearson framework and sometimes as an extreme threshold on evidence in a Fisherian interpretation of p-values (e.g., extraordinary claims require extraordinary evidence).

    I don’t see how a hierarchical model is related to either a tiny type-1 error rate or a massive threshold on evidence, though, so I don’t understand your comment.

    • Andrew (other):

      The connection between hypothesis testing and hierarchical modeling is that both involve a sequence of related problems. To speak of type 1 error or extraordinary claims in a frequentist sense involves a reference set, a sequence of tests or claims for which you are interested in some average behavior. Similarly, a hierarchical model has a distribution of scenarios that is to be averaged over. As I wrote, Bayesians are frequentists.

Leave a Reply to Anoneuoid Cancel reply

Your email address will not be published. Required fields are marked *