Cage match: Null-hypothesis-significance-testing meets incrementalism. Nobody comes out alive.

[cat picture]

It goes like this. Null-hypothesis-significance-testing (NHST) only works when you have enough accuracy that you can confidently reject the null hypothesis. You get this accuracy from a large sample of measurements with low bias and low variance. But you also need a large effect size. Or, at least, a large effect size, compared to the accuracy of your experiment.

But . . . we’ve grabbed all the low-hanging fruit. In medicine, public health, social science, etc etc etc, we’re studying smaller and smaller effects. These effects can still be important in aggregate, but each individual effect is small.

To study smaller and smaller effects using NHST, you need better measurements and larger sample sizes. The strategy of run-a-crappy-study, get p less than 0.05, come up with a cute story based on evolutionary psychology, and PROFIT . . . well, I wanna say it doesn’t work anymore. OK, maybe it still can work if your goal is to get published in PPNAS, get tenure, give Ted talks, and make boatloads of money in speaking fees. But it won’t work in the real sense, the important sense of learning about the world.

What, then, should you be doing? You should be fitting multilevel models. Take advantage of the fact that lots and lots of these studies are being done. Forget about getting definitive results from a single experiment; instead embrace variation, accept uncertainty, and learn what you can.

I think there’s room for a more formal article on this topic, but the above is a start.

68 thoughts on “Cage match: Null-hypothesis-significance-testing meets incrementalism. Nobody comes out alive.

  1. I’ve been waiting for this post because I completely agree. As a technology, NHST has run its course. This is, culturally, very hard for people to accept.

    My question for you is: What are the limits of multilevel models? I’ve gotten very bearish on big data as well, especially as it applies to medicine, where we see huge-interpatient variability, dynamic changes intrapatient (I am not the same today as I was yesterday), and sometimes significant variability in the application of treatment or measurement themselves. Is there a limit to what we can know in these noisy, complex systems? At what point do we say ‘nothing to see here – move along?’

    • My answer to this is that there is no inherent limitation in Bayes as a technology. So if you adopt a Bayesian strategy and use multilevel models you won’t have to abandon it and move on to the “next statistical thing” later.

      However, and this is critical, adopting Bayes by itself is like adopting English, it’s spoken widely, it has all the latest technical terms, you can say pretty much anything you want to say in English… but… That includes everything from a 3rd grade book report all the way up to a Nabokov novel. It’s up to you to have something good to say.

      Which comes to the next bit: Design of Measurement, Design of Substantive Models, Design of How To Use Available Data (whether it’s “Big” or “Little”), Design of Experimental Protocols, Design of Decision Making Rules (Utility functions for use in Bayesian Decision Theory)…

      What we need more of now is thinking, and less button pushing. Burn your SAS manuals and start reading the Stan manual. Start thinking in terms of “*Every measurement* has measurement error, is it small enough in this case to ignore?” start thinking in terms of “What are the variables that are involved in this process? How can I organize them into a smaller number of dimensionless groups? What are the order of magnitude sizes of these dimensionless groups in practice? What constraints exist that alter outcomes? How can I represent this function in a simple way? etc etc.

      • I don’t see much or almost any use of dimensionless groups outside of Engineering. Is there? Am I missing something?

        Are there areas of Econ / Poli Sci / Psych / Medicine etc. that are heavy on dimensionless groups?

        • This is a big area of opportunity. OF course, secretly there are dimensionless groups inside lots of things. The CPI for example is a dimensionless Dollar Cost of Basket Of Goods Today / Dollar Cost of Basket Of Goods At Specific Previous Timepoint

          in Medicine there is mass dosage of drug / weight of patient… (but why not Area Under The Curve Of Glucose Concentration vs Time / Area Under the Curve for Reference Person without diabetes same age same sex, same weight, for example).

          The main thing seems to be that when the dimensionless group smacks people in the face they can recognize it, but if they actually thought about them consciously they would get farther.

        • I’m conflicted. I’m not sure whether the lack of use of dimensionless groups in these areas is just low hanging fruit; or whether they are not in vogue for a reason, some reason.

          Reminds me of Chesterton’s fence: Just because there doesn’t seem to be a reason for the substantial lack of use of dimensionless groups in these subjects doesn’t mean there isn’t a reason……

        • The thing is, there *isn’t* a lack of use of dimensionless groups, what there is is ad-hoc unprincipled use.

          For example, the concept of mg/kg of body weight for drug dosage is something people sort-of understand in medicine. But, of course, you have to make a pill so the mg is fixed and you can take 1, 2 or 3 of them. But in a clinical trial where your efficacy isn’t necessarily established…. people will do a regression of effectiveness vs mg dosage, but what they should be doing is effectiveness vs mg/kg since you have variability across your patients a fixed mg of dosage can’t possibly be the factor that actually is of interest.

          Similarly for outcomes measures: area under a curve for glucose vs time for 3 hours after a meal for example. This analysis should be set up by comparing AUC for the patient to predicted area under a curve for a control patient same height same weight same sex without diabetes, and it should be analyzed in terms of food intake / body mass. Instead, you see raw AUC vs Food Intake in g and then you see it for “children under 5, children age 5-10, children age 10-15, adults over 15 and less than 30, adults 30-60, seniors over 60”

          and the analysis relies basically on the average body type within each sex/age group being the implicit reference scale, and that it doesn’t vary too much so you can pretend it’s a constant across patients. You WILL get a LOT better results if you analyze the dimensionless ratios directly because they are the more fundamental thing, and they will have less noise in them because you are eliminating a source of noise by comparing apples to apples (that is, the apple of a person of a specific height and weight intaking a specific amount of food as a fraction of their body weight, compared to the apple of what you would predict all those things being equal from a regression in the control group without diabetes)

        • The concept of mg/kg of body weight is not as relevant for drug dosage as you believe. Actually a correction for body surface area is more frequent, but unfortunately that’s not dimensionless.

        • After reading Daniel’s and Carlos’s comments, I decided to try to refresh my memory on body mass vs body surface area (in particular, my recollection that BSA is defined in more than one way), and came upon this interesting article: http://theoncologist.alphamedpress.org/content/12/8/913.full.

          It has some good points and some not so good points — one of the good points being it is fairly accessible to the non-specialist; one of the bad points being (unsurprisingly) that the discussion of statistics is not very sophisticated.

        • Carlos, yes I kind of have heard about that. The correction for surface area may well be secretly motivated (or at least explained in terms of) the transport across membranes such as the intestines or the kidneys.

          If you were to take a full grown human and scale them down in every way by multiplying every length in their body by 0.9 you would reduce their intestinal surface area by a factor 0.9^2 and you would also reduce their skin surface area by the same 0.9^2. If you model the absorption of the drug in terms of a difference in concentration between inside the intestines and outside the intestines… you would find that all else equal the rate at which the drug crosses the intestine would be proportional to the surface area.

          Implicitly then there’s a dimensionless group C * v * A/V which describes the process in terms of a desired drug concentration C, a transport rate per unit concentration difference v (in dimensions of length / time / concentration) a surface area of absorption A and a volume of the body V.

          since C is some arbitrary desired concentration, and v is a property of the membranes, what matters is A/V, and since V is directly proportional to mass, if you have a specified body mass, then when you model this dimensionally, you will be “correcting for surface area”

          mathematically inside the process however is a more sophisticated dimensionally homogeneous way of looking at the problem.

          I’m not saying that this is the only effect, but I am saying that the mathematical symmetry of dimensionlessness exists as a thing, regardless of whether people acknowledge it or not.

        • @Martha: thanks for that reference, interesting.

          From the dimensionless group argument I just gave, the role of BSA based dosing would primarily be expected to be important when the drug is delivered orally, primarily excreted in the kidneys, or has to diffuse from one tissue into another across a membrane (for example maybe into the synovial fluid or some such thing where there is little vasculature).

          As far as I know most cancer drugs are delivered IV and so absorption through the intestine isn’t the main issue, and tumors are often highly vascularized, so the distance from the bloodstream is going to be small and diffusion across the vascular membranes will not be the dominant factor.

          I don’t know whether cancer drugs are primarily eliminated in the urine or metabolized in the liver or what.

          So, somehow it isn’t surprising to me that BSA based dosing isn’t really all that appropriate for cancer drugs, but it wouldn’t be surprising to me at all if BSA based dosing strategies were important for orally administered drugs.

        • In my analysis I neglected a factor T which is a kind of duration of treatment.

          T*C represents an “equivalent area under the curve of concentration vs time” which is the “treatment dose” so it should have been:

          T * C * v * A/V

          v is a property of the membrane describing a transport velocity, (length / time / concentration difference)

          (A/V) is the property that summarizes the geometry of the body you’re treating.

          with the factor T, the group is dimensionless.

        • Also, as a general physical concept you’d expect that when a drug is rapidly absorbed and transported and primarily eliminated through metabolism that the dosing would depend primarily on the mg/kg ratio, as this indexes the peak concentration in the body under those conditions.

          When the absorption is delayed and happens through a membrane process and elimination is through excretion either into the bowels or through the kidneys, then the transport across surface areas are important and the TCvA/V group would be of interest in addition to the mg/kg group. If both absorption and elimination are across membranes you’d expect the balance of two groups TCv1A/V and TCv2A/V would be important which could be reworked into TCv1A/V and (v1-v2)/v1 both of which are dimensionless.

          When the drug is absorbed through time, and primarily metabolized, you’d expect TCvA/V to be involved in the absorption and mg/kg to be involved in the metabolism rate. If people gain weight due to adiposity and this doesn’t increase their liver size, then you might also expect something like m/M to be important where m is the liver mass and M is the body mass.

          thinking things through in dimensional analysis *will* give useful insight compared to grouping people by age or by weight, and “correcting for” surface area or adiposity etc..

        • If dimensionless group based analysis were something widely taught to everyone for the last 60 years and still Medicine and Psychology resisted their use, you’d have a Chesterton’s fence type argument… But in the absence of basically any education on the topic outside physics and engineering… I think it’s sufficient to explain away the lack by “people just don’t know anything about the technique”

        • From the point of view of the use of the concept by physicists (at least), something like mg/kg is very weak. Instead, you try to find several potentially important variables, and try to find some combination of them (including their powers) that is dimensionless.

          Here is one example, relating to gravitational waves:

          “In general relativity the generation of gravitational waves is given quantitatively by combining the third time derivative of the quadrupole moment described above, with the appropriate coupling constant. The latter can only depend on the constants G and c (for classical waves) and by dimensional analysis this constant must have the form G/c^5.”

          [http://www3.mpifr-bonn.mpg.de/staff/sbritzen/gravwaves]

          This illustrates how one might learn that interesting but unexpected powers of variables could play a useful role.

      • Are Bayes and SAS really completely incompatible? Must the manual be thrown out? Is Stan the only way to do legitimate analysis? I hate to think the answers to these questions is yes. Surely, even good descriptive analysis has its place, and as many of the discussions on this site have said, classical statistical analysis is not worthless – it is often mis-used and more often mis-interpreted.

        I don’t see how rejection of NHST means that we must abandon all approaches other than Bayesian. And I don’t think drawing such a bright line between “legitimate” and “illegitimate” analysis is helpful. It is sort of like saying because so many have adopted English, there is no reason for anybody to speak French any more (alright, maybe that is a poor example….).

        • SAS offers the MCMC procedure in their SAS/STAT module. It is very easy and intuitive framework for fitting any model that I’ve come across as well as almost any of the models in Gelman’s books. I use it literally all day long, and can move quickly from data management, reporting, and Bayesian analysis in the same session. Everyone I work with appreciates its capabilities.

          This is not to say that SAS is always the best option, but to call for us to throw away our SAS manuals in favor of stan is unnecessary hyperbole. People have strong, and frankly, tedious, opinions on software issues.

        • Burn the SAS manual was intended to be hyperbole, and metaphor not literal. In other words, stop looking for a canned PROC that will “do the canned X analysis of Y” which is a lot of what SAS makes available to you as far as I can tell (I left SAS behind with a pre 1.0 version of R back in the 90’s so I’m no SAS expert. Still, as metaphor for “stop looking for a canned procedure” I think it works).

        • Also, the reason to read the Stan manual actually isn’t so much to learn how to use Stan as a piece of software, it’s because the Stan manual is also secretly about building and interpreting models. Under chapter II “Programming Techniques” the sub-chapters are things like:

          Regression Models, Time-Series Models, Missing Data & Partially Known Parameters, Truncated Or Censored Data, Finite Mixtures, Measurement Error and Meta-Analysis, Latent Discrete Parameters, Sparse and Ragged Data Structures, Clustering Models, Gaussian Processes…..

          The fact that they give you specific tools to encode these models is in some sense secondary to the fact that they give you perspective on how to handle these issues.

          So, sorry if I was a bit obscure in my hyperbolic metaphor.

        • I agree completely, and I think this point should be emphasized. The stan manual, along with BDA and Spieglehalter’s ‘Bayesian Approaches to Clinical Trials’, are always open for reference on my desk.

        • When I was a student, one of my stats profs recommended learning SAS as it would guarantee a job. I don’t know if that’s entirely true, but I highly respect this person’s opinion.

        • That’s sort of a corollary to the “legacy reasons / lock-in / switch-over costs” argument, isn’t it?

          If I’ve uses SAS all my life I probably want to hire people that know SAS coz’ I’m most comfortable using SAS on my projects.

        • Based on my experience, the “it would guarantee a job” may mean “There are a lot of ads for experienced SAS programmers.”

        • Way too much for independent users. They have free online versions as well, but I haven’t experimented with them.

        • Unless you are a legacy user, and the switch-over costs are high, is there a good reason for learning SAS over R?

          It’s not a rhetorical question. I’m genuinely interested in knowing.

        • I am not an experienced R user, so I can’t give a fair comparison. As I said in a comment above, I have found that there are many employment opportunities for experienced SAS programmers. I can’t say quite the same for R programmers, but I haven’t paid that much attention.

          A SAS license has always been available to me in all areas of academics/post-docs/employment, so cost was never an issue.

          SAS offers an online version for free, but I haven’t experimented with it.

        • From a couple decades old fairly poor knowledge of SAS. SAS itself is not turing complete, so it’s impossible to get anything complicated done in it directly. Base SAS is more or less a bunch of instructions that transform a data set into another data set (called a “data step”) and a bunch of PROCs that do fixed canned analyses of data sets.

          On top of SAS is a turing complete macro language which will *write a potentially infinite sequence of base SAS instructions* for you which when executed will eventually do the thing you want done. Looping constructs and soforth are typically done in the macro language.

          In addition to this issue, there’s the issue that SAS is organized into PROCs (kind of pre-programmed procedures) some of these PROCs themselves contain a whole language for describing a particular kind of computation, so maybe the PROC MCMC has inside it a language for describing an MCMC model in the way that rstan uses the Stan language to describe a model…

          with all that complexity and stuff comes a flavor of FORTRAN 60 and a per-seat license cost of something enormous like maybe $50,000 for a full set of SAS tools, per SEAT (User) and starter packs of reduced functionality are still thousands of dollars.

          Back in the 90’s the people I worked with used it because it could do lots of canned regressions like linear regression and generalized linear regression on “big” datasets containing maybe a couple hundred thousand to a couple million observations using off-line on-disk techniques that didn’t choke back when 128MB of RAM was a lot. So for example you could read the whole stock market ticker feed and compute summary statistics on every stock that traded for the day without grinding your machine to a halt.

          These days I load 32 Gigabytes of data into RAM, and I run MariaDB queries on public use microdata sets from the american community survey containing tens of millions of rows of survey responses doing complex joins that complete in a second or two on a laptop. So the offline batch processing type organization of SAS is of less importance probably.

          I’m not going to bash SAS as a piece of software, that wasn’t my intention originally (see elsewhere) but I think you can make your own answer to this question based on the above, and I don’t think the above description is grossly incorrect even though it is definitely out of date.

        • @Daniel

          >>>These days I load 32 Gigabytes of data into RAM, and I run MariaDB queries on public use microdata sets from the american community survey<<<

          Just curious, how much RAM does your machine have? I've been agonizing recently about what config. my next laptop should be.

          Also, you said "MariaDB queries". Is that same as SQL queries? Did you say MariaDB coz' the specific american community survey dataset happens to *reside* on a MariaDB server? Or….?

        • Rahul: MariaDB is a fork of MySQL with some improvements, developed by the original developer after he left Oracle. I have it on my desktop machine and that’s where I loaded up the datasets after downloading them as text/csv type files.

          My desktop and Laptop machines have 32G of RAM, so I was exaggerating a little bit, but I have loaded up say 25gig data tables. If you really do have a need for 32G+ datasets it’s not hard to spec out a 64 gig desktop machine but the few times where I ran out of RAM it was because I was doing something stupid.

          These days if you actually needed even bigger “big iron” you can still get it in a single pedestal server case for quite cheap. For example something like a dual LGA 2011 and 512 Gig of ECC RAM and a RAID array of 8 disks at 4TB each + a 512Gig flash drive for the OS and cache. Choose a Motherboard like this:

          https://www.newegg.com/Product/Product.aspx?Item=N82E16813182967

          and fill it with WD Red spinning disks and ECC RAM dimms. That machine would be a monster and cost you less than $10k maybe less than $5k

        • My personal opinion isn’t so much that SAS isn’t compatible with good analysis, it’s that SAS is designed to facilitate and make very easy a button-pushing default analysis. I’m sure you CAN use SAS for something good, just like you CAN use Excel, but I don’t think I’d recommend it based on my limited experience, and the manual will tend to guide you towards one of the many canned PROCs which is really why I say “burn the manual”.

          As far as I’m concerned Bayes vs Frequentism has a pretty bright line. If it doesn’t make sense to you to say “as far as I know the entirety of my knowledge about a process is that it might as well be a random number generator, but I have a lot of data so I can fit a frequency model and try to predict the frequency of future occurrences of various types” if you aren’t in that situation, then thinking about your problem as a Frequentist problem is like trying to tighten a screw with a hammer.

        • I believe your complaint is about how statistical computing is taught, not about a particular software product.

          I couldn’t agree more.

        • Tangential remark, but no matter how many flaws of using Excel that I’m made aware of, I still tend to use so much of it.

          Sure, I’m not one of those crazies that will soil their hands writing VB macros or have a dozen linked spreadsheets. Yet, I find it hard to deny that the darn tool must be getting something right for it to turn out to be so useful.

        • The thing that Excel gets right is that it integrates mathematics, data, and user interface. If what you need is to be able to “play with numbers on screen” (ie. answer questions like “what would it take in terms of changing X and Z to make Y come out to something bigger than 1”) then Excel does that for you in an interactive sense. It lets you literally sit there and tweak things to get a feeling for how they change. That’s valuable, it’s just that there are many many situations where you can go wrong with Excel.

          If you already know how to use say SQL and R, then you can drop back to Excel when that’s all you need (I do it all the time for things like analyzing fairly simple cash flows in terms of present value). But if you don’t know how to step out of Excel and use a different tool, the fact that Excel maybe CAN do a lot more tends to cause projects to creep out of control and become totally unmaintainable horrible nightmares of hidden bugs. For example, if you are doing ODE solutions using Euler’s method in a column of 2000 rows of numbers and then asking Excel’s GoalSeek to tweak a coefficient until you satisfy the far-end boundary condition… you are probably doing it wrong.

        • Totally agree. +1

          You need to know when to ditch Excel. My thumb rules have been: If you need to write a macro, or cross reference data from multiple sheets, or use nested conditional formule, it’s about time to switch tools.

        • Also, Govt agencies, please publish your data as csv files, don’t put stuff into big nicely formatted multi-column cross-tabulated excel sheets. I mean, sure, you could generate those too for people who want to just look at the data for a few minutes, but for those of us who want to just suck down your raw data into a SQL table and analyze it…. god I hate that.

        • I wade further into this discussion with just a little apprehension….
          I guess I don’t see what’s wrong with making analysis easier to do. Sure I can see problems if people have no idea what they are doing or if they don’t know the limits of their knowledge (which I think I do), but I don’t think making analysis easier = making it poorer. Personally, I use JMP. Occasionally, it does not do what I need and I am forced to use something else (for example, running R from within JMP), but it does almost all of what I want and does it much more simply and visually. JMP is rarely mentioned anywhere (despite its 40 year existence), at least partially because SAS would rather keep it to be a secret (so as not to cannibalize their flagship expensive product). From my experience JMP makes obtaining meaningful insights from data far easier, more intuitive, and more productive than the alternatives.

          I hope we don’t need to get into a debate about which software program is best – I don’t think there will be any resolution and precious little insight produced. What might be worth discussion is whether analysis must somehow be difficult in order for people to do it. More specifically, if you don’t have to do any programming, can an analysis be good/useful? I don’t think programming is necessary or sufficient for good analysis – I know some people don’t agree with that position. But I frankly don’t understand it. We use word processors to write documents – why not require writers to program their own written documents (there are languages for doing that)?

        • >>>I hope we don’t need to get into a debate about which software program is best – I don’t think there will be any resolution and precious little insight produced.<<<

          I think the vim vs emacs debate was finally settled by way of annual paintball battles.

        • Now I’m sorry I phrased my hyperbole in terms of SAS :-) I could just as easily used a pocket calculator metaphor

          It wasn’t meant to be a point about software products, it was meant to be advice to think first and then go out and find whatever tool will let you implement your thoughts, rather than trying to find some pre-canned buttons to push.

          The same button pushing stuff goes on all the time in Bioinformatics. There are whole expensive suites of tools designed to make it so that biologists with no training can do canned procedures to produce numbers. My wife went to an “expert” hired by her med school to get some advice on how to use the one they have a license for. The person basically said things like “first you set the number in this box to 1, then you check this box, then you pull down this menu and select this option, then you press Go and then you come back and get your report which will give you your p values so you can publish your results” I just sat there and kept my mouth SHUT (until later, my wife told me she could figuratively hear my teeth grinding)

          I’m sorry but that isn’t “making analysis easier” that’s selling coconuts with straps pre-installed to put them on your ears for cargo cult purposes.

        • And then there’s a biologist I knew who was using a program that had a command inappropriately labeled “hyperprior” (as I recall, it allowed a choice of options for the parameter for a certain prior, including one option that allowed using a certain type of hyperprior), and so she confused parameter and hyperprior.

        • @Daniel

          I recently was speaking to a biologist friend wanting to start off in Bioinformatics. He was confused which might be the best tool.

          Is R (& the Bioconductor set of libraries) a good choice?

          Thoughts? Comments? How mature is the bioinformatics toolkit within R. Is there some other tool that may make more sense?

          Maybe you can pick your wife’s brain too for me. :)

        • @Rahul: I think Bioconductor is the right place to start. I have a friend who does bioinformatics for a group at Cedars Sinai hospital here in the LA area, and his group uses a mixture of specialized tools for particular batch processing tasks, and R for all the standard data processing and glue code.

        • Is R (& the Bioconductor set of libraries) a good choice?

          I have never had a good time with bioconductor packages (dependency hell, errors while running 5 line examples in the docs, etc) to the point where now I don’t even try and just roll my own if that is the only other option.

        • @Anoneuoid

          That’s interesting to hear. Any alternative packages you like when Bioconductor sucks?

          Or do you feel R itself isn’t the best way for bioinformatics workflows?

        • Interestingly, the more I get investigators to think through the kinds of issues that Daniel points out above, the easier the analysis becomes.

        • +1 that is more or less the point of the thinking through the issues part. You then know what you’re focusing on and can analyze things that are relevant.

          if you just “let the data speak for themselves” so to speak, you find out that somewhere “in the data” is the fact that every Thursday Joe Schmegegge buys a Pop Tart at the corner store outside his place of work between the hours of 11:30 and 12 because he likes flirting with the cashier… is it relevant to your theory of the time course of type 2 diabetes in an urban population of sedentary desk-workers? Maybe, or maybe not, but compared to a poisson process for Pop Tarts it’s far too regular p < 0.001

        • “A theory of the atheoretical theory of learning. I love it.”

          GS: More the philosophy of “…the atheoretical theory of learning.” Plus, as Fred Skinner points out right away, and had to point out many times, he was talking about a particular kind of theory.

    • My argument would be that the limits (currently) are dealing with sampling weights. I know there are some solutions (forget about weights, use MRP?), but this can be hard to do in many national surveys, which only provide final sampling weights (as opposed to weights at each level + poststratification weights), and might not give sufficient detail because of privacy concerns about which cell a respondent falls into. I know this is an active area of research (I like this article http://www.stat.columbia.edu/~gelman/research/published/STS226.pdf, and I was happy to see that the postdoc position Andrew just posted is in part to help figure out better models for doing this!), but unless I’m missing a big piece of literature (which is certainly possible) I don’t know that we are there yet.

    • Seth asks about limits of multilevel models. I have a question that may be related — How do multilevel models deal with a situation like research on adverse effects of drugs? Here there are two goals.

      First, to find adverse effects where they exist. For example, Drug A sometimes causes heart arrhythmia.

      Second, to show adverse effects are probably absent when they don’t exist. For example, although there is an elevated rate of lung cancer in patients taking Drug A, after controlling for age and smoking status, we find that taking Drug A is not associated with any substantial increase in risk of lung cancer.

      I don’t understand how multilevel models deal with the second goal. Also, sorry if I’m being completely clueless. Clearly, I don’t know much about this.

      • Multilevel models are more or less a way to help quantify uncertainty. They’re a mathematical tool for calculating, not a magic bullet for design of experiments or whatever.

        In the context of “after controlling for age and smoking status, we find that taking Drug A is not associated with any substantial increase in risk of lung cancer” what is necessary is first to decide that the only things other than Drug A which might be affecting patients lung cancer risk is smoking and smoking related effects.

        Then, your causal model is Cancer Rate = f(Drug Dose, Smoking)

        once you accept that causal model, you can use multilevel models to do the calculations helping you get relatively precise estimates of say the coefficients that describe the Drug Dose dependency. When you get precise estimates, and they are concentrated near zero, then you can make some claims about drug not affecting cancer rate. All of it is always conditional on the model assumptions. If you have several models, you can embed them into one big meta-model to help distinguish between those possibilities.

        Multilevel models don’t make any causal inference magic happen until after you accept some form for the causal description. Then, they help you make maximal use of the available data within the context of that model.

  2. It never worked. The illusion it has any use is dissipating as people trained before NHST retire/die. They were in senior positions and could still demand real science get done despite the growing popularity of NHST.

  3. Andrew, from a scientific perspective you are right about the low hanging fruit having been harvested. But from a career perspective you can simply pick up one of those harvested fruit, give it a different name, and pretend that you’ve just discovered something novel. TED talks and fame will follow. This is how the world ended up with grit and many other “novel” constructs in psychology.

  4. Never mind that illicit animal called NHST (where one allegedly goes straight from a stat sig result to a research claim, and no proper alternative is stated). Statistical hypotheses testing (which many, including Neyman) equates to significance testing isn’t going anywhere–not if you care to falsify and employ error probabilities to evaluate well-testedness. Ironically, though, on this issue, the big complaint by critics of testers over many years is that the tests can be made so sensitive that you’re picking up on trivial effects. This is tied to the Jeffreys-Lindley “disagreement”. (Another example of a criticism that comes in entirely opposite forms: there’s a high “prevalence” of true nulls (Ioannidis), and all nulls are false.} A statistically significant result needs to be interpreted in such a way as to take into account the sample size, as with confidence intervals. That’s what severity does. (others recommend decreasing the required p-value as n grows.)

    Of course in physics, miniscule discrepancies are of enormous importance., and they continually expend great energy detecting known effects (e.g., deflection effect, gravity waves) at ever more precise levels. More generally, in the case of good sciences, if you have a genuine experimental phenomenon, you show it by changing, impinging on other known effects. That’s how you build triangulation. Size alone isn’t what matters.

  5. > But . . . we’ve grabbed all the low-hanging fruit.

    No critique of the main thrust of the post, but this is probably not true in general. The implication is that the remainder of the gaps in our scientific knowledge consist of small individual effects. I can think of several areas where that’s not the case.

    Of course, that says nothing against the methodological recommendations here. But I do think everything-is-a-small-effect-nowadays is a misleading way to think about what’s going on.

Leave a Reply to Daniel Lakeland Cancel reply

Your email address will not be published. Required fields are marked *