I agree.

For example, suppose we characterize the current standard approach as:

Approach 0: Compute classical confidence intervals and then report YES THERE’S AN EFFECT if the interval clearly excludes zero and report MAYBE THERE’S A SMALL EFFECT if the endpoint of the interval is very close to zero and report THERE’S NO EFFECT if zero is well within the interval.

Now consider the following reform:

Approach 1: Use the same classification rule as above but with Bayesian posterior intervals. I think this approach would be an improvement, because it lets us include prior information. But it still has major problems.

Then we can move to:

Approach 2: Do Approach 1, but instead of looking at comparisons or estimates one at a time, look at all of them at once, if possible embedding them in a hierarchical model. I think this would be a further improvement, because it uses more information and helps us avoid selection bias relating to forking paths. But it still has the problem that it’s extracting certainty from uncertainty.

So this moves to some sort of:

Approach 3: Do good modeling and report uncertainty intervals conditional on the model, but don’t use overlap-with-zero as a way of making strong deterministic-sounding statements.

]]>I was thinking the other day: it’s great that there is a group of people with extraordinary statistical expertise who can identify problems with NHST and suggest alternatives; but if any method is going to be trickled down into daily use and standard practice, it’s going to be used by a much broader group of people with substantially less statistical expertise. Under those conditions, there will always be people who just want to put guts in the machine and get sausage out, and not worry too much about what happens inside. What will the shortcomings of the alternative methods be under those circumstances?

Not defending NHST by any means. But the more widely any method is used the more widely it will be abused. so that’s something to consider.

]]>1. Run a bunch of analyses on everything

2. Report which ones p 0.05

4. Report LS means or sample statistics for the responses passing #2.

I see a lot of sophisticated stat types getting on Andrew’s case for “strawman NHST”, but what he is describing is rampant and widespread. Besides, I have never seen a rigorous research program out in the wild cleaving to a Neyman-Pearson decision framework consistently for long enough for type I error rates to matter…

]]>If researchers had zero personal incentives to do this, then sure… But in the presence of career incentives to publish stuff… then the literature would be totally polluted with bullshit.

hey wait…

]]>I don’t agree that that is how significance testing is typically used. The wording is odd to me. If I said there is evidence from a well-designed experiment(s) to suggest a coin is unfair, I am not stating that as a truth with a capital T certainty, but as evidence for, and at a certain alpha level, and I allow for errors and discuss any assumptions.

“A quick calculation finds that it takes 16 times the sample size to estimate an interaction as a main effect, and given that we are lucky if our studies are powered well enough to estimate main effects of interest, it will typically be hopeless to try to obtain the near-certainty regarding interactions”

Then change the design to not do interactions, and/or get a larger sample (may need to save up some $). That still might be preferable to using a prior to get at the interaction. And did that 16 come from replacing uncertainty with certainty? ;)

Justin

]]>I once spent ~2 weeks trying to find a definition for ‘housing unit” (roughly house/apartment, maybe?) It appeared that in the USA & Canada it was a case of “I know one when I see it”.

]]>The question is, would this benefit have occurred without RCTs, just by clinicians and researchers trying different things and publishing their qualitative findings? I have no idea (by which I really mean I have no idea, not that I’m saying that RCTs have no value).

]]>Same for social science. OMG I fear the day when there is precise definition for “food desert”; when we know what “quality preschool” is and what it does; when it’s known that we’ve become an “equal” society.

Thousands – nae! Tens of thousands! – would be out of work!

Save NHST! Save the economy!

]]>The amelioration of symptoms and prognosis of almost every common disease has improved since I started clinical medicine in 1987; progress built on very many RCTs, none of them perfect but together forming a tapestry of overlapping evidential strands that can be read. ]]>

Fair enough. All such advice is context dependent.

]]>Like for example, suppose you know women are different from men, and body weight is important in a medical treatment… SO you split by women and men, and you put them into 3 groups weight 1, weight 2, and weight 3… so you have 3 * 2 = 6 different groups, then you randomize within each group between drug A and drug B… Now you decide you want to have say 100 people in each category for some reason, you need 1200 people, which means you need to recruit somewhat more than that because you’re demanding balance among all the groups… maybe you need to see 2000 people, sort them and put them all into the various groups. Now your medical treatment is $5000 and you’ve got 1200 people: $6M to run the trial. Sure, this is doable for some people, for others it’s 2 orders of magnitude more money than they have.

]]>I think this would most likely be a problem with trials that are using simple randomization, which would largely be dependent on the size of the study, but would also give you large standard errors to reflect the uncertainty, but then again, most experienced trialists and statisticians avoid using simple randomization for this reason due to potential imbalances and focus on blocking and stratifying based on prior knowledge of potential confounding variables

]]>We can run an experiment where for example we use some prior knowledge and decision theory to choose a treatment and then observe the outcome and model the treatment response using known confounders. You can’t eliminate all confounders using large sample sizes with this method, but you can learn a lot, and in practice you can’t eliminate confounders with high sample sizes in RCTs either, because you never get to those large enough N anyway due to cost constraints etc.

]]>I agree that 16 is not a magic number; it’s the product of assumptions. The larger the interactions are, the smaller this number will be. I don’t think that my number of 16 is “B.S.”; it’s clearly derived from its assumps.

Just one thing: In your comment, you write, “in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial.” I agree. But that’s what I say too! The factor of 16 in sample size comes from two factors: the factor of 4 in sample size arising from the factor of 2 in SE that you mention, and my assumption that interactions are half the size of main effects. If you’re disagreeing with me on the factor of 16, it’s because you’re saying that your interactions of interest are more than half the size of main effects. It’s hard to know about this, but I agree that the number we get will depend on this assumption.

]]>We seem to be learning via Mendelian Randomization that there are few meaningful subgroup effects in medicine (very few piranhas swim in biological systems).

See – Professor George Davey Smith – Some constraints on the scope and potential of personalised medicine https://www.youtube.com/watch?v=uiCd9m6tmt0&t=2467s

]]>Consider that in a simple continuous-mean comparison from a 2×2 orthogonal randomized design, the standard error (SE) for the interaction contrast will only be double the SE of a single main-effect contrast from the trial, meaning that only 4 times the sample size would be needed to get the interaction SE down to what the main effect SE was. For tests, I published some not-so-quick calculations long ago for binary-data settings of interest in my applications, which gave most sizes much less than 16 times those for main effects (Greenland S (1983). Tests for interaction in epidemiologic studies: a review and a study of power. Statistics in Medicine 2:243-251), similar to what others got in the same type of setting.

]]>I would not propose that observational studies are preferred to RCTs, but I do see these are on a continuum rather than stark alternatives. Both types of studies have practical limitations which make them more similar than the NEJM article suggests. I often (too often these days) find myself looking for evidence on a medical condition or treatment, only to find that there are no reasonably close RCTs (especially given Andrew’s point about the need to see the effects on particular subgroups rather than looking for average effects), and that the observational data I would like to see is simply unavailable (although, in theory, much more observational data could be made available, were it not for the insane private insurance model we use in the US, with little standardization or sharing of data).

]]>Do you perhaps mean “suspension of disbelief”?

]]>I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

Your grandmother could have told you that for free.”

Even if she died long before 2016? Even if she died long before I was born? By some miracle or prescience? ;~)

]]>Also taken up as blinding via randomization being more important than ensuring imbalances in important confounders rarely occur.

However, randomization is the only known cure for ignorance, with the main side effect being loss of precision.

Its value will depend on the subject matter, but in medicine Mendelian Randomization is making it clearer that in treatment/exposure comparisons for treatments – its extremely important.

]]>The key is finding out by repeatedly causing something what the downstream effects of causing that thing are. This information is valuable even if you don’t have asymptotically zero correlations. Ideally your model can include these correlations and correctly account for the size of your uncertainty.

For example telephone surveys during the 2016 election cycle should have had a nonresponse bias built in to their model… “we might be consistently seeing bias on the order of 5% in either direction” was a fairly safe bet given how polls work… But acknowledging it would have made poll output worthless, and so to make money pollsters ignored this issue.

I mean imagine if they’d said “we polled 2000 people and we’ve concluded given the possibility of nonresponse bias that Hillary Clinton will receive between 40 and 60% of the vote and has a 50% chance of winning.

Your grandmother could have told you that for free.

]]>Also, there are potential issues with randomization depending on how the sample frame is constructed: you could have a randomized selection procedure but it might not be randomized wrt to the full population of interest — e.g. the famous example of randomized dialing of landline numbers.

]]>“reliance on evidence from controlled experiments with random assignment and blinding when possible”, in other words controlled experiment is essential, random assignment and blinding is nice to have.

]]>“Controlled” is more important than “randomized,” I think.

]]>This all may be obvious to you, but unfortunately it’s not obvious to many researchers. Indeed, it wasn’t obvious to me until recently! The purpose of much of academic research and writing is to figure out and explore ideas, looking at them in enough different ways until the ideas seem obvious to us.

I will be very happy if we reach a time when the ideas of the above post are considered obvious by most statisticians, medical and social scientists, and quantitative analysts.

]]>