It will seem to work great until you add that new feature, specification that switches d from positive to negative. Then how will you explain such a sudden reversal to the layperson who has been listening to your model and making decisions based on the idea d is greater than 0.4? Progress will become your enemy.

]]>Michael:

This all seems a bit tricky and indirect to me. I’d rather just attack the decision problem directly, or give inferences for everything, than try to make hard decisions based on whether an effect size exceeds some threshold, as that does not seem to correspond to real losses and gains.

]]>I should say that this is very similar to reporting the false discovery rate, except that you’re describing the probability that any of the observed effects is truly meaningful, whereas I believe the FDR describes the probability that a particular effect observed to be statistically significant is truly not zero. So maybe what I’m proposing is really just telling people who are already reporting the FDR to stop drawing conclusions that are meaningless (there are no null effects) and arbitrary (there’s nothing magical about any alpha level) and selective (don’t just report significant comparisons).

]]>In fact, you’re not limited to reporting a single p-value. If you define D as a matrix of the effect sizes actually observed along with the number of comparisons that yielded an effect at least as large, then you can report a semi-continuous probability distribution across the range of D reflecting the probability of observing that number of effects at each level, under the assumption that none of the effects are greater than d among all of the comparisons. The drawback to this analytical approach would be that you could never say anything more certain than “Hey, guys, something important is very likely in this data set!” The benefit is that you’re using all of the data simultaneously and reporting all of your results concisely, while drawing conclusions that depend on only a small number of assumptions (i.e., you defined the minimum meaningful effect correctly, you’ve correctly specified the distribution of observed effects given none of them are real, you’ve accounted for the plausible ranges/ceilings of effect sizes, you’ve accounted for multicollinearity that renders some comparisons redundant through removal, composites, or weighting/pooling).

]]>Suppose you have 40 different things you can in theory estimate. Then you have 2^40 possible different subsets of these 40 things you could choose to include in your estimates. That’s about 1 trillion different subsets you could choose from…

Or you could just estimate all of them.

Now, comparing two things, you could do choose(40,2) = 780 different comparisons between two things… Second order comparisons (is a-b dramatically different from c-d) could be done in around choose(780,2) = 304000 different ways.

All of these things can be handled, correctly, by computing the posterior distribution over all 40 items.

]]>Forgot to add this – not all choices can be important https://statmodeling.stat.columbia.edu/2019/06/02/still-at-work-on-the-piranha-theorems/

]]>Choices based on noisy data are risky – so just reduce the number of choices based on data that you can.

So place your bet, make some choices and hope to win (well not loose too badly) and partially pool all back towards a safe? bet.

But you can’t doubt everything at once and should not expect to learn anything reliable from a single study*.

* Perhaps an exception – Intravenous Immunoglobulin Therapy for Streptococcal Toxic Shock Syndrome—A Comparative Observational Study. Kaul, et. al.

I was originally pushed off the project by statisticians who argued they could find the best out of 2^k possible adjusted effect models for k possible confounders and investigators should only report that. Reviewers stepped on that and I was invited back on and they left.

I reported the minimum effect estimate. “Because of uncertainty about optimal model selection given the small sample size, we looked at the estimated IVIG effect for all possible models. The minimum estimated odds ratio for survival associated with an IVIG effect survived for was 4.3.” So finding the best adjustment was not even necessary given low cost and fairly well understood side-effects – almost any effect was worth adopting. Of course with one observational study – no one really knows…

]]>Stuart:

It would depend on the example, but the short answer is that it would make sense to build these options into the model rather than to consider them as discrete alternatives. I think of the multiverse as more of a conceptual demonstration than a recommendation for any particular applied analysis.

]]>I tried to locate the thread in which you commented on Samuel Huntington. I did not have all that much contact with him really. I recall him as a child I recognized several of his colleagues in his book Clash of Civilizations. Nor had I known that Huntington even knew Serge Lang. Lang’s book Challenges chronicles Hungtington’s assignment of probabilities to political events in South Africa. It’s worth a read.

]]>The authors examine three key large-scale data sets, two from the United States and one from the United Kingdom, that include information about teenager well-being, digital-technology use and a host of other variables. Instead of running one or a handful of statistical analyses, they run all theoretically plausible analyses (combinations of dependent and independent variables, with or without co-variates) — in the case of one data set, more than 40,000. This allows the authors to map how the association between digital-technology use and well-being can vary — from negative to non-significant to positive — depending on how the same data set is used.

https://www.nature.com/articles/d41586-019-00137-6

Three hundred and seventy-two justifiable specifications for the YRBS, 40,966 plausible specifications for the MTF and a total of 603,979,752 defensible specifications for the MCS were identified.

[…]

After noting all specifications, the result of every possible combination of these specifications was computed for each dataset. The standardized β-coefficient for the association of technology use with well-being was then plotted for each specification.

https://doi.org/10.1038/s41562-018-0506-1

The coefficients of the arbitrary models they are talking about are meaningless. The coefficients only mean something if the model is specified correctly, which they must admit is not the case or else they wouldn’t be trying out all these different specifications. I wouldn’t even think it is likely that any single one of these hundred million models is the “correct specification”, since the variables are limited to whatever data was available.

How can taking the average of 600 million meaningless values result in a meaningful value?

]]>