Is that what Andrew said originally? Hmmmm I can’t see how a single method can apply either

]]>Some differences might arise because our topic makes us very aware of the benefits of giving discretion to those who report information they have that others don’t. Sure, they might use discretion to misrepresent, but they can also use that discretion to communicate more effectively, so mandatory rules governing what you say and how you say it can hinder good communication. So we might be more inclined to support author discretion.

]]>For instance this from a draft of a new statistics test written in bookdown https://moderndive.com/10-hypothesis-testing.html#statistical-significance

“If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are *statistically significant*. Statistically significant data provide *convincing evidence* against the null hypothesis in *favor of the alternative*, and allow us to *generalize* our sample results to the claim about the population.”

]]>So the probability the default makes sense in your particular study is not exactly zero?

So then, since the probability of gambling trust funds at the local casino increasing their value is not exactly zero and therefor that would not be a misuse of trust funds?

]]>I think your comment got in the wrong place. ]]>

It would also be interesting to see your experiment tried in other fields, to see if results are similar or vary from field to field. ]]>

Justin

]]>That isn’t wrong, that is exactly right if your p-value is less than alpha = .05 and the alpha makes sense for your study. Now whether that is a useful, or practically significant is another matter altogether.

Justin

]]>Very interesting description.

]]>No System is Perfect: Understanding How Registration-Based Editorial Processes Affect Reproducibility and Investment in Research Quality

]]>Before turning in, I will provide a rough outline of why I agree with your point that people are trying to use a single method to answer many different questions.

I would like to do things differently by first assessing the reliability of the data by examining the normalized likelihood distribution of the study data alone (as opposed to assessing statistical significance). I would also perform ‘severe testing’ by applying a checklist and estimating the probability of impeccable methodology using another theorem derived from the extended form of Bayes’ rule (described in the Oxford Handbook of Clinical Diagnosis) that allows ‘abductive reasoning’. I would then apply a form of sensitivity analysis to assess how these likelihood distributions are affected by taking into account the probability of impeccable methodological consistency. I would also assess how the probability of replication within a sensible range would be affected with or without other data or prior probabilities by using Bayesian analyses. I would then consider how they might be used in diagnostic classification and treatment decisions in a decision analysis. In response to your 5 scenarios:

1. I would set out the probabilities of benefit, harms and costs and the utilities of the latter by using treatments A and B taking into account the severity of the disease and other factors such as age and gender of the patient etc on the probabilities. I would then perform a decision analysis based on the various patient features.

2. I would perform a RCT comparing the new treatment to placebo or an existing treatment.

3. I would try to model the expected result of an RCT (e.g. using decision analysis techniques) to see if the unexpected pattern was as promising as first suggested and if so, perform a RCT comparing the new treatment to placebo or an existing treatment.

4. I would perform a meta-analysis and then perform a decision analysis as in (1)

5. I would plot the likelihood probability distribution based on the observational data and if the findings were important and merited further investigation, design a RCT to obtain a fresh likelihood distribution of the difference.

I would suggest that there are other easier exploratory controlled studies which can be performed based on principles of diagnosis and treatment selection that examine efficacy in a provisional way without randomization. Also in diagnostic and scientific hypothesis testing it becomes necessary again to use the theorem derived from the extended form of Bayes rule to examine by hypothetico-deductive ‘severe testing’, alternative hypotheses to real efficacy (e.g. spurious results due to bias from poor study design).

]]>The feminist in me was bothered by the “grandmother” line, so I wrote the thing about grandpa, but then I worried I was being ageist so I threw in the slam against people like me.

]]>Also the first line of the 2nd reply should have been (without typo) “I agree that no single…”

]]>There is an analogous problem in medicine in terms of over-simplified and inappropriate probability thresholds. This applies to screening test results, diagnostic criteria and treatment thresholds. These may be useful for beginners and the inexperienced but are unfortunately in widespread use in an analogous way to statistical significance tests. This leads to damaging over-investigation, over-diagnosis and over-treatment that are directly analogous to the problems in science.

]]>I agree that it would be useful to move in this direction. I think that one problem is that people are trying to use a single method to answer many different questions.

Here are a few scenarios:

– A decision needs to be made, for example use procedure A or procedure B to treat some disease.

– Someone has an idea for a new treatment, and there’s a desire to test this new idea.

– Some data arise suggesting some unexpected pattern—this could arise from a study of existing data or as a byproduct of a newly-gathered data—and you want to decide how much to believe that this pattern is real, whether to follow it up with further study, whether to implement the new idea on patients right away, etc.

– An idea has been studied by many research teams and you want to do a meta-analysis with the goal of making recommendations regarding treatments, further studies, etc.

– You suspect that a certain treatment *doesn’t* work, or doesn’t work in a consistent way, and you have observational or experimental data to address this question.

My first statement is that no single method will give good answers in all these scenarios. My second statement is that I don’t think that statistical significance gives good answers in *any* of these scenarios. The next step is to provide alternatives. I do think that I and others have good alternative answers, but the starting point is that the alternatives will be different for these different questions.

My first task is to assess the reliability of medical findings. Secondly, based on the findings I assess the probability of diagnoses / hypotheses. Thirdly on the basis of the various diagnoses and their probabilities I make decisions. As more information comes along the cycle goes (ie back to finding, diagnoses and decisions).

The first task involves (1) assessing the probability of replication due to random variation maybe based on a random sampling model by assuming that the methodology is impeccably consistent but also (2) assessing the methodology for such consistency by going through a checklist (severe testing ?). If either (1) or (2) is hopeless then I may discard the ‘finding’. If (1) is promising (ie passes some test of preliminary significance) I do (2) carefully and reassess the probability of replication by combining (1) and (2). I may also look for an independent observation (eg by another doctor) of the same finding and after assessing it as in (1) and (2) and combine the independent observations of the same finding.

I tend to think of assessing scientific findings, hypotheses and decisions in an analogous way. Does this correspond to how discussants or readers of this blog think?

]]>Doing typical structural design office calculations isn’t quite like preparing your taxes, but its closer to preparing your taxes than computational fluid mechanics is. I’ve had licensed structural engineers come to me to ask my why the calculations work (engineer was calculating stiffness of a support and had an intuitive idea that they shouldn’t just add together two quantities they’d calculated, but didn’t know why).

]]>Grandma in a wheelchair may well be quite a bit faster than grandpa hobbling along, not using the wheelchair because he thinks he’s “still got it”! Or maybe both are faster than middle-aged dude who can’t be bothered to get up off the couch at all.

]]>P.S. As a former structural engineer I can assure you that you shouldn’t be using it as an exemplar of a profession that doesn’t have unqualified (not necessarily the same thing as uneducated) people performing the calculations. Statistics is not as unique as you might think in that regard.

]]>The point of statistical analysis is not to please me, the point is to make sense of data and learn about the world beyond the particular data at hand. My problem with all of 1, 2, 3, and 4 is that these methods are not doing that. For examples of analyses that I believe do make sense of data and learn about the world, I can point you to my books and published articles. (Also of course lots and lots of stuff not written by me, but it’s easiest as a starting point for me to point you to my own work.)

]]>Can you explain how and give an example?

I believe the argument here is that declaring “statistical significance” if the p-value less than say a default of .05 _is_ a misuse of p-values.

It is not saying p-values should not exist of not be used but rather not simplistically used in generic dichotomous yes/no ways.

]]>My own contribution to the topic humbly submitted for your consideration here.

]]>While p-values and statistical significance creates a lot of problems, many of the problems would still be there regardless. For instance, suppose p-values did not exist and a researcher wants to weave a story about a positive effect. The researcher can “remove outliers” and slice and dice the data until it shows a positive effect.

The discussions should center more on experiment design and causal diagrams, but they can coexist with significance testing. People might be more receptive to “collect data in a better way” rather than “stop using p-values”. A lot about p-values seem off (e.g. binary decisions of p = 0.051 vs p = 0.049), but they can be useful to support the analysis as long as the study is well-designed. Running a t-test on an RCT with sufficiently large sample size seems reasonable, for instance. But fishing for p-values in an n = 20 study with a lot of noise is not reasonable.

]]>Okay, I’m a little tongue-in-cheek, but what do you think of Mayo’s argument?

]]>——

It makes sense to withhold characterization of any part of the discussion as scientific to control of the anchoring effect.