That often an indication of a serious problem and neither study should be taken as informative until the lack of overlap is understood. ]]>

The second trial certainly makes us more confident in rejecting the null hypothesis, doesn’t it?

No, I am 100% confident the null hypothesis in this case should be rejected. The data is irrelevant.

If I do trials of what jelly bean colors cure cancer, I’ll come up with some number of statistically significant effects… but almost none will be replicated.

There is going to be some weird correlation between the color of jelly beans chosen and recovery from cancer under whatever specific circumstances. With sufficient sample size you will detect this, and if it is worthless information then 50% of the time the statistically significant effect will be in the same direction.

]]>You end up with a severely overfit model, you mean.

]]>If we get different results on an attempt to reproduce a finding, there’s two things that may be contributing to varying degrees– chance and methodology. If we treat the two studies as a combined sampling in our meta-analysis, we miss the former.

In effect, we have a combined sample, but a very confounded one– consider them tuples of methodology and subject; we have 300 subjects combined with methodology B and 30 subjects combined with methodology A.

]]>I'm also not sure why you say the a priori probability of replication is 50%. Can you elaborate? If I do trials of what jelly bean colors cure cancer, I'll come up with some number of statistically significant effects… but almost none will be replicated.

]]>We start with a hypothesis that has some kind of plausible causal relationship.

We do our best to control our research to eliminate common causes, etc.

And then we seek out a p value of statistical significance. If we have well-controlled other confounds, and found p<0.05, Bayes will tell us that our plausible causal relationship is now 20x as likely as before (assuming it was not very likely), and it's a good candidate for other people to try and reproduce and study other ways.

It's not a crazy way to do things. Yes, we need to implicitly consider the prior to some extent. If we find p<0.01 evidence for ESP under some new trial conditions, … it's a significant finding in that it's 100x as likely as before. But it still isn't very likely.

Of course, we also need to consider effect size. A p<0.00001 finding of an effect, with a strong explanation for why we would believe in a causal relationship, tested and explored multiple ways… is not very interesting if the effect magnitude is 2%.

]]>2. Preventing tracking of votes. If you were to weight votes, you’d need to attach the weightings to the votes. This means that in a given precinct you’d have a lot more information of whom to attribute votes to.

3. Audit / preventing cheating. If you have different weightings for different votes, were the votes weighted properly? You want to prevent attribution (#2), but you also want to make sure that votes count the right amount– these two goals are in direct opposition. (As compared to just tracking whether or not someone has voted and whether you have the right number of ballots).

4. Distortive effects. Youth turnout is already low. If votes are weighted differently, this is likely to discourage low-weighted groups from participation. ]]>

That’s why we use statistical tools to try and separate the wheat from the chaff. But even they are error-prone and potentially misleading. Even so, I wouldn’t throw them out. ;)

There is no substitute for very, very, very careful and skeptical reasoning.

]]>send you to your death then you are entitled to participate in the election of those politicians. ]]>

> An arbitrary statistical **model**

> the curve doesn’t fit that **well**

> The correct use **of**

Say I started with a population of N0 cells that on average undergo a binary division r times per day so my model for number of cells after t days is N(t) = N0*2^(r*t). This is an equation derived from some basic principles no one has a problem with. If we fit it to observations of the number of cells after 2, 4, and 6 days then our estimates of N0 and r have well defined meaning.

If then I say “the curve doesnt fit that good so lets add in terms for apotosis rates, senescence when the cell density gets too high, etc” **the meaning of N0 and r does not change**.

For a statistical model not derived from any first principles the meaning of each coefficient depends on what else is included in the model. Almost always what gets included is a matter of convenience. **So the meaning of the coefficient changes, as does the value.**

The correct use if a statistical model is to make predictions, not interpret the arbitrary values of the coefficients. A rationally derived model can be used for both.

]]>As for that Garden of Forking Paths example, some models are clearly more plausible than others. Of course with many variables there are many permutations possible, hence the hundreds of models. But, at the end of the day, in statistical modelling you are going to have take a stand on some model or other if you want to say something beyond throwing your hands in the air and saying “I don’t know” (which, I’ll admit, could be an improvement in a lot of cases). In most developed economic literatures, there are often only a few specifications that any one would take to be reasonable given the body of theoretical work that has preceded it. Therefore if you have added or removed some variables and it’s not guided by theory, people will be very suspicious.

In general, you keep straw-manning social science empirical work, acting as though all social scientists are imbeciles plugging variables at random into STATA regression models and looking at p-values to decide which model “worked”. This characterization fits with your narrative, so you run with it. But the reality is a little more complex; there is a lot of high-quality empirical work being done in economics, for example. Lots of problems too, but nobody is publishing papers that are of the PNAS variety in Econ, I can assure you that. The “p-hacking” (or equivalent of) done in modern Econ is mostly in structural work, I’d argue.

]]>Student makes it here:

https://errorstatistics.com/2015/03/16/stephen-senn-the-pathetic-p-value-guest-post/#comment-120537

I know I’ve seen Neyman do it, but don’t remember the paper at the moment. Here is a random applied Neyman paper I just found:

(i)

did the silver iodide seeding in any of the other completed experiments show significant effects, positive or negative, on precipitation in areas far removed from the intended target?

[…]

The two Arizona experiments (6, 7) were performed during the summer months of 1957-60 and in 1961, 1962, and 1964. The target area was an isolated body of mountains known as the Santa Catalina Mountains, with dimensions of roughly 15 by 20 miles. Seeding was performed over a period of 2-4 hr, and began at 12:30 p.m. The experimental unit was a “suitable” day. Determination of the suitability of a given day was made in the morning; the essential criterion was a high level of precipitable water. The experimental design was in randomized pairs of suitable days, subject to the restriction that the 2 days of a pair be separated by not more than 1 day diagnosed as not suitable. For the first day of each pair, the decision whether to seed or not was purely random. Whatever this decision was, it required a contrary decision for the second day. The second experiment differed from the first in the following respects: more gages scattered over a somewhat smaller area, level of seeding, and more stringent selection of suitable days.The original evaluation of possible effects of seeding was based (6, 7) on the average rainfall over the 5-hr period from 1300 to 1800, MST, as measured by a substantial number of recording gages scattered in the target. In both experiments the results of the evaluation were about the same-a not significant 30% apparent loss of rain ascribable to seeding. On days when cloud bases were high, these apparent losses were heavier than on days when the cloud bases were low.

[…]

The first shows thatthe seeding over Santa Catalina Mountains was actually accompanied by a significant apparent 40% loss in 24-hr rainfall at a distance of 65 miles from the intended target, P = 0.025. This, then, constitutes an affirmative answer to question (i).

[…]

The stratification reflected in the last two double lines of Table 1, stimulated by the thoughts of Horace Byers (1, pp. 551-2), was performed because the design is randomized pairs: only the first day of each pair was selected for the experiment, without prior knowledge whether it would be seeded or not. Table 1 shows that the difference between the category of “first days” and the category of “second days” is quite sharp, but its sign is opposite to that visualized by Byers.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC389009/

So first of all we see some p-hacking going on here. Second we see him going from rejecting the null model (I didn’t follow to ref 5 to learn exactly what it was) of something like “rainfall for the next 24 hrs on seeded vs non-seeded days is sampled from the same distribution”, to conclude “silver iodide seeding reduced 24 hr rainfall by 40%”.

Can anyone find an example of Fisher committing this error?

]]>When that model can be convincingly said to represent some real-world phenomenon, then the coefficients have some real-world meaning

Yes, they have meaning when the model is “correctly specified”. Ie, it includes all the relevant variables and no irrelevant ones, etc. You can easily prove this to yourself by adding/removing variables to the model and seeing the others change. Attempting to interpret the meaning of these coefficients is like looking at the individual weights of a neural network.

Here is an example of people attempting to explore all plausible linear models for one dataset. They come up with over 600 million different values for the same coefficient ranging from positive to negative: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

After all that, they conclude:

]]>Because we are examining something inherently complex, the likelihood of unaccounted factors affecting both technology use and well-being is high. It is therefore possible that the associations we document, and those that previous authors have documented, are spurious.

For the sake of simplicity and comparison, simple linear regressions were used in this study, overlooking the fact that the relationship of interest is probably more complex, non-linear or hierarchical 13 .

we wind up with models involving 10 or so different causative factors each of which affects the outcome in a well defined way, and each of which has an associated set of parameters and posterior distribution of those parameters…

that’s the reality. NHST *is* “just theory” produced by mathematicians with no connection to real experiments. Ronald Fisher who did real actual experiments on agriculture is quoted above somewhere saying how stupid NHST is and how it will result in people ritually doing stupid things in the name of “science”.

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117232

]]>I suspect that we are looking at the difference between those who do experiments, and those who just theorize. ]]>

It’s the use of a threshold to determine significance that’s bothers Annoneuoid. Is that correct?

No… your question has been answered twice in this thread already:

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117359

https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1117518

But yea, using an arbitrary threshold that in practice gets adjusted so that the “right” amount of “discoveries” get published is another fun piece of nonsense that acts as a red herring to distract from the real problem.

]]>I guess different people have different concerns. You’d probably get a big howl out of a lot of the things I think should be done! :)

Generally I see your point. I’m not sure I agree with it. I mean on the one hand we have people advocating for more rights for youth but often the same people are telling us that we can’t hold young men responsible for their actions because their brains aren’t fully formed until age 21 or something – and that’s before you count their individual abilities to acquire knowledge and their accumulated experience.

So I credit you with good intentions! :)

]]>Thanks, much appreciated, that’s useful. My understanding is that Anoneuoid is bothered not by the P values themselves, but by the use of significance testing. It’s the use of a threshold to determine significance that’s bothers Annoneuoid. Is that correct?

I agree with Anoneuoid on this point.

I was trying to make a distinction between cutting edge research and routine work. Seems to me like there is lots of regular work where the data is sound (say, information on stock trades), n is large and there’s a fairly smooth distribution, where using a cut-off to select among models is a sensible thing to do.

]]>1. Because I said, for the sake of example, that you know the distributions are normal.

No you said “what users want to know is the probability that their results occurred by chance”, and then substituted “sampled from a normal distribution” for “chance”. This is a bait and switch.

It would certainly be useful to have evidence that the two groups did *not” come from the same distribution.

The answer is they did not. They never do in real life except perhaps in cases where it is theoretically predicted to be so (eg that subatomic particles have identical properties, etc). Otherwise, I can’t think of anything to do with this information.

3. I doubt whether users would find at all useful your advice that the false positive risk is zero.

Yes, and I would agree with them. I can’t think of any reason doing what you describe could be useful.

]]>that’s what a typical “hypothesis test” does, and it gives us reasonably useful information when it *fails to reject* because then we know that this abstract mathematical model is for the moment sufficiently good that we can’t distinguish between it and whatever actually happened…

]]>1. Because I said, for the sake of example, that you know the distributions are normal.

2. “What purpose would it serve to know the probability that two groups were sampled from exactly the same distribution?”

It would certainly be useful to have evidence that the two groups did *not” come from the same distribution.

3. I doubt whether users would find at all useful your advice that the false positive risk is zero.

]]>So how would you estimate that probability in, for example, the case of comparing the means of two independent samples (assume normal dist, equal variances)?

1) Assuming an observation resulted from sampling from a particular normal distribution is just one possible definition of “chance”. Why can’t I assume it is a sample from a lognormal distribution and refer to that as “chance”? I would get rid of this “chance” terminology altogether.

2) What purpose would it serve to know the probability that two groups were sampled from exactly the same distribution?

3) As per recent discussion on this blog, most likely we can deduce the probability is zero and the data is irrelevant to our conclusion: https://statmodeling.stat.columbia.edu/2019/08/28/beyond-power-calculations-some-questions-some-answers/#comment-1108650

]]>You say

“Knowing “the probability that the results occurred by chance” at least seems like a possibly useful piece of information.”

Good. We agree at least on that point.

So how would you estimate that probability in, for example, the case of comparing the means of two independent samples (assume normal dist, equal variances)?

]]>that response surely tells you that what users want to know is the probability that their results occurred by chance.

No, they want information about their research hypothesis, not some null hypothesis. They are trained to calculate a p-value for a default null model and compare it to a significance threshold. This process makes zero sense so they come up with myths to explain why they and everyone else is doing it.

Knowing “the probability that the results occurred by chance” at least seems like a possibly useful piece of information. Compare to “the probability of observing a deviation at least that extreme from what we would predict if we assumed a null model everyone knows was false begin with was actually true”.

]]>1) Come up with a ml/statistical model and use it to make predictions so some form of cost benefit analysis can be done and a decision made. Every intervention will have many “effects”, and they will vary depending on the circumstances.

2) Testing predictions derived from a theory, pretty much as described here: https://statmodeling.stat.columbia.edu/2019/09/06/__trashed-2/#comment-1116876

An effect is basically the coefficient of a linear model, it will change depending on what you include in the model (variables, interactions, etc), so really is just an arbitrary number.

]]>You say

” I want to know how certain we can be of the direction of the effect (and it’s magnitude of course)”

Yes sure. But before estimating the size and direction of the effect, you want to be as sure as you can that there is an effect there to measure. That is the problem in almost all biomedical research. That is what, despite decades of work by statisticians, most users still think that it’s what the p value tells them. Most users, when asked what a p value means, say that “it’s the probability that by results occurred by chance”. Of course it isn’t, but that response surely tells you that what users want to know is the probability that their results occurred by chance.

The problem with that question is that it has an infinitude of answers. But many of the answers, including mine, suggest that if you have observed p = 0.049 in a well-powered experiment and claim an effect exists, the probability that you are wrong is between 20 and 30% (and much higher for an implausible hypothesis).

What would really help users would be for you to say what your estimate of that false positive risk is.

]]>Interestingly, if we use maximum likelihood, having a point null (theta=0) or a composite null (theta less than or equal to zero) makes no difference because (at least in the normal mean model), the point null will always be the supremum of the composite null. ]]>

it seems natural to users who wish to know how likely their observations would be if there were really no effect

No it doesn’t come naturally. I remember being taught this and feeling something was very wrong. I was kept too busy at the time to put much thought into it though.

]]>Math and empirical skills .NE. political skills. I don’t want Einstein fixing the brakes on my car.

]]>In any case, I don’t think “statisticians have been unable to agree on what to do about the problem” even rates as a real reason science is so badly done today. There are whole fields that *desperately want to keep doing ritualistic cargo cult science* because it brings them power and makes them money.

the whole phrase “cargo cult science” comes from Feynman in his 1974 commencement address at Caltech. It’s 2019 today…

]]>Some people don’t like the assumption of a point null that’s made in my proposed approach to calculating the false positive risk, but it seems natural to users who wish to know how likely their observations would be if there were really no effect (i.e. if the point null were true).

It’s often claimed that the null is never exactly true. This isn’t necessarily so (just give identical treatments to both groups). But more importantly, it doesn’t matter. We aren’t saying that the effect size is exactly zero: that would obviously be impossible. We are looking at the likelihood ratio, i.e. at the probability of the data if the true effect size were not zero relative to the probability of the data if the effect size were zero. If the latter probability is bigger than the former then clearly we can’t be sure that there is anything but chance at work. This does not mean that we are saying that the effect size is exactly zero. From the point of view of the experimenter, testing the point null makes total sense.

[From section 4 in https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529622 ]

]]>I don’t know about that. But I think either of my sisters would be a better president of the United States than I could ever be.

]]>Oh dear, you seem to be very angry.

Surely it was obvious that, when I said statisticians are ignored, I meant that their advice to stop relying on p values is ignored. And part of the reason is surely that statisticians have been unable to agree on what to do about the problem. ]]>

Though when using possibility theory, there’s no important role for impossibility since the focus is on direct evaluations of things of interest, not NHST style reasoning. Furthermore, it is a continuous *degree of possibility*, so you’d call it a 5 percent possibility interval rather than a 95 percent confidence interval. Things outside have possibility 5 percent or less, rather than being ‘impossible’.

The main reason I mention ‘possibility’ is that there is a formal theory of it that tracks the same ideas as compatibility:

https://en.m.wikipedia.org/wiki/Possibility_theory

But I’d happily see some formal aspects of the compatibility interpretation developed in a similar manner.

In particular – just because something is very possible doesn’t mean it is ‘very necessary’. You can measure the degree of necessity of H relative to Ha via 1- Poss(Ha) from any alternative.

So, if replacing degree of possibility by compatibility, it would be nice to have an analogue for degree of necessity too. Something about uniqueness or precision? I don’t know, but I feel like this is a useful aspect of possibility theory – in short, it being non-additive, so two theories can have degree of possibility 1 with no contradiction.

]]>I would have appreciated a properly argued rejection of the idea that p = 0.05 corresponds to approximately a likelihood ratio of at most 3 in favour of H1. That’s the heart of my argument. Of course the people who object to the whole idea of a point null might find fault with it, but I wouldn’t have expected you to do so. ]]>

I am far more arrogant than Anoneuoid could ever hope to be in his wildest dreams! Anoneuoid isn’t fit to sniff my socks when it comes to arrogance!

]]>Of course, this would be an open invitation to gross manipulation of the civics test to achieve partisan results.

Therefore, I would only agree if I was put in charge of the implementation.

]]>Are two datasets with p=0.005 and p=0.2 close? Modifying slightly Keith’s example with coins, getting in 9 flips 8/1 and 6/3 do not seem so close. On the other hand, under the null hypothesis the average distance between p-values is 0.33 so a distance of 0.195 probably can be considered as close. But the subtle sleight of hand is that you are not looking at the closeness between the p=0.2 and p=0.005 results under the null hypothesis. In the coins example, getting 8 heads in 9 flips is 7 times more likely when the binomial probability is 2/3 then when is is 1/2.

The following example where the argument doesn’t make much sense may help to make my point clear. Let’s consider that instead of normal errors we have a bilinear error term: the distribution has a triangular shape and the error is in the interval [-1 1] with a maximum at 0. For the null hypothesis mu=0, the positive results with (two-sided) p-values 0.2 and 0.005 are 0.55 and 0.9. The p=0.05 threshold is at 0.78. Consider the results 0.55 (p=0.2) and 1.30 (p=0): following the reasoning in your post one can say that going from non-significant to impossible results can correspond to small, non-significant changes in the underlying quantities.

]]>from Naked: “P-values are a useful continuous measure of discordance with the null hypothesis.”

This seems to me to relate to NHST. Naked characterizes P as a continuous measure, in contrast to the discontinuous nature of the NHST.

from Me: “They’re the appropriate method for a restricted range of problems”

By which I mean p-value thresholds or NHST significance testing. I don’t think that reflects a misunderstanding on my part. I don’t know exactly what Anoneuoid refers to. I suspect that Anoneuoid’s statement is a reaction against NHST made without really understanding what was said, partly because I wasn’t clear that I was referring specifically P-thresholds. But I dunno, without clarification from Anoneuoid.

]]>