Justin:

Two things. First, there’s a reason people use different statistical methods. They’re not all equivalent, and there’s no reason to think that every statistical method is a good first pass at a problem. Second, a lot of the problems with statistical significance are exactly that it is used as a “filter/gatekeeper.” We discuss general problems with filter/gatekeeper/etc. here.

]]>It won’t, and it was never, ever claimed to, but statistical significance is a good ‘first pass’ at the problem, filter/gatekeeper what have you. But hey, same with Bayes factors and everything else.

Justin

]]>I updated to this link: http://www.stat.columbia.edu/~gelman/research/published/abandon_final.pdf

]]>:(

I give you that it is quite esoteric. I will try to unwind some peculiarities.

The exps in the function “pYes” are so that par[1] and par[2] would always be positive. Conversely in the vector priorMeans the means are log’d so that they would be more easily understood–but maybe they aren’t. The core equation in itself–(x/alpha)^beta–is widely used in psychophysics to relate physical signal level to the internal signal-to-ratio (cf. e.g. Kontsevich och Tyler from the previous post). Intercept in the model corresponds to a “decision criterion”–if you think it as a latent variable model (https://en.wikipedia.org/wiki/Logistic_regression#As_a_latent-variable_model).

The logs in the function informationGain are related to the calculation of the entropy of a bernoulli distribution. As I said, the heuristic is to minimize the entropy of the posterior distribution. Here, instead, it is the probability distribution of the responses in which the entropy is minimized; the algorithm (during the optimisation step inside the main loop) chooses the stimulus that reduces the entropy of the bernoulli distribution the most. Kujala and Lukka have more information about this in their article.

The arbitrary constants in pYes are indeed arbitrary. The constant 0.98 and conversely 0.02 is mixing coefficient; it denotes how much of the response is dictated by the core equation and how much of it is due to unbiased noise (the coefficient 0.5). This principle is elaborated on in the Zeigenfuse reference. Also I think Kruschke wrote about this sort of mixture modeling in his book, calling it “robust regression”, but I can’t put my finger on it.

]]>Yes, normal distributions assumptions often make sense in the absence of anything better because they involve the minimum of assumptions. That doesn’t mean testing whether such a model is correct makes any sense.

My ever-present position is to test a model you have derived from a theory/explanation/whatever and then work from there.

]]>Paul Meehl gave the nice delivery and all I am doing is parroting him. So I don’t seem the point of putting forth effort on that front.

]]>One reason normality assumptions are so popular is that the normal distribution is a mathematical attractor. So the assumption can hold approximately without much assumption. Of course a thing that mathematically has to be true isn’t of much interest testing scientifically.

In the end, hypotheses about which rng generated your data are stupid things to test. What we want is mechanistic predictive models with bounds on the imprecision. That’s what Bayesian gives you.

]]>+1 for the idea; -1/2 for the delivery.

]]>1. HA: ‘treatment’ is better than baseline

2. HA: ‘treatment’ is worse than baseline

[…]

So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false.

Not at all. That is not the null hypothesis, alternative hypothesis, or any hypothesis being tested. Your null hypothesis is whatever model you actually calculated a p-value (or whatever) based on. This will include other assumptions besides mean1 = mean2, such as normality, iid data, etc. **There is no reason to privilege the mean1 = mean2 assumption.**

In this sequential sampling case, the iid assumption that a lot these default statistical models make is violated. I think people don’t even understand the first thing about what they are testing, which leads to all these problems. I know a lot of readers probably think I am hyperbolic but it really is idiotic if you understand what is going on:

]]>Ian:

That’s all fine, except (a) effects can be highly variable, hence an effect size of +0.002 in a particular experiment, even if several standard errors from zero, doesn’t tell us much about what might happen next time (I’m assuming a scale in which effect sizes on the order of 0.1 are interesting); and (b) all this type 1 error rate control stuff is not really relevant to questions of distinguishing positive from negative effects.

]]>One last thought on this. Most people interpret point null hypotheses in a way that I think you’d approve of. Namely, a point null hypothesis can be thought of as two tests:

1. HA: ‘treatment’ is better than baseline

2. HA: ‘treatment’ is worse than baseline

A point null hypothesis test tests both of these controlling for the family wise error rate. This actually corresponds to how people actually interpret the results. People don’t say: “The treatment was shown to have a non-zero effect on the condition,” rather they interpret it directionally: “the treatment was shown to improve the condition.”

So yes, the null hypothesis of no effect is often a priori false, however, the one sided nulls are not false. So here is a counterintuitive response to your statement that “the null hypothesis is false.”:

The null hypothesis is true. The question the test is addressing is which one.

]]>Cool.

If I had the time to waste I could redo this with a rejected study with frequency based analyses where the reviewers stated that the study was too noisy (under powered) had not adjusted properly for multiple analyses. The resubmit would do a Bayesian analysis with a flat prior and prattle on how about the advantages of now knowing the posterior probabilities highlighting credible intervals that are almost identical to the previous confidence intervals.

]]>yup

]]>I think you may mean more precisely that a point null hypothesis is false, as it is obviously not the case that any null hypothesis is false. It is quite common to have basic business practices in place and then have the research question “does doing X improve Y”, where Y is revenue or some such thing. X does not always improve Y, so the null hypothesis that we should keep with the status quo is sometimes true.

]]>Contagious laughter.

]]>Why are there so many logs and exps and arbitrary constants and wierd equations in this code? If its meant to demonstrate something about sequential testing that makes it unneccesarily difficult to understand.

]]>Caspar:

I think the thing with the p-values is irrelevant to good practice in that we should not be using p-values to make inferences or decisions. I disagree entirely with your “false discovery rate” attitude in that I do not think the purpose of a study is, or should be, the “discovery” of nonzero differences. All differences are nonzero. Just get N=10^6 and you can get as many discoveries as you want.

Regarding the point estimates: yes, any selection on statistical significance will bias your point estimates. This arises with sequential or non-sequential designs. However, if you perform a sequential design and report all your data, there should not be a problem.

In addition, I disagree completely with your conclusion that a researcher should “increasing your sample size in small bits until you meet some threshold.” It’s always better to get more data. The reason for not getting more data is some combination of cost, convenience, and urgency—not a statistical significance threshold. Again, the null hypothesis of exactly zero effect and zero systematic error will never be true, so I have no interest in rejecting it 5% of the time or whatever. This is a game that I have no interest in playing, and which I don’t think researchers should be playing. And, for that matter, I don’t think Alan Turing used statistical significance thresholds when cracking codes (or, at least, I haven’t heard of him doing so).

]]>Tim:

No, that is not correct. See my above post.

]]>““Researcher’s degrees of freedom” is the scariest phrase in Science.”

Or the funniest:

]]>That doesn’t matter much in the sense that parameter estimations are just as affected/biased by badly done sequential analyses.

]]> Bill Drissel

Frisco, TX

Ian:

I’m happy to be basically guaranteed to reject the null hypothesis. The null hypothesis is false.

]]>I agree that sequential designs are just fine, and should be actually encouraged; however, I do feel that Andrew is engaging in a bit of a dodge there. If you don’t use sequentially valid frequentist inference, you are basically guaranteed to reject the null hypothesis eventually without the use of sequentially valid inference procedures.

Andrew’s argument, if I understand correctly, is: ‘Whatever, NHST (frequentist and bayesian) is useless and broken so who cares if you do the sequential analysis wrong.”

I think you can make the argument that NHST has serious problems, as Andrew often does. Whatever your bottom line decision rule should be, which has always been less clear to me in Andrew’s writing, you’ve got to correctly account for your sequential design. Sometimes you get this for free by virtue of being in a Bayesian paradigm and sometimes you don’t.

]]>I wrote a blog post outlining the consequence of sticking to NHST and not adjusting for sequential data collection. I hope it can help as an eye opener to some, as it clearly shows how large the bias in the p-values *and* the effect size estimates is when applying this approach:

http://blog.casperalbers.nl/science/statistics/the-problem-of-unadjusted-sequential-analyses/

I have a small disagreement with this statement.

(A) It IS useful to learn about an effect size being small

(B) The usefulness of (A) is predicated on having a large enough sample. And one way that will occur is if your ‘true’ effect size is very small and you have a statistical significance based stopping rule.

So while I agree that p-value based stopping rules are not a generally coherent framework, a side effect of implementing them is that a ‘precisely estimated zero’ obtained from doing so is quite useful. Think of this as the inverse to the type-M problem.

]]>The motivation for this is, at least it used to be, rather practical: if we are interested in, e.g., the faintest stimulus the subject can detect, it doesn’t really make sense to present them with stimuli they always are able to notice. This resulted in different sorts of “non-parametric” sequential tests, in which some simple rule would be used to determine the next stimulus. Later, as was said in the beginning of this post, more mathematical methods for stimulus selection were developed, since in the more complex models the stimulus placement is dependent on more things than just the psychophysical threshold.

To make everyone more bored, I’ve attached a quickly put together R code of a simple adaptive psychophysical task. I scripted it while on a tea brake, so it lakes a certain robustness in programming sense… but still, I thought that maybe people could find it fun to play around with it. It uses sequential importance sampling, at this point, so the particle degeneracy can become a problem if one wants to run longer simulations. In these cases I’d recommend one to add a “resample-move” step, as in Chopin (2002).

Also, since it is all in native R, and I was too lazy to figure out some vectorizations, it is also really slow, so be aware. The model in itself is quite simple. There’s an observer making binary decisions, basing their decision on the “internal” strength of the signal (depends on where signal-to-noise ratio is 1 and non-linearity of the internal scale) and a decisional bound, pretty much like in basic probit models. The probability is “padded” a with by mixing in some non-cognitive factors (like in Zeigenfuse and Lee 2010, if I recall correctly). So there it is.

References:

Chopin, N. (2002). A sequential particle filter for static models. Biometrika.

Dimattina, C. (2015). Fast Adaptive Estimation of Multidimensional Psychometric Functions. Journal of Vision.

Kontsevich, L.L and Tyler, C.W. (1999). Bayesian Adaptive Estimation of Psychometric Slope and Threshold.

Kujala, J.V and Lukka, T.J. (2006) Bayesian Adaptive Estimation: the next dimension. Journal of Mathematical Psychology.

Shen, Y, and Richards, V.M. (2013). Bayesian Adaptive Estimation of the Auditory Filter. Journal of the Acoustical Society of America

Zeigenfuse, M.D. and Lee, M.D. (2010). A General Latent Assignment Approach for Modeling Psychological Contaminants. Journal of Mathematical Psychology.

APPENDIX (CODE CODE CODE AAH)

# Some Functions pYes = function(x, par) { 0.98 * pnorm(-par[3] + (x / exp(par[1])) ^ exp(par[2])) + 0.02 * 0.5 } informationGain = function(stimulus, particles, weights) { pyes = rep(0.5, length(weights)) sum1 = 0 sum2 = 0 for(i in 1:length(weights)){ pyes[i] = pYes(stimulus, particles[i,]) sum1 = sum1 + pyes[i] * weights[i] sum2 = sum2 + (-(pyes[i] * log(pyes[i]) + (1 - pyes[i]) * log(1 - pyes[i]))) * weights[i] } sum1 = (-(sum1 * log(sum1) + (1 - sum1) * log(1 - sum1))) return(-(sum1 - sum2)) } # Particle set priorMeans = c(log(2), log(1), 1.2) priorSd = c(1, 1, 1) nParticles = 1000 particles = matrix(NaN, ncol = 3, nrow = nParticles) particles[,1] = rnorm(nParticles, priorMeans[1], priorSd[1]) particles[,2] = rnorm(nParticles, priorMeans[2], priorSd[2]) particles[,3] = rnorm(nParticles, priorMeans[3], priorSd[3]) weights = rep(1 / nParticles, nParticles) # Parameters for the simulation nTrials = 100 answers = c() stimuli = c() generatingValues = c(1, 0.5, 1) # Run simulation: for(t in 1:nTrials) { # Choose stimulus: stimuli[t] = optimise(informationGain, lower = 0, upper = 10, particles = particles, weights = weights)$minimum answers[t] = rbinom(1, 1, pYes(stimuli[t], generatingValues)) # Update prior for(i in 1:length(weights)) { weights[i] = weights[i] * (answers[t] * pYes(stimuli[t], particles[i,]) + (1 - answers[t]) * (1 - pYes(stimuli[t], particles[i,]))) } weights = weights / sum(weights) }]]>

Thanks – the actual second link is http://statmodeling.stat.columbia.edu/2017/04/19/representists-versus-propertyists-rabbitducks-good/

]]>Keith, did you mean for those to be the same link?

]]>This strategy gives you a significant result 98.8%(!) of the times if there actually is no effect.

https://pbs.twimg.com/media/DcL81C-W0AAuPk5.jpg:large

This is wrong, groupA and groupB should be initialized inside the outer loop. As it is now they grow to very large sample sizes. I get ~20% significant results for that scenario.

]]>Björn:

I had very much the same experience with a statistical colleague a couple months ago. Before and afterwards, I sent them some material on how bad this actually is. Have no idea what the impact was/will be.

Largely, I think it is bad meta-physics or meta-statistics at the root of this and why it is so hard to get folks to take criticism seriously. For instance, the likelihood principle, to some means frequency properties are irrelevant so they will just dismiss looking at frequency properties.

If you can get someone’s attention and time, this simulation based exposition of the issue by Andrew may be a good bet http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

I discussed it in a wider context here (where it is Case study 1) http://statmodeling.stat.columbia.edu/2016/08/22/bayesian-inference-completely-solves-the-multiple-comparisons-problem/

]]>> pilot studies are often not randomized. Or, if they are, the point is to check that the randomization is feasible

This certain would make sense in clinical research for instance trying to carefully balance the control and intervention groups. But again, primary motivation being to evaluate feasibility, safety, compliance, timing, costs, etc.

Even most hard core frequentists won’t complain about using informative priors for _design_.

]]>Here is a model of sequentially collecting data from two equivalent groups, performing a t-test, and collecting more until you either get a “significant” result or run out of money:

https://pastebin.com/EcbU8sC7

Here are the results of the above for 100 simulations. It is the distribution of p-values you get by taking either the final or lowest p-value. About 35% were less than 0.05 in that case:

https://image.ibb.co/mZU6KS/seq_sample.png

If you didn’t do sequential sampling then those histograms would look like uniform distributions and ~5% of p-values would be below 0.05.

]]>You need to model the process you think generated the data. Ie, if there is sequential sampling you need to model sequential sampling.

“We continuously increased the number of animals until statistical significance was reached to support our conclusions”

“We assumed the data was iid but then made collection of new data dependent on the outcome of the previous data so it wasn’t iid. Then we rejected the iid model and concluded that we know the cure for cancer (or whatever).”

That is how stupid this is. And do not be mistaken, it is widespread and has been for decades. The main problem at this point is that the human mind recoils at the thought of the consequences.

]]>While I agree with most of what you’ve said, I do think that is important to criticize (or warn against, education about, etc.) these kind of sequential designs, as they just provide noise. Yes, other practices *also* lead to noise, but that means we should criticize those practices too, not to ignore some practices because others are also bad or worse.

You do make a great point that had this study used an a-priori fixed sampling procedure instead of a post-hoc sequential one, it would not have been much better. While that is true in this case, in many other cases this does not hold. As such, I do think that it is good to focus (to some degree) on this particular bad approach, without losing sight of other problematic practices.

]]>Maybe we need more examples of how to do such trials properly and interpret it in an appropriate manner. For example, what would a possible Bayesian model including time look like? Picking some sensible priors, what would be a good analysis of this experiment and what would be an appropriately cautious interpretation?

]]>Bjorn:

Just to be clear:

– Bayesian methods don’t require an adjustment for sequential design. They do, however, require a model for the outcome that includes all variables used in the design (in the case of a sequential design, the key variable is time).

– I think decision rules based on null hypothesis significance testing (these include p-values, Bayes factors, and decisions based on whether a 95% confidence or posterior includes zero) make no sense and will in general have bad statistical properties.

– I think it’s a mistake to think you’ve “won big” if you get a huge effect size estimate along with a large standard error. I discussed this problem in section 2.1 of the paper linked to above.

– I disagree completely with the claim that Bayesian analyses with vague priors have great frequency properties. I’ve talked and written about this a lot: Bayesian analysis with vague priors leads to the following sort of statement: If you have an estimate that’s 1 se from zero, you end up with 5:1 odds that the true effect is positive. Go around giving 5:1 odds based on pure noise and you’re gonna lose a lot of bets.

– The likelihood principle is what it is. In any case, you can do most of the above reasoning without worrying about the likelihood principle, just looking at frequency properties.

– The practical effect I’m hoping from this post is for people to focus on important statistical issues. To criticize the above-linked study based on its sequential design is, to me, ridiculous, as it would have almost all the same problems had the sample size been fixed. The sequential design is a minor part of the study, and to pick on that seems to me like a distraction. For influential people including the editor of a leading psychology journal) to focus on this seems to me to miss the point, and it’s perhaps one clue how so many crappy papers get published in top journals: there’s an attitude that if various arbitrary rules are followed (no sequential design, p less than 0.05, etc.), that a paper gets to be published. That led to the Bem ESP debacle.

]]>As long as it's done this way, this approach is a double-whammy, you get the chance to "win big" with a huge effect size early on based on a type SM error and late on based on showing an irrelevant effect.

In my experience, when I tried to discuss the potential issues (which admittedly are more to do with the significant or not interpretation rather than the sequential data collection), I just get told that I am too stupid to understand the likelihood principle (and that I should read some article by Berry and Berry that explains it even for people like me). So these kind of posts do really worry me in terms of the practical effects they have, although here at least the "this is fine in a sense, of you don't care about the type 1 error rate or other frequentist operating characteristics"-disclaimer is clear. However, I would have wished for it to be even bigger and more clearly spelled out. One can never be too clear.

]]>Daniel:

Yes, I agree, your scenario is different than the sort of pilot study we see in statistics where the range of the data are pretty much known ahead of time (a simple example being binary data with a roughly known frequency).

]]>I guess what you call a pilot study and what I call a pilot study may differ significantly, or maybe just it differs by field… In forensic engineering for example, I’ve definitely done things like sent an inspector into the field to take observations of say 8 randomly selected windows to figure out what measurements need to be taken, and get a handle on visually discernible failure rates. Rates could be anything from 0 to 1… and the “real” study needs to quantify failure across 1500 windows installed on a 15 story building, using measurement techniques appropriate to the types of failures likely to occur.

The pilot study might cost $800, and collect very basic info about failure types and soforth, but the final study will need to look at N windows, with pressure tests or chipping away stucco to reveal installation techniques, or whatever. Maybe $5000 per window by the time scaffolding is set up, and 4 or 5 simultaneous workers per window.

You definitely don’t want to do some kind of industrial process control textbook formula for sample size and tell the inspection team to completely strip 400 windows out of the building at a total cost exceeding twice the quantity requested in the settlement discussions.

A real-world cost based decision analysis is a real thing here, and even low grade biased measurements from a pilot of 8 windows is a damn site better than any other technique for determining the sample size.

]]>Daniel:

it’s not just that. Given that variance will be so high with small N, there’s not really any point in working to control bias at this stage. That’s one reason that pilot studies are often not randomized. Or, if they are, the point is to check that the randomization is feasible, not to worry about balance in some group of 4 patients or whatever.

]]>“there’s no real reason to put lots of effort into controlling biases”

unless you’re doing it explicitly to make a decision about the size of your follow-up study, in which case you should try to get the best data you can so you can feed it all into a decision analysis.

I think normally this kind of decision analysis based followup isn’t done, and that’s why people put less effort into their pilot studies.

]]>A:

I’d need to see the example. In any example I’ve ever seen, the interval based on 3 or 4 pairs is so wide as to include huge swathes of completely unrealistic parameter values.

]]>“But another way of putting it is that, if you have any kind of reasonable prior, the likelihood from your pilot study will be so weak as to have virtually no effect on your posterior.”

Based on my experiences, this is not true. I’ve seen lots pilot studies that are something like matches pairs with 3 or 4 pairs. To claim that the researchers prior sd about the effect size that was considerable less than half of the sd of the difference of a randomly selected pair was definitely not the case.

To be clear, I’m not talking about ideal worlds here.

]]>Daniel:

Sure. But another way of putting it is that, if you have any kind of reasonable prior, the likelihood from your pilot study will be so weak as to have virtually no effect on your posterior. And, in addition to that, a pilot study will typically have lots of bias: you’re doing the pilot to make sure the treatment can be implemented as planned, and there’s no real reason to put lots of effort into controlling biases.

]]>Here’s an example of one of the things you could do with your pilot posterior:

draw N posterior samples, for each i = 1…N generate a fake dataset of Q data points according to the generating process, then run a Bayesian inference on this fake dataset, and determine some posterior samples for k, an important parameter.

Vary Q

Using a utility function that encodes how much you really care about knowing the best value for k, choose a Q that maximizes your expected utility across the N possible parameter vectors.

Now carry out your “real” study using sample size Q

And *that* is what “Bayesian Power Analysis” should look like, and it *doesn’t* suffer from the “noisy point estimate” problem.

]]>Daniel:

Of course don’t use a flat prior for the design. Use prior information. There’s always prior information, otherwise why are they doing the experiment in the first place?

]]>Point estimates from tiny pilots are usually crappy, but the entire posterior distribution is a useful thing to compute. I’d much rather use that than a “flat” prior in my follow up study.

]]>Jeff:

1. Yes, with some sequential designs you can increase the probability of getting statistical significance. So what? Statistical significance, by itself, tells us nothing.

2. I disagree with your statement that, “in theory, a pilot test might be a good way to generate an estimate for a priori power analysis.” Even in theory the pilot study is a bad way to generate this estimate. It will be too noisy. The lower limit of an 80% interval from a pilot study is not a conservative estimate of anything; it’s just a random number!

]]>