# utils

logistic <- function(a = 10,

b = 5,

c = 0.5,

t = 1:10,

mean = 0,

sd = 1){

y = a/(1 + exp((b - t)/c))

y = y + rnorm(length(y), mean = mean, sd = sd)

return(data.frame(t = t, y = y))

}

calcDiffs <- function(dat){

grid = t(combn(1:nrow(dat), 2))

res = sapply(1:ncol(dat),

function(j) sapply(1:nrow(grid),

function(i) diff(dat[grid[i, ], j])))

return(res)

}

plotRes <- function(dat, diffs, main = ""){

plot(0, 0,

xlab = "Time",

ylab = "y",

xlim = c(1, nrow(dat)),

ylim = range(dat),

main = main)

apply(dat, 2, lines)

hist(apply(diffs, 1, var), breaks = 20, xlab = "Variance", main = main)

}

# sim

nt = 20

nrep = 1000

dat0 = replicate(nrep, rnorm(nt))

dat1 = replicate(nrep, logistic(t = 1:nt)[, 2])

dat2 = replicate(nrep, logistic(t = 1:nt,

b = runif(nt, 4, 6))[, 2])

diffs0 = calcDiffs(dat0)

diffs1 = calcDiffs(dat1)

diffs2 = calcDiffs(dat2)

par(mfrow = c(3, 2))

plotRes(dat0, diffs0, main = "Normal Null Model")

plotRes(dat1, diffs1, main = "Logistic Null Model")

plotRes(dat2, diffs2, main = "Logistic Model")

In regards to mazes, I worked a lot with our behavioral core. They were very meticulous about how the experiments were done, no dropping of data, etc.

Check the final papers. If they do multiple tests on the same animals, are the reported sample sizes the same? If so, that would be the exception, not the rule. There are always things that go wrong.

P.S.

I have an aside regarding using repeated measures anova for behavior. One problem is usually that some form of learning occurs but the test assumes sphericity. So this assumption makes no sense at all in this case, especially when they are studying learning to begin with… To make it worse, popular stats software had a bug for decades in (one of the) corrections for such violations:

And here’s our first really, really serious problem. If you have a between-subjects factor, SPSS’s computation of the Hunh-Feldt epsilon is WRONG. Yep, it’s just plain incorrect. SPSS has known about this bug for decades, but hasn’t fixed it yet. Hard to believe, but true. It’s not way off, but it’s a little bit off. R gets it right, below, so look there to see the correct number. How they get away with this year after year, version after version, is simply beyond me.

https://mikebyrnehfhci.wordpress.com/2015/08/03/translating-spss-to-r-mixed-repeated-measures-anova/

Eventually they came out with the “enhanced Hunh-Fedlt” (either SPSS or SAS gave it some name like that), which was simply the correct calculation.

That is all pretty much irrelevant to me now though. Due to the extremely weak relationship between the null model and the research hypothesis, I don’t care about such tests to begin with.

]]>Yes, I agree that these issues are often ignored. Here’s what I wrote in the above-linked post that started this discussion:

the stopping rule enters Bayesian data analysis in two places: inference and model checking:

1. For inference, the key is that the stopping rule is only ignorable if time is included in the model. To put it another way, treatment effects (or whatever it is that you’re measuring) can vary over time, and that possibility should be allowed for in your model, if you’re using a data-dependent stopping rule. To put it yet another way, if you use a data-dependent stopping rule and don’t allow for possible time trends in your outcome, then your analysis will not be robust to failures with that assumption.

2. For model checking, the key is that if you’re comparing observed data to hypothetical replications under the model (for example, using a p-value), these hypothetical replications depend on the design of your data collection. If you use a data-dependent stopping rule, this should be included in your data model, otherwise your p-value isn’t what it claims to be.

Both these points involve model misspecification, and we made both these points in our 1995 book. But not everyone seems to have paid attention! So if there is more of an active interest in these problems now, I’m glad to hear it.

]]>There are articles about it, such as Rosenbaum & Rubin’s “Sensitivity of Bayes Inference with Data-Dependent Stopping Rules” (http://amstat.tandfonline.com/doi/abs/10.1080/00031305.1984.10483176#.WlDGY1Q-dQI), but with 33 citations in 23 years even though the authors are such prominent figures, this has been seemingly mostly ignored.

Do you think this is connected?

As a side note, I am hopeful that there is now an active interest in taking misspecification seriously in Bayesian statistics, there are more and more articles on related topics. For a long time people would dismiss the issue, and say “just go back to the drawing board and come up with a better model”. This ignores that 1) the next model will be misspecified too, and that 2) it is possible to come up with principled statistical methods without assuming that the model is correct, so why not trying?

]]>Yes, my last week as a biostatistician, someone told me about standard practices for western blots. It was a little unnerving (although for the record, it doesn’t sound like any statistical tests were run: my understanding is that you just grab the three most normal looking blots from control and the three most abnormal from treatment and publish your pictures).

In regards to mazes, I worked a lot with our behavioral core. They were very meticulous about how the experiments were done, no dropping of data, etc. On the other hand, academic standard practice was to use repeated measures anova on the meticulously collected data, which we showed to have power around 0.25 on maze data under the null hypothesis (by randomly permuting labels). We did address this issue in one of our papers, although I’m sure the new statistical methods haven’t caught on.

]]>I misread your comment to be stating that intentional dropping of “unpleasant” data was the norm.

I am saying it is the norm. Say you do 3 western blots and one doesn’t fit with the others, the norm is to do another. If it fits, then you throw out the outlier and assume something went wrong that time.

Say you are testing the speed of mice searching a maze for a food reward and some take so long they are making you late for dinner. It is the norm to figure those mice just weren’t hungry enough that day (or whatever), and leave them out.

]]>By the way, I don’t like the term “objective”, and even less the term “subjective” for what I think is the proper formulation of Bayesian methods. (Jaynes: “any probability assignment is necessarily ‘subjective’ in the sense that it describes only a state of knowledge [… but ] this probability assignments [are] completely ‘objective’ in the sense that they are independent of the personality of the user. They are a means of describing the information given in the statement of a problem, independently of whatever personal feelings you or I might have about the propositions involved.”)

One doesn’t have to “fully” believe the prior, but as for every other assumption in the model the results depend on it so at least one should find it reasonable. If you “approximately” believe the prior, you will trust the results to be approximately correct. If the prior is more or less representative of the range of values that you “expect” for the parameter, you don’t expect the (average) impact of the stopping rule to be an issue. Of course if the prior is very different from what you “know” about the parameter before the experiment, then you can expect things to go wrong and you won’t take the results at face value. Garbage in, garbage out.

]]>Yes, I do agree that unintentional biasing of data must happen quite a bit (I’ve only been involved with 2 data collections in my life, so not much empirical evidence here). I misread your comment to be stating that intentional dropping of “unpleasant” data was the norm.

]]>So I would encourage a thorough read of that paper, as it become clear that the general statement about Bayes and optional stopping is not warranted, and people being so strong that it is no problem, can be doing more bad than good in applied settings.

Finally..lol..Rouder (2014) did a few very small simulations. One cannot draw strong conclusions for such a minimal characterization of the problem at hand. I have a paper I am working on, and it has 180 condition with five models and three priors, and even then, I would not make general statements about what to do in practice

]]>That is not how I read the article at all. You may have gotten that if you just scanned it. They pretty much critiqued all of the Bayes factor advocacy in psychology, and did it very well. Most notably of which is that, optional stopping does not work out for how almost all Bayesian psychologists use and advocate Bayes factors (even those that introduced BF to psychology, interestingly enough). That is, default prior distributions, which in the extreme case, cannot even be sampled from in the bayes factor package because they are improper. So the very package to compute Bayes factors does not even have the necessary priors to obtain non-problematic optional stopping. So, in fact, not many practice actual subjective Bayesian methods and, assuredly not those that are the strongest proponents of Bayes factors. There can belief attached to 50:50 odds and default priors, as these are largely in the “objective” camp. If the default prior does capture belief, then this is implicit: we believe all effects are a priori the same, as this is what the use of a default means.

Also, the word “objective” is a misnomer. I NEVER think the prior or posterior captures any aspect of my beliefs, but are simply sets of assumptions to possibly learn something, but all the while knowing my model is incorrect in doing so. However, I do not kid myself into thinking what I do is inherently objective, and I am 100 % Bayesian (in that I do not use classical in practice, but also study frequentist properties of Bayesian models). The illusion of objectivity is just that, and really takes all the fun out of building interesting statistical models.

So in a general sense, your take home is correct, but also in a general sense, that take home does not apply to many situations in practice!

]]>a researcher who drops outliers because they disagree with their hypothesis

It’s probably difficult to realize without actually collecting a lot of data yourself, but there is (almost?) always a legitimate reason available to drop any given datapoint. The practice is something like this:

Most of the former agents had defended the practice of parallel construction because no falsified evidence or illegally obtained material were presented in courts.

https://en.wikipedia.org/wiki/Parallel_construction

It doesn’t even need to be “on purpose”, though. Just analyze the data different ways while unblinded and your bias will leak into the results. That’s why the first thing to do with any empirical paper is ctrl-F “blind”.

]]>Bayesian inference doesn’t require you to *know* the true joint probability distribution. It’s not even clear to me what are you referring to, because I would say that there is a reality out there and the true value of the parameter of interest is fixed even though we don’t know the precise value. (Proper) Bayesian inference works by allowing you to incorporate your *uncertainty* into the analysis via the prior.

]]>Maybe I’ve just made a name for myself as a cranky statistician, but I’ve never met a researcher who drops outliers because they disagree with their hypothesis. I’ve heard stories of young bio grad students who didn’t know better and of PI’s that ask people to do this, but the frequency is low enough that I’ve not encountered anyone that uses this practice. Again, maybe they all just keep quiet when I’m around, but given how widely known how bad this practice is, I would be surprised; at the very least, I can say that I’ve never met a research who doesn’t know better. And if we’re worried that the majority of researchers are knowingly breaking very serious inference rules, well, nothing will save us then.

]]>The article addresses whether optional stopping is problematic for Bayesian inference with Bayes factors and his response is “no”. He reviews arguments like yours and concludes that one shouldn’t evaluate and interpret Bayesian statistics as if they were frequentist statistics and that optional stopping can be used and data can be analyzed ignoring stopping rules.

You think that Bayesian inference has a problem with stopping rules. Another option is that Bayesian inference has no problem with stopping rules, but you have a problem with Bayesian inference.

]]>It says exactly what I am saying.

1) They choose a generative model at random, with odds set to the prior odds.

2) They generate data from the randomly chosen model.

3) They compute posterior model odds.

4) Lo and behold, the inferences are equally *interpretable* regardless of stopping.

That doesn’t change the fact that 1) Prior odds are known 2) The generative model is randomly chosen on each iteration 3) The probability of finding evidence for one model or another is affected by the stopping rule (look at their plots). They admit this, they also did exactly what I said people do: Skirt the problem of the assumptions. Their simulation literally assumes that at any given time, there is a 3.5:1 odds that the data will be generated according to some H2 as opposed to some H1; as it, the state of the data generating process changes at random from being zilch to being not-zilch. That is the schrodinger’s hypothesis that I can’t fathom exists in any real context.

ASSUMING that the true model can randomly switch between zilch and not-zilch at some known prior odds, then the asymptotic odds match the randomly flipping model odds.

But that is an insane assumption in reality. I just really strongly doubt any of us outside of quantum mechanics study a phenomenon that simultaneously exists and doesn’t exist at some odds.

In such a case, we fix a reality and see what happens under optional stopping. As I’ve shown, as they themselves have shown, the probability of finding some desired effect increases. They also admit this in a footnote.

Their opinion about how comparing bayesian methods to hypothetical truths doesn’t make sense to me, given that in simulations the truth is not hypothetical; it is in fact known. The parameter is set. The joint model is specified. In the end, their argument reduces down to “it remains interpretable” (true) and “reflects some subjective updating quotient” (true, and also unsatisfying).

I don’t believe in schrodinger’s hypotheses. When I simulate, I know the truth. When I analyze, I do not. Bayesian inference is only 100% guaranteed calibrated and free from optional stopping if you know the true joint system, which we don’t in practice. That’s why in these little toy examples, you can specify a joint system, then analyze with the same joint system, and the probabilities hold up (Bob Carpenter had a post about this as well); but the point is moot, because we in practice don’t know the true joint system.

]]>It is a response to people who claim that stopping rules are problematic for Bayesian inference using arguments similar to the ones you present. This is one of the conclusions/recommendations of Rouder:

“Researchers who use Bayesian testing should use the proper interpretation as updated beliefs about the relative plausibility of models in light of data. The critical error of Yu et al. (2013) and Sanborn and Hills (2013) is studying Bayesian updating conditional on some hypothetical truth rather than conditional on data. This error is easy to make because it is what we have been taught and grown familiar with in our frequentist training. In my opinion, the key to understanding Bayesian analysis is to focus on the degree of belief for considered models, which need not and should not be calibrated relative to some hypothetical truth.”

]]>
what if we just looked at our data point by data point and stop as soon as our posterior meets the publication process requirement

[…]

a decrease in reliability of publications than using Frequentist methods

I’m pretty sure this “what if” scenario describes the majority of what goes on in many areas, except the data points are coming in batches 10 or whatever instead of one by one. However, to make up for that hurdle, data pruning occurs at the end.

Grouping frequentist methods in with that practice is really unfair though. As usual, the real problem is people testing some default hypothesis (in this case it assumes a certain independence between the datapoints, which is violated by early stopping). Instead, they should be testing their model of a process that actually may have generated the data. In that case, it would become obvious they are just checking whether a model they made false on purpose is actually false.

]]>Inference about theta will depend only on your prior probability distribution for theta and on the likelihood function L(theta; y_1, … y_k) = p(theta | y_1) p(theta | y_2) … p(theta | y_k).

Given the data y_1, y_2, …, y_k, the fact that you used some particular stopping rule is completely irrelevant. If this dataset had been generated by fixing k beforehand (or by any other stopping rule) the likelihood function would also be L(theta; y_1, y_2, …, y_k) = p(theta | y_1) p(theta | y_2) … p(theta | y_k) and the inference would be the same.

> But the real problem is that the data generated given some stopping rule is not the same as the data generated given some other stopping rule.

This is not a problem in a Bayesian analysis, because it is conditional on the data generated. How can the data generated not be “the same” if it is a given?

]]>When people argue to use Bayes because optional stopping isn’t a problem at all, they don’t specify that optional stopping isn’t a problem … if:

1) The prior exactly matches the distribution of parameter values

2) The model is exactly correct

They will just say “it’s not an issue!”, skirt around those conditions altogether, then call up the likelihood principle and say that the interpretation doesn’t matter.

They seem to fail to mention that in reality those conditions are never met. If we knew the exact distribution of parameter values, I’d argue we barely need to even do the study in the first place; if we know for a fact that theta ~ N(.4,.1), then what do we have to gain by even collecting data in most cases? Just to figure out which of those parameters generate some one particular dataset? Meh.

And 2), of course, no analytic model is ever correct.

So the optional stopping is only not a problem if we’re already omniscient enough to perfectly define the closed probability system. Great, then I won’t even need to collect data! I already know it.

]]>“The ONLY time that I can see, in which optional stopping is not a problem for Bayes, is IF the effect 1) does in fact vary 2) your prior is exactly representative of the extent that it varies. So basically, if an effect varies, and you already know everything there is to know about the effect but don’t know which random parameter is responsible for YOUR particular data, then optional stopping is unaffected.”

]]>In a full model, you may have p(y^N, N, theta|S), where S is a stopping rule, theta is an unknown parameter, and N is the length of the observation vector.

Typically, N is fixed or treated as totally random, so it is irrelevant.

Imagine though that you are generating data from a process that uses stopping rules, such as 0 is not in the 95% HDI.

These are the generative steps:

1) Choose theta from p(theta) [could be a fixed point]

2) Generate data from p(y|theta) of some length

3) Obtain posterior for current N, assess HDI

4) If HDI excludes 0, stop.

5) Else, generate some N-delta more from p(y|theta).

Do this a few thousand times. Look at the distribution of data. Does that match what the likelihood implies? Nope. That means the analytic model will need to be revised in order to account for the fact that the generated data from a process employing optional stopping is different from the more naive model that just assumes p(y|theta). N is no longer fixed, nor fully independent [N is conditioned on some stopping rule and previous observations, so to speak]. I’m not sure if this is a resolved issue. But the real problem is that the data generated given some stopping rule is not the same as the data generated given some other stopping rule. That needs to influence the analytic model in some way.

]]>Although I question optional stopping below, I’ll chime in on the other side of the argument here.

The idea that Bayesian inference has no problem with optional stopping is *conditional on the assumption that we have the correct prior for the theta generating process*. In the case that theta just is 0, never a question of randomness, then the prior is P(theta = 0) = 1, and no data is necessary. So this leaves the Bayesian two choices:

1. Use the correct prior and so no data is necessary for inference

2. Use the incorrect prior and then all Bayesian guarantees are out

Now, if there’s a process that is first spitting out theta’s and then spitting out data conditional on that theta, and the Bayesian knows the distribution that’s spitting out thetas and the distribution of the data conditional on theta, then the Bayesian is free to do checks on the data as it comes in and stop whenever they would like and still perform valid inference.

But if their prior is wrong, I *believe* their type S error rate will be much higher than if they used a fixed N with the same wrong prior.

]]>Since academic research is competitive, let’s come up with a strategy to maximize publications. Let’s just pretend that one gets to publish when the posterior probability that some theta is greater than some delta is larger than 1 – alpha.

Well, we could collect all our data and compute the posterior probability. But instead, what if we just looked at our data point by data point and stop as soon as our posterior meets the publication process requirement? This clearly increases our publication probability, as eventually we will use all our data, so our probability of publication using this method is strictly greater than if we had just looked at all our data at once.

Better yet, Bayesians don’t have to care about multiple comparisons, so we can repeat this process with a random permutation of the data (assuming exchangeability) and repeat the process until we get to publish. Of course, with probability 1, every possible permutation will be met, so to save computation time, we can just immediately throw out the data that disagrees with the hypothesis that theta is greater than delta and then just compute the posterior as though there were no truncation, as that would have eventually happened at some point anyways.

Of course, I’m not saying Bayesian statistics is illogical; quite the contrary, if you don’t read Frequentist results with a Bayesian view of the world, you’ve dooming yourself to make silly decisions about the world. But my point is that just viewing things from a Bayesian standpoint does not make everything clear and simple; the issue with multiple comparisons problem above is that from a Bayesian standpoint, it’s okay to do multiple comparisons *on multiple hypotheses, rather than the same hypotheses*, which is a bit subtle. Ignoring the second part of the obtuse strategy, I think it’s clear that this method will lead to a decrease in reliability of publications than using Frequentist methods…although that’s not so clear to me why that’s wrong with this strategy given the likelihood principal.

My (admittedly naive) thought is that for the second part, after we’ve done one data-peek, we need to condition the likelihood to the fact that on our next peek, we would have stopped had we seen the results we wanted. Is there some argument that conditioning on this fact factors out and we are just left with something proportional (in regards to the parameters, not the data) to the original likelihood? I would think this proof would be required for the likelihood principal to allow for optional stopping, and my prior is that has more likely than not been proved somewhere, but I’m familiar with the proof.

]]>No, it doesn’t change. Or maybe you can offer an example?

]]>Upon observing these data, from an analyst perspective, the value is unknown. An analyst wouldn’t say delta = 0, period, because that’s just 100% non-data based. They may be right, but it’s not even a realistic scenario. Basically you’re saying “well, if the analyst knew everything already, then they’d be right more than a Bayesian analyst who doesn’t know everything”, and of course that’s true. And it’s irrelevant. It’s not even an inference, it’s just that somehow this person already knows the answer, and when comparing to someone who is making an inference based on observations, they are right more often – Of course that’s trivially true. If you know the true answer, and I don’t, you will be correct. That’s just tautological.

And optional stopping is a problem for Bayes, for the empirical fact that those who make Bayesian inferences after optional stopping will make inferences that are different from those who make Bayesian inferences without optional stopping. That is a problem, since the claim is that it shouldn’t matter what the stopping rule is. The fact is, it does matter, in the sense that if you had two people observing the exact same stream of data, one is more likely to stop observing early and make an erroneous inference than the other who fixes their N.

And as I said, the math and interpretation of the Bayesian inferential quantities do not change according to stopping rules. But that doesn’t mean optional stopping is consequence-free for Bayesians.

And how would the analysis change?

You’d have to incorporate the stopping rule itself as a condition to the model. If the data generating process changes as a consequence of the stopping rule (some sets of data are more probable than others given a stopping rule), then the likelihood could change. If some parameter values are a-priori less probable given a stopping rule (as in, due to the stopping rule constraint and some max N, some parameter estimate is unlikely to be recovered), you may want to adjust the priors for the parameter. It’s not an easy thing, and I don’t think it’s solved. But it does exist, and other Bayesians seem to acknowledge this.

Another way of thinking about it, is if you had two papers on the same topic, using the same method, everything is the same. One analyst finds support for some theory A, and fixed their N to 400. The other analyst finds support for some opposing theory B, but engaged in optional stopping until theory B was supported. Which would you give greater credibility to?

I would give more support to theory A, all else being equal. Why? Because optional stopping can capitalize on the repeated probability of a chance ordering that supports one theory or another.

The math may not change, but intention matters, the modification of the possible sequences matters, the marginal probability of estimating some effect changes and matters.

I love Bayes, but if some state is true and fixed, optional stopping can increase the probability of choosing the wrong state/statement by over 15x. That renders the whole “bayesian inference has no problem with optional stopping” argument moot to me. Although, the rule for stopping may change the severity of the problem (stopping conditional on amount of information or posterior precision isn’t particularly problematic).

]]>A: use a fixed sample size and perform Bayesian inference on the data

B: use one particular optional stopping rule and perform Bayesian inference on the data

You find thaf for delta=0, procedure B gets it wrong more often than procedure A.

We agree that the probability of making an erroneous inference changes, but I’m not sure how this shows that the Bayesian method has a problem with optional stopping.

If you consider in your simulation the procedure

C: don’t bother doing any experiment and conclude that delta=0 because I say so, period.

you will notice that it will always make the correct inference. Does it mean that Bayesian analysis has also a problem even in the absence of optional stopping, given that procedure A makes erroneous inference much more often than procedure C?

How is optional stopping a problem for Bayes? Do you agree that, given the observed data, the fact that it was obtained using some non-fixed stopping rule is irrelevant? If not, how would the analysis change to account for the “problem”?

]]>And no, I don’t need to simulate different values of delta. In this simulation, delta IS zero, period. Zero. It is set to zero, because I am God, and I can set it to zero. There is no uncertainty from a data generation perspective. It is factually zero, because I am in control of the process.

The “scientist” will have uncertainty, and that’s already encoded into the analysis. If you are simulating thousands of datasets for the exact same simulated experiment, and you permit delta to vary according to a distribution, you have a Schrodinger’s hypothesis. Some statement is randomly true or false, and that’s a very strange statement to make indeed.

The ONLY time that I can see, in which optional stopping is not a problem for Bayes, is IF the effect 1) does in fact vary 2) your prior is exactly representative of the extent that it varies. So basically, if an effect varies, and you already know everything there is to know about the effect but don’t know which random parameter is responsible for YOUR particular data, then optional stopping is unaffected.

But that’s never going to happen. Your prior will never, ever ever ever, perfectly represent the true distribution of effects, in practice.

]]>If H0 is true, then you have shown that you can expect a difference between the results obtained with the two stopping rules. But you don’t know if H0 is true! If you repeat the simulation using different values of delta (generated according to the prior distribution) I think that you may find that there is no bias.

But of course the two experimental setups as you described them won’t give identical results: in aggregate, when the stopping rule is “N=400” you will have more data than when the rule allows for early stoping. What the likelihood principle says is that if you stop early with N=4 the inference is the same as if you had fixed N=4, not that it is the same as if you continue to N=400.

]]>That post is specifically about the kind of hypothesis tests that you hate (excluding zero from credible intervals; Bayes factors), but all to make the point that these sorts of goals are affected by optional stopping. Inspired by many twitter debates.

]]>