In an otherwise pointless comment thread the other day, Dan Lakeland contributed the following gem:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

I could care less about p-values but I really really like the identification of a null hypothesis with a random number generator. That’s exactly the point.

The only thing missing is to specify that “as extreme or more extreme” is defined in terms of a test statistic which itself needs to be defined for every possible outcome of the random number generator. For more on this last point, see section 1.2 of the forking paths paper:

The statistical framework of this paper is frequentist: we consider the statistical properties of hypothesis tests under hypothetical replications of the data. Consider the following testing procedures:

1. Simple classical test based on a unique test statistic, T , which when applied to the observed data yields T(y).

2. Classical test pre-chosen from a set of possible tests: thus, T(y;φ), with preregistered φ. For example, φ might correspond to choices of control variables in a regression, transformations, and data coding and excluding rules, as well as the decision of which main effect or interaction to focus on.

3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y;φ(y)), where the function φ(·) is observed in the observed case.

4. “Fishing”: computing T(y;φj) for j = 1,…,J: that is, performing J tests and then reporting the best result given the data, thus T(y; φbest(y)).

Our claim is that researchers are doing #3, but the confusion is that, when we say this, researchers think we’re accusing them of doing #4. To put it another way, researchers assert that they are not doing #4 and the implication is that they are doing #2. In the present paper we focus on possibility #3, arguing that, even without explicit fishing, a study can have a huge number of researcher degrees of freedom, following what de Groot (1956) refers to as “trying and selecting” of associations. . . .

It might seem unfair that we are criticizing published papers based on a claim about what they would have done had the data been different. But this is the (somewhat paradoxical) nature of frequentist reasoning: if you accept the concept of the p-value, you have to respect the legitimacy of modeling what would have been done under alternative data. . . .

**Summary**

These are the three counterintuitive aspects of the p-value, the three things that students and researchers often don’t understand:

– The null hypothesis is not a scientific model. Rather, it is, as Lakeland writes, “a specific random number generator.”

– The p-value is not the probability that the null hypothesis is true. Rather, it is the probability of seeing a test statistic as large or larger than was observed, conditional on the data coming from this specific random number generator.

– The p-value depends entirely on what would have been done under other possible datasets. It is not rude to speculate on what a researcher *would have done* had the data been different; actually, such specification is *required* in order to interpret the p-value, in the same way that the only way to answer the Monty Hall problem is to specify what Monty would have done under alternative scenarios.

Is there a difference between ‘random number generator’ and ‘probability model’?

Yeah I’m not sure what calling the null hypothesis a “specific random number generator” is supposed to convey. It seems to me that in most cases, the null hypothesis is a probabilistic model of what certain features of the data would look like if some quantity (that typically has scientific meaning) had a value of 0. It’s a (limited, but hopefully reasonable) model of the world that has observable consequences. I don’t see what’s at stake in calling that a kind of “scientific model.”

Anon, Olav:

Yes, “specific random number generator” is the same as “probability model.” The problem is that researchers commonly make two mistakes: (a) taking a probability model to be the same as a scientific model, and (b) taking rejection of the probability model to be the same as proof of a preferred alternative scientific model. The point of calling it a “specific random number generator” instead of a “null hypothesis” is to demystify it a bit.

But not all p-values come from full probability models. If (say) the p-value is computed using assumptions on the mean outcome, plus assumptions of independence, then there is not just one “specific random number generator” that accords with the null, but a huge class of them. So I don’t think this identification entirely works.

Also, this discussion skips over the issue of which other possible datasets one must consider to make the corresponding p-value useful – and it’s not an easy problem, see e.g. Cox 1958.

“But not all p-values come from full probability models”

I don’t even know what to make of that statement. Definitively it’s impossible to calculate a given p value without a probability model associated with it.

Now, the probability model need not be of the data, sometimes it’s just directly on the test statistic. ie. independent of what the data distribution is, the fact that it has an average and a standard deviation means that its sample averages are distributed like random number generator R provided that sample size N is large enough… (central limit theorems etc).

But every p value is the tail probability of some probability model!

“Now, the probability model need not be of the data…”

hi daniel, i think that is george’s point? he states “full” probability models.

so i do not think you two are necessarily disagreeing.

To be clear: by “full probability model” I mean a data-generating model for the data. I was interpreting your comment that “the result was produced by a specific random number generator” to mean that the data (and not just some summary of it) was produced by some completely-specified process. Andrew’s comment that the “test statistic … itself needs to be defined for every possible outcome of the random number generator” does support this interpretation, I believe – but apologies if it is wrong.

And to be (hopefully) clear on why I think it matters: specifying what the null means is fairly easy in an assumed full probability model – typically some regression coefficient is zero, and by assumption that implies a lot about the true state of nature. With fewer assumptions the interpretation is harder; some summary having a “null” value (e.g. mean difference of zero, comparing two groups) doesn’t mean that some broader null hypothesis (e.g. no difference in distribution across two groups) can be ruled out.

With the second version, it’s easier to engage non-experts in discussions about what else might be going on, beyond the one aspect quantified by the test statistic.

Yes, I see what you’re saying. Sometimes we can specify a full probability model for how the individual data points occur. Other times we just specify a probability model for how the sample statistics occur, which leaves an infinity of data generating processes we could consider.

Of course, **sample statistics are themselves data** so it is possible to cause confusion.

Typically, the specification at the aggregated level is more correct distribution-wise, and less specific. The (RNG based) sampling distribution of the mean of N values “is normal” for a huge number of distributions when N is even moderately large (typically 20?).

However, the “Bootstrapping” methods all basically specify that the data comes from a distribution that looks a lot like sampling the data with replacement. The smoothed bootstrap for example is sampling from the kernel density estimate. There you’re actually claiming something at the data level. “Future data values will look like samples from the KDE!”

@daniel this is roughly what I meant the other day by saying a frequentist analysis of a given full model typically makes weaker assumptions on the truth of the model. Since you’re familiar with PDEs think: weak solution of a PDE as satisfying an integral condition on the original PDE and thus not requiring the PDE to be exactly satisfied at all points.

@ojm. Sure it’s fine to say that you’re only modeling the distribution of the test statistic, but you’re still making a VERY STRONG assumption to say that your data occurs as a random iid sample from some population.

Let’s just give an example that gives the flavor. R(i) is an MCMC sampler from probability density p(x) it returns a sequence of samples.

now, p(x) is “hard” to sample from for this sampler. It does however run many cycles very quickly. So you grab R(i) for i = 1..100, even though internally it does 10,000 jumps and only reports every 100’th and therefore, there isn’t much auto-correlation… it never gets outside of some local maximum which is nowhere near the global maximum… Lets say time to transition from initial conditions into the “main probability mass” is say 1e24 jumps, so taking any reasonable number of more samples doesn’t help.

A simpler concept: take a 20 random samples from a strange looking distribution. write them down, hand them to a 5 year old. ask them which numbers they “like the best” have them pick out 10 of them. Then sample with replacement from the rest and do your frequentist statistics on the sample. How close will your predictions be for the averages and standard deviations compared to sampling with replacement from the 20? It depends a LOT on what “my 5 year old likes the best” which is not the kind of fact I expect to be implied by “frequency guarantees”

This is the same problem as saying “we took a representative sample of people in NY who came into the Foo clinic and gave them drug X, it was successful at reducing cholesterol by factor y, therefore we can give it to the whole population of people and it will on average produce reduction about equal to y in cholesterol (p < 0.05) if we repeat this procedure hundreds of times we're guaranteed to be right 95% of the time".

The extrapolation to the larger population is automatic when you really do randomly sample the full well-defined population.

But the use of sampling concepts in probability is totally SPURIOUS to get estimates and standard errors on this other problem where you have just some data and no way to define a specific population and a specific method of sampling.

I'm not saying Bayesians can solve all these problems, but they at least have a clear path to attempt to tackle them, because they can build a model for "what it means for a 5 year old to like a number" or "how hard is it to move from the initial conditions into the main probability mass" or whatever.

No amount of bootstrapping a confidence interval on the 10 numbers will help you deal with the fact that a 5 year old picked her favorites first.

also, you might say “well a frequentist could create a model of this more complex sampling process” and while that’s true, they *can not make a probability model of it*

specifically: they can’t put probability distributions over the possible frequency distributions of the whole population, and they can’t put probability distributions over the possible ways in which the sampling deviates from IID.

why? because there’s no sense in which the actual shape of the population distribution is a sample from some meta-distribution of “populations” in alternate universes or whatever… also you could imagine that the child’s effect on the sampling process is variable in some way or another, but you have only one realization of this effect, and no knowledge of the real population distribution. You could imagine for example that the child’s effect could be equivalent to one of say several hundred possible computer programs, but you can’t put a “prior” over those computer programs, so there’s no sense in which you can give a probabilistic statement here.

In the end, the Frequentist is really left with the following statement:

“If you tell me your guess for what the full population looks like from which the 20 numbers were drawn, and you tell me your guess for the computer program that emulates the child, I will show you how to calculate some information about the 20 numbers”

This maps pretty directly to:

“If you tell me the true population distribution of cholesterol reductions across the whole world, and you give me a computer program that simulates the process of “being in new york and choosing to go to the doctor at clinic X” and a computer program that simulates the process of “being in Hamburg and choosing to go to the doctor” then I will give you frequency guarantees on the differences you’d see between new york and hamburg when you try this process over and over”

What I’m trying to say is that it (might) help to think of the frequentist approach as a method of *analysis* of a given model or method and the bayesian approach as a way of *constructing* a model.

Think Box, Gelman etc and posterior predictive checks.

Not to say that there aren’t Bayesian interpretations of frequentist analyses and frequentist interpretations of bayesian models, which I think there often are. I just think of this as the ‘general orientation’ of each approach.

Not only does “specific random number generator” demystify things, but actually in many cases there IS a difference between an RNG and a probability model. Specifically:

Frequentist: suppose D ~ normal(mu,1); and m = mean(D), then mu = m +- I with p = 0.05

This specifically assumes that the actual data D is drawn from a random number generator such as rnorm(N(D),mu,1) for some unknown value of mu. This is a fact that is absolutely false in a vast majority of actual uses. In many cases, there is no sense in which it can be even approximately true.

Bayesian: suppose p(D | mu) = normal(mu,1) and p(mu) = some_dist(constants) then p(mu|D) = p(D|mu)p(mu)/C

p(D|mu) is a function of *mu* for fixed D. It doesn’t say anything about random sampling of D. The only thing required of p(D|mu) to get good bayesian inference is that as a function of mu, it has a peak near the “real” value of mu. The goodness of the inference is not dependent on D looking anything like a random sample from p(D|mu).

For an example of the latter see:

http://models.street-artists.org/2014/03/21/the-bayesian-approach-to-frequentist-sampling-theory/

in which the likelihood function has substantial density in regions of space that are physically impossible and yet the inference for the average is still “good”. The sample looks nothing like a sample from p(D|mu) for mu equal to the actual average value.

To make the importance of this clearer imagine the difference between the following scenarios:

1) There are 100 crates of orange juice, 100 cartons in each crate, each carton has a serial number. I use an RNG to select 100 cartons and measure their weight. How much will all the cartons put together weigh?

2) Every living person in 20 first world countries (including the US) is assigned a serial number, a random number generator is used to get a random sample of 10,000 of them. Their serum cholesterol is measured at day 0, and they are put on drug X, their serum cholesterol is measured 6 months later, and the average change in cholesterol was deltaC. How big would the average change be for another random sample of 10,000 patients from the same population?

3) 10,000 patients come into a clinic in New York City last year, their serum cholesterol is measured and then they are put on drug X and 6 months later their serum cholesterol is measured. How big will the average change in cholesterol be if you give this drug to 10,000 other patients in Hamburg Germany?

In problem 1,2 there is actual random selection from a defined finite population. Mathematically, the use of the RNG ensures that your sample can not be “too different” from other samples of the same size or of the whole population.

In problem 3 there is no notion of random sampling at all. In what sense do we get any “Frequency guarantees” in any frequentist analysis of problem 3?

Well, like I always say to my undergrads when I talk about random sampling, at least with random sampling there is some math that says on average it’s going to be a good estimate even if you can’t know if yours is one of the really bad samples, if you survey people who walk by you, you have no idea.

Yes, but ignoring the issue and moving forward with a rote analysis which then gives you a tiny standard error that you publish which then, a few years later, is shown in replication with a broader more representative sample to have been a complete and utter joke… that’s not funny!

Thank you. I still am not quite sure what’s wrong with (a), though Lakeland’s explanation below is somewhat illuminating. The null hypothesis certainly *ought* to be scientifically grounded, and not just some arbitrary random number generator… The real sin seems to me to be (b), or setting up a bogus null hypothesis and then taking its rejection to imply proof of one’s own pet hypothesis.

Sometimes, there’s nothing wrong with (a) (the idea that the random number generator is a scientific hypothesis). and then the Bayesian and the Frequentist are both right. This would be my case (1) and (2) (about the orange juice and the forced-worldwide trial of the drug)

In these cases, the RNG imposes a fact on the world, and by doing so it becomes a scientific fact. Namely, the fact that it’s really hard to find a set of 10,000 people who are “very different” from everyone else by trying sets of 10,000 that come out of an RNG. How hard it is to find that subset by just asking the RNG for a new sample is in some sense measured by the p value.

But sometimes (a) is really wrong. How hard is it to find a sample of people that are “very different” from people who choose to go to the doctor in Hamburg Germany by looking at people who choose to go to the doctor in New York USA? It’s potentially really super easy, and potentially there could be “quite large” differences between the two, maybe orders of magnitude larger than would be predicted from assuming the NY sample is somehow applicable to other places and people in the world.

The Bayesian can’t get away from the problem that they need more data to predict Hamburg, but the Bayesian can do two things the Frequentist can’t:

1) use other sources of information probabilistically through the prior on parameters.

2) predict parameters without a commitment to a model involving random sampling. The Bayesian can use any scientific information they have to pick any distribution which has the property that p(Data | Params) has higher density near the true value of the Params than away from the true value of the Params, and still get good inference. In particular, there is no need to assume IID sampling from a given distribution. There is no need to assume the tails fall off “accurately” in some long-run frequency sense, and many other relaxations are possible as well.

In the majority of cases I can think of where (2) happens and people call it “Frequentist” they usually just mean “Bayesian with a flat prior” and the inference is done through maximum likelihood which is a Bayesian calculation (by a Bayesian calculation I mean any calculation that obeys the sum and product rules of Bayesian probability, which the maximum likelihood calculation does)

When you think of truly “Frequentist” methods you should think of things like Bootstrapping, Permutation Testing, Kolmogorov-Smirnov testing, Chi-Squared goodness of fit testing, Mann-Whitney testing etc etc. These inherently have as their basis an infinite sequence of values of which your data is the prefix.

Well, sometimes it’s not possible to set up a sensible null hypothesis, as in your (3), perhaps. I’m not sure what the null hypothesis is supposed to be there.

As for your comments on frequentism. To me, a “frequentist” is just someone who refuses to explicitly assign priors to hypotheses. As you note, frequentist methods can often be understood as implicitly assigning a kind of “prior,” but those priors are often not genuine Bayesian priors or are priors only in a mathematical sense. For example, the flat prior that corresponds to MLE is improper (and therefore does not really obey the sum rule); a prior corresponding to AIC penalization makes no Bayesian sense since it assigns a lower probability to a model than to a submodel, even though the model has to be true if the submodel is; shrinkage estimators can be mimicked with “empirical Bayes” priors, but those are priors that are constructed from the data and are therefore not legitimate Bayesian priors; etc.

Exactly, I think the “It’s not possible to set up a sensible null hypothesis (in terms of random sampling from a population)” is actually the *USUAL* case or at least a really common case in actual practice. For example in Himmicanes and Mechanical Turk surveys of women’s shirt colors and menstrual cycles, and a whole bunch of Biology and cancer and medical research and etc etc. In almost every case where people have these weird claims, they’re improperly implicitly assuming that there is some well defined population that they’re thinking of which their sample is a well defined random sample from.

“a frequentist is just someone who refuses to explicitly assign priors to hypotheses” is not a good definition I think, and I certainly don’t want to confuse anyone into thinking that is the definition I’m using.

To me, A frequentist is someone who interprets the meaning of probability as solely “frequency in repeated sampling”. At least when I use the term that’s what I mean.

In fact, almost any likelihood calculation is a bayesian calculation. The flat prior that corresponds to MLE I interpret as a nonstandard object with infinitesimal density over a nonstandard length. It results in a proper posterior, its interpretation is Bayesian, it obeys the sum rule (the integral over the nonstandard domain is 1 etc). It’s no more illegitimate than the definition of the integral: int(f(x)dx,a,b) = st(sum(f(a+(b-a)/N*i) * dx,i=1,N)) for N nonstandard. There is no sense in which “how often” something happens is used in that calculation unless you explicitly check whether the likelihood assigned fits the generated data using statistical tests, which is typically never done, you couldn’t, it would take a huge amount of data for you to verify the frequency properties of the distribution, and when you did that you wouldn’t need to do likelihood calcs, you’d just calculate the sample statistic and have a tiny standard error.

I suspect an AIC penalization is really a choice of likelihood as a mixture over different likelihoods with prior mixture probability assigned to different models that happen to be submodels of each other but I haven’t looked carefully there.

In any case it’s my belief that if you do a likelihood based calculation and you haven’t got a massive dataset to validate the shape of the likelihood function as a proper frequency distribution, you’re just pretending to not be a Bayesian.

>>>To me, A frequentist is someone who interprets the meaning of probability as solely “frequency in repeated sampling”. At least when I use the term that’s what I mean.<<<

Sounds like a straw-man to me. Are there many who stick to that dogmatic position outside of those that relish the philosophical debate?

In practical contexts isn't the presence of an explicit, considered prior a good working definition to split the body of applied work into Bayesian vs Frequentist approaches.

Rahul. No I don’t think so AT ALL. Anyone who picks a likelihood without doing validation of the Frequency distribution and then fits maximum likelihood has the following questions to ask themselves:

By what logic does this calculation make any sense? And, what has to hold in order for it to perform well?

The answers to those things are going to be that the logic that makes it work is the Cox axioms, and it will work well when the chosen likelihood puts the data into the high probability region when the parameters have their proper values. In other words, it’s a Bayesian calculation. The fact that it has flat nonstandard priors doesn’t mean that it’s not inherently a Bayesian calculation.

Suppose that instead of ML estimation stats textbooks taught that you should always choose normal(0,10^360000) priors for unknown parameters and then do a Bayesian calculation. Would it fail to be Bayesian? Would it produce any different results than an ML calculation in any real-world problem? What would be the “frequentist” logic behind it?

@Daniel

To illustrate the point, can you link to applied studies (not methods papers) that you’d say are Bayesian but have no use / mention of any explicit prior?

Rahul, I don’t have any links for you, and when I try to search I find methods papers because in methods papers they’re very explicit about their use of maximum likelihood, whereas in applied papers people just tend to do stuff and not give a big fanfare, but just go into your favorite journals and find:

1) Anyone doing least squares fits either to fit simple lines or low order polynomials in Excel, or something a little more sophisticated (maybe a theoretically pre-determined family of curves), or to find coefficients for an ODE. Least squares = maximum likelihood for normal error model, almost never will you find anyone “validating” the normal error model with something like a shapiro-wilks test etc, it’s an assumption and it’s a Bayesian one. Even if they validate it, it then falls into the “both Bayesian and Frequentist” category.

2) Anyone explicitly fitting maximum likelihood models. Where did they get their likelihood? it’s usually either a default like “these events are rare and can not occur at the same time, so they have a poisson distribution with unknown parameter lambda”, that’s Bayesian. Or they say “the variance was sometimes bigger than the mean, so we used a negative binomial model” again, Bayesian, there’s no reason to think there’s a collective of stuff in the world that has a negative binomial distribution that you’re sampling from.

Here’s an example I finally found in PLoS One that is all full of “tests” and uses terrible logic (such as stepwise exclusion of parameters based on chi-squared fits or something) but at heart it’s a Bayesian model, they chose a likelihood for the data based on negative binomial because it had the property that it had a separate mean and variance, instead of poisson which has them equal. They do a generalized linear model fit via maximum likelihood based on this assumed likelihood…

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0091105

This stuff is rampant, and it’s Bayesian. if you’re choosing a model based on what you know not what the data actually can prove to you through frequencies, you’re doing a Bayesian calculation.

“if you’re choosing a model based on what you know not what the data actually can prove to you through frequencies, you’re doing a Bayesian calculation”

The arguments are probably getting tiresome but this is simply not true!

ojm. I’d be happy to discuss the issue with you over at my blog where it might be less tiresome :-) I’ll put up a post later today.

Ok, here’s my new thoughts on the issue, thanks to ojm once again for pushing back and getting me closer to a well thought out position.

http://models.street-artists.org/2016/05/09/confusion-of-definitions-bayesian-vs-frequentist/

The TL;DR version. I think a lot of classical statistics is not inference at all, but rather communications engineering. It basically starts with a Bayesian model and whittles it down based on Frequentist criteria until it makes for a good compression scheme.

That’s basically what’s going on in the example paper at PLoS One on bacterial cultures. They start with a Bayesian model based on an assumed likelihood, and remove stuff from the likelihood using Frequentist tests effectively on a guiding principle of “reduced model size” and then maximize the final likelihood (Bayesian) to produce a model that would be a good starting point for a channel encoding scheme to send a whole bunch of data of the same type if they were to collect it (ie. an assumption of a stable process).

Forgive me if I have mistaken this somewhere, I am not an excellent mathematician. However, I think the point can be described from a measure theoretic and philosophical standpoint as such:

Any hypothesis, null or not, is essentially an indicator random variable defined on the event space (the sigma algebra on the sample space)–or more indirectly on a the topological space defined by the image of some random variable defined thereupon. Now, this is indeed how it should be since, from a philosophy of statistical science point of view, hypotheses are formulated as binary questions and data collection is essentially a plebiscite.

That said, by the very definition of a hypothesis as a measurable function from one measurable space to another, our hypothesis induces a probability measure as the push forward measure from the original–albeit typically unknown–measure. Thus, the hypothesis is, by the very axiomatization of probability theory, a random variable in its own right, and therefore can be thought of as a random number generator adhering to the distribution it permits.

If I might opine for a moment: the issue here is not that the hypothesis is a form of scientific model–for one could just as well take any indicator random variable and its corresponding push forward measure as a scientific model with varying degrees of predictive power–but with the interpretation of the p-value as some form of epistemic truth of this variable.

The p-value itself is essentially a random variable that decrees the probability of observing a value at least as extreme (for real-valued random variables) as that which is generated by the hypothesis you have chosen as the default–and the choice of this is essentially unbounded.

I might finally add, in addition, that there is some philosophical distinction between scientific and statistical models. One might make the existential argument that a scientific model have some sort of semantics encoded therein which represent, to the best of our senses, how the universe actually works.

One might, then, from a philosophical standpoint always choose the null hypothesis to be the currently most accepted / most highly predictive scientific model, and always compare new scientific hypothesis. That all said, the mathematics does not make a distinction in this respect; both must be formulated as measurable functions–regardless of how abstruse they are to those trying to understand them in human terms.

After all, you could build a Rube Goldburg machine of a hypothesis and still have it be statistically significant and even existentially true, but that doesn’t really help *humans* understand the universe any better, which, ostensibly, is the philosophical purpose of science and statistics.

Finally, to digress a bit: at some point, we may indeed have to forgo our desire for simplistic models, and indeed, which computational power increasing, it may be our computers coming up with and testing hypothesis all on their own. The mathematical language they use may be very different than ours, as a terabyte long proof for them is just as intelligible as a two sentence one is to us.

“– The p-valua is not the probability that the null hypothesis is true. Rather, it is the probability of seeing a test statistic as large or larger than was observed, conditional on the data coming from this specific random number generator.”

And there are an infinite number of other random number generators that could have produced substantially the same results with similar probabilities.

A P-value is calculated using data x and say a parametric family of probability models indexed by theta. P-values are probabilities of particular events and are calculated within the model, either through direct calculations where possible or by simulations. A null hypothesis may specify a parameter value theta_0 exactly or only some components of theta, say the mean mu of the N(mu,sigma^2) model but not the sigma. Given a numerical value associated with the data x one can ask whether this value is consistent with theta_0. For example if the mean of the data is 1.145 this would not be considered as being consistent with the N(100,1) model. Another way of interpreting a P-values is as a measure of approximation. The P-value specifies how weak your concept of approximation has to be for the model theta_0 to be accepted as an approximation to the data x with respect to the numerical quantity under consideration. There is no need to restrict the calculation of P-values to just one numerical value. For the normal model N(mu,sigma^2) on may ask whether the mean, the variance, the degree of outylingness and the distance from the model are all typical for data generated under the model. Each has a P-value and the overall P-value could be defined as the minimum of the separate P-values. Suppose that the overall P-value is so small that it is accepted that the data are not consistent with theta_0 with respect to the quantities considered. Then one has to find out why. At this stage the model and the parameter theta_0 have to be related to the world. Possible explanations are a badly calibrated measuring instrument, the wrong data, outliers, dependency between the observations, transcription errors etc. If all these can be eliminated to some extent then we can provisionally and speculatively claim that the small P-value does indeed indicate that the quantity of copper in this sample of drinking water does not exceed the legal limit.

One word: probability integral transform.

Only a linguist would call that “one word” :-)

*Sigh*. I always have to explain my jokes on this blog. I meant it as a joke ;).

I thought you were going to point out that in German, it’s one word.

probabilitieintegraltransformierung?

No idea how to say it in German (I wouldn’t even know how to say it in Hindi or Japanese or French, the other languages I speak). Wahrscheinlichkeitsintegraltransformation is my guess.

I believe this post shows the value of parroting the same issues repeatedly, over and over again.

In the probability parlance, there was quite a disagreement about what was meant by the null hypothesis (being true) among the ASA p-value panel and commentators.

Some took it to be “the null hypothesis is the complete set of assumptions under which the p-value is calculated’ (e.g. Stark, Mayo) while others “null hypothesis may specify a parameter value theta_0 exactly” with there being other assumptions about the scientific model to be concerned about (e.g. Greenland et al).

For the first, the distribution of p-values would, by definition, be uniform given the first conception of the null hypothesis was true but possibly very non-uniform for the second conception depending on whether the wider assumptions about the scientific model warranted essentially as randomized comparisons.

( Wish I would have pointed that out here http://www.stat.columbia.edu/~gelman/research/published/GelmanORourkeBiostatistics.pdf )

But I had not realized how neat this notation was T(y;φ(y))

I am currently working on simulations and animations of estimation behavior given test outcome based decisions.

Here T(y;φ(y)) -> E.m(y;T(y)) with E.c is combined estimate and E.s is separate estimates given whether T(y) a test for differences was not significant or significant (an early paper would be Bancroft 1944 http://projecteuclid.org/euclid.aoms/1177731284 )

Given the problems have been narrowly known since 1944 but only widely appreciated sporadically over time (I do remember David Cox around mid 1980s speculating that a φ(y) such as choosing a BoxCox transformation hopefully would not affect the performance of T(y;φ(y)) – one should expect its troublesome issue.

But the patterns I see in these simulations i am doing now are actually weirder than I expected (and why I am animating the simulation results).

Anyway, thanks for the distraction.

Would love to see some animations. Any chance you’ll put it on the web somewhere?

Simulating the Garden of Forking Paths?

Yep, the challenge being whether φ(y) can be made explicit enough to program.

If not, then the properties of T(y;φ(y)) if not undefined will be unknown.

(Now if I insist φ(y) must be continuous, someone will find publishable “statistical” paradoxes.)

Perhaps the differential value of simulation in understanding statistical issues for those writing the simulations (or reviewing the code) is the need for explicit considerations. For instance, in the Monty Hall problem, instructions for which door Monte opens. That’s key to understanding puzzle of what it means for something to be informative (it changes the prior) and how that varies (depends on how the door is chosen.) Now is someone coding the simulation likely to do more that just automatically make the choice random and equal and not think about or try some variations. Now good mathematicians will quickly identify an implicit assumption about the choice door and better ones will investigate varying that.

For those watching the simulations, some may and likely most won’t get much out of it.

To _me_ this animation of two stage simulation should provide most if not all understanding of Bayesian analyses for those who are not trying be their own statisticians but want to understand Bayesian analyses https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/

Most don’t seem to get much out of it.

Richard McElreath has worked much harder along a similar approach https://www.youtube.com/playlist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z and has tried using the same Galton two stage quincunx.

I guess you are trying to specify “data cleaning and model choosing” procedures, and that is the hard part, yeah? Getting some set of decisions that mimics the way researchers work sufficiently well that you can get at the Garden?

I have this not-fully-fleshed-out thought that maybe there is a “sufficient statistic” type of thing here. Suppose that the Garden-type decisions people make are independent of treatment status – that is, they look at distributions and moments and whatever in the data when they decided how to analyze it, but these were not related to treatment assignment (so blind to treatment or only using the control group).

Then you take that (real) data, and randomly assign treatment and run the exact same model the author’s run, on whatever interactions and covariates they employ. Then you do it 10,000 times. I wonder what kind of rejection rates you would get.

I think there are really two thoughts here, only one of which might be helpful. The first (and the one I think is more likely useful) is: instead of using simulated data, use real data. Brian Nosek (sorry Brian!) has tons of it on the OSF, and you could sample in any way you wanted (you could choose N, you could choose only from the control group, you could do whatever). Then you would have data that actually looked like (is) real world data with all the messy covariances and omitted variables.

The second is: even if you don’t know the decision tree that led to some particular sample and model being the “final” choices, all of that information is sort of already there for you, hidden in the data and the estimator choice. So I wonder if rejection rates on placebo assignment, given the actually used sample and estimator, can teach us something about the ability of the decision-making process to influence the properties of the analysis (in terms of frequentist properties of the procedures actually used in “Science”).

I suspect that if you just do whatever the researchers actually did, but assign treatment randomly and thus know the effect is 0, you would get rejection rates well above the specified alpha, and that would teach us something about how the decision-making process that leads to particular samples/estimators can affect inference. And then you could do it on the raw data using some robust estimator and see if you get something closer to the actual expected alpha.

But I also have a nagging feeling I’m missing some bit here… still, figured there was something useful in this thinking.

I guess the notation T(y;φ(y)) indicates essentially (to re-phrase it in modelling terms) that you have adjustable ‘structural’ parameters that you are estimating based on the data at hand?

So, while these *should* be ‘external/structural’ parameters that your ‘internal’ paramters are estimated conditional on, you are allowing your model structure to alter based on the data? This invalidates the estimates of the parameters of interest?

Presumably a hierarchical model relaxes the ‘fully fixed’ aspect and allows the structural parameters to vary somewhat, but makes the different levels explicit.

Alternatively, perhaps, you could say that once you allow all your parameters (structure/hypotheses included) to vary (or vary too much – e.g. priors are too weak) with the data at hand, you are working both ‘within a model’ and ‘within a dataset’ (since everything, structure included, now depends on the given data) and hence you give up any potential generalisability?

So…how do you interpret T(y;φ(y))? Also +1 for seeing some simulations.

It sounds like he’s taking two samples, doing a test of whether there is a “significant” difference between them, and then if yes doing separate estimates of something, and if no doing an estimate pooling everything together. In other words, the decision about how to “model” the data (as two separate samples or as one sample) is dependent on the data itself through the outcome of a test.

Exactly.

The weirdness comes from estimators having properties that depend rather heavily on the sample size (which because of the setting can’t really be avoided nor can I share them publicly).

More generally this is really old stuff, I likely wrote a simulation to investigate the properties of using a test of heterogeneity to decide on whether to use fixed (completely pooled) or random effects (partially pooled) estimation in meta-analysis to confirm it was a really bad idea back in 1985/6. I was always a big fan of doing simulations which infuriated my professors in Biostatistics, one of who told me “a professional statistician should not need to stoop to doing simulations.”

Animations are new and now easy with Shiny, I’ll try to put some examples together to share if I get time.

This is a fascinating post. Could someone please explain the consequences of following a similar procedure to that outlined in research path #3 in the post, but with analysis carried out in a Bayesian framework (say in BUGS), with reported credible intervals instead of p-values? The data would still be analysed in a different way depending on the specific circumstances of the study right? So even though there is no reference to a frequentist view in the philosophy the outcome of modelling would still be dependent on these factors? Is the issue one of not reporting all the parameter estimates you’ve found on the way to your final model? My brain can’t take this in the morning, please help me!!

I’ll try to explain my view on the differences in a few hours. The thing to remember is that the Bayesian analysis doesn’t depend on frequencies, it depends on model assumptions about what is and what isn’t plausible to occur under some structural model with “correct” values of parameters plugged in (and also, what you think about those parameters before you see data, in the prior)

The “goodness” of the “structural” model is the main question, and post-data you may be able to narrow down your structural model in a *helpful* way. whereas narrowing down your test to one you think should be relevant is *unhelpful* for the test’s ability to “check” or “falsify” your overall concept.

Ok, let’s recap #3 to start off:

“3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y;φ(y)), where the function φ(·) is observed in the observed case.”

First off, there’s nothing really Bayesian about *testing* so, let’s rewrite the Bayesian version as: “Decide on and fit a model in a case where the data you observe informs the model that you decide to fit”

We’ve actually been over this before on the blog. As a Bayesian before you get data, you might have a variety of models you’d consider to describe your process. As a Bayesian you should be able to put some kind of prior over which of them is the best model. After seeing the data, you can be in a situation in which the data immediately makes the probability of several of these models negligible (it contains data that contradicts those models under pretty much any parameter values). At this point, it’s a valid approximation to just fit the models remaining. If one model stands out strongly as having highest posterior probability, it’s an approximation to simply fit that one model and truncate the others out of the analysis.

So, we’ve established that it can be consistent with Bayesian reasoning to choose your one model after you see your data, provided that you don’t have any other plausible models left once you see the data.

In the case where you do have plausible models left, it seems like the right thing to do is to either fit each of them separately, or try to write one big model where they’re embedded together and fit that. Not doing so is a kind of “cheating”, basically ignoring alternative explanations that you are aware of because you don’t like them.

So, it’s not so much about whether you fit the model in BUGS or Stan and whether you report credible intervals vs p values, it’s more an issue of:

Frequentist: whether you’re relying on your test to reject a bad idea for you, even though you chose which test based on what seemed interesting (ie. seemed like it couldn’t be rejected by your data) after seeing the data

Bayesian: whether you’re relying on the data to eliminate models you would have considered if you hadn’t seen the data, and whether the approximate update you did “in your head” to eliminate those models really does correspond to a proper Bayesian calculation, or you’re just fooling yourself because you “don’t like” the other models.

Hi Daniel, that’s really helpful. Thanks for taking the time to explain, I appreciate it.

@Gelman

“I could care less about p-values but…”

I think you mean “I *couldn’t* care less about…” ?

That’s what I could care less means; I couldn’t care less.

Mind blown.

A relevant message from David Mitchell behalf of the Queen about her English:

https://www.youtube.com/watch?v=om7O0MFkmpw

Shravan:

Yes, exactly. It’s an idiom in English, kinda like how “flammable” and “inflammable” mean the same thing.

I think you meant to address PABP ;).

Since you’re the linguist maybe you’d have some info on this. I always assumed that “I could care less” came about because the pronunciation of “couldn’t care” is more complex than “could care” (it requires more complex tongue manipulation). So in essence it became “could’ care” where the apostrophe indicates the listener is supposed to fill in the missing sounds just like the word “couldn’t” you’re supposed to fill in “could not”

So when someone like PABP fails to comprehend it it’s confusing for the speaker because they basically fail to share a common agreement on how the language works.

Oh, I don’t think linguists are supposed to help you with practical issues.

That’s like asking a Computer Scientist for help when Windows won’t boot up.

As a guy with an undergrad CS minor, whenever someone asks me for help with windows not booting I am always happy to help. However, when I hand them a Debian install disk, somehow they rarely decide to accept my help…. oh well..

Actually, Rahul is right that linguists are useless with practical issues. But this is not a practical but a theoretical issue. It seems like an idiom to me (an Americanism). There are things called negative polarity items in language, like “I don’t have a red cent”, but if you remove the negation, the meaning changes to its literal meaning. Not so with “I could care less”. I’d say it’s just an idiomatic variant.

My impression is that “I could care less” started as “I could care less?”, implying in a sarcastic way that one couldn’t care less.

“I *could* care less if I spent some time thinking of something that I might care less about because it is hard to think of what it might be.”

Just to add mud, I think “I could care less” is a more American than British idiom – it looks and sounds a bit odd to British ears (at least mine) although fairly familiar. But who knows what the kids are saying these days?

It certainly does sound odd to this non-native English speaker.

There’s a lot on Language Log (linguists) on “couldn’t care less”. The most recent that I see is at http://languagelog.ldc.upenn.edu/nll/?p=21170 , which has links to the earlier discussions.

Coincidentally, three months ago a student asked the following question in AM 207 this semester:

“what does true randomness mean, and how is it known that something is truly random?”

to which I replied:

“To some extent it is not possible to tell whether a sequence of random numbers is “truly random”; however, a hypothesis test may help. In other words, if one posits that a sequence of numbers is generated by some null probability distribution, a test can be performed to asses whether this sequence of numbers is consistent with this distribution or not, where we may use a notion of extremity like a p-value.”

I have realized that upon deeper reflection on the true meaning of randomness and further assessment of the of this comment, I was not actually referring to a group of donkeys (in verb form).

It’s a really interesting question. Per Martin-Lof answered it in a similar way. Basically he shows that there exists some sort of “best” computable test, along the lines of an “ultra die-hard test” and then anything that passes this test we should consider to be a random sequence. As the sequence length goes to infinity, the number of possible sequences grows faster than the number that would be rejected… so virtually all long bit sequences are “random” ones by this definition

http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/

Here’s a link to the “die harder” tests which are one of the main modern ways to assess random number generator algorithms: http://www.phy.duke.edu/~rgb/General/dieharder.php

And here are the NIST tests: http://csrc.nist.gov/groups/ST/toolkit/rng/

so in practice, approximating Per Martin-Lof’s ultimate test is the route used to test RNGs

Cool, thanks for the links.

meaning of* (missing data)