Forgive me if I have mistaken this somewhere, I am not an excellent mathematician. However, I think the point can be described from a measure theoretic and philosophical standpoint as such:

Any hypothesis, null or not, is essentially an indicator random variable defined on the event space (the sigma algebra on the sample space)–or more indirectly on a the topological space defined by the image of some random variable defined thereupon. Now, this is indeed how it should be since, from a philosophy of statistical science point of view, hypotheses are formulated as binary questions and data collection is essentially a plebiscite.

That said, by the very definition of a hypothesis as a measurable function from one measurable space to another, our hypothesis induces a probability measure as the push forward measure from the original–albeit typically unknown–measure. Thus, the hypothesis is, by the very axiomatization of probability theory, a random variable in its own right, and therefore can be thought of as a random number generator adhering to the distribution it permits.

If I might opine for a moment: the issue here is not that the hypothesis is a form of scientific model–for one could just as well take any indicator random variable and its corresponding push forward measure as a scientific model with varying degrees of predictive power–but with the interpretation of the p-value as some form of epistemic truth of this variable.

The p-value itself is essentially a random variable that decrees the probability of observing a value at least as extreme (for real-valued random variables) as that which is generated by the hypothesis you have chosen as the default–and the choice of this is essentially unbounded.

I might finally add, in addition, that there is some philosophical distinction between scientific and statistical models. One might make the existential argument that a scientific model have some sort of semantics encoded therein which represent, to the best of our senses, how the universe actually works.

One might, then, from a philosophical standpoint always choose the null hypothesis to be the currently most accepted / most highly predictive scientific model, and always compare new scientific hypothesis. That all said, the mathematics does not make a distinction in this respect; both must be formulated as measurable functions–regardless of how abstruse they are to those trying to understand them in human terms.

After all, you could build a Rube Goldburg machine of a hypothesis and still have it be statistically significant and even existentially true, but that doesn’t really help *humans* understand the universe any better, which, ostensibly, is the philosophical purpose of science and statistics.

Finally, to digress a bit: at some point, we may indeed have to forgo our desire for simplistic models, and indeed, which computational power increasing, it may be our computers coming up with and testing hypothesis all on their own. The mathematical language they use may be very different than ours, as a terabyte long proof for them is just as intelligible as a two sentence one is to us.

]]>Ok, here’s my new thoughts on the issue, thanks to ojm once again for pushing back and getting me closer to a well thought out position.

http://models.street-artists.org/2016/05/09/confusion-of-definitions-bayesian-vs-frequentist/

The TL;DR version. I think a lot of classical statistics is not inference at all, but rather communications engineering. It basically starts with a Bayesian model and whittles it down based on Frequentist criteria until it makes for a good compression scheme.

That’s basically what’s going on in the example paper at PLoS One on bacterial cultures. They start with a Bayesian model based on an assumed likelihood, and remove stuff from the likelihood using Frequentist tests effectively on a guiding principle of “reduced model size” and then maximize the final likelihood (Bayesian) to produce a model that would be a good starting point for a channel encoding scheme to send a whole bunch of data of the same type if they were to collect it (ie. an assumption of a stable process).

]]>ojm. I’d be happy to discuss the issue with you over at my blog where it might be less tiresome :-) I’ll put up a post later today.

]]>There’s a lot on Language Log (linguists) on “couldn’t care less”. The most recent that I see is at http://languagelog.ldc.upenn.edu/nll/?p=21170 , which has links to the earlier discussions.

]]>“if you’re choosing a model based on what you know not what the data actually can prove to you through frequencies, you’re doing a Bayesian calculation”

The arguments are probably getting tiresome but this is simply not true!

]]>Rahul, I don’t have any links for you, and when I try to search I find methods papers because in methods papers they’re very explicit about their use of maximum likelihood, whereas in applied papers people just tend to do stuff and not give a big fanfare, but just go into your favorite journals and find:

1) Anyone doing least squares fits either to fit simple lines or low order polynomials in Excel, or something a little more sophisticated (maybe a theoretically pre-determined family of curves), or to find coefficients for an ODE. Least squares = maximum likelihood for normal error model, almost never will you find anyone “validating” the normal error model with something like a shapiro-wilks test etc, it’s an assumption and it’s a Bayesian one. Even if they validate it, it then falls into the “both Bayesian and Frequentist” category.

2) Anyone explicitly fitting maximum likelihood models. Where did they get their likelihood? it’s usually either a default like “these events are rare and can not occur at the same time, so they have a poisson distribution with unknown parameter lambda”, that’s Bayesian. Or they say “the variance was sometimes bigger than the mean, so we used a negative binomial model” again, Bayesian, there’s no reason to think there’s a collective of stuff in the world that has a negative binomial distribution that you’re sampling from.

Here’s an example I finally found in PLoS One that is all full of “tests” and uses terrible logic (such as stepwise exclusion of parameters based on chi-squared fits or something) but at heart it’s a Bayesian model, they chose a likelihood for the data based on negative binomial because it had the property that it had a separate mean and variance, instead of poisson which has them equal. They do a generalized linear model fit via maximum likelihood based on this assumed likelihood…

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0091105

This stuff is rampant, and it’s Bayesian. if you’re choosing a model based on what you know not what the data actually can prove to you through frequencies, you’re doing a Bayesian calculation.

]]>Cool, thanks for the links.

]]>@Daniel

To illustrate the point, can you link to applied studies (not methods papers) that you’d say are Bayesian but have no use / mention of any explicit prior?

]]>It certainly does sound odd to this non-native English speaker.

]]>Rahul. No I don’t think so AT ALL. Anyone who picks a likelihood without doing validation of the Frequency distribution and then fits maximum likelihood has the following questions to ask themselves:

By what logic does this calculation make any sense? And, what has to hold in order for it to perform well?

The answers to those things are going to be that the logic that makes it work is the Cox axioms, and it will work well when the chosen likelihood puts the data into the high probability region when the parameters have their proper values. In other words, it’s a Bayesian calculation. The fact that it has flat nonstandard priors doesn’t mean that it’s not inherently a Bayesian calculation.

Suppose that instead of ML estimation stats textbooks taught that you should always choose normal(0,10^360000) priors for unknown parameters and then do a Bayesian calculation. Would it fail to be Bayesian? Would it produce any different results than an ML calculation in any real-world problem? What would be the “frequentist” logic behind it?

]]>Just to add mud, I think “I could care less” is a more American than British idiom – it looks and sounds a bit odd to British ears (at least mine) although fairly familiar. But who knows what the kids are saying these days?

]]>Here’s a link to the “die harder” tests which are one of the main modern ways to assess random number generator algorithms: http://www.phy.duke.edu/~rgb/General/dieharder.php

And here are the NIST tests: http://csrc.nist.gov/groups/ST/toolkit/rng/

so in practice, approximating Per Martin-Lof’s ultimate test is the route used to test RNGs

]]>It’s a really interesting question. Per Martin-Lof answered it in a similar way. Basically he shows that there exists some sort of “best” computable test, along the lines of an “ultra die-hard test” and then anything that passes this test we should consider to be a random sequence. As the sequence length goes to infinity, the number of possible sequences grows faster than the number that would be rejected… so virtually all long bit sequences are “random” ones by this definition

http://models.street-artists.org/2013/03/13/per-martin-lof-and-a-definition-of-random-sequences/

]]>>>>To me, A frequentist is someone who interprets the meaning of probability as solely “frequency in repeated sampling”. At least when I use the term that’s what I mean.<<<

Sounds like a straw-man to me. Are there many who stick to that dogmatic position outside of those that relish the philosophical debate?

In practical contexts isn't the presence of an explicit, considered prior a good working definition to split the body of applied work into Bayesian vs Frequentist approaches.

]]>I guess you are trying to specify “data cleaning and model choosing” procedures, and that is the hard part, yeah? Getting some set of decisions that mimics the way researchers work sufficiently well that you can get at the Garden?

I have this not-fully-fleshed-out thought that maybe there is a “sufficient statistic” type of thing here. Suppose that the Garden-type decisions people make are independent of treatment status – that is, they look at distributions and moments and whatever in the data when they decided how to analyze it, but these were not related to treatment assignment (so blind to treatment or only using the control group).

Then you take that (real) data, and randomly assign treatment and run the exact same model the author’s run, on whatever interactions and covariates they employ. Then you do it 10,000 times. I wonder what kind of rejection rates you would get.

I think there are really two thoughts here, only one of which might be helpful. The first (and the one I think is more likely useful) is: instead of using simulated data, use real data. Brian Nosek (sorry Brian!) has tons of it on the OSF, and you could sample in any way you wanted (you could choose N, you could choose only from the control group, you could do whatever). Then you would have data that actually looked like (is) real world data with all the messy covariances and omitted variables.

The second is: even if you don’t know the decision tree that led to some particular sample and model being the “final” choices, all of that information is sort of already there for you, hidden in the data and the estimator choice. So I wonder if rejection rates on placebo assignment, given the actually used sample and estimator, can teach us something about the ability of the decision-making process to influence the properties of the analysis (in terms of frequentist properties of the procedures actually used in “Science”).

I suspect that if you just do whatever the researchers actually did, but assign treatment randomly and thus know the effect is 0, you would get rejection rates well above the specified alpha, and that would teach us something about how the decision-making process that leads to particular samples/estimators can affect inference. And then you could do it on the raw data using some robust estimator and see if you get something closer to the actual expected alpha.

But I also have a nagging feeling I’m missing some bit here… still, figured there was something useful in this thinking.

]]>I have realized that upon deeper reflection on the true meaning of randomness and further assessment of the of this comment, I was not actually referring to a group of donkeys (in verb form).

]]>“what does true randomness mean, and how is it known that something is truly random?”

to which I replied:

“To some extent it is not possible to tell whether a sequence of random numbers is “truly random”; however, a hypothesis test may help. In other words, if one posits that a sequence of numbers is generated by some null probability distribution, a test can be performed to asses whether this sequence of numbers is consistent with this distribution or not, where we may use a notion of extremity like a p-value.”

]]>“I *could* care less if I spent some time thinking of something that I might care less about because it is hard to think of what it might be.”

]]>My impression is that “I could care less” started as “I could care less?”, implying in a sarcastic way that one couldn’t care less.

]]>A relevant message from David Mitchell behalf of the Queen about her English:

https://www.youtube.com/watch?v=om7O0MFkmpw

Hi Daniel, that’s really helpful. Thanks for taking the time to explain, I appreciate it.

]]>Yes, but ignoring the issue and moving forward with a rote analysis which then gives you a tiny standard error that you publish which then, a few years later, is shown in replication with a broader more representative sample to have been a complete and utter joke… that’s not funny!

]]>Actually, Rahul is right that linguists are useless with practical issues. But this is not a practical but a theoretical issue. It seems like an idiom to me (an Americanism). There are things called negative polarity items in language, like “I don’t have a red cent”, but if you remove the negation, the meaning changes to its literal meaning. Not so with “I could care less”. I’d say it’s just an idiomatic variant.

]]>Well, like I always say to my undergrads when I talk about random sampling, at least with random sampling there is some math that says on average it’s going to be a good estimate even if you can’t know if yours is one of the really bad samples, if you survey people who walk by you, you have no idea.

]]>As a guy with an undergrad CS minor, whenever someone asks me for help with windows not booting I am always happy to help. However, when I hand them a Debian install disk, somehow they rarely decide to accept my help…. oh well..

]]>Oh, I don’t think linguists are supposed to help you with practical issues.

That’s like asking a Computer Scientist for help when Windows won’t boot up.

]]>Since you’re the linguist maybe you’d have some info on this. I always assumed that “I could care less” came about because the pronunciation of “couldn’t care” is more complex than “could care” (it requires more complex tongue manipulation). So in essence it became “could’ care” where the apostrophe indicates the listener is supposed to fill in the missing sounds just like the word “couldn’t” you’re supposed to fill in “could not”

So when someone like PABP fails to comprehend it it’s confusing for the speaker because they basically fail to share a common agreement on how the language works.

]]>Ok, let’s recap #3 to start off:

“3. Researcher degrees of freedom without fishing: computing a single test based on the data, but in an environment where a different test would have been performed given different data; thus T(y;φ(y)), where the function φ(·) is observed in the observed case.”

First off, there’s nothing really Bayesian about *testing* so, let’s rewrite the Bayesian version as: “Decide on and fit a model in a case where the data you observe informs the model that you decide to fit”

We’ve actually been over this before on the blog. As a Bayesian before you get data, you might have a variety of models you’d consider to describe your process. As a Bayesian you should be able to put some kind of prior over which of them is the best model. After seeing the data, you can be in a situation in which the data immediately makes the probability of several of these models negligible (it contains data that contradicts those models under pretty much any parameter values). At this point, it’s a valid approximation to just fit the models remaining. If one model stands out strongly as having highest posterior probability, it’s an approximation to simply fit that one model and truncate the others out of the analysis.

So, we’ve established that it can be consistent with Bayesian reasoning to choose your one model after you see your data, provided that you don’t have any other plausible models left once you see the data.

In the case where you do have plausible models left, it seems like the right thing to do is to either fit each of them separately, or try to write one big model where they’re embedded together and fit that. Not doing so is a kind of “cheating”, basically ignoring alternative explanations that you are aware of because you don’t like them.

So, it’s not so much about whether you fit the model in BUGS or Stan and whether you report credible intervals vs p values, it’s more an issue of:

Frequentist: whether you’re relying on your test to reject a bad idea for you, even though you chose which test based on what seemed interesting (ie. seemed like it couldn’t be rejected by your data) after seeing the data

Bayesian: whether you’re relying on the data to eliminate models you would have considered if you hadn’t seen the data, and whether the approximate update you did “in your head” to eliminate those models really does correspond to a proper Bayesian calculation, or you’re just fooling yourself because you “don’t like” the other models.

]]>I think you meant to address PABP ;).

]]>Shravan:

Yes, exactly. It’s an idiom in English, kinda like how “flammable” and “inflammable” mean the same thing.

]]>I’ll try to explain my view on the differences in a few hours. The thing to remember is that the Bayesian analysis doesn’t depend on frequencies, it depends on model assumptions about what is and what isn’t plausible to occur under some structural model with “correct” values of parameters plugged in (and also, what you think about those parameters before you see data, in the prior)

The “goodness” of the “structural” model is the main question, and post-data you may be able to narrow down your structural model in a *helpful* way. whereas narrowing down your test to one you think should be relevant is *unhelpful* for the test’s ability to “check” or “falsify” your overall concept.

]]>Mind blown.

]]>That’s what I could care less means; I couldn’t care less.

]]>“I could care less about p-values but…”

I think you mean “I *couldn’t* care less about…” ?

]]>Yep, the challenge being whether φ(y) can be made explicit enough to program.

If not, then the properties of T(y;φ(y)) if not undefined will be unknown.

(Now if I insist φ(y) must be continuous, someone will find publishable “statistical” paradoxes.)

Perhaps the differential value of simulation in understanding statistical issues for those writing the simulations (or reviewing the code) is the need for explicit considerations. For instance, in the Monty Hall problem, instructions for which door Monte opens. That’s key to understanding puzzle of what it means for something to be informative (it changes the prior) and how that varies (depends on how the door is chosen.) Now is someone coding the simulation likely to do more that just automatically make the choice random and equal and not think about or try some variations. Now good mathematicians will quickly identify an implicit assumption about the choice door and better ones will investigate varying that.

For those watching the simulations, some may and likely most won’t get much out of it.

To _me_ this animation of two stage simulation should provide most if not all understanding of Bayesian analyses for those who are not trying be their own statisticians but want to understand Bayesian analyses https://galtonbayesianmachine.shinyapps.io/GaltonBayesianMachine/

Most don’t seem to get much out of it.

Richard McElreath has worked much harder along a similar approach https://www.youtube.com/playlist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z and has tried using the same Galton two stage quincunx.

]]>Exactly.

The weirdness comes from estimators having properties that depend rather heavily on the sample size (which because of the setting can’t really be avoided nor can I share them publicly).

More generally this is really old stuff, I likely wrote a simulation to investigate the properties of using a test of heterogeneity to decide on whether to use fixed (completely pooled) or random effects (partially pooled) estimation in meta-analysis to confirm it was a really bad idea back in 1985/6. I was always a big fan of doing simulations which infuriated my professors in Biostatistics, one of who told me “a professional statistician should not need to stoop to doing simulations.”

Animations are new and now easy with Shiny, I’ll try to put some examples together to share if I get time.

]]>No idea how to say it in German (I wouldn’t even know how to say it in Hindi or Japanese or French, the other languages I speak). Wahrscheinlichkeitsintegraltransformation is my guess.

]]>Exactly, I think the “It’s not possible to set up a sensible null hypothesis (in terms of random sampling from a population)” is actually the *USUAL* case or at least a really common case in actual practice. For example in Himmicanes and Mechanical Turk surveys of women’s shirt colors and menstrual cycles, and a whole bunch of Biology and cancer and medical research and etc etc. In almost every case where people have these weird claims, they’re improperly implicitly assuming that there is some well defined population that they’re thinking of which their sample is a well defined random sample from.

“a frequentist is just someone who refuses to explicitly assign priors to hypotheses” is not a good definition I think, and I certainly don’t want to confuse anyone into thinking that is the definition I’m using.

To me, A frequentist is someone who interprets the meaning of probability as solely “frequency in repeated sampling”. At least when I use the term that’s what I mean.

In fact, almost any likelihood calculation is a bayesian calculation. The flat prior that corresponds to MLE I interpret as a nonstandard object with infinitesimal density over a nonstandard length. It results in a proper posterior, its interpretation is Bayesian, it obeys the sum rule (the integral over the nonstandard domain is 1 etc). It’s no more illegitimate than the definition of the integral: int(f(x)dx,a,b) = st(sum(f(a+(b-a)/N*i) * dx,i=1,N)) for N nonstandard. There is no sense in which “how often” something happens is used in that calculation unless you explicitly check whether the likelihood assigned fits the generated data using statistical tests, which is typically never done, you couldn’t, it would take a huge amount of data for you to verify the frequency properties of the distribution, and when you did that you wouldn’t need to do likelihood calcs, you’d just calculate the sample statistic and have a tiny standard error.

I suspect an AIC penalization is really a choice of likelihood as a mixture over different likelihoods with prior mixture probability assigned to different models that happen to be submodels of each other but I haven’t looked carefully there.

In any case it’s my belief that if you do a likelihood based calculation and you haven’t got a massive dataset to validate the shape of the likelihood function as a proper frequency distribution, you’re just pretending to not be a Bayesian.

]]>Well, sometimes it’s not possible to set up a sensible null hypothesis, as in your (3), perhaps. I’m not sure what the null hypothesis is supposed to be there.

As for your comments on frequentism. To me, a “frequentist” is just someone who refuses to explicitly assign priors to hypotheses. As you note, frequentist methods can often be understood as implicitly assigning a kind of “prior,” but those priors are often not genuine Bayesian priors or are priors only in a mathematical sense. For example, the flat prior that corresponds to MLE is improper (and therefore does not really obey the sum rule); a prior corresponding to AIC penalization makes no Bayesian sense since it assigns a lower probability to a model than to a submodel, even though the model has to be true if the submodel is; shrinkage estimators can be mimicked with “empirical Bayes” priors, but those are priors that are constructed from the data and are therefore not legitimate Bayesian priors; etc.

]]>What I’m trying to say is that it (might) help to think of the frequentist approach as a method of *analysis* of a given model or method and the bayesian approach as a way of *constructing* a model.

Think Box, Gelman etc and posterior predictive checks.

Not to say that there aren’t Bayesian interpretations of frequentist analyses and frequentist interpretations of bayesian models, which I think there often are. I just think of this as the ‘general orientation’ of each approach.

]]>also, you might say “well a frequentist could create a model of this more complex sampling process” and while that’s true, they *can not make a probability model of it*

specifically: they can’t put probability distributions over the possible frequency distributions of the whole population, and they can’t put probability distributions over the possible ways in which the sampling deviates from IID.

why? because there’s no sense in which the actual shape of the population distribution is a sample from some meta-distribution of “populations” in alternate universes or whatever… also you could imagine that the child’s effect on the sampling process is variable in some way or another, but you have only one realization of this effect, and no knowledge of the real population distribution. You could imagine for example that the child’s effect could be equivalent to one of say several hundred possible computer programs, but you can’t put a “prior” over those computer programs, so there’s no sense in which you can give a probabilistic statement here.

In the end, the Frequentist is really left with the following statement:

“If you tell me your guess for what the full population looks like from which the 20 numbers were drawn, and you tell me your guess for the computer program that emulates the child, I will show you how to calculate some information about the 20 numbers”

This maps pretty directly to:

“If you tell me the true population distribution of cholesterol reductions across the whole world, and you give me a computer program that simulates the process of “being in new york and choosing to go to the doctor at clinic X” and a computer program that simulates the process of “being in Hamburg and choosing to go to the doctor” then I will give you frequency guarantees on the differences you’d see between new york and hamburg when you try this process over and over”

]]>It sounds like he’s taking two samples, doing a test of whether there is a “significant” difference between them, and then if yes doing separate estimates of something, and if no doing an estimate pooling everything together. In other words, the decision about how to “model” the data (as two separate samples or as one sample) is dependent on the data itself through the outcome of a test.

]]>@ojm. Sure it’s fine to say that you’re only modeling the distribution of the test statistic, but you’re still making a VERY STRONG assumption to say that your data occurs as a random iid sample from some population.

Let’s just give an example that gives the flavor. R(i) is an MCMC sampler from probability density p(x) it returns a sequence of samples.

now, p(x) is “hard” to sample from for this sampler. It does however run many cycles very quickly. So you grab R(i) for i = 1..100, even though internally it does 10,000 jumps and only reports every 100’th and therefore, there isn’t much auto-correlation… it never gets outside of some local maximum which is nowhere near the global maximum… Lets say time to transition from initial conditions into the “main probability mass” is say 1e24 jumps, so taking any reasonable number of more samples doesn’t help.

A simpler concept: take a 20 random samples from a strange looking distribution. write them down, hand them to a 5 year old. ask them which numbers they “like the best” have them pick out 10 of them. Then sample with replacement from the rest and do your frequentist statistics on the sample. How close will your predictions be for the averages and standard deviations compared to sampling with replacement from the 20? It depends a LOT on what “my 5 year old likes the best” which is not the kind of fact I expect to be implied by “frequency guarantees”

This is the same problem as saying “we took a representative sample of people in NY who came into the Foo clinic and gave them drug X, it was successful at reducing cholesterol by factor y, therefore we can give it to the whole population of people and it will on average produce reduction about equal to y in cholesterol (p < 0.05) if we repeat this procedure hundreds of times we're guaranteed to be right 95% of the time".

The extrapolation to the larger population is automatic when you really do randomly sample the full well-defined population.

But the use of sampling concepts in probability is totally SPURIOUS to get estimates and standard errors on this other problem where you have just some data and no way to define a specific population and a specific method of sampling.

I'm not saying Bayesians can solve all these problems, but they at least have a clear path to attempt to tackle them, because they can build a model for "what it means for a 5 year old to like a number" or "how hard is it to move from the initial conditions into the main probability mass" or whatever.

No amount of bootstrapping a confidence interval on the 10 numbers will help you deal with the fact that a 5 year old picked her favorites first.

]]>I guess the notation T(y;φ(y)) indicates essentially (to re-phrase it in modelling terms) that you have adjustable ‘structural’ parameters that you are estimating based on the data at hand?

So, while these *should* be ‘external/structural’ parameters that your ‘internal’ paramters are estimated conditional on, you are allowing your model structure to alter based on the data? This invalidates the estimates of the parameters of interest?

Presumably a hierarchical model relaxes the ‘fully fixed’ aspect and allows the structural parameters to vary somewhat, but makes the different levels explicit.

Alternatively, perhaps, you could say that once you allow all your parameters (structure/hypotheses included) to vary (or vary too much – e.g. priors are too weak) with the data at hand, you are working both ‘within a model’ and ‘within a dataset’ (since everything, structure included, now depends on the given data) and hence you give up any potential generalisability?

So…how do you interpret T(y;φ(y))? Also +1 for seeing some simulations.

]]>Simulating the Garden of Forking Paths?

]]>Would love to see some animations. Any chance you’ll put it on the web somewhere?

]]>probabilitieintegraltransformierung?

]]>@daniel this is roughly what I meant the other day by saying a frequentist analysis of a given full model typically makes weaker assumptions on the truth of the model. Since you’re familiar with PDEs think: weak solution of a PDE as satisfying an integral condition on the original PDE and thus not requiring the PDE to be exactly satisfied at all points.

]]>