* or you got very unlucky

]]>Michael, I agree with you ! the problem with Frequentist statistics isn’t with the p values. They mean what they mean, it’s a true statement “you’d rarely get as large an effect of green jelly beans on acne as you saw if you sampled from such and such a random number generator”…

the problem is in the interpretation. the correct logical statement from that is:

“So, either you aren’t sampling from a random number generator, or you are and it doesn’t have the properties you used in your test.”

Since we know you aren’t sampling from an RNG to begin with… well…

]]>Yes, and there is nothing wrong with that. Decisions about what to study have to be decisions by thoughtful minds, not mindless statistics. If Alice is thoughtless enough to assume that my advice is a recipe to be followed ad infinitum then that is her stupidity, not a flaw in the advice.

]]>> she did find that there is reason to suppose that the green jelly beans had an effect and so she should design an experiment to test just that hypothesis.

Is that really the answer that you want to give to the “how should she proceed” question in this (literally) cartoonish but unfortunately all-too-real example?

Say that she does a larger “green jelly beans” study. Sadly she cannot find evidence for the effect of green jelly beans anymore…

But looking at the data in detail, it favours the idea that green jelly beans are linked with acne in female smokers with no children.

She is cautions about making a firm conclusion, of course, but there is reason to suppose that there is an effect.

She should design an experiment to test just that hypothesis.

Etc.

]]>Alice finds that:

A1] the data provides evidence of the effect on acne of green jelly beans

A2] the data provides evidence of the effect on acne of jelly beans of a particular color, namely green

A3] the data provides evidence of the effect on acne of jelly beans of a particular color

Bob finds that the data doesn’t provide evidence of the effect of acne of jelly beans of any particular color.

Mind you, I do understand that they are doing different tests. But if the goal is to make the (never defined) “evidential meaning” of p-values understandable to non-statisticians the whole thing remains a bit confusing. It seems to me that you need some “non-standard” concept of “strength of evidence”, because in another context such a contradiction wouldn’t be dismissed so easily.

Imagine that looking at the same body of evidence, Detective Anderson claims that there is evidence that Mr. X murdered Epstein while Detective Brown claims that there is no evidence that Epstein was murdered by anyone. We would say that there is a conflict, or at least that they don’t use the concept of “evidence” consistently.

]]>Bob and Carol have tested different null hypotheses and so different p-values is of no consequence. Carol focussed on the evidence regarding the effects of each colour of jelly bean whereas Bob used a method that minimises the possibility of a falsely discarding a true null hypothesis. Their results do not conflict.

Alice cannot say that she has evidence against Bobs hypothesis because she did not test that hypothesis. She might say, however, that Bob’s hypothesis is not very interesting. And she might point out that if Bob’s hypothesis were of interest then Bob’s experiment would have needed far larger samples than hers because his approach to multiple tests costs a lot of power.

The relevant inferential question is how Carol should proceed. She should recognise that the probability that she would find evidence favouring an effect of at least one of the colours was quite high and so she should be cautious about making a firm conclusion on the basis of just those experimental results. However, she did find that there is reason to suppose that the green jelly beans had an effect and so she should design an experiment to test just that hypothesis. (Or publish the lot as a preliminary study.)

What should Bob do? He should read my papers and see how badly served he is by Neyman-Pearsonian statistics!

]]>Yes, Carlos, I do tend towards the likelihood principle, but I don’t believe that it should be interpreted as saying that _only_ the likelihood function should be taken into account when making inferences. The interpretation of the principle as saying that _only_ likelihoods should be taken into account when making inferences is silly, and it is hard to understand why it has remained the standard understanding of the principle.

This is how I define the likelihood principle (from https://arxiv.org/abs/1507.08394):

Two likelihood functions which are proportional to each other over the whole of their parameter spaces have the same evidential content. Thus, within the framework of a statistical model, all of the information provided by observed data concerning the relative merits of the possible values of a parameter of interest is contained in the likelihood function of that parameter based on the observed data. The degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.

Notice how recognition of the role of the statistical model and the restricted scope of the likelihood function (“within the framework of a statistical model”) precludes the false notion that _only_ the likelihood function should be taken into account when making inferences. Statistical models cannot capture all of the information relevant to an inference, except in trivial model cases, and a likelihood function cannot be any better or more relevant than the statistical model from which it comes.

]]>> What’s the evidential interpretation of the p=0.45 obtained by Bob?

Correction: it’s actually p=0.57. Not very different, but maybe crossing the 0.5 equator is relevant for the evidential interpration.

(In my previous comment I had rounded it from 0.43 to 0.45, but I didn’t realize I was looking at it from the wrong side.)

]]>When I say that “The model could be applicable” I mean that it may be the case that you have the option of considering a more applicable model, not that I’m going to provide you with one. If the model is “good”, the p-value means something. From the perspective, the “worse” the model gets the less meaningful the p-value is.

You say that the “evidential meaning” of the data doesn’t depend on the experimental protocol, only on the data and on a model that doesn’t have to change when the design of the experiment changes. That would make more sense, I think, if you were embracing the likelihood principle and saying that the inference has to be based on the likelihood function only. But given that you say that inference has to take into account those details, I don’t see what do you gain by claiming that p-values give the “evidential meaning” of the data/model when no inference is possible without additional evidence/data/assumptions. You split the evidence in local/global and you still need to report and take into consideration every piece.

Is that “local” evidence useful by itself? What would you say is “the meaning conveyed by an observed P-value of 0.002”? As far as I can see, you don’t give an answer beyond saying that it could have the “usual” interpretation of unlikely-things-happening if the model is correct but could also be low because the model is not applicable. But once the model is not applicable, does it mean anything concrete? If there is no way to tell if a value is low or high, what’s the utility?

> If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

“For example, in the case of the cartoon, the evidence in the data favour the idea that green jelly beans are linked with acne (and if we had an exact P-value then we could specify the strength of favouring) but because the data were obtained by a method with a substantial false positive error rate we should be somewhat reluctant to take that evidence at face value.”

I agree that we shouldn’t take “that evidence” at face value, because the “strength of evidence” is weakened by the muliple comparisons even though they are irrelevant for the calculation of the p-value according to the model that ignores them.

Say the original study had used 20 different colors of jelly beans but looked only at the aggregate (non) effect.

They published the data and two researchers A and B looked at the subgroups. The standardized effects are:

-2.052 -1.331 -1.266 -1.203 -1.062 -0.970 -0.928 -0.599 -0.594 -0.431 0.089 0.302 0.535 0.576 0.628 0.685 0.946 1.081 1.354 1.553

Alice calculates p-values using independent models, with the null hypothesis being no effect for one color, because it’s obvious that the existence of people taking yellow jelly beans doesn’t change the data for those who took blue jelly beans.

She finds evidence for a negative effect (say that the subjects have acne and it can get better or worse) of green jelly beans: p=0.0402, the evidence for other effects is less strong (p>0.1).

Bob considers a model where the null hypothesis is that the effect is zero for every color. This means that if any of the independent null hypothesis considered by his colleague is false then this one is also false.

He calculates a p-value using as statistic the absolute value of the largest observed effect: p=0.45. The evidence against the null hypothesis that the effect is zero for every color is extremely weak, if any.

What is the evidential interpretation of the p=0.0402 obtained by Alice? What’s the evidential interpreation of the p=0.45 obtained by Bob?

Does the data collected in the original study favour the idea that green jelly beans are linked with acne?

What is the “strength of favouring” given that the exact p-value is 0.0402?

Can Alice say that she has “strong” evidence against Bob’s null hypothesis? (If green jelly beans have an affect, Bob’s null hypothesis is also false.)

(I agree that statistics is difficult but I don’t think that all p-values are equal: I find that Bob’s are better than Alice’s.)

]]>Again, what exactly is your model that is taking the optional stopping into account? Does it distinguish between the subset of possible outcomes that stop early from those that go on to the large sample size? Assuming that it does that, then does it treat optional stopping results at, say n=20, from the equivalent data that notionally came from a fixed n=20 protocol?

It is easy to say that the models should take all of the sampling rules into account, but not so easy to condition on the sampling rule and the actual (i.e. observed) sample size at the same time. I think that you are suggesting that the situation calls for a model that cannot simultaneously have all of the properties that you would need.

An equivalent issue arises when the results from multiple tests are ‘corrected for multiplicity’. The ‘corrected’ tests involve a different null hypothesis from the null hypotheses of the individual significance tests. If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

]]>At least we agree on something: I don’t get it.

To extract the _evidential meaning_ of the data we use a model which ignores a number of things about the experiment, so the difference between that model and the experiment details become irrelevant for the calculation of p-values. However, I still to take into account those details for inference because that _evidential meaning_ may not be so meaningful by itself.

The model could be applicable, the sampling distribution used to calculate the p-values could be correct, the p-value could have a clear inferential meaning. If I understand your position, a correct understanding of p-values means forgetting about that and assuming that different models are to be used for _evidential considerations_ and for inference from that evidence.

]]>No, you are not getting the point. The model that can extract the _evidential meaning_ of the data via a p-value is not the same as the model that you want to use. That’s why it is necessary to have the evidential considerations explicitly separate from the considerations of how to respond to that evidence.

]]>Do we agree that by “observed p-value” you mean “the p-value calculated using a flawed/inapplicable model” according to your own discussion of model adequacy?

How can be the use of a less flawed / more applicable model, if available, be a bad thing? If no better model is available I understand that you may want to use whatever you have, but saying that these p-values calculated according to a flawed model are fine when “correctly understood” is a stretch.

I don’t think that distinguishing p-values calculated according to an applicable model and p-values calculated according to an inapplicable model is a silly thing to do.

]]>Carlos, it is easy to make such arguments, but what exactly is your different model? The frequentist approaches to ‘dealing with’ interim analyses by taking them into account in the statistical model seem to me to discard the evidential nature of the observed p-value in favour of preservation of long run type I error rates. If you want to do that then use the Neyman-Pearson hypothesis testing framework, but take note of the serious shortcomings of that approach for scientific inference formation.

]]>Carlos, I half agree and half disagree:

The actual goal of Frequentist inference *in my opinion* is to find out things about the real world, and to do so by doing calculations on problems that mimic reality as an approximation. The idea behind saying “probability is the frequency in large repetitions” is that then… you can make a 1-1 at least approximate correspondence between a RNG computational algorithm, and what you think will happen in collecting real world data.

Unfortunately, this is where Frequentism usually goes wrong: it’s used in huge numbers of contexts where validating the choice of distribution against a data-set is impossible. So, it relies on inference on things like averages etc where CLT type results are available so you can choose a distribution independent of an observed shape of individual data points… but the problem with that is you get *one* data point from the sample average distribution for example.

however, there’s nothing *mathematically* wrong with calculating a p value for a model that doesn’t mimic reality, it just doesn’t tell you what the usual Frequentist statistician wants to know: the probability of making a *real world* error.

In some sense this possibility is normally not acknowledged in the typical NHST rubric, rather, it’s assumed that the statistical model is adequately describing reality *first*, and all that is needed is a parameter value. Then people reject the null parameter value, and normally immediately falsely conclude that the parameter value is close to the maximum likelihood value or some other estimator value, while never even considering the idea that the model is totally inadequate in the first place.

They ignore the 3rd option you mention above “the statistical model is flawed or inapplicable” and this is *almost always* the correct option in typical application of NHST because people who do adequate models have a tendency to be using Bayes to fit them.

In fact, basically what Bayes is is a search through model space to see which models are adequate, using likelihood to filter out those that are inadequate.

This is why Anoneuoid always goes on about *test your actual research hypothesis*. The “null hypothesis” is usually just a thing in a book. You don’t care about it, it doesn’t match reality in any way… it’s just a stupid straw man. The guy who wrote the book didn’t have the slightest clue what science you were doing.

But none of that invalidates the *math* of the p value. The p value tells you “hey, this stupid thing you chose to check is not reality” and *that’s all* it doesn’t tell you “hey this other thing near the MLE is reality” or any of the stuff people want it to tell them.

Anyone who formulates a sufficiently complex model to describe the reality of their experiment, hypothesizes some parameters, and then tests their model against data using NHST and a battery of tests… well they get a silver star… they get a gold star if instead of hypothesizing some point parameters, they acknowledge the possibility for those parameters to have some wiggle room and explore that wiggle room… But then they’re doing Bayes.

Bayes isn’t much different from NHST using likelihood ratio tests + hypothesized wiggle room for parameters. It’s just that given the generality of the applicability, you also get to spend your time specifying a model that matches your understanding of the experiment/observations relatively closely.

It’s not really about running Stan on dumb models + some priors and everything gets better, it’s about getting a tool where you can use “not dumb” models, and suddenly you’re in a different realm where golf putting depends on things like the precision with which people can estimate angles, the length over which the putt has to travel, and the rate at which energy is taken out of the ball by friction and soforth.

“just do bayes” won’t cut it… but “just learn Bayes so you can in theory fit any kind of model, and then learn about how to build good models!” that can cut it.

]]>Michael,

> interim analyses cannot affect the p-value (when correctly understood). After all, the interim analyses do not influence the data or the null hypothesis.

The p-value does depend as well on the model, as you discuss extensively on that link. The interim analyses do influence the model. The model should take them into account.

“What is a statistical model?”

“A statistical model is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in response to data. The model does that by assigning probabilities to potential arrangements of data.”

You acknowledge that you won’t get calibrated statistical inferences if you ignore the interim analyses: “That sub-optimality should be accounted for in the inferences that made [sic] from the evidence”.

“Consider the meaning conveyed by an observed P-value of 0.002. It indicates that the data are strange or unusual compared to the expectations of the statistical model when the parameter of interest is set to the value specified by the null hypothesis. The statistical model expects a P-value of, say, 0.002 to occur only 2 times out of a thousand on average when the null is true. If such a P-value is observed then one of these situations has arisen:

• a two in a thousand accident of random sampling has occurred;

• the null hypothesised parameter value is not close to the true value;

• the statistical model is flawed or inapplicable because one or more of the assumptions underlying its application are erroneous.”

In the latter case, I don’t think it makes much sense to use a model known to be inapplicable to calculate a p-value that no longer has a clear meaning.

]]>“I disagree that there’s no such thing” [as a ‘representative sample’], but “every real-world sample is nonrepresentative.” And, thus, Bertrand Russell was the Pope (he may actually have been, I don’t rightly know).

]]>Daniel captures my intention perfectly.

Ideally a scientist would be able to make the best possible use of statistics as a relatively reliable step towards making optimal scientific inferences. In the real world a scientist has to be able to usually do a good enough job with statistical inference that it does not substantially detract from the scientific inferences. (Many, many scientists will have to up their game to meet the standards of that second sentence!)

]]>I disagree. Good data analysis varies greatly depending on the data generating process and the particular observations on the components of this process that you happen to observe in your experiment or study. General guidelines could be given, but creating a giant list or flow chart to follow would seem quite difficult if not impossible to me. Also, I think flow charts of canned solutions tend to encourage people to turn their brains off.

A flow chart or standardized procedures for data analysis would be like writing an SOP to drive a car. It’s nice to know that at a stop sign one should brake the car to a halt, but it’s also a good idea to brake the car to a halt when a child steps out in the street.. or a dog, or if smoke starts coming out from under the hood, or when it starts raining too hard to see, etc. At some point you need to think for yourself and apply the skills you know in a common sense and reasonably intelligent way.

Martha Smith puts this better than I have in the comments below.

]]>> What if you make your null as robust/severe/extreme as you can? It seems that if you exhaustively list every potential source of bias in the data and fully characterize them into estimations of error, and then sum those into a global estimate of error, you could build a null from that, right?

Now you’re doing Bayesian analysis. Specifically, you’re using your scientific theory to build a generative model of the data collection process. That’s all the fanatics like Anoneuoid and I ask for ;-) Just actually think about your science and build a model of what happens, and then check that against data! In doing so, incorporate as many of the known or hypothesized effects and biases as you can…

As soon as you bother to do that, you’ll have questions other than “does this or does this not equal 0”. No one really cares about whether parameters equal zero in real models, because as soon as you build the model in your head, the parameters take on *meaning* whereas when you’re push-buttoning the “pocket calculator” version of statistics the parameters have no meaning, you just want to know if they are or aren’t greater than 0 or whatever.

But don’t you think it matters whether say your ability to measure how much effort people put into avoiding hurricanes is dominated by the variation caused by the male/female name of the hurricane, or by the extremely poor data that you have on what the reported locations of landfall were and how that was communicated so that people knew whether they were or were not in danger?

I mean, that’s actual scientific knowledge there: communication about hurricane risk affects people’s perception of danger and choice of response… and perhaps the hurricane’s name affects the communications, or perhaps it affects the response, maybe both…

]]>To the extent that you are running the same physical process multiple times, then yes, you can also run the same mathematical model of that physical process multiple times…

the idea however that you can somehow create a canned procedure which need not know anything about the physical process or social science process or literary process or whatever and can still magically tell you what your data means… that’s extremely problematic.

]]>I disagree, it’s more like saying that to determine what waveforms your device is producing you must understand the concept of real numbers, of functions, of voltage, of variation of voltage in time, of input impedance and probe capacitance and the way that attaching the probe might change the waveform your device produces, of radio frequency radiation and shielding to know whether signals come from outside your device…

Do you disagree that those are important to using an oscilloscope, and are separate from design of the instrument itself?

]]>Jim: if you have a historical dataset of curated “nothing is going on” data, and you get a new data point, then you can compare this data point to the historical set and say “this (is/is not) unusual compared to history”

This is an actually fine application of hypothesis testing, the reason a thing is considered unusual is specifically because it was rare (had low frequency) in the past.

The problem is, most hypothesis testing isn’t like that at all.

1) it’s rare to compare to a large database of data, instead some stand-in distribution is used without ever checking if it adequately represents the “real” distribution.

2) Another strategy is only a single parameter is compared, say the mean, using an asymptotic limit theorem to define the sampling distribution in a way that’s independent of what the data look like (as long as they have a mean and standard deviation…). But that two distributions have the same / different mean is often not sufficient to establish much.

3) Even if the difference in means is meaningful, people often throw away that information and make decisions based only on is something “significant vs not significant”… ie. they decide on a single binary digit of information (present/absent). But different sized effects can’t all have the same theoretical use.

4) When first filtering on significant/nonsignificant and then looking at effect size, the effect size is dramatically biased upwards in magnitude. Failure to replicate a similar magnitude effect is our expectation from this kind of research.

So, there are multiple issues at play. The problem with p values isn’t the p value, it’s the failure to properly align the research question with the mathematical question.

]]>Michael, thanks for the link to your paper. I am working my way through it. I guess I do disagree with your view “that good scientists [need to be] capable of dealing with the intricacies of statistical thinking.” To me that is like saying that to determine what waveforms your device is producing, it is not enough to be able to read them on an oscilloscope, you have to be able to design an oscilloscope yourself. I frankly feel that it is an abrogation of responsibility by the statistical community to just toss the statistical aspects of proper inference methods to the unwashed masses (like me). Because we will fight like ignorant dogs. And we will attempt to baffle our reviewers with statistical treatments unique, dense and long, and dare them to wade in deep enough to criticize.

]]>“best practices is to just stop doing it entirely”

I am in violent agreement! I would never approach anything that way. But that is a standard approach in psychology, at least according to the practitioner I know. He would say that you are not trying to understand buoyancy, that is impossibly hard, so all you can do is establish whether a coating of butter has an effect in isolation. But I am with you, I wouldn’t feel comfortable with anything about a butter coating if I did not have an adequate understanding of the concept of buoyancy.

]]>“the nil null is programmed into every push button piece of software and is spit out by all sorts of analyses and plopped into papers Willy nilly. it’s not like it’s a straw man”

Sure, but isn’t the question whether the baby should go out with the bathwater, that is, whether misuse of p values means we should end significance testing?

What if you make your null as robust/severe/extreme as you can? It seems that if you exhaustively list every potential source of bias in the data and fully characterize them into estimations of error, and then sum those into a global estimate of error, you could build a null from that, right?

Suppose I am interested in himmicanes and I decide that after working through all the possible sources of bias that I can dream up or find in the literature, I discover that 97% of the potential error is caused by two factors, the magnitude of the hurricane and the percent that flee. After estimating those biases and establishing error bars, my model shows that any value between [people spend 5% more on female-named hurricanes] and [people spend 18% more on male-named hurricanes] is indistinguishable from “no effect.” Now I back-calculate from the worst case error, the 18%, that to get a p of 0.05, I need a result greater than [people spend 68% more on male-named hurricanes] to show a statistically significant result based purely upon the gender of the hurricane name. I don’t really care how big the effect is! At this point, I finally interrogate the raw data and discover that it shows [people spend 74% more on male-named hurricanes]. Can I declare victory? I am curious what problem this approach would fail to prevent, and if so, what would work better in a simple example like this.

]]>Let them eat cake

]]>“Try to define what a “representative sample” is without getting into a reductio ad absurdum.”

ha ha, yeah, you’re right, but you still know what one is right? :)

]]>Matt,

FWIW I’m not a statistician or medical researcher, just a geologist :) But it seems obvious to me that, properly applied, this approach should be suitable to test for the efficacy of a treatment at some level of accuracy or precision.

All that’s going on in a test like this is comparing the distribution of measurements from treatment samples to the distribution of measurements from an ideal set of samples in which there is no effect. The distribution of measurements in which there is no effect is presumably normal, thus the idea that it’s a “random number generator”. I was confused by that for a while but I think that’s what Daniel and Phil are on about. A random number generator would produce a normal distribution.

That being said the method is only as good as the data and P = 0.05 isn’t absolute proof of efficacy, it’s the *odds* that the difference in measurement distributions result from chance. So it’s an indirect comparison with certain odds of failure even if everything else is perfect.

]]>“Creating a flow chart of canned analyses sounds like a bad idea…”

I don’t know why it would be. Physicists use a set of “canned” or standardized procedures to find and verify exoplanets. Geologists use a standard set of procedures to identify and processes specimens for isotopic ages. Drug researchers use a standard set of procedure to extract natural chemicals from plants and test them for efficacy in the lab

Everything in science is done by standardized procedures. If you’re not using one, you’re making one up to get other people to use. The procedures aren’t the problem. The problem is the people who don’t follow procedures but claim they did. A lot of that is not confirming that the assumptions required for the procedure to work are actually met.

]]>Wow, well this might explain quite a bit about the confusion that occurs between the two of us.

I see Frequentism as a very definite ontological theory: the frequency is a thing in the world that can in principle be verified by taking a large sample of observable things, and probability calculations are acceptable to the extent that they approximate the real world frequencies in a reasonably accurate way.

I have absolutely no problem with “if such and such random process were to occur, it would produce observed data or more extreme as often as p = 0.0021” that’s a perfectly valid mathematical calculation. The problem is usually either:

1) It isn’t anything validated about the world, so it doesn’t correspond to the Frequentism theory mentioned above, so we need a test of goodness of fit or a proof of an appropriate asymptotic result or something before we can call it Frequentist.

or

2) It is a purely hypothetical calculation, in which case it is actually a (usually quite poor) Bayesian model, and I ask “why did you check that particular model using that particular likelihood?”. In particular, when used with p = 0.05 threshold, it kind of corresponds to a strange ABC Bayesian model, in which your summary statistic is p

You can create a Bayesian model as follows, using ABC methodology:

1) Define a stochastic RNG sampling model, like normal(m,s) or even some complex “random effects” model like linear regression with iid errors and individual slopes and intercepts per person or per school etc.

2) Let the prior for the parameters be uniform over [-MaxFloat,MaxFloat] for all parameters (or if parameter must be positive then [0,MaxFloat] or similarly for negative, or uniform between two logical bounding values maybe [0,1] etc.

3) Generate a random parameter vector from the prior

4) using the parameter vector calculate the p value for the test statistic function applied to the data under that parameter + sampling model (basically generate a data set, calculate the test statistic, repeat, then find out what the quantile of the real data test statistic is in this big database)

5) keep the parameter vector if p > 0.05, reject the parameter vector if p < 0.5

6) goto 1 until a sufficient sample is collected.

Is this a Bayesian model or not? I suggest that to the extent that you haven't tested the sampling distribution against the data for goodness of fit, and to the extent that you actually hypothesize this uniform prior which can't possibly be a real sampling distribution of anything, it's actually a Bayesian procedure and an obviously non-optimal one at that.

]]>Fair enough, you’ll probably find enough references in which frequentism is described like this. I think this is very problematic. Nobody has ever managed to nail down properly what an objectively “existing in the world” frequentist probability actually is. It’s hard to start with how it can be “measured” without having a conception what it is. Personally I think it makes more sense to define it as a way of thinking: “We think of a process as if…” Even “repetition” is a to some extent subjective construct in the sense that no “repetition” ever is really identical (and one would have a very hard time defining what could be meant by “an approximate repetition”. For me frequentism is basically about thought constructs, and one can just say that these thought constructs are more or less strongly and convincingly associated with how we perceive the “real world”, very weakly in situations in which we only have very few or even just one observation (although it may be conceivable to gather more), much stronger in cases in which many observations can be had, without obvious threats to “randomness” such as dependence, shifts of conditions etc. Random effects are somewhere in between; often you have many observations but of course you don’t observe the parameters directly but rather observations generated from them (according to the thought construct).

]]>> It’s about what we think a model means. It’s not how the model looks like.

Specifically, a Frequentist model should be one in which everything that has probability assigned to it should be a verifiable, observable outcome which can be repeated and varies from one experiment/observation to another.

From wikipedia which I think agrees with most of everything I’ve read: “The relative frequency of occurrence of an event, observed in a number of repetitions of the experiment, is a measure of the probability of that event. This is the core conception of probability in the frequentist interpretation. “

also “Probabilities can be found (in principle) by a repeatable objective process (and are thus ideally devoid of opinion).”

So “observed in a number of repetitions” is essential, and probabilities are verifiable facts about the world, not about our assumptions. This is why in “good” frequentist practice like what’s taught in “Statistical Analysis and Data Display” (Heiberger and Holland) you are told how to run things like Anderson-Darling tests or Kolmogorov-Smirnov tests or do transformations to make your data more normally distributed before running your ANOVA or whatever. The idea is that whatever calculation you’re doing is an approximation of some “real, true” probability distribution and you should make sure the distributional assumptions map at least approximately onto the verifiable facts of the world. This is also why there are so many nonparametric tests and goodness of fit tests and things in such books because they use mathematical tricks to incorporate these transformations into the tests.

This is why I say in most cases random effects models are Bayesian models in which people just avoid thinking about the prior explicitly. Somewhere deep in these mixed models is the idea that some parameter (like the mean) that describes a distribution over verifiable outcomes itself has a distribution placed over it and each sub-group has a different realization of this parameter value.

But the interpretation of the distribution over the parameters can not be anything but Bayesian since the parameters themselves are not observable nor can they be found by “a repeatable objective process” nor is there any way to verify it.

You *might* argue that with enough data on each subgroup, the sample mean is close enough to the real mean, and then the observed distribution of the sample means across the groups forms the sample distribution of the parameter, and then you could adjust the assumption of the hierarchical distribution over parameters to match the shape of the “observed” distribution over people

I have never ONCE seen that done.

]]>For me “frequentist” is an interpretation of probability. It’s about what we think a model means. It’s not how the model looks like. Bayesian on the other hand is a way of computing things, and there are different interpretations of this by different people (objective, subjective, falsificationist…). One can give a frequentist interpretation to a Bayesian prior. Sometimes that’s very far fetched (if the idea is that in fact only one value of the parameter generates all the data), sometimes not so much. Many people interpret random effects in a frequentist manner (even more people don’t care about interpretation so can be neither classified as frequentist nor as any variety of Bayesian), which often makes sense insofar as that they model something that in reality is repeated (e.g., when a random effect is assigned to each of n test persons). In my use of terminology it doesn’t matter whether this actually is a distribution over a parameter. I’m not aware that papers are rejected because they use priors by the way. What I see more is that people avoid priors because they have no idea where to get them from, and how sensitive what they do would be to the choice of prior.

]]>Thanks Nick. I would be very pleased to receive your reasoning about the sidedness of the p-values. My email is Michael with an extra L at the end, then @unimelb.edu.au

]]>Michael,

I actually came across your article last week and was very impressed.

I do believe however, that for a number of reasons your, argument only applies to one-sided p values against a range null, not a two-sided p value against a point null.

(Also, if we’re discussing p values, where is Anoneudoid?)

]]>Mark:

I agree that the term “representative sample” is vague and depends on context, but I disagree that there’s no such thing.

To put it another way, every real-world sample is nonrepresentative. It is important in data collection to control the nonrepresentativeness, and it is important in analysis to recognize and adjust for the nonrepresentativeness.

]]>For samples from a finite population at a point in time (like surveys) a representative sample is one in which all measurements are of similar magnitude to what would be expected from a pure random sample drawn using a calibrated RNG.

There’s nothing circular or absurd about that. You don’t have to actually use a calibrated RNG to collect your sample, but if it isn’t “representative” of what you might have gotten from a RNG it isn’t a representative sample.

]]>Jim, there is no such thing as a “representative sample”, that’s a non-statistical concept that many statisticians seem to like throwing around simply to justify their methods. It’s ill-defined. Don’t believe me? Try to define what a “representative sample” is without getting into a reductio ad absurdum. Well… a representative sample is a sample in which the relevant characteristics of a sample are similar to those in the population. Who defines which are the “relevant characteristics”? What does “similar” mean? Sure, the proportions of African Americans, and females, and those from the northeast are “similar” to those in the census (or whatever)… but what about the proportions of female African Americans from the northeast (a perfectly well-defined sub-population, clearly as relevant as any of the marginal sub-populations). Thus, it’s an absurd concept.

]]>Matter for what? Matter for the accuracy of the mathematical statement “such and such random process wouldn’t have produced this data p = 0.002” or matter for the accuracy of scientific statement “such and such a thing is true about the world”?

a representative sample doesn’t matter for the mathematical truth, but of course does matter for the statement about the world.

]]>hypothetical RNGs… stupid phone

]]>The problem isn’t that the p value is wrong or biased, the problem is that the model used to calculate the p value doesn’t model known facts about the actual data collection process. Another way to say this is standard practice among frequentists is to try to make the p value be based on a model of the data collection experiment and therefore be able to say something more than “if you had just grabbed a sample from rnorm() you probably wouldn’t have gotten this data” what they want to say is “if you actually repeat this exact experiment in the world, and the true parameter is 0 you will rarely get data like this” that is in other words only the value of the parameter is left as a hypothetical, everything else is intended as a statement about the way the world actually operates.

it’s this attempt to make frequency a physical fact that galls me about frequentist, I think it’s also what irritated Jaynes, his mind projection fallacy for example. Mayo is on record here saying that statistical claims can’t just be “conditional on some assumed process… probability of this thing occurring is p” but rather should be unconditional claims about what actually does happen. to me that’s the essence of the conflict. I have no problem with Keith’s favored interpretation of Bayes as testing hypothetical Range to see which subset are sufficiently lifelike.

]]>I honestly don’t know what the majority of Bayesian publications do, I mostly read about fairly good ones… but it’s rare to see them at all in say Biology or engineering or medicine or economics, fields I care about.

Often people count maximum likelihood random effects models as Frequentist because they don’t specify an explicit prior, but I don’t think they are… they apply probability to parameters without parameters being an observable quantity that can have repetitions, or a verifiable physical distributional shape. They are just a poor man’s Bayes for people afraid that explicit priors will get their papers rejected.

of course you can bootstrap things rather a lot, and people do. and you can do lots of permutation testing and placebo treatments and try to discover things that way… it’s not totally hopeless, at least this kind of thing respects the observed distribution of data

I’m beginning to like Keith’s explanation of Bayes as filtering out which random number generator models could/couldn’t have generated the data using likelihood as the filter against the prior as the proposal. it is a unifying concept that seems intuitive to explain to beginners.

]]>oh, sure, for that kind of thing, best practices is to just stop doing it entirely!

Imagine if people tried to understand bouyancy that way, we would have a hopeless mishmash of thousands of papers: x sinks in bucket of water… x floats in bucket of water when coated in butter, x sinks in water if coating of butter is too thin, x floats in water when powdered… actually x sinks in water when powdered if experiment performed in southern hemisphere… x definitely floats in water when powdered contrary to eminant southern researchers findings, x plus thick coating of butter actually sinks if water has sufficient temperature… blah blah blah

]]>You can model all kinds of things such as systematically biased sampling also in a frequentist way. Unfortunately it’s hardly ever done. (If a Bayesian does it, I’m fine with that, but you won’t tell me that the majority of Bayesian analyses that we see published does this in appropriate ways either.)

]]>Yes! Thank you!

Collecting data until you’re black box spits out the “right” answer is poor practice. Why not just pick the one sample that creates statistical significance and use that first and save yourself the trouble of doing all the other ones? :))))

Andrew, it’s strange that you approve of this practice, since its likely to lead to your favorite P value, P = 0.049999999999, not to mention tempting people to be selective in which values they add to the model.

]]>The thing is that the theory on which the (standard) test is based doesn’t take into account that you make decisions to gather more data conditionally on what happened earlier, which may bias the p-value. There’s sequential analysis that does take these things into account.

The analogous problem in Bayesian statistics is that if you do optional stopping in a Bayesian analysis and don’t model this (which as a Bayesian you don’t have to), if you assume that there is a true underlying model, one can show that this can increase your chances to make wrong decisions, i.e., to increase the posterior probability for something wrong, compared with an analysis in which you have a fixed rule how many observations you gather, just the same as if you’re computing p-values ignoring optional stopping. Chances are this has been mentioned/papers have been linked elsewhere in this thread that show this (I haven’t followed links).

Now of course a Bayesian could say, why should we be interested in what happens if we assume a true frequentist model in which we don’t believe anyway, but it’s how we can analyse the performance of methods, by setting up artificial models and see whether what happens there makes good sense. (Chances are I don’t have to tell you this Andrew, but other readers of this blog…;-)

]]>the question was: does a representative sample matter? The answer is: yes.

]]>“the isotope ratio of Ru in minerals from different places cannot be exactly equal to an infinite number of decimal places.”

“cannot” or “probably isn’t”?

“you usually don’t care about the probability that you would have gotten the data that you saw, if the true effect were zero. “

No, I don’t agree with that.

]]>“The question that is addressed is whether the data can *distinguish* what has happened from a model in which the true effect is 0.0000.”

Perfect! That’s what I understood but was having a hard time expressing clearly.

“Surely if the answer is “no”, the data can’t be used as evidence for anything else, and that’s of interest and relevant in a lot of cases.”

“if the answer is “yes”, this doesn’t necessarily mean that something substantially meaningful is going on”

Badabing and Badaboom.

]]>