the problem is in the interpretation. the correct logical statement from that is:

“So, either you aren’t sampling from a random number generator, or you are and it doesn’t have the properties you used in your test.”

Since we know you aren’t sampling from an RNG to begin with… well…

]]>Is that really the answer that you want to give to the “how should she proceed” question in this (literally) cartoonish but unfortunately all-too-real example?

Say that she does a larger “green jelly beans” study. Sadly she cannot find evidence for the effect of green jelly beans anymore…

But looking at the data in detail, it favours the idea that green jelly beans are linked with acne in female smokers with no children.

She is cautions about making a firm conclusion, of course, but there is reason to suppose that there is an effect.

She should design an experiment to test just that hypothesis.

Etc.

]]>A1] the data provides evidence of the effect on acne of green jelly beans

A2] the data provides evidence of the effect on acne of jelly beans of a particular color, namely green

A3] the data provides evidence of the effect on acne of jelly beans of a particular color

Bob finds that the data doesn’t provide evidence of the effect of acne of jelly beans of any particular color.

Mind you, I do understand that they are doing different tests. But if the goal is to make the (never defined) “evidential meaning” of p-values understandable to non-statisticians the whole thing remains a bit confusing. It seems to me that you need some “non-standard” concept of “strength of evidence”, because in another context such a contradiction wouldn’t be dismissed so easily.

Imagine that looking at the same body of evidence, Detective Anderson claims that there is evidence that Mr. X murdered Epstein while Detective Brown claims that there is no evidence that Epstein was murdered by anyone. We would say that there is a conflict, or at least that they don’t use the concept of “evidence” consistently.

]]>Alice cannot say that she has evidence against Bobs hypothesis because she did not test that hypothesis. She might say, however, that Bob’s hypothesis is not very interesting. And she might point out that if Bob’s hypothesis were of interest then Bob’s experiment would have needed far larger samples than hers because his approach to multiple tests costs a lot of power.

The relevant inferential question is how Carol should proceed. She should recognise that the probability that she would find evidence favouring an effect of at least one of the colours was quite high and so she should be cautious about making a firm conclusion on the basis of just those experimental results. However, she did find that there is reason to suppose that the green jelly beans had an effect and so she should design an experiment to test just that hypothesis. (Or publish the lot as a preliminary study.)

What should Bob do? He should read my papers and see how badly served he is by Neyman-Pearsonian statistics!

]]>This is how I define the likelihood principle (from https://arxiv.org/abs/1507.08394):

Two likelihood functions which are proportional to each other over the whole of their parameter spaces have the same evidential content. Thus, within the framework of a statistical model, all of the information provided by observed data concerning the relative merits of the possible values of a parameter of interest is contained in the likelihood function of that parameter based on the observed data. The degree to which the data support one parameter value relative to another on the same likelihood function is given by the ratio of the likelihoods of those parameter values.

Notice how recognition of the role of the statistical model and the restricted scope of the likelihood function (“within the framework of a statistical model”) precludes the false notion that _only_ the likelihood function should be taken into account when making inferences. Statistical models cannot capture all of the information relevant to an inference, except in trivial model cases, and a likelihood function cannot be any better or more relevant than the statistical model from which it comes.

]]>Correction: it’s actually p=0.57. Not very different, but maybe crossing the 0.5 equator is relevant for the evidential interpration.

(In my previous comment I had rounded it from 0.43 to 0.45, but I didn’t realize I was looking at it from the wrong side.)

]]>You say that the “evidential meaning” of the data doesn’t depend on the experimental protocol, only on the data and on a model that doesn’t have to change when the design of the experiment changes. That would make more sense, I think, if you were embracing the likelihood principle and saying that the inference has to be based on the likelihood function only. But given that you say that inference has to take into account those details, I don’t see what do you gain by claiming that p-values give the “evidential meaning” of the data/model when no inference is possible without additional evidence/data/assumptions. You split the evidence in local/global and you still need to report and take into consideration every piece.

Is that “local” evidence useful by itself? What would you say is “the meaning conveyed by an observed P-value of 0.002”? As far as I can see, you don’t give an answer beyond saying that it could have the “usual” interpretation of unlikely-things-happening if the model is correct but could also be low because the model is not applicable. But once the model is not applicable, does it mean anything concrete? If there is no way to tell if a value is low or high, what’s the utility?

> If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

“For example, in the case of the cartoon, the evidence in the data favour the idea that green jelly beans are linked with acne (and if we had an exact P-value then we could specify the strength of favouring) but because the data were obtained by a method with a substantial false positive error rate we should be somewhat reluctant to take that evidence at face value.”

I agree that we shouldn’t take “that evidence” at face value, because the “strength of evidence” is weakened by the muliple comparisons even though they are irrelevant for the calculation of the p-value according to the model that ignores them.

Say the original study had used 20 different colors of jelly beans but looked only at the aggregate (non) effect.

They published the data and two researchers A and B looked at the subgroups. The standardized effects are:

-2.052 -1.331 -1.266 -1.203 -1.062 -0.970 -0.928 -0.599 -0.594 -0.431 0.089 0.302 0.535 0.576 0.628 0.685 0.946 1.081 1.354 1.553

Alice calculates p-values using independent models, with the null hypothesis being no effect for one color, because it’s obvious that the existence of people taking yellow jelly beans doesn’t change the data for those who took blue jelly beans.

She finds evidence for a negative effect (say that the subjects have acne and it can get better or worse) of green jelly beans: p=0.0402, the evidence for other effects is less strong (p>0.1).

Bob considers a model where the null hypothesis is that the effect is zero for every color. This means that if any of the independent null hypothesis considered by his colleague is false then this one is also false.

He calculates a p-value using as statistic the absolute value of the largest observed effect: p=0.45. The evidence against the null hypothesis that the effect is zero for every color is extremely weak, if any.

What is the evidential interpretation of the p=0.0402 obtained by Alice? What’s the evidential interpreation of the p=0.45 obtained by Bob?

Does the data collected in the original study favour the idea that green jelly beans are linked with acne?

What is the “strength of favouring” given that the exact p-value is 0.0402?

Can Alice say that she has “strong” evidence against Bob’s null hypothesis? (If green jelly beans have an affect, Bob’s null hypothesis is also false.)

(I agree that statistics is difficult but I don’t think that all p-values are equal: I find that Bob’s are better than Alice’s.)

]]>It is easy to say that the models should take all of the sampling rules into account, but not so easy to condition on the sampling rule and the actual (i.e. observed) sample size at the same time. I think that you are suggesting that the situation calls for a model that cannot simultaneously have all of the properties that you would need.

An equivalent issue arises when the results from multiple tests are ‘corrected for multiplicity’. The ‘corrected’ tests involve a different null hypothesis from the null hypotheses of the individual significance tests. If you agreed with the section in my paper regarding the XKCD jelly beans example then you should be able to understand the optional stopping problem.

]]>To extract the _evidential meaning_ of the data we use a model which ignores a number of things about the experiment, so the difference between that model and the experiment details become irrelevant for the calculation of p-values. However, I still to take into account those details for inference because that _evidential meaning_ may not be so meaningful by itself.

The model could be applicable, the sampling distribution used to calculate the p-values could be correct, the p-value could have a clear inferential meaning. If I understand your position, a correct understanding of p-values means forgetting about that and assuming that different models are to be used for _evidential considerations_ and for inference from that evidence.

]]>How can be the use of a less flawed / more applicable model, if available, be a bad thing? If no better model is available I understand that you may want to use whatever you have, but saying that these p-values calculated according to a flawed model are fine when “correctly understood” is a stretch.

I don’t think that distinguishing p-values calculated according to an applicable model and p-values calculated according to an inapplicable model is a silly thing to do.

]]>The actual goal of Frequentist inference *in my opinion* is to find out things about the real world, and to do so by doing calculations on problems that mimic reality as an approximation. The idea behind saying “probability is the frequency in large repetitions” is that then… you can make a 1-1 at least approximate correspondence between a RNG computational algorithm, and what you think will happen in collecting real world data.

Unfortunately, this is where Frequentism usually goes wrong: it’s used in huge numbers of contexts where validating the choice of distribution against a data-set is impossible. So, it relies on inference on things like averages etc where CLT type results are available so you can choose a distribution independent of an observed shape of individual data points… but the problem with that is you get *one* data point from the sample average distribution for example.

however, there’s nothing *mathematically* wrong with calculating a p value for a model that doesn’t mimic reality, it just doesn’t tell you what the usual Frequentist statistician wants to know: the probability of making a *real world* error.

In some sense this possibility is normally not acknowledged in the typical NHST rubric, rather, it’s assumed that the statistical model is adequately describing reality *first*, and all that is needed is a parameter value. Then people reject the null parameter value, and normally immediately falsely conclude that the parameter value is close to the maximum likelihood value or some other estimator value, while never even considering the idea that the model is totally inadequate in the first place.

They ignore the 3rd option you mention above “the statistical model is flawed or inapplicable” and this is *almost always* the correct option in typical application of NHST because people who do adequate models have a tendency to be using Bayes to fit them.

In fact, basically what Bayes is is a search through model space to see which models are adequate, using likelihood to filter out those that are inadequate.

This is why Anoneuoid always goes on about *test your actual research hypothesis*. The “null hypothesis” is usually just a thing in a book. You don’t care about it, it doesn’t match reality in any way… it’s just a stupid straw man. The guy who wrote the book didn’t have the slightest clue what science you were doing.

But none of that invalidates the *math* of the p value. The p value tells you “hey, this stupid thing you chose to check is not reality” and *that’s all* it doesn’t tell you “hey this other thing near the MLE is reality” or any of the stuff people want it to tell them.

Anyone who formulates a sufficiently complex model to describe the reality of their experiment, hypothesizes some parameters, and then tests their model against data using NHST and a battery of tests… well they get a silver star… they get a gold star if instead of hypothesizing some point parameters, they acknowledge the possibility for those parameters to have some wiggle room and explore that wiggle room… But then they’re doing Bayes.

Bayes isn’t much different from NHST using likelihood ratio tests + hypothesized wiggle room for parameters. It’s just that given the generality of the applicability, you also get to spend your time specifying a model that matches your understanding of the experiment/observations relatively closely.

It’s not really about running Stan on dumb models + some priors and everything gets better, it’s about getting a tool where you can use “not dumb” models, and suddenly you’re in a different realm where golf putting depends on things like the precision with which people can estimate angles, the length over which the putt has to travel, and the rate at which energy is taken out of the ball by friction and soforth.

“just do bayes” won’t cut it… but “just learn Bayes so you can in theory fit any kind of model, and then learn about how to build good models!” that can cut it.

]]>> interim analyses cannot affect the p-value (when correctly understood). After all, the interim analyses do not influence the data or the null hypothesis.

The p-value does depend as well on the model, as you discuss extensively on that link. The interim analyses do influence the model. The model should take them into account.

“What is a statistical model?”

“A statistical model is what allows the formation of calibrated statistical inferences and non-trivial probabilistic statements in response to data. The model does that by assigning probabilities to potential arrangements of data.”

You acknowledge that you won’t get calibrated statistical inferences if you ignore the interim analyses: “That sub-optimality should be accounted for in the inferences that made [sic] from the evidence”.

“Consider the meaning conveyed by an observed P-value of 0.002. It indicates that the data are strange or unusual compared to the expectations of the statistical model when the parameter of interest is set to the value specified by the null hypothesis. The statistical model expects a P-value of, say, 0.002 to occur only 2 times out of a thousand on average when the null is true. If such a P-value is observed then one of these situations has arisen:

• a two in a thousand accident of random sampling has occurred;

• the null hypothesised parameter value is not close to the true value;

• the statistical model is flawed or inapplicable because one or more of the assumptions underlying its application are erroneous.”

In the latter case, I don’t think it makes much sense to use a model known to be inapplicable to calculate a p-value that no longer has a clear meaning.

]]>Ideally a scientist would be able to make the best possible use of statistics as a relatively reliable step towards making optimal scientific inferences. In the real world a scientist has to be able to usually do a good enough job with statistical inference that it does not substantially detract from the scientific inferences. (Many, many scientists will have to up their game to meet the standards of that second sentence!)

]]>A flow chart or standardized procedures for data analysis would be like writing an SOP to drive a car. It’s nice to know that at a stop sign one should brake the car to a halt, but it’s also a good idea to brake the car to a halt when a child steps out in the street.. or a dog, or if smoke starts coming out from under the hood, or when it starts raining too hard to see, etc. At some point you need to think for yourself and apply the skills you know in a common sense and reasonably intelligent way.

Martha Smith puts this better than I have in the comments below.

]]>Now you’re doing Bayesian analysis. Specifically, you’re using your scientific theory to build a generative model of the data collection process. That’s all the fanatics like Anoneuoid and I ask for ;-) Just actually think about your science and build a model of what happens, and then check that against data! In doing so, incorporate as many of the known or hypothesized effects and biases as you can…

As soon as you bother to do that, you’ll have questions other than “does this or does this not equal 0”. No one really cares about whether parameters equal zero in real models, because as soon as you build the model in your head, the parameters take on *meaning* whereas when you’re push-buttoning the “pocket calculator” version of statistics the parameters have no meaning, you just want to know if they are or aren’t greater than 0 or whatever.

But don’t you think it matters whether say your ability to measure how much effort people put into avoiding hurricanes is dominated by the variation caused by the male/female name of the hurricane, or by the extremely poor data that you have on what the reported locations of landfall were and how that was communicated so that people knew whether they were or were not in danger?

I mean, that’s actual scientific knowledge there: communication about hurricane risk affects people’s perception of danger and choice of response… and perhaps the hurricane’s name affects the communications, or perhaps it affects the response, maybe both…

]]>the idea however that you can somehow create a canned procedure which need not know anything about the physical process or social science process or literary process or whatever and can still magically tell you what your data means… that’s extremely problematic.

]]>Do you disagree that those are important to using an oscilloscope, and are separate from design of the instrument itself?

]]>This is an actually fine application of hypothesis testing, the reason a thing is considered unusual is specifically because it was rare (had low frequency) in the past.

The problem is, most hypothesis testing isn’t like that at all.

1) it’s rare to compare to a large database of data, instead some stand-in distribution is used without ever checking if it adequately represents the “real” distribution.

2) Another strategy is only a single parameter is compared, say the mean, using an asymptotic limit theorem to define the sampling distribution in a way that’s independent of what the data look like (as long as they have a mean and standard deviation…). But that two distributions have the same / different mean is often not sufficient to establish much.

3) Even if the difference in means is meaningful, people often throw away that information and make decisions based only on is something “significant vs not significant”… ie. they decide on a single binary digit of information (present/absent). But different sized effects can’t all have the same theoretical use.

4) When first filtering on significant/nonsignificant and then looking at effect size, the effect size is dramatically biased upwards in magnitude. Failure to replicate a similar magnitude effect is our expectation from this kind of research.

So, there are multiple issues at play. The problem with p values isn’t the p value, it’s the failure to properly align the research question with the mathematical question.

]]>I am in violent agreement! I would never approach anything that way. But that is a standard approach in psychology, at least according to the practitioner I know. He would say that you are not trying to understand buoyancy, that is impossibly hard, so all you can do is establish whether a coating of butter has an effect in isolation. But I am with you, I wouldn’t feel comfortable with anything about a butter coating if I did not have an adequate understanding of the concept of buoyancy.

]]>Sure, but isn’t the question whether the baby should go out with the bathwater, that is, whether misuse of p values means we should end significance testing?

What if you make your null as robust/severe/extreme as you can? It seems that if you exhaustively list every potential source of bias in the data and fully characterize them into estimations of error, and then sum those into a global estimate of error, you could build a null from that, right?

Suppose I am interested in himmicanes and I decide that after working through all the possible sources of bias that I can dream up or find in the literature, I discover that 97% of the potential error is caused by two factors, the magnitude of the hurricane and the percent that flee. After estimating those biases and establishing error bars, my model shows that any value between [people spend 5% more on female-named hurricanes] and [people spend 18% more on male-named hurricanes] is indistinguishable from “no effect.” Now I back-calculate from the worst case error, the 18%, that to get a p of 0.05, I need a result greater than [people spend 68% more on male-named hurricanes] to show a statistically significant result based purely upon the gender of the hurricane name. I don’t really care how big the effect is! At this point, I finally interrogate the raw data and discover that it shows [people spend 74% more on male-named hurricanes]. Can I declare victory? I am curious what problem this approach would fail to prevent, and if so, what would work better in a simple example like this.

]]>ha ha, yeah, you’re right, but you still know what one is right? :)

]]>FWIW I’m not a statistician or medical researcher, just a geologist :) But it seems obvious to me that, properly applied, this approach should be suitable to test for the efficacy of a treatment at some level of accuracy or precision.

All that’s going on in a test like this is comparing the distribution of measurements from treatment samples to the distribution of measurements from an ideal set of samples in which there is no effect. The distribution of measurements in which there is no effect is presumably normal, thus the idea that it’s a “random number generator”. I was confused by that for a while but I think that’s what Daniel and Phil are on about. A random number generator would produce a normal distribution.

That being said the method is only as good as the data and P = 0.05 isn’t absolute proof of efficacy, it’s the *odds* that the difference in measurement distributions result from chance. So it’s an indirect comparison with certain odds of failure even if everything else is perfect.

]]>I don’t know why it would be. Physicists use a set of “canned” or standardized procedures to find and verify exoplanets. Geologists use a standard set of procedures to identify and processes specimens for isotopic ages. Drug researchers use a standard set of procedure to extract natural chemicals from plants and test them for efficacy in the lab

Everything in science is done by standardized procedures. If you’re not using one, you’re making one up to get other people to use. The procedures aren’t the problem. The problem is the people who don’t follow procedures but claim they did. A lot of that is not confirming that the assumptions required for the procedure to work are actually met.

]]>I see Frequentism as a very definite ontological theory: the frequency is a thing in the world that can in principle be verified by taking a large sample of observable things, and probability calculations are acceptable to the extent that they approximate the real world frequencies in a reasonably accurate way.

I have absolutely no problem with “if such and such random process were to occur, it would produce observed data or more extreme as often as p = 0.0021” that’s a perfectly valid mathematical calculation. The problem is usually either:

1) It isn’t anything validated about the world, so it doesn’t correspond to the Frequentism theory mentioned above, so we need a test of goodness of fit or a proof of an appropriate asymptotic result or something before we can call it Frequentist.

or

2) It is a purely hypothetical calculation, in which case it is actually a (usually quite poor) Bayesian model, and I ask “why did you check that particular model using that particular likelihood?”. In particular, when used with p = 0.05 threshold, it kind of corresponds to a strange ABC Bayesian model, in which your summary statistic is p

You can create a Bayesian model as follows, using ABC methodology:

1) Define a stochastic RNG sampling model, like normal(m,s) or even some complex “random effects” model like linear regression with iid errors and individual slopes and intercepts per person or per school etc.

2) Let the prior for the parameters be uniform over [-MaxFloat,MaxFloat] for all parameters (or if parameter must be positive then [0,MaxFloat] or similarly for negative, or uniform between two logical bounding values maybe [0,1] etc.

3) Generate a random parameter vector from the prior

4) using the parameter vector calculate the p value for the test statistic function applied to the data under that parameter + sampling model (basically generate a data set, calculate the test statistic, repeat, then find out what the quantile of the real data test statistic is in this big database)

5) keep the parameter vector if p > 0.05, reject the parameter vector if p < 0.5

6) goto 1 until a sufficient sample is collected.

Is this a Bayesian model or not? I suggest that to the extent that you haven't tested the sampling distribution against the data for goodness of fit, and to the extent that you actually hypothesize this uniform prior which can't possibly be a real sampling distribution of anything, it's actually a Bayesian procedure and an obviously non-optimal one at that.

]]>Specifically, a Frequentist model should be one in which everything that has probability assigned to it should be a verifiable, observable outcome which can be repeated and varies from one experiment/observation to another.

From wikipedia which I think agrees with most of everything I’ve read: “The relative frequency of occurrence of an event, observed in a number of repetitions of the experiment, is a measure of the probability of that event. This is the core conception of probability in the frequentist interpretation. “

also “Probabilities can be found (in principle) by a repeatable objective process (and are thus ideally devoid of opinion).”

So “observed in a number of repetitions” is essential, and probabilities are verifiable facts about the world, not about our assumptions. This is why in “good” frequentist practice like what’s taught in “Statistical Analysis and Data Display” (Heiberger and Holland) you are told how to run things like Anderson-Darling tests or Kolmogorov-Smirnov tests or do transformations to make your data more normally distributed before running your ANOVA or whatever. The idea is that whatever calculation you’re doing is an approximation of some “real, true” probability distribution and you should make sure the distributional assumptions map at least approximately onto the verifiable facts of the world. This is also why there are so many nonparametric tests and goodness of fit tests and things in such books because they use mathematical tricks to incorporate these transformations into the tests.

This is why I say in most cases random effects models are Bayesian models in which people just avoid thinking about the prior explicitly. Somewhere deep in these mixed models is the idea that some parameter (like the mean) that describes a distribution over verifiable outcomes itself has a distribution placed over it and each sub-group has a different realization of this parameter value.

But the interpretation of the distribution over the parameters can not be anything but Bayesian since the parameters themselves are not observable nor can they be found by “a repeatable objective process” nor is there any way to verify it.

You *might* argue that with enough data on each subgroup, the sample mean is close enough to the real mean, and then the observed distribution of the sample means across the groups forms the sample distribution of the parameter, and then you could adjust the assumption of the hierarchical distribution over parameters to match the shape of the “observed” distribution over people

I have never ONCE seen that done.

]]>I actually came across your article last week and was very impressed.

I do believe however, that for a number of reasons your, argument only applies to one-sided p values against a range null, not a two-sided p value against a point null.

(Also, if we’re discussing p values, where is Anoneudoid?)

]]>I agree that the term “representative sample” is vague and depends on context, but I disagree that there’s no such thing.

To put it another way, every real-world sample is nonrepresentative. It is important in data collection to control the nonrepresentativeness, and it is important in analysis to recognize and adjust for the nonrepresentativeness.

]]>There’s nothing circular or absurd about that. You don’t have to actually use a calibrated RNG to collect your sample, but if it isn’t “representative” of what you might have gotten from a RNG it isn’t a representative sample.

]]>a representative sample doesn’t matter for the mathematical truth, but of course does matter for the statement about the world.

]]>it’s this attempt to make frequency a physical fact that galls me about frequentist, I think it’s also what irritated Jaynes, his mind projection fallacy for example. Mayo is on record here saying that statistical claims can’t just be “conditional on some assumed process… probability of this thing occurring is p” but rather should be unconditional claims about what actually does happen. to me that’s the essence of the conflict. I have no problem with Keith’s favored interpretation of Bayes as testing hypothetical Range to see which subset are sufficiently lifelike.

]]>Often people count maximum likelihood random effects models as Frequentist because they don’t specify an explicit prior, but I don’t think they are… they apply probability to parameters without parameters being an observable quantity that can have repetitions, or a verifiable physical distributional shape. They are just a poor man’s Bayes for people afraid that explicit priors will get their papers rejected.

of course you can bootstrap things rather a lot, and people do. and you can do lots of permutation testing and placebo treatments and try to discover things that way… it’s not totally hopeless, at least this kind of thing respects the observed distribution of data

I’m beginning to like Keith’s explanation of Bayes as filtering out which random number generator models could/couldn’t have generated the data using likelihood as the filter against the prior as the proposal. it is a unifying concept that seems intuitive to explain to beginners.

]]>Imagine if people tried to understand bouyancy that way, we would have a hopeless mishmash of thousands of papers: x sinks in bucket of water… x floats in bucket of water when coated in butter, x sinks in water if coating of butter is too thin, x floats in water when powdered… actually x sinks in water when powdered if experiment performed in southern hemisphere… x definitely floats in water when powdered contrary to eminant southern researchers findings, x plus thick coating of butter actually sinks if water has sufficient temperature… blah blah blah

]]>Collecting data until you’re black box spits out the “right” answer is poor practice. Why not just pick the one sample that creates statistical significance and use that first and save yourself the trouble of doing all the other ones? :))))

Andrew, it’s strange that you approve of this practice, since its likely to lead to your favorite P value, P = 0.049999999999, not to mention tempting people to be selective in which values they add to the model.

]]>The analogous problem in Bayesian statistics is that if you do optional stopping in a Bayesian analysis and don’t model this (which as a Bayesian you don’t have to), if you assume that there is a true underlying model, one can show that this can increase your chances to make wrong decisions, i.e., to increase the posterior probability for something wrong, compared with an analysis in which you have a fixed rule how many observations you gather, just the same as if you’re computing p-values ignoring optional stopping. Chances are this has been mentioned/papers have been linked elsewhere in this thread that show this (I haven’t followed links).

Now of course a Bayesian could say, why should we be interested in what happens if we assume a true frequentist model in which we don’t believe anyway, but it’s how we can analyse the performance of methods, by setting up artificial models and see whether what happens there makes good sense. (Chances are I don’t have to tell you this Andrew, but other readers of this blog…;-)

]]>“cannot” or “probably isn’t”?

“you usually don’t care about the probability that you would have gotten the data that you saw, if the true effect were zero. “

No, I don’t agree with that.

]]>Perfect! That’s what I understood but was having a hard time expressing clearly.

“Surely if the answer is “no”, the data can’t be used as evidence for anything else, and that’s of interest and relevant in a lot of cases.”

“if the answer is “yes”, this doesn’t necessarily mean that something substantially meaningful is going on”

Badabing and Badaboom.

]]>