Well, I wrote before that in principle non-rejection is the stronger result because it clearly shows that a non-clustering model is compatible with the data, so I’m OK with you being skeptical about the implications of rejection.

However, you should realise that I’m making a weaker statement than “I have evidence in favour of another (specific) model”. I’m just saying that there is significantly more clustering in the data (as measured by T) than expected under M. If you are happy to call *just this* Hcluster, it’s Hcluster indeed, but this Hcluster is not a very strong or specific statement. There is nothing to prove here; I’m just explaining how I interpret what was shown.

OK, this is going round in circles now.

What would it mean to you to say that a clustering is “real”, and how would you go on to show it?

]]>Christian,

“I just don’t subscribe to the dogma that that’s the only way of doing this, and there may be good reasons to take another way.”

To be clear I am not so much interested in the best way to do this particular case, I am trying to figure out why people seem to think that by disproving one model it provides evidence in favor of another (Inf-1 = Inf). It became clear to me in my own work that the usual “model” being rejected of “two groups of animals are exactly the same on average” was not helpful to me in any way because there will always be differences at baseline and “lurking variables” that arise during the course of the study. It is at best a spurious step on top of estimating effect sizes. At worst has been a source of widespread confusion.

Your use of a null model is different in some ways, but I still do not understand why you think “rejecting” it has helped you. By rejecting you are implying that some sort of deduction can be made from the premise that H0 is false. I can see no other purpose to rejecting. I can however see a purpose to comparing the model to the data in order to 1)make predictions 2)compare to other models. The usefulness of your paper is to provide a null model that future models should be compared to.

You write: “What is important is the *direction* of clustering formalised in my T”. I interpret this as “if H0 is false (or a bad fit or whatever) then clustering is more likely”. Please make this into the form of a proof as I suggested above so that I can understand:

“You are doing something like the following:

If H0 then p>=0.05

p<0.05 therefore ~H0

…

~H0 AND (no one bothered to try anything else) therefore Hcluster"

This was a reply to “question” above.

]]>I’m not in principle against having a more comprehensive model including clustering and against comparing them. I just don’t subscribe to the dogma that that’s the only way of doing this, and there may be good reasons to take another way.

Oh, and let me reiterate: I don’t reject the “perfection” of the null model and you’re right, I don’t believe it literally anyway. What is important is the *direction* of clustering formalised in my T, against which I’m rejecting the model.

The statistic T is a better formalisation of what we’re interested in in this application than the likelihood ratio of two specific models.

]]>Christian,

I meant no offense by the “fail” terminology, I don’t think your attempts to deal with this data was a fail at all.

I just really don’t understand your justification for rejecting/disproving the null model in order to claim something about a different model. It seems to me you should not do this because there are other reasons for the model to be rejected other than presence of clusters. It is very straightforward.

“We have proposed a null model that formalises what some biogeographers were saying informally (it’s up to them to decide whether we succeeded in this), so rejection should tell them something. Some others were mentioning that they expect some kind of clustering – and that was the *direction* in which our rejection points.”

I mostly agree with this. Your model is not a strawman, it appears to be a good attempt at describing the data and a good tool for exploring it. My problem is 100% with this “rejection” step. No one expects it to be perfect to begin with so rejecting the perfection of the model does not seem capable of providing useful information. Checking how good the fit is, however, does provide useful information if compared to other models.

The way I see it there should be two models, one with clusters and one without. They should be compared by p-value, AIC, BIC, whatever. The deviations from the models should be explored (as you did in your paper) to see if anyone comes up with an idea to improve them. Then some non-algorithmic thoughts run through peoples brains and they decide (informally, as recommended by Fisher) which model seems best or if more data is necessary.

]]>I should, regarding the “magical” decision boundary, add that in such parametric bootstrap simulations for computational reasons we cannot accurately figure out p-values like 10^{-8}, but I take as really strong evidence against M not p<0.05, but rather that the observed T is some distance away from the most extreme value seen under the null model. With the model and statistics in the paper, we have seen this happening for several datasets, as well as non-significance, so the procedure certainly has some discriminative power.

]]>question,

“So… we agree that rejecting the null model fails achieve your goal of demonstrating any real clustering.” The thing is, what are the standards here? I was making a rather modest statement conceding the limitations of what I have done, and you make a “fail” out of it. It is true that if I claim “there is a real clustering” because of this, it can be objected that there may exist a model that demonstrates otherwise which I didn’t try. However, this is the case with *whatever* model-based approach is taken. Because *no* parametric model exhausts the full space of possibilities and there is no work whatsoever, in any area, that can argue convincingly that all possible models formalising non-clustering (including the model that states that with probability one data had to look exactly how they look) are ruled out based on the data. Usually such models are just ruled out by assumption, which of course doesn’t restrict reality. So if you think that I failed, for this very reason, no success is possible at all. However, if I can convince a user that the model that I rejected is actually a pretty good attempt to explain the patterns in the data by other means than clustering, this is about as good as it gets. What I can say is that if T measures the amount of clustering and the observed value is too large for what we expect under M, that there is *significantly more clustering in the data than expected under M*. If you want to translate this into an alternative hypothesis, I have rejected M in favour of the class of models under which T can be expected to produce larger values. We could take this as a *definition* of your “Hcluster”, how does that feel? (This certainly depends on whether I can convince a user that the definition of T is appropriate for measuring the kind of clustering they are interested in.)

I don’t mind much about the “magical” 5%. If you want a crisp yes/no-decision, you need a cutoff, but I’m rather happy to say that with p between about 0.01 and 0.07, say, evidence is moderate but not strong, and I know very well that there is no rational reason to defend precise values. (I probably got your initial text on “arbitrary criteria” wrong because I thought that you were not only talking about the significance borderline but also about the test statistic.)

The thing is, the biogeographical theories the exploration of which the work was meant for don’t come as precisely specified statistical models with parameter values. We have proposed a null model that formalises what some biogeographers were saying informally (it’s up to them to decide whether we succeeded in this), so rejection should tell them something. Some others were mentioning that they expect some kind of clustering – and that was the *direction* in which our rejection points.

]]>Something happened to that post…

“however, as long as nobody can give me a model that a) formalises non-clustering and b) fits the data well, I will treat the data as a significantly clustered.”

Ignoring the issue of choosing a rejection criteria, what is your reasoning behind (where the tilde ~ = NOT) ~H0 therefore Hcluster? This is what I mean by write up some sort of proof. For example if we had a theory T that predicted parameter A=100, then we measure A=10 we could write:

If T then A=100

A=10

10/=100

A/=100 therefore ~T

You are doing something like the following:

If H0 then p>=0.05

p<0.05 therefore ~H0

…

~H0 AND (no one bothered to try anything else) therefore Hcluster

How do we get to that last step? Where does it come from?

]]>Christian,

“However, in order to demonstrate that *any* real clustering is going on, we see whether there is significantly more clustering than what would be expected under M using T as a measure…in case of significance one could still suspect that another non-clustering model could explain the observed amount of clustering”

So… we agree that rejecting the null model fails achieve your goal of demonstrating any real clustering. Also, earlier you wrote: “Rejection criteria are not arbitrary to me, but have to be related to what you want to find out about.” Yet in this paper you have decided to use the “magical” p=0.05

p<0.05 therefore ~H0

…

~H0 AND (no one bothered to try anything else) therefore Hcluster

How do we get to that last step? Where does it come from?

]]>question,

the paper you link is the right one. I don’t understand what kind of proof you want. If you have a test statistic T that quantifies the amount of clustering, and you have a model M that can be interpreted as “no clustering, all structure comes from other aspects than clustering such as spatial autocorrelation”, and the value of T in your data does not significantly differ from what is expected in M, I can say that in terms of clustering (as far as it’s measured by T), the data cannot be significantly distinguished from M and are therefore not significantly clustered.

This holds regardless of whether I can find a model that indeed can be interpreted as clustering and fits the data better or not. Actually in the paper we were *not* interested to find a “best model” for the data (you may have been confused by the fact that we actually did specify an alternative model for power simulations; but for the logic I’m referring to here this is not needed). The clustering that we usually do for such data partitions the dataset based on defining a distance and running and MDS, so we get a clustering but this does *not* come with a model for the underlying data generating process. We don’t need the latter if only the clustering itself is of interest. However, in order to demonstrate that *any* real clustering is going on, we see whether there is significantly more clustering than what would be expected under M using T as a measure.

The argument is less strong in case of significance than in case of insignificance, because in case of significance one could still suspect that another non-clustering model could explain the observed amount of clustering, which cannot be ruled out given how many non-clustering models are conceivable; however, as long as nobody can give me a model that a) formalises non-clustering and b) fits the data well, I will treat the data as a significantly clustered.

Generally I’m quite happy to avoid relying on restrictive model assumptions for this kind of thing. As long as there is a big set of models that could fit the data well, I don’t mind much whether in terms of likelihood one is better than another. As long as we cannot reject a model, it cannot be ruled out, whether there is a better one or not.

Read Laurie Davies’s work on “Data Features” and “Approximating Data” for getting a taste of this kind of philosophy (although I’m not strictly following his ideas) and especially for why he thinks (as well as I do) that comparing likelihoods is a very questionable tool.

Keith:

That Peirce’ quote is quite awesome, do you happen to have any kind of source for it? Of course, what you describe would be even more preferable but I don’t think it’s necessary for a scientist to be *happy* for his own work and theories to be falsified. I think it’s okay to have some kind of division of labor in so far as you usually (try to) falsify other peoples’ work and not your own. What we really need is acceptance and openness towards falsifications and criticism on the one hand and empirical (statistical) research that in itself is more akin to falsificationism on the other hand.

]]>Christian,

Can you write out some kind of proof (or link to such) of how the specific procedure you are talking about is helping you cluster? Possibly give an example of this occurring as well. Your link did not work but now I have accessed this paper which I presume is one: http://www.sciencedirect.com/science/article/pii/S0167947303000914.

As I keep saying you appear to be talking about comparing the relative merits of different models based on fit/complexity, this is not NHST. One model will always be better or worse than the other based on AIC or whatever score you choose to use. If you are indeed using NHST-filtering steps this is most likely spurious, and this does appear to be going on in Henning and Hausdorf 2004. You run the simulations once to reject the null model or not, then again to compare null model with alternative model. Why not just do the second?

]]>Well, if an apparent clustering turns out to be not significantly more clustered than what can be expected under a certain non-clustering null model, this seems to be very much NHST to me.

]]>“What do you mean by “much better fit”? A significantly better one? (This *has* to do with rejections.) If it’s not significantly better, the data are compatible with the worse one, too.”

This would depend on context and the practical consequences of deviation from the fit. Also, once again what we appear to be doing in this scenario is comparing the relative merits of multiple models, this is not NHST.

]]>Claim A is of course relative to the test statistic used. What I had in mind here (but have not written down) is a test statistic that formalizes how strong the clustering is, as used for example in C. Hennig and B. Hausdorf: Distance-based parametric bootstrap tests for clustering of species ranges. Computational Statistics and Data Analysis 45 (2004), 875-896 and ftp://ftp.stat.math.ethz.ch/Research-Reports/110.html. Rejection criteria are not arbitrary to me, but have to be related to what you want to find out about. True is that there may be more than one test statistic worth looking at.

What do you mean by “much better fit”? A significantly better one? (This *has* to do with rejections.) If it’s not significantly better, the data are compatible with the worse one, too.

Claim B is not really a precise claim – what would need to be shown is that there is no model that isn’t interpreted as “clustering” and fits the data so well that it could have generated the data. If I reject just a single one, that’s a rather weak step in that direction. However, if I reject something that takes into account all non-clustering structure we can imagine, that’d be much better (although still no mathematical proof that no other possibility is left).

Certainly I’m not saying with what model we should “work” – I’m interested in sets of models compatible or incompatible with the data in the sense of Laurie Davies’s “Data Features” (1995, Statistica Neerlandica). I’m not interested in ending up with a single “right” model. Also I’m pretty agnostic about how we should arrive at such models (this would probably need to be qualified when challenged).

]]>You make these two claims:

A) “The informative value is that if the null model is not rejected, it is clear that the given dataset cannot be used to argue that whatever clustering was found is real and meaningful (although of course it doesn’t mean that the H0 is true).”

B) “On the other hand, a significant rejection is the more convincing, the harder the researcher tries and works to find a null model that models the data as well as possible (rejecting a naive Gaussian distribution with very low p is usually not enough).”

Claim A appears to be incorrect. If “model 2” is a much better fit (enough to cancel out any extra complexity), the results can be used to argue the clustering is real. “Rejecting” the null model has nothing to do with it. That the rejection criteria is usually arbitrary should really drive this point home intuitively.

Claim B I also do not think is correct. It is talking about a family of “null models”, which is interesting. The implication seems to be that we should choose to work with the simplest model consistent with the data. First, I would say that the origins of the model also play a role (was it derived from first principles, is it totally ad hoc, what assumptions are necessary, etc). Second, we are once again talking about the relative complexity/accuracy tradeoff rather than trying to reject a model. In this case I think p-values may be useful by indexing likelihoods (at least for simple cases such as the t-test, see Michael Lew’s findings here: http://arxiv.org/abs/1311.0081), but the “rejection” step is not.

If you could attempt to write out your thoughts on this in a more formal fashion perhaps it can help.

]]>According to the discussion given here, this would look “confirmationalist”, because the researcher doesn’t really believe the null model, but rather a clustering alternative. The informative value is that if the null model is not rejected, it is clear that the given dataset cannot be used to argue that whatever clustering was found is real and meaningful (although of course it doesn’t mean that the H0 is true). On the other hand, a significant rejection is the more convincing, the harder the researcher tries and works to find a null model that models the data as well as possible (rejecting a naive Gaussian distribution with very low p is usually not enough).

However, interpreting the result appropriately it is clear that rejection doesn’t “confirm” the specific clustering that was found in the data by the researcher’s favourite method. It is an rather earlier step and only says that “some kind of clustering is going on here” (or even something else that was more complex than what was in the null model). Better than nothing (often enough there is indeed no evidence for this) but far away from making a “confirmative” statement about anything; so it isn’t really “confirmationist”, or only confirmationist regarding the quite weak hypothesis that “something is going on”.

To me there seems to be nothing wrong with this apart from the fact that, as was discussed before, people want to make (and read) stronger statements, and they do, regardless of whether these are justified, and people don’t like to admit that what could be found in their data was quite weak (this would probably be easier if everyone else was more modest about their statements as well) and would need to be subjected to much more research including serious falsification attempts in order to generate reliable knowledge.

]]>Anonymous:

Please be polite. Regarding your comment: I don’t think there are a lot of perks to being a philosopher; I think philosophers such as Mayo are doing their best to formalize what scientists do. Jaynes is great but there are lots of ways of doing statistics and I value what Mayo does even if I don’t agree with everything she writes. For that matter, I got a lot out of reading Popper and Lakatos, even though neither of them offered any methods that I could use.

]]>“The bottom line is: we can regard criticisms of something called NHST as relevant only insofar as all cases of insevere corroboration are condemned.”

To the extent “Severity” has been made concrete, it’s only been tested on 200 year old problems where it’s numerically and functionally the same as using the Bayesian posterior. In other words, it hasn’t been tested.

To the extent “Severity” is a meta principle, it’s infinitely malleable and can be always be fudged on an ad-hoc basis to save face. This leads to a kind of Statistical Zeno’s paradox. As each new problem is found and fixed the methods inch ever closer to the Bayesian answer (just like ‘severity’ brings p-value methods closer to posteriors), without anyone having to admit that’s where they’re headed. At least when Abraham Wald encountered this phenomenon, he had the mathematical skill to see where perfection lead and the integrity to name them “Bayes strategies”.

Oh the delicious irony of Popperites (Popperazzi?) using Popper’s words to swear allegiance to untested theories and unfalsifiable ideologies.

Of course if you don’t know enough math to verify these claims you can deny them indefinitely. That’s the chief perk of being a Philosopher I suppose. The statistical community will eventually discover the truth though if they take “Severity” seriously enough. The truth always wins with this sort of thing and it won’t make a spit of difference what the highly credentialed super geniuses on this blog have to say – just like the world learned the truth about Classical Statistical methods no matter how thoroughly the super stars of Frequentism indoctrinated each new generation and enforced their ideological prescriptions.

So enjoy your adulation while it lasts Mayo, and pray those statisticians who’ve praised your work continue to spend their time complimenting it rather than using it.

Oh and to the extent there’s a kernel of truth in “severity”, Jaynes of course did it 100 times better, with mathematical details, in yet another of his articles pregnant with useful ideas that any statistician worth their salt could turn into half a dozen profitable research programs: http://bayes.wustl.edu/etj/articles/what.question.pdf

]]>Rahul:

You write, “I think the flaws & blind spots of NHST are well recognized. But practitioners aren’t stupid. Mostly.” It’s not about stupidity. Statistics is hard! Recall our discussion the other day. Even brilliant scientists such as Turing and Kahneman can get snowed by what seems to them as overwhelming statistical evidence.

I think it’s worth returning to these topics over and over again because they are difficult enough that non-stupid people get them wrong, over and over. I don’t delude myself that I can change everyone by one blog post or even by 100 posts and 10 journal articles. But I think these issues are worth thinking about, and I think that we make progress by elaborating and discussing them.

And you write, “What’s not available or practical in many use-cases is an improved alternative to NHST.” Here, I think we have to return to the question of incentives. An alternative statistical method such as multilevel modeling can be an improvement in some ways (for example, giving more accurate and reproducible estimates of effects) but could be considered as negative in other ways (with multilevel modeling, it’s harder to get statistically significant p-values (see my 2000 paper with Tuerlinckx), hence harder to get publication). So I agree with those people who say that, along with working on statistical methods, we have to work to change some of the perverse incentives of the system.

]]>“An earlier $58 million request for the Centers for Disease Control would help the agency ramp up production and testing of the experimental drug ZMapp, which has shown promise in fighting the Ebola epidemic in western Africa.”

http://www.washingtontimes.com/news/2014/sep/5/white-house-asks-for-58m-for-ebola-drugs/

Really, this is the evidence:

Reversion of advanced Ebola virus disease in nonhuman primates with ZMapp. Nature(2014) doi:10.1038/nature13777

http://go.nature.com/oY8pGI

Why did editors/reviewers fail to make them include a description of the “clinical score” protocol? Why would you spend so much money on a project then fail to blind the people measuring your primary outcome?

This is grade-school stuff.

]]>I think the flaws & blind spots of NHST are well recognized. But practitioners aren’t stupid. Mostly.

What’s not available or practical in many use-cases is an improved alternative to NHST.

]]>Almost.

]]>“What makes a methodology pseudoscientific isn’t that it refuses to falsify so much as being unable to reliably pinpoint the blame of any apparent anomalies.”

I think this is an apt description. The question is how to “pinpoint the blame” when we have a theory capable of only vague (higher/lower, no/some relationship) conditions. It seems to me you simply cannot do so, the solution is to 1)Observe carefully

2)Record as much about the phenomenon as possible

3)Think about what may be going on

4)Guess (adduce) an explanation

5)Formulate some assumptions of your guess in mathematical form

6)Deduce precise predictions from these assumptions (upper/lower bounds, existence of a phenomenon, exact values)

7)Check how close your predictions are to new data

I made this awhile back based on some of Meehl’s publications, I wonder what you guys think of it:

http://s29.postimg.org/3n52c0iqv/logical_structure.png

It was based on these two papers:

Meehl, P (1990). “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It”. Psychological Inquiry 1 (2): 108–141. doi:10.1207/s15327965pli0102_1

http://rhowell.ba.ttu.edu/meehl1.pdf

Meehl , P. E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow S. A. Mulaik J. H. Steiger (Eds.), What if there were no significance tests? (pp. 393-425). Mahwah, NJ: Erlbaum.

http://www.tc.umn.edu/~pemeehl/169ProblemIsEpistemology.pdf

Mayo,

Good example w the Ebola drug. Let’s take the recent zmapp study(Qiu et al 2014). The increased survival was because they euthanized the control animals for having high ‘clinical scores'(bad symptoms). They do not tell us how this clinical score was calculated, but do mention that the study was not blinded.

So we have a situation of unblinded researchers deciding how long monkeys survive according to unknown criteria then claiming increased survival in the treatment group….the stats do not even enter into the decision making process.

]]>Mayo:

I don’t know why you continue to write things like “something called NHST” or “so-called NHST” or “no real methodology.” What you call “something called NHST” is what the rest of us call “NHST.” You might as well refer to “something called evolution” or “something called the Eiffel Tower.” “NHST” is a phrase that refers to a statistical method that is prevalent in psychology research in elsewhere. The method goes like this: a researcher has a substantive theory X and, from the data, he or she tests a null hypothesis Y that he or she does not believe, then gets p<0.05 and declares rejection of Y and then claims X is correct. In recent years, NHST has been used to demonstrate all sorts of things including the existence of ESP and huge effects of ovulation on vote preferences. From a statistical perspective, NHST is flawed for several well-known reasons that I have discussed many times on this blog and in published and unpublished papers. As we have discussed from time to time, the problem is not with p-values---the same errors will arise if instead people use confidence intervals---but rather with the NHST framework. Also as we've discussed, NHST has various properties that make it appealing to practitioners, as well as a superficial (but, as I've argued in the above post, false) connection to Popperian falsification. So for all these reasons, I think NHST is worth discussing. As I've also written many times, I don't think the framework of Type 1 and Type 2 errors is helpful in most of the applications I've seen.

]]>PPS where I wrote H1-H3 above I meant H0-H2.

]]>PS I think what Mayo says is that NHST is fine under a frawework that includes H0-H2. If so I agree.

I think Andrew is criticising NHST under the standard framework that includes only H0-H1. If so I agree.

Yet the problem here is not NHST so much as the underlying framework.

]]>Peter:

That is exactly the point made by Jeynes.

For rejection of H0 to count as evidence for H1 we must ensure that these two hypotheses are a partition of the space of hypotheses. If so it must be the case that if H0 is unlikely then H1 must be likely, as beliefs must sum to 1.

However, science is a human endeavor prone to failure. So we must always consider a third hypothesis, H2, that the evidence reflects some artifact, measurement error etc. If so the logic above breaks. H0 may be unlikely, but H1 even more so, so H2 must be who done it.

The point of good design, thorough implementation, reliable instruments etc is precisely to minimize the probability of H2 so evidence against H0 counts as evidence for H1.

Bayesians can also go wrong if they ignore H2.

If you include H1-H3 in the analysis then results will inform you about the null, the alternative, and the probability of an artifact. Note that H3 can be a stand in for all the ways in which a study might go awry. No need to specify each and every possibility, thought this can be more informative.

]]>If you are criticizing an NHST from a stat sig effect to a substative claim T, you should add, as did Meehl: “For the corroboration to be strong, we have to have “Popperian risk” (Popper, 1959/1977, 1962), “severe test” (Mayo, 1991, 1996) . (Meehl and Waller 2002)”

So you’d have to be critical of all statistical affirming the consequent where even though T “fits” or “explains” or “confirms” or is given a B-boost by, the observed effect, Prob (so good a fit, even if not T) = not low.

That would NOT be to criticize statistical significance tests—certainly not any error statistical test–which must control two error probabilities, erroneously finding evidence against and erroneously failing to. Strictly speaking, N-P statistical tests only concern statistical hypotheses but we can extend the reasoning to any level. Your example concerns the former (essentially a type 1 error). If one moves from a statistically significant effect to a substantive claim T—where that claim has not had its errors probed, and so T has not been corroborated with severity—then you fail to control that error. That would not condemn tests where that error was controlled. Thus to criticize all statistical significance tests would be to preclude such inferences even where they are warranted, and in fact it would preclude empirical falsification in science. Consider a case where we’d allow it to be warranted:

1. If this ebola drug didn’t work, then they wouldn’t be able to show such and such improved survival.

2.They show improved survival (statistically)

This is evidence the drug works.

The bottom line is: we can regard criticisms of something called NHST as relevant only insofar as all cases of insevere corroboration are condemned.

]]>I think more clarity is needed here between the concepts of Theory and Hypothesis (for which I don’t think there are adequately accepted definitions). I think of Theory as describing the theorized and unobservable *causal* relationship being considered, and the Hypothesis as the observable ‘correlational’ relationship *implied* by the theory. The theoretical relationship is between ‘theoretical level’ concepts (A&B), while the Hypothesis is about relationship between *operationalized* measures of those theoretical concepts (m(A) & m(B)).

There is then a logical step involved: (A influences B) implies (m(A) correlates_with m(B))

This logical step is open to critique: are m(A) and m(B) valid measures of the concepts A and B? Is the theorized relationship correct (e.g. linear vs. various forms of non-linearity)?

Testing then proceeds under the assumption that the above situation is correct.

Falsification seeks a ‘reductio ad absurdum’ situation, in which m(A) does not correlate with m(B), implying that one of the *many* assumptions of the test is false. The central assumption is that “(A influences B)”, but it is not the only assumption that should be considered. The critique questions above are two others. Others could involve whether or not there is another theory that would also imply that “(m(A) correlates_with m(B))”. Possibilities include: reverse causality (“B influences A”) and tertium quid (“C influences both A and B”). Another is sample bias: were the samples actually representative of (e.g. selected rendomly from ) the population to which the results are to be generalized?

“Confirmation” (in which m(A) *does* correlate with m(B)) says little. All of the same assumptions need to be considered. But once those have been considered, and assuming we’re pretty sure the assumptions hold, we would *tentatively* accept the idea that “A influences B”. But “*tentatively*” is a hard concept for our species to hold onto; we are simply too enamoured of certainty. Confirmation bias gets in the way of falsificationism. There may be a better theory out there that explains the correlational relationship, without the existence causal connection between the two concepts, but which we weren’t able to think up at present. If someone comes up with such a theory, we then need to look for operational level consequences of the old and the new theory that are inconsistent with each other, and see which of the two we can reject. (Which gets into the idea of competing research hypotheses, and *comparative* analysis, rather than simply testing a single theory’s Hypothesis against its own Null Hypothesis. And then there’s the idea of generating multiple operatinalizations of the same theory: what other *testable* research hypotheses does the theory imply… and what if the different research and nulll hypotheses produce different results?)

Similarly, the idea that we can ‘confirm’ a theory doesn’t work, unless we have full knowledge of all the other possible theories. We can only compare a theory against another theory (with the Null Hypothesis representing a ‘Null Theory’), and recognize that the theories have some assumptions involved that we can be aware of, and others that we are not yet equipped to be aware of.

If you can argue (convince yourself? convince others?) that those other assumptions are valid, then “m(A) does (not) correlate with m(B)” does imply that “A does (not) influence B”. But the arguments around those assumptions are never certain… they too are subject to future insights, theorizing and testing.

Science is *process* not *certainty*!

=Peter

]]>Shravan:

I’m happy to admit my mistakes; see for example here:

http://www.stat.columbia.edu/~gelman/research/published/AOAS641.pdf

and here:

http://statmodeling.stat.columbia.edu/2014/05/12/results-shown/

and here:

http://www.stat.columbia.edu/~gelman/research/published/GelmanSpeedCorrection.pdf

and here:

http://statmodeling.stat.columbia.edu/2014/07/15/stan-world-cup-update/

and here:

http://statmodeling.stat.columbia.edu/2009/05/11/discussion_and/

And my colleagues such as Bob Carpenter, Jennifer Hill, Phil Price, etc., are the same way. Indeed I would find it difficult to do science *without* admitting errors when they occur. It is often through recognition of our errors that we learn the most.

You write, “I’m sure that you also rarely back down from a position or opinion that you have; it would involve loss of face. It’s easy to fool oneself into thinking, that no, it’s not about loss of face, I’m really right about this.” Of course this is an impossible argument to refute, but really it would be silly for me not to back down from a position when I made a mistake, as that’s how I learn.

In any case, sure, I realize that human nature is what it is, and I’m not expecting Mark Hauser, Ed Wegman, Anil Potti, etc., to admit they were wrong—they’ve had their chances to admit wrongdoing and haven’t taken those opportunities—nor do I expect Daryl Bem and the various “Psychological Science”-type researchers to admit that they have been spending years chasing noise. I agree with you that in these cases it would just be too difficult for these people to admit, even to themselves, what they’ve done. I suspect that even the out-and-out cheaters have a way to explain to themselves what they’ve done (for example that they’re being attacked by haters, or that their critics are politically motivated, etc.). For example, when I asked him about retracting the false numbers he’d put in his column David Brooks characterized his critics as “intemperate.” I think that in his mind it got him off the hook.

So, sure, that’s the way it is. But I don’t have to like it. From an empirical or statistical standpoint, yes, I understand. But on an emotional level, I am continually surprised when people refuse to admit their errors. It just seems so weird to me.

]]>You wrote: “The issue is that research projects are framed as quests for confirmation of a theory.” And elsewhere you said you don’t understand why people are not willing to consider that they might be wrong.

Ignoring the obvious reasons for all this (the fame, the money, tenure), why is all this surprising to you? Who doubts themselves (publicly)? If researchers really were to abandon blind loyalty to their own ideas, it would be a personal defeat for them. I’m sure that you also rarely back down from a position or opinion that you have; it would involve loss of face. It’s easy to fool oneself into thinking, that no, it’s not about loss of face, I’m really right about this.

I think that’s a primary driver of the behavior of scientists and their theory-development, not any real belief in their theory. It’s mostly about not losing face.

At least in my field, I have yet to encounter a scientist who backs down from a position, science related or not science related, that they have taken a stand on publicly. People are not even willing to express uncertainty about their beliefs; it’s a binary decision. I’m sure there must be people out there in other fields who are willing to express uncertainty publicly and in writing, but I’ve never met one.

]]>It appears to be, although Bookstein doesn’t use the name “two-stage”.

]]>Mayo:

As you know, I’m a big fan of model checks, of comparing data to predictions from the model. I’m a big fan of using a model to make strong claims, then checking those claims with data. I think this is particularly important to do with models that I like.

But I’m not a big fan of NHST, which is the procedure of trying to reject straw-man null hypothesis A in order to make a claim that a preferred hypothesis B is correct. NHST may very well refer to “no real methodology” (as you put it), but it’s a non-real methodology that is used a lot in psychology and elsewhere. Daryl Bem, Jessica Tracy, and Alec Beall are not the only scientists to take p<0.05 as strong evidence that their preferred model is true. This attitude appears all over the place. Again, I like model checking, indeed I've gone to a lot of effort to emphasize the continuity between classical and Bayesian goodness-of-fit tests. Posterior predictive checks, for example, can be viewed as a generalization of classical tests where there is uncertainty about the parameters and no easy analytical solution. But I don't like NHST.

]]>The last para on Meehl wasn’t to be included. It’s correct but goes onto a different topic that I decided not to take up here. It’s quite important, but I don’t want to mix issues.

]]>Andrew: I think the term was invented in psychology and alludes to no real methodology. I agree it alludes to a fallacy of (probabilistic) affirming the consequent–one that is embraced by those who view evidence as a matter of increasing probabilistic confirmation. But I think there is a danger in confusing significance tests–be they “pure significance tests (as David Cox calls them), or N-P tests–entities with their own confusions and misuses, with the affirming the consequent move from an observed effect to a substantive theory T (where T might be thought to render that effect more probable). By jumbling these together, much of the discussion has degenerated. The error control that is vouchsafed in proper significance testing and cognate methods is absent in the illicit form of these methods. It would be like criticizing a method, but not referring to that method at all. Worse, many of the critics of significance tests and related methods say that what we really want are ways to boost the confirmation of theory T (whether absolute or comparative). So they go right back to recommending confirmation boosts instead of NHST, which boils down to recommending a version of NHST instead of a methodology that would have precluded the unreliable move to T (whose errors have been poorly probed by merely finding effect x).

Meehl was wrong about a number of things in the midst of his criticisms of tests in general . Like many people, he claimed that if you are going to allow an observed effect x to be evidence for theory T, then you must take the absence of the effect x as grounds to deny T.

]]>This is more complicated than a mere blog comment can do justice to, but the thing is, setting out to falsify, formulating your “test’ as a modus tollens need not warrant the denial of the antecedent (in the case of a failed consequent) in the least. In the case at hand, turning the example (under criticism) into modus tollens doesn’t turn it into a critical affair or spare it from being questionable science. What makes a methodology pseudoscientific isn’t that it refuses to falsify so much as being unable to reliably pinpoint the blame of any apparent anomalies. Popper’s philosophy denied we could solve such “Duhemian problems” reliably (even though, personally, he thought we must manage to). At most, for Popper, you can infer something is wrong somewhere. That’s where Popper’s methodology falls apart (and mine, I hope, goes beyond his). This is on p. 1 of Error at the Growth of Exper Knowledge (ch. 1 “Learning from Error). To claim one is being stringent or self-critical simply because one is going to try to find flaws in a model or hypothesis is empty.

]]>Not to even mention the “you can’t get a paper into Nature unless you publish a proper p value” type editorial policies created by statistically naive editors at top journals. This makes it hard for a researcher to publish high quality work when they actually know what they’re doing and purposefully chooses to use something like a Bayesian method or even a frequentist method which focuses on estimation and confidence bounds instead of p values. I’m especially pointing at Biology because that’s the area where I have first or second hand experience of this kind of thing going on.

]]>