In which I side with Neyman over Fisher

As a data analyst and a scientist, Fisher > Neyman, no question. But as a theorist, Fisher came up with ideas that worked just fine in his applications but can fall apart when people try to apply them too generally.

Here’s an example that recently came up.

Deborah Mayo pointed me to a comment by Stephen Senn on the so-called Fisher and Neyman null hypotheses. In an experiment with n participants (or, as we used to say, subjects or experimental units), the Fisher null hypothesis is that the treatment effect is exactly 0 for every one of the n units, while the Neyman null hypothesis is that the individual treatment effects can be negative or positive but have an average of zero.

Senn explains why Neyman’s hypothesis in general makes no sense—the short story is that Fisher’s hypothesis seems relevant in some problems (sometimes we really are studying effects that are zero or close enough for all practical purposes), whereas Neyman’s hypothesis just seems weird (it’s implausible that a bunch of nonzero effects would exactly cancel). And I remember a similar discussion as a student, many years ago, when Rubin talked about that silly Neyman null hypothesis.

Thinking about it more, though, I side with Neyman over Fisher, because the interesting problem for me is not testing the null hypothesis, which in nontrivial problems can never be true anyway, but in estimation. And in estimation I am intersted in an average effect, not an effect that is identical across all people. I could imagine a model in which the variance of the treatment effect is proportional to its mean—this would bridge between the Neyman and Fisher ideas—but this is not a model that anyone ever fits.

So, just to say it again: if it’s a pure null hypothesis, sure, go with Fisher. But if you’re inverting a family of hypothesis tests to get a confidence interval (something which I’d almost never want to do, but let’s go with this, since that’s the common application of these ideas), I’d go with Neyman, as it omits the implausible requirement that the treatment effect be exactly identical on all items.

56 thoughts on “In which I side with Neyman over Fisher

  1. Even when testing, Neyman’s hypothesis could be seen as sensible. At least one-sided testing: is a new treatment better than an established old one?
    Assuming that both treatments do something and are reasonably different, it makes a lot of sense to believe that one is better for one patient and another is better for another patient. Now Senn is right that one wouldn’t believe that on average both are exactly the same, but still the question whether there is evidence that one is better than the other could be legitimately be translated as whether there is one-sided evidence against the null that both are on average the same (as borderline case of the other treatment being better on average).

      • Another pragmatic reason for taking the Neyman approach in testing; the overwhelming bulk of the information about deviations from Fisher’s null may be given by the signal for the Neyman-esque average effect – particularly when the effects are roughly the same size and direction. You won’t get full efficiency, in general, but for analyses with some degree of robustness it’s a good place to start.

  2. As an attempt to amplify Christian’s point, in abstract mathematical reasoning, once it is pointed out the A cannot be equal to B it is wrong to represent A with B.

    In purposeful (sensible) empirical reasoning (modelling) the pointing out that the model representing treatment effects can only be wrong (as Fisher/Senn does) does not rule it out for representing treatment effects, as we know all models are wrong (as representations of empirical objects) and we must choose among wrong ones anyway.

  3. Is variance in outcome proportional to outcome really that rare? If you fit a line through log transformed data with least squares, isn’t that what you’re doing? I’d have to think about whether you’re assuming variance or standard deviation proportional to outcome. those darned square roots. I like to think in terms of standard deviations in all cases, they are on the same scale as the data.

    • Daniel:

      Log scale is fine but that’s a slightly different issue. For example, suppose you have data you wouldn’t log (for example, test scores), and then you are considering a treatment effect that might be positive. So you could model the sd of treatment effects as being proportional to the average effect. Such a model could well be a good idea but it’s not standard practice.

  4. Given the variation in humans, wouldn’t it also be reasonable to expect that there is a drug*subject interaction present? There could be sub-populations where the effect was negative and some where it was positive, but overall we might see no effect.

    When considering the efficacy of many drugs, my prior would be that we would see a different effect in men than in women. That brings up the issue that a number of drugs were tested in men only and then assumed to work in women.

  5. The finite variance null (zero mean, finite variance) is a more flexible model (than the zero mean, zero variance), so in some sense it is much more “conservative”.

  6. Sir David Cox, in his 1958 article “The Interpretation of the Effects of Non-Additivity in the Latin Square”, provided a rather unique viewpoint on Neyman’s null in the Neyman-Fisher controversy of 1935 (which dealt with randomized complete block and Latin square designs). After first summarizing Wilk and Kempthorne’s earlier results by stating that it is usually the case that the expected mean residual sum of squares is larger than the expected mean treatment sum of squares for Latin squares, Cox then considered the practical importance of this difference of expectations, which he correctly recognized as being related to interactions between the treatment and blocking factors. Cox (1958: p. 73) raised the interesting question of whether, for a Latin square design, the practical scientific interest of the null

    H_0 : E(MSTreatment) = E(MSResidual)

    is comparable to, or greater than, Neyman’s null, especially when the difference between these expected mean sums of squares is considered important. He concluded that testing Neyman’s null when there is no unit-treatment additivity may not be helpful:

    “… if substantial variations in treatment effect from unit to unit do occur, one’s understanding of the experimental situation will be very incomplete until the basis of this variation is discovered and any extension of the conclusions to a general set of experimental units will be hazardous. The mean treatment effect, averaged over all units in the experiment, or over the finite population of units from which they are randomly drawn, may in such cases not be too helpful. Particularly if appreciable systematic treatment-unit interactions are suspected, the experiment should be set out so these may be detected and explained.” (Cox, 1958: p. 73)

    Oscar Kempthorne also had interesting comments on the nulls of Fisher and Neyman in his book “The Design and Analysis of Experiments”:

    “If the experimenter is interested in the more fundamental research work, Fisher’s null hypothesis is more satisfactory, for one should be interested in discovering the fact that treatments have different effects on different plots and in trying to explain why such differences exist. It is only in technological experiments designed to answer specific questions about a particular batch of materials which is later to be used for production of some sort that Neyman’s null hypothesis appears satisfactory … Neyman’s hypothesis appears artificial in this respect, that a series of repetitions is envisaged, the experimental conditions remaining the same but the technical errors being different.” (Kempthorne, 1952: p. 133)

  7. On practical matters, how this affect what I do? If a look at a p-value, I’m being fisherian, right? And then, I’m assuming that units are exactly zero at the null?

    And when I try to compute the power of a test, I’m being Neym(ian??), right? In this case, I’m assuming the effect is on average zero (under the null) when I set alpha to, say, 5%?

    I guess my question is: How this affect calculations of what I do? And about interpretations?

    I already have a hard time interpreting a p-value (and please, remember that people as good as Wasserman and Gelman couldn’t agree if there is any conditioning or not when computing a p-value). Should I be concerned about my interpretations of any other stuff?

  8. Andrew, I think that you underestimate the problems with the Neyman model. To put it another way it violates Nelder marginality by contemplating models which have interactions but not their marginal main effects. I don’t believe that any experienced data analyst would countenance them. They cause all sorts of problems.

    I am pretty much in sympathy (I think) with Gaius.

    A related problem occurs in Bayesian models for meta-analysis that specify independent prior distributions for random treatment by trial interactions and for the average effect of treatments. See ‘Trying to be precise about vagueness’ Stat Med, 2006
    http://onlinelibrary.wiley.com/doi/10.1002/sim.2639/abstract;jsessionid=29FD159E758E5476022AB8C37BC6AFB6.d03t01?deniedAccessCustomisedMessage=&userIsAuthenticated=false

    • By implication, Gelman is not an “experienced data analyst?”

      I suspect you are concerned with some kind of technical aspect of something specific, whereas Gelman is being maybe more general. Also notice that Gelman is really interested in estimation of effect size rather than null-hypothesis-testing.

      In general it does seem reasonable to me that averaged over some population, the difference between treatment A and treatment B could be close to zero, whereas in specific subpopulations these effects might have consistent directions. I think this is the core of Gelman’s issue.

      Of course, given enough data, we’d love to tease out the subpopulation effects and interactions and things, but in practice sometimes we need to just take an average effect over a big population, because we aren’t going to have control over which subpopulation will be treated or whatever. (Think over the counter medicine for example)

      • No. By implication Gelman does NOT use the Neyman model. I will be surprised if he does

        The issue is whether you would ever use a model in which the main effect was constrained to be zero but the interaction would be allowed to be anything at all.
        Stephen

        • But Gelman right here says: “if it’s a pure null hypothesis, sure, go with Fisher. But if you’re inverting a family of hypothesis tests to get a confidence interval (something which I’d almost never want to do, but let’s go with this, since that’s the common application of these ideas), I’d go with Neyman”

          so you’re right, he doesn’t use hypothesis tests at all personally, but if you’re going to he recommends Neyman’s model to get confidence intervals, so I’d love to hear the two of you hash it out because I’m sure I’d learn something.

        • And my example seems to be a perfectly reasonable one where I might be interested in whether the main effect (ie. averaged over the whole population) was near zero, even if interactions with sub-populations could be all sorts of stuff. I think that seems like a pretty reasonable model to a lot of people here on this blog at first glance at least.

    • For some reason, this post by Andrew lead me to re-read Meng’s epilogue on H likelihood here http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ss/1270041255

      For instance, in 2008 I invited David Freedman to SAMSI meta-analysis program as a proponent for Neyman Null (mostly because I don’t get it), he declined but confirmed his strong preference for Neyman Null.

      And despite Gaius’ accurate quotations, I don’t believe David had strong preference for one over the other, saying something like that 1958 paper just provided technical support for Fisher’s interpretation of the Latin Square.

      I also really don’t get H likelihood other than as an approximation, but believe I should just point to Meng’s epilogue.

      “I must confess that my study of the h-likelihood framework is largely carried by both the authors’ faith in their methods and my faith in the authors—they must have seen signs that most discussants did not.”

      p.s. really liked the precise about vagueness paper – I should re-read it.

  9. It would have been better if Fisher hadn’t been such a lone pathbreaker that we didn’t get stuck with his awkward techniques. If others had been keeping up with Fisher, then there would have been competing methodologies to choose among. But Fisher was so far out ahead of this time that we got stuck with his way of doing things rather than a more refined methodology that would have emerged from competition.

    It’s a little like how Newton’s notations for doing calculus work fine if you as smart as Newton, but Leibniz’s dy/dx style is better for everybody else. Out of patriotism, the British stuck with Newton’s terminology, which held back British math for a century, while the rest of the world gratefully adopted Leibniz’s methods.

  10. This is not necessarily an example, but might be.
    Back in the mid-1990s, my cardiologist wished to prescribe a cholesterol lowering drug (this was before Lipitor became available to public) and told me:
    there are 3 drugs, one of which will almost certainly work for you, but we can’t predict which one.
    We’ll try each one and see which works well enough (and one did).

    For any given person, perhaps response would be 0,0,1 or 0,1,0 or 1,0,0 to the 3 drugs, but if there were other people for whom the effect was -1, 0, 1 or -.5, -.5, 1 then it might fit.

  11. “I could imagine a model in which the variance of the treatment effect is proportional to its mean—this would bridge between the Neyman and Fisher ideas—but this is not a model that anyone ever fits.”

    Poisson regression…

  12. Let me make in concrete. In a Latin square, a model in which the treatments do nothing is one in which ‘row’ and ‘column’ effects only apply (call this model F0). Now in the Fisher and Nelder school of modelling this model can be made progressively more complex by adding additive treatment effects, T, (call this model F1) and then adding interactions, I, (call this model F2). Neyman wants to create a model which has I but NOT T to create a new model N0 and then see how this compares to F2 (which is then his N1). To use the British English vernacular, this is just “barking”. Note that the issue is not whether eventually you end up with a model like F2 (I believe it is F2 that Gelman uses for estimation) the issue is simply whether N0 can ever be an interesting model to entertain and even whether (speaking more practically and accepting that all models are approximate) you would be prepared to have a scheme in which you could end up with N0 as an approximation but could never end up with F1 as an approximation.

    This particular attitude causes endless harm in medicine, where we currently have many drug companies spending billions on chasing individual response for treatments for which there is no evidence that they actually do anything on average. This is back to front from the practical point of view. (Dichotomania is an obsessive compulsive disorder that is a symptom of this diseases. See http://www.emwa.org/JournalArticles/JA_V18_I3_Senn1.pdf )

    Finally, to explain Fisher’s exasperation, he had only just explained to Wilks, who considered he had provided a rigorous proof of Fisher’s analysis, that no such proof was needed (Cochran eventually convinced Wilks that Fisher’s proof was correct) to find that Neyman had now popped up with a proof that Fisher’s approach was NOT correct. This was typical behaviour of mathematicians in love with formalism. As Savage later admitted, Fisher was a much better mathematician than he had realised. It is a pity that Wilks and Neyman did not slug it out together.

    See Added Values, http://onlinelibrary.wiley.com/doi/10.1002/sim.2074/abstract for further discussion.
    Stephen

    • It sounds like your issue with this is partly due to the emphasis on hypothesis testing, and partly due to the classical identifiability/Latin square design rather than a semi-causal mechanistic + bayesian modeling approach.

      Imagine I have 2 treatments T in {A,B} and 4 groups of people P1,P2,P3,P4 who are genetically different in some way which may be hidden. The outcome function is F. I can easily imagine a model, where F(P,T) represents the expected value of F in population of people of type P under treatment T

      F(P,T)

      whose expected value over the population is:

      F(P1,Ta) P(P1,Ta) + F(P2,Ta) P(P2,Ta) + … + F(P4,Tb)P(P4,Tb) = E[F] ~ 0

      but where some of the terms are positive and some are negative. In addition, in each population there is some noise, so that the actual outcome f for person i, of hidden genetic type P_i, and given treatement T_i is

      f(i, P_i,T_i) = F(P_i,T_i) + epsilon_i

      and epsilon_i is randomly distributed in some manner.

      Under this kind of model it seems reasonable to try to identify the hidden P_i values where F is positive and to then apply treatment only to that subgroup. A perfect example would be specialized cancer treatments, specialized to certain unknown to us genetic variations in the cancer cells.

      No Bayesian would ever bother to put a delta function prior on EF=0 (the population expectation) though, they’d simply say that EF is most likely in the vicinity of 0, and be perfectly happy if it turned out to be some small value.

      • “Imagine I have 2 treatments T in {A,B} and 4 groups of people P1,P2,P3,P4 who are genetically different in some way which may be hidden.”

        “Under this kind of model it seems reasonable to try to identify the hidden P_i values where F is positive and to then apply treatment only to that subgroup. A perfect example would be specialized cancer treatments, specialized to certain unknown to us genetic variations in the cancer cells.”

        This is similar to what I described on (statins) cholesterol-lowering drugs, where there likely was an unknown genetic determination of which drug would work. If the side-effects had been worse or costs higher, maybe they would worked harder to figure out the genetic determinants,but then Lipitor came along and seemed to work pretty well for most people.

  13. I don’t necessarily disagree with this but the point is whether the model in which the main effect is zero and the interaction is allowed to be anything at all is ever worth fitting. By fitting the richer model (interaction and main effect) you may well discover that the main effect is small but that is another matter.

    Imagine that you have a group of patients, group 1, for which it is the case that the average effect is zero but the patient by treatment interaction is large and another group of patients, group 2, for which the same is the case. Now suppose that you construct a group 3 by randomly drawing some members from group 1 and others from group 2. Although the treatment by patient interaction is likely to remain large for this group it is almost impossible for the group average to be zero. It is this fact that makes the Neyman model absurd.

    As well as Fisher’s attitude to hypothesis testing John Nelder’s approach to modelling and marginality is relevant. See a reformulation of linear models: http://www.jstor.org/discover/10.2307/2344517?uid=3738488&uid=2&uid=4&sid=21102329342847 as is the long running dispute over Type II (acceptable) and Type III (unacceptable) sums of squares.

    • I think one of the things I love about Bayesian inference is that it obviates the need for specialized interpretations of the linear algebra of classical linear least squares problems. I’ve always found that unsatisfactory and more or less an artificial constraint on statistical inference induced by the need to have a tractable method of solution. The geometry of least squares is I think an artificial constraint on statistics (a useful one, but artificial), this becomes quite obvious when you suddenly have some nonlinear model and the geometry just goes away in a puff of smoke.

      In Bayesian inference non-orthogonality, classical nonidentifiability/confounding, unbalanced designs, and other similar considerations are significantly less important, or have a different character entirely. One reason is the possibility that we have some kind of useful prior information, for example we might know that quantity A should be tightly constrained around some known value, and quantity B may be much less constrained. When this is the case, we can fit a model in which A*B or A+B appears without having an impossible task due to confounding. A similar case occurs in this discussion I think. For the Bayesian, specifying that (in my example) EF ~ 0 can be done in terms of something like a fake data point whose value is 0, and specifying that this data point is a draw from a normal distribution with mean equal to the sum given on the left of my previous post (the expectation sum), where the P( , ) values are unknown parameters (probably chosen from a Dirichlet distribution in practice), and the normal distribution for the fake data point has some standard deviation which is a-priori smallish. we can then use this essentially as a constraint to help us infer the P values which would tell us what we know about the proportion of the population that responds well to the drug (large F) vs what proportion responds poorly (negative F).

      I can *absolutely* see that this would be a useful model under some conditions, for example where a study had been previously performed showing that there was essentially no average effect for a drug, but where a follow up data analysis reveals that there were a small number of patients who had an excellent effect, and we are now analyzing a followup clinical study to try to determine what caused that good effect in the small populations, and how big those small populations are.

      Thanks for the interesting discussion.

      • Unfortunately, the Bayesian response to classical designs with orthogonality, latin squares and fractional factorial designs etc seems to be a sometimes rather computationally intensive optimization / decision theory problem. so we give up potentially a lot of tractability.

  14. IMHO the prior information bit is a red-herring. Prior information can be regarded as pseudo-data. Appeal to it turns a smaller data problem into a larger one. I have a lot of sympathy with the Bayesian idea that one should use information rather than throw it away but I don’t think your example gets round the problem of the combination of zero main effects in the presence of interactions. Since Andrew started this debate by claiming to side with Neyman perhaps he could cite one of his own papers in which he has used a vague prior for an interaction together with a highly informative prior that one of its marginal main effects is zero? (Or at least a lump of probability that a main effect is zero.) Then we can see more clearly the practical implementation in a Bayesian framework.

    To cite an empirical investigation, Nicola Greenlaw did an MSc with me in which she examined the relationship between the random effect variance and the main effect of treatment in 125 meta-analyses and found that the two were correlated. See http://theses.gla.ac.uk/1497/1/2010greenlaw1msc.pdf

    • Stephen:

      Thanks for the feedback. For all the reasons you have stated, I indeed would not use “a vague prior for an interaction together with a highly informative prior that one of its marginal main effects is zero? (Or at least a lump of probability that a main effect is zero.)”

      What I am objecting to is the opposite model, the vague prior for a main effect along with an assumption that the interaction (or, more generally, the variation in the treatment effect) is exactly zero. That is what is equivalent to the so-called Fisher null hypothesis in the classical approach in which a confidence interval for the effect is constructed by inverting a family of hypothesis tests corresponding to different values for the main effect.

      • Isn’t that model of “vague for interactions but highly informative at zero for marginal main effect” exactly the model I proposed in the case where some drug requires some particular rare genotype to work though? Of course my model only makes sense in the presence of real actual prior data about this drug, but I don’t see that anyone has convinced me that such a thing is always a bad idea.

        On the other hand, I can surely see that it doesn’t make sense as a first go-to type model, that’s clear. We’re almost always considering things that we expect to have some average effect over the whole population.

        As an aside, I find the “main effects” and “marginal effects” and “interactions” terminology itself problematic for my own understanding, perhaps because I don’t do this type of stuff on a daily basis. My preferred language is functional dependencies, perhaps because most of my work is in mechanistic modeling, and I don’t do clinical trial type stuff. I’m sure there’s a straightforward translation between this typical “classically designed experiment” terminology and things I’m more familiar with, but I don’t have those translations built into my brain at the moment.

        • First of all, Andrew was not proposing Neyman’s model ( moving from a null of interactions but no effect on average to an alternative of interactions plus some average effect)as an occasional alternative to Fisher’s approach, he was suggesting that it is generally more appropriate. This claim cannot be rescued by referring to the odd case where it might apply.

          Second, I don’t think that such odd cases exist but if they do I would be interested to have a concrete example. In the context where I work, of clinical trials, I would have to be given the example of real disease + real population + real treatment(s).

          As I said before, there is a lot things claimed about pharmacogentics but I do not know of any treatments that don’t work on average but do work for some.

        • Perhaps you’ve had more emphasis on confirmatory studies, and this makes the case seem odd to you. In an exploratory study, say a screen for a new drug, you might expect the treatment (injection of something at a low dose) to have on average zero effect, but there would be some small population of candidate drugs which are very effective. Identifying the “interaction” of *which* drug you inject is the whole point of the study! Now perhaps this seems odd to you because it’s some nonstandard usage of the word “interaction”. And if so, I’ll chalk that up to my own discomfort with this terminology, as I’ve said before I prefer a more mathematical / functional dependence type terminology.

          Let me give you another case that might make sense to you and that’s closer to my own type of research. Suppose you are interested in degradation in infrastructure (say steel bridges). You would like to study the effect that random breakage of bolts has on the strength and safety of an old steel bridge. You build a computer model and you start randomly assigning bolts to break. You might easily find that a randomly chosen bolt will have no effect on the overall safety of the bridge, thanks to redundancy. But you might also find that a very small number of bolts are critically important, their breakage causes collapse due to an unforseen mechanism of stress transfer. The average effect is near zero because the number of critical bolts is so small, but the interaction of the treatment (breakage of a bolt) with the location of the bolt, has important nonzero components. The whole point of the study is to quantify that risk (ie. find out how many such critical locations there might be).

          I think the very nature of the terminology “main effect” and “interaction” belies a bias in the type of situation that is being analyzed in these kinds of designed experiments. Usually we’re not going to run an expensive designed drug trial unless we’ve already run a bunch of screenings and have some reason to believe that there is an average effect. On the other hand, in the bridge study, we’re really hoping that there is ZERO average effect thanks to designed-in redundancy, otherwise we’re going to close the bridge right away and go over it with a fine toothed comb.

        • Thanks, Daniel, but I don’t see how your hypothetical example corresponds to a Neyman model.

          I repeat what I said about Fisher’s approach. He is not committed to models without interactions. He objects to models with interactions but not main effects. I think his approach is natural. You start with A you move to A+B you then move to A+B+A.B. Bayesian or frequentist, starting with A + A.B and moving to A+B+A.B makes no sense.

        • Stephen:

          I agree with you completely. A model with interaction but no main effect is weird. There are some settings with natural constraints where it makes sense to consider an interaction but no main effect, but such examples are unusual.

          My criticism of the “Fisher null hypothesis,” when applied to a family of null hypotheses which are inverted to obtain a confidence interval, is that it corresponds to a model of potentially large main effects (indeed, if the effects could not possibly be large they would generally not be studied at all, as the goal in such studies is to reject the zero hypothesis and thus demonstrate a real and important effect) but with no variation at all in the treatment effect. Such a model seems wrong to me. I can see how it can be useful, but I certainly don’t see it as preferable to a model in which the effect, if nonzero, is allowed to vary.

        • Both a “model with interaction but no main effect” and a model with main effect “with no variation at all in the treatment effect” are wrong (with possibly occasional exceptions).

          Though the first (given your very clear explanation) seems _much wronger_ … can’t remember why David Freedman still went for the first.

        • At the time that I was working on the beta-agonist, formoterol, in the late 1980s and early 90s the three main beta-agonists being used in general treatment of asthma were salbutamol (albuterol, US), terbutaline and fenoterol. None of these had a proven action beyond 5 to 6 hours after treatment. We were looking for 12 hours duration. In proving, as we did, that formoterol had a 12 hour duration of action by measuring the forced expiratory volume in one second (FEV1) at 12 hours in comparison to salbutamol I consider that very little harm was done in comparing a null model (in which formoterol had no effect) to an alternative (in which it had an additive effect). Of course we did not commit ourselves to literal belief in this model but the most you would get to indicate that it was wrong would be an increase in variance in the formoterol group. Indeed, our usual habit was to log transform the data, a concession to this belief . However, while I concede that you could possible improve on the simple classical shift comparison (on the log scale in our case) by somehow simultaneously allowing for some change in variance, I would really have to see a lot more by way of technical detail to concede that this had much value. One thing I am sure of, nobody would have entertained as remotely reasonable a biological model in which formoterol suppressed FEV1 at 12 hours in some patients and increased it in others to leave them on average no better.

        • I’m learning a lot from this. One of the things I’m learning is that terminology across disciplines is an issue. I gave an example (bridge testing) which I think had the character of Neyman type situation, but you say that it does not correspond to a Neyman model. I don’t disagree, I just don’t understand the disconnect. I think this comes down to some difficulty in the two of us agreeing on what is and what is not a Neyman type situation, and I think it has to do with my experience mainly being outside the area of RCT type situations.

          I also seem to be learning that the type of problems I work on have a fairly different quality from the kinds of things that “normal” statisticians usually work on. This is probably not that surprising. For example:

          another situation I’ve worked on where I can imagine a natural zero average effect and yet substantial variance in individual effects is finance. Perhaps a company has a variety of news items that come out, the company historically tracks its industry benchmark fairly well. We are interested in an efficient market hypothesis, where it is not possible to consistently beat the market thanks to arbitrage. We expect the average effect of news to be zero (relative to the benchmark) yet each news item might cause a rise or fall in stock price. The difference in log returns between the company and the benchmark would be expected to be zero in the long run, yet the news items might easily contribute to increased variance in returns (risk) and this risk is exactly what we’re interested in predicting.

          Perhaps one of the things that is different is that I often work on problems where there is some kind of human intervention that might force average effects to zero. For example feedback control systems, or engineered physical designs, etc.

          I feel like I could probably come up with examples like this all day long. Of course they’re not the *only* examples of interesting areas of research, but I don’t feel like they’re so “weird”. On the other hand, you seem to feel that there’s something essentially different about my examples. I don’t disagree but I am intrigued by the disconnect.

  15. To return to my model framework a few posts ago, the issue is really whether the order in which hypotheses should be asserted is F0 then F1 then F2 and whether the Neyman hybrid N0 (interactions but no main effect) can ever be reasonable. (The issue is NOT that Fisher would never countenance F2.) Also relevant is the practical problem that any Bayesian interested in prediction has to have highly informative priors that most interactions are small. Otherwise there would be no point in using the results of clinical trials to inform medical practice (the effect in any possible subgroup one has not studied could be anything at all).

    Although I am not a great fan of Jeffreys’s approach to modelling I do think that it was a sign of his genius that he realised that uninformative priors would only work as a system if embedded in an approach whereby they were conditionally uninformative given the model postulated AND one had an a priori belief that simple models were more likely than complex ones. Naive Bayesians (which is not what I accuse you of being)seem to think one can survive on uninformative priors only. This is impossible since you have a highly informative prior distribution that the effect is zero of every single thing (and the list is endless) that is not in your model.

    I think that Jeffreys would find Fisher much more reasonable than Neyman.

    • I agree, choosing N important variables and putting vague priors on them essentially means that all the other possible variables (around 10^80 atoms in the visible universe for example) have a strong prior on them as being irrelevant.

      In my work, I always use mildly informative priors for pretty much everything. Usually the information comes in the form of rescaling variables via an order of magnitude type calculation and then putting a prior on the rescaled variable that its magnitude is somewhere in the vicinity of 1. But the stuff I do is often mechanistic, building newtonian physical models for stuff, and in this context my approach is probably a lot easier than in the context of say drug trials.

    • Stephen:

      You write, “I think that Jeffreys would find Fisher much more reasonable than Neyman.”

      I too find Fisher much more reasonable than Neyman! But there are places where Fisher took methods that worked well in his problems and tried to extract general principles that did not make so much sense. Also there are places where later statisticians took Fisher’s ideas too far. An example, I think, is the inversion of hypothesis tests using the so-called Fisher null hypothesis.

      I’ll say this: If you want to invert a series of hypothesis tests to obtain a confidence interval, don’t use the Fisher null hypothesis. If you want to use the Fisher null hypothesis, use it only as a null hypothesis, not as a family of null hypotheses for the purpose of obtaining inference about a nonzero parameter.

      • There is a tradition of this “If you want to use the Fisher null hypothesis, use it only as a null hypothesis” amongst some statisticians dealing with randomised clinical trials [e.g. only/most important to get statistical properties correct under the null]. In fact, Don Rubin would often say things like if there is an effect you are not going to learn much (generalizable) about it in a standard RCT.

        (This provides a rationale for why Neyman’s model is _much wronger_: a wonky null is considered worse than a wonky alternative.)

  16. I’m replying to K? O’Rourke’s comment:
    http://statmodeling.stat.columbia.edu/2013/05/24/in-which-i-side-with-neyman-over-fisher/#comment-146573

    From Freedman’s writings (including the excellent chapters on significance tests in the Freedman-Pisani-Purves textbook), my impression is that (1) he preferred confidence intervals to tests, and (2) he found Neyman’s framework useful for studying the properties of CIs for average treatment effects under relatively weak assumptions.

    I agree with Andrew in preferring Neyman to Fisher (on this issue) and estimation to testing. Fisher-style permutation tests are valid for the strong null hypothesis that treatment had no effect on anyone, but in general they’re sensitive to certain kinds of departures from the strong null and not others. These properties complicate the tests’ interpretation and are probably not well-known to most of their users (but see the Chung & Romano papers below and their references for tests that are exact for a Fisher-style null and asymptotically valid for a Neyman-style null.) These issues are relevant not only to tests based on the difference in means, but also to rank tests such as the Wilcoxon-Mann-Whitney.

    This may all be cryptic, so here are some papers I’ve found helpful.

    PRATT, J. W. (1964). Robustness of some procedures for the two-sample location problem. J. Amer. Statist. Assoc. 59 665–680.

    GAIL, M. H., MARK, S. D., CARROLL, R. J., GREEN, S. B. and PEE, D. (1996). On design considerations and randomization-based inference for community intervention trials. Stat. Med. 15 1069–1092.

    STONEHOUSE, J. M. and FORRESTER, G. J. (1998). Robustness of the t and U tests under combined assumption violations. J. Appl. Stat. 25 63–74.

    REICHARDT, C. S. and GOLLOB, H. F. (1999). Justifying the use and increasing the power of a t test for a randomized experiment with a convenience sample. Psychol. Meth. 4 117–128.

    BRUNNER, E. and MUNZEL, U. (2000). The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation. Biom. J. 42 17–25.

    NEUBERT, K. and BRUNNER, E. (2007). A studentized permutation test for the non-parametric Behrens–Fisher problem. Comput. Statist. Data Anal. 51 5192–5204.

    HO, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. J. Educ. Behav. Stat. 34 201–228.

    ROMANO, J. P. (2009). Discussion of “Parametric versus nonparametrics: Two alternative methodologies” by E. L. Lehmann. J. Nonparametr. Stat. 21 419–424.

    FAY, M. P. and PROSCHAN, M. A. (2010). Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Statistics Surveys 4 1-39.

    CHUNG, E. Y. and ROMANO, J. P. (2011). Asymptotically valid and exact permutation tests based on two-sample U-statistics. Technical Report 2011-09, Dept. of Statistics, Stanford Univ.

    CHUNG, E. Y. and ROMANO, J. P. (2012). Exact and asymptotically robust permutation tests. Ann. Statist. To appear.

    • Thanks, I was unable to find Freedman’s draft that I _recall_ focussed directly on Fisher versus Neyman null but I do recall material on both being in that very insightful _introductory_ statistics book of his!

      I am aware of some of the references you have provided. Here is an additional one that has nice graphs showing some of the complications http://journal.r-project.org/archive/2010-1/RJournal_2010-1_Fay.pdf

      But, I believe except as an initial start on the analysis in randomised studies, nulls of any variety do not make much sense at all. That is, with nonrandomised comparisons or once one realises there was more than just a trivial amount of informative non-compliance, unblinding, missingness in the randomised studies, a zero on average contrast makes no sense – is that when the confounding perfectly cancels out the unknown true effect of what the contrast is try to get at?

      As an aside, with the Freedman-Pisani-Purves textbook after I had covered the Fisher null in a general first year undergrad stats course, I asked one of the more experienced faculty how they had handled the Neyman null. He replied he always omitted both when he taught – way too difficult for undergrads.

  17. Example attempt (in which I also attempt to determine whether or not I have any understanding of this debate):

    An example where the main treatment effect “ought” to be constrained to 0, while sub-group treatment effects “ought” to be allowed to vary in any manner, yes? That is what we are trying to construct? With “ought” meaning something like “we’d get a more efficient estimator?”

    OK, here’s a try: The effect of Obama saying “We will have universal, single-payer health care next month in this country” on attitudes towards Obama. This could quite definitely have a 0 mean effect across the population. But if you interacted “treatment” with a Liberal and Conservative dummy (and everyone was in your sample was defined to be one of the two) it seems pretty likely to me that you’d get one strong positive treatment effect, and one strong negative one. By excluding the main effect (which, if estimated, would be real close to 0) and thereby forcing it to be 0, you’d get more efficient estimates on your two sub-group treatment effects. Right?

    Personally, I always think of a treatment effect as an average across a distribution of treatment effects. That’s because I think people are all weird and different from each other. Now here is where I’m lost: is this simply a metaphysical argument, or is there a fundamental mathematical difference that relates to the relationship between using analytic standard errors (say, sandwich estimators) or using permutation tests? Because I don’t get that part. And it seems like there is something there I’m missing in this conversation.

    • jrc, yes, there’s a fundamental difference between sandwich SEs and the usual permutation tests.

      The sandwich SE allows any form of heteroskedasticity, so in particular, it allows treatment to affect the variance of the outcome.

      The Fisher-Pitman permutation test looks at the distribution of the difference in means under a null hypothesis in which treatment has no effect on anyone’s outcome (and therefore no effect on the variance). The inverted-test confidence intervals that Andrew’s criticizing are typically based on a family of nulls, each of which assumes a homogeneous treatment effect (so again, treatment has no effect on the variance of the outcome).

      The Fisher-Pitman test is asymptotically equivalent to the pooled-variance two-sample t-test, while a test based on the sandwich SE (in an OLS regression of the outcome on an intercept, a treatment dummy, and nothing else) is asymptotically equivalent to Welch’s unequal-variances t test.

      Suppose treatment actually increases the variance of the outcome. What do we lose by using the pooled-variance t-test? We’ll underestimate var(y1) and overestimate var(y0), where y1 is the outcome with treatment and y0 is the outcome in the absence of treatment. With a balanced design, two wrongs make a right and we still have a consistent estimate of the SE of the difference in means. But if the treatment group is smaller than the control group, the more important wrong is our underestimation of var(y1), and we’ll tend to underestimate the SE of the difference in means. Similarly, if the treatment group is larger, we’ll tend to overestimate that SE.

      (Everything I’ve just said is focused on large-sample properties. In small samples, sandwich SEs can have substantial downward bias and high variability.)

      Some of the references I cited in my other comment (such as Gail et al., Stonehouse & Forrester, and Reichardt & Gollob) are helpful on these issues, and also this paper by Peter Aronow & Cyrus Samii:

      https://files.nyu.edu/cds2083/public/docs/samii_aronow_equivalencies_white.pdf

      A newish approach is to do a studentized permutation test–e.g., instead of using the permutation distribution of the difference in means, use the permutation distribution of a heteroskedasticity-robust t-statistic (such as the difference in means divided by the sandwich SE). This is discussed in Chung & Romano’s Annals of Stats paper:

      http://arxiv.org/abs/1304.5939

  18. So here’s my summary, an attempt to build consensus:

    1) The Bayesians in the crowd seem to be uninterested in hypothesis testing in general and prefer to focus on effect sizes.

    2) In the case of confidence/credible interval calculations, the Neyman model corresponds to constraining the average effect over a population to zero (the “main effect”), while allowing effects that are specific to sub-categories of observations to vary (the “interactions”). Some of us find this a very odd model, others seem to find it at least not as odd, or even a fairly normal idea. I, and a few others, identified some cases where it seems more normal to me in part because there is feedback or engineered design present intended to ensure that the average effect was near zero (engineering design, feedback control, arbitrage, and in jrc’s example political competition). Such “mechanism design” is not normally present in RCT drug trials. (Note: the “set point” theory for weight in which the brain supposedly has feedback mechanisms to maintain stable weight might be such a case in RCT of weight loss methods)

    3) The biggest point that I take from Stephen Senn’s posts is that we should probably not START with constrained main effects; instead allow them to be nonzero, fit models, ensure that those models result in small main effects, and if not, ensure that we *really mean it* when we proceed to fit a null main effect model. This makes pretty good sense and is probably why the Neyman model is thought of as wacky.

    4) Andrew dislikes the sharp null of Fisher for inverting hypothesis tests for obtaining confidence intervals. I do this so rarely that i’m not even sure what the difference in the two cases really looks like for this purpose. Prior to learning about Bayesian MCMC computation, I almost always worked with Maximum Likelihood based or Bootstrapping intervals, or accepted whatever fell out of canned R procedures for simple cases.

  19. It is easy to understand the difference between the two viewpoints by formulating the problem in the potential outcomes framework.

    Let Y_i(0) be the outcome of subject i when assigned to treatment 0, and let Y_i(1) be the outcome of subject i when assigned to treatment 1. Though in reality we can observe only Y_i(0) or Y_i(1)), we can conceptually define the individual causal effect as the difference Delta_i = Y_i(1) – Y_i(0).

    In the Fisher perspective, Delta_i should be the same for all i. In the Neyman perspective, Delta_i may differ. Phrased in other words: Fisher assumes a correlation between potential outcomes of 1, where Neyman leaves it unspecified. I would say the latter is more general, so I side with Neyman.

    The key difference is the correlation r(Y(0),Y(1)) between the two potential outcomes. It is interesting that we can actually estimate r(Y(0), Y(1)) if we impute the missing potential outcome. This possibility provides a new handle to identify treatment heterogeneity.

    • Stef:

      That’s exactly what I was thinking. If the avg treatment effect is zero, I can see it being zero everywhere, but if the avg is not zero, it’s hard for me to imagine the effect being a constant.

      • Andrew, thanks.

        Just a small addition. If the average effect is zero, then the effect can be zero for everybody, but this need not be the case. Suppose we have a pill that makes 50% of the people happy, and 50% of the people depressed. The average causal effect will be zero, and we would conclude that the pill is not effective, even though it is effective for 100% of the people.

        I think you are right that it is generally harder to assume that an effect is homogeneous when it is nonzero.

        • I have already given reasons for suggesting that it is impossible for the average effect to be zero if individual values are not. In your example, suppose that there are 200 people in the trial. Even if by some miracle it were true for the 200, if we recruit one more, it’s not true. It’s also not true for most of the randomly chosen subsets of any given size of the patients recruited.

          I agree that if there is a treatment effect it is unlikely to be constant on the scale chosen but Neyman’s model does not make this better.

        • Stephen:

          I am no huge fan of Neyman’s model but I really don’t like the so-called Fisher model of constant effects, and what I really don’t like is the combination of the two approaches in which the Fisher-inspired model of constant effects is combined with the Neyman-inspired approach of inverting hypothesis tests to obtain a confidence interval.

        • Stephan,

          I was not implying to constrain the effect estimate to zero. In your example, if person 201 comes in, and this person happens to be within the ‘happy’ group, then the average causal effect will equal ACE = ((101 * 1) + (100 * -1)) / 201 = 1/201. It is true that the ACE cannot be exactly zero here, but it will probably be close enough to make the test result insignificant. Hence, we are still in the situation where we have a test that says ‘no effect’, while in fact there is an effect for 100% of the persons.

        • However, the disagreement between Fisher and Neyman was not between what the test said but what the null hypothesis assumed. Neyman’s hypothesis assumes that literally the effect is exactly zero but the interaction is not. So your siutation does not satisfy the condition of a Neyman null hypothesis.

          I agree with Andrew that assuming the interactions are zero can be harmful in many situations. Here is a case http://onlinelibrary.wiley.com/doi/10.1002/sim.4780100905/abstract
          where I argued exactly that but if the component of variation is not identifiable it may not be. On the other hand, if you are a Bayesian, putting an uninformative prior distribution on an interaction can be a very bad idea.

        • Stephen, thanks. That clears it up. I was starting from the much older Neyman 1923 paper where he introduced potential outcomes, a perspective that I find useful since it allows us to express treatment heterogeneity as the correlation between potential outcomes.

Comments are closed.