One-tailed or two-tailed?

Someone writes:

Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using a t-test. Am I entitled to use a *one-tailed* t-test? Or should I use a *two-tailed* one (thereby giving a p-value that is twice as large)?

I know you will probably answer: Forget the t-test; you should use Bayesian methods instead.

But what is the standard frequentist answer to this question?

My reply:

The quick answer here is that different people will do different things here. I would say the 2-tailed p-value is more standard but some people will insist on the one-tailed version, and it’s hard to make a big stand on this one, given all the other problems with p-values in practice:
http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf
http://www.stat.columbia.edu/~gelman/research/published/pvalues3.pdf
P.S. In the comments, Sameer Gauria summarizes a key point:

It’s inappropriate to view a low P value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

This is so important. You can take lots and lots of examples (most notably, all those Psychological Science-type papers) with statistically significant p-values, and just say: Sure, the p-value is 0.03 or whatever. I agree that this is evidence against the null hypothesis, which in these settings typically has the following five aspects:
1. The relevant comparison or difference or effect in the population is exactly zero.
2. The sample is representative of the population.
3. The measurement in the data corresponds to the quantities of interest in the population.
4. The researchers looked at exactly one comparison.
5. The data coding and analysis would have been the same had the data been different.
But, as noted above, evidence against the null hypothesis is not, in general, strong evidence in favor of a specific alternative hypothesis, rather than other, perhaps more scientifically plausible, alternatives.

103 thoughts on “One-tailed or two-tailed?

  1. Why do a statistical test at all? Your hypothesis is that group A has a higher mean value of X than gruup B. Just take the averages and make your judgment.

    Now, if your groups A and B are samples from something else jada jada, that’s a different story.

    • This is correct if you have samples for _everyone_ in group A and _everyone_ in group B. Which is probably not what was meant in the original question.

      • but the question did mean _everyone_: “for each person I measure a single real-valued quantity X”. So I think Tom’s answer is correct. Maybe the groups are part of a larger populations.

  2. These two arguments have the same structure:

    1) “If Bill Gates owned Fort Knox, then he would be rich.
    Bill Gates is rich, therefore he owns Fort Knox.”

    2) “If my characteristic of interest affects the outcome I am measuring, then Group A and Group B will be different.
    Group A and Group B are different, therefore my characteristic of interest affects the outcome.”

    Actually #2 is closer to this one:
    “If Bill Gates owned Fort Knox, then he would be able to afford a pair of socks.
    Bill Gates can afford a pair of socks, therefore he owns Fort Knox.”

    • The one-tailed is only slightly better.

      “If Bill Gates owned Fort Knox, then he would have at least $3k in the bank.
      Bill Gates has at least $3k in the bank, therefore he owns Fort Knox.”

      • These objections to the argument structure are compelling, but only if accepting/rejecting a hypothesis test is read as something nearly the same as “believe that H is true (false)”. If one instead paraphrases “accept” as “believe it remains plausible that” and reject as “believe it is implausible”, the argument structure you give isn’t so bad any more (or at least not bad in the same way). Not that I’m saying this is what accept/reject mean to a NHST practitioner either (honestly I have no idea, though there’s good reason to say that it’s not supposed to be as simple as “believe is true/false”). But I think this does show your objection needs an assumption spelled out a bit more explicitly.

        • bxg,

          Take any two groups of people and split them up randomly. Attempt to treat them exactly the same. Then measure something about them. What would you measure that would result in a plausible belief that they are EXACTLY the same on average? The only time this works is when the null hypothesis is plausible. For example, zero ability to bend spoons with their mind.

          In almost all use cases the hypothesis that two groups of people are exactly the same is not plausible to begin with. Because of this, the assumption of the test that (one sided) p-values follow a uniform distribution when your treatment does nothing is incorrect. Instead the distribution when your *research* hypothesis is false is now a function of sample size and p-hacking techniques.

    • These two arguments have the same structure:

      1) “If Bill Gates owned Fort Knox, then he would be rich.
      Bill Gates is rich, therefore he owns Fort Knox.”

      2) “If my characteristic of interest affects the outcome I am measuring, then Group A and Group B will be different.
      Group A and Group B are different, therefore my characteristic of interest affects the outcome.”

      There is an important difference between what you have written, above, and what the deductive analogue of a conventional hypothesis test is. In the latter, we assume the null hypothesis is true and argue for a contradition. That is, we would argue:

      “If Bill Gates did not own Fort Knox, then he would not be rich.
      Bill Gates is rich; therefore, he owns Fort Knox.”

      That is a valid deductive argument. It is unsound, of course, because the first premise is false.

      A conventional null hypothesis significance test (NHST) is a sort-of probabilistic analogue of the above argument. That is, we assume that the null hypothesis is true, and if we observe data that is improbable under it (ie, a test statistic in the tail of the distribution), we argue that the null has been contradicted. Except that’s not valid. Such a probabilistic version of the argument requires us to take into account the prior probabilities of the null and alternative hypotheses, factors that NHST famously ignores.

      • Jay,

        We should also not confuse the research hypothesis with the statistical hypothesis. Your procedure (the common, strawman one) adds the extra step of making the statistical hypothesis the opposite of what is predicted by the research hypothesis. This is not part of NHST per se, but it is part of the strawman nil NHST procedure.

        We should distinguish between the two, because this extra step means that it is impossible to ever falsify the research hypothesis. Either
        1) The result is “not significant” and we can say nothing about whether a difference exists. (A layer of confusion exists here due to the hybrid of Fisher with the Neyman-Pearson inductive behaviour paradigm)
        Or
        2) The result “is significant” and we have made a Modus Tollens argument against the statistical hypothesis, but can only affirm the consequent in terms of our research hypothesis.

        The further problem regarding transposing the conditional then occurs on top of this. This is where the bayesian vs NHST argument comes in. A Bayesian procedure that “disproves” the strawman opposite of the research hypothesis still suffers the same flaw noted above.

        However, I argue that the logical problem above is actually not the most fundamental flaw. Even more fundamental is that the effect size estimate includes information about both the direction and magnitude, while the output of NHST is only existential (an effect exists) or directional (+/-).

        Since we always have an effect size estimate anyway, it is impossible that the extra “significance test” aspect ever performs a useful function. I am talking about specifically about when the null hypothesis is a strawman.

        • I am open to the argument that somehow strawman NHST allows us to distinguish “signal” from “noise”, but I have found all who have attempted to convince me of this believe strange, even insane, myths about what they are doing. For example, they attribute unexplained magical powers to “the math” or claim that concluding “H0 is false” does not suggest “H0 is false” is true.

        • Concluding H0 is false surely implies you believe that (“H0 is false”, is true).
          So should I presume you are talking about someone who (in the sense of NHST) “rejects” H0. (Of course “rejects” here as a technical term, not an English word, except where it is convenient to the statistitian to allow the confusion.)

          But still…

          For someone to claim that a NHST “rejection” does not even _suggest_ that H0 is false? I know NHST rejection is weird, but this seems even more damning that any other criticism I’ve seen. Can you given an example where someone might feel the need to say something so strange?

        • (Previous link not from me; is it a reply to me from ‘question’?)

          Linked page contains such things as:

          “If someone has arguments in favor of the usefulness of strawman NHST that do not require thinking ‘reject the hypothesis that the rate of warming is the same, but DO NOT take this to imply that the rate of warming is the not same’ is a rational statement, I would welcome your comments.”

          That’s not at all related to what I though ‘question’ was saying, which was that “rejecting” H_0 might not even _suggest_ H_0 is false! Suggest != imply. I wouldn’t put anything past NHST, but I’d like to know if this (the ‘suggest’ form of the criticism) is true.

          As invited by the linked page, it’s easy to comment on the global warming statement. “Rejection” here does not mean what it does in English, and with this understanding the statement is not irrational (it may not be very useful though). If you fall for the central con of statistics, which is that technical terms can safely be used as ordinary speech (significance, bias, accept, reject, etc) you are in for a lifetime of confusion. You’ll think relatively
          smart people are idiots for saying things that are almost insanely
          irrational, whereas they aren’t saying what you think they are and your criticism need to be somewhat more charitable and nuanced to even be on the right page.

        • bxg,

          I don’t know what happened with with the names. If the technical meaning of “reject no change” does not imply even that the logical opposite of “there is a change”, what could possibly be the purpose of “rejecting”?

          Suggest/Indicate/Imply/whatever. I understand the Neyman-Pearson hypothesis test perspective in which one does not draw conclusions, but I don’t think that was what was being discussed in that link.

        • It’s not strange at all, say you know apriori that 3 dice are being rolled.

          Let’s say your measurement is the total of the dice rolls. That follows a not-quite-normal distribution. You observe the number 18, which is on the far right tail of the distribution. The probability of observing a number as large or larger than 18 is small (p = 0.005).

          However, you know apriori that P(H0 = 1), so even though H0 is rejected, it does not suggest that H0 is false. This is not as artificial an example as it might seem. You can have less extreme versions, say P(H0 = .99) that basically follow the same argument. These sorts of situations are ubiquitous in research (prior probability of the null hypotehsis being high).

        • Anon,

          Ok, but in that case why would you reject the null hypothesis? Clearly the cutoff should be adjusted until rejecting the hypothesis does imply that it is false. This seems to be just another way to misuse NHST.

          The procedure has two outcomes. You reject the hypothesis or you don’t reject. If you are going to come to the same conclusion either way because your cutoff is too lenient, what is the point?

        • how do you suggest addjusting the cutoff to account for the prior probability? If you’re going to account for the prior, then use the prior in the first place!

          by the way, imo a lot of the multiple comparison adjustments are basically a roundabout ways of accounting for the low prior probability in these large-scale inference settings without admitting to being bayesian.

        • Anon,

          “how do you suggest addjusting the cutoff to account for the prior probability? If you’re going to account for the prior, then use the prior in the first place!

          by the way, imo a lot of the multiple comparison adjustments are basically a roundabout ways of accounting for the low prior probability in these large-scale inference settings without admitting to being bayesian.”

          My position is that if you do not have a hypothesis worth testing (one that predicts much more precise outcomes than a coinflip), testing the hypothesis is worthless. Instead you describe and explore the data. You either hide it under your bed or share with others in the hope they come up with a hypothesis worth testing based on the data.

          I also claim disproving the opposite of your research hypothesis and taking that as evidence for your research hypothesis is not something that has never been useful to anyone ever at any time in history and it is actually impossible that it ever has. I am not 100% sure about this last point and am trying to incite responses to it.

        • The rejection of a null hypothesis in a Neyman-Pearson hypothesis test (which is where the term applies) most certainly does not imply, nor necessarily suggest, that the null is false. Nor does the null being true imply that rejecting it is the wrong decision, since the decision to reject or accept should be based on Type 1 and Type 2 error probabilities determined by consideration of the relative costs of those errors.

          A classic example of the compatibility of rejecting a null hypothesis when you know that the null is probably true comes from testing for disease. If the prevalence of the disease is low, then a positive test result will usually be false, even if the test is highly accurate. Now, in addition, if the consequences of treating the patient for the disease when the disease is not present is low, but the consequences of failing to treat when the disease is present are high, then the correct decision is to treat; that is, to reject the null hypothesis of no disease, even though the null hypothesis is probably true.

          I’m sure that Deborah Mayo could help me out here. Unfortunately, I doubt she will after having been subjected to such offensive treatment by Entsophy.

        • Jay,

          Are there any conditions you believe should be met before one performs a N-P hypothesis test rather than a fisher significance test? Or do you believe these are essentially interchangeable approaches?

        • [It appears that this reply will appear above the the comment it is in reply to.]

          @question:

          Are there any conditions you believe should be met before one performs a N-P hypothesis test rather than a fisher significance test? Or do you believe these are essentially interchangeable approaches?

          I think that Neyman-Pearson hypothesis tests are best-suited to answer the question “What should I do?” rather than “What is true?” The result of a Fisherian test, a p-value, doesn’t answer either of those questions very well. Unfortunately, a low p-value is often misinterpreted as a Bayesian posterior probability. Fortunately, if the prior probability of the alternative is not too low, the null will often be false if the p-value is low enough.

        • Jay,

          I was nearly driven off topic here, but it is my own fault. The situation you describe appears to be one in which there is a valid cost-benefit and valid null hypothesis. I am not taking issue with that. However if the null hypothesis is something considered by most to be implausible even before the study is performed, I am looking for an argument as to how exactly this is more useful than simply estimating effect size.

        • “I’m sure that Deborah Mayo could help me out here. Unfortunately, I doubt she will after having been subjected to such offensive treatment by Entsophy.”

          Mayo is a fraud. She’s done no applied statistics and has a week freshman calc knowledge of math at best. On that basis she passes herself off as a statistics guru and tries to convince people Bayes should be abandoned wholesale (and sometimes succeeds).

        • Joseph:

          Please be polite. Each of us has our strengths and weaknesses. I am interested in the philosophy of statistics and find it relevant to statistical practice but I know very little of the recent philosophical literature. Mayo has expertise in this area and I very much appreciate that she participates in our discussions.

        • Entsophy: you yourself have said something along the lines of “everyone who passes freshman calculus is a statistician” which I took to mean that lack of knowledge of advanced mathematics isn’t really a major barrier to understanding.

          In my experience it seems as if you and Mayo talk past each other. Though I think you understand what Mayo is saying and simply disagree with it, it seems to me that she doesn’t understand or engage with what you are saying very much. I have no idea really if she’s even tried, she may have written you off as being basically a troll.

          I honestly find Mayo’s stuff extremely hard to read. It’s like the old SF Chronicle columnist Herb Caen. People loved his style, but I’m not one of those people. Everything was written as kind of an in-joke between the reader and the writer, and if you weren’t familiar with the writer’s circle of friends and experiences… it was more or less opaque. That’s how I find all her “comedy hour” references and soforth. Your writing is more or less the opposite of that, simple, direct, and to the point. I’m not surprised the two of you rankle each other.

          One thing that’s not clear to me though is whether Mayo does in fact “[try] to convince people Bayes should be abandoned wholesale”. Honestly her writings are so difficult for me to read it’s not clear to me whether she’s saying something like “look, frequentist inference using my severity principle is a valid way to do science” or more like “look, frequentist inference using my severity principle is how to do science and Bayes isn’t” which is something fairly different.

        • @daniel thank you I’ve been trying to follow her arguments but i have a really hard time with them. mainly because, as someone who practices bayesian statistics if someone has a valid criticism of a fundamental flaw / argument against my approach I want to hear them out.

          i’m glad at least a smart guy like you has the same issue and it’s not just me.

        • my biggest pet peeve is people who talk about “noise” or “randomness” as if it’s well defined when they are actually talking about a very specific noise signal.

          Logically, these people dichotomize things such that if it’s not consistent with a particular noise signal (the one defined by the null being tested), they conclude “cannot be explained by chance”. But it could be under a different noise model!

          Conversely, if a failure to reject is found they say, “oh it’s just random.” No it’s not, it’s consistent with your particular model of randomness.

        • For example, they… claim that concluding “H0 is false” does not suggest “H0 is false” is true.

          But that’s just what the tortoise said to Achilles!

          “And at last we’ve got to the end of this ideal racecourse! Now that you accept A and B and C and D, of course you accept Z.”
          “Do I?” said the Tortoise innocently. “Let’s make that quite clear. I accept A and B and C and D. Suppose I still refused to accept Z?”
          “Then Logic would take you by the throat, and force you to do it!” Achilles triumphantly replied. “Logic would tell you, ‘You can’t help yourself. Now that you’ve accepted A and B and C and D, you must accept Z!’ So you’ve no choice, you see.”
          “Whatever Logic is good enough to tell me is worth writing down,” said the Tortoise. “So enter it in your notebook, please. We will call it
          (E) If A and B and C and D are true, Z must be true.
          Until I’ve granted that, of course I needn’t grant Z. So it’s quite a necessary step, you see?”
          “I see,” said Achilles; and there was a touch of sadness in his tone.

  3. Andrew,

    I am probably one of the least statistically literate readers of this blog, and (probably hence) I found this line in the second paper especially insightful (slightly rephrased) :

    … it’s inappropriate to view a low P value (indicating a misfit of the null hypothesis to data) as strong evidence in favor of a specific alternative hypothesis … rather than other, perhaps more scientifically plausible, alternatives …

    Thanks!

    • Using a Bayesian approach, you can calculate the probability that the mean of A is greater than the mean of B.

      With the classical approach, you get the probability that, if the means are equal, you would have seen a result at least as extreme as the one that you saw. That’s rarely what you’re interested in.

      To answer the original questioner, though: I think the textbook answer to your question is that you should do a one-tailed test, because you’re not just interested in whether A-B is consistent with 0, you want to know its inconsistency with 0 is on a specific direction.

      • Phil,

        See the comments above. Even if you can correctly say “the mean of A is greater than the mean of B”, this does not allow you to argue that “the mean of A is greater than the mean of B because of the reason I think it is”. To make that argument plausible requires ruling out alternative explanations. Even proper randomization with large samples only makes it *unlikely* mean(A)>>mean(B), where “unlikely” and “much greater” are ill defined.

      • The Bayesian can “calculate the probability that the mean of A is greater than the mean of B”.
        Oh really? And where do you get your prior? How do you interpret it? Your degree of belief about the means? Why would I want to mix that into the data analysis? The most highly criticized studies are conducted by true believers…. If it’s a frequentist prior, where does that come from? Or is it just one of these priors “I know not what it is” except that it’s a mathematical construct used for a posterior? Being a Bayesian apparently means never having to say what your priors (and therefore your posteriors) signify, they just know it’s better somehow.The frequentist error statistical test or estimate at least has control over seriously misleading interpretations. You may say the posterior is what you’re interested in, but I fail to see why we’d want to mix your (it-means-whatever-I want-it-to-mean-but I cannot-tell-you) prior in with the data analysis. I didn’t think Andrew wanted to either…

        • “The frequentist error statistical test or estimate at least has control over seriously misleading interpretations.”

          How does your recommended procedure address the problem of alternative explanations for the mere existence of a difference between two groups of people? That is the main source of “seriously misleading interpretations”, and I do not believe it is (or can be) addressed by any statistical method.

        • Question: I was alluding to misleading interpretations of what the statistical methods teach; if you think that the question of discriminating a (given) substantive explanation cannot be “addressed by any statistical method”, then the initial contrast between Bayesian/frequentist relevance doesn’t enter. But since you ask, I would employ the same error statistical reasoning (qualitative or quantitative) to discriminate (or reveal I cannot discriminate) substantive explanations. For just one example, having found, say, a deflection effect, we may distinguish if it is due to a gravitational effect (as described in GTR), or a corona effect, a shadow effect, etc. because they are each bent in very different ways (so we argue from coincidence, just as in significance tests). But all this is, largely, beside the point of my comment, except for one thing. I think the use and interpretation of statistical methods should be continuous with science generally, that scientists jump in and out of probing one question using formal statistics, another, non-statistically. They do not give prior probability assignments to an exhaustive set in the non-statistical realm, nor update probabilistically. A Bayesian philosopher’s (artificial) reconstruction of the above example would assign probabilities to theories about corona effects, shadow effects, light deflection effects, etc.and then see which gets the higher posterior, but that is not at all what occurs, they didn’t have and wouldn’t have wanted to use “plausibility” assignments to “theories” about these effects (which did not exist). They only need ways to distinguish the different effects, and they do so in a piece-meal manner. Well, this is too far afield. Thanks.

        • Mayo,

          If you look into the history of GR, you will find that its original acceptance was largely due to Eddington deciding to throw out 1/3 of the data and give a favorable interpretation of the rest. Further, the motivation for GR was the Michelson-Morely experiment, the results of which were actually consistent with a geocentric universe, but people refused to accept that interpretation on philosophical rather than empirical grounds. In other words, historical context and personalities appear to be very important to how data is interpreted. I don’t see how this can be seen as something other than a kind of prior.

        • Question: Oh my God, now you’re repeating a famous howler regarding GR now. I discuss it elsewhere (e.g.,Error and the Growth of Experimental Knowledge 1996, chapter 8), and even one of the famous proponents of that mistaken allegation (about Eddington)–Clark Glymour agrees. Eddington and all the hundreds of scientists who studied the sets of eclipse results that were destroyed by exposure to the sun, preventing a valid estimate of error, agreed that the results were destroyed by exposure to the sun. Period. The story about Eddington throwing out data is total B.S. Further, GR was never accepted from the 1919 results, the tests were horribly insevere, as were all eclipse results for decades. It was only with radioastronomy in the 1970s that they developed severe tests, and only really with pulsars. Please check the history (e.g., Clifford Will, Was Einstein Right?)

          And yes, we can describe the Newtonians who developed ways to retain Newton in the face of anomalies as having strong belief in Newton. Oliver Lodge, for example, believed he could communicate with his dead son, Raymond, through the Newtonian ether and still wanted to believe. But guess what? They had to keep their priors in T out of the data analysis supposedly testing T. (And this too I discuss in the same chapter). No scientists, not even the staunch believers, thought they could combine their strong beliefs in Newton into the data to show Newton was not actually confronting anomalies. Everyone was free to develop Newton-saving theories, but when they were shown to have inconsistent results or predict observations radically at odds with results, they had to give them up. That’s why scientists wanting to find something out do not incorporate disbeliefs/beliefs in the very theory T under test, into the data that is supposed to test T! That their Newtonian beliefs led them to work hard to save Newton is something else. There’s a huge difference there. Well, I can’t teach this stuff here….look at my Error and the Growth–you can find a scan of it off my blog.

        • Mayo, plenty of people have answered these questions. Either you refuse to read, or more likely can’t read, their answers. So instead you give a series of questions which would have made sense to ask Savage circa 1955, but are bizarrely anachronistic today.

        • Quite the opposite: Subjective Bayesians like Savage were far clearer about what they meant by their priors than current day Bayesians (read, for example, Lindley chastising his own student Bernardo!). Current day Bayesians by and large have run as far away as possible from subjective priors, preferring priors that, in some sense (on which they do not agree), allow the data to have the greatest weight. They haven’t found uninformative priors, so there are a bunch of different conventions for getting them. (Kass and Wasserman (1996) delineate their problems). Of course there are other kinds, invariant priors, maxent, conjugates, etc. By and large we are told that the prior is an undefined entity that is used to obtain a posterior. Frequentist matching priors are amongst the most popular. Current day Bayesians also run away from other principles to which Savage,Lindley and others adhered, e.g, the Likelihood Principle. They have abandoned Dutch Book arguments, and many, if not most, even reject inductive inference by updating priors. (Gelman is one.) It’s contemporary Bayesianism that has lost its foundations. To think otherwise is anachronistic.

        • If you have a fixed constant m, then you can do a sensitivity analysis with it. You can create a range of reasonable values for m (one of which is presumably the true value), vary it within that range and see how it affects whatever final results you care about.

          That p(m) is just a formal way of carrying out that sensitivity analysis. The high density region of p(m) is just the “reasonable range”. By averaging over this distribution your doing the “varying” part.

          That P(e_1,…,e_n) of errors which you think is approximately the shape of mythical long run frequency of errors is actually a formal sensitivity analysis just like the prior.

          The errors in the data actually taken represent fixed constants. You can again do a sensitivity analysis by considering a reasonable range of values for those errors, varying then within that range to see how it affects whatever you’re trying to do.

          The high density region of the posterior (i.e. the Bayesian credibility interval) is just the total result you get from conducing that sensitivity analysis using both a prior range of reasonable values for m and the range of reasonable values for the errors in the data and the knowledge that they’re connected by y_i=m+e_i.

          Moreover this is objective in all the ways we need it to be. We use true information to get a reasonable range of values of m (that objectivtly contains the true value) and we use true facts about the measuring device to get a reasonable range of values for the errors in the data actually taken.

          This is not new. It’s not difficult to understand. You keep making the same false bath faith claims about Bayesian statistics and no matter how many times someone corrects you on it, you us repeat the same stuff over and over again.

        • A prior as an initial guess, e.g. as in a root-finding algorithm, or, more generally, as some sort of regularization to constrain the type of solution you’re looking for, OK. You’re essentially ignoring/filtering some of the data in favour of background info. Which is not in general a bad thing, depending on your goal.

          On the other hand I’d say the goal of a sensitivity analysis is to do something like vary the inputs one by one, holding the others fixed, so as to assess the dependence of the output on each individual component separately. This could be done eg via a local linearization and obviously this has complications in complex problems (say, where to linearize etc). However, the general idea is to isolate the effects of each part separately.

          So I concede priors may have a reasonable interpretation under many practical circumstances, but I don’t really see them as a sensitivity analysis – more as a filter to ignore solutions you don’t want and emphasise those you do.

        • “Being a Bayesian apparently means never having to say what your priors (and therefore your posteriors) signify”

          Ok Mayo, I’ll spell this out as simply as I can. Take y_i = mu+e_i. The prior P(mu) has a high probability region. This region is where the true value of mu lies (assuming the prior is a good one).

          The likelihood P(e_1,…,e_n) has a high probability region. This region is where the actual errors in the data taken lie (assuming the likelihood is a good one).

          The posterior P(mu|e_1,…,e_n) has a high probability region. This region is where mu lies given that we knew it was in the high probability region of P(mu) and we know the actual errors are in the high probability region of the likelihood. Presumably it’s high probability region is smaller than P(mu) because we learned something from taking the data.

          So please, go on and tell us Bayesian how silly we are, but for heaven’s sake stop claiming Bayesians never say what priors or posteriors signify. I just did.

        • I just don’t see how it helps to say:

          “The prior P(mu) has a high probability region. This region is where the true value of mu lies (assuming the prior is a good one).”

          P(mu) “has” a high probability region. Say it is .9. If we assume “the prior is a good one”, you say, this region is where the true value of mu lies. What happened to the .9? This is what we wanted to know. Does it mean that 90% of the time mu is within this region? Does it mean I have strong evidence (perhaps obtained through severe testing) that mu is in the region—the kind of evidence that would be wrong only 10% of the time about where mu is? Does it mean I’d bet that mu is in this region just as I would wrt an event that has frequentist probability .9?

        • “Does it mean that 90% of the time mu is within this region?”

          m is fixed. There is only one time. The high density area is a kind of a bubble that describes our uncertainty about m. Bigger the bubble, the bigger the uncertainty. The smaller the bubble the more we can pin m down to an exact value.

          In some cases that .9 means something like “90% of possible sub-states are consistent with m being in the given region”. Nate Silver’s Bayesian probability of 90% for Obama’s election was of that form, where each “sub-state” is a potential vote pattern consistent with whatever inputs Nate was using.

          In other instances it’s for convenience. I may know my weight is between 150 and 250 but I may use a N(200,50) prior because it’s easier. My true weight is in the high probability region of either, using the later will spread out my posterior distribution more, but many times it’s not worth the effort to get a tighter interval estimate.

          In other cases we consider high probability regions (i.e. Bayesian credibility intervals) because we often have to make best guesses based on partial information. If for example I know that (0,100) contains m with certainty, but 99% of the possibilities consistent with my partial knowledge about m put m in (45,55), then I may be willing to trade off the lack of certainty for a much reduced interval depending on what i’m doing.

          In general though, the modelers job is to use whatever partial information they have about m to choose a distribution which puts the true value of m as high in the high probability manifold as possible. They either objectively achieve that goal or they don’t.

        • I agree with Entsophy pretty much entirely here. Bayesian probabilities just usually don’t have the interpretation “x percent of the time”, except possibly when the likelihood is built entirely out of a belief that future data will look like historical data without any “mechanism” (ie. the very least informed likelihood). This also is a big difference Between Bayesian and Frequentist modeling. The likelihood in Bayesian models is pretty much anything that predicts a “bubble” of high probability around the data.

        • I’m personally fine with using Bayesian models in this way to, roughly speaking, explore the space of plausibility. At least in observational settings, most of the time I’m ultimately interested in the value of the physical parameter I’m trying to estimate.

          I wonder though, this use of probability seems to imply that a prior can almost never be too wide, but it can be too narrow. Would you agree with that? If so, wouldn’t the definition of “too narrow” fall back on a frequentist definition of probability? Your don’t have to be on board with a full-on calibrated prior (as nate silver prefers), but if the frequency with which the 95% credible interval contains the true value too little of the time, wouldn’t the model be wrong?

        • What it means to “be a good prior” is that we’ve come up with a way to put the one true mu in the high probability region. There are basically two ways: 1) have some information that constrains mu, and create ANY prior that puts mu in the high probability region or 2) make the high probability region of the prior extremely large so that it’s just implausible to be outside this region.

          Entsophy has given multiple examples. But here is one:

          There are two drugs, the outcomes of interest from the drug trials are say duration of headache. I have no prior information about how well the new drug works, but I have some previous trials of the old drug which suggest that people mostly get rid of their headache after taking the drug within 1 hour. My prior on mean duration for drug 1 is therefore normal(mu=1,sd=2), although this includes impossible values (-1 for example) it does place high probability in the region around 1 hour and doesn’t constrain this region to exclude say 4 hours, and that seems reasonable and easy.

          For drug 2, I have no idea, except that I know it can’t be negative, and I should use this information since it’s the only information I have. So I’ll place a maximum entropy prior on a positive value with a given average, which is an exponential distribution. Since from experience it’s implausible that people will have headaches longer than several days, I’ll use say exponential(1/48). This includes in its high probability region everything from 0 hours to 1 week. Most headaches go away well before 1 week even if you don’t take any drugs. This prior will do to constrain the range for the “sensitivity” analysis Entsophy is talking about (which is what the Bayesian machinery does).

          The big issue for a Bayesian is: what is the likelihood (or alternatively put: what is the model which I will use to describe how to predict data)? you see, I’m free to choose any model that successfully approximately predicts the real data. There is no need to approximate the “frequency” of any process.

          In any case, depending on my knowledge of the patients, the biochemistry, and the neurology I come up with some way to predict data and specify that… then if there exists a region of the parameter space where the real data appears in the high probability region of the prediction distribution… the Bayesian machinery tells me I should use values for the parameters in the region which makes those consistent predictions.

          That’s all Bayesian models do.

          Now, although the Frequentist interpretation of the Bayesian scientific model is void, it is always possible to mathematically construct a random number generator which has Frequency properties that fit the Bayesian model. Once you’ve done that, it is possible to ask the question: “given this Bayesian model, is the actual data consistent with the kind of pseudo-data spit out by the random number generator?” Since you’ve constructed the RNG to have exactly ideal frequentist properties, now the typical question of a hypothesis test “if my model were true, would I have seen data that looks like my real data” is actually a valid and useful question about how well my model fits the actual data. If there are say 100 data points and my Bayesian model predicts things in the same region as 90 of them and would almost never predict the output of the RNG to be in the region of the other 10… this identifies something that is wrong with my model, it doesn’t take into account some process that does actually occur.

        • “And where do you get your prior? How do you interpret it? Your degree of belief about the means? Why would I want to mix that into the data analysis?”

          As someone who has used frequentist methods in my day job (due to the conventions in my field), I think it’s fair to say that even researchers who use frequentist methods have unstated prior beliefs about the means (or rather, the difference in means). So, for any given published paper, it is possible to elicit a prior from the author(s) about what they think the difference in means is likely to be before they see the data. Nobody does this, of course, but it’s there in the background. It’s also driving the data analysis, implicitly.

          Given that it’s already in the picture, it seems reasonable to explicitly include it in the model.

          Although this may not be standard among Bayesians (I don’t know), I often use previous data as a prior for my current analysis with new data. This explicitly takes into account the implicit belief we work with in the frequentist world. This approach seems reasonable to me.

          I guess the biggest problem with the frequentist tools is not that they are fundamentally problematic (and anyway, for the type of datasets I work with, the likelihood is going to dominate in most cases), but that almost nobody understands what a p-value is. This leads to bizarre, high profile papers published in top psycholinguistics journals touting positive conclusions from null results derived from a low-powered study. Even the titles of papers have wording like “No effect of X in situation Y”. There are huge debates in psycholinguistics about the evidence in favor of H_0: \mu = 0. Often the argument in favor of the null is that repeated studies show null results. But if one’s power is 0.20, what did they expect? Even in 2014 I get to read dissertations coming out of top departments with the null result as the main point.

          The second biggest problem is that model assumptions are considered irrelevant (or, as Andrew would put it, model checking is not done). The data analysis procedure has been simplified to (a) load data, (b) run lmer command, (c) publish estimate, standard error, and t-value (I have done this myself). In many cases, people haven’t even bothered to take a cursory look at their data to check if the data has the structure they think it has (that it has been generated in the way they think it has). A cool example is a common data analysis done in eye-tracking research, with a dependent measure called re-reading time. Here, usually 80% or so of the data points are 0 millisecond values, and the others are non-zero. It is standard to fit a linear mixed model or anova with 80% or so of the data containing 0 ms values, and it is standard to assume that the underlying distribution is

          X ~ Normal(\mu,\sigma^2)

          where \mu is quite a bit larger than 0. When I analyze re-reading time without these 0 values, reviewers actually prevent me from publishing that analysis, on the grounds that it is non-standard! The central problem is that it is easy when using frequentist tools to forget (or to never even know) that we are assuming something like

          X ~ Normal(\mu,\sigma^2)

          The sheer simplicity of the frequentist tools, which I consider a huge plus in general, ironically becomes a burden. One thing I learnt while fitting Bayesian hierarchical models was that one has to always declare *how the data were generated* when defining the model. This relatively simple insight had escaped me for the 10 years I spent fitting frequentist models (maybe this was my own fault, but I am not alone). The dramatic amount of work one has to put into a Bayesian hierarchical model specification (compared to a one-liner lmer function) gave me a new appreciation about what I am implicitly assuming about how the data were generated. This is now how I teach frequentist methods too in my introductory statistics classes, which are purely frequentist; I teach students to think about the underlying generative process.

        • Shravan There’s too much to respond to except to note that testing assumptions of statistics models are carried out using frequentist methods! We’re the ones with the model validation tools, which is why, for example, George Box declared that Bayesian statistics couldn’t possibly be all we need because it would preclude scientific discovery. This is close to a quote and I can look it up if you like. The reason it would preclude scientific discovery, says Box, is that it would require an exhaustive set of models and priors on each.

          On your point about background info: we do take it into account in all the ways that are required for critical analysis (quite a lot under “background” on my blog, if interested). Yes, there might always be biased beliefs back there, but your suggestion that therefore we might as well mix them in with the data analysis is rather at odds with what we want to do in learning from data.

        • “except to note that testing assumptions of statistics models are carried out using frequentist methods!”

          This is absurd and I’ve seen Gelman and others correct you on this at least a dozen times. I wont repeat all that since it seems not to register. For everyone elses benefit here is the big picture:

          1) Sometimes model assumptions are checked in a frequentist sense sometimes they’re checked in more general Bayesian senses. The later includes the former as special cases (whenever they actually make sense that is).

          2) Those Frequentists tests don’t “validate the model”, what they actually do is confirm the model is a good “summary” of the old data. Whether the new data looks anything like the old data is a completely separate question left totally unanswered by those “tests”. This is especially problematic because you can find a good model “summary” of any data set regardless of what physically caused the data.

          This is important because this is the single biggest reason we have the crisis of reproducibility right now. It’s not because of the nefarious affect of Bayesian “subjective” priors as Mayo likes to claim, its because Frequentists have allowed fantasies about their supposed knowledge of infinite frequencies to fool themselves into believing those tests are doing something they aint.

        • Let me back the claim “Sometimes model assumptions are checked in a frequentist sense sometimes they’re checked in more general Bayesian senses. The later includes the former as special cases (whenever they actually make sense that is).” with a specific example.

          In the post linked below I showed explicitly how testing whether a past data set $\vec{x}_{past}$ is the high probability manifold of Bayesian probability distribution (which is an entirely Bayesian test!) reduces to checking to see if a given Chi-squared term is small.

          So go ahead and deny the claim. The mathematics stands as true as ever regardless of what any of us say.

          Here you go:

          http://www.entsophy.net/blog/?p=271

        • “testing assumptions of statistics models are carried out using frequentist methods”

          I have no problem with that. I’ll use whatever method makes sense in a particular context. I guess I don’t understand why it has to be either Bayes or frequentist, but not both, as needed. I do understand the philosophical arguments, but as an end-user of statistics, they don’t touch my life.

          “your suggestion that therefore we might as well mix them in [priors] with the data analysis is rather at odds with what we want to do in learning from data”

          I don’t know what you mean. I don’t want to ignore prior information if I have it (and I sometimes do). With a standard frequentist analysis, every data set is analyzed as if we know nothing about the subject (usually false in my field). We do informally argue about “converging evidence”; if so, why not just formally check the convergence?

          Anyway, currently my biggest problem with frequentist methods is that it leads to abuse. Although I guess that once people become sophisticated enough to fit models in Stan or whatever, we will see abuse too (in fact, I think Andrew thinks I abuse Bayesian methods, because I use credible intervals to do inference, almost like a frequentist; I even use Bayes factors, which I think Andrew once characterized as “crap”).

        • Rahul:

          One point that Rubin always made is that arguments about what model to use (I wouldn’t limit this to “what prior to use,” as of course the choice of data model is typically important too) are valuable in that they point the way toward scientific discussions. So, even if it can be work to set up a reasonable statistical model, this work can yield insight.

        • No, not in my field. But in my field it usually doesn’t matter much; the likelihood will dominate. When it does matter, I just present all results with all priors that seem reasonable given what we know so far. E.g., after 14 years of doing experiments I can make a pretty good guess about what a plausible range of the standard deviation for a reading time measurement is going to be.

          Either that, or I just do whatever Andrew says :).

        • MAYO, any statistical analysis requires some kind of model, so you don’t really have a choice as far as that is concerned. Your choice is in what model to use, not whether to have one. That’s true of frequentist and Bayesian approaches, and any other approach too.

          A standard, textbook approach to this problem would assume that the samples from groups A and B are from normal distributions, with variances estimated from the data, then do a one- or two-tailed hypothesis test, with the hypothesis being that the means are equal. Notice that you have to assume a normal distribution in order to do this. Or you could do a different test with a different statistical distribution…you still need to assume you know the distribution.

          A Bayesian might fit a model where the samples from a are assumed to come from N(mean = mu_a, standard deviation = sd_a) and similarly for b, where the mu’s and sd’s have wide prior distributions…indeed, if there are at least two samples from A and B, those priors can be infinitely wide and the posterior distribution will still be defined, but in almost any real situation there is at least some prior information that lets you narrow things down at least somewhat. A prudent statistician will try several priors and make sure the choice of prior isn’t dominating the analysis.

          This is very similar to the eight schools problem, where we might be interested in whether School A has a better teaching program than School B.

          To me, a strong argument for _some_ sort of approach that is not based purely on a hypothesis test is that the hypothesis test doesn’t even try to answer the question you’re interested in. You’re not interested in “would I have seen this, if A > B”; you’re interested in “given that I’ve seen this, how likely is it that A > B”. Actually, it’s likely that that’s not what you’re interested in, either, but at least it’s closer. One can argue whether a Bayesian approach is the best way to answer that second question, but it’s inarguable that a hypothesis test does NOT answer it, or even try to. If you want to answer that second question, you have to do -something- other than a hypothesis test.

        • I’ve found that at least in the kind of data I end up looking at, most assumptions about distributions tend to be unsupported, if not actually lousy, and as a consequence I’ve found myself reaching more and more for nonparametric techniques (such as, forex, rank tests).

        • Jake:

          We discuss this briefly in Bayesian Data Analysis. In short, my suggestion is, if you were going to do a rank test, I’d instead use a compressing transformation of the data and then fit the usual sort of regression model. The trouble with rank tests is that they can only be used to answer a very limited set of questions. But the general approach of compressing transformations (of which the rank is a special case) can allow you to get rid of any outliers and then make use of the full range of statistical modeling techniques.

        • Phil: Unfortunately, you’ve got frequentist inference quite wrong: we do not report the likelihood, we always report the error probabilities associated with a test statistic (be it likelihoods or something else–even your posterior). We want to appraise the warrant of a hypothesis all right, but the way to do that is to ascertain how well or poorly probed it was by the test/data in question. It’s too bad that PopBayes so cavalierly distorts frequentist methods, I’m not necessarily blaming you, it’s rampant.
          Perhaps see these blogposts:
          http://errorstatistics.com/2012/11/30/error-statistics-brief-overview/
          http://errorstatistics.com/2012/10/05/deconstructing-gelman-part-1-a-bayesian-wants-everybody-else-to-be-a-non-bayesian/

        • Mayo: c’mon. It doesn’t take a lot of exploration of the literature to find purported frequentist inferences that don’t report error probabilities, and there’s an entire school of analysis in which reporting the likelihood is the answer to everything.

          So, claiming “we” practitioners of frequentist inference do what you say, without exception, is demonstrably wrong. The traditional definitions of the goals of frequentist inference are laid out here; please either stick to them or use language that distinguishes your approach from them. Calling what you do the “error statistical approach” is fine, claiming it to be the *only* version of frequentist inference is not.

        • I read both of those blog entries but I’m afraid I didn’t understand either of them.

          I’ve seen quite a few statistics books, and when it comes to hypothesis testing all of the ones that I’ve seen take the approach of testing “how likely is it that I would have seen what I saw, if the true effect were exactly 0” [or, sometimes, exactly equal to some other number]. I assure you, this is a common thing that people test. If what you’re saying is that not _everybody_ tests that, then that’s great, I’m glad to hear it! But, unfortunately, many people do the test just the way I said.

          I’m one of the people who learns best if I have an example in front of me (not everybody is like this). Even a drawing-balls-from-urns example would probably suffice in this case. For instance, suppose I have two big groups of people, A and B, and I think that on average people from Group A might have higher systolic blood pressure than people from Group B. I draw a random sample of 10 people from each group, so I have measurements a1, a2, a3, …a10 from group A, and b1, b2, … b10 from group B. What’s the standard frequentist approach for assessing the probability that the mean blood pressure in Group A is higher than that in Group B?

        • “What’s the standard frequentist approach for assessing the probability that the mean blood pressure in Group A is higher than that in Group B?”

          I lol’d.

        • Phil:

          You ask, “What’s the standard frequentist approach for assessing the probability that the mean blood pressure in Group A is higher than that in Group B?”

          The answer is that in a frequentist approach, you can only assign probabilities to random variables. You cannot assign probabilities to parameters. Thus, in this problem the natural frequentist approach is to treat the mean blood pressure in each group as a random variable. The distributions of these random variables need to be externally specified or estimated from data. Traditionally, frequentist statisticians have required that parameters either be externally specified or estimated from data, but there is nothing deep in frequentist principles that would stop them from using data to modify a range of prior guesses, or using prior information to regularize a data-based estimate.

          In short, I think that the most natural frequentist approach to your problem would be to fit a hierarchical Bayesian model (and then, if desired, perform a bunch of research to evaluate the statistical properties of that approach). The hierarchical part comes from your desire to estimate a probability, and the Bayesian part comes because you can’t get any reasonable pure-data-based point estimate of the hyperparameters of a hierarchical model given data from just two groups.

        • Andrew, you say “I think that the most natural frequentist approach to your problem would be to fit a hierarchical Bayesian model (and then, if desired, perform a bunch of research to evaluate the statistical properties of that approach).” I’ll be interested to hear if Mayo agrees with you.

          Mayo, the question was for you, really. I’ve made a measurements on a sample from Group A and a sample from Group B, and I want to know the probability that the Group A mean is greater than the Group B mean. I know how I would do this using a Bayesian approach, but I don’t know how I would do this using a Frequentist approach. Enlighten me?

        • “What’s the standard frequentist approach for assessing the probability that the mean blood pressure in Group A is higher than that in Group B?”

          A frequentist would say that probability is either 1.0 or 0.0, as the underlying reality is that, if you measured the whole population, either Group A or Group B has the higher mean blood pressure.

          Frequentist confidence intervals and hypothesis tests are designed to have a particular error rate at providing an statement. For instance, when the assumptions of the test are met, a level 0.05 test will have a false positive rate of 5%. The practicality of the assumptions and their application in real world problems aside, frequentist results do not return probability statements about the parameters (as Andrew states), they make statements about the error rate of the estimating procedure. If you build a 95% confidence interval and state that the true value is in that interval, you will be wrong 5% of the time. This is what Mayo refers to as controlling the error rate (I believe). If you perform a level 0.01 test of the null hypothesis that “the blood pressure in Group A blood pressure in Group B” with the knowledge that you will only be wrong 1% of the time.

          Those error rates all depend critically on the assumptions build into the test, as Andrew points out.

          As a quick aside back to the original problem, I think the proper way to do the NHST would be to do the two-sided test, since, as I see it, the two-sided test is essentially a Bonforoni-corrected pair of tests, one testing A > B and one testing B > A. When one is debating using a one-sided test, it is because they have likely implicitly done the other test in their head and concluded that they did not reject the null (If they expect A > B, it is because they already “tested” B > A and found no evidence for it in their prior information test of B>A). Just a thought.

          If I am misrepresenting Mayo’s argument above I would appreciate a correction, but I believe that I am summarizing the error rate standpoint reasonably.

        • Mike:

          As noted in my comment just above, frequentist inference does allow for probabilistic statements—a frequentist does not need to say that a probability is either 1 or 0. The restriction is that in frequentist inference you can not make probability statements about parameters. But you can make frequentist inference about predictive quantities. Hence the frequentist literature on multilevel models.

          A frequentist could indeed reply to Phil’s question by saying that he or she is not interested in making these probabilistic predictions. But to the extent that these predictions are of interest, they can be made within a frequentist predictive framework.

        • Andrew:

          Perhaps one might put it as “The hierarchical part comes from your desire to estimate a probability, and _avoiding_ the Bayesian part _comes because you interpret that probability as aetiologic or physical and that often you can_ [can’t] get [any] reasonable pure-data-based point estimate of the hyperparameters of a hierarchical model given data from _many_ [just two] groups.”

          At least that seems to have been David Cox’s reply to Lindley and Smith’s Bayes estimates for the Linear Model paper.

        • Mike G, I understand what you’re saying but from a practical standpoint it seems ridiculous. Except for quantum mechanical behavior the probability of _everything_ is either 0 or 1. But what good is that, unless I know the answer for the specific case I’m interested in?

          Let’s separate the question of _terminology_ from the question of _practical application_. A Bayesian or Frequentist would both agree that either the Group A mean is or isn’t higher than the Group B mean, so I guess I can see why one would feel uncomfortable about using the term “probability.” This still seems funny, though, because it means you can’t really use the term “probability” to apply to anything. Either it will or won’t rain a week from today; either the next coin I flip will or won’t come up heads. I’d like to say that the probability of the coin coming up heads is 50%, or very close to it, but if you want to say “no, that’s ridiculous, it’s either 0 or 1, only you don’t know which” then I think you’re using “probability” in a way different from the plain-English meaning but OK, let’s play it your way. Call it what you will.

          So let me re-phrase the question. I have samples from Group A and Group B. A colleague who has the same information that I have says “you know, the mean the sample from Group A is 10% higher than the mean from Group B, but I think that if we measure everyone in group A and everyone in Group B, we’re going to find that the mean from group A is less than 2% higher than Group B. If you offer me 8:7 odds, I’ll bet you on it.” How should I decide whether to take the bet?

        • Phil:

          I understand what you’re saying but from a practical standpoint it seems ridiculous. Except for quantum mechanical behavior the probability of _everything_ is either 0 or 1. But what good is that, unless I know the answer for the specific case I’m interested in?

          Bayes uses probability to describe what’s known about parameters, these need not be only zero or one. In frequentist work probability describes only random variables.

          Let’s separate the question of _terminology_ from the question of _practical application_. A Bayesian or Frequentist would both agree that either the Group A mean is or isn’t higher than the Group B mean, so I guess I can see why one would feel uncomfortable about using the term “probability.” This still seems funny, though, because it means you can’t really use the term “probability” to apply to anything. Either it will or won’t rain a week from today; either the next coin I flip will or won’t come up heads. I’d like to say that the probability of the coin coming up heads is 50%, or very close to it, but if you want to say “no, that’s ridiculous, it’s either 0 or 1, only you don’t know which” then I think you’re using “probability” in a way different from the plain-English meaning but OK, let’s play it your way. Call it what you will.

          Your discussion is of the use of probability to describe random variables, and so applies to both Bayes and frequentist work.

          So let me re-phrase the question. I have samples from Group A and Group B. A colleague who has the same information that I have says “you know, the mean the sample from Group A is 10% higher than the mean from Group B, but I think that if we measure everyone in group A and everyone in Group B, we’re going to find that the mean from group A is less than 2% higher than Group B. If you offer me 8:7 odds, I’ll bet you on it.” How should I decide whether to take the bet?

          This is a rather different question than the one you originally posed. The variability in the measurements and (for Bayesian work) in the prior is going to affect how you decide.

        • Andrew:

          Thank you for your clarification. Perhaps I should have been saying “error statistician” instead of frequentist, as I think this statement does somewhat fall outside the error statistical framework though doesn’t it? If there is one pair of groups A and B and a frequentist used the approach you described to assign say 87% probability that group A > group B, then you cannot assess the accuracy of that statement within the error statistical framework. If you measured all of groups A and B and compared their means, you would find that either A>B or B<A, so you cannot say whether 87% probability was "right" or "wrong" and you cannot speak of the error properties of the procedure. Treating the means as random assumes a conceptual larger population, but you cannot possibly measure these hypothetical groups because you only have the two groups, so you again it makes no sense to speak of the error properties of prediction. I think an error statistician would forgo making any probabilistic statement of the sort. I could be failing to understand something, so I apologize if I am.

          Phil:

          As george points out, your new question is very different from your old one and lends naturally to the Bayesian philosophy of probability (betting odds and so forth). That said, assuming the data appeared to meet the assumptions of a two-sample t test, an error statistician could test at level "alpha" the null that the difference between groups was less than or equal to 2%, where alpha was picked based on how willing the error statistician was to be wrong. They could then make the bet if the test rejected with the knowledge that they would only be wrong "alpha"% of the time. Again, this test does require assumptions, but I believe this is the way an error statistician would view this problem. Mayo, please correct me if I am misrepresenting things or failing to make important things clear.

        • Mayo:

          You as, “where do you get your prior?” We get it the same place we get our linear models and our assumptions of additivity and Cox regression models and logistic regressions and lassos and everything else. We put together our models based on some combination of substantive knowledge and comfort given existing practice, then we evaluate our models in various ways including simulations and analytical calculations, we check how our models fit data, we see how our inferences change under different assumptions, we gather new data to identify troublesome aspects of our model, etc. My recent paper with Yair gives some sense of how this works in practice.

  4. One of the problems with using a one-tailed test (because you’ve hypothesised that M(A) *ought* to be greater than M(B)) is that if it turns out non-significant, the CI will typically include the possibility that M(B) > M(A), at which point you have some explaining to do.

    I’ve seen it suggested that the only time it’s legitimate to use a one-tailed test is if the other tail is mechanically impossible (e.g., you’re comparing the number of people alive at t1 with those alive at t2). Another suggestion is that it’s OK to use a one-tailed test if you’re prepared to react to a significant result in the “impossible” direction in exactly the same way as you would to a n/s result; for example, if you’re testing a new medicine to see if it’s better than an existing one, you don’t distinguish between “it’s (statistically) significantly worse” and “there’s no (statistically) significant difference” because in either case, you’re not beating the current best treatment and your new pill is going in the trash.

    But the main real-world justification for using a one-tailed test seems to be that the researchers’ need for a p < .05 result is greater than the pointing and laughing in their direction that might ensue when they publish. I'd be *very* interested to see a redux of Masicampo and Lalande (2012) with a comparison between one- and two-tailed p-values; my guess is that one-tailed tests would show a considerably higher preponderance of borderline results.

    I recently saw an article which stated (wording and numbers changed to protect the guilty) that "a t-test showed that M(A) was greater than M(B), t = etc, df = etc, two-tailed p = .08, one-tailed p = .04)". So they got all the benefits of a p < .05 result, while also hedging their bets by not actually having insisted on a two-tailed test. (The PI on this study is a Very Big Name, who uses a *lot* of one-tailed tests to get below .05.)

  5. The best treatment of this I have seen is in Meier, Sacks and Zabell where they say: “A simple rule, which we adopt here, is to take a one-sided 2.5% level criterion (equivalent to a two-sided 5% level)…” So it’s fine to use a one-sided test as long as you maintain the size of each tail. That would solve the problem Nick discusses above.

    • Just to elaborate slightly: if you’re so confident about the direction of the effect, you shouldn’t mind giving up the area in the other tail ex ante to prove your point.

      • Indeed. But that assumes that you are the kind of scientist who is disinterested and genuinely curious about the outcome, rather than having a substantial interest (in terms of getting published, prestige, power, and/or speaking fees) in the numbers coming out a particular way.

        The fact that something as simple as this can be gamed by researchers with an agenda because we can’t get reviewers to agree on it, tells us that something is badly flawed in even our most basic statistical methods. Yet we launch multi-million dollar public policy initiatives on the back of this sort of “science”.

        • If you aren’t genuinely curious about the outcome, you aren’t a scientist, and no amount of statistical expertise will help you.

        • Fair enough. I was being glib. My real point was that NHST at its best gives useful information to someone who is genuinely curious. As a species of rhetoric (which is most applied published statistical work is) it is weak just because it is so malleable in ways that the reader can’t discern. So the skeptic (using Akerlof’s Market for Lemons) naturally assumes the worst, at which point NHST arguments are unconvincing. In classic rhetorical terms, it’s like the argument from authority: only valid to the extent that the reader either (a) trusts the arguer; or (b) is willing to hunt down the authority himself and see if cite is accurate, accurately reflects the authority, etc. etc.

        • “if you’re so confident about the direction of the effect, you shouldn’t mind giving up the area in the other tail ex ante to prove your point.”

          In that case why would you expect significance to hold just because the direction of the effect is correct?

        • I don’t. I neither expect nor don’t expect significance. It will either come or not, as the size of the effect, the underlying model and the sample size warrants. But if I want to eschew one tail as being (theoretically, practically, spiritually) incorrect, I should have the right to do so. But that doesn’t give me license to loosen my standards of significance in the “right” direction.

    • There is no “magic” in the 5%-rule anyway, so why should it be helpful regarding the original question to suggest that the 2.5% should be used? Nobody stops the researcher from even using 1%…

  6. In deciding between one- and two-tailed P-values one needs to first decide if one is interested in evidence or just attempting to discard a null hypothesis. In the latter case the may arguments for two-tailed tests have some merit. However, in the context of an assessment of the evidence in the data the one-tailed P-value makes more sense. (Arguably it makes little difference n practice as long as the number of tails is specified.)

    See this paper for an extensive discussion of tails and evidence: arxiv.org/abs/1311.0081

  7. I think that your correspondent’s question is ill-posed.

    Suppose your correspondent decides from the data (1) that the mean of group A is higher than the mean of group B? Or, alternatively, (2) that this is not the case?

    The important question is this: How will he behave, that is, what decision will he make, should he decide that (1) is probably true, versus deciding that (2) is probably true?

    This hypothesis testing business, in a context that doesn’t ask how actual behavior should change depending on the result of the test, is vacuous.

    Now, the behavior could be: If the hypothesis is not supported, give it up and don’t publish anything, but if it is supported, publish a paper (which may then enter into the collection of papers that are later not supported by subsequent research, as John Ioannidis has noted). (Which, parenthetically, exacerbates the problem of unpublished non-significant results).

    http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0020124

    My take is that you have to consider questions like this in the context of decision theory, which may be either frequentist or Bayesian (because of the complete class theorem). Your choice.

    But this hypothesis testing paradigm (frequentist or Bayesian) is fundamentally flawed because it doesn’t ask “so what? What next?”

    • This decision theoretic approach is so unappealing to me.

      Everyone asks ‘so what, what next?’, whether formally or informally, but not everyone wants to mix that all together with what has typically been considered the interesting part of science – (modelling) the ‘truth’ about the external world.

      If you want an additional layer of decision theory, fine, but I’d prefer to make sure it’s a separate model to that of the phenomenona.

      • hjk, I don’t understand your comment. A decision theory analysis is always going to built on a full probability model of the underlying phenomenon. It is just an additional bit that guides you, given the output of the probability model, into choosing the best action to take, given what you know. How is this unsatisfactory?

  8. If I may be permitted a plug, my book Statistical Issues in Drug Development http://www.senns.demon.co.uk/SIDD.html has a chapter devoted to one-sided versus two-sided test. It is downloadable here http://www.senns.demon.co.uk/c12.pdf. Bayesians might also like to look at Laurence Freedman’s article:
    Freedman LS. An analysis of the controversy over classical one-sided tests. Clin trials 2008; 5:635-40. http://www.ncbi.nlm.nih.gov/pubmed/19029216
    Stephen

  9. I was advised to always use two-tailed tests, because you might just end up wanting to report a strong effect that was the reverse of your hypothesis. However, I note that if you use an error budget, you can express a prior belief in the design of your frequentist statistical test. That is, if I wish to report an effect falsely no more often than 1% of the time, I could report an effect when the p-value works out as 0.0001 or lower in the left tail, or 0.0099 or lower in the right tail.

  10. I would echo Andrew’s (and other’s thoughts) in that, at least wrt to statistical design, power of the test is important. It is the lens of the scope we are able to discern one ‘thing’ be it group/object/event/other. In teaching business analysts who don’t have a strong skillset in statistics period, much less differentiate between bayesian and classical methods, I use the analogy of digital camera resolution to emphasize the necessity to express power as part of the uncertainty or ‘sanity’ of any test, whether it is a post hoc/ post mortem, or part of an experimental design (which none of them adhere to anyway – but baby steps right?).

    2 cameras of different resolutions in terms of mega pixels viewing the same object may affect how one interprets any ‘significance’ of a test. That’s a very visual case of describing the difference between 2 distributions of information. If someone were to change the hue or gamma coefficient of one of the photos, the distributions of information look different, because they are different. Thus exploratory data analysis must be conducted and one should never assume normality – an rather easy process that even cursorily conducted can avoid alot of future confusion about previously reported results. That they don’t understand decisioning recommendations (particularly as a system of games) either I also find problematic as that can help clarify the information an analyst presents to a layperson. But suffice to say that teaching them to at least think that the ideas of significance wrt ‘rejection’ and ‘acceptance’ are not concrete (and the monikers confusing to the layperson) and should rather be viewed as ‘evidence (marginally/mildly/moderately/highly) supports’ or ‘evidence (marginally/mildly/moderately/highly) does not support’ the test hypothesis. I’m throwing out some terms on the fly here, but each term as a ‘score of uncertainty’ can be assigned an intuitive probabilistic value given the problem being solved.

    That is a practice normally not conducted, but for most tests of means/proportions/weighted, etc this would be something I believe should be given consideration as part of a ‘best practice’ framework where it may apply.

  11. Straight answer to the original question:
    Leaving issues of model assumptions aside, using a one-sided p-value means that the theory is tested against “no difference, or difference opposite from what the theory suggests”. Looks fine to me if you decide this before looking at the data. (The two-tailed test will of course not always give a p-value twice as large, only if the difference is in the expected direction.)

    Of course afterwards, whatever your p-value is, don’t over-interpret it…

  12. There is a lot of misinformation and misunderstanding about one-tailed / one-sided tests and p-values. And almost no one talks about one-sided confidence intervals. I believe addressing this issue is also a way to resolve some other misunderstandings and malpractices related to p-values while staying withing the frequentist framework, part of the solution being making the null hypothesis being rejected explicit. Medicine and behavioral sciences are notorious for their misuse of two-sided tests (p-values and CIs) when they are clearly making directional claims and from a practical standpoint: directional decisions. I’ve made the argument for one-sided tests in a series of articles on https://www.onesided.org/

Leave a Reply to Entsophy Cancel reply

Your email address will not be published. Required fields are marked *