Moving statistical theory from a “discovery” framework to a “measurement” framework

Avi Adler points to this post by Felix Schönbrodt on “What’s the probability that a significant p-value indicates a true effect?” I’m sympathetic to the goal of better understanding what’s in a p-value (see for example my paper with John Carlin on type M and type S errors) but I really don’t like the framing in terms of true and false effects, false positives and false negatives, etc. I work in social and environmental science. And in these fields it almost never makes sense to me to think about zero effects. Real-world effects vary, they can be difficult to measure, and statistical theory can be useful in quantifying available information—that I agree with. But I don’t get anything out of statements such as “Prob(effect is real | p-value is significant).”

This is not a particular dispute with Schönbrodt’s work; rather, it’s a more general problem I have with setting up the statistical inference problem in that way. I have a similar problem with “false discovery rate,” in that I don’t see inferences (“discoveries”) as being true or false. Just for example, does the notorious “power pose” paper represent a false discovery? In a way, sure, in that the researchers were way overstating their statistical evidence. But I think the true effect on power pose has to be highly variable, and I don’t see the benefit of trying to categorize it as true or false.

Another way to put it is that I prefer to thing of statistics via a “measurement” paradigm rather than a “discovery” paradigm. Discoveries and anomalies do happen—that’s what model checking and exploratory data analysis are all about—but I don’t really get anything out of the whole true/false thing. Hence my preference for looking at type M and type S errors, which avoid having to worry about whether some effect is zero.

That all said, I know that many people like the true/false framework so you can feel free to follow the above link and see what Schönbrodt is doing.

44 thoughts on “Moving statistical theory from a “discovery” framework to a “measurement” framework

  1. “But I think the true effect on power pose has to be highly variable”

    Really? You think it has to be the case that there are some people for whom power pose is highly effective?

    • Doesn’t have to be consistent on a per person basis… could be a combination of people and time. For example, I suspect that pointing a firearm at someone and yelling “Get on the ground right now!” is probably a really effective power pose.

      • Separating the point from the joke, sure, doesn’t have to be consistent within people, but I still doubt literally anyone would perform much better in a high stakes task at any time if they make a superman pose privately beforehand (even if the high stakes task is an audition to be Superman’s flying standin)

    • What is “being highly effective”? I can think of several distinct questions:

      Question 1: Will power poses make some people feel significantly better, more confident, powerful, less anxious, etc? I bet it will (just like some other people would feel much better after prayer, Stoic exercises, a walk in the park, being whipped by Mistress, etc.)

      Question 2: Will it affect some people better than placebo? Well, better than other placebos, anyway? I bet it will; in the psychological domain, everyone probably has their own favorite placebos. And, in the psychological domain, a placebo is as good as a cure; if you feel better, then, by definition, you feel better.

      Question 3: Will it work for a majority of people, to a significant degree? I bet it won’t.

      Question 4: Will power poses affect the hormones of certain people? I don’t see why not. I mean, if a videogame makes me excited, it affects my hormones. Someone out there must get very angry whenever they see a rainbow, meaning that for this person rainbows would increase their adrenalin, testosterone or whatever.

      But there’s a rhetorical trick here: a lot of writeups on power poses say that it affected hormones, implying that therefore it must be real/worth it/not placebo. Even leaving aside the failure to replicate, this is a fallacy, because hormones aren’t just the cause of psychological states, they’re also the result of psychological states. The same goes for the oft-seen argument that such-and-such effect is really real because it was measured to “rewire the brain” or because it light up something in an fMRI. Everything we experience affects hormones, rewires the brain, and light up fMRIs. Unless we believe in souls, all mental states must be reflected physically somewhere. Therefore, measuring a physical change correlated with a psychological state doesn’t prove anything about causality nor efficacy.

      • Also, as Andrew once pointed out, if everyone adopts the power pose, the advantage is lost. So if Amy Cuddy is successful in converting everyone to her “mind hack”, it will become ineffective. In her success lie the seeds of failure. Perhaps fittingly.

        • I don’t like that specific criticism (not that I like power poses or Cuddy’s ideas).

          We adopt strategies that benefit specific individuals all the time.

          Sure, at equilibrium, there’s no advantage but we all the time rely on strategies that wouldn’t work under ultimate equilibrium. If someone invents a magic pill that makes you 10x more confident, sure, it will sell notwithstanding the fact that if the entire world consumed the said pill we’d all probably be no better off.

        • But the whole point of the Amy Cuddy approach is to gain an advantage over others (e.g., in an interview setting, everyone goes to the bathroom and stands in front of the mirror and power poses). A pill that makes you 10x more confident is also implicitly intended to defeat your opponents.

          I forgot to add the assumption that Amy Cuddy doesn’t add the caveat to her magic potion that your mileage may vary. You Will Make It Once You Fake It, is the message. If there were individual differences claimed in the approach, then sure, even if everyone does it, not everyone will benefit equally, so some winners will still emerge.

        • My point is, you shouldn’t shun a strategy to gain an advantage over another just because had everyone adopted the same strategy it would not work.

          That seemed the gist of your argument above and I think that’s a bad argument.

  2. I don’t think psychology or linguistics can really afford to do what you are doing. There we are in the business of evaluating claims, which amount to investigating whether theta is greater than 0, or the like.

    • “There we are in the business of evaluating claims”.

      But should we be? :) Many Psych papers are centered around questions like “do x affect y”, and in many (even most?) cases this question can be answered from the armchair with “Of course x affects y, at least in some small way”. (Of course IQ correlate with timing ability, of course variations in verbal instructions affect speed of walking, of course sex affect taste preference, at least in some tiny tiny way.) The more interesting question should be “how much and in what way does x affect y”, and for that an estimation focus seems more suitable. But that’s a more difficult question and psychological research seems to like traditions… :)

    • I don’t know: I do more sociolinguistics, which I think should be more concerned with measurement. The types of questions I”m concerned with involve investigating which groups of people use which variants with which degree of frequency on which occasions.

  3. Coming from physics and working in medical statistics now, I fully agree with your “measurement” framework. However, I also have to take serious my medical colleagues’ troubles: They cannot work with a 3µg/l lower gastrin concentration, they want to know if they should send the patient to surgery or psychological treatment. And try to explain patients that they have a 3µg effect size – we are the patients!

    I tend to argue the “don’t dichotomize too early” -way. Don’t look a 5 p-values from the last conference, look at 5 effect sizes and try to do a back-of-the-head multivariate decision. Physicians are very good at this when it comes to combining posture/language/skin color/movement of the patients into a decision pattern; they should use the same combinatorial skill when looking at laboratory values. We should support this in education by training to take into account priors more seriously.

    And stop all editors from requesting p-values of everything …

      • Yes, but at the same time, just like with roundoff error, if you have a lot of intermediate values where the information has been limited to just a few bits, you can wind up with a very lousy and noisy final decision. So I agree with Dieter, and I think he’s right to realize that doctors are pretty good at combining continuous information into decisions, and if you need to do it for them, you should accumulate all the continuous information you can and put it into an algorithm, don’t make intermediate dichotomizations etc

        • Yes, I agree with Dieter on that too.

          What bugs me is there isn’t much discussion about how to translate the continuous intermediate values into that final discrete decision.

        • This means in essence coming up with utility functions, and my impression is that utility functions “fix” certain political power relationships into numerical quantities, and so people tend to avoid them at all costs. :-)

          You might think you’re arguing about “what the decision making should be about the assumed loads on structures so as to calculate the required strength of connectors” but what you’re really arguing about is “whether it’s going to be more economically favorable to build a concrete structure or a steel structure and therefore how much money the concrete people are going to make vs how much money the steel fabricators are going to make”

          similarly for medical decisions, it might seem simple enough to say your problem is “given these three lab measurements come up with a decision about whether to treat with drug A or to monitor for several months with monitoring process B” but in reality your decision will be “given these three labs will insurance companies C,D,E pay a lot of money to drug company that makes drug A or less money to laboratories that do monitoring process B” etc etc

        • So long as the academics writing the model are not getting paid by the steel fabricators nor the concrete people there’s no conflict of interest right?

          And academics write things that annoy various stakeholders all the time so what’s another one.

  4. It’s a real pity there is no way we can build computational models of the processes we study. That way we could just wvaluate model predictions by looking at the best estimates we have from the data. Sad!

      • It’s both. Some areas of psych and ling ask questions that can be studied by building models; but people often don’t do that but instead try to look up the p-value to decide on the truth of their theory. The problem with hand-waving theories is that there are so many degrees of freedom. One reason I quit working in theoretical linguistics was that there was always an escape hatch, once a counterexample was identified. But there are other questions (“does striking a superman pose improve your performance in a job interview?”) that don’t easily admit to formal modeling (“A computational model of power posing”; I guess this paper will be coming out soon), and there all you can really do is sit down and fart in that puddle. Or watch the waveform wiggle around indefinitely. Then there are problems that fall in between those extremes—I think psycholinguistics is in that gray zone. The questions are too complex to model realistically, but one can make crude approximations. I suppose those crude approximations are the way to go for now.

        • One wacky thought is that we should add a specific section to all articles listing empirical things which, if found, would be strong evidence that the authors’ proposed hypothesis was wrong.

          i.e. Along with your conclusions brainstorm potential experiments that would yield evidence contradicting your pet model.

        • Part of the graduate level training in linguistics (at least in the US, where I did my PhD) is to argue against any counterexample. It’s emerging that it’s the same in other disciplines too, like psych. It’s a bit like courtroom dramas on TV—the lawyer for the defence is going to defend his/her client, no matter what the facts. Isn’t that how debate teams work in schools too? In my school (St Columba’s, in Delhi), basically each side was assigned a position and then their job was to argue for it. In academia, the position is assigned by virtue of your working with a particular PhD advisor. The moment you are associated with an advisor, your job is to defend his/her position. You won’t find many PhD students publishing papers against their advisor’s favorite position, in any field, but more so in psych and ling.

        • Well not all systems of justice are adversarial. Maybe it’s more apt to think of academic publishing like the Napoleanic systems. Where it’s not all about vigorous representation but more about finding out the truth.

        • Oh OK, googled it. But unfortunately we all watch too many American TV dramas. All we ever do is defend the hell out of our adopted position (sometimes adopted by the vagaries of chance, e.g., which graduate school you happen to land up in).

        • If you do want to go the courtroom model we should have the Journal Editors put up submitted manuscripts for public review so that we can have an actual, skeptical adversary.

          Right now the power poses & himmicanes crap is going through unquestioned.

        • “You won’t find many PhD students publishing papers against their advisor’s favorite position, in any field, but more so in psych and ling.”

          Sad. Makes me glad I’m a mathematician, where finding a counterexample is considered a good thing.

  5. I fully agree with the approach. The only question in application a.k.a. decision-making is not whether an effect is real or not, but how much it matters.

  6. As with most things, context makes a difference. “False discovery” and “false positive” do make sense in (for example) a context where you are screening genes for possible effects on a certain medical/biological condition or outcome. (Of course, the frequent language of “linked to” is often misleading in these contexts — unless the “possible discovery” has been backed up by a plausible mechanism of effect.)

    • I think that’s a statement about the typical spectrum of effect sizes in these fields: it’s not unreasonable to divide genes into “large effect” (conditional on experimental/observational setup, environment, genetic background, etc etc …) and “small effect” \approx “no effect”. In reality, there are probably *no* genes of zero effect for a given trait – everything interacts with everything else/epistasis is rampant. But the “true/false positive” approximation may be less silly in genetics/genomics, at least at our current levels of resolution, than in other fields …

      • So I would advocate thinking of a False Discovery as a M(agnitude) instead of Type I error, although the error could result because of a deflated (due to bias or sampling) standard error rather than an inflated (due to bias or sampling) coefficient.

  7. Question: Suppose I have a point estimate and a standard error. What range of values can I “rule out”? Example: I estimate the effect of air pollution on child mortality using more, higher quality data than any previous study. I want to know whether my results “rule out” previous estimates in the literature.

    I could just say “Confidence intervals generated in this manner cover the true effect with a probability alpha” and leave it at that. But that ignores the power bit – maybe I want to say I can reject an effect size of Delta with power = 0.8. Or maybe I want some function of alpha/power that can rule out certain parameter values at various levels of confidence. Or maybe I want to invert a set of equivalence tests (or maybe that reduces to something like one of the above).

    My point is just that this seems like a fundamental problem to understand in moving from statistics as “discovery” to statistics as “measurement”. And yet there is almost no discussion of this sort of thing in the social science literature (at least in most empirical papers, I understand there is a methodological literature here, though not one I’m super familiar with).

    I also think that this kind of exercise has potential to weaken the “file drawer” effect – at least for well-executed studies that generate fairly precise estimates with CIs that still cover 0. If new, better work can be framed so as to rule out the effects of previous, noisy work, maybe journals would be more interested (compared to just saying “we find no effect”). It also helps address the problem of poorly-done replication studies – just because you find 0 where previous authors found not-0, you can’t say much if you can’t rule-out their effect size.

    So I guess two questions: a) what are the best ways to make this argument (in terms of statistical calculations); and b) why don’t we see this happening in the literature?

  8. What is as important and perhaps more important is the difference in values (means or medians) in practical terms. When we have p=.001 in large samples for small differences that are trivial in larger context, we have to reduce the difference to some practical change in application–or not. The p-value is only a tool, only one tool among several.

Leave a Reply

Your email address will not be published. Required fields are marked *