29 thoughts on “Science and personal decision making

  1. I think this is one of your clearest, most intuitive elucidations of the Garden of Forking Paths:

    “Suppose you get people to roll a die 600 times and hypothesize that of all six faces, some are more likely to come up than others. The problem here is that we have many different things to look at: we could compare the 1’s to the 6’s, or compare low numbers to high numbers, or compare even numbers to odd numbers, or look at some other combination such as comparing men who roll even numbers to women who roll odd numbers. We could have a story for any of these. Any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

    And this psychological point seems helpful in making your argument that all this isn’t really just about some bad (in whatever sense) researchers or subfields, it is about a pervasive problem in both statistical training and human psychology:

    “And it’s not just about seeing patterns in tea leaves, or silly psychology studies. I’m afraid we do this same sort of reasoning all the time, seeing patterns from noise in everyday life. When our children were little, we were always trying to figure out how to help them get to sleep. For a while we focused on regular bath times, then there were the different variants of swaddling, the favorite stuffed animals, and so forth. We’d try something new and it would work–but were we just lucky? That said, there was no harm in us experimenting.”

    P-values are the epistemological grounding for truth creation in (some) social science. The problem is they don’t protect against the human need to interpret order from chaos nearly as well as we previously thought (or would have to think in order for p-values to be epistemologically useful).

    I think personalizing it, bringing it into everyday experiences and choices that people make, and giving intuitive probability examples really makes the argument stand well on its own, away from all the baggage associated with specific academic disputes. Of course, you need both kinds of rhetorical framings, but the argument you make here seems like a new tack to me, and a successful one.

    • St:

      I’m not obsessed with power pose. What happened here was that Maryna interviewed me on the replication crisis, and power pose is a good example as it’s been in the news lately. Power pose is a great example here because it’s received tons of promotion from Ted talks, NPR, etc. I don’t see why you’d be nauseated by my mention of power pose. But if it really does bother you, I guess you could consider trying to calm yourself down by holding your body in an open stance for two minutes; I’ve heard this has lots of positive effects!

        • Shravan:

          It’s interesting what can nauseate people. Personally, I find peanut butter nauseating. I’m not allergic to the stuff, I just can’t stand the smell or the taste, even in small concentrations.

          • I’m starting to get nauseated by my own field because it feels too much like power posing! I’ve been desperately computing estimates of power from previous studies for planned experiments in my lab, and the numbers are so pathetically low it seems there may be no point in even running experiments! And repeating publishing experiments that reported significant effects, and finding…nothing. Or repeating experiments that found nothing and finding…something. Or repeating experiments and finding…the opposite of the published result. I’m trying to recall what statistical phenomenon this wild fluctuation reflects, but nothing really comes to mind.

            Maybe one has to stop worrying about finding the truth about a phenomenon. The journey is the destination. Experiments are a philosophical mediation: what if this were true?

  2. Yes, good interview. This part is important too:

    “So, sure, if the two alternatives are: (a) Try nothing until you have definitive proof, or (b) Try lots of things and see what works for you, then I’d go with option b. But, again, be open about your evidence, or lack thereof. If power pose is worth a shot, then I think people might just as well try contractive anti-power-poses as well. And then if the recommendation is to just try different things and see what works for you, that’s fine but then don’t claim you have scientific evidence one particular intervention when you don’t.”

    One of the biggest problems is that people take intuitive/experiential findings and then try to present them as “science.” This is especially prevalent in “action research” (in education, for instance), where, with the sanction of education departments, school districts, etc., teachers try new things in the classroom and then write up the results as “research” (which often gets published.

    It’s great to try new things in the classroom. It’s often good (and possibly great) to write up your findings for the benefit of others. But there’s no need to call it science or “action research” (or the preferred phrase in education, “data-driven inquiry,” which really just means that you’re looking into what you see before you, but which sounds official and definitive). Good education research exists, but it’s rather rare; in the meantime, there’s plenty of room for informal investigation, as long as it’s presented as such.

  3. This quote from the interview

    Suppose you get people to roll a die 600 times and hypothesize that of all six faces, some are more likely to come up than others. The problem here is that we have many different things to look at: we could compare the 1’s to the 6’s, or compare low numbers to high numbers, or compare even numbers to odd numbers, or look at some other combination such as comparing men who roll even numbers to women who roll odd numbers. We could have a story for any of these. Any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.

    would make a nice project: generate the dataset correctly (randomly) – 300 men and 300 women – and ask students to come up with a publishable story, documenting their (forking) path to the result. How many different stories would N students come up with?

    • “How many different stories would N students come up with?”

      I’ve always been terrible at probability counting, but I think the answer has, like, an “N”, a “K” and an “!” in it.

      • This one’s easier than those with N’s and K’s and factorials. Theoretically, at most N. I was asking for an experimental result – I’d like to see what happens when N students tackle the problem separately, with the same data set. I hope someone tries it and lets me know.

        • I told you I was terrible at counting!

          I guess the “a publishable story” thing does make this a trick question with a clear upper bound. But what if we had a million graduate students and a million laptops…. if my probability theory is right, eventually two groups would find that Rosencrantz and Guildenstern were, in fact, both flipping a different coin over and over through a recurring single moment in time, AND flipping the same coin over and over through moving time. Both based on p<0.05 testing the null hypothesis of "there is no paper to publish here". H_a wins again!

          OK but seriously – I could see that being a great term project for an upper-division undergrad methods course. Getting kids to ridicule something seems to me like it is usually a good way to get them to not want to do that thing in the future. But I'd make sure they understood it was a frame-up from the beginning…that way they never mentally associate p-hacking with actual scientific research. P-hacking is just a fun game you can play if you aren't serious about learning anything about how the world works.

      • But that’s pure combinatorics (the probability counting jrc refers to). The probability is computable and experiments will agree with the prediction – statistically, of course, perhaps not for the one class you try it in. This is a fact people find surprising. Forking paths lead to surprising conclusions that aren’t facts. So I don’t think its a good analogy.

        • Forking paths leads to surprising conclusions that ARE facts, they’re just not facts about the world, they’re facts about random number generators. Specifically facts of the form

          “Specific random number generator Foo produces data with test statistic t(MyData) less than fraction p of the time you ask it to produce datasets of the size Size(MyData)”

          The problem is, people are all mixed up as to whether facts about random number generators and facts about the world are the same kind of thing (short answer: no, they aren’t)

          My high school chemistry teacher used to introduce pH indicator dyes with stories about battles between red fairies and blue fairies…. Those stories have the same relevance to science that stories about random number generators do.

          • It’s great to have a high fidelity model for p(data|H_i) but it’s a means to an end not an end in itself. When it comes to decision making what I need to know is p(H_i|data). I need to know the random number generators for all my hypotheses not just the null hypothesis.

            • The fact that you can take any probability distribution and construct a random number generator to match frequencies to probabilities does not mean that p(H_i | data) is a random number generator that matches probabilities to the frequencies with which experiments produce outcomes.

              • or that’s more like p(data | H) but p(H|data) is not a probability that a random numer generator p(data|H) is the “right” rng for the universe or whatever… basically RNGs are unrelated to most science.

              • If p(H_i|data) isn’t consistent with observed frequencies of outcomes then something’s wrong.

                I can fit two different (nested) regression models to my data and compute an F-statistic. With the presumption that one of the models is the correct one and that the measurement noise is Gaussian, if I set a threshold and use the value of the F-statistic to decide which model is the right one then the probability that I pick the right one is described by an F-distribution. My random number generator is configured to generate that F-distribution.

              • > or that’s more like p(data | H) but p(H|data) is not a probability that a random numer generator p(data|H) is the “right” rng for the universe or whatever… basically RNGs are unrelated to most science.

                Our comments crossed in the ether. Now I think I get what you’re saying.

            • Chris:

              The expression p(H_i|data) typically doesn’t make much sense, at least not for models that do not have truly generative prior distributions. See Chapter 7 of BDA3 for some discussion of this issue.

              • Andrew,

                My copy of BDA3 is at work. I’ll check it tomorrow. In the interim, I go back to my hazardous material detection days. My sniffer measures a signal which I treat as a random vector, x. To keep it simple, say my two signal hypotheses are:

                H0: x = b where b = N(mu, covariance)
                H1: x = as + b where b is as above, s is the signature of the hazardous material I want to detect, and a is its abundance.

                Making toxin absent/toxin present decisions based on p(x|H1)/p(x|H0)>threshold is highly effective. You can’t meaningfully assign values of p(H0) and p(H1) but you can plausibly argue that p(H0)/p(H1) > big_number under normal circumstances which in turn allows you to infer when p(H1|data) is greater than the threshold above which I want to sound an alarm.

              • >I say you put a prior on the parameter a and go from there.

                No one’s figured out meaningful prior for the class of problems I was working on – at least not to the best of my knowledge. (I tried on a few occasions but didn’t get anywhere useful.) That said, if one could establish a meaningful prior for a that would be an impressive step forward.

              • Chris, we just had a very brief but related question on here a while ago in the mixture model post, regarding “spike and slab” type priors, where something is most often 0 but then when it isn’t zero it could be almost anything. I’m actually working on a similar issue right now with a model for my wife’s RNA seq data.

                Some kind of very long tailed prior on a makes sense in these contexts. Since a is a positive number, a half-cauchy prior around 0 with a small scale parameter expresses essentially “most of the time nothing’s going on, but when it is… it could be anything even a very large number”

              • >Some kind of very long tailed prior on a makes sense in these contexts.

                Daniel,
                What you describe sounds promising. The (unproductive) direction I took was to model what the signal would be from a particular contamination event and then take my prior from the weighted average over plausible events. Maybe an interesting idea in the abstract but in practice it was mess. (I made mess of it at least;-)

Leave a Reply

Your email address will not be published.