Workflow and the role of hypothesis-free data analysis

In our discussion a couple days ago on the role of hypotheses in science, Lakeland wrote:

Even “this data is relevant to the question we’re studying” is already a hypothesis. There’s no such thing as hypothesis free data analysis.

I’ve sometimes said similar things, in that I like to interpret exploratory graphics as model checks, where the model being checked might be implicit; see for example this recent paper with Jessica Hullman.

But, thinking about this more, I wouldn’t quite go so far as Lakeland. I’m thinking there’s a connection between his point and the idea of workflow, or performing multiple analyses on data. For example: I just went on Baby Name Voyager and started typing in names. This was as close to hypothesis-free data analysis as you can get. But after I saw a few patterns, I started to form hypotheses. For example, I typed in Stephanie and saw how the name frequency has dropped so fast during the past twenty years. Then I had a hypothesis: could it be alternative spellings? So I tried Stefany etc. Then I got to wondering about Stephen. That wasn’t a hypothesis, exactly, more of a direction to look. I had a meta-hypothesis that I might learn something by looking at the time trend for Stephen. I saw a big drop since the 1950s. Also for Steven (recall that earlier hypothesis about alternative spellings). And so on.

My point is that a single static data analysis (for example, looking up Stephanie in the Baby Name Voyager) can be motivated by curiosity or a meta-hypothesis that I might learn something interesting, but as I start going through workflow, hypothesizing is inevitably involved.

I’m thinking now that this is a big deal, connecting some of our statistical thoughts about modeling and model checking and hypotheses with scientific practice and the philosophy of science. Statistical theory and textbooks and computation tend to focus on one model at a time, or one statistical procedure at a time; in the workflow perspective we recognize that we are performing a series of statistical analyses.

It’s hard for me to imagine doing a series of analyses without forming some hypotheses and without thinking of how to refine these hypotheses or adjudicate among alternative theories of the world. One quick data analysis, though, that’s different. I sincerely think I looked at that Stephanie graph out of pure curiosity. As noted above, deciding to look at some data out of curiosity could be said to reflect a meta-hypothesis that something interesting may turn up, but I would not classify that as much of a hypothesis at all. After looking at the graph, though, the decision of what to look at next is definitely hypothesis-informed.

Similarly, I can conduct a survey and ask a bunch of questions without having any hypothesis of how people respond; I can just think it’s a good idea to gather these data. But I think it would be hard to conduct a follow-up survey without making some hypotheses. (Again, I’m speaking here of scientific or engineering hypotheses, not “hypotheses” in the sense of that horrible statistical theory of “hypothesis testing.”)

So . . . hypothesizing plays a crucial role in statistical workflow, even though I don’t think a hypothesis is necessary to get started.

13 thoughts on “Workflow and the role of hypothesis-free data analysis

  1. Doesn’t this just really depend on your definition of hypothesis?
    “I can conduct a survey and ask a bunch of questions without having any hypothesis of how people respond; I can just think it’s a good idea to gather these data.”
    So you don’t have a hypothesis on response, but isn’t “good idea to gather these data” an underlying hypothesis about what data to gather and what is a good idea?

    • Jd:

      What you’re talking about is what I was calling in my post “a meta-hypothesis that something interesting may turn up.” I agree that such a meta-hypothesis has some scientific content, but it seems too vague for me to consider it a scientific hypothesis. Just for example, a scientific hypothesis (in my terminology) might be “People become more ideological in their views when they move to another state.” A vaguer scientific hypothesis is “People’s political views change (more than they would otherwise) when they move to another state.” A meta-hypothesis is, “It could be interesting to look at a bunch of political attitudes and demographics on a bunch of survey respondents.” This meta-analysis might well be informed by a substantive hypothesis, but often it does seem to me that we do something close to pure exploratory analysis, where we look at some data in the vague hope or expectation that we might find something interesting.

      To put it another way, much of the time what is labeled as exploratory data analysis is more focused and more clearly motivated by hypotheses or models—that’s what Jessica and I talk about in our paper. If to everything as a hypothesis is to make that statement of ours kind of empty.

      • I see. Well, it seems useful to term ‘scientific hypothesis’ to be something rather specific or narrow, as in your first example, but I can see Lakeland’s point because it seems it’s really all hypothesis on some continuum of specificness.

      • I agree that if “everything” is a hypothesis, then it’s an empty statement. But I don’t think everything is a hypothesis, only suppositions that can be answered by collecting data. This contrasts to say “opinions” or “religious beliefs” or “mindfulness meditation” or whatever. It also isn’t a supposition to just ask a question like “what happened to the name stephanie over the last few years?” It can be answered with a graph, but it doesn’t have any supposition involved… But for example “did stephanie get a lot less popular recently? I haven’t heard that name in a while.” Is definitely a hypothesis.

        I think you’re saying that sometimes you start with a pure question without suppositions, like “what happened to stephanie?” but once we know something about the question, we rapidly proceed to wonder about specific suppositions. I think that’s about right.

        • Daniel:

          Yes to what you’re saying, that as we move through our workflow, we form hypotheses as we go along.

          “What happened to Stephanie?” is already edging toward a hypothesis, in the sense of my Why Ask Why paper with Guido. The question that I think is more hypothesis-free is, “What’s Stephanie been doing?” I know from experience that these name graphs are often interesting and unexpected, so I suspect that if I look at Stephanie, I might find something interesting.

        • One way to think about this is in terms of possibilities, actualities and interpretabilities instead of hypothesis, deduction and induction and as just phases that meld into one another.

          For instance, a possibility once reflected upon deductively and or inductively may become a favored possibility (hypothesis) worthy of investigation.

          And “What happened to Stephanie?” might start mainly as interpreting what possibilities seemed to have been actualities (induction). The mistake, I think is trying to discern them as being separate categories or buckets. Thinking is far to fluid for that to be the case.

  2. This discussion gets a bit tricky because it boils down to inferring the causes of our own or others’ behavior. For example, even if any particular analytical step might not be motivated by one or a few clear hypotheses, there are likely a huge number of “latent” hypotheses that we could entertain at any given time, and those surely guide our behavior even if we make no attempt to formalize them.

    For example, looking up Stephanie, chances are one would have many ideas about how the graph *might* look, each of which might be connected with some potential causal mechanism to get it to look that way. Once we see that graph, some of those ideas will be confirmed/disconfirmed. Seeing the graph activates latent hypotheses that we have accumulated from prior experience, classroom instruction, wherever.

    So whether an analysis is “hypothesis free” is tricky: Seeing the result of any analysis will, if we are curious, lead us to come up with hypotheses. And the particular analysis done would almost certainly be informed by the set of latent hypotheses we’ve acquired. Even if an analysis was not motivated by a specific hypothesis, it might have been motivated by an amalgam of latent hypotheses. It may also have been motivated by a desire to come up with some specific hypotheses.

    • Gec:

      I agree, and one of my reasons for bringing this topic up is that I think that, when doing statistical analysis of any kind—graphical or otherwise—it is helpful for us to introspect and try to be as explicit as possible about the hypotheses we’re working with.

  3. || Even “this data is relevant to the question we’re studying” is already a hypothesis.
    There’s no such thing as hypothesis free data analysis.

    .

    The relevant term here is human “Bias”

    Everything done ny a human is subjective.

    Objectively objective human analysts do not exist, on any subject.

    Even the initial undertaking of any analysis is a subective/biased choice by the analyst involved.

    • Crsw:

      1. See my response to jd above.

      2. Regarding subjectivity and objectivity, I recommend this paper with Hennig. The terms subjective and objective are imperfect but I think there’s value in considering the attributes associated with these words.

  4. A dimension that seems relevant here, and perhaps too in the difference between “day science” and “night science”, may not be that we might have a reason to look for something — a meta-hypothesis in your parlance — but that we are willing to say that we have that reason before we look. As a consequence, we can plan a mechanism for looking that would help us confirm if our reason appears valid or not.

    For example, I’m interested by how name use changes over time so I look haphazardly at plots for some names, but then I see what looks like a trend I think I can identify: “some names have more spellings in recent years and this might contribute to a decline in the traditional spelling of a similar name.” Now, even though I’m still just acting on interests, I have the chance to define a protocol for learning from that interest. In this case, we might systematically examine the trends of all names that have multiple spellings etc.

    To me, it’s this situation of defining a method, where the method can be critiqued, that leads to a change in the kinds of things we can learn. I feel the act of being interested is not quite a hypothesis in this way of thinking, unless we try to pin down what we think would constitute it being interesting and consider how we might measure that in this situation.

    Further, I think it’s possible to define a defensible method without defining a hypothesis, and to me, thats ok. I don’t have to have a belief about something to measure it in an accurate, and replicable way, and the notion that I think it’s worth measuring is responding to a different question, in my mind, as in, not necessarily a hypothesis about the thing being measured.

  5. Just a few thoughts after reading the posts: As others have said, I find the conversation about hypotheses tricky (and sometimes unproductive) in part because there are so many different uses and understandings of what is meant by the word within science. I think we often don’t even have a sense of when our own understandings or implications differ from others. Despite that, or maybe in addition to it, I haven’t been able to convince myself that explicit hypothesis-talk has much value (at least in the projects I have been involved in over the last 20 years). In my experience, it’s often superficial and post-hoc.

    I see more value in trying to back up a step and explicitly talk about research questions — and then researcher methods and assumptions used to investigate the question. I think a question comes first – before any “hypothesis” – and it seems more like the logical starting point in the scientific process or workflow. The research question is also easier to critique from a more open and broad perspective – without shifting focus to a particular outcome as is often associated with hypothesis. It can be tempting to evaluate a hypothesis conditional on a question, without first backing up and considering the worth of the question. Modeling comes in as methods (with researcher assumptions) used to address a research question in a particular researcher-dependent way. I think the the question + methods/assumptions fits in well with Andrew’s workflow perspective. I’m not convinced that forming and stating hypotheses (at least how it is often done) improves the scientific process – but instead takes focus away from justification of other parts of the workflow. It seems that “hypotheses” are often part research question and part researcher assumptions – wrapped up in a way that makes it really hard to disentangle them.

  6. I agree with Megan that “question” is often a better starting point for research than “hypothesis.” Often a research program begins with “we don’t know much about X, so lets create a set of data.” It often happens that as we collect that data we discover hypotheses we try to verify or falsify, but “It is a capital mistake to theorize before one has data.”

    Another model which might be helpful is the hermeneutic cycle between the general and the particular.

Leave a Reply to Andrew Cancel reply

Your email address will not be published. Required fields are marked *