In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

In preparation for writing this news article, Kelly Servick asked me what I thought about the Kavli HUMAN Project (see here and here).

Here’s what I wrote:

The general idea of gathering comprehensive data seems reasonable to me. I’ve often made the point that careful data collection and measurement are important. Data analysis is the glamour boy of statistics, but you can’t do much if your data are no good. Regarding your other question: I prefer not to use the pejorative term “fishing”; rather, I’d say it’s great that they’re gathering a large dataset that will allow them to make discoveries and formulate hypotheses that can always be tested on new data. In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

Finally, I’m skeptical about their claims that with “big data” they will be able to somehow resolve causal issues. For example there’s this quote: “while correlations have been found between retirement and cognitive decline, it is not known whether this is due to retirees’ lower levels of mental engagement, reverse causation whereby cognitive decline induces retirement, or the resulting reduction in social contact. . . . The inability to resolve issues of causation reflects, to put it bluntly, a data limitation.” I disagree and I think it’s naive of them to think that by gathering a bunch of observational data, that they’ll be able to answer such questions. To answer such causal questions you need experimental data or quasi-experiments. Sure, some genetic inputs can be taken as approximately equivalent to randomization, so there’s some leverage you can get from such large-scale observational studies, but ultimately if you want to understand the effects of a reduction in social contact, you have to experiment on this or observe some natural experiment. Getting a bit pile of data can help you formulate some hypotheses, but I do think it’s naive, or maybe just hype-y, to attribute inability to resolve issues of causation to a “data limitation.”

It’s too bad they had to spoil this interesting project with all that hype. I’m doing a little fight against the hype by waiting until February to post my thoughts on this big big October news story.

P.S. Hannah Bayer, chief scientist of the project, gives some details here on their plans for data sharing.

26 thoughts on “In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

  1. One aspect of this project seems vague to me but is important. It is unclear whether the data used in the studies will be publicly available or not. There are extensive discussions in their papers and on the website about the layers of protection of individuals’ identities (appropriate) but it is not clear whether this means public access to data used in their published work will be provided or not. In one article they do state: “While preserving the privacy and security of participants will be essential, an open door policy for access to the data is required if the potential of the dataset is to be realized.” If this means what it says, then I applaud it. But if it means that data is only available to the researchers involved in the studies, then I am concerned. I would like to see a clear commitment to providing data for replication (in its many forms). Many of their studies will have policy implications and I believe that any study that professes to involve policy should make its data publicly available (subject to appropriate, not draconian, privacy protection).

  2. totally agree on your comment on larger sample and cause but I do think it’s more misconception than hype. I think a major misconception in the biomedical sciences is that the smaller error that comes with larger samples in observational designs gives one more confidence that the effect is causal. The problem is, the estimated error doesn’t account for omitted variable bias (unless this is explicitly modeled, which it rarely is) and this error doesn’t really decrease with sample size. So we are fooled by these error bars (which don’t represent the magnitude of the true error).

  3. It’s certainly true in general that adding more observations of the same variables will not resolve causal identification issues, but adding more variables (as they seem to be doing in the HUMAN project) and observing those variables over time might, e.g. if the new set of variables includes all important confounders. They do sound overconfident that they’ve picked the right additional variables to adjust for, but it is in theory possible, right?

    Also, they seem to have a specific set of possible causal models in mind when they say “it is not known whether this is due to retirees’ lower levels of mental engagement, reverse causation whereby cognitive decline induces retirement, or the resulting reduction in social contact. . .” Of course, the set of causal models they’re considering is undoubtedly way too small and way too simple, but given the clearly wrong assumption that the true causal model is in that set, observing the relevant variables over time could be sufficient to resolve the causal question.

    Basically, I agree with you, but I give these researchers credit for not making the usual stupid ‘Big Data can solve causal inference’ mistake and instead being hubristic about their ability to identify possible confounders.

  4. What about Grainger causation? Or Judea’s DAGs?

    If you really had a huge pile of fine grained data would either of those help deduce causation?

    e.g. If you had a lot of cognition testing data points spread over time could you use Grainger causality to answer the retirement-cognition decline question?

  5. I don’t think that causality is determined empirically. I believe that causality is the consequence of theoretical architecture that identifies relationships among phenomena relevant to that theory. No amount of data or data analysis will identify, once and for all, if X causes Y.

    However, I agree that data analysis can generate hypotheses about what causes what, and I think that specific empirical consequences can be deduced from belief that X causes Y.

    • Garnett – I think we agree, and I think it is an important point to emphasize.

      My version of it goes: “Causation is always a rhetorical argument about the world.” By that I mean that it isn’t within the statistics that you can go from some coefficient/estimate to causality, it is from Epistemology, Logic, Metaphysics, History, Social Science…whatever. In the case of RCTs, the rhetorical argument has to do with the relationship between treatment assignment and unobserved variables – it isn’t like we can ever prove this within one study, we can only argue that, in general, random assignment leads to no specific correlation between treatment status and individual characteristics. The rhetorical argument, simple as it is, is just that “There is no reason to believe that the treatment and control groups would have been different in the absence of treatment.” But that isn’t an observable, verifiable claim, it is just an argument from logic and meta-statistics (defining the world in which the statistical apparatus operates). If we thought that everyone in the world was completely different and responded differently to treatment, this rhetorical argument would not carry any force.

      It is even more obvious in cases of quasi-experimental variation – say instrumental variables. There are no perfect instruments – there are better and less good ones. The believability of the causal interpretation comes from a rhetorical argument about the relationship between the variation induced by the instrument and the people upon which it is acting. Same with difference-in-difference, regression discontinuity… you name it – it is the argument about the relationship between “treatment assignment” or “identifying variation” in the world that carries the weight, not the model or statistics, which are just encoding that rhetorical argument in a statistical framework.

      • If I understand you correctly, when you say that “Causation is always a rhetorical argument about the world” one might conclude that ‘causation’ is not a particularly relevant aspect of statistical analysis. I think this is true, though I may have misunderstood your intent.

  6. So if causality can only be established with controlled experiments, should we force people to smoke or not for 10 years to see if they die? Try CO2 forcing on one set of planets and the opposite on another one? There are entire branches of science that are purely observational (astrophysics for one). I recognize the unique advantages of controlled experiments, but they can only answer a limited range of questions because of practical or ethical limits in performing them. The way observational science progresses from correlation to causation is by excluding confounding factors and contrasting alternative theories. The last one standing is considered the current accepted one — was that called falsificationism or some such? It seems to me statisticians make up their own restricted version of science and then spend time bashing everything else.

    • AP:

      Are you arguing against anything particular that I wrote above? I agree with your general point that it is possible to learn about causation without doing controlled experiments, but what I don’t understand is whether you believe that your comment is in contradiction to anything that I have written.

      • Andrew, you wrote “To answer such causal questions you need experimental data ” I thought that was as opposed to observational data. Sorry if I read too much into it. Some of the comments reinforced my interpretation, maybe in the wrong direction.

        • AP:

          No, I wrote, “To answer such causal questions you need experimental data or quasi-experiments.” A quasi-experiment is a sort of statistical analysis of observational data.

        • Sorry for chopping your statement halfway, that was a bit aggressive quoting. So in your opinion there are three sorts of empirical data, experimental, quasi-experimental and observational. The first two can help establish causal relationship, the latter is limited to hypothesis generation, to be later probed by one of the other two.

        • AP:

          No, I guess my sentence wasn’t clear. A quasi-experiment is just an analysis of observational data that is focused on a particular causal question. Some data are more suitable for quasi-experimental analysis than others, but I wouldn’t draw a sharp line.

        • I think the quasi-experimental part comes from the identifying variation in the world, not exactly from a feature of the data. I just point it out because often the part of the data that you use for harnessing the identifying variation in a quasi-experiment comes from a dataset separate from the outcome/covariate dataset. An obvious example is when you use some sort of weather change or geographical information as an instrument – that is usually merged in from somewhere else.

          The basic point is just that quasi-experimental modes of statistical inference are both observational and capable of producing causal estimates. As such, they have features of both observational and experimental work – observational data and an epistemology based on experimental thinking (which is justified by the quasi-experimental/randomly-assigned identifying variation).

    • AP: For smoking, the relevant experiment is to enroll people, randomize to treatment and control, where the treatment is a smoking cessation program. The reason this is relevant is because that’s what one might want to do if one discovered that smoking does cause disease and death. (There are other things one can do as well, but many of those do not involve intervention at the individual level.)

      In any case, the observational evidence was rather striking. See Sec. 8 of http://www.stat.berkeley.edu/~census/521.pdf for a beautiful summary.

      One can also read http://www.wiley.com/legacy/wileychi/eosbs/pdfs/bsa598.pdf for a lucid discussion of the relationship of causal inference and statistical models. A further improved version is in Sections 6.4 and 6.5 of the revised edition of Freedman’s “Statistical Models”.

      • Good point Russell, maybe my smoking example was not the best, but you may agree enforcing cessation for, say, 5 years, is a though design to implement in practice. I can tell you some funny stories from the Tobacco Strike days in my country … But you pick a better example, please. Thanks for the link, I knew it but it never hurts to give it a second pass, great stuff. My point was not specifically about smoking, but about “observational science” to be possible and necessary. Sometimes I have the impression that there is an “evidence based anything” movement whereby science = test on treatment/control experiment = statistics. Thereby statistics is the scientific method and statisticians its custodians (google that and you’ll find multiple references). It seems to me a point of view that accepts that the scientific method is more heterogenous is important at a time when science is in a crisis. One of the dangers which is concrete in the biomedical sciences is that we are going to toss all observational studies in favor of experimental ones. Take this quote from “Forbes” (OK not a scientific source of the highest quality) “How do we get so many erroneous conclusions from observational studies? In most of them, from dozens to hundreds or even thousands of questions are asked. Statistical significance will happen by chance about 5% of the time, yielding false-positive results. ” But I learn from many sources including this blog that the issues of p-value interpretation and multiple testing are orthogonal to the issue of experimental vs observational evidence. You say that the “observational evidence” establishing the harms from tobacco smoking was striking (agreed). Can we make that more general and rigorous? Why is the evidence that we should provide broadband to the poor (correlation of income and internet bandwidth) not striking? Is it a matter of feel and opinion, lack of alternative explanations, something else? When is the observational evidence striking enough that it is redundant to set up a (expensive or dangerous or impossible) controlled experiment? How do we build consensus from observational evidence alone? I think it’s an important question for the biomedical sciences in particular. I thought implicit in Andrew’s answer is that an observational data set, no matter how rich and large, can only help for hypothesis generation, not testing and I think he’s wrong on that one. But from his answer maybe it’s my own misunderstanding. My bad, but maybe I am not the only one who could use a clarification.

        • AP: I don’t understand everything you write.

          > enforcing cessation for, say, 5 years, is a though design to implement in practice.

          Is “though” supposed to be “thought” or “thorough”? In any case, I agree it would be very useful if one could get enough volunteers and solve any practical difficulties.

          > I can tell you some funny stories from the Tobacco Strike days in my country

          Is this what http://www.nytimes.com/1992/11/26/world/a-tobacco-strike-is-driving-italians-to-desperate-ends.html?pagewanted=all is about?

          > But you pick a better example, please.

          Not sure what you mean. But many people say just what you said: you can’t force people to smoke for the sake of an experiment. That’s one reason I felt it worth responding. More generally, in various situations, people often feel there is no suitable experiment that can be done. But many times this is simply for lack of trying to come up with one, or for lack of cleverness. After all, many experiments are very clever. Good science is not easy.

          > You say that the ¿observational evidence¿ establishing the harms from tobacco smoking was striking (agreed). Can we make that more general and rigorous? Why is the evidence that we should provide broadband to the poor (correlation of income and internet bandwidth) not striking? Is it a matter of feel and opinion, lack of alternative explanations, something else?

          The first link I gave was exactly about the issue of deducing causation from obs. evidence. This is not something that can be mechanized. No precise rules will suffice. In almost all situations, one data set alone will not be convincing.

          > When is the observational evidence striking enough that it is redundant to set up a (expensive or dangerous or impossible) controlled experiment?

          If one is trying to decide whether an intervention at the level of individuals is warranted, then it would be rare that an experiment would not also be warranted. By definition, such an experiment is possible.

        • Is “though” supposed to be “thought” or “thorough”? In any case, I agree it would be very useful if one could get enough volunteers and solve any practical difficulties.

          – “tough”, sorry for the typo

          Is this what http://www.nytimes.com/1992/11/26/world/a-tobacco-strike-is-driving-italians-to-desperate-ends.html?pagewanted=all is about?

          – that and more

          Not sure what you mean. But many people say just what you said: you can’t force people to smoke for the sake of an experiment. That’s one reason I felt it worth responding. More generally, in various situations, people often feel there is no suitable experiment that can be done. But many times this is simply for lack of trying to come up with one, or for lack of cleverness. After all, many experiments are very clever. Good science is not easy.

          – so you are saying for every hypothesis there is a possible experiment that is practical and ethical. I meant an example, in any science, where controlled experiments are not possible. Take cosmology or climatology that study an object that exists in a single copy. Where is my control planet or control universe? I think there are entire branches of science that make do without experiments- so you are saying for every hypothesis there is a possible experiment that is practical and ethical. I meant an example, in any science, where controlled experiments are not possible. Take cosmology or climatology that study an object that exists in a single copy. Where is my control planet or control universe? I think there are entire branches of science that make do without experiments

          The first link I gave was exactly about the issue of deducing causation from obs. evidence. This is not something that can be mechanized. No precise rules will suffice. In almost all situations, one data set alone will not be convincing.

          – I agree with you, but somewhat unsatisfied with the answer — is that all we have?

          If one is trying to decide whether an intervention at the level of individuals is warranted, then it would be rare that an experiment would not also be warranted. By definition, such an experiment is possible.

          – I disagree with you, when you consider e.g. long term lifestyle changes as the treatment and hard endpoints such as lifespan and lifetime hospitalizations

        • >”Sometimes I have the impression that there is an “evidence based anything” movement whereby science = test on treatment/control experiment = statistics. Thereby statistics is the scientific method and statisticians its custodians (google that and you’ll find multiple references).”

          Yes, this is exactly what has happened. Statistical Methods != Scientific Methods. For an example, check out the claimed detection of a gravitational wave by LIGO:
          http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.061102

          If you read carefully you will find their statistical argument is fatally flawed (they erroneously believe the p-value is the probability their signal is a false positive), but this does not really affect the scientific argument (they ruled out *all* the sources for the signal that anyone could think of, their theory is able to explain the signal very precisely). In physics, it seems statistics can be essentially irrelevant (despite all the space they inappropriately devoted to that argument), while as in these lesser “evidence-based” fields it is the entire argument.

  7. Andrew, thanks so much for your comments and to everyone who has contributed to this discussion, the leadership of the Kavli Human Project have been following it quite closely.

    We thought we should chime in on one important point early in the thread that we hoped we could contribute to:
    Regarding our open data policy, a major goal of the KHP is to create a resource that can be used by the scholarly community to address a nearly limitless range of questions. We strongly believe that we increase the value of the data by making it accessible. Data governance policies will, of course, be necessary to protect the security of the data and the privacy of the participants, but we do not expect that they will be an impediment to researchers. More details about our data access and governance plans can be found at http://kavlihumanproject.org/wp-content/uploads/06_KHP-CD1-Ch5-Privacy-and-Security.pdf.

    That having been said, if any of you have suggestions that might improve the project we would REALLY welcome them. We really do appreciate and consider all feedback, as we want to ensure that the resource we are building will ultimately be useful to researchers (like you). As you may know, we have an annual request for information: a process by which we can get new ideas from outside the project and as part of that initiative we invite anyone interested to submit a response to our recent request for information (http://kavlihumanproject.org/khp-rfi-01-120115/).

    The truth is that we closed this year’s RFI about a week ago, but this thread is so interesting that I’ve had our Chief of Public Affairs reopen it, just in case one of you wants to submit something. We’ll keep it open for you for 3 weeks – but if you ever want to ask us a question please don’t hesitate to write one of us directly.

    Hannah Bayer, PhD
    Chief Scientist, Kavli HUMAN Project
    Research Associate Professor of Decision Sciences
    New York University

Leave a Reply

Your email address will not be published. Required fields are marked *