How much should theories guide learning through experiments?

This is Jessica. I recently wrote about the role of theory in fields like psych. Here’s a related thought experiment:

A researcher is planning a behavioral experiment. They make various decisions that prescribe the nature of the data they collect: what interventions to test (including the style of the intervention and any informational content), what population to recruit subjects from, and what aspects of subjects’ behavior to study. The researcher uses their hunches to make these choices, which may be informed by explanations of prior evidence that have been proposed in their field. This approach could be called “strongly” theory-driven: they have tentative explanations of what drives behavior that strongly influence how they sample these spaces (note that these theories may or may not be based on prior evidence). 

Now imagine a second world in which the researcher stops and asks themselves, as they make each of these decisions, what is a tractable representation of the larger space from which I am sampling, and how can I instead randomly sample from that? For example, if they are focused on some domain-specific form of judgment and behavior (e.g., political attitudes, economic behavior) they might consider what the space of intervention formats with possible effects on those behaviors are, and draw a random sample from this space rather than designing an experiment around some format they have a hunch about.  

Is scientific knowledge gain better in the first world or the second?

Before trying to answer this question, here’s a slightly more concrete example scenario: Imagine a researcher interested in doing empirical research on graphical perception, where the theories take the form of explanations of why people perform some task better with certain visual encodings over others. In the first world, they might approach designing an experiment with implications of a pre-existing theory in mind, conjecturing, for example, that length encodings are better than area encodings because the estimated exponent of Stevens’ power law from prior experiments is closer to 1 for length compared to area. Or they might start with some new hunch they came up with, like density encodings are better than blur encodings, where there isn’t much prior data. Either way they design an experiment to test these expectations, choosing some type of visual judgment (e.g., judging proportion, or choosing which of two stimuli is longer/bigger/blurrier etc in a forced choice), as well as the structure and distribution of the data they visualize, how they render the encodings, what subjects they recruit, etc. Where there have been prior experiments, these decisions will proably be heavily influenced by choices made in those. How exactly they make these decisions will also be informed by their theory-related goal: do they want to confirm the theory they have in mind, disconfirm it, or test it against some alternative theory? They do their experiment and depending on the results, they might keep the theory as is, refine it, or produce a completely new theory. The results get shared with the research community.

In the “theory-less” version of the scenario, the researcher puts aside any hunches or prior domain knowledge they have about visual encoding performance. They randomly choose some some set of visual encodings to compare, some visual judgment task and some type of data structure/distribution compatible with those encodings, etc.. After obtaining results they similarly use them to derive an explanation, and share their results with the community. 

So which produces better scientific knowledge? This question is inspired by a recent preprint by Dubova, Moskvichev, and Zollman which uses agent-based modeling to ask whether theory-motivated experimentation is good for science. The learning problem they model is researchers using data collected from experiments to derive theories, i.e., lower dimensional explanations designed to most efficiently and representatively account for the ground truth space (in their framework these are autoencoders with one hidden layer, trained using gradient descent). As theory-informed data collection strategies, they consider confirmation, falsification, crucial experimentation (e.g., sampling new observations based on where theories disagree), and novelty (e.g., sampling a new observation that is very different from its previously collected observations) and compare these to random sampling. They evaluate how well the theories produced by each strategy compare in terms of perceived epistemic success (how well does the theory account for only the data they collected) and “objective performance,” how well they account for representative samples from the full ground truth distribution.

They conclude from their simulations is that “theoretically motivated experiment choice is potentially damaging for science, but in a way that will not be apparent to the scientists themselves.” The reason is overfitting: 

The agents aiming to confirm, falsify theories, or resolve theoretical disagreements end up with an illusion of epistemic success: they develop promising accounts for the data they collected, while completely misrepresenting the ground truth that they intended to learn about. Agents experimenting in theory-motivated ways acquire less diverse or less representative samples from the ground truth that are also easier to account for. 

Of course, as in any attempt to model scientific knowledge production, there are many specific parameter choices they make in their analyses that should be assessed in terms of how well they capture real-world experimentation strategies, theory building and social learning, before we place too much faith in this claim. For the purposes of this post though, I’m more interested in the intuition behind their conclusions. 

At a high level, the possibility that theory-motivated data collection reduces variation in the environment being studied seems plausible to me. It helps explain why I worry about degrees of freedom in experiment design, especially when one can pilot test different combinations of design parameters and one knows what they want to see in the results. It’s easy to lose sight of how representative your experimental situation is relative to the full space of situations in which some phenomena or type of behavior occurs when you’re hell bent on proving some hunch. And when subsequent researchers design experiments informed by the same theory and set-up you used, the knowledge that is developed may become even more specialized to a particular set of assumptions. Related to the graphical perception example above, there are regularly complaints among people who do visualization research about how overfit certain design principles (e.g., choose encodings based on perceptual accuracy) are to a certain class of narrowly-defined psychophysics experiments. On top of this, the new experiments we design on less explored topics (uncertainty visualization, visualization for machine learning, narrative-driven formats etc.) can be similarly driven by hunches researchers have, and quickly converge on a small-ish set of tasks, data generating conditions or benchmark datasets, and visualization formats.

So I find the idea of random sampling compelling. But things get interesting when I try to imagine applying it in the real world. For example, on a practical level, to randomly sample a space implies you have some theory, if not formally at least implicitly, about the scope of the ground truth distribution. How much does the value of random sampling depend on how this is conceived of? Does this model need to be defined at the level of the research community? On some level this is the kind of implicit theory we tend to see in papers already, where researchers argue why their particular set of experimental conditions, data inputs, behaviors etc. covers a complete enough scope to enable characterizing some phenomena.   

Or maybe one can pseudo-randomly sample without defining the space that’s being sampled, and this pseudo-randomly sampling is still an improvement over the theory-driven alternative. Still it seems hard to conceptually separate the theory-driven experiment design from the more arbitrary version, without having some theory of how well people can intentionally randomize. For example, how can I be sure that whatever conditions I decide to test when I “randomly sample” aren’t actually driven by some subconscious presupposition I’m making about what matters? There’s also a question of what it means for one to choose what they work on in any real sense in a truly theory-free approach. For various reasons researchers often end up specializing in some sub-area of their field. Can this be like an implicit statement about where they think the big or important effects are likely to be?

I also wonder how random sampling might affect human learning in the real world, where how we learn from empirical research is shaped by conventions, incentives, ego, various cognitive limits, etc. I expect it could feel hard to experiment without any of the personal committments to certain intuitions or theories that currently play a role. Could real humans find it harder to learn without theory? I know I have learned a lot by seeing certain hunches I had fail in light of data; if there was no place for expectations in conducting research, would I feel the same level of engagement with what I do? Would scientific attention or learning, on a personal level at least, be affected if we were supposed to leave our intuitions or personal interests at the door? There’s also the whole question about how randomizing experimenters would fare under the current incentive structures, which tend to reward perceived epistemic success.

I tend to think we can address some theory problems, including the ones I perceive in my field, by being more explicit in stating the conditions that our theories are intended to address and that our claims are based on. For example, there are many times when we could do a better job of formalizing the spaces that are being sampled from to design an experiment. This may push researchers to recognize the narrowness or their scope and sample a bit more representatively. To take an example Gigerenzer used to argue that psychologists should do more stimuli sampling, in studies of overconfidence, rather than asking people questions framed around what might be a small set of unusual examples (e.g., “Which city lies further south: Rome or New York? How confident are you?”, where Rome is further north yet warmer), the researcher would do better to consider the larger set of stimuli from which these extreme examples are sampled, like all possible pairs of large cities in the world. It’s not theory-free, but would be a less drastic change that would presumably have some of the same effect of reducing overfitting. It seems worth exploring what lies between theoretically-motivated gaming of experiments and trying to remove personal judgment and intuitions about the objects of study altogether.  

41 thoughts on “How much should theories guide learning through experiments?

  1. Jessica:

    I feel like I’m missing something here. In the usual descriptions of science, experimentation is strongly theory-generated. The idea is you start with a theory and you come up with an experiment to falsify it. Or you have two theories and you come up with an experiment to distinguish between them. Yes, there are examples of experiments that are conducted without a clear theory in mind—for example testing a bunch of different compounds in a bunch of different settings and seeing what they do—but it was my impression that the norm in science, and indeed the way that I usually do science, starts with a theory. So I’m surprised to hear people claiming otherwise. Are they just being too-clever-by-half contrarian or is there something more going on in this discussion?

    • I agree! The paper that led to the post isn’t claiming science is done without theory, but rather that if we tried to drop the theory, our understanding of the world would be more accurate because the evidence we would collect would be more diverse and representative of the underlying ground truth. I wrote the post trying to imagine what is would look like (and whether it’s even possible) to drop theory from experimentation. I still don’t have a good sense of how it would ever work in practice, because anything we try to do is in a sense based on some attempt at characterizing what we think matters. So for me the “theory” that theory is bad for experimentation is hard to connect to reality. But at the same time, the idea that we often need to step back and define what space we’re sampling from, and try to sample in a less biased way, aligns with certain thoughts I have about how to improve generalizability in my field. So the idea of random sampling as a kind of extreme we could try to take seriously is interesting to me as a kind of thinking prompt.

      • A few quick comments.

        You can’t rule out a theory by the way it was generated but you need to choose wisely what theories you attempt get empirical knowledge about.

        Think of “without theory” as considering all possible representations of the world – that is an uncountably infinitive set (between and two representations there will an intermediate representation) and empirical knowledge comes with an error rate.

        That was one of the reasons why CS Peirce argued that we need a _logic_ of abduction. However he was unable to justify that claim satisfactorily.

        • The part about theory-free meaning “considering all possible representations of the world” reminds me of how some theorists have characterized the task scientists face in trying to explain some phenomena (e.g., aspects of cognition https://escholarship.org/content/qt8cr8x1c4/qt8cr8x1c4.pdf) as intractable, even if you remove all the uncertainty about what effects are real or how well they generalize.

          Your comment also makes more obvious to me that implicit in the idea of random sampling leading to better science is a belief that human intuition / prior knowledge doesn’t benefit the overall need for explanations. But it seems hard to assess without acknowledging how success is very conditional on what humans perceive as useful.

    • @Andrew This story about theory-based vs. non-theory based testing played out in behavioral game theory about 20 years ago. The question was, is Nash equilibrium a good prediction for how people will actually choose. There were two camps, one that believed in Yes, and another one that believed in No. And whenever people from the Yes-camp ran an experiment, they found that Nash predicts really well, and whenever people from the No-camp ran an experiment, they found that Nash predicts really badly (so it wasn’t too-clever-by-half contrarian, but good old confirmation bias). The thing is that there is an immense number of possible games, and hence of possible experiments that you could run. And people in each camp basically selected the type of game in which Nash equilibrium really does perform well (the Yes-camp) or poorly (the No-camp). That’s why the idea of testing the theory on randomly selected games came up (the Al Roth et al. paper I’ve linked above).

      • Sandro:

        That’s interesting. Behavioral game theory seems like a pretty narrow area of research, so I don’t see how this finding is relevant to the research that I do in political science or public health, but it’s always good to hear about how they do things in other fields too.

    • @Andrew you are right that most people in social and natural sciences view theory-guided experimentation (especially, falsification or crucial experiment-driven — that you describe using in your work) as a gold standard. Why exactly is this the case? Is there a particular set of compelling experiments, proofs, or simulations that led you to believe that these strategies are beneficial for scientific progress?

      We noticed that these theory-based strategies originate in the philosophical proposals and have only been verbally defended. Our goal in this work was to put these different strategies to a test using computational modeling. Before conducting these experiments all of us (authors) believed that the falsification and crucial experimentation strategies will result in superior theories developed by the agents. However, this is not what we found: falsification, crucial experimentation, and other strategies lead to worse theories developed by the agents as compared to a random baseline. We conducted many additional simulations and analyses to figure out what “went wrong” in our simulations and why the result is the way it is. Despite this additional effort, we could not find a single context in which the agents experimenting in a theory-guided way develop better theories than the agents experimenting without theory-based considerations. Some other results of our simulations (e.g. concerning the “apparent” success of the theory-based experimentation strategies vs. their actual success) potentially explain why it seems to us that the theory-guided experimentation efficiently guides our learning in science, while in fact it might not.

      Despite many limitations of this work (e.g. we had to choose a specific modeling framework, particular implementations of the theory-based experimentation strategies, etc), we believe that it should raise some skepticism in the commonly assumed virtue of theory-motivated experimentation.

      @Jessica it is hard for us to imaging theory-neutral experimentation in real life too. There are people who are working on potential implementation of the “less theory-driven” experimentation in behavioral & social sciences (Beth Baribault, Abdullah Almaatouq, Thomas Griffiths, and others), but the perfect “theory-neutral” experimentation is probably unachievable. This is definitely a puzzle to think about further, but I believe that it is not a unique puzzle for the random or theory-neutral strategies. Do you think we can ever implement perfect falsification experiments, for example? If yes, can you imagine ideal falsification-driven experimentation at scale (e.g. each scientist is driven by falsification AND a group as a whole also behaves falsification-driven, and not like a frankenstein where the same experiment is a perfect falsification for one scientist but not at all for the other, etc)?

      • Thanks for commenting Marina, your work is definitely thought-provoking. I don’t think perfect theory driven experimentation is possible – a question that came to me in writing the post was how much more random than expected new observations might be given that researchers trying to use strategies like confirmation or falsification are approximating.

        I also wonder the extent to which humans attempting random sampling might resemble humans using novelty driven approaches. I’m thinking about evidence of how well people do at tasks like constructing random sequences, where they tend to underestimate the probability of repetition.

  2. As a strong proponent of theory based inquiry I’d like to raise a few objections.

    1) The benefit of finding a robust and interesting theoretical explanation is larger than that of, say, confirming that there is no interesting pattern. The goal isn’t pure accuracy about the world but useful accuracy even at the expense of some useless accuracy.

    2) We often claim to believe that the world is often described by stunningly simple patterns/theories. If the experiment didn’t replicate that aspect of the world it may be misleading.

    3) A great deal depends on how you define theory guided. If you genuinely have strong rewards for showing violations of the recieved theory it will be an effective way to search through potentially useful hypothesises.

    The danger mentioned seems real, but it just seems to be another way of saying: but we often twist our theories into near unfalsifiablity

    • I tend to agree with you. Some of the same reasons behind striving for more random sampling motivate the need to replace bad/unfalsifiable theory with more rigorously defined theory, so as I was writing the post, it occurred to me that arguing for better theory versus more random sampling could be seen as two sides of the same coin. But at the same time random sampling is meant to be incompatible with theory-driven so there’s something that’s missing in relating the two. I wonder what it would look like to formally define “useful accuracy” as what’s not captured when experiments are treated as something we can generate using random sampling.

      One aspect I didn’t comment on in the post, but comes to mind when I think about “useful accuracy” is the need for theories to align with how people think/be interpretable. For instance there might be cases where a less faithful explanation is much more user-friendly, and therefore influential, than a more faithful but less understandable one. This seems hard to capture in formal frameworks. For the paper discussed in the post, it might mean there’s a constraint on the set of autoencoders that can be learned or that can be successfully transferred to knowledge within a community.

    • Peter G. wrote:
      “1) The benefit of finding a robust and interesting theoretical explanation is larger than that of, say, confirming that there is no interesting pattern. The goal isn’t pure accuracy about the world but useful accuracy even at the expense of some useless accuracy.”

      Yes.

      At least for the kinds of theories I’d be interested in, you wouldn’t be able to recognize that there even were patterns without a theory. It seems to me that the human brain is a complex enough system that any view of it that lacks a theory has no chance of saying something meaningful. So I don’t get the idea of looking at data without a theory.

      This is analogous to a favorite joke. Hedi Lamarr proved that the SETI types can’t see what they are looking for. This is because in spread-spectrum communication systems (which she and her husband invented (sort of: see Wiki)), if you don’t have an exact model (theory!) of what the transmitter is doing (e.g. the pattern with which the transmitter is frequency hopping), you can’t even recognize that there’s a signal there to try to figure out (that is, you can’t tell the difference between a signal and noise; everything looks like noise until you look at the frequency spectrum with the right pattern). Even if there are aliens out there talking to each other, there’s no way to see those signals without the details of the transmission technology used.

      Psychology should be a bit easier than SETI. At least we speak the same language and have native informants to talk to. Why do people have a “tip of the tongue” problem? A data-only approach would have you saying things like “People are more likely to forget words between 3:00 and 5:00 pm.” This may or may not be true, but it’s boring in the extreme.

    • Generalizing is the entire task of science. Without generalization science is pointless. Imagine if we had to come up with a new equation for every triangle!

      However, if you’re getting at the idea that *claims* about the generalizability of a given concept frequently turn out wrong, that’s a fair point. It’s not enough to claim generalizability. Every claim about generalizability needs to be demonstrated in every aspect.

  3. In studying decision making psychologists routinely ignore the fact that people are often presented with information that is purported to be irrefutable and turns out later to be wildly wrong. So if you’re going to investigate “decision making” and people’s “overconfidence” you better take that into account somehow!! Experts are often wrong and other people lie. You’re presenting charts and graphs to people that you claim are irrefutably true, why are **you** so confident? Why should people trust you over their intuition when experts are often wrong?

    It wasn’t that long ago that several eminent scientists – even physicists – were **big** on-board with Peak Oil, which turned out to be a big laugher. People had all kinds of sophisticated charts and data and theory and bla bla bla bla which was all wrong, because it was based on incorrect fundamental assumptions.

    So how did the “peak-oil-is-wrong” people know it was wrong? Were they able to give a perfectly rational accounting of why they thought it was wrong that later turned out to be true? Probably not! They sensed it was wrong from other cues, like how the “theory” lined up so perfectly with pre-existing political desires of peak oil advocates, or just from the general history of exploration – what?

    So you see you’re making a pretty big assumption when you present data to people and claim they *should* accept your claims about it. It’s a bad assumption because “you” (the generalized expert presenter) are frequently wrong, and people know that.

    • I guess in my mind it would be pretty useful to understand why people reject supposedly irrefutable ideas or theories – not so you their objections can be contradicted in the press (a la global warming persuasion pyschology), but so you can understand how they recognized false theories when they turn out to be false.

  4. Removing theory laden decision making from experimental design is neither feasible nor desirable IMHO. First, what you even choose to look at is inextricably linked to one’s corpus of belief (never mind the experimental design itself). And in a practical sense, there needs to be some justification for the design less one not be able to convince funding agencies to green light the project.

    Second, proper science is based on making surprising predictions (surprising in the sense that the prediction would be an unlikely result sans the theory) and seeing if they pan out. In Popperian lingo we need to pit theories against sufficiently risky tests. The only way to evaluate if a test is risky or not is to have some concept of the possible outcomes for the domain in which are you in to compare the outcome to. Paul Meehl talks about this backdrop as the Spielraum. Sampling designs at random and creating post-hoc explanations does away with this entirely, and I’m fairly certain the philosophers of old would tell us that we can do that – free country and all – but we ain’t doing science any more.

    • Interesting, hadn’t hear of the Spielraum, but the idea of a “backdrop” as being impossible to define without theory captures why I find it difficult to conceive of any experiments being truly theory-free. I haven’t studied philosophy of science so it’s useful to hear where this comes up.

    • Yup, generally agree. However in reality some degree of theory-independent experimentation/observation is useful. A concrete example might be Charles Darwin where much of his initial “experimentation/data collection” was not theory-led even if he had a corpus of beliefs (his were largely “creationist” at the outset). He had no need to “convince funding agencies” since he was funded by his Dad and he had a free voyage!

      He made loads of observations, collected loads of samples which he eventually organized, discussed with experts and came up with a “post-hoc explanation”. If you replace the rather perjorative “post-hoc explanation” with “theory” then that seems perfectly fine to me. Once you have the theory then the more Popperian approaches of theory testing can take over.

      This approach is quite widespread in my experience and I’ve done some myself – e.g. you do loads of unfunded experimentation since you feel there may be something interesting to find or you have what you think is a nice experimental technique, you’re not sure what but you might find but expect that some ideas might begin to crystallize. These approaches are productive so long as you don’t consider any post-hoc explanations as the end of the story, but as part of the beginning of the story (i.e. a testable hypothesis).

      • Also see Brahe and Kepler. For ~1,500 years no one collected data precise and accurate enough to do it, but once that data was available it took one generation to overturn the age old assumption that orbits must consist of some combination of circles.

        Of course, there were a number of related technologies (eg, telescope and printing press) and political aspects (Reformation then European Renaissance) that may have also been required.

  5. In the wider context experimentation/data collection in the absence of theory is a fundamental element of science. Those years wandering the world collating information on the natural world by Darwin and Alfred Russell Wallace was done, at least initially, in a theory-free context. The extent and quality of their observations led to a theory. Likewise with the early determination of simple molecular structures by Linus Pauling – this was done in the expectation that this knowledge would enable understanding down the line and lead Pauling to a number of theories about the nature of the chemical bond, basic structural elements of proteins etc. Much of science is like this.

    e.g. the entire human genome project has been undertaken in a theory-free context – of course one might say that this isn’t science but the application of technology. However the accruing results of the genome project has provided a vast body of data that enables testing existing and developing new theories about evolution, the molecular basis of genetic diseases, human genetic and phenotypic variability and so on. So it’s very much part of science.

    These “stamp-collecting” approaches have value because reliable and carefully-obtained and curated data about the natural world will always have something to tell us.

    The idea that over-reliance on a theoretical context for experimentation might be problematic if it tends to direct scientific thought down a proscribed channel that samples only a part of the wider reality is a good one and may well have some truth including the likelihood described in the top article “that (this flaw) will not be apparent to the scientists themselves”! It seems very likely, for example, that some of the realities underlying evolutionary processes will require ideas and theories that we have no clear way of accessing within our current theoretical framework.

    • I get your point: basic facts need to be gathered. That’s why we have geological surveys and biological surveys. But I don’t think this statement:

      “Those years wandering the world collating information on the natural world by Darwin and Alfred Russell Wallace was done, at least initially, in a theory-free context. ”

      is accurate. These two made important new observations, but they were made in context of existing theories and hypotheses. I believe Erasmus Darwin, Chuck’s uncle, had already proposed some mechanism of inheritance, while Linnaeus’ classification scheme was the fundamental framework that allowed organisms and their relationships to one another to be accurately recognized.

      Modern biological and geological surveys heavily utilize “theory” to prioritize what info to collect and where to collect it. Biologists would use areal photos to identify different habitat zones for species surveys and geologists would use areal photography and topography to extend mapping from adjacent areas into a new area and choose the best places to “ground truth” the air photo interps.

      • I understood the topic to be about “theory-motivated experimentation”/”theory-motivated data collection” (to quote some of the descriptions in the top article). Of course we can’t escape the background of existing thought/theory in whatever field we might work in. But given that, it is quite possible to do experimentation/data collection that isn’t “motivated” by theory. The Human Genome project was/is a massive data collection exercise whose motivation was the expectation that that data will be useful and so it turns out. But it wasn’t done to address any particular theory/ies.

        Darwin’s early years are actually a good example of “theory-less” experimentation/data collection! Yes there were existed existing ideas about the relationships of species in Darwin’s time but I don’t think Darwin’s initial explorations were motivated by theory. They were done out of an interest in exploring the natural world, in collecting and sending samples back to England. At some point his ideas about speciation and common origins developed out of what were a somewhat random (wherever the Beagle went!) data collection but this was only after returning to England where he could organize and study his samples and most importantly discuss these with experts.

        One can contrast this with his later “theory-motivated experimentation” e.g. on breeding and crossing pigeons which was done to explore specific theory about the origins of variation.

        In relation to the specific subject of this thread – one might say that making careful observations/experimentation of the natural world without addressing any particular theory is a fundamental way in which new theories may develop. Constraining oneself to work within an existing theoretical framework may result in a canalization of thought that limits wider realizations and as pointed out in the top article the practitioners won’t be aware of the limitations they’ve set themselves. Thus the examples of advances made by youngsters or individuals entering a new field, where their investigations are less constrained by existing theoretical frameworks.

        • I think you’re seriously mistaken.

          On the contrary, Darwin’s work is a model of applying theory derived from an existing and spotty subset of data on life forms to the much larger set of data on life forms that had previously been unexplored.

          He took proposed theoretical generalizations – the Linnean system and Huttonian ideas of geology – and test their fit to the unexplored world. In the case of the Linnean system, his primary focus, he found it fit very well. His survey filled in many empty spaces in the Linnean chart. It fit so well that he was able to propose a mechanism for inheritance and environmental fitness from the data.

          Had Darwin been just randomly jotting, without centuries of preceding thought and development on the organization of life forms and inheritance, he would have done no better than Aristotle and probably much worse.

          “one might say that making careful observations/experimentation of the natural world without addressing any particular theory is a fundamental way in which new theories may develop.”

          I most vehemently disagree. Rather, I’d say making observations independent of theory or pre-existing ideas is impossible, since just choosing what to observe is based on existing knowledge.

        • Are you missing the point? I’m sticking to the point of the top article and seeing if I can find some correspondence in biology with the two approaches described (classical/Popperian “theory-motivated experimentation”/”theory-motivated data collection” where you design an experiment to directly test a theory/hypothesis or its implications) and the alternative (a little more vague IMO) where rather than focussing on a pre-planned, designed sampling to directly address a theory you broaden the enquiry to sample a little more randomly within the study arena and then inspect the outcomes. The idea being that doing the latter may protect against oversampling and neglecting to see outside a potentially constraining theoretical framework.

          IMO (you are more than welcome to disagree!) Darwin’s experimentation/data collection was less of the former and more of the latter. Darwin didn’t sit down and plan his study (we’ll go here and there and collect these samples and this should tell us this or that in relation to some theory). In reality his sampling (at least of animal and plants) was somewhat random and dictated by the route of the Beagle and he remained a creationist (in the old sense of the word – i.e. considering that species were permanent) throughout the voyage. His sampling was not designed to address a theory and its implication even if he expected that his sampling would shed some light within the existing theoretical framework. It was only after he returned to England that he could organise his samples, consult with experts and gradually come to some interpretations about the relationships of species, their mutability and so on.

          So the point relates to the initiation/design of a study rather than its overall progression and conclusions.

          Here’s another example. Galileo realized that a good enough telescope would allow him to address a question about the moon (is it a perfect sphere a la Aristotle?). He did some “theory-motivated” observation and discovered that the Aristotlian theory was incorrect. Subsequently he made some more random (not theory-motivated) observations out of an interest in studying the heavens, happened to focus his telescope on Jupiter, found some odd faint “stars”, observed these a bit more and found that they must be circling Jupiter and so he fell across a rather profound observation (if those “moons” are circling Jupiter then all celestial bodies can’t be circling the Earth).

          Interest-driven “theory-independent” experimentation/observation works in Science because the natural world always has something to tell us so long as we make observations carefully and interpret them in good faith. The entire Human genome project is not “theory-motivated” (the motivation is that existing technologies allows us to collect data that will have fantastic bearing on understanding down the road).

        • Galileo realized that a good enough telescope would allow him to address a question about the moon (is it a perfect sphere a la Aristotle?). He did some “theory-motivated” observation and discovered that the Aristotlian theory was incorrect.

          This is another case (like the other thread about molecules/proteins randomly diffusing around in the cytosol) where I don’t think many people who actually spent time studying the issue carefully really believed it. I mean anyone can look at the moon and see the irregularities.

          Eg, here is a discussion from Plutarch (~ 100 AD) about “the face on the moon”*: https://archive.org/details/plutarchonfacewh00plut

          With his telescope Galileo could make out more detailed shadows but at the periphery it still appeared spherical (no mountains were visible as the non-spherical theory predicted), which lead him to hypothesize an atmosphere like the Earth’s to blur it. So it is hard to believe the new evidence would have convinced someone who really thought the moon must be a perfect sphere for whatever theoretical/religious/political reasons.

          * There is also an interesting story about Carthaginian’s making a voyage west of Britain every 30 years to a place where night lasted only 1 hr a day for a month, then eventually going further west to a great continent. Then there is the discussion about the nature of life on the moon.

        • This is another case (like the other thread about molecules/proteins randomly diffusing around in the cytosol) where I don’t think many people who actually spent time studying the issue carefully really believed it. I mean anyone can look at the moon and see the irregularities.

          With his telescope Galileo could make out more detailed shadows but at the periphery it still appeared spherical (no mountains were visible as the non-spherical theory predicted), which lead him to hypothesize an atmosphere like the Earth’s to blur it. So it is hard to believe the new evidence would have convinced someone who really thought the moon must be a perfect sphere for whatever theoretical/religious/political reasons.

          Yes Plutarch had proposed that and Kepler also liked the idea (he wrote what’s considered to be the first work of science fiction about a journey to the moon!). But the orthodoxy was very much the Aristotelian view (see below). In fact Galileo’s telescope was powerful enough to see that the terminator (the line dividing dark and light sides) rather than being unbroken (required by a perfect sphere) was intercepted by dark patches where there should be light and light where there should be dark. His picture in “Starry Messenger” in 1610 showed this clearly with the patches interpreted as craters (dark patches = shadows/light patches = highlights as in a mountain range illuminated by the rising sun). He included a large crater that isn’t actually there presumably as a close up illustration of his interpretation. Galileo was probably also helped by his familiarity and understanding of perspective.

          Why wasn’t this lack of lunar perfection obvious before as you suggest? Partly due (i) to how one interprets what one sees but mostly due to (ii) the powerful hold of prevailing natural philosophy.

          On (i) Thomas Herriot looked at the moon through a (less powerful) telescope before Galileo, and his sketch (26 July 1609) showed an irregular terminator but with no interpretation. He did another sketch in 2010 (after reading “Starry Messenger”) and lo and behold the terminator he “saw” now had craters including the fictitious one that Galileo added that isn’t actually there. It was easy before Galileo to explain away the obvious variations in the moon’s surface colouring, for example.

          On (ii) it’s difficult to envisage the hold that Aristotelian philosophy had on understanding and the way that it was considered totally to trump direct experience. e.g throughout the Middle Ages and in some cases up to Galileo’s time natural philosophers held the views and taught (a) that it was unknown whether the equatorial regions were too hot for human habitation even though direct evidence was available that Portuguese sailors had reached the equator in 1475 and that Sri Lanka (known to be near the equator) was populated, (b) that ice is heavier than water (Aristotle) (despite the obvious shortcomings of that idea! ) and (c) that nerves are connected to the heart (Aristotle) even tho Galen had shown that they are connected to the brain.

          David Wootton’s “The Invention of Science” is an astonishingly good account of these subjects IMO.

        • But the orthodoxy was very much the Aristotelian view

          […]

          On (ii) it’s difficult to envisage the hold that Aristotelian philosophy had on understanding and the way that it was considered totally to trump direct experience.

          I just think there were likely many people paying lip-service to the “orthodoxy”, because that it was what required for political/financial support.

          It is like NHST today. There is really no one that knows what they are talking about that thinks it makes any sense, yet it is ubiquitous. So you are forced to litter your paper with irrelevant p-values just to publish.

  6. Isn’t this the same as choosing a sampling method for monte carlo experiments?

    You can a priori choose between random, latin hypercube, orthagonal sampling, etc. But you can get much faster convergence by using a markov chain, at the risk of getting stuck in a local optimum. The practical solution to that is running multiple chains.

    What would be nice is a model of the approximate relative number of steps each method would take. Eg, model it as a hypercube with D dimensions of S bins each. Then an exhaustive search would require n = S^D steps. Then a single monte carlo sample is a good approximation of a much smaller hypercube with volume s^D, where the magnitude of s is somehow a function of the “complexity” of the posterior and efficiency of the sampling strategy. Then a good fit would be found when n*s^D = S^D, ie after n = (S/s)^D steps.

    I dont know how to work out the complexity and sampling efficiency aspects, but in general people try to *reduce* random walk behavior, eg that is the point of using hamiltonian MCMC. So I would expect random sampling to perform much worse than theory driven design. An exception is when the most popular theory/chain is stuck in a local optimum and alternative theories/chains are suppressed, which could very well be quite common.

  7. That is a very thought-provoking paper and I really like seeing the scientific enterprise simulated – thanks to the authors and Jessica for presenting it!
    However, in my opinion the paper has very strong and also misleading assumptions (or specifically, imprecise translations from the real world into the modeling space) that makes it difficult to apply it to real science.

    1) It is far from being clear that you can model a “ground truth” as a fixed multivariate distribution and reduce science to estimate its properties/relations. I think some people here already mentioned that the search space is potentially infinite and you do not know which dimensions will be important (and they might even change or become more numerous the more you get to know). All observation is theory-laden, so the choice of experiments must be as well. A pure “random” strategy is therefore not possible at all, as many discussions about whether two experiments investigate the same thing or not show. I would say that it is in fact done where possible, e.g. when researchers try to use different sets of stimuli, a wide range of participants, etc. – so basically in a quite circumscribed area of investigation, at least when they are not directly interested in refutation, using specific sets of stimuli/groups of participants. But it cannot be a recipe for science as a whole.

    Due to us not knowing the relevant dimensions, it is not a fact that we would “know” or “understand” problems and their solutions just by describing the data we gathered very accurately in statistical terms – sparsity is only one aspect of good theories. If it were that easy, we would have totally different discussions about what science is, and we would only need to do data crunching to come up with new stuff (robots might have already replaced us as scientists if that were the case). So theories are not (only) sparse representations of all data we have access to, but the results of creative processes about how the world behaves. Again, coming close very fast with statistical models might not be a relevant metric.
    2) Falsification has the goal of finding and conducting experiments which rule out certain explanations. It is not necessarily the same as sampling observations close to the ones we so far explain badly with our statistical model; it is not deciding about which direction to go in the search space based on the least fitting observation, but thinking about what would refute our ideas of regularities we have so far. It also does not have to be gradual, but can be very discontinuous. So falsification might in fact underperform in this model.

    I think, in the end, there are always reasons for what is being studied, and these are grounded in theory. The authors of this paper seem to do it this way as well even though they try get across that it is not the best strategy: Based on theory, they pick five strategies how experiments are chosen, and then do an experiment/simulation to see which ones work best, or even falsify that falsification is the best one. According to their results, wouldn’t it be better to test 100 or even more strategies, for example by extending the “random” strategy to other probability functions (not only uniform, also exponential, normal, log-normal, etc.), and not base this choice on past studies or the history in philosophy of science? Did they deliberately forego this increase in accuracy in order to gain something else instead, or is it indeed valuable to proceed this way, even in terms of accuracy?

    • Thanks, these are all excellent points.

      >According to their results, wouldn’t it be better to test 100 or even more strategies
      It’s hard not to wonder about the origin story of the reported experiment when reading a paper like this. Marina gives some info on above that suggests it was theory-driven. This is one reason I admire the authors for writing it – they put themselves in a difficult position to deliver the findings.

      • > This is one reason I admire the authors for writing it – they put themselves in a difficult position to deliver the findings.

        Yes, me too! And I would love had they openly discussed that in the paper. I think this would make for pretty interesting follow-up talk/research/blogging, even though it might feel uncomfortable to write it that way.
        I also really like that they simulate theorizing. Statistics profits enormously from the ability to just create worlds where everything is known in order to check your models and methods and I was recently thinking if it is possible to translate that to theorizing. I am not still not convinced that it is possible, due to what I think are serious shortcomings in this approach mentioned in my initial comment, though I really like that they find something like a closed form to do it and were able to derive very interesting consequences.

        • @Finn @Jessica I really like the idea of having a subsection on “What this study would look like if the authors were guided by the random strategy” in the discussion. I might add it if the space permits (will make sure to acknowledge both of you in this case)!

    • @Finn thanks so much for the thoughtful comments!

      1. I agree that our model is a very particular formalization, and there are many other ways to formalize science. My hope is that this work stimulates others to develop alternative models to formally study experimentation strategies in science. I would be excited to see other models that potentially lead to different/more nuanced results.
      I agree that our formalization of the “ground truth” is oversimplified, and the space of possible measurements is open-ended and potentially infinite. However, in a moment of designing a new experiment scientists often do seem to work with a limited set of dimensions that they can choose from (the possible dimensions can be limited by a theory — our results do not speak against/for this possibility). Moreover, I cannot think of a way to model learning about the “infinite ground truth” that would be computationally tractable — do you have ideas on how we could have addressed this limitation in the modeling approach?

      2. You are right that our formalization of falsification is somewhat different (i.e. more gradual) than the exact principle that Popper proposed. One reason for this is that we did not find it realistic to represent scientific theories as logical statements, and logic-based representations are probably the only ones that allow clear-cut refutations. However, my hunch is that our formalization of falsification is more favorable to its success than the alternative you are suggesting: continuous signal is typically more informative for learning than “all or none” signal.

      3. I appreciate your meta-skepticism of our study — yeah, as I described above (and as Jessica kindly pointed out), when planning these experiments, we did not know that the results are going to favor the random experimentation strategy; so we did not design the experiments with this knowledge in mind. I’m excited to think about ways to apply these insights in my future work though!

      • Hi Marina,

        Thanks for your deep engagement with my thoughts!

        I struggled a bit to find a good angle for my answer because I think there are a couple of layers here and I am not sure what a good order is to approach them – this is also why the reply is a bit lengthy.

        One issue is that I am not quite sure what scope your model is supposed to have. When you write: > “(the possible dimensions can be limited by a theory — our results do not speak against/for this possibility)”, this implies to me that you limit your model to cases where it is absolutely clear which variables are important in order to answer an already well defined research question and use the data to model statistical relations. Is that correct? Because if it is, this feels very narrow to me, and does not have very far-reaching consequences as suggested, since it is more a statistical problem instead of an epistemological one. You basically say that random sampling is the best approach for statistical estimation. Now this is not that surprising, because any approach of statistical estimation should be prone to overfitting as long as you make your sampling dependent on the values of variables you already sampled, right?

        However, this basically becomes your definition of working with a theory, so I think there are some problems here: First, your definition of theoretical work (especially the falsification strategy) is quite narrow – falsification could also mean to sample values that are more extreme compared to the ones you sampled before but in the model it is dependent on the values one already has. I think allowing more extreme sampling would move the falsification strategy closer to or even above the novelty strategy. This was also the problem I had with the definition of falsification: It is not that you defined it as gradual (although I am not sure how this would look like when the problem is less statistical), but that it does not max out the potential falsification has in reality and that this strategy thus underperforms.

        The next point is (and that follows from the first) that your definition of theory as results of data reduction implies that theories are only about data reduction and that the prediction of data points is the important metric to judge them. However, a theory is more than compression of data which can predict new data points – and I am not trying to explain that to you because I am sure you thought about that and made a deliberate choice in order to reduce the complexity of the problem – and that is totally fine. I just think your definition might be a too specific formalization that can have consequences regarding the success of the strategies. This is because each dimension and thus every part of a theory is weighed equally in your model, whereas theories are hierarchical in a sense that they have core parts and more auxiliary parts, where refutations do not have equally severe consequences. While it might be true that theories are not the same as logical statements, I think they contain logical statements in the form of hypotheses, which can falsify a theory when proven wrong – eventually, and when the issue it at the core of a theory, whereas falsification of auxiliaries might not be fatal. In your model it is hard to see how theories can be falsified that way.

        So if that part would be implemented in your model, the falsification strategy involving theory could be better than it looks now, because the core assumptions could be targeted more accurately when identified by theory in comparison to a random approach in experimentation choice. Now you said in the statement I quoted earlier that your model is agnostic to that, and a prerequisite of that is of course that you allow theories to be something other than results of data reduction and also allow variable selection to be a part of theories – I think this is important for a fair comparison, and my hunch is that this needs to involve causal connections between variables.

        This is because when theorizing we usually want to know and speculate about WHY something is (un)related to something else and I think this is a crucial part of theories. When I understand your manuscript correctly, this is absent in the model as the ground truth was designed with causal connections between variables. Is that right? I assumed this from this sentence: “The dimensions do not have to reflect the distribution of properties, however: they can also be interpreted as scientific manipulations and their outcomes”. This makes me wonder if the definition of experiment is specific enough. I would understand experiment as a study where a variable is perturbed to isolate its effects on another downstream variable to estimate its specific effects. On the other hand, a mere observation of the same variables might lead to different results due to other variables influencing the relation. However, based on your sentence, this is equivalent in your model because of the correlational structure that is ignorant to causality. However, this would mean that an important point of theories – namely thinking about causal structures, and designing experiments or observational studies with optimal adjustment sets – is not part of your model. I think this matters because in my eyes causality is the way theories reduce data dimensions. And giving the ground truth model more complexity by introducing direction between variables should make theories stronger compared to random approaches.

        Maybe in the end it might not matter that much for the model that the potential space is infinite in reality but bounded in models? I don’t know. I am also not sure how to model new findings, or complete falsification of concepts (like the ether hypothesis). I think however, this needs to include causality, variable selection by theories, and maybe also scientific concepts which predict certain variable structures (“if there exists this state A, then we should observe effects X, Y, Z”)?
        Together, this might enable a theory to be only about a small (more or less encapsulated) part of the ground truth which predicts states or relations between variables based on causal mechanisms, which need to be inferred by experiments or correlation patterns of the variables known to the experimenter (or which the experimenter deems relevant to the question/theory of interest). So each theory could be something like the prediction of a statistical relation, but also the prediction of a causal association. In the following iterations, scientists would then try to extend/confirm/challenge the theory by including more variables into their model in different roles (mediator, collider, confounder) and decide about their associations with the other variables. The different strategies are then evaluated by e.g. how fast the estimated parameters approach the ones in the ground truth, or how fast wrong models are weeded out, something like that.
        I think this makes the random approach much more difficult because the search space for new variables and potential roles of variables become very big very fast, and most are not helpful, especially when some are not likely to contribute at all when not excluded by theory (variables totally unrelated to your question, like from a different field). This also would do more justice to falsification because the correct choice of variables and their role can easily falsify a whole theory, e.g. by including a confounder which makes a relation go away.

        I am not sure about the implementation details and if that really would work out, but my feeling is that it includes some details which are important to model.

        That was quite the essay, I hope there is something in there you find useful!

  8. This got me thinking about Broadbent’s ideas regarding experiment design. He argued for a binary reduction method where one attempts to classify the parameters into ever narrower paths and potentially exclude large tracts of previously plausible theories rather than go after supporting specific ones. I always felt the critical thing was to make such selections based entirely on observable properties rather than theoretical ones. One might start with a phenomenon one wants to study and then attempt to map out and classify the possible method space. Then you can construct a theory.

  9. Long time reader, first time commenter:

    > to randomly sample a space implies you have some theory, if not formally at least implicitly, about the scope of the ground truth distribution. How much does the value of random sampling depend on how this is conceived of?

    This is a big concern of (statistical) learning theory. The answer is, almost totally. There is no structure/bias/scope-free inference, and there is No Free Lunch – any learning that succeeds on one class of hypotheses necessarily fails on some other class. If you want to recapitulate a distribution, that distribution is based on the structure of its parameters (even if you memorize, which just means a bias against generalizing). This is true even when your samples are “random presentations” from any distribution with support on the instance set. Every distribution belongs to a class, and sampling “randomly” often means just iterating through an implicit class.

    > how can I be sure that whatever conditions I decide to test when I “randomly sample” aren’t actually driven by some subconscious presupposition I’m making about what matters?
    You can’t :) In other words, we shouldn’t confuse ignorance of bias with absence of bias.

    > Could real humans find it harder to learn without theory?
    children learn their languages in ways that clearly defy adult distributions ( https://sites.socsci.uci.edu/~lpearl/courses/readings/Pearl2019Ms_PovStimWithoutTears.pdf )

Leave a Reply

Your email address will not be published. Required fields are marked *