Skip to content
 

External vs. internal validity of causal inference from natural experiments: The example of charter school lottery studies

Alex Hoffman writes:

I recently was discussing/arguing about the value of charter schools lottery studies. I suggested that their validity was questionable because of all the data that they ignore. (1) They ignore all charter schools (and their students) that are not so oversubscribed that they need to use lotteries for admission. (2) They ignore all the students at the public school that did not apply to a lottery.

The response I received was that they may lack external validity, but that’s just because the researcher focused so much on internal validity.

What do you think that we should do with this kind of defense of supposedly policy-relevant research? Is there something I am missing; is admissions of a lack of external validity ameliorated because of stronger internal validity?

This strikes me as the same issue as how the assessment industry has so focused on maximizing reliability (alpha) that they have are not willing to give a little on alpha in exchange for greater validity. They don’t sample from the entire construct/curriculum/standards because they can’t get enough items on a test if they do, thus the construct is predictably underrepresented.

I believe that the goal should be external validity. Sure, you need internal validity as an intermediate step, but internal validity should not be goal, in and of itself.

Am I missing something?

My reply: One way to think about it is that you can get estimated causal effects for everyone, but your estimates are most believable for the core group (in this case, the intersection of (a) oversubscribed charter schools and (b) students who applied for a lottery) and rely on increasingly strong assumptions as you move away from the core.

At this point, you have two choices: You can report your estimate for the core, which is narrow but less assumption-bound (less external validity, more internal validity), or you can construct estimates for everyone, and then you’ll pay the price in internal validity. I think it’s fine to do this latter strategy; you should just make your assumptions clear.

The thing you don’t want to do is take the estimate for the core and report it as an estimate for the general population without modeling how the treatment effect might vary. In the example above, you’ll want to think about how the effect of charter school compared to public school could be different, (a) for charter schools that are not oversubscribed, and (b) for students who did not apply for a lottery.

32 Comments

  1. Steve says:

    I may be wrong, but the charter studies I have seen also are not doing an intention to treat analysis, and that is a big deal because many of the charters are pushing out students that don’t meet expectations. They do this either overly by telling parents this is not the place for your kid or simply by making life so difficult for certain children with disabilities or other problems that parents give up. But, I don’t think those children are followed, so we don’t know if their experience at the school made things better or worse. At any rate, isn’t the failure to do an intention to treat analysis affecting internal validity not just external validity or am I confused.

    • mpledger says:

      There is also an effect of disappointment. A “winner” goes to a school that their parents have put a lot of effort to get them into – a huge emotional pay-out for “success” – and a “loser” goes to a school that their parents wanted to avoid – a huge emotional loss for “failure”. If both schools were even-stevens otherwise, which kid is more likely to flourish – the kid in the school everyone wanted them to go to or the kid going to a school s/he didn’t want to.

    • This is also a problem in public schools… kids with more socioeconomic power move out of public schools at say middle school or high school and into private schools. the remaining students move into the high schools say and make the high schools look bad in a naive analysis of average achievement level… but the real issue is just filtering the student body down to those who have lower income parents and soforth.

    • Demosthenes says:

      Who can say what you’ve seen, but you’re wrong that ITTs are not reported. Check the Angrist et al. papers.

  2. jim says:

    To me this sounds like someone wanting a study that produces results they like (external validity), rather than a study produces results they don’t like (internal validity).

    All research has trade-offs and assumptions and social science research of any type has ***ALOT*** of trade-offs and assumptions. Ideally there would be a type of study that blends internal and external validity, but that would have more trade-offs and assumptions still.

    The real question in public schools is whether parents will allow their children to be judged by a universal rigorous standard. If that was in place, the job to deliver the goods would be on the student and parents, which is where it should be. Whenever anyone attempts that, though, parents have a cow and insist that their child is so smart that no test can identify their brilliance and that this standard (it doesn’t matter what the actual standard is) just isn’t fair to their poor child!

    • jim says:

      Teachers, by the way, hate rigorous standards too.

      My 7th grade science teacher was a National Guard chopper pilot. Needless to say, we spent **a lot** of time talking about his weekend adventures with the Guard. Every once in a while he’d get on some other kick and spend the day on that kick. I’ll never forget his lecture on nuclear power, which in retrospect wildly overstated the risk, but influenced me for a long, long time.

      If this teacher was teaching to rigorous standards, all his lectures on flying around with the guard and his personal hobby horses (nuclear power etc) would all have to go.

      THAT is why teachers hate testing. They’d have to work every day instead of just play at whatever they like.

      • Phil says:

        That’s may be why that teacher hated testing. I’m sure that is not the only teacher who hates the emphasis on standardized testing for that reason. But to say that’s the only reason or even the main reason many teachers hate all of the standardized tests…that’s just ridiculous.

        • jim says:

          You’re right.

          They also hate standardized testing because they’re afraid that they’ll be inappropriately judged by students’ results (and to some degree rightly so).

        • jim says:

          Phil it’s amazing that we have this group of statisticians and scientists who flatly reject the idea that a standardized test can monitor educational progress! But they’re more than happy to apply all sorts of bizarre and untested methods to determine the miniscule level of efficacy of some drug or social interaction!

          Hilarious!

    • steve says:

      You are wildly overstating how easy it is to have a “universal rigorous standard.” Defining any standard will result in teaching to the standard, which immediately skews any results. It is hard to state one universal standard for what all kids should be learning. And, why should we even want that. Different skills are needed for different occupations.

      • jim says:

        “You are wildly overstating how easy it is to have a “universal rigorous standard.”

        ??? All you have to do is create a test and deliver it. And, yes, we do want teachers to teach to the test. The test is the guideline for the curriculum. So they should teach to it. That’s their job.

        The purpose of public education is to give people foundational skills, like reading, writing and math, isn’t it? Even a poet needs basic math skills, and even a statistician must be able to read and write.

        • Steve says:

          Sorry. Imperial China has a standardized test. It does not work out well for them.

          • jrkrideau says:

            Well, Imperfal China only used civil service exams from around the 600’s to about 1900 CE. They probably had not worked out all the kinks.

          • jim says:

            I’m curious what drives the economy in your region.

            Here in Seattle, where the economy is driven very strongly by education, people are flooding in from countries all over the world where standardized testing is the norm.

            That’s great, more power to them. But it says a lot about US education.

            • Jim, here in CA, standardized testing is the norm, every student takes standardized tests in 3,4,5,6,7,8,11 th grades. I’ve been analyzing the data to understand our local school system. It’s definitely useful. But I’ve also had the experience of the difference between an AP class and an actual college course. In high school, an AP history class is basically a course in which you are taught to answer a whole bunch of questions on a test. A bunch of facts about things that happened in the past. In a college course, the class is about learning to understand the relationships between different events in history and how they shaped later events, etc you have to write papers in which you argue for ideas, you have to do some research, reading some original source materials, etc. The two are very different. I can tell you that the standardized AP test is a major hindrance to learning the essential thing, which is the skill of developing ideas from facts, models of the way things worked… It does, however, get people to learn a lot of facts.

        • Martha (Smith) says:

          Steve said,

          “You are wildly overstating how easy it is to have a “universal rigorous standard.”

          Jim responded:

          ??? All you have to do is create a test and deliver it. And, yes, we do want teachers to teach to the test. The test is the guideline for the curriculum. So they should teach to it. That’s their job.”

          In my experience, the word, “rigorous” has more than one meaning. Some people use it to say that something is difficult or exacting. Others use it to say that something is created with high standards. A test that is rigorous in the second sense is usually rigorous in the first sense, but not vice versa. In particular, the only way a single test can be a guideline for the curriculum is if the curriculum does not have high standards.

          • Steve says:

            Yes, and currently the common core testing is “rigorous” only in the sense of being hard or having a few hard questions. The math tests that I have seen test very little math content, but contain tricky word problems that turn on the student reading the question in the “right” way. The difficult ELA questions require the student to pick at the “correct” interpretation of a story that is of course ambiguous. So, the test is hard because you have to share certain assumptions with the test makers, but it isn’t making the child face some original problem.

            • Martha (Smith) says:

              Steve said,
              “The math tests that I have seen test very little math content, but contain tricky word problems that turn on the student reading the question in the “right” way.”

              In my experience, the phrase “math content” can mean quite different things to different people. Speaking as a mathematician who has taught quite a variety and range of mathematics courses (as well as a few statistics courses, and as well as having done research in mathematics), I am very aware that many people (especially many college freshmen) think of “math” as just doing calculations. I tell my students that if they think math as just being able to do calculations, then they can be replaced by a computer. Real math goes beyond just doing calculations — all of which involves carefully reading the background of the problem, considering the context in which the problem occurs, understanding terminology, understanding limitations and assumptions in mathematical techniques, and choosing a plan of solution that takes all of these into account. For more elaboration, you may look at the following website, which is the “first day handout” for a course I used to teach for prospective secondary math teachers: https://web.ma.utexas.edu/users/mks/360M05/360M05home.html

  3. LemmusLemmus says:

    Well, what you could do is do a low-external/high-internal validity analysis and a high-external/low-internal validity analyis and hope that the results are similar.

    • Steve says:

      Or you could just do experiments where you develop teaching methods in a lab and see which one’s actually work, script the lessons, and then when you find out what works, reproduce them in mass. This has actually been done. It is called Direct Instruction, but our schools won’t use them. Instead, they try various en mass changes to the curriculum that then cannot be studied except through extremely noisy observational studies that leave everyone scratching their heads about whether anything was proved.

      • Martha (Smith) says:

        “Or you could just do experiments where you develop teaching methods in a lab and see which one’s actually work”

        And how do you define/verify that “they work”? If the criterion is just “teaching for the test” and the test is easily gamed — well …

        • jim says:

          While “DI” is still controversial, that’s mostly because educators hate it for reasons I noted above. I’ve read about it in several different contexts and the research backing it up appears to be solid.

          https://en.wikipedia.org/wiki/Direct_instruction

        • Steve says:

          I agree with you that there will be no neutral arbiter of “it works.” But, we will have a clear view of what “it work” means. I did the direct instruction version of teaching my kid to read in 100 lessons and by the end of the lessons, he could read, and he was three years old. There will never be an objective (a word I hate) measure of what works in education. But, we can certainly define broadly what our objective is and met it. I emphasis, we need to understand that our standards our arbitrary, in some sense.

        • Steve says:

          Well, that’s the problem in education because we don’t really know if there is a difference between gaming the test and learning. Are good students anything other than people who are good at pleasing their teachers? We cannot actually give students tests that are unrelated to their instruction. But, at least with Direct Instruction the lesson are scripted, so we know what we are testing, not some vague concept of how to do instruction that will then vary from teacher to teacher.

          • jim says:

            “we know what we are testing, not some vague concept of how to do instruction that will then vary from teacher to teacher.”

            Bingo.

            “we know what we are testing, not some vague concept of how to do instruction that will then vary from teacher to teacher.”

            Is there any such thing as “gaming” a test? To a minor degree, you can play strategy games, but that only gets you so far, right? You can eliminate two out of five answers because they’re ridiculous. But you still have to chose one of three.

      • jim says:

        “It is called Direct Instruction”

        Excellent! One of the biggest reasons we don’t use DI is because teachers and educators hate it. It takes away their claim to being the magic fairy that brings knowledge and makes them just a conveyor of words.

  4. LemmusLemmus says:

    “Instead, they try various en mass changes to the curriculum that then cannot be studied except through extremely noisy observational studies that leave everyone scratching their heads about whether anything was proved.”

    That’s the big problem with politics generally. We basically know how to test whether A works better than B (assuming we’ve agreed on what “it works” means), but policy changes are not set up for testability.

Leave a Reply