Greg:

I agree with your general point. In this case, I suspect the sample size is too small to learn anything useful. The estimate of the effect of air filters will be near zero and not statistically significant, but I think the uncertainty interval will be so wide that it would not be appropriate from these data to claim that the effect size is near zero.

It’s a cool natural experiment, but it might be just too small to give any useful information. But, sure, the researcher could do some time-series plots and see if anything turns up.

]]>Put differently, is there some way to learn something useful about air pollution from this data? It is an important question. Lots of work in this area finds comparable results with comparable (or worse) methodological faults. Is most we can learn about the question (not the methodology) really nothing?

]]>This is the bit I find baffling. If there wasn’t any gas present, and the filters are meant to be catching some other pollutant, why on earth would the slope of a regression of change in score on distance from the leak be evidence of anything? I mean, why not regress the change in the scores on how many kids are Hannah Montana fans, or the phases of the moon?

]]>Dean:

I think there’s a larger problem here, which is considering this as a “regression discontinuity” problem rather than just starting from scratch and thinking it of as an observational study. There are lots of potential differences between the different schools, and distance from the gas leak is only one of these differences. It doesn’t matter what you do with this one variable, it still doesn’t adjust for anything else.

And of course in this particular example, there’s no difference between the outcomes in the treatment and control groups, and we always have to worry about a method that creates a statistical difference when there was no difference there in the first place. Yes, this can happen, but we should be concerned.

]]>One robustness check, now quite widespread, is to report results with double and half the selected bandwidth.

Figure A3 reports results for varied bandwidths (but not up to double the bandwidth, which might have good reasons, “I stop the figure at 2.5 miles as extending it further would start including the two school that were evacuated due to their proximity to the leak”) and the results are not significant at half the bandwidth and are not so strong with larger bandwidths either.

Of course, conditioning declaring a result on so many robustness checks passing can also distort inference, so not sure we should expect all checks to pass…

]]>These kinds of spatial RDs typically collapse the spatial discontinuity to a single dimension and then use local linear regression, but better estimators are available.

]]>Nathan, take a look below at this comment:

In general, you should think of regression discontinuity as just another kind of nonlinear model. if y(x) has a discontinuity in overall level, it can be approximated as a quickly rising function such as inverse_logit((x-d)/s) where d is the discontinuity point, and s is the scale for how fast the function rises/falls.

if it has a kink in the slope, you can use something like what andrew posted about a “continuous hinge” a couple years back:

https://statmodeling.stat.columbia.edu/2017/05/19/continuous-hinge-function-bayesian-modeling/

In general I recommend to use these continuous smooth functions rather than discontinuous ones. On the one hand, real things are rarely entirely discontinuous, so it’s the step functions and soforth that are the approximations usually, and on the other hand, smooth functions tend to work better for computational reasons.

Beyond that, you should realize that you *do* have scientific/background information, and use these interpretable basis functions such as the above to help you constrain the problem and regularize it…. For example if you’re looking at test data through time… and someone suddenly implements a new teaching method, no one expects suddenly that the rate of improvement of math scores will go from say 100 points per year to 1000… so you have background information, use it! put constraining priors on coefficients and things.

In general you will get a better fit if you:

1) treat the problem as a nonlinear function approximation problem

2) Use an interpretable set of functions to describe the behavior you could expect, with smooth functions.

3) Provide realistic prior information for the coefficients of the interpretable model

4) Use at least the very basic theoretical knowledge (in this case for example, with the absence of measurable gas levels before the filters, there is no theoretical reason why school performance should be related to distance from a gas leak)

Hi Daniel, those graphs are very convincing but the wikipedia article goes over my head a bit, would you mind expanding on the problem and what solutions we should turn to when using RD?

Thanks a lot

]]>Um, during the same time period at that same location, lead was removed from gasoline and the air. Lead is much more harmful to student performance than is smog generally.

]]>Ha, that’s great Daniel! Looks just like the data in the paper.

]]>Umm…no academic in her right mind would publish a study that would hint at the benefits of using natural gas.

]]>“Until somebody analyses the filters we won’t know.”

+1

it’s almost funny that no one even thought of it.

]]>It isn’t just that the g(x) should be windowed at the boundary, but also for regularization it should be from a more limited family of functions than the f(x). In this case they use linear functions for both f and g and then g can “undo” all of f, they become independent functions on either side… if f(x) is from a larger function space, then g(x) is more of a perturbation to it.

This is another form of a prior really.

]]>There are an infinity of models with a discontinuity at a given point. In this example, *everything* is discontinuous… the whole model of the world changes from one place to another…

In a more restricted/regularized design you might choose:

y = f(x) + k * step_function(x-d)

which specifies that y steps up it’s average value by an amount k at the point d…

you might also do:

y = f(x) + k * step_function(x-d) + j * integrate(step_function(x-d),x)

which now says in addition to stepping up its average value, it now has an additional trend.

As it stands, if f(x) is a+b*x then this is equivalent to what was done here. It can represent exactly the same functions as the specification a+b*x for x less than discontinuity and a2+b2*x for x greater than the discontinuity… (on the other hand, if the f(x) is a more complex nonlinear function then it represents the smooth nonlinear trend

So, now, we need to bring in background knowledge. First off… we need to acknowledge that if there weren’t any natural gas pollutants, then there is no real reason to believe that the slope is meaningful, so we should put a strong prior that b ~ 0 and j ~= 0 (where ~= is a little stronger than ~ ) with whatever difference from zero there is being due to some geography or whatever.

Next we should say that if there’s a difference in test scores, it shouldn’t be dramatically different from the approximate sizes of the differences in observed test scores in different parts of the county as pollution varies across the county… so k ~ normal(0,s) with s some small number probably like 0.05 to 0.1 or so on the scale they’re using here.

Now, refit the model, and you can get some kind of reasonable inference.

]]>Southern California engaged in a vast natural experiment between about 1965 and 2000 in the near virtual elimination of severe smog:

https://slideplayer.com/slide/5875876/19/images/7/Ozone+trends+in+Los+Angeles.jpg

Around 1966, there were 75 days per year with bad stage 2 smog alerts. After the late 1980s there were close to zero.

Lesser Stage 1 smog alerts declined from about 180 per year in the mid-1960s to single digits annually by the late 1990s.

Has anybody ever analyzed the effects of this huge environmental improvement on school test scores?

]]>+1

]]>Yes, I’ve encountered talks where (for some reason I can’t really fathom), saying you use “novel methods” seems to get you brownie points.

]]>I wasn’t aware of the Runge thing. I’d seen the Gibbs thing before. Neat.

]]>If only we could get journalists to spend less time on twitter and more time reading these comment sections. I can’t expect this from most reporters, but I’d hope that an outlet such as Vox, which is specifically dedicated to explaining the news, coudl do this.

]]>Nice simulation. Very convincing.

]]>I don’t have a reference, but the way to think about this is that you’re representing a nonlinear function, so the place to read about it is in function approximation theory. you’re doing a basis expansion basically.

]]>Good point.

If the filters are working by reducing common pollutants and not by reducing natural gas pollutants, then the effect would not increase with distance from the leak. There would be flat lines on both sides of the 5-mile point and a simple comparison of filter versus no-filter schools would show the effect.

]]>Adam:

Two things I’ve consistently said about the null hypothesis are: (a) it’s always false, and (b) if we can’t distinguish the data from simulations from the null hypothesis, that tells us something: it tells us that the data are not informative to learn very much.

If you have prior information that air filters can help test performance, that’s fine, go for it. But in this case the data provide approximately zero evidence. That was the point of my P.S. above. It’s fine to recommend air filters, but this recommendation shouldn’t be motivated by this particular study, where there were essentially no difference in outcomes between treated and control schools.

]]>“I think the default should go the other way: we start with the observation that there was no apparent difference between the outcomes in the two groups of schools, and then if people want to make a positive claim from there, the burden should be on them.”

Is this Mayo or Gelman? Didn’t expect the logic of NHST to come out of your mouth.

Why not be Bayesian about it and give each hypothesis some probability by default (equal odds if we’re totally agnostic pre-data) and update in light of the data?

]]>The way RDD as an identification strategy was taught to me was to fit one regression with an indicator variable for the point of discontinuity. This way, you’re not letting the function do different things at both sides, and smooth trends in one side inform the function on the other side. It’s meant to be an identification strategy that controls for smooth trends in covariates, though is still susceptible to confounding for many interesting discontinuities (political boundaries would be one). I don’t know what this “compare two totally different regressions on either side of the discontinuity” thing is, or if it even has a name.

]]>Daniel,

Your conditioning argument below seems very sound to me, and clear. Thank you!

As someone who has never fitted any type of discontinuity analysis, do you have a reference/method for performing your suggested regularized model: eg. y(x) = f(x) + g(x : x > boundary).

]]>I don’t know. I’ve seen reviewers accept a paper by virtue of the “novelty” of its statistical methods.

]]>Z:

Yes, the data show what the data show, but the real question is about generalizations to other settings, and to do that we need to make lots of assumptions in any case. The simpler analysis does allow the data to speak, but without assumptions the data do not speak to the larger questions of interest.

]]>Sure, maybe I’m picking up on unintended implications. To me it came across as “The tortured RD analysis says X but the data show (if you just let them speak for themselves through my simpler analysis) Y”. But I wouldn’t report results from your proposed study as “the data show…” any more than I would report results from the bad RD study as “the data show…”. Even though your analysis is simpler in some sense, it’s not like the simpler analysis is just letting the data speak. Both are “data + assumptions imply…”

]]>Kyle:

I found the thread. The best part was where Smith wrote, “His criticisms seem taken directly off of Twitter (or a summary someone sent him, or a very cursory perusal).” I don’t think that Smith realizes that when I write, I follow academic rules, which means that if I get an idea from something I read, I cite the source! If my criticisms had been “taken directly off of Twitter,” I would’ve linked to the relevant tweets!

I continue to think that a big problem here is that people take a published (or, in this case, unpublished) claim as a starting point. In this case, the starting point was “Putting air filters in classrooms increased test scores by 0.20 standard deviations,” and the idea is that the claim stands until someone shoots it down. I think the default should go the other way: we start with the observation that there was no apparent difference between the outcomes in the two groups of schools, and then if people want to make a positive claim from there, the burden should be on them. Relatedly, I think more graphs should be absolutely required in this sort of work. Not just that discontinuity graph (although it’s a start) but the pre/post graph I suggested in my above post. The final analysis might end up looking like a discontinuity regression, that could be fine, but we have to work our way there.

]]>Demosthenes:

There’s one robustness check they didn’t do, which is the direct comparison between the two groups of schools, which would show essentially zero difference!

]]>Demosthenes:

No one is saying to believe the study just because it has the “regression discontinuity” label. But I think they pretty much are acting that way. But I guess we could flip this around and ask why the Vox writer reported on the study so credulously. We can’t blame PNAS on this one, as the paper hasn’t even been published! I’m not quite sure what cues in the technical report made it seem reasonable for a reporter to accept the claims without question. Perhaps there was a certain professionalism in the presentation of the results. I do think that part of the appeal was that it was a natural experiment. Maybe not the discontinuity analysis per se, but something about the setup that made it look like trustworthy research, I dunno.

]]>Z:

I agree. When I say these data “do not show any clear effect” and that the study “found a null result,” I don’t mean that the true effect is zero. What I mean is that the data are consistent with a zero true effect.

The point of my proposed simpler study is not that it gives the right answer but rather that I think it is a reasonable starting point, in the same way that, if you have survey data, a reasonable starting point is to do some demographic adjustments to line up the sample to the population. That’s not the end point, just a start.

]]>I disagree with the notion that the analysis Andrew favors deserve a privileged status. Andrew goes so far as to summarize the hypothetical results from his analysis as “*the data showed* no effect”. No, the data alone do not show anything about whether there is a causal effect. Only a combination of assumptions, models, and data can show that. Andrew thinks the assumptions and models of the regression discontinuity are too cumbersome, but I also find the strong assumptions underlying his simpler approach unlikely.

However, I do share Andrew’s implicit intuition that if there’s not a big difference in a simple comparison between treated and untreated, there’s probably not a sizable causal effect. Basically, what are the odds that confounding would almost exactly cancel out the effect? When an involved analysis turns up an effect that wasn’t apparent from a simple comparison, it does seem more likely that there’s really no effect and the analysis screwed something up than that there was an effect that was perfectly cancelled out by confounding that the analysis removed.

]]>I data thieved the plots. Basically, the means and variances of the whole groups and the sub-groups are identical. (The means for the math scores are just about 1 standard deviation of the mean away from each other, the english score means are even closer.) Now, that’s not the most sophisticated thing one can do, but it does mean that at first blush the two sub-groups behave more or less like you’d expect any two randomly drawn sub-groups would look like, coming from the same parent population.

You basically have to first show that there’s some statistically significant trend with *something* (but how many somethings did you look at???), and then show that the chosen sub-groups clearly violate that, and at a statistical level that more than makes up for how many ways of parsing the data that you looked at.

There’s nothing in the quick look at the data that would make me think that these two groups are in any way different.

]]>Umm, the table in the paper does (a not completely exhaustive) set of robustness checks for the different control variables. You’re being lazy here.

And how do you think this addresses the problem? Eg, if there is another important variable that should be included that they lack data on, how is a robustness check going to discover it?

]]>Right, he snuck that in there about the trend and I knew what he meant, but he could have been more explicit about the fact that the trend is meaningless, distance doesn’t play any role… this is very close to monkey pushes button and sees if he gets a potato chip this time…

see below, where you’ll notice that when you fit noise like this you get a potato chip about 90% of the time: https://statmodeling.stat.columbia.edu/2020/01/09/no-i-dont-think-that-this-study-offers-good-evidence-that-installing-air-filters-in-classrooms-has-surprisingly-large-educational-benefits/#comment-1221843

]]>Of course, but there are lots of different filters for lots of different environments and it would be rather surprising if for example a vapor cartridge filter turned out to be an effective particulate filter. Leery of lawsuits manufacturers send along detailed warnings to disabuse users of the notion that filters designed for one class of contaminants might also guard against another. So my question is: if the filter is wasn’t removing what it was designed to remove, what, if anything, did it remove? Until somebody analyses the filters we won’t know.

]]>the fact is the discontinuity is known to create this phenomenon. With a discontinuity you have information from only *one side* of the interval informing the fit… when you fit through the discontinuity you get a dramatically different result because under the null hypothesis, the information content is continuous through the discontinuity (slope, overall value, even second or third derivatives) whereas under the alternative hypothesis there is no connection at all between the two halves. Unfortunately for these people who keep doing this all the time… the world isn’t binary, it’s not “everything is the same across this boundary” or “everything is totally different across this boundary” the opposite of “everything is the same across this boundary” is “there exists something that is different across this boundary”

Put another way, regularization is needed here… you want to fit y=f(x) + g(x : x > boundary)

where f incorporates all the data, and g incorporates just the data on one side of the boundary, to show how the results there differ from the overall fit…

instead they fit:

y = f(x : x < boundary) + g(x : x > boundary)

so there is no information shared across the boundary at all… as if on one side of the boundary you have venusians being taught on teleprompters by 7 eyed tentacle beasts, and the other side is humans or something.

]]>Filters have no effect because they are a binary intervention: Filters = 1 if inside 5 miles and Filters = 0 outside. To the extent there’s a real trend in scores related to distance, in either or both conditions, it cannot be caused by the filters. A continuous trend cannot be caused by a binary intervention.

]]>More to the point, why does distance from the leak predict growth at all, inside or outside 5 miles? The authors set this up as a binary issue–either you have the filters or you don’t, as determined by whether you were inside the radius or you weren’t. The only reason for looking at distance is if there really was contamination with gas, and readings taken in the schools show there wasn’t. The slope of the score growth as a function of distance is theoretically irrelevant–only the intercept (mean) matters if the filters are the cause of differences between groups. They have instead found evidence that, if there is a real difference, it cannot be caused by filters!

I realize now this is what Andrew meant when he wrote that it’s “a linear trend which makes no theoretical sense,” but I originally thought he was referring to the size/direction of the trend being nonsense, not the fact that there’s no sense in even looking at the trend.

]]>1. If you can look at the graph and draw a virtually horizontal line from a point on the left margin to a point on the right margin, staying between confidence limits around both of the regression lines, doesn’t that mean there’s a plausible regression line with slope ~= 0? It’s not a sophisticated evaluation of the data, but it raises a red flag for me.

2. If the air filters suddenly stopped being installed at 5 miles, why does distance greater than 5 miles still predict test score growth? Shouldn’t growth outside of that radius flatline as a function of distance? I didn’t read the paper, so maybe they establish somewhere that there’s a non-filter-related variable associated with scores and that also decreases with distance–like SES or general pollution levels. But then that would mean the authors chose not to adjust the graphed data points for that variable, requiring the reader to mentally rotate the two lines until the one on the right is flat in order to see the net difference due to filters. Right?

]]>I’m replying to a comment that hasn’t been approved yet, but whatever… hope it doesn’t break the blog…

Here is a very simple demonstration of why when you run a regression discontinuity on noise, you almost always get two very different results on either side of the discontinuity even when the data is pure noise (or maybe especially).

Includes a simulation and 20 pdf graphs of the results… there’s always something going on in the graph.

]]>I too find it highly plausible that people living over a gas field would do better on cognitive tests if everything didn’t smell like rotten eggs. But in this case, “Air testing conducted inside schools during the leak (but before air filters were installed) showed no presence of natural gas pollutants, implying that the effectiveness of air filters came from removing common air pollutants and so these results should extend to other settings.”

IOW, if the filters had an effect it was not on leaked natural gas, but on “common air pollutants,” which I think would probably not include noticeable levels of mercaptans.

]]>https://en.wikipedia.org/wiki/Runge%27s_phenomenon

In the middle of the interval, the function gets close to the right value, because it is a continuous function and it has data to constrain not only its value but also its slope, and maybe 2nd derivative, etc…

But at the end of the interval, particularly with noise involved, it has information from only one side of the interval. For example, here there is an upward outlier at about 4.95 and another downward outlier at 5.75 (math score graph). If the regression went through that region, those two would cancel out… the curve would stay somewhere in the middle… But because we drew an arbitrary boundary there, and then allowed the functions on each side of that boundary to do *completely different* things, the functions do exactly that, completely different things. This then becomes a “discovery” because we have an “identification strategy”. It’s like saying you discovered your feet are different sizes because when you allowed yourself to put on two different shoes you didn’t trip. People are pretty good at not tripping. Clowns can wear enormously oversized shoes… it’s normal for you to not trip (it’s normal for the curves to go off in oscillations at the edges of a boundary) nothing about finding that the usual thing happened should make you think anything unusual is going on.

]]>In particular, yes, some different kernels were tried, but what about a constant fit? If neither the data nor prior reasoning strongly support a particular functional form, you should have to report fits with a wide range of functional forms, not just some cherry-picked ones that happen to give the desired significant results.

]]>Umm, the table in the paper does (a not completely exhaustive) set of robustness checks for the different control variables. You’re being lazy here.

]]>“But I think it’s a mistake of people to believe the results of such an analysis by default, just because it has the “regression discontinuity” label.”

This seems like a huge straw man, Andrew. Literally no one is saying that.

]]>Our coefficient of interest is Beta, which represents the effect of being just within 5 miles of the gas leak (and thus receiving air filters) compared to being just outside (and not receiving air filters).

Nope, the meaning of this coefficient depends on what you included in the model. In this case that is “geographic location”, “lagged test scores”, “student demographics and fixed school characteristics”, etc.

Change the model, change the meaning of the coefficient. Why is this model specification any better than the millions of other ones they could have used? It seems to be an arbitrary one using coefficients of convenience.

See here for an example of someone actually checking the value of the coefficient for some other data: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

]]>