I agree with your general point. In this case, I suspect the sample size is too small to learn anything useful. The estimate of the effect of air filters will be near zero and not statistically significant, but I think the uncertainty interval will be so wide that it would not be appropriate from these data to claim that the effect size is near zero.

It’s a cool natural experiment, but it might be just too small to give any useful information. But, sure, the researcher could do some time-series plots and see if anything turns up.

]]>Put differently, is there some way to learn something useful about air pollution from this data? It is an important question. Lots of work in this area finds comparable results with comparable (or worse) methodological faults. Is most we can learn about the question (not the methodology) really nothing?

]]>I think there’s a larger problem here, which is considering this as a “regression discontinuity” problem rather than just starting from scratch and thinking it of as an observational study. There are lots of potential differences between the different schools, and distance from the gas leak is only one of these differences. It doesn’t matter what you do with this one variable, it still doesn’t adjust for anything else.

And of course in this particular example, there’s no difference between the outcomes in the treatment and control groups, and we always have to worry about a method that creates a statistical difference when there was no difference there in the first place. Yes, this can happen, but we should be concerned.

]]>Figure A3 reports results for varied bandwidths (but not up to double the bandwidth, which might have good reasons, “I stop the figure at 2.5 miles as extending it further would start including the two school that were evacuated due to their proximity to the leak”) and the results are not significant at half the bandwidth and are not so strong with larger bandwidths either.

Of course, conditioning declaring a result on so many robustness checks passing can also distort inference, so not sure we should expect all checks to pass…

]]>These kinds of spatial RDs typically collapse the spatial discontinuity to a single dimension and then use local linear regression, but better estimators are available.

]]>In general, you should think of regression discontinuity as just another kind of nonlinear model. if y(x) has a discontinuity in overall level, it can be approximated as a quickly rising function such as inverse_logit((x-d)/s) where d is the discontinuity point, and s is the scale for how fast the function rises/falls.

if it has a kink in the slope, you can use something like what andrew posted about a “continuous hinge” a couple years back:

https://statmodeling.stat.columbia.edu/2017/05/19/continuous-hinge-function-bayesian-modeling/

In general I recommend to use these continuous smooth functions rather than discontinuous ones. On the one hand, real things are rarely entirely discontinuous, so it’s the step functions and soforth that are the approximations usually, and on the other hand, smooth functions tend to work better for computational reasons.

Beyond that, you should realize that you *do* have scientific/background information, and use these interpretable basis functions such as the above to help you constrain the problem and regularize it…. For example if you’re looking at test data through time… and someone suddenly implements a new teaching method, no one expects suddenly that the rate of improvement of math scores will go from say 100 points per year to 1000… so you have background information, use it! put constraining priors on coefficients and things.

In general you will get a better fit if you:

1) treat the problem as a nonlinear function approximation problem

2) Use an interpretable set of functions to describe the behavior you could expect, with smooth functions.

3) Provide realistic prior information for the coefficients of the interpretable model

4) Use at least the very basic theoretical knowledge (in this case for example, with the absence of measurable gas levels before the filters, there is no theoretical reason why school performance should be related to distance from a gas leak)

Thanks a lot

]]>+1

it’s almost funny that no one even thought of it.

]]>This is another form of a prior really.

]]>In a more restricted/regularized design you might choose:

y = f(x) + k * step_function(x-d)

which specifies that y steps up it’s average value by an amount k at the point d…

you might also do:

y = f(x) + k * step_function(x-d) + j * integrate(step_function(x-d),x)

which now says in addition to stepping up its average value, it now has an additional trend.

As it stands, if f(x) is a+b*x then this is equivalent to what was done here. It can represent exactly the same functions as the specification a+b*x for x less than discontinuity and a2+b2*x for x greater than the discontinuity… (on the other hand, if the f(x) is a more complex nonlinear function then it represents the smooth nonlinear trend

So, now, we need to bring in background knowledge. First off… we need to acknowledge that if there weren’t any natural gas pollutants, then there is no real reason to believe that the slope is meaningful, so we should put a strong prior that b ~ 0 and j ~= 0 (where ~= is a little stronger than ~ ) with whatever difference from zero there is being due to some geography or whatever.

Next we should say that if there’s a difference in test scores, it shouldn’t be dramatically different from the approximate sizes of the differences in observed test scores in different parts of the county as pollution varies across the county… so k ~ normal(0,s) with s some small number probably like 0.05 to 0.1 or so on the scale they’re using here.

Now, refit the model, and you can get some kind of reasonable inference.

]]>https://slideplayer.com/slide/5875876/19/images/7/Ozone+trends+in+Los+Angeles.jpg

Around 1966, there were 75 days per year with bad stage 2 smog alerts. After the late 1980s there were close to zero.

Lesser Stage 1 smog alerts declined from about 180 per year in the mid-1960s to single digits annually by the late 1990s.

Has anybody ever analyzed the effects of this huge environmental improvement on school test scores?

]]>If the filters are working by reducing common pollutants and not by reducing natural gas pollutants, then the effect would not increase with distance from the leak. There would be flat lines on both sides of the 5-mile point and a simple comparison of filter versus no-filter schools would show the effect.

]]>Two things I’ve consistently said about the null hypothesis are: (a) it’s always false, and (b) if we can’t distinguish the data from simulations from the null hypothesis, that tells us something: it tells us that the data are not informative to learn very much.

If you have prior information that air filters can help test performance, that’s fine, go for it. But in this case the data provide approximately zero evidence. That was the point of my P.S. above. It’s fine to recommend air filters, but this recommendation shouldn’t be motivated by this particular study, where there were essentially no difference in outcomes between treated and control schools.

]]>Is this Mayo or Gelman? Didn’t expect the logic of NHST to come out of your mouth.

Why not be Bayesian about it and give each hypothesis some probability by default (equal odds if we’re totally agnostic pre-data) and update in light of the data?

]]>Your conditioning argument below seems very sound to me, and clear. Thank you!

As someone who has never fitted any type of discontinuity analysis, do you have a reference/method for performing your suggested regularized model: eg. y(x) = f(x) + g(x : x > boundary).

]]>Yes, the data show what the data show, but the real question is about generalizations to other settings, and to do that we need to make lots of assumptions in any case. The simpler analysis does allow the data to speak, but without assumptions the data do not speak to the larger questions of interest.

]]>I found the thread. The best part was where Smith wrote, “His criticisms seem taken directly off of Twitter (or a summary someone sent him, or a very cursory perusal).” I don’t think that Smith realizes that when I write, I follow academic rules, which means that if I get an idea from something I read, I cite the source! If my criticisms had been “taken directly off of Twitter,” I would’ve linked to the relevant tweets!

I continue to think that a big problem here is that people take a published (or, in this case, unpublished) claim as a starting point. In this case, the starting point was “Putting air filters in classrooms increased test scores by 0.20 standard deviations,” and the idea is that the claim stands until someone shoots it down. I think the default should go the other way: we start with the observation that there was no apparent difference between the outcomes in the two groups of schools, and then if people want to make a positive claim from there, the burden should be on them. Relatedly, I think more graphs should be absolutely required in this sort of work. Not just that discontinuity graph (although it’s a start) but the pre/post graph I suggested in my above post. The final analysis might end up looking like a discontinuity regression, that could be fine, but we have to work our way there.

]]>There’s one robustness check they didn’t do, which is the direct comparison between the two groups of schools, which would show essentially zero difference!

]]>No one is saying to believe the study just because it has the “regression discontinuity” label. But I think they pretty much are acting that way. But I guess we could flip this around and ask why the Vox writer reported on the study so credulously. We can’t blame PNAS on this one, as the paper hasn’t even been published! I’m not quite sure what cues in the technical report made it seem reasonable for a reporter to accept the claims without question. Perhaps there was a certain professionalism in the presentation of the results. I do think that part of the appeal was that it was a natural experiment. Maybe not the discontinuity analysis per se, but something about the setup that made it look like trustworthy research, I dunno.

]]>I agree. When I say these data “do not show any clear effect” and that the study “found a null result,” I don’t mean that the true effect is zero. What I mean is that the data are consistent with a zero true effect.

The point of my proposed simpler study is not that it gives the right answer but rather that I think it is a reasonable starting point, in the same way that, if you have survey data, a reasonable starting point is to do some demographic adjustments to line up the sample to the population. That’s not the end point, just a start.

]]>I disagree with the notion that the analysis Andrew favors deserve a privileged status. Andrew goes so far as to summarize the hypothetical results from his analysis as “*the data showed* no effect”. No, the data alone do not show anything about whether there is a causal effect. Only a combination of assumptions, models, and data can show that. Andrew thinks the assumptions and models of the regression discontinuity are too cumbersome, but I also find the strong assumptions underlying his simpler approach unlikely.

However, I do share Andrew’s implicit intuition that if there’s not a big difference in a simple comparison between treated and untreated, there’s probably not a sizable causal effect. Basically, what are the odds that confounding would almost exactly cancel out the effect? When an involved analysis turns up an effect that wasn’t apparent from a simple comparison, it does seem more likely that there’s really no effect and the analysis screwed something up than that there was an effect that was perfectly cancelled out by confounding that the analysis removed.

]]>You basically have to first show that there’s some statistically significant trend with *something* (but how many somethings did you look at???), and then show that the chosen sub-groups clearly violate that, and at a statistical level that more than makes up for how many ways of parsing the data that you looked at.

There’s nothing in the quick look at the data that would make me think that these two groups are in any way different.

]]>Umm, the table in the paper does (a not completely exhaustive) set of robustness checks for the different control variables. You’re being lazy here.

And how do you think this addresses the problem? Eg, if there is another important variable that should be included that they lack data on, how is a robustness check going to discover it?

]]>see below, where you’ll notice that when you fit noise like this you get a potato chip about 90% of the time: https://statmodeling.stat.columbia.edu/2020/01/09/no-i-dont-think-that-this-study-offers-good-evidence-that-installing-air-filters-in-classrooms-has-surprisingly-large-educational-benefits/#comment-1221843

]]>the fact is the discontinuity is known to create this phenomenon. With a discontinuity you have information from only *one side* of the interval informing the fit… when you fit through the discontinuity you get a dramatically different result because under the null hypothesis, the information content is continuous through the discontinuity (slope, overall value, even second or third derivatives) whereas under the alternative hypothesis there is no connection at all between the two halves. Unfortunately for these people who keep doing this all the time… the world isn’t binary, it’s not “everything is the same across this boundary” or “everything is totally different across this boundary” the opposite of “everything is the same across this boundary” is “there exists something that is different across this boundary”

Put another way, regularization is needed here… you want to fit y=f(x) + g(x : x > boundary)

where f incorporates all the data, and g incorporates just the data on one side of the boundary, to show how the results there differ from the overall fit…

instead they fit:

y = f(x : x < boundary) + g(x : x > boundary)

so there is no information shared across the boundary at all… as if on one side of the boundary you have venusians being taught on teleprompters by 7 eyed tentacle beasts, and the other side is humans or something.

]]>I realize now this is what Andrew meant when he wrote that it’s “a linear trend which makes no theoretical sense,” but I originally thought he was referring to the size/direction of the trend being nonsense, not the fact that there’s no sense in even looking at the trend.

]]>1. If you can look at the graph and draw a virtually horizontal line from a point on the left margin to a point on the right margin, staying between confidence limits around both of the regression lines, doesn’t that mean there’s a plausible regression line with slope ~= 0? It’s not a sophisticated evaluation of the data, but it raises a red flag for me.

2. If the air filters suddenly stopped being installed at 5 miles, why does distance greater than 5 miles still predict test score growth? Shouldn’t growth outside of that radius flatline as a function of distance? I didn’t read the paper, so maybe they establish somewhere that there’s a non-filter-related variable associated with scores and that also decreases with distance–like SES or general pollution levels. But then that would mean the authors chose not to adjust the graphed data points for that variable, requiring the reader to mentally rotate the two lines until the one on the right is flat in order to see the net difference due to filters. Right?

]]>Here is a very simple demonstration of why when you run a regression discontinuity on noise, you almost always get two very different results on either side of the discontinuity even when the data is pure noise (or maybe especially).

Includes a simulation and 20 pdf graphs of the results… there’s always something going on in the graph.

]]>IOW, if the filters had an effect it was not on leaked natural gas, but on “common air pollutants,” which I think would probably not include noticeable levels of mercaptans.

]]>https://en.wikipedia.org/wiki/Runge%27s_phenomenon

In the middle of the interval, the function gets close to the right value, because it is a continuous function and it has data to constrain not only its value but also its slope, and maybe 2nd derivative, etc…

But at the end of the interval, particularly with noise involved, it has information from only one side of the interval. For example, here there is an upward outlier at about 4.95 and another downward outlier at 5.75 (math score graph). If the regression went through that region, those two would cancel out… the curve would stay somewhere in the middle… But because we drew an arbitrary boundary there, and then allowed the functions on each side of that boundary to do *completely different* things, the functions do exactly that, completely different things. This then becomes a “discovery” because we have an “identification strategy”. It’s like saying you discovered your feet are different sizes because when you allowed yourself to put on two different shoes you didn’t trip. People are pretty good at not tripping. Clowns can wear enormously oversized shoes… it’s normal for you to not trip (it’s normal for the curves to go off in oscillations at the edges of a boundary) nothing about finding that the usual thing happened should make you think anything unusual is going on.

]]>In particular, yes, some different kernels were tried, but what about a constant fit? If neither the data nor prior reasoning strongly support a particular functional form, you should have to report fits with a wide range of functional forms, not just some cherry-picked ones that happen to give the desired significant results.

]]>This seems like a huge straw man, Andrew. Literally no one is saying that.

]]>Our coefficient of interest is Beta, which represents the effect of being just within 5 miles of the gas leak (and thus receiving air filters) compared to being just outside (and not receiving air filters).

Nope, the meaning of this coefficient depends on what you included in the model. In this case that is “geographic location”, “lagged test scores”, “student demographics and fixed school characteristics”, etc.

Change the model, change the meaning of the coefficient. Why is this model specification any better than the millions of other ones they could have used? It seems to be an arbitrary one using coefficients of convenience.

See here for an example of someone actually checking the value of the coefficient for some other data: https://statmodeling.stat.columbia.edu/2019/08/01/the-garden-of-forking-paths/

]]>