Ah good, at least I’m not crazy.

I still don’t quite understand what they’re plotting though, I don’t think it’s by school, it says “treatment and comparison girls” by pre-score, so how many treatment girls are there with small pre-scores?

or is “treatment” here “goes to a school that received some grants” even if the individual girl didn’t get the treatment?

But then how similar are schools that received no grants vs did receive some grants?

I still don’t think this plot is the way to answer the question, though given your explanation it may not be as silly as I first thought it was. What you want is as I said above, something about Post-pre as a function of things like “total dollars received by school” and “change in teacher attendance rate” and other mechanisms for improved teaching that plausibly could have been caused by receiving money

]]>Daniel,

They are plotting the difference between Y in two different groups (treatment schools that got scholarships and control schools that did not); so they are plotting Yt-Yc across X, not Y-X across X.

I think you would have to have differential regression to the mean across treatment/control to see what they are showing, but maybe I missed a salient point of the conversation above.

]]>Thanks.

]]>Anon:

This is a well known thing and it is indeed a special case of regression to the mean.

]]>if you generate two normally distributed independent random variables and plot y-x vs x you will get a negative sloped curve

Yes, this x vs y-x thing an interesting “trick”. If you want a positive correlation, you can do x vs x-y.

I asked above but didn’t get an answer. This is totally different from regression to the mean, right? Is there a name for this?

]]>again the problem is this graph does NOTHING to help us determine if there is any externality effect. if you generate two normally distributed independent random variables and plot y-x vs x you will get a negative sloped curve…

first an analysis must be thought of that could answer the question before we can even discuss further

]]>First, the wild divergence of the confidence limits at the ends of the x axis probably are the result of too little data in those two regions. Whatever the non-parametric kernel was, it has some effective window width, and we can get a sense of that width by looking at the width of the structures – the humps – in the curve. Probably a simple LOWESS fit with a would have been better (because simpler).

Second, the high frequency in the CL curves suggests to me some numerical problem, reminiscent of using too high a power in the smoothing kernel.

With these points in mind, I would not consider parts of the curve below about -1 or above about +1.25 on the x axis.

What then are we left with? The remaining curve suggests that there is a negative slope. But with these confidence limits, you could also plausibly fit a horizontal line. So there is some evidence for some effect having a negative slope, but this idea has not been severely tested, as Mayo would say. And there is no support for any particular explanation, if indeed the effect is real.

Earlier commenters have made good suggestions about what kind of data and analyses might let one learn more. For myself, I would like to see some kind of confirmation that there is a “real” effect before spending too much more effort on explanations.

]]>Let Y = b + rX + sqrt(1-r^2) e, where e is normal with the same SD as X.

Now plot Y-X vs X; you will see a similar slight negative slope. This is purely an artifact of the regression.

What is this sqrt(1 – r*2) doing? I see a negative slope just with y = b + r*x that becomes more prominent as r approaches 0.

I did:

n = 1e3

r = 0.9

b = 0

x = rnorm(n, 0, 1)

y = b + r*x + sqrt(1 – r^2)*rnorm(n, 0, 1)

# y = b +r*x

This is something different than regression to the mean right? At first that is what I thought you meant.

]]>Its actually quite easy to simulate.

Let X be the pretest, give it a normal distribution.

Let Y = b + rX + sqrt(1-r^2) e, where e is normal with the same SD as X.

This is just a linear shift with correlation r (the reliability of the test, or maybe I’m missing a square or a root).

Now plot Y-X vs X; you will see a similar slight negative slope. This is purely an artifact of the regression.

I think getting at the interaction effect gets around this, but the figure above is a nice example for teaching the regression effect, but not much else.

]]>It is even worse than that. If your explanation predicts a fast curve, you need to explain why there is some other factor(s) that is *exactly * cancelling out the regression to the mean effect…

Actually let’s say you used a 99% CI instead, so everywhere the “effect is not statistically significant from zero”. It would be surprising if the true curve was flat because regression to the mean is a well known phenomenon (Wikipedia even mentions students taking tests as a prototypical example[1]). Ie, you would need to explain why there is no regression to the mean observed here.

[1] https://en.m.wikipedia.org/wiki/Regression_toward_the_mean

]]>there also appears to be evidence for the treatment effect being meaningfully greater than 0 for pre-test scores

How are you determining what is meaningful here?

And let’s say the data actually looks like that curve (which I highly doubt). Ie, the CI is roughly the local sd and most of the points are clustered around the mean line.

Then the job becomes to come up with a model to explain it and/or compare it to the predictions of some preexisting theory the study was designed test.

So, can regression to the mean explain this? We would analytically work out (or more likely simulate) what we expect the data would look like if regression to the mean was going on (which would be always). Then, if the pattern looks similar we could say “regression to the mean can explain this pattern”.

However, Ken Carlson above also proposed another explanation involving more effort from students with middling pre-test scores. So we should also figure out how to write down this theory to derive a quantitative prediction from it, that one will probably have some free parameters that need to be estimated from some other type of data. But the point is to get a curve on the chart that reflects that explanation. You can even fit the curve and use the parameter values you get as predictions for what other observations should be in future data.

]]>+1. The figure looks quite a bit regression-to-the-mean-ish

]]>My interpretation of the figure is pretty much the same as yours. I’d add that the overall effect size is ? with a 95% CI of ? — ?, and there also appears to be evidence for the treatment effect being meaningfully greater than 0 for pre-test scores in the range of about -1 to 0.5. But I see these additions as a relatively small difference between our interpretations, given the overall level of uncertainty in the results.

The weird thing is that I read Reyes’s paragraph as having approximately the same meaning, too.* It could be that I am so used to seeing p-value-based overconfidence that I instinctively compensate in my brain, and that Reyes was actually trying to express more certain conclusions than your or my summaries. Alternatively, I could be reading the text with the same level of caution that was originally intended by Reyes. It can be difficult to tell the difference.

*Apart from the last sentence, which appears to be wrong because the results are consistent with both large positive and large negative treatment effects at pre-test scores less than -1.

]]>Thanks a lot for your response Andrew. That does help considerably, though I was more directly interested in the kind of text that summarizes the information presented in the figure, instead of explaining the methods behind it. Kind of your version of Reyes’s original paragraph, which maybe is what he was asking for too.

I guess you could say “the application of the methods to the data is the information” or something along those lines, but my experience is that that kind of approach only works with audiences who are fluent in statistics.

]]>Also, “with an apparent lack of randomly assigned control schools, we do not know whether the variation in post test score is caused by the presence of the grants at a given school or not”

]]>My interpretation of the fact about the world would be something like:

“Based on the graph, it seems reasonable to expect people who do poorly on pre-test to do somewhat better on post test, while people who do well on pre-test do somewhat less well on post test.”

]]>I don’t think the AnonymousCommentator is looking for an explanation of the mathematical object that’s being shown, he/she wants an explanation of the inference about the world that can be drawn from the graph, because people used to NHST are used to the idea that either a graph “shows a real thing about the world” or “it shows nothing is going on”

but that’s just not what graphs / models do. And I think you and I and Anoneuoid are indirectly pointing this out by failing to say something like “there’s an 80% posterior probability that a positive externality exists for the low scoring students” or some such similar thing…

it’s not just about translating what a frequentist NHST person would say into a proper bayesian language… it’s about interpreting the world in an entirely different way than what the NHST person would do.

]]>AnonymousCommentator:

My interpretation of the figure is pretty simple: they fit some sort of nonparametric model and the curve shows the best estimate of the treatment effect, conditional on the assumptions of the model. The confidence bands are the result of some uncertainty calculation. Just to fix an idea in my mind, I’ll imagine these bounds were made by repeatedly re-fitting the model after bootstrapping the data. The point is that these curves represent variation in the estimate of the line, conditional on the model.

]]>The original paper itself looks to be about 15,000 words long, so I doubt that any of us have actually read it, which would be a necessary precursor to detailed discussion of whether the analyses used by the authors are or are not suitable.

I disagree that is necessary, once you have experience looking at this stuff you learn to use heuristics so you don’t waste time. Lack of a scatter plot and concern about whether the interval contains zero are big red flags.

]]>I think all we can say is there appears to be some evidence for variation as a function of pre test score, which is consistent with a wide variety of explanations.

]]>Anoneuoid and Daniel — Andrew wrote, “Nonparametric is good… and with care can be interpreted just fine.” He also wrote, “If the researchers want to put in the extra effort to fit the nonparametric model and interpret it appropriately, I’m fine with that.”

Based on this, I would like to know what an appropriate interpretation of the results in the figure is, from Andrew’s perspective (or, if you are interested, from yours). I would like to know this so I can compare it side by side with Reyes’s paragraph, to help me understand how Andrew’s perspective on results akin to those shown in the figure compares with the kind of text that is produced by p-value-focused, old-school statistical thinking.

The original paper itself looks to be about 15,000 words long, so I doubt that any of us have actually read it, which would be a necessary precursor to detailed discussion of whether the analyses used by the authors are or are not suitable. (Great if you’ve read it though!) I get the urge to speculate about how you would have done the analyses differently, and how issues like regression to the mean might affect the results, but I don’t want to get into depth about any of that, since I would feel obligated to have read the paper before doing so. All I’d like is a basic summary of the figure, in your language.

]]>So basically regression to the mean? I don’t know if that curve actually means anything anyway though.

]]>Let’s just unpack how poorly the plotted curve answers the question by considering an alternative mechanism:

People whose pre-tests were low were much more likely to be “having a bad day” and upon taking the post-test will be much more likely to get higher scores because they’ll be feeling better, have slept better, etc… whereas people who had a very high pre-test were “having a good day” and when they take the post test they are much more likely to be having a more normal day and so they will test lower… People in the middle were probably having a typical day, and on post-test will still be having a typical day… and will test neutrally.

what kind of curve will you expect to see under this model? How does it compare to the curve we do see?

]]>Right I was going to try to say something too, but then I couldn’t, because I think this graph is the wrong way to answer the question of whether people who go to schools where some people receive grants are benefitting from the grants even if they personally don’t receive them.

The way to answer that, is to unpack the question and try to figure out what a direct answer to this question looks like:

look at people who didn’t receive grants, calculate Post-Pre score differences, and then plot them vs a measure of something you think was affected by the grants, such as teacher attendance or total influx of money into the school or percentage of children in the school who did receive grants, or all of the above in multiple panels…

Compare those plots for people who did receive the grants…

The problem here is the method of answering the question that was actually chosen doesn’t answer the question very well, if at all, but it’s exactly the kind of thing one would do if you have been taught “statistically significant things are real, and statistically insignificant things are zero” so all I have to do is go and look to see if statistical significance exists for low scoring students, and if it does, then they must be getting an externality!

ugh it makes my facial muscles hurt from frowning.

]]>I guess confusion is really just a lay term for “operating under an incorrect assumption”.

Specifically, I take it as a principle that correlations/effects are ubiquitous (although the vast majority are negligible or otherwise uninteresting). People looking for differences are assuming that they are rare and special.

]]>Not that you asked for my opinion, but what would make that hard is that I would never make a figure like this to begin with… so you would be asking me to explain something I would never do.

Often it goes all the way back to the beginning, I would have never even collected that type of data to begin with. I am not interested in the answer to the question that motivated it (“is there a difference?”), and believe the only reason people think they are interested in it is confusion.

]]>That would be extremely helpful for those of us who are trying to learn from your approach to statistics. Thanks!

]]>Thanks for the laugh.

]]>It isn’t clear that quantile effects are what you would want here, at least not if you have concerns about rank change in test scores due to the heterogeneity in treatment effects – the women you can pinpoint in the pre-intervention distribution as the ones likely to benefit the most may not be the same women who, ex-post, are the ones at that same point in the outcome distribution. And in fact we know there was some rank-churning in the data, and it was relatively severe:

“the odds of winning were only 3% for the bottom quartile of girls in the baseline test distribution and 5% for the second quartile, compared to 13% and 55% in the top two baseline quartiles”

So if some of this rank-churning was treatment-induced, then the quantile effects don’t get you what you want (the women at the 30th percentile after treatment are not the women who were at 30th percentile before treatment – so you aren’t getting the right people who should have gotten the extra “externality” boost). Only the heterogeneity across pre-score would get you the thing you want. Of course if the tests measure different things or are just really noisy, then it is something of a toss-up which is “better” or “better theoretically motivated”, but if we assume the tests pre/post are equally good measures of learning, then I think the authors did the conceptually appropriate thing here and the quantile regression results would not answer the question they are trying to answer.

]]>Brian

]]>that’s why I type with my eyes closed

]]>Also I believe a big part of the reason this kind of analysis isn’t done is that when you do this sort of thing it’s hard to get some “statistically significant result” that you can then publish, because you’re essentially estimating a bunch of interactions between different aspects of how the outcome occurs, and so precision of these estimates is not particularly fabulous. If you think in terms of “everything statistically significant is real, and everything non-significant is equal to 0” you will find that “everything is equal to 0 [sic]” in every study you do.

Thinking instead about finding that the most likely values of certain parameters suggests that teacher attendance is really important and helps both grant receiving students and non-grant-receiving students, and so it would be a good bet to start spending money on providing grants to kids so that schools get better overall… well you can only do that kind of thing if you can *assign different credence to different values of unobserved parameters* and that can *only* be done in a Bayesian analysis by definition.

]]>The part about positive externalities suggests some mechanism like “because kids who received grants went to the same schools as kids who didn’t, and the grants improved the availability of high quality educational materials and teacher attendance, even the kids who didn’t get grants benefited from improvement of the overall educational quality of the school experience”

It’s misguided to try to measure improvements as a function of pre-test score and attribute the difference to “positive externalities”. Instead, we should hypothesize mechanisms by which the positive externalities occur (for example improved teacher attendance or improved educational materials) and then look at our measurements of teacher attendance and educational materials, and see how children’s test scores vary with the variation in the hypothesized determinant of the outcome.

Graphs of relevance would be for example test score improvement vs change in teacher attendance colored by groups: low pre-test, medium pre-test, high pre-test or similar things.

]]>Yea, it would be interesting to compare. I saw they only had a couple thousand data points so they should be able to get them all on a scatter plot. If not (too much overlapping), they could divide it into a grid and make a heatmap. I could care less about this arbitrary polynomial they fit. Do future papers check its predictive skill?

]]>I can’t resist pointing out what I assume was an uncaught autocorrect: “I would also think for a monument…” :~)

]]>I had the same question. There is some insight based on the fine scale pattern of the CIs. … and, the variance:signal looks like it is going to be so big we might not want to spend time looking further.

]]>