## “Incentives to Learn”: How to interpret this estimate of a varying treatment effect?

Germán Jeremias Reyes writes:

I am currently taking a course on Applied Econometrics and would like to ask you about how you would interpret a particular piece of evidence.

Some background: In 2009, Michael Kremer et al. published an article called “Incentives to learn.” This is from the abstract (emphasis is mine):

We study a randomized evaluation of a merit scholarship program in which Kenyan girls who scored well on academic exams had school fees paid and received a grant. Girls showed substantial exam score gains, and teacher attendance improved in program schools. There were positive externalities for girls with low pretest scores, who were unlikely to win a scholarship.

With that in mind, for my applied econometrics class, I had to replicate one of the main figures the paper, which shows how the treatment effect varied according to the baseline test scores. The idea being that, if we observe a positive treatment effect in the lower end of the baseline test score distribution, that would be evidence of positive externalities. This is what I found:

My question is about how to interpret this figure. Before reading your blog, my interpretation would have been something like this:

“Results that the estimated treatment effect is not statistically different from zero for most of the values of the 2000 test score distribution. There are two exceptions, girls with a baseline test score slightly above -1, and girls with a baseline test score around 0.5, in both of those cases, we detect treatment effects statistically different from zero at the 95% levels, calculated with 300 bootstrap replications. These results suggest that the test score gains are concentrated in the lower middle of the test score distribution, i.e., among students with baseline test scores between -1 and -1.5 (and, perhaps, for students slightly above the mean, i.e., with a baseline score close to 0.5). This stems from the fact that those are the only estimated treatment effects statistically different from zero at the usual levels. Furthermore, the fact that the treatment effect is not statistically different from zero for students with a baseline score lower than -1 (i.e., on the lower side of the test score distribution) shows that the program did not have negative externalities on low-achieving students.”

However, your blog made me wary about reaching conclusions only based on p-values/standard errors/statistical significance. So, my question to you is, what do you think we should learn about the treatment effect based on the figure above?

Based on the above picture, I’d be inclined to just fit a model with a linear interaction of test score with treatment. There will be uncertainty in the slope—I’m guessing the data are consistent with a slope of zero, that is a constant treatment effect—but the interaction could be of policy interest so it could be worth estimating even if the resulting estimate is highly uncertain.

P.S. I don’t think there’s anything wrong with a nonparametric model being used in the above graph: Nonparametric is good, it’s more general than parametric and with care can be interpreted just fine. I’m recommending a linear model just because I think it would do the job here. If the researchers want to put in the extra effort to fit the nonparametric model and interpret it appropriately, I’m fine with that.

1. Anoneuoid says:

What does the actual data look like? I would guess it looks nothing like that wiggly curve. That is based purely on past experience with papers that only include model output instead of scatter plots though.

• kcummins says:

I had the same question. There is some insight based on the fine scale pattern of the CIs. … and, the variance:signal looks like it is going to be so big we might not want to spend time looking further.

• Anoneuoid says:

Yea, it would be interesting to compare. I saw they only had a couple thousand data points so they should be able to get them all on a scatter plot. If not (too much overlapping), they could divide it into a grid and make a heatmap. I could care less about this arbitrary polynomial they fit. Do future papers check its predictive skill?

2. kcummins says:

AG’s route is probably what I would suggest. I would also think for a monument about curvilinear/threshold associations, which I probably would have written into my original analysis plan because my content areas often contain threshold associations. In that case, I would couch this analysis as exploratory and fit a corresponding linear model which can accommodate such functional features, but cast RA Fisher’s shadow on the discussion. An “isolated record” of a result does not warrant attribution of a genuine effect. So, the next step in the program would be a requisite confirmation.

3. Ken Carlson says:

I’m not sure an interaction coded as PxT (pretest times treatment) would capture how incentives are supposed to work. If I’m far above the threshold for receiving the award, I don’t gain much by improving my test score; if I’m far below it, the gain might be large, but the effort required, and the risk of failure may exceed the value of the award. For someone right at the threshold, the effort is small and the reward is large. You’re looking for a treatment effect that’s zero at the ends and has a bend in the middle. (And, for what it’s worth, I started composing this comment before I switched to a larger screen and saw the picture.)

• The part about positive externalities suggests some mechanism like “because kids who received grants went to the same schools as kids who didn’t, and the grants improved the availability of high quality educational materials and teacher attendance, even the kids who didn’t get grants benefited from improvement of the overall educational quality of the school experience”

It’s misguided to try to measure improvements as a function of pre-test score and attribute the difference to “positive externalities”. Instead, we should hypothesize mechanisms by which the positive externalities occur (for example improved teacher attendance or improved educational materials) and then look at our measurements of teacher attendance and educational materials, and see how children’s test scores vary with the variation in the hypothesized determinant of the outcome.

Graphs of relevance would be for example test score improvement vs change in teacher attendance colored by groups: low pre-test, medium pre-test, high pre-test or similar things.

• Also I believe a big part of the reason this kind of analysis isn’t done is that when you do this sort of thing it’s hard to get some “statistically significant result” that you can then publish, because you’re essentially estimating a bunch of interactions between different aspects of how the outcome occurs, and so precision of these estimates is not particularly fabulous. If you think in terms of “everything statistically significant is real, and everything non-significant is equal to 0” you will find that “everything is equal to 0 [sic]” in every study you do.

Thinking instead about finding that the most likely values of certain parameters suggests that teacher attendance is really important and helps both grant receiving students and non-grant-receiving students, and so it would be a good bet to start spending money on providing grants to kids so that schools get better overall… well you can only do that kind of thing if you can *assign different credence to different values of unobserved parameters* and that can *only* be done in a Bayesian analysis by definition.

Seems like an analysis that is crying out for quantile regression estimates. We could easily specify a linear quantile regression model like rq(testscore2001 ~ testscore2000 + treatment + treatment:testscore2000, tau=c(1:99/100),data=thedata,contrasts=list(treatment=”contr.treatment”)), assuming treatment is a factor variable denoting the treated and comparison girls. Estimates by quantiles and confidence intervals can be graphed for various effects or predictions by quantile level. The linear testscore2000 (pre) effect can be made more flexibly nonlinear by say using splines (e.g., bs(testscore2000,degree=3,df=6)) in the linear quantile regression model, but then the interaction with treatment gets a little more problematic to interpret but not impossible.

Brian

• jrc says:

It isn’t clear that quantile effects are what you would want here, at least not if you have concerns about rank change in test scores due to the heterogeneity in treatment effects – the women you can pinpoint in the pre-intervention distribution as the ones likely to benefit the most may not be the same women who, ex-post, are the ones at that same point in the outcome distribution. And in fact we know there was some rank-churning in the data, and it was relatively severe:

“the odds of winning were only 3% for the bottom quartile of girls in the baseline test distribution and 5% for the second quartile, compared to 13% and 55% in the top two baseline quartiles”

So if some of this rank-churning was treatment-induced, then the quantile effects don’t get you what you want (the women at the 30th percentile after treatment are not the women who were at 30th percentile before treatment – so you aren’t getting the right people who should have gotten the extra “externality” boost). Only the heterogeneity across pre-score would get you the thing you want. Of course if the tests measure different things or are just really noisy, then it is something of a toss-up which is “better” or “better theoretically motivated”, but if we assume the tests pre/post are equally good measures of learning, then I think the authors did the conceptually appropriate thing here and the quantile regression results would not answer the question they are trying to answer.

5. AnonymousCommentator says:

Andrew — In his email, Germán Jeremias Reyes provides a paragraph-length interpretation of the figure using old-school, p-value based language and ideas. Could you provide an alternative paragraph that interprets the results in the figure from your preferred, p-value-less point of view?

That would be extremely helpful for those of us who are trying to learn from your approach to statistics. Thanks!

• Anoneuoid says:

Not that you asked for my opinion, but what would make that hard is that I would never make a figure like this to begin with… so you would be asking me to explain something I would never do.

Often it goes all the way back to the beginning, I would have never even collected that type of data to begin with. I am not interested in the answer to the question that motivated it (“is there a difference?”), and believe the only reason people think they are interested in it is confusion.

• Anoneuoid says:

I guess confusion is really just a lay term for “operating under an incorrect assumption”.

Specifically, I take it as a principle that correlations/effects are ubiquitous (although the vast majority are negligible or otherwise uninteresting). People looking for differences are assuming that they are rare and special.

• Right I was going to try to say something too, but then I couldn’t, because I think this graph is the wrong way to answer the question of whether people who go to schools where some people receive grants are benefitting from the grants even if they personally don’t receive them.

The way to answer that, is to unpack the question and try to figure out what a direct answer to this question looks like:

look at people who didn’t receive grants, calculate Post-Pre score differences, and then plot them vs a measure of something you think was affected by the grants, such as teacher attendance or total influx of money into the school or percentage of children in the school who did receive grants, or all of the above in multiple panels…

Compare those plots for people who did receive the grants…

The problem here is the method of answering the question that was actually chosen doesn’t answer the question very well, if at all, but it’s exactly the kind of thing one would do if you have been taught “statistically significant things are real, and statistically insignificant things are zero” so all I have to do is go and look to see if statistical significance exists for low scoring students, and if it does, then they must be getting an externality!

ugh it makes my facial muscles hurt from frowning.

• Let’s just unpack how poorly the plotted curve answers the question by considering an alternative mechanism:

People whose pre-tests were low were much more likely to be “having a bad day” and upon taking the post-test will be much more likely to get higher scores because they’ll be feeling better, have slept better, etc… whereas people who had a very high pre-test were “having a good day” and when they take the post test they are much more likely to be having a more normal day and so they will test lower… People in the middle were probably having a typical day, and on post-test will still be having a typical day… and will test neutrally.

what kind of curve will you expect to see under this model? How does it compare to the curve we do see?

• AnonymousCommentator says:

Anoneuoid and Daniel — Andrew wrote, “Nonparametric is good… and with care can be interpreted just fine.” He also wrote, “If the researchers want to put in the extra effort to fit the nonparametric model and interpret it appropriately, I’m fine with that.”

Based on this, I would like to know what an appropriate interpretation of the results in the figure is, from Andrew’s perspective (or, if you are interested, from yours). I would like to know this so I can compare it side by side with Reyes’s paragraph, to help me understand how Andrew’s perspective on results akin to those shown in the figure compares with the kind of text that is produced by p-value-focused, old-school statistical thinking.

The original paper itself looks to be about 15,000 words long, so I doubt that any of us have actually read it, which would be a necessary precursor to detailed discussion of whether the analyses used by the authors are or are not suitable. (Great if you’ve read it though!) I get the urge to speculate about how you would have done the analyses differently, and how issues like regression to the mean might affect the results, but I don’t want to get into depth about any of that, since I would feel obligated to have read the paper before doing so. All I’d like is a basic summary of the figure, in your language.

• I think all we can say is there appears to be some evidence for variation as a function of pre test score, which is consistent with a wide variety of explanations.

• Anoneuoid says:

The original paper itself looks to be about 15,000 words long, so I doubt that any of us have actually read it, which would be a necessary precursor to detailed discussion of whether the analyses used by the authors are or are not suitable.

I disagree that is necessary, once you have experience looking at this stuff you learn to use heuristics so you don’t waste time. Lack of a scatter plot and concern about whether the interval contains zero are big red flags.

• Andrew says:

AnonymousCommentator:

My interpretation of the figure is pretty simple: they fit some sort of nonparametric model and the curve shows the best estimate of the treatment effect, conditional on the assumptions of the model. The confidence bands are the result of some uncertainty calculation. Just to fix an idea in my mind, I’ll imagine these bounds were made by repeatedly re-fitting the model after bootstrapping the data. The point is that these curves represent variation in the estimate of the line, conditional on the model.

• I don’t think the AnonymousCommentator is looking for an explanation of the mathematical object that’s being shown, he/she wants an explanation of the inference about the world that can be drawn from the graph, because people used to NHST are used to the idea that either a graph “shows a real thing about the world” or “it shows nothing is going on”

but that’s just not what graphs / models do. And I think you and I and Anoneuoid are indirectly pointing this out by failing to say something like “there’s an 80% posterior probability that a positive externality exists for the low scoring students” or some such similar thing…

it’s not just about translating what a frequentist NHST person would say into a proper bayesian language… it’s about interpreting the world in an entirely different way than what the NHST person would do.

• My interpretation of the fact about the world would be something like:

“Based on the graph, it seems reasonable to expect people who do poorly on pre-test to do somewhat better on post test, while people who do well on pre-test do somewhat less well on post test.”

• Also, “with an apparent lack of randomly assigned control schools, we do not know whether the variation in post test score is caused by the presence of the grants at a given school or not”

• AnonymousCommentator says:

Thanks a lot for your response Andrew. That does help considerably, though I was more directly interested in the kind of text that summarizes the information presented in the figure, instead of explaining the methods behind it. Kind of your version of Reyes’s original paragraph, which maybe is what he was asking for too.

I guess you could say “the application of the methods to the data is the information” or something along those lines, but my experience is that that kind of approach only works with audiences who are fluent in statistics.

6. Anonymous-Commentator says:

Hi Daniel – This response is to your first interpretation of the figure above, which starts “I think all we can say is…” I can’t seem to respond to that post directly, so am trying here…

My interpretation of the figure is pretty much the same as yours. I’d add that the overall effect size is ? with a 95% CI of ? — ?, and there also appears to be evidence for the treatment effect being meaningfully greater than 0 for pre-test scores in the range of about -1 to 0.5. But I see these additions as a relatively small difference between our interpretations, given the overall level of uncertainty in the results.

The weird thing is that I read Reyes’s paragraph as having approximately the same meaning, too.* It could be that I am so used to seeing p-value-based overconfidence that I instinctively compensate in my brain, and that Reyes was actually trying to express more certain conclusions than your or my summaries. Alternatively, I could be reading the text with the same level of caution that was originally intended by Reyes. It can be difficult to tell the difference.

*Apart from the last sentence, which appears to be wrong because the results are consistent with both large positive and large negative treatment effects at pre-test scores less than -1.

• Anoneuoid says:

there also appears to be evidence for the treatment effect being meaningfully greater than 0 for pre-test scores

How are you determining what is meaningful here?

And let’s say the data actually looks like that curve (which I highly doubt). Ie, the CI is roughly the local sd and most of the points are clustered around the mean line.

Then the job becomes to come up with a model to explain it and/or compare it to the predictions of some preexisting theory the study was designed test.

So, can regression to the mean explain this? We would analytically work out (or more likely simulate) what we expect the data would look like if regression to the mean was going on (which would be always). Then, if the pattern looks similar we could say “regression to the mean can explain this pattern”.

However, Ken Carlson above also proposed another explanation involving more effort from students with middling pre-test scores. So we should also figure out how to write down this theory to derive a quantitative prediction from it, that one will probably have some free parameters that need to be estimated from some other type of data. But the point is to get a curve on the chart that reflects that explanation. You can even fit the curve and use the parameter values you get as predictions for what other observations should be in future data.

• Anoneuoid says:

Actually let’s say you used a 99% CI instead, so everywhere the “effect is not statistically significant from zero”. It would be surprising if the true curve was flat because regression to the mean is a well known phenomenon (Wikipedia even mentions students taking tests as a prototypical example[1]). Ie, you would need to explain why there is no regression to the mean observed here.

• Anoneuoid says:

It is even worse than that. If your explanation predicts a fast curve, you need to explain why there is some other factor(s) that is exactly cancelling out the regression to the mean effect…

7. Actually, this looks like classic regression effect to me. He is plotting a difference score (post-pre) versus the pretest score and seeing a higher residual at the lower end of the scale.

Its actually quite easy to simulate.

Let X be the pretest, give it a normal distribution.

Let Y = b + rX + sqrt(1-r^2) e, where e is normal with the same SD as X.

This is just a linear shift with correlation r (the reliability of the test, or maybe I’m missing a square or a root).

Now plot Y-X vs X; you will see a similar slight negative slope. This is purely an artifact of the regression.

I think getting at the interaction effect gets around this, but the figure above is a nice example for teaching the regression effect, but not much else.

• Anoneuoid says:

Let Y = b + rX + sqrt(1-r^2) e, where e is normal with the same SD as X.

Now plot Y-X vs X; you will see a similar slight negative slope. This is purely an artifact of the regression.

What is this sqrt(1 – r*2) doing? I see a negative slope just with y = b + r*x that becomes more prominent as r approaches 0.

I did:
n = 1e3
r = 0.9
b = 0
x = rnorm(n, 0, 1)
y = b + r*x + sqrt(1 – r^2)*rnorm(n, 0, 1)
# y = b +r*x

This is something different than regression to the mean right? At first that is what I thought you meant.

8. Tom Passin says:

Let’s remember that the original task of the original question was to “replicate” the graph using the data in the paper. It was not to deconstruct the paper or to suggest better experiments. So let’s see how far we can go with just the graph.

First, the wild divergence of the confidence limits at the ends of the x axis probably are the result of too little data in those two regions. Whatever the non-parametric kernel was, it has some effective window width, and we can get a sense of that width by looking at the width of the structures – the humps – in the curve. Probably a simple LOWESS fit with a would have been better (because simpler).

Second, the high frequency in the CL curves suggests to me some numerical problem, reminiscent of using too high a power in the smoothing kernel.

With these points in mind, I would not consider parts of the curve below about -1 or above about +1.25 on the x axis.

What then are we left with? The remaining curve suggests that there is a negative slope. But with these confidence limits, you could also plausibly fit a horizontal line. So there is some evidence for some effect having a negative slope, but this idea has not been severely tested, as Mayo would say. And there is no support for any particular explanation, if indeed the effect is real.

Earlier commenters have made good suggestions about what kind of data and analyses might let one learn more. For myself, I would like to see some kind of confirmation that there is a “real” effect before spending too much more effort on explanations.

• again the problem is this graph does NOTHING to help us determine if there is any externality effect. if you generate two normally distributed independent random variables and plot y-x vs x you will get a negative sloped curve…

first an analysis must be thought of that could answer the question before we can even discuss further

• Anoneuoid says:

if you generate two normally distributed independent random variables and plot y-x vs x you will get a negative sloped curve

Yes, this x vs y-x thing an interesting “trick”. If you want a positive correlation, you can do x vs x-y.

I asked above but didn’t get an answer. This is totally different from regression to the mean, right? Is there a name for this?

• jrc says:

Daniel,

They are plotting the difference between Y in two different groups (treatment schools that got scholarships and control schools that did not); so they are plotting Yt-Yc across X, not Y-X across X.

I think you would have to have differential regression to the mean across treatment/control to see what they are showing, but maybe I missed a salient point of the conversation above.

• Ah good, at least I’m not crazy.

I still don’t quite understand what they’re plotting though, I don’t think it’s by school, it says “treatment and comparison girls” by pre-score, so how many treatment girls are there with small pre-scores?

or is “treatment” here “goes to a school that received some grants” even if the individual girl didn’t get the treatment?

But then how similar are schools that received no grants vs did receive some grants?

I still don’t think this plot is the way to answer the question, though given your explanation it may not be as silly as I first thought it was. What you want is as I said above, something about Post-pre as a function of things like “total dollars received by school” and “change in teacher attendance rate” and other mechanisms for improved teaching that plausibly could have been caused by receiving money