Skip to content

“Peer assessment enhances student learning”

Dennis Sun, Naftali Harris, Guenther Walther, and Michael Baiocchi write:

Peer assessment has received attention lately as a way of providing personalized feedback that scales to large classes. . . . By conducting a randomized controlled trial in an introductory statistics class, we provide evidence that peer assessment causes significant gains in student achievement. The strength of our conclusions depends critically on the careful design of the experiment, which was made possible by a web-based platform that we developed. Hence, our study is also a proof of concept of the high-quality experiments that are possible with online tools.

Sun wrote to me:

We conducted a crossover study to see whether students who participated in peer assessment learned more than students who didn’t. In our new study, we took into account your suggestions about our first study, especially about principles of educational measurement. For one, we designed harder exams (somewhat to the chagrin of the students).

I have not looked at the paper in detail but, just speaking generally, I love this sort of study. Even if the end result is “no effect” or “no statistically significant effect,” I still think it’s important in that it pushes us to think harder about what we want our students to learn. As we discussed yesterday, measurement is super-important and is, I think, an underrated aspect of statistics.

I do have one suggestion: it’s a suggestion that’s pretty much universal in any study of this sort:

Make a scatterplot where each dot is a student, and you plot “after” score vs. “before” score, using different colored dots for treated and control students, and you can also draw the regression lines on the graph.

I find this sort of graph to be essential in the understanding of any study of this sort.


  1. Christian Hennig says:

    Measurement may be super-important, but the link doesn’t seem to work.

  2. Peer-to-peer teaching (more widely known as collaborative learning) has also been widely studied in “offline” education research, at least since the 1980s when I first became aware of it. It used to be called “collaborative learning”. I believe it is still generally considered a Very Good Thing. Every professor knows that you don’t really know something until you can teach it, or at least write about it coherently.

    In my own experience, having students grade each other’s papers is very eye-opening for everyone (grader, gradee, and teacher of the actual class). When we did this at Carnegie Mellon, students seemed (no real data to back it up) to work harder to impress their peers. But be careful — students can be harsh critics of each other’s work.

    We used to do all sorts of “peer-to-peer” education in our year-long MS practical course. One task would be to assign a student to ask a question about another student’s presentation (exactly what a good host does when nobody else in the audience asks questions). Another would be to write reviews for another student’s paper just like a review for a journal. That’s win-win-win-win because the students know their peers will be reading their work, the peers reading it have to figure out what it means, everyone has to learn how to comment and understand other student’s comments, and it provides a great insight into the reviewing process. Kind of like moot court in law school!

    There is also a whole lot of room to apply models of grading to correct for inaccurate or biased graders. This is true not just in classes, but for journals, conferences, etc. If it’s multiple choice, you can use the Dawid-Skene model (from 1979).

    • Radford Neal says:

      “students seemed (no real data to back it up) to work harder to impress their peers”

      This seems like a problem in any assessment of how well such techniques work. If these techniques lead to better results **for the same amount of student effort**, then they’re clearly beneficial. But if they are just a motivational technique that prompts students to spend more time on the course, it is quite unclear whether or not they are beneficial. More time on this course means less time on another course, or less time spent on other activities that might actually be more important (eg, going to a party where you might meet your future spouse).

    • Rahul says:

      At least on reading this current paper the effect, if any, seems so tiny that I might as well disregard it.

    • Martha says:

      “But be careful — students can be harsh critics of each other’s work.”

      This may depend on the particular student “audience.” In some instances where I’ve tried to have students grade their classmates’ work, I’ve found that some students have an, “It isn’t nice to criticize someone” philosophy. In some classes, it was necessary to have a discussion of the difference between “criticizing the person” and “critiquing the work.”

  3. Rahul says:

    The authors use p-value misconceptions among students as their example.

    It would be funny if their authoritative “correct answer” itself was in error. (Isn’t it?)

  4. Rahul says:


    You asked for graphs? Here:

    3D Scatter Plots in Full Color. With spaghetti lines. First two figures of that paper. (Interestingly, the authors are not even MBAs. )

  5. jrc says:

    Andrew, re: Make a scatterplot where each dot is a student, and you plot “after” score vs. “before” score, using different colored dots for treated and control students, and you can also draw the regression lines on the graph.

    First – yes, this is a really important graph in any kind of study like this. I drew one for my undergrads the other day and they immediately understood what it was telling them. My only addition: for “include the regression line” – use a local linear/polynomial regression, and plot the T/C groups best fit curves with dashed lines matching the group-specific dot color, but then subtract them from each other at each point, and draw a solid black line showing the difference between T/C (the treatment effect at each point in the pre-score distribution). I think that is my favorite way to draw that graph.

    Second – I don’t think that is possible here. There is no obvious “pre-score” in this setup, I don’t think. I guess you could rank kids based on overall GPA (or math GPA, or math SAT scores, or something) and then do this, but it is not clear that that is the “ranking” you (or we) are interested in. Which brings me to…

    Third – what is the ranking we are interested in here?… Well, I have some ideas about that. Wanna hear it here it goes***….

    The thought experiment behind this graph is the idea that somehow a pre-intervention test score is a measure of ability. We are interested in the extent to which a treatment effect varies across the ability distribution (and to model check in the sense of seeing that post-intervention scores increase in pre-intervention scores – but mostly its the heterogeneity in treatment effect across ability).

    Plotting post-score across pre-score then is essentially looking at treatment effects across an ability ranking. If the pre-score is a good measure of ability, then great, we have a good ranking of people, and we can legitimately asked if the “higher ability” or “lower ability” students improved. But often in these types of interventions – such as this one, where there is no obvious pre-intervention score that is comparable across all students – there is no obviously good pre-intervention ability ranking.

    So suppose we have a very noisy measure of ability in the pre-score. Well – what about another ranking of ability, namely, the post-intervention test score. Often times, that can be a better measure of pure ability (albeit ability after the intervention) because it is well designed to measure the particular educational outcomes of interest.

    And so finally my argument: Quantile treatment effects (meaning distance between to inverted CDF’s – the test score at the 20th percentile of the treatment group distribution minus the 20th percentile score in the control group distribution) can be a more convincing measure of heterogeneous effects across the ability distribution than the local regression (or linear, or binned, or whatever) across the (less precisely measured or relevant) pre-intervention test score.

    This also relates to the question of “rank preservation” in the interpretation of treatment effects across the outcome distribution, but in general, with how noisy test scores are and how much ranks change from test to test just because of noise, I think that conversation is not super helpful (though I reserve the right to still change my mind on that, because I think the question itself is sort of interesting, and it is really an “open” question that could use some attention).

    ***”Ah ha ha ahhhh ahhhh. Thank you very much!”:

Leave a Reply