Skip to content

What to think about in 2015: How can the principles of statistical quality control be applied to statistics education

Happy new year!

A few years ago, Eric Loken and I wrote, Statisticians: When we teach, we don’t practice what we preach:

As statisticians, we give firm guidance in our consulting and research on the virtues of random sampling, randomized treatment assignments, valid and reliable measurements, and clear specification of the statistical procedures that will be applied to data. With self-assured confidence that we occupy the moral high ground, we share horror stories about convenience samples, selection bias, multiple comparisons, and other problems that arise when those less enlightened about proper methodology don’t follow the rules.

But are we really consistent in all aspects of our professional lives? How do we approach teaching? The following generalizations apply to most of us:

We assign grades based on exams that would almost surely be revealed to be low in both reliability and validity if we were to ever actually examine their psychometric properties. Despite teaching the same courses year after year, we rarely use standardized tests.

We almost never use pre-tests at the beginning of the semester, either to adjust for differences between students in different sections of a course or even for the more direct goal of assessing what has actually been learned by students in our classes.

We evaluate teachers based on student evaluations which, in addition to all their problems as measuring instruments, are presumably subject to huge nonresponse biases. Would we tolerate client satisfaction surveys as the only measure of hospital quality?

We try out new ideas haphazardly. Not only do we not do randomized experiments, we generally do not perform any systematic comparisons of treatments at all. As one high-level administrator put it to us recently, “It would be good if we introduced our new teaching methods based on some- thing more than a ‘hunch.'”

We continued:

The statistical field of quality control emphasizes the process of monitoring and improving a system, rather than focusing on individual cases. When we teach, however, we tend to focus on what seems to work or not work in an individual course, rather than on improving the process or the sequence. Consider how entrenched the freshman science sequence is at many large universities.

The contradiction is especially clear because we actually teach the stuff we believe in our classes and expect the students to parrot it back. However, we do not, in general, conduct our classes in a manner consistent with the principles we teach. . . .

And we concluded:

Being empirical about teaching is hard. Lack of incentives aside, we feel like we move from case study to case study as college instructors and that our teaching is a multifaceted craft difficult to decompose into discrete malleable elements. . . .

In making our practice more research-based and our teaching more practically focused, it would make sense to involve the entire educational team, including members of college and university administrations who set curricula, permanent faculty who organize courses, adjuncts and teaching assistants who perform much of the grading and face-to-face teaching, and writers of textbooks and educational materials.

This all sounds good. But what have I done about this since we wrote the above paragraphs? Lots of preaching, no practicing.

So, on this first day of the new year, I think we should all reflect on how to apply the statistical principles of quality control to statistics teaching. And to education more generally, of course, but let’s start with the problems that are right in front of us.


  1. gwern says:

    As always when this topic (or most complaints about academia, for that matter) comes up: incentives matter. As always, the cobbler’s children go unshod.

    Yes, there’s a fair bit of knowledge about how to teach. Testing, spaced repetition, commitment devices, interaction, graphs – all of these we know to considerably improve educational performance. But professors are not rewarded for teaching well, and students aren’t there to learn but get credentials. Hence, the status quo of ad hoc tests, teaching fads, the farce of student evaluations, and so on.

    • Clyde Schechter says:

      I don’t think it’s a matter of incentives so much as a matter of institutions and infrastructure. While most statisticians know something about psychometric analysis of assessment instruments (and some lack even that), the more daunting problem is actually developing, updating, securing, and maintaining test items. Having been involved heavily in this activity in graduate medical education earlier in my career, I can tell you that it is a very labor-intensive and expensive undertaking. It is, for practical purposes, impossible for a single teacher to do this, and would be impractical for all but the largest institutions’ faculties to do on their own. To be done properly, it needs to be done on an industrial scale.

      The problem has been solved in medical education. The National Board of Medical Examiners, which develops and administers the national licensing examination for physicians, also sells “shelf exams” to medical schools to do student assessments in individual courses. The use of these well-developed, psychometrically sound instruments has become nearly universal in the last decade and is beginning to make its impact felt on teaching methods as well. The trend is spreading to graduate medical education also. Some of the specialty certification boards now also provide in-service examinations that residency programs can use to evaluate the progress and learning of their trainees. While not all specialties do this, and not all residency programs avail themselves of it, the practice is spreading rapidly. The key is that there are institutions with vast resources and a compatible mission that are able to do this. I can’t think of anything like this in statistics (or really any other academic discipline) that could undertake this role.

      • Rahul says:

        Well, if there was a latent demand I guess these institutions would emerge for Statistics too? For one, attaining industrial scale on a Stat-101 class is far easier than a Medical Specialty.

        OTOH, the question is, is there a demand? Perhaps we, as a society, value quality & reliability in our Physicians more than we do in our Statisticians.

      • Martha says:

        “I don’t think it’s a matter of incentives so much as a matter of institutions and infrastructure.”

        Incentives are intertwined with institutions and infrastructure. Changing incentives may result in changes in institutions and infrastructure; changes in institutions and infrastructure may change incentives (sometimes unintentionally, and sometimes for the worse); and it may be necessary to change institutions and infrastructure in order to change incentives.

        “I can tell you that it is a very labor-intensive and expensive undertaking. It is, for practical purposes, impossible for a single teacher to do this, and would be impractical for all but the largest institutions’ faculties to do on their own.”

        This makes sense to me.

        “The problem has been solved in medical education.”

        I’m skeptical; there may appear to be a solution, but (unfortunately) “solutions” often bring problems with them. I would guess that, for example, there is criticism of the “shelf exams” (as well as the national licensing exams) on the grounds of omitting (or not adequately testing) some important skills or knowledge, or on the grounds of being (at least partially) obsolete as soon as they are widely used.

        It’s a never-ending problem, always subject to disagreements about what is important and to obsolescence and to the need to correct mistakes. Still, there is a lot of room for improvement, so it’s worth trying to implement incentives as much as feasible (which leaves open the big problem of how to do this).

        • Clyde Schechter says:

          “I’m skeptical; there may appear to be a solution, but (unfortunately) “solutions” often bring problems with them. I would guess that, for example, there is criticism of the “shelf exams” (as well as the national licensing exams) on the grounds of omitting (or not adequately testing) some important skills or knowledge, or on the grounds of being (at least partially) obsolete as soon as they are widely used.”

          A couple of comments. First, I should have said not that the problem has been solved but that a solution is evolving. The use of shelf exams began only recently–though it has caught on very rapidly. The use of in-service exams is still spotty. And the availability of these tests is only now beginning to lead to serious attempts to compare different approaches to medical education. But at least medical education is now clearly moving along that path.

          The National Board of Medical Examiners has broad representation from all segments of the medical profession and there is very little controversy about the appropriateness of their exams for the purposes they are used for. It is true that some questions go out of date shortly after they are first used, but part of the process of maintaining the exams is identifying and weeding out obsolete questions. Even in the event that circumstances change between the time an exam is first compiled and when it is administered, part of the scoring methodology identifies items with unexpectedly poor performance. Such items are reviewed for content, and are eliminated from scoring if it is recognized that the keyed answer is no longer correct in light of recent developments. The item is then either revised or dropped altogether from future use.

          The appropriateness of the content of specialty certification exams is more a subject of dispute. This is particularly true of the re-certification exams that are required periodically to maintain certification. Physicians in practice may develop focus their work in a restricted subarea and feel it is unfair to be examined on parts of their specialty that they no longer practice. Some of the larger specialties are accommodating this by developing exams tailored to some of the more common subareas. The smaller specialties are unable to do this because it is not feasible to develop an exam for a handful of examinees (or at least not at a reasonable price).

          So, while things are not perfect, they are better than we see in much of academics, and are moving in the right direction.

      • gwern says:

        > I don’t think it’s a matter of incentives so much as a matter of institutions and infrastructure.

        The potential equivalence aside (do not institutions and infrastructures determine much of the relevant incentives?), that would be a reasonable explanation if all the improvements were as expensive and difficult as you say good test batteries would be. This is not the case, though: spaced repetition and testing effects in varying strengths can be applied for free simply by teachers doing weekly quizzes, doing cumulative tests rather than focusing solely on just taught material, or mixing up questions over time; not to mention that students can do this themselves with the many free spaced repetition software packages. And somehow these institutions, incapable of sustained effort to create good tests, are able to survey their entire student body many times a year for teacher evaluations…

  2. John says:

    I’ve extensively overhauled how we teach statistics at our honours level and haven’t done one iota of testing. I defend this for two reasons, I’m ironing out fine details of how I’d like the new system to work and it’s the last class most of them take at the end of their degree so there is no (easy) way to test longer term benefits – if any.

    But I do anticipate over the next couple of years I am going to be doing some testing. I think I’ve solved most of the problems the course has been having and at the point I’m pretty happy with it I want to do some assessment. Any thoughts on this thread would be appreciated.

  3. […] Applying statistical thinking to education. […]

  4. Fernando says:

    We should apply the principles of statistical quality control to research, as well as teaching.

    Scientists are in the business of manufacturing inferences.

    Presently our manufacturing processes are as reliable as the process used to manufacture the Yugo

  5. ezra abrams says:

    a) you keep teaching statistics till you retire
    b) you give up stats, and try and drag the college teaching profession, kicking and screaming, into the 20th century

    your prior for which is likely to have a bigger positive impact on society ?

    PS: most people are waiting for a leader

  6. Elin says:

    I think there has to be a combination of being a data person and being realistic. It’s ridiculous to think that you could do a RCT on your own teaching, obviously you are not blind to the intervention (except maybe in some very obscure scenario like randomly assigning different books). And each university and college (and program within universities and colleges) is so different and has such unique selection effects that the idea of generalizability is really questionable.

    That said, the accumulation of lots of smaller analyses is helpful in a very practical way. Even more, trying to look at whether or not your students actually learn what you want them to learn is personally helpful even if it’s not something that is great, earth shattering research design. I do pretests and posttests on some of my teaching and I find it helpful (if sometimes painful) to look at the results. I think that’s probably more helpful than a massive, administration driven assessment undertaking that inevitably will be required to always show success. The accrediting agencies want you to do assessment, but they are not going to accredit you if the assessment results show you are failing. So if you are going to be data focused you need to remove anything that incentivizes good results. So we have to muddle through, aiming for “good enough” data to help us know what’s going on in our own classrooms.

    When I talk to students who are going to be k-8 teachers about why taking a research class and learning to deal with data is good for them, I point out that they are doing mini experiments all the time, e.g. when they separate two kids who are disruptive or try changing up the spelling homework, and if they are consciously observing and collecting data all along they might be able to at least start to evaluate whether those strategies are working or if it just seems like they are working because of wishful thinking or because the bad spellers were absent on the day of the test that week.

  7. Christian Hennig says:

    I’m not against proper research including RCTs on statistics teaching.
    But there is any number of good reasons to be skeptical about it and to do things in teaching that are hardly compatible with such a research programme.

    A few aspects:
    1) My experience as student and lecturer is that most lecturers are much better when they use the style and material that they like most. I have seen a department trying to impose something that was tested by proper educational research, but in the hands of those who didn’t like it, it was a disaster (although of course I can’t know whether everything would have been a disaster in the hands of these professors…).

    2) The most helpful evaluations of my teaching are those in which I ask the students to write down one or two things they like and they don’t like about my course n free form on a sheet of paper. Nothing quantitative (except that I realise if three people complain about the same thing), no standardised formats.

    3) The students to which my teaching is targeted is exactly the group in the room (and much of what works is influenced by social processes among the students, which can turn out totally different in a different group). I want to be in contact with them and adapt my style to the group that I have. And I will give students lots of opportunities to bring in something creative, related to their own thoughts, that can be discussed by the group, and how this turns out depends strongly on the specific group.

    4) If we want to test whether students can adapt their knowledge to situations that deviate from the examples in class, we need to use surprising tests, not standardised ones.

    • Christian Hennig says:

      I shuld add that neither I do always practice what I preach… setting up and marking surprising tests is more work that standardised ones…

    • KMC says:

      I know how much work it can be to adapt to students’ evolving status, as well as developing assessments that gauge students’ ability to transfer knowledge. I’m guessing that some of your students don’t recognize the effort you put in for them… but I do.

      We should consider those aspects that you list. I would suggest that we should also guide our curriculum decisions on theories supported with empirical evidence.

      I’m open for collaborating.

    • Martha says:

      Christian said, “and much of what works is influenced by social processes among the students, which can turn out totally different in a different group”

      Yup — This (plus the particulars of the students in the classroom) is part of the “classroom effect” — this is one reason why evaluating teachers on results of standardized tests is inherently inaccurate and hence unfair to the teachers.

      But Christian’s point(s) also bring up one problem in RCT’s for teaching: It would be a really big deal to design an RCT that can give a good estimate of the variability in the effect of a teaching method that captures the variability coming from the classroom effect and the teacher effect. In other words, we are dealing with really noisy data; a huge sample size (of classes as well as teachers) would be needed. A study of reasonable size could give effect estimates that would not be reasonable to extrapolate beyond the teachers and types of students involved in the study.

      • Martha says:

        Clarification on reading over my last comment: In the last sentence, I meant “reasonable size” in the sense of “feasible size,” not in the sense of “large enough to give a good estimate of effect size.”

  8. KMC says:

    I teach a mastery course. Before I began using this approach I had great concerns about reliability and validity. My concerns remain, but are strongly attenuated. Now my biggest questions are related to what topics/skills should my students possess to be awarded a particular letter grade (there are no points in the course, only awards/badges issued for a particular topic/concept/tool). I would love to get some feedback on that, for my introductory course. My syllabus is at if you are willing and interested in looking at my current grade matrix.

  9. Eli Rabett says:

    Teaching evaluations are either popularity contests (great or lousy teacher) or measures of what the students have learned (standardized tests). The focus is wrong, they should be formative, e.g. more ppt presentations or more problems worked in class on the board? basically questions about how instruction should be changed.

  10. David Condon says:

    I would add to that college professors mostly don’t spend any time reading educational research on classroom methods. I would say this would be more useful than testing out new methods or a teacher’s own particular implementation of the existing methods provided he/she was well-versed in the existing literature.

Leave a Reply