Debate involving a bad analysis of GRE scores

This is one of these academic ping-pong stories of a general opinion, an article that challenges the general opinion, a rebuttal to that article, a rebuttal to the rebuttal, etc. I’ll label the positions as A1, B1, A2, B2, and so forth:

A1: The starting point is that Ph.D. programs in the United States typically require that applicants take the Graduate Record Examination (GRE) as part of the admissions process.

B1: In 2019, Miller, Zwick, Posselt, Silvestrini, and Hodapp published an article saying:

Multivariate statistical analysis of roughly one in eight physics Ph.D. students from 2000 to 2010 indicates that the traditional admissions metrics of undergraduate grade point average (GPA) and the Graduate Record Examination (GRE) Quantitative, Verbal, and Physics Subject Tests do not predict completion as effectively admissions committees presume. Significant associations with completion were found for undergraduate GPA in all models and for GRE Quantitative in two of four studied models; GRE Physics and GRE Verbal were not significant in any model. It is notable that completion changed by less than 10% for U.S. physics major test takers scoring in the 10th versus 90th percentile on the Quantitative test.

They fit logistic regressions predicting Ph.D. completion given undergraduate grade point average, three GRE scores (Quantitative, Verbal, Physics, indicators for whether the Ph.D. program is Tier 1 or Tier 2, indicators for six different ethnic categories (with white as a baseline), an indicator for sex, and an indicator for whether the student came from the United States. Their results are summaries by statistical significance of the coefficients for GRE predictors. Their conclusion:

The weight of evidence in this paper contradicts conventional wisdom and indicates that lower than average scores on admissions exams do not imply a lower than average probability of earning a physics Ph.D. Continued overreliance on metrics that do not predict Ph.D. completion but have large gaps based on demographics works against both the fairness of admissions practices and the health of physics as a discipline.

A2: Weissman responded with an article in the same journal, saying:

A recent paper in Science Advances by Miller et al. concludes that Graduate Record Examinations (GREs) do not help predict whether physics graduate students will get Ph.D.’s. Here, I argue that the presented analyses reflect collider-like stratification bias, variance inflation by collinearity and range restriction, omission of parts of a needed correlation matrix, a peculiar choice of null hypothesis on subsamples, blurring the distinction between failure to reject a null and accepting a null, and an unusual procedure that inflates the confidence intervals in a figure. Release of results of a model that leaves out stratification by the rank of the graduate program would fix many of the problems.

One point that Weissman makes is that the GRE Quantitative and Physics scores are positively correlated, so (a) when you include them both as predictors in the model, each of their individual coefficients will end up with a larger standard error, and (b) the statement in the original article that “completion changed by less than 10% for U.S. physics major test takers scoring in the 10th versus 90th percentile on the Quantitative test” is incorrect: students scoring higher on the Quantitative test will, on average, score higher on the Physics test too, and you have to account for that in making your prediction.

Weissman also points out that adjusting for the tier of the student’s graduate program is misleading: the purpose of the analysis is to consider admissions decisions, but the graduate program is not determined until after admissions. Miller et al. are thus making the mistake of adjusting for post-treatment variables (see section 19.6 of Regression and Other Stories for more on why you shouldn’t do that).

At the end of his discussion, Weissman recommends preregistration, but I don’t see the point of that. The analysis made mistakes. If these are preregistered, they’re still mistakes. More relevant would be making the data available. But I guess a serious analysis of this topic would not limit itself to physics students, as I assume the same issues arise in other science Ph.D. programs.

B2: Miller et al. then responded:

We provide statistical measures and additional analyses showing that our original analyses were sound. We use a generalized linear mixed model to account for program-to-program differences with program as a random effect without stratifying with tier and found the GRE-P (Graduate Record Examination physics test) effect is not different from our previous findings, thereby alleviating concern of collider bias. Variance inflation factors for each variable were low, showing that multicollinearity was not a concern.

Noooooooo! They totally missed the point. “Program-to-program differences” are post-treatment variables (“colliders”), so you don’t want to adjust for them. And the issue of multicollinearity is not that it’s a concern with the regression (although it is, as the regression is unregularized, but that’s another story); the problem is in the interpretation of the coefficients.

A3: Weissman posted on Arxiv a response to the response; go to page 18 here.

Weissman doesn’t say this, but what I’m seeing here is the old, old story of researchers making a mistake, getting it into print, and then sticking to the conclusion no matter what. No surprise, really: that’s how scientists are trained to respond to criticism.

It’s sad, though: Miller et al. are physicists, and they have so much technical knowledge, but their statistical analysis is so crude. To try to answer tough predictive questions with unregularized logistic regression and statistical significance thresholding . . . that’s just not the way to go.

Weissman summarizes in an email:

Although the GRE issue is petty in the current context, the deeper issue of honesty and competence is science is as important as ever. Here are some concerns.

1. Science Advances tells me they (and Science?) are eliminating Technical Comments. How will they then deal with pure crap that makes it through fallible peer review?

2. Correct interpretation of observational studies can obviously be a matter of life or death now. The big general-purpose journals need editors who are experts in modern causal inference.

3. Two of the authors of the atrocious papers now serve on a select NAS policy panel for this topic.

4. Honesty still matters.

My take on this is slightly different. I don’t see any evidence for dishonesty here; this just seems like run-of-the-mill incompetence. Remember Clarke’s law. I think calls for honesty miss the point, because then you’ll just get honest but incompetent people who think that they’re in the clear because they’re honest.

Regarding item 3 . . . yeah, the National Academy of Sciences. That’s a political organization. I guess there’s no alternative: if a group has power, then politics will be involved. But, by now, I can’t take the National Academy of Sciences seriously. I guess they’ll do some things right, but you’ll really have to look carefully at who’s on the committee in question.

82 thoughts on “Debate involving a bad analysis of GRE scores

  1. Isn’t the idea behind suggesting preregistration as a remedy is that, consciously or unconsciously, the authors of this study garden-pathed it, that is, the bizarre/wrong choices of how to carry out their analysis are there in part because they led to the desired answer.

    • Yes, exactly. Andrew is missing the point that several of these authors have a very strong prior public commitment to these results. Their careers are based on them, including ~$3.9M in grants acknowledged in the paper. All the multiple errors go in the direction of confirming the prior claims.

  2. > Correct interpretation of observational studies can obviously be a matter of life or death now. The big general-purpose journals need editors who are experts in modern causal inference.
    That would explain all those low quality papers published from observational studies on Covid that flooded into medical journals at the beginning? of the current pandemic.

  3. A digression from the major discussion, but one that struck me: why the interest in the relationship between GRE scores and PhD completion in Physics. I admit I have no direct experience with PhD programs in Physics, but I do in some other fields. Completion has always seemed to me to be more related to life experiences than to academic ability. I would think that grades would be more related to GRE scores than degree completion is. Wouldn’t that have been a more relevant question? Of course, many of the same issues in this debate would arise, but the focus on degree completion just strikes me as a less interesting question than some (admittedly poor) measure of degree quality.

    • Feel free to suggest *any* “measure of degree quality” that faculty anywhere have been able or willing to provide. This same question comes up repeatedly in my (Physics) department, and never goes anywhere. Completion is simple to measure, and is definitely correlated with aptitude, though the correlation isn’t perfect. In many cases, as you note, “life experiences” get in the way of completion; in many others, inability to handle a Ph.D. does.

      • As someone who interacts with physicists and physics departments on occasion, I get the impression that many Physics Ph.D. programs tend to admit more first-year students than other programs. The expectation is that many will not finish, or will leave with a Masters. They let the qualifying exams “narrow the field”. Is this at all an accurate impression?

        If so, I suppose completion could arguably be an informative outcome under that regime. (Though I also think it’s a pretty crummy way to do things–why admit a student if you really think there’s a substantial chance they won’t finish? Alternatively, doesn’t it speak ill of the grad courses if a large number of students can’t pass their quals?)

        • I think this was much more the case ~30 years ago than it is now. My own department has got rid of its qualifying exam about 5 years ago, and I’m aware of several others in the past decade that have also done so. (That’s an interesting story in itself, and a complex one. I switched from being pro-exam to anti-exam, by the way.) It is true that many departments admit more students than will finish, but it is also true that the departments want people to finish — we’d be delighted if the dropout rate dropped. What’s going on isn’t a deliberate winnowing, and certainly isn’t admission to the program of people one expects won’t finish, but rather the reality that a decent fraction of people don’t complete a Ph.D. There are many causes of this. It is also not clear that this is “good” or “bad” — many people finish a Ph.D. who shouldn’t, and are granted degrees essentially for sticking around, to the frustration of hard-working and capable students.

        • Thanks, that makes sense!

          And to be clear, I agree that all Ph.D. programs would love it if all their students finished—that is the point, after all! I was only referring to a specific policy of using exams to “weed out” students, which I’m happy to hear is no longer popular (if it ever was). As you and Dale say, completion is often a matter of other demands on the students.

          I’d add that another reason some students don’t finish is that their interests change. Getting a Ph.D. takes a long time, during which people are learning and developing. Sometimes in the course of learning a lot about a specific field, you find that it doesn’t actually address what you care about or that there’s another domain that you’d prefer to contribute to instead.

        • “many people finish a Ph.D. who shouldn’t, and are granted degrees essentially for sticking around, to the frustration of hard-working and capable students.”

          Baddabing. This is the case with ***ALL*** degrees and alot the case in companies too, so the idea that test scores will predict completion rates for grad students or success in the “after life” is a crock of shit to begin with.

        • You’d think so a priori, but it turns out the data say otherwise for completion rates. You’d have to ask somebody with actual knowledge of the field to see if anything is known about broader outcome measures.

        • Michael:

          Are you really making a strong inference based on statistical adjustment for data with cells as small as 23 using measures we know to be crude in their ability to differentiate at small increments?

          This study is not worth spending the time on, but the inference you are making is patently absurd.

        • @Curious. No. Read the damn papers. The original authors did something crazy to get groups of ~23 to artificially inflate confidence intervals to hide an effect. That particular data subset actually had N= ~2300, and the overall set had N=~4000. It turns out that was enough to see trends at better than 4 sigma level despite some collinearity with other predictors and despite range restriction plus artificially enhanced collider stratification bias.

        • Share the data and formulas you used Michael, because what I’m looking at are substantively tiny effects (though statistically significant – made larger by correcting for range restriction) produced by large sample sizes that are effectively meaningless given the crudeness of the measures.

        • @curious I agree that effect size is more important than statistical significance. That’s why I made a point of giving it in the paper, along with the calculation method. I have no access to any data other than what Miller et al. published. In fact, our interaction started when I asked for one correlation coefficient (GRE-Q and GRE-P) and the lead author wrote that he couldn’t give it because of “human subjects” issues. It took months and some pressure from others to get that coefficient.

          As I wrote in the papers, the effect size isn’t huge (around a factor of 2 odds ratio for GREs toward the top of the enrolled range compared to toward the bottom, holding GPA constant. The GPA effect was smaller, holding GRE constant. None of my published estimates make any correction for range restriction, but informally I’d estimate that would raise that odds ratio to ~3 for the current enrollees. Following the policy recommendation of the Miller et al. to drop the exams entirely would mean extending admissions to people who are below the bottom of that current range, and thus would give a bigger odds ratio.

          These mediocre predictors are all that people have to work with, which is why they try to combine several of them to get better guesses.

        • Michael:

          There is plenty to criticize about that study, but at least they put their analytic results in a table to make it easy on the reader.

        • Curious:

          Indeed, this relates to an interesting point, which is that open data is a plus, regardless of the quality of a study. A high-quality study is even better if the data are available, and a low-quality study can be valuable despite its flaws if its data can be accessed.

        • Michael:

          Now that I’ve taken your sage advice and read your paper in greater detail, I see how the 1.5*GRE_P + GRE_Q brings the subgroup logit effects to ~ equal. However, I am wondering:

          1. Was the correlation between UGPA ~ GRE subtest scores shared with you?
          2. In the second paragraph on page 2 of your paper you state:

          “The net logit change between the 10th and 90th percentiles on that combined score would be reduced from the sum of the separate effects of the two scores [~0.46 and ~0.36 in the United States for Q and P, respectively, estimated from Figure 2 of (1)] by a factor (1.55/2)^1/2 since their correlation is 0.55, giving a net logit effect of ~0.72.”

          It’s not clear to me where the effects ~0.46 and ~0.36 are being pulled from Figure 2 in the Miller et al paper and how you are calculating the combined effect of ~ 0.72.

        • @curious Good questions.
          1. The UGPA- GRE correlation was not initially shared with me. Finally under pressure it appeared in their arXiv follow-up. I forget what the exact value was but it wasn’t huge. You can look it up.
          2. I took Fig. 2 and blew it up to full size with a copying machine. Then I measured the points with a ruler and calculated the corresponding logits. You may notice that the numbers changed very slightly between my arXiv versions because I tried to be extra careful when I realized that an official publication would result.
          If these two predictors were fully collinear, then the range of their sum would be the sum of their ranges. They aren’t quite, so the range is reduced by the factor given in the paper. You take that range and multiply by the slope of the sum to get the effect size for the sum.

          BTW, I want to brag about something. In the first arXiv versions, when the authors were still hiding all the correlation coefficients, I had to guess the GRE-P-GRE-Q correlation. Guided by other test correlations, I guessed 0.707, out of convenience. I wrote ETS, who said they didn’t have it but later came back with the actual value: 0.70 for both Spearman and Pearson. They asked how I knew. The value of 0.55 in the data of the paper is reduced by range-restriction.

        • Michael:

          Another couple questions:

          1. Which formula should be used?

          https://advances.sciencemag.org/content/6/23/eaax3787/tab-pdf

          “From Figure 2 of (1) , we see that the GRE-P range in the U.S. is about 1.5 times as large as the GRE-Q range. Adding the Q coefficient to 1.5 times the GRE-P coefficient [from Table 2 of (1)]…”

          equation from sciencemag: GRE_Q + 1.5*GRE_P

          > gre_qp_sciencemag
          # A tibble: 4 x 5
          group GRE_P GRE_Q GRE_V GRE_QP

          1 All_N_3962_Logit 0.003 0.013 -0.001 0.0175
          2 US_female_N_402_Logit 0.0002 0.017 -0.001 0.0173
          3 US_male_N_1913_Logit 0.005 0.01 -0.000005 0.0175
          4 US_N_2315_Logit 0.005 0.01 -0.0001 0.0175

          https://arxiv.org/ftp/arxiv/papers/1902/1902.09442.pdf

          “The 10th to 90th percentile ranges for the U.S. group can be seen in Fig. 2 of (1), with
          GRE-P having ~1.5 times as large a range as GRE-Q in this cohort, so the equal-weighted sum is
          close to GRE-P+1.5*GRE-Q, i.e. 1.5*GRE-Q has about the same range as GRE-P.”

          equation from arxiv: 1.5*GREQ + GRE_P

          > gre_qp_arxiv
          # A tibble: 4 x 5
          group GRE_P GRE_Q GRE_V GRE_QP

          1 All_N_3962_Logit 0.003 0.013 -0.001 0.0225
          2 US_female_N_402_Logit 0.0002 0.017 -0.001 0.0257
          3 US_male_N_1913_Logit 0.005 0.01 -0.000005 0.02
          4 US_N_2315_Logit 0.005 0.01 -0.0001 0.02

          2. In effort to “maintain minimal standards of competence and transparency” will you please share the values you used when you, “… measured the points with a ruler and calculated the corresponding logit” from Miller et al Figure 2? If they are in the arxiv version, please point me to the page because I’ve missed them.

        • @Curious. You seem to no longer believe “This study is not worth spending the time on”.

          On “which equation”: If you multiply the GRE-Q score by 1.5 before adding it to GRE-P to make equal-range contributions, then you have to divide the GRE-Q slope by 1.5 due to the change of units.Fr the purposes of checking whether the net slopes on the subgroups are the same, that’s equivalent to multiplying the GRE-P slope by 1.5, since the overall scale is irrelevant.

          I’m very late for another obligation now but later will scrounge the figures to get the percentile ranges. You can easily do it yourself to check, just using these percentile ranges and the published slopes (together with the correlation coefficient) to get the effect size. Reading the y-axes to get logits is a consistency check.

          I should note that in their lengthy response, which took months and disputed more or less everything else, Miller et al. did not dispute that my read of their numbers was correct.

          I don’t understand the computer output you included.

        • Michael:

          I will say it’s now exceeded my interest given the person who claimed their paper was about transparency and research competence is pretending they did not make an error in describing their methods.

        • @curious. OK, I got off the phone and found the blown-up xeroxes of Fig 2.
          For GRE-p I got that the female odds went up from 62/38 to 70/30, for a logit of 0.358. For males 70/30 to 77/23, logit 0.361. Weighted average 0.36.
          For GRE-Q females 59/41 to 70/20, logit 0.483. Males 68/32 to 77/23, logit 0.454, weighed average 0.46.

          That said, what the hell is going on with you?

        • @curious It feels stupid to still be answering this but what we have here is a simple change of variables in a linear equation. What you multiply the old variable by to get the new one is the inverse of what you multiply its coefficient by. What appeared in print was exactly right. I’d like to be snide about it but actually the first time I looked at it I made the same mistake as you.

        • Michael:

          I did not make a mistake, the error in description was yours.

          I simply calculated exactly as described in each paper, which said to multiply the logit for GRE-P in the sciencemag version and GRE-Q in the arxiv version. If the calculation should be conducted differently, it is your responsibility as the author to describe precisely how it is done so that it can be understood by your readers. Better yet, it would have been better to include the actual equation as you did in the arxiv version and to replicate Miller et al Table 2 with the new effects based on your calculations.

          That is the entire point of your article according to the arxiv version.

        • @ curious

          Once more into the breach.

          From arXiv:
          ” so the equal-weighted sum is
          close to GRE-P+1.5*GRE-Q, i.e. 1.5*GRE-Q has about the same range as GRE-P. Using data from
          Table 2 of (1) its coefficient (GRE-P coefficient +(1/1.5)*GRE-Q coefficient) is virtually identical
          in the entire sample (“All Students”) and the three subgroups described

          From Sci. Adv.
          “Adding the Q coefficient to 1.5 times the GRE-P coefficient [from Table 2 of (1)], we find that the
          predictive coefficient of the equal-weight sum is the same to within
          a 1% range in the “All Students” total sample and in each of the three subsamples described:”

          In one case I multiplied the P coefficient by 1.5 before adding, in the other case divided the Q coefficient by 1.5 before adding. Since the question was whether the result was nearly invariant under choice of subsample, the constant scale factor of 1.5 between these two choices is irrelevant.

          Maybe there’s a reason that GRE-Q, based on 9th grade math, is a good predictor.

        • Michael:

          I pointed out an error in your description. You ignored and pretended it is the responsibility of the reader to understand what you meant. When you make your paper about transparency and competency, it is your responsibility to acknowledge any lack of clarity brought to your attention.

    • > I would think that grades would be more related to GRE scores than degree completion is.

      How would you handle drop-outs? Whose grades do you look at? Completion is a “hard endpoint”. I guess it may be more relevant than grades conditional on completion (assuming the question is not “what do GRE scores predict” but “who should we admit into the program”).

    • Besides the issues with collider bias and range restriction that are usually pointed out in these discussions, I’ve also always found program completion in technical, quantitative programs to be a strange metric to use, without first more thoroughly investigating reasons for why the student dropped out. All of my friends in grad school who mastered out of their physics / math / cs degrees did so after receiving very comfortable offers from tech startups and other industry employers, so their decision seemed less motivated by a difference in ability than values.

      • That’s great if your goal is to improve students’ lives. Not so great if your goal is to replicate a professoriate in your own image. As my advisor told me when I told him I was planning to leave to go into into consulting: “Why did we even bother to educate you?”

        • That is a fascinating comment by your advisor. I only teach at the undergraduate level, but I have to admit that I feel like I want to send people to graduate school to get PhDs. The bias/drive is strong. Luckily my paycheck really only depends on teaching, so having my advisees change major or leave is no big deal. I’ve seen a few students fail to complete grad school, and that kinda hurts, but as long as they are successful at something in the end that is tangentially related to their degree, I am happy. I can imagine for an R1 professor when a graduate student leaves after investing time in their research, it is a big blow. That just highlights that we need to rethink graduate education. A graduate degree is the new/modern bachelors degree. If you want a high level quantitative/technical job, you really need advanced training.

    • All of us have the general impression that degree completion tends to be determined by things like seeing how miserable assistant profs are, getting an offer from Google, etc. So it would make a perfect outcome to use to show that GREs aren’t predictive. It’s surprising that a conventional competent analysis of the actual data showed that GREs actually are predictive. In my opinion that was strong motivation for the authors to then do many weird incompetent things to hide the predictive power.

  4. The model here is simply that a score on an achievement test taken prior to graduate school is predictive of variation in completion of the program post acceptance. This is a quite commonly modeled prediction and one where quite often the GRE is assumed to be proxy for IQ (which it is not, though they are certainly correlated). This seems like a level of measurement precision well beyond that of a standardized test such as the GRE.

    What assumptions are implied by this model?

    1. That the GRE is able to measure the skills and abilities necessary to complete a graduate program in physics with enough precision that small variations in scores at the high end of the distribution will vary with completion of the program.

    2. Andrew said it is incorrect to adjust for quality of program, but I do not see how this would be different than the hierarchical models he often recommends for other modeling problems. (If the effect disappears when including program quality, that would tell me that the precision of the test to predict has been exceeded. Perhaps I am missing his point.)

  5. I know why conditioning on a post-selection variable is an error, but if I were on the admissions committee of either tier I would certainly want to see the dataset restricted to my particular tier, with sample selection adjustments made, either Heckman-style with an inverse Mill’s ration on the probability of matriculation (where the probability of matriculation uses only grades to avoid [but not eliminate] confounding) or by a sample which adjusts for those accepted to both low- and high-tier programs.

    • Yes, here’s what I say on arXiv
      Interaction terms between rank and other predictors could be used to help different programs choose different criteria, but no such terms are included in the model reported (1) or in the later addition….
      As justification for the use of a highly stratified model, which they now concede can introduce bias, they argue that programs of different types may have different needs. (5) While true, the mathematical expression of such differences takes the form of interaction terms between the predictors of interest (e.g. GRE-Q) and the program rank or (in the new version) individual program variables. Such interaction terms were not included in the previous reports (1) (2), so that the reported results could not be used to obtain different criteria for different programs. The current report now says that such interaction terms were insignificant (5), thus again negating the stated rationale for including the biasing rank or program covariates.

      • Am I correct in this interpretation:

        Not including individual or tier level interaction is a mistake because GRE scores, etc., are used for admissions and then conditional on those scores that get a student admitted, other tier or program specific variables will determine completion?

        For example, when looking at SAT scores and college completion I take a sample that includes students from Cal Tech and Cal State Long Beach. Assume those schools have similar completion rates. If I ran my regression on the whole sample, even with random effects for programs, I would still not find a significant and positive relationship between SAT and college completion, even though we would expect students who score in the 90th percentile on the SAT to have higher completion rates than those who score in the 10th percentile if all those students went to the same program.

        • That is one sort of effect that could be present, that it’s harder t graduate from the top programs. There are other effects (e.g. differential funding) that can make it easier to graduate from the top programs. So the net effect there is of uncertain sign.

          The bias introduced in the S.A. paper and the response was of a different type. Their model included GRE & GPA plus a few less interesting predictors. It had to leave out ones with less standard scales, like undergrad research experience, letters of recommendation, quality of essays,… When you ask “what rank of program did a student end up in?” that outcome variable has causal antecedents both inside the model (GRE, GPA) and outside the model (see above). It’s called a collider between those causes. So when you stratify on it high inside-model predictors become systematically negatively correlated with outside-model predictors. That collider stratification gives a systematic negative bias to all the predictive coefficients for variables inside the model. It can be a very large effect, even flipping signs of coefficients. My paper includes some estimates for simple analytic distributions.

          Although many of the problems in this paper were uniquely comical, collider stratification is much more general. Almost every program will tell you that they don’t see much dependence of success on X, where X is an admissions criterion. That’s not just because of small-N stats. Each X collides with many other causes in determining which program a student ends up enrolling in. If a noticeable success vs. X remained in the enrollees, then that says the program should add or subtract weight to X in admissions evaluations, depending on the sign of the dependence. Under normal conditions, very little dependence should remain.
          So most of the single-program anecdotes just say that things are working about the way they are intended to work.

  6. The fact that program tier is the “most significant factor” but that their title is about something else feels like a red flag, no? If they tried a few variations of their model to see how the fit changed when variables were left out vs included, and the overall results didn’t change much, then that would be the important thing, yes? I feel like the most important thing is to always test a few different models to make sure you aren’t just creating a model with Lisle justification that shows you want you want to find. Otherwise you are just p hacking.

  7. Wow, I think you’re burying the lede here. Science is eliminating Technical Comments? What a disaster. They’ve published some really atrocious articles that have never been retracted (e.g. google “arsenic life”). Recently, there was a bad paper publish. I and 5+ other groups independently submitted technical comments, and only one got published. I guess they are dropping even the pretense that they are dedicated to research as a field of inquiry, as opposed to dogmatic announcements. Between this and their paywall policy (one of the most restrictive of the major journals), they are a disgrace to the field they purport to represent.

    • Science does have a technical comment in the current issue, and has eLetters as an alternative that are linked to the on-line version of the article and don’t have to be submitted within a couple of months of the publication. eLetters aren’t edited, which is fine, but are text only so they can’t include figures, which is not fine.

      I kind of like an old-fashioned style in which papers got published along with comments by named discussants, plus a response to the discussants by the authors.

      • The eLetters are a joke. They are far less visible than the main articles, and as you said can’t have figures. AFAIK, they don’t have DOIs either, so in terms of indexing they basically go down the memory hole.

  8. Isn’t the real issue that Weissman raises the difficulties with effectively correcting flawed claims and methods? I used to work in an area of applied ecology that was amazingly backwards methodologically, so I had a lot of experience with trying to do that, but I don’t have keen suggestions about successful approaches. Post-publication peer review seems like an advance, but as this example shows, it isn’t enough.

  9. Apart from statistical issues, it’s not clear to me what’s the point of Miller et al.

    The title says: “Aside from these limitations in predicting Ph.D. completion overall, overreliance on GRE scores in admissions processes also selects against underrepresented groups.”

    The introduction explains: “Unfortunately, nontrivial barriers impede admission to Ph.D. programs for some demographic groups. Undergraduate grades, college selectivity, and GRE scores are the three criteria that best predict admission to U.S. graduate programs, but these parameters are not evenly distributed by race and gender.”

    The conclusion seems to be that GRE scores shouldn’t be used because they select against underrepresented groups and they don’t work anyway: you can predict doctoral completion just as well with undergraduate grades. But no reason is given to think that admissions processes based on GPA only, ignoring GRE scores completely, would be less selective against underrepresented groups. The distribution of GPA by race and gender is not discussed, apart from a mention to underrepresented minorities going to public universities where grades are lower than in private universities, so “applying UGPA thresholds would indirectly favor White students, posing a risk to broadening participation aims.”

    Looking for distributions of GPA, I found another paper on the subject that may be interesting. (Disclaimer: I’ve not read it, I see they do some mediation analysis but I don’t know if it makes sense)

    The Physics GRE does not help applicants “stand out”
    Young and Caballero
    https://arxiv.org/pdf/2008.10712.pdf

    “We find that for applicants who might otherwise have been missed (e.g. have a low GPA or attended a small or less selective school) having a high physics GRE score did not seem to increase the applicant’s chances of being admitted to the schools. However, having a low physics GRE score seemed to penalize otherwise competitive applicants.”

    • > The title says: “Aside from these limitations in predicting Ph.D. completion overall, overreliance on GRE scores in admissions processes also selects against underrepresented groups.”

      I meant the abstract. The title is also a variation on the same theme: “Typical physics Ph.D. admissions criteria limit access to underrepresented groups but fail to predict doctoral completion”

    • Yeah, I read that one recently. What it shows is that GRE is used moderately in admissions. It can help or hurt near the borers. Their point is that there are few applicants with otherwise very weak records who get in just via GREs. That sounds pretty reasonable.
      I suppose if they saw the opposite result the point would be that clearly unqualified applicants are using the GRE to replace qualified ones. The conclusion, GRE=bad, is a given.

    • ‘The conclusion seems to be that GRE scores shouldn’t be used because they select against underrepresented groups and they don’t work anyway: you can predict doctoral completion just as well with undergraduate grades. But no reason is given to think that admissions processes based on GPA only, ignoring GRE scores completely, would be less selective against underrepresented groups. The distribution of GPA by race and gender is not discussed, apart from a mention to underrepresented minorities going to public universities where grades are lower than in private universities, so “applying UGPA thresholds would indirectly favor White students, posing a risk to broadening participation aims.”’

      I think this is all part of the larger debate over whether GRE scores of various sorts should be used for grad school admissions, and specifically whether they cause equity problems. AFAICT progressive opinion has dramatically shifted on this in recent years. Current opinion is that the tests are unfair to underrepresented groups because (basically) wealth buys better access to preparation (whether general schooling or test prep courses). What I don’t get is how reducing admissions criteria to basically just GPA, undergrad school reputation, and letters of reference helps — all of these seem CLEARLY to be likely to be much MORE subject to favour-the-well-connected-and-rich bias than the GRE test. I can’t fathom why this isn’t the primary issue being raised in the GRE/admissions discussion. Maybe it’s just easier to delete the GRE and thus claim you did something about equity in admissions.

      I also wonder why the key statistical goal has become “criteria that predict PhD completion.” Presumably, if 2 students had identical physics ability, the rich one would have a better chance on average of finishing a PhD than the poor one, for all kinds of obvious reasons (rich students have less need to work other jobs, can take more risks since they have a backup source of support, they may be less motivated to leave for industry, they have more academics in their family and thus less “are you finished with school yet” pressure, etc. etc.).

      But I assume we all agree that it would be a bad idea for admissions committees to rank wealthier students higher! (although arguably the reputation of the undergraduate school is a close correlate anyway)

      • Yes to all that. But for me the big issue is not admissions policy. It’s whether in our pursuit of various goals (virtue, status, grant money,…) we completely shit on basic scientific methods. If we do, what was it that we were supposed to be offering the world? Why would anybody listen to us about a pandemic or the climate?

      • I’m very late to this conversation, but a basic reason to focus on graduation rates is that – in general – failing to graduate is the “outcome to be avoided.”

        Pursuing a undergraduate or graduate degree is generally a good investment (regardless of your familial income), but failing to receive your degree – but paying for a few years of education – is generally a bad idea, both for the University and the failing student. So choosing to admit a student who won’t get anything out of the degree and also won’t be paying for the full term is something to be avoided.

  10. Stated finding: “…do not predict completion”
    Reviewer comment: “…the purpose of the analysis is to consider admissions decisions”
    Not all issues for one DV are issues for another DV, so getting the DV right is important to evaluating the analysis.

    “And the issue of multicollinearity is not that it’s a concern with the regression … the problem is in the interpretation of the coefficients.”
    Well seeing as the two points “(a) when you include them both as predictors in the model, each of their individual coefficients will end up with a larger standard error, and (b) … students scoring higher on the Quantitative test will, on average, score higher on the Physics test too, and you have to account for that in making your prediction,” it seems to be both.
    With VIFs <2, (a) doesn't seem to be a huge issue. And for (b), it depends on the question you're interested in- a picture of the incremental contribution of the measure or just the overall predictive value of the measure. The former is clearly the interest of this analysis, and the interest of admissions committees seeking applicants likely to complete the program- unless they're starting off UGPA-blind.

    Weissman also focuses on using an equal-weight sum instead of both predictors separately, which further highlights the point of the original paper- at worst these indicators are not providing unique information, and at best GRE-Q might provide a little. A common factor, like "underlying quant ability," adequately indicated by UGPA, might be enough. If you require the GRE (which includes GRE-Q), then you don't need to require the GRE-P. It's definitely good that he generated the response from the authors that had additional analyses to drive the point home and do some model-checking though.

  11. My analysis specifically described the incremental predictive power in a model that included GPA. It was substantial and easily statistically significant. In the US group it exceeded the incremental predictive power of GPA by a bit. Overall, it exceeded the incremental power of GPA by a lot. And then all those coefficients are underestimates due to collider stratification bias.

  12. Why did I suggest pre-registration? Partly just to try to include something constructive. But also I think that it might have been easier for reviewers to suggest more valid statistical methods before the data came in, because afterwards it was clear that standard methods would lead to an un-desired conclusion. Maybe it wouldn’t have mattered.

  13. “…the Graduate Record Examination (GRE) Quantitative, Verbal, and Physics Subject Tests do not predict completion as effectively admissions committees presume.”

    The use of “and” here pretty clearly indicates a joint hypothesis, yet nowhere is this hypothesis tested.

  14. What surprised me about the article and the discussions here is that I thought it is intuitive NOT to adjust for post-treatment variables. Clearly, I was wrong.

    A study in 2016 showed that post-treatment conditioning is present in nearly half of the published papers in American
    Political Science Review, the American Journal of Political Science, and
    Journal of Politics. (source: https://www.dartmouth.edu/~nyhan/post-treatment-bias.pdf)

  15. It shouldn’t surprise anyone that test scores don’t correlate to either grad rates or “success” because:

    1) many profs are desperate for students and will take any crappy student and push them through so they can have a department funded slave;
    2) many students get sick of the academic bullshit and just get a job and leave;
    3) even in companies, just not pissing anyone off is a more effective way to rise in the company than performing well – which often means pissing off entrenched people and blocking the path to advancement.

    Claiming the **testing** is an “unfair admissions policy” is comical. No doubt a claim made by someone who didn’t do well on the test.

    The problem isn’t the admissions policy. It’s the graduation and success criterion.

    • Yes, good points. Except that the tests scores do correlate significantly positively with graduation, even when controlling for undergrad GPA and despite negative compensatory effect bias form other predictors. Only strenuous butchering of statistical methods allowed the authors to hide that correlation. Maybe when they set out to collect the data they didn’t expect a correlation.

  16. > Miller et al. are thus making the mistake of adjusting for post-treatment variables (see section 19.6 of Regression and Other Stories for more on why you shouldn’t do that).

    I’m not sure I get it. What’s the treatment here? The GRE score?

    • Only following this loosely, haven’t read everything, but I think…

      (a) the “treatment” is the GRE score(s) (left in or out as one
      of the predictors of completion)

      (b) the response is completion/non-completion, and

      (c) the rank of the Physics program the student was in, is being treated as another predictor — when in fact it’s another response (because it may itself depend on GRE scores etc.)

      Adjusting for (c) may therefore weaken or reverse any signal for (a) predicting (b). (? – best guess)

      • If GREs are the treatment, someone who lives by the rule of “never adjust for post-treatment variables” would also want to leave out GPAs. The GRE is often taken before graduation, sometimes even before finishing the first year of college.

        • Yeah, living by rules can be a problem. A somewhat better rule would be more like “Don’t adjust for downstream variables in the causal diagram”. Time order is just a way to guarantee that something isn’t causally downstream. In this case, GRE probably has little effect on GPA. But GPA and GRE both have strong causal effects on which grad school you get into.

        • I had only skimmed your paper, but I think you disuss the issue in more detail than saying “you just don’t”. My comment was about Andrew’s remark. He points to chapter 19.6 “Do not adjust for post-treatment variables”. To be fair, the message is weakened already in the first paragraph where it says “it is generally not a good idea”. I think the next chapter 19.7 Intermediate outcomes and causal paths is also relevant. Even if the bottom line is still “if you do that bad things can happen to you”, it makes clear that the issue is complex.

          (By the way, I tried a regression using the data in Figure 19.10 and I got a coefficient of -1.62, not -1.5. It’s possible that I did wrong, though.)

        • But to clarify further: The real treatment here is admissions committees looking at GREs.

          But there was little direct way to look at that, since at the time the look-at-GREs treatment was universal. So (although their description was very muddled) what Miller et al. did was to construct a model of what causes graduation, and try to infer what the effect of the admissions looking at GREs treatment would have been from the coefficients of that other causal model. Whatever influences GRE scores is one of the causal factors in this other model.

          That really wasn’t a crazy thing to do, except that they screwed up everything about how they did it.

        • If the real treatment is admissions committees looking at GREs[1], the real effect we want to estimate is the difference between the case where they look at some scores and the counterfactual where they look at different scores (but the student is the same)?

          Say I change the treatment assignment and increase the scores for one student in the application documents that one admission committee will receive. What would be the effect? If the student was admitted (and enrolled [2]) either way, would the probability of completing the degree be different?

          One could imagine a mechanism, depending on the “scope” of the treatment. Maybe it gives access to more money, or a better advisor, increasing chances. (or maybe higher expectations lead to disappointment, conflict and quitting). But if the “treatment” ends there, why should the probability of completing the degree change?

          In the case where the “treatment” (which is a manipulation of the scores the admission committes consider, not a change in the student’s ability) results in the admission on a different program it’s more plausible that the probability of completing the degree change. Many things will change as a result. Maybe a better program is harder, increasing the probability of dropping out. Maybe having better resources, or more motivation because the post-degree outlook is brighter, increase the probability of doctorate completion. But it would also make sense in that case to say that the “real treatment” was admission on that program and not the score.

          [1] I guess you have then many treatments, one per admissions committees (let’s assume for simplicity that all the members look at them at the same time and the “treatment” happens when they meet to look at the scores). For each student, including some that won’t be accepted anywhere, there will be one or more treatments and the outcome (finishing or no the degree) will be measured for at most one of the treatments (conditional on the program accepting the student and the student “accepting” the program).

          [2] The “change of treatment” doesn’t affect the other admission processes.

        • That’s not quite it. This is exactly the somewhat tricky point that Jamie forced me to think more clearly about back in early 2019. (Seems like another universe.)
          The initial question is not “What do you change to increase the likelihood that a given student would graduate?”
          It’s “How can an admissions committee pick a set of students who will have a high graduation rate? Does including GREs help pick?”The treatment is for the committee to look at the real scores. Because there were basically no places that didn’t include GREs, they couldn’t even use sophisticated techniques to simulate an RCT on that question.

          So what they did in effect was to look at a different question: “What measurable individual traits predict graduation? Do GREs help predict?” This is close to the causal question of “What traits cause a student to graduate?”, which encouraged me to use some causal language (“collider bias”) to distinguish ways in which their analysis systematically biased their estimates from ways in which their analysis systematically lost signal-to-noise.

          What does the second question, which they address though incompetently, have to do with the first one, the real policy question? The implicit idea is that if admissions committees choose the individual students with the strongest markers of traits leading to graduation, they will end up with a cohort with a higher graduation rate. With various caveats, that part is useful.

          I think you are suggesting another experiment: substitute randomized scores for the real ones, and see how those scores affect graduation rates, mediated by their effect of which program the student goes to. That would actually be a good way (though unethical) to measure the direct effect of program rank on graduation rate. Then in the “what makes a student likely to graduate?” question, one could properly adjust for that direct effect of rank. The sign of that direct effect is unknown.
          The adjustment used for program rank instead ended up adjusting for a classic collider, giving systematic negative bias. Not because colliders always have to give negative bias, but because we’re pretty sure that the things collided with in admissions here (prior research work, reputation of undergrad program letters, …) all are counted with a positive effect sign by the admissions committees.

        • Michael:

          To do this properly one must also be able to measure and adjust for the possibility of disparate treatment at the individual & group levels post acceptance.

        • @curious I don’t think that would be needed if there were a genuine randomized input, such as randomly faked GRE scores. For the sort of pseudo-random effects I used to make an informal argument about this, since nothing else was available, yes, you certainly need to think about such possible confounders.

        • > The treatment is for the committee to look at the real scores.

          What does “treatment” mean? Is there a way, even hypothetical, that a different “treatment” had been applied? I don’t see how, if they look at the real pre-existing scores. Or maybe the alternative “treatment” would be not to look at the scores? If “treatment” is just an empty label that we can attach wherever, what’s the relevance for the analysis?

          If the question you care about is “How can an admissions committee pick a set of students who will have a high graduation rate?” then you have as many questions as admissions committee. Each admissions committee wants to pick a set of students who will have a high graduation rate in their program. Wouldn’t then the relevant correlation be conditional on the program? If in every program better scores mean better outcomes then it seems that the mistake could be to look at pooled data and conclude the opposite (Simpon’s “paradox”).

        • @ Carlos The treatment choices are “look at GREs” or “don’t look at GREs” . Both of these choices are now actually in use at different departments. Also some intermediates like “look at GREs if submitted but don’t require them”. I believe some departments have a flat policy that no one is allowed to submit GRE scores.

          So in principle one could compare otherwise very similar departments to see which policies are getting better results for the department, not for individuals. I’ve proposed an entirely feasible RCT that departments that can’t decide should form a pooled group and accept random assignments of GRE policy, which would save many argumentative department meetings. Nobody is interested.

          Yes, the effect of these admissions policy choices on graduation rate will depend on the program. I discuss that above under “interaction effects”. You could get that the treatment effect depends on things like program rank, theory/experiment balance, etc. Miller et al. used that as a retrospective excuse to include rank in their regressions, but since they didn’t include interactions with rank they introduced collider bias while getting exactly zero sensitivity to those differences in treatment effect between programs.

        • Thanks, I was confused and thought that “the real treatment here” meant the treatment that should be considered in the discussion of the Miller et al. analyis.

          In that case the treatment is at the PhD program level. It has the values “include GREs in the selection process” and “do not look at GREs in the selection process”. The time of the treatment would be when that choice was made at each department. Many things would be post-treatment, and some of them could have a meaningful impact on the outcomes like the choices of students of what programs to apply for. Outcomes for one “subject” (department) would also depend on the “treatments” recived by the other “subjects”. Changes in admissions policies elsewhere would change the composition of the pool applicants that would enter the program if accepted. Many complex dynamics could appear. For example, assuming GREs are informative, maybe second-tier programs would see the “information content” of GREs increase when the programs upstream make a less efficient use of that information.

          In any case, that would be a very different “treatment” and doesn’t correspond to the “effect” estimated in that paper. It doesn’t seem relevant for the “mistake of adjusting for post-treatment variables” question.

          To be clear, I don’t say it’s not a mistake in that case. I say it ain’t necessarily so.

          For example, if we wanted to determine if GRE predicts doctorate completion and we had data about two programs with 40 students each

          A: 50/50 mix of scores 2 and 3, 50% completion rate for the former, 75% for the latter

          B: 50/50 mix of scores 3 and 4, 25% completion rate for the former, 50% for the latter

          and the regressions looked like this: https://imgz.org/i8vFtUPX/

          conditioning on school seems more appropriate than not doing it.

          Of course many assumptions about what we are seeing and what we want to know are required. That’s the point.

  17. Skimming Weissman’s article and the response ends with:

    “Meanwhile, the question remains of what use should be made now of the actual predictive power of the GREs. That involves non-technical considerations rather than p-values. The issue of how our profession should choose its new members faces a variety of not always parallel social goals and is fraught with uncertainties. Despite these difficulties, finding the best selection method is trivial in one limiting case. If we do not try to maintain minimal standards of competence and transparency or even basic logic in our treatment of data, then the optimum group of students whom we should be educating is the empty set.”

    (bolds left out)

    Ouch! Tell us how you really feel, why dontcha!

  18. P.s. Another physics education paper has come out recently, without a hodge-podge of creative errors but with one giant interpretative error. See https://arxiv.org/abs/2011.06678.

    The journal (Physical Review Physics Education Research) has declined to publish my comment but has suggested that a survey of messed-up causal inference in their published papers would be welcome.
    “…I’d like to encourage Dr. Weissman to comb through
    PRPER to find some other examples of authors making the same unwarranted leap in
    their “implications” sections, and write a Short Paper for PRPER explaining this
    critique and arguing that it applies fairly widely–that it’s a mistake authors
    commonly make.”

    So please, if anybody knows of any papers that should be covered, please let me know. If anybody is really enthusiastic about this project maybe we could collaborate. It’s quite unusual for a journal to request critical comments on their own papers.

    • I think I remember an Irish student many years ago describing having gone through a system very much like that. If I understand this Wikipedia passage, it seems that it’s still in use.
      “Ireland
      In Ireland, students in their final year of secondary education apply to the Central Applications Office, listing several courses at any of the third-level institutions in order of preference. Students then receive points based on their Leaving Certificate, and places on courses are offered to those who applied who received the highest points.”

Leave a Reply to Jonathan (another one) Cancel reply

Your email address will not be published. Required fields are marked *