I sometimes but do not always ping people on these. In this case, motivated by your above question, I sent the person an email.

]]>I’d try to work directly with quantities of interest, such as probabilistic predictions. For instance, looking at calibration of probabilistic predictions—if the algorithm says result A is 70% likely in 100 cases, how many of those cases were really category A? The standard hypothesis test would involve binomial(100, 0.7) being the expected distribution of errors for a well-calibrated model.

We also recommend a lot of posterior predictive checks, so I’m surprised Andrew didn’t do that. It’s a bit tricky with logistic regression because there’s no generative model of the features, so you can only replicate the predictions. But that can help you see at least if the marginals are right (number of instances of each category).

If you’re going to keep AUC, at least come to grips that the average AUC comes with some uncertainty and try to report that. When the weatherperson reports a 70% chance of rain, I want to know if their prediction is reliable (whether there really is a 70% chance of rain), not what it’s AUC is. And given two calibrated weather forecasters, I want the ones making sharper predictions (closer to 1 or 0 for categorical results, or smaller posterior intervals for continuous predictions).

You are using full Bayesian predictive posterior inference for those regressions, right?

]]>I agree that either DV or Inshallah or something secular like “If all goes well” are often very appropriate in discussing research (especially research plans).

]]>“Start with a known universe, simulate fake data from that universe, then apply procedures (1) and (2) and see if they give much different answers.”

Or maybe just run Daryl Bem papers through it?

How is “ground truth” ascertained in terms of replication?

]]>I kinda like the idea of researchers inserting the word “Inshallah” at appropriate points throughout their text. “Our results will replicate, inshallah. . . . Our code has no more bugs, inshallah,” etc.

]]>Or actually, I’d start with something simpler — count chi-squared stats for unigrams, bigrams and trigrams without any document vectors. Cause you’re more interested in insight than just having a marginally better classifier, right?

Anyway, don’t forget to run your paper through this, cause if it says “unreproducible” you’ve got a Gödelian nightmare!

]]>Cool idea for a study! Though I agree with Andrew’s comment about not wanting to think of study outcomes in a binary sense… Just to make sure I’ve got this straight, these are your two proposed methods? Please excuse the craptastic pseudocode:

(1)

for i in 1:100

split data randomly into groups A, B, C

for Te in [A, B, C]

Tr = the remaining data not in Te

train doc2vec on Tr

infer doc2vec vectors for Tr and Te separately

train LR on Tr vectors

Y[i,Te] = apply LR on Te vectors

compute single AUC score, from average Y (over shuffles)

(2)

for i in 1:100

split data randomly into groups A, B, C

for Te in [A, B, C]

for j in 1:100

Tr = the remaining data not in Te

train doc2vec on Tr

infer doc2vec vectors for Tr and Te separately

train LR on Tr vectors

Y[j,Te] = apply LR on Te vectors

aucs[i] = compute AUC from average Y (over shuffles)

So for method 2 you get a distribution of AUC scores. I assume the bottleneck here is training doc2vec (as opposed to the logistic regression)? Looks like with method 1 you only have to train a doc2vec model 300 times, but with method 2 you’ll have to train a doc2vec model 30,000 different times! I don’t do NLP so correct me if I’m wrong but seems like that’ll take… a while.

But, you’ve got a good point about wanting to separate the variation due to sampling (the random sampling that occurs during your cross-validation splits) from the variation due to the non-determinism of training a doc2vec model. Or at least I like the idea of getting a distribution of AUC scores instead of a single value.

Maybe as a compromise between methods 1 and 2, you could train doc2vec models on an independent dataset and then use those (already-trained) for inference during your shuffles. For example, you could train 100 doc2vec models on some independent dataset X (of papers in the same field as the ones in your current dataset or something, but for which you don’t need info about replications, just the text). Then, for each of those models, you could run 100 shuffles of the cross-validated logistic regressions. So, in the form of more crappy pseudocode,

for i in 1:100

re-train doc2vec on independent dataset X

for j in 1:100

split data randomly into groups A, B, C

for Te in [A, B, C]

Tr = the remaining data not in Te

infer doc2vec vectors for Tr and Te separately

train LR on Tr vectors

Y[Te] = apply LR on Te vectors

aucs[i,j] = compute AUC from Y

That way you only need to train a doc2vec model 100 times. But you can still see how the variance of the AUC is affected by sampling (by averaging aucs over the i dimension) separately from how the variance is affected by the non-determinism of doc2vec (by averaging aucs over the j dimension). Though it seems to me it would be most useful to just measure the total uncertainty (by looking at the variance of all the values in aucs, without averaging).

]]>Is DV a typo (i.e., was CV, for cross-validation, intended?), or does it mean “Deo Volente” (literally “God willing”, or figuratively “If all goes well”)?

]]>I found the following study concerning the means of such “repeated CVs”, and while I haven’t read it in depth, it appears to be pretty pessimistic about the mean of CV statistics being useful at all

https://core.ac.uk/download/pdf/34528641.pdf