Brendan Nyhan writes:
I’d love to see you put some data in here that you know well and evaluate how the site handles it.
The webpage in question says:
Upload a data set, and the automatic statistician will attempt to describe the final column of your data in terms of the rest of the data. After constructing a model of your data, it will then attempt to falsify its claims to see if there is any aspect of the data that has not been well captured by its model.
I can’t imagine this idea could really work with today’s statistical technology. But I’m supportive of the idea. You have to start somewhere, and even a demo project that doesn’t really do much will still give some insight into what can be done, and what are some useful next steps. So I wish them luck.
To try it out, I entered in this dataset of well-switching in Bangladesh, which we used to illustrate logistic regression in chapter 5 of ARM.
In the first version of this post I reported that running the program with this dataset returned an error, but commenter Zbicyclist pointed that the file had some missing data, and the webpage for this program did very clearly say that no missing data are allowed (which is the case in Stan as well). I went back and removed the rows with missing data (yeah, I know, imputing the missing values would’ve been better) and reran, and then I got the following result:
Fair enough. So I waited a few minutes (actually, I typed the above paragraph and loaded in the above image) and then hit refresh on the page.
It then gave me a 6-page report, which didn’t look so bad! It pretty much did what I might do—almost. It fit a model, plotted a bunch of residuals, and computed some numerical summaries.
I’ll give some comments, but first, here’s the report:
And here are my quick thoughts:
1. As I said before, it looks pretty good. Small crisp graphs, clear captions, clearly-labled sections, and a fairly seamless integration of the canned phrases so that it reads like written English, not like robot-speak.
2. It’s mostly transparent. What I mean here is that, when it has natural-language constructions, it’s typically pretty clear to me what was the underlying computation being done.
3. The report is written in the first person (“I have interpreted the data,” “I have compared,” etc.). I think I understand why they did this—it personalizes the computer program, maybe that’s a good way to get the user involved—but I find it confusing. I’d prefer for “I” to be replaced by “Autostat” (a shorthand to the name of the program, “Automatic Statistician”), thus “Autostat interpreted the data” etc. One big trouble with “I” is that it’s easy to imagine a researcher cutting and pasting chunks of this into a report without ever reading it!
Just imagine if Ed Wegman got his hands on this program.
4. The program fit a linear regression model which of course isn’t ideal for a binary outcome. But that’s fair; the documentation actually said that all it can do is linear right now, as this is still a prototype version of the program.
5. The data came from a survey of people in some villages in Bangladesh where some of the water had high levels of arsenic. The people being surveyed were living in homes that had high arsenic levels in their drinking water. The variable being predicted, “switch,” equals 1 if the person being surveyed expressed willingness to switch to a well of a some neighbor whose drinking water is low in arsenic, and 0 otherwise. The predictors are “arsenic” (the current arsenic level of the person’s well), “dist” (the distance to the neighbor’s well), “assoc” (whether the person is a member of a community association), and “educ” (respondent’s number of years of formal education)
I was surprised to see that Autostat only wanted to include arsenic as a predictor: when I was fitting the model for chapter 5 in ARM, I ended up including arsenic, dist, their interaction, and also educ.
What’s going on is that there’s a lot more information in the logistic than in the linear regression. Here’s a quick linear regression in R:
lm(formula = switch ~ arsenic + dist + assoc + educ) coef.est coef.se (Intercept) 0.21 0.01 arsenic 0.19 0.01 dist 0.00 0.00 assoc -0.01 0.01 educ 0.00 0.00 --- n = 6498, k = 5 residual sd = 0.44, R-Squared = 0.18
Umm, let’s rescale distance and education to make this more interpretable:
lm(formula = switch ~ arsenic + I(dist/100) + assoc + I(educ/4)) coef.est coef.se (Intercept) 0.21 0.01 arsenic 0.19 0.01 I(dist/100) -0.02 0.02 assoc -0.01 0.01 I(educ/4) 0.01 0.01 --- n = 6498, k = 5 residual sd = 0.44, R-Squared = 0.18
OK, so with a linear model there really isn’t room to estimate coefficients for distance and education. Who knew?
Just to check, I better re-run the logistic regression:
glm(formula = switch ~ arsenic + I(dist/100) + assoc + I(educ/4), family = binomial(link = "logit")) coef.est coef.se (Intercept) -1.40 0.06 arsenic 1.06 0.04 I(dist/100) -0.30 0.10 assoc -0.06 0.06 I(educ/4) 0.07 0.03 --- n = 6498, k = 5 residual deviance = 7383.6, null deviance = 8702.4 (difference = 1318.7)
Hmm, this isn’t quite what we got in our book, but there we had only 3020 data points and here we seem to have 6498, so as you can see I don’t have complete control of my dataset. Maybe I did something dumb like write the data into the file twice.
Whatever, no time to worry about this now. My real point is that the Autostat program does seem to have done something reasonable, given that it’s limited to linear models.
So this seems like an excellent start.
I think that for the program to be useful to researchers it should link to the code that fit the models, that way a researcher could play around and fit alternatives. (In the arsenic example, for example, once we’re fitting logistic regressions, we know we want to log the arsenic predictor and we know we want to include interactions.)