Fabio Rojas writes:
In much of the social sciences outside economics, it’s very common for people to take a regression course or two in graduate school and then stop their statistical education. This creates a situation where you have a large pool of people who have some knowledge, but not a lot of knowledge. As a result, you have a pretty big gap between people like yourself, who are heavily invested in the cutting edge of applied statistics, and other folks.
So here is the question: What are the major lessons about good statistical practice that “rank and file” social scientists should know? Sure, most people can recite “Correlation is not causation” or “statistical significance is not substantive significance.” But what are the other big lessons?
This question comes from my own experience. I have a math degree and took regression analysis in graduate school, but I definitely do not have the level of knowledge of a statistician. I also do mixed method research, and field work is very time intensive. I often feel like that I face a tough choice – I can delve into more advanced statistics, but that often requires a huge investment on my part. Is there a middle ground between the naive user of regression analysis and what you do?
My reply: You can take a look at my book with Jennifer Hill. Chapters 3-5 hit the basics, then you can jump to chapters 9-10 for causal inference.
More specifically, here are some tips:
– The difference between “significant” and “non-significant” is not itself statistically significant.
– Don’t just analyze your variables straight out of the box. You can break continuous variables into categories (for example, instead of age and age-squared, you can use indicators for 19-29, 30-44, 45-64, 65+), and, from the other direction, you can average several related variables to create a combined score.
– You can typically treat a discrete outcome (for example, responses on a 1-5 scale) as numeric. Don’t worry about ordered logit/probit/etc,, just run your regression already.
– Take the two most important input variables in your regression and throw in their interaction.
– The key assumptions of a regression model are validity and additivity. Except when you’re focused on predictions, don’t spend one minute worrying about distributional issues such as normality or equal variance of the errors.
Possibly the readers of this blog could offer some suggested tips of their own?
I don't see non-statisticians use cross-validation enough.
These are great suggestions, which I think can be dimension-reduced to two:
1. Let the data speak.
2. Don't take frequentist statistics too seriously.
you just blew the mind of every MS student spending their weekend with a copy of CDA and problem sets that involve fitting probit models and the like to various horseshoe crab, alligator, and snoring data.
Yes, but "letting the data speak" is not always so easy. Often you need a model with a lot of structure in order to let the data speak. This is the subject of much of my recent and current research.
Great list. To your second point I'd add examining the pairwise scatterplots of your outcome variable with your main variables of interest.
Don't take frequentist statistics too seriously
All statistical methods have frequentist properties; are you suggesting we take none of them seriously?
I would clarify or even take issue with two of Andrew's comments, even though my own suggestions are in mild tension with each other
1) Analyzing a continuous variable as a categorical one is in general a Bad Thing. We rarely expect the predicted response to be a constant over a range and then to make a discrete jump at a point that we choose arbitrarily a priori. I would oppose Andrew's suggestion to do this.
2) Collapsing a actual ordinal variable into a continuous one makes sense if there are enough levels, but 5 is in the gray area, and dark gray at that. I would not do this if there were 3 levels, say.
I think they should know how and when to ask a statistician for more help.
As a statistical consultant, I am not expert on any of the substantive areas that my clients are working on – epilepsy, aging, the outbreak of war, bullying in college, the role of group therapy, etc etc. Why should those people, who ARE experts in substantive areas, be expert in statistics?
Like many people reading this blog, I've had many statistics courses and have done a bunch of self-educating beyond those courses.
As the post says, many social scientists (and others!) have two statistics courses (or one!) and that's it. So, they learn means and standard deviation and so on, and regression, probably linear regression and maybe basic logistic regression.
They don't even know what statistical questions they can ask, much less what answers can be given.
Ask questions early, ask often and ask questions based on your substantive issues, not what you think might be the statistical questions involved.
My two cents:
check you model (by looking at the residuals or some sort of posterior predictive checks).
But don't trust too much on statistics like R^2 or pseudo R^2.
Good point about improving communication and having the statistician play a useful role. One problem, though, is that many Ph.D. statisticians have some of the bad attitudes that we're warning about.
It all depends on the context. As I wrote above, "You can . . ."; I didn't say "You must . . ." or "You always should . . ."
In giving my recommendations, I'm pushing against the usual approach to statistics, in which the statistical analysis is driven by form that the variables happen to have in the data file at hand.
Look at scatterplots of your data. Put some loess fits on them to get a rough idea of what's going on.
Check for curvature.
Look at scatterplots of your residuals.
Check for curvature.
Use theory from the problem domain to inform your choice of functional form for the regression. You may need to transform the predictors or response variable. And always ask yourself, "Do I really need that intercept term?"
Do your assumptions matter? That is, I run a regression a la Gauss, then I run a Bayesian regression with different priors. Do any of my
substantive conclusions differ?
I'm not a statistician, but the more I look at research, the more I get the impression that the following are more important than heteroskedasticity-adjusted standard errors and all the other maths stuff:
1. Is it really a good idea to control for that variable? It is not that unusual for papers to look at the influence of immigration on city crime rates while controlling for ethnic heterogeneity.
2. What does that significance test measure? An oldie but goodie.
Think about the contrast you're looking for and check for influence. Are a few points really driving your conclusions? Particularly, are a few extreme points doing so? Are you that confident in your assumptions for them? Scatterplots and dFBetas help. This is true of any technique, but in regression it gets at the key assumptions above of linearity and additivity.
Great post. My initial reactions are:
– Regarding your last point, we shouldn't worry about distributional assumptions even when doing predictive models… which leads to my next point
– which Ryan alluded to above: validation (e.g. against holdout data, cross-validation, bootstrapping, etc.) is a must, especially for predictions. Too many job candidates are lost for words when I ask them about how they validate their models.
– I'd add to the list: interpretability. In practice, we have to explain the model to our clients. Say, if we don't pay attention to multicollinearity, then the effects may take the "wrong" signs. Interpretability restricts the types of methods.
I would like to speak in favor of the much blighted r^2. So many people dismiss it because it's not a "perfect measure" (as if any measure was).
Like all measures, r^2 has a common understanding which over-simplifies its meaning (i.e. it's spoken of in terms of raw "explanatory power"; when, in fact, it is simply a rough way of measuring explained variance). Nonetheless, r^2 is very important for statistical models. I would say it rivals "the mean" (i.e. an average) for its foundational importance. Why?
Because, r^2 (when a person is properly aware of its limitations) helps us know when we are barking up the wrong tree. Consider a regression model predicting cancer.
In this model, you may get a number of variables with high coefficient values predicting cancer. But which ones are truly contributing *the most* to cancer incidences? Each significant coefficient will tell you that a variable is likely to lead to cancer; but (and this is the crux of the matter) they won't tell you how many cases of cancer *in general* they tend to explain. E.g.: coefficients can tell you that exposure to radiation contributes very directly to cancer, *but* they don't tell you how many cases of cancer are likely caused by radiation exposure. If you see r^2 (or, in this case, pseudo-r^2) go up, after you enter a variable, this tells you that the variable is not just significant — it is likely a major contributor to cancer, overall.
Of course, there are limitations, and r^2 and pseudo-r^2 can be horribly flawed. However, we need general benchmarks for measuring explanatory power, and I don't see why r^2 is not a good, "all purpose" benchmark.
Good discussion. One comment, two additions:
1. Beware and embrace measurement error. Nearly everything is measured with error and do not forget it can have profound consequences.
2. Don't ignore "nuisance" parameters. Too many papers focus only on the "parameter of interest" and ignore that things like measuremet error or endogeneity of "nuisance" variables can mess up what you are interested in.
3. Cross-validation is important when examining correlations, as in forecasting, but not causal inference. E.g., IV will always forecast less well out of sample than OLS (assuming everything else is well specified), but OLS does not have a causal interpretation.
Frequently, users of regression don't think carefully about the intertwined issues of:
1. alternative parameterizations;
2. interpretation of coefficients; and
3. interpretation of test statistics.
I see frequent interpretive mistakes in presentations and papers that can be traced to misuderstandings of the parameterization someone has chosen for a regression model.
So I would add: Think very carefully about the parameterization you chose, what it means, and what that implies for the meaning of tests.
I'm no expert, but for getting at what JL is talking about, I like looking the difference in prediction when you change the value of the inputs you're talking about (with a measure of uncertainty as well), where the change in input measured in standard deviations, or some unit natural to that variable. In a linear model with no interactions, this 'predictive difference' will be independent of the other predictors. Otherwise you can look at 'average predictive difference' or some such, averaging over the other inputs.
There's a nice discussion of this in Andrew Gelman's book! He has an interesting blog too…
I can follow you here: R^2 might be good but it has *nothing* to do with causation.
Don't bother with the scatterplotting and loess fits and stuff like that. Pretty much everything is linear, anyway.
Holdout samples should be avoided at all costs. Once you get a good model, all a holdout sample can do is to show you that your model isn't good enough. Who wants to know that?
If you insist on using a holdout sample and your model doesn't look good, accidentally lose the holdout file. Another strategy is to accidentally use part of the data you actually modeled as your holdout.
The model with the best r^2 is the best. Improve your r^2 by figuring out some way to get autocorrelation into your model without being too obvious about it.
The lever is one of the great tools of mankind. If you have an observation with high leverage, it can be helpful in getting a high r^2. If, instead, this observation lowers r^2, it's an outlier for sure and should be dropped.
Stepwise can be a big time saver, especially if you are good at thinking up explanations for curious relationships. They publish fiction in the New Yorker, and it's one of the most respected publications in the world.
Is it April 1 yet?
1) Sample from your data. Lots of companies try to fit complicated regressions models with billions of observations. Wouldn't "many millions" do just as well? (This also ties into holdout samples.)
2) Prefer a simple regression model that you can understand over a complicated model that you can't. (This also ties into overfitting.)
Is there a good reference to read for the discussion on treating a discrete outcome as numeric? I see it mentioned here with some regularity, but don't know where I might go about looking up the underlying arguments for/against.