Should we always be using the t and robit instead of the normal and logit?

My (coauthored) books on Bayesian data analysis and applied regression are like almost all the other statistics textbooks out there, in that we spend most of our time on the basic distributions such as normal and logistic and then, only as an aside, discuss robust models such as t and robit.

Why aren’t the t and robit front and center? Sure, I can see starting with the normal (at least in the Bayesian book, where we actually work out all the algebra), but then why don’t we move on immediately to the real stuff?

This isn’t just (or mainly) a question of textbooks or teaching; I’m really thinking here about statistical practice. My statistical practice. Should t and robit be the default? If not, why not?

Some possible answers:

10. Estimating the degrees of freedom in the error distribution isn’t so easy, and throwing this extra parameter into the model could make inference unstable.

9. Real data usually don’t have outliers. In practice, fitting a robust model costs you more in efficiency than you gain in robustness. It might be useful to fit a contamination model as part of your data cleaning process but it’s not necessary once you get serious.

8. If you do have contamination, better to model it directly rather than sweeping it under the rug of a wide-tailed error distribution.

7. Inferential instability: t distributions can yield multimodal likelihoods, which are a computational pain in their own right and also, via the folk theorem, suggest a problem with the model.

6. To make that last argument in reverse: the normal and logistic distributions have various excellent properties which make them work well even if they are not perfect fits to the data.

5. As Jennifer and I discuss in chapters 3 and 4 of our book, the error distribution is not the most important part of a regression model anyway. To the extent there is long-tailed variation, we’d like to explain this through long-tailed predictors or even a latent-variable model if necessary.

4. A related idea is that robust models are not generally worth the effort–it would be better to place our modeling efforts elsewhere.

3. Robust models are, fundamentally, mixture models, and fitting such a model in a serious way requires a level of thought about the error process that is not necessarily worth it. Normal and logistic models have their problems but they have the advantage of being more directly interpretable.

2. The problem is 100% computational. Once stan is up and running, you’ll never see me fit a normal model again.

1. Clippy!

I don’t know what to think. RIght now I’m leaning toward answer #2 above, but at the same time it’s hard for me to imagine such a massive change in statistical practice. It might well be that in most cases the robust model won’t make much of a difference, but I’m still bothered that the normal is my default choice. If computation wasn’t a constraint, I think I’d want to estimate the t (with some innocuous prior distribution to average over the degrees of freedom and get a reasonable answer in those small-sample problems where the df would not be easy to estimate), or if I had to pick, maybe I’d go with a t with 7 degrees of freedom. Infinity degrees of freedom doesn’t seem like a good default choice to me.

16 thoughts on “Should we always be using the t and robit instead of the normal and logit?

  1. I haven't seen this before and search turned up nothing useful, so

    What is "stan" and why does it solve such computational issues?

  2. I am not sure whether it is appropriate to ask question in this blog but could you please spare some time to read the following paragraph and help me figure out some problems I face in my research?

    Recently I am doing my master thesis on the multilevel analysis of spatial distribution of crime and related socioeconomic characteristics in six Canadian cities.

    The main purpose of my present research is to investigate whether crime rate of a given neighbourhood is related to both neighbourhood and city characteristics in six Canadian cities. The dependent variable is the crime rate at the neighbourhood level (census tract), while the independent variables are socioeconomic variables measured at both neighbourhood and city level. Hierarchical linear modelling was employed, which involved neighbourhoods (census tracts) as the micro level of analysis and cities as the second level of analysis. I used HLM 7.0 to implement such two-level models.

    Also, I want to take the spatial dependence of neighbourhoods into account by using spatial regression models. That is, I want to model spatial dependence at the first level after controlling other level-1 covariates. For this purpose, I created a spatial lag term (i.e., the weighted average of crime rate in neighbouring locations) using GeoDa software and included it as a level-1 predictor variable in a two-level models estimated by the HLM 7.0 software. However, such equation has an endogenous variable (spatial lag Wy) on the right-hand side, it must be estimated using either a maximum likelihood (ML) or two-stage least squares (2SLS) approach. But to my knowledge, HLM uses Generalized Least Squares (GLS) to estimate the fixed effect. If I include the spatial lag term at the first level and estimated by GLS, the spatial lag term is not independent from the random effects (i.e., level-2 residual variance), which violates the model assumption. Indeed, as WY is necessarily endogenous, the GLS estimates will suffer from simultaneity bias. It seems that the method I used is incorrect but I got stuck how to use another approach to deal with this issue.

    Could you help me to figure out it? Many thanks in advance.

  3. When you "basic distributions such as normal and logistic" are you including lognormal or other transforms of normal?

    I ask because of computer benchmarking. If one has a sample of N benchmarks, runs them on 2 machines X and Y, and looks at the distribution of ratios Ri = Xi/Yi, normal doesn't work, because one must get consistent statistics whether X or Y is arbitrarily chosen as denominator, but lognormal often provides a good fit. This is unsurprising as many of the factors generating differences tend to be multiplicative, not additive. Of course, for small variance, normal isn't too far off, but cannot be mathematically correct.

    However, if one compares a scalar processor and a vector or parallel system, one often finds a heavy tail in one direction, as the vector/parallel systems generate a cluster of outliers, but they aren't really outliers, but a strong indication of non-uniformity of the population.

  4. Justin: No bad data just bad models and there is even a paper on that by John Nelder entitled there are no outliers in the stack loss data set.


  5. Andrew: 4 is simply not thinking of all the benefits/costs in the long run and it should be 2.

    Or in the long run it must be 2 (as eventially some one will realize this and be convincing about it).


  6. John: Yes, I'm including lognormal within the normal category.

    Charlie: What Keith said.

    Keith: Possibly it's 2, but maybe not. Maybe robust models have some statistical problems of which we are not aware, problems that might show up if these models are used more routinely.

  7. I'm having difficulty finding the John Nelder paper on the stack-loss data. Google Scholar only pulls up 4 cites, not the paper itself. The paper was apparently published in a journal called "Student", and I've had no luck finding *that* with Google either.

  8. For those intrigued or frustrated by the allusive references here:

    Nelder, J.A. 2000. There are no outliers in the stack-loss data. Student 3(3): 211-218.

    I don't have access to the journal, but I think I recall dimly that it is published through one of the Swiss universities and is rather elusive.

  9. Nick: Did not realize the Nelder paper was so elusive. At the bottom of page 37 there is a short summary of it here

    By the way figure 3 was the first almost individual observation likelihood plot I made in 2001. I did not include the plot for Nelder's model, but in it all the likelihoods were essentially on top of each other.

    Andrew: Agree with your additional point on 2. Here I like the work of Ken Rice and Sander Greenland on identifying the full joint probability model that leads to the same techniques. If _we_ can make these both easy to identify and essentially costless to look at my point about 2 would be firmer – notwithstanding earlier Tukey point about "not the model that lead to the technique but how the technique works".


  10. I don't think robust models are the way of the future.

    Outliers can be the result of an error in data collection, typing in the data, or the data preparation. In that case robust models just hide problems in the data that you should have dealt with before you started modeling. Ideally, you should have gone back to the original data and recover the real values or when that is not possible you could exclude that observation from your model, or impute it, or something else that takes this process seriously.

    Alternatively, outliers can be genuine datapoints. In that case these outliers often represent either very desirable or very undesirable situations (e.g. it may be a very wealthy person or someone that dies very young). In these situations the outliers contain the most interesting information. Using that information needs careful thought and modeling.

    However, the robust models that I have seen do not seem appropriate for either situation. They just "downweight" the outlier, hide the outlier in a more flexible error distribution, or make that information more normal/ignorable.

    There may be cases where the error distribution may be a genuine t-distribution or some other robust model may be appropriate. But I would use such models as the exception for situation where I ruled out data errors and/or genuine interesting outliers.

  11. "In these situations the outliers contain the most interesting information." YES!

    Yes, on a small scale, typical practice on some program committees has been for a dozen people to classify papers/abstracts from 1 to 5, put the numbers in a spreadsheet, and drive many acceptance decisions from the computed averages. When I was running the committee, I insisted on looking at the outliers and ask them why. For example, most people might rank something 3-4 (Accept) and one persons ranked it 1 (NO). Sometimes it would turn out that the paper looked OK, but was somewhat unfamiliar to most people, but the outlier knew the work well and could explain why it was junk. Occasionally, with abstracts, the outlier was the other direction and convinced the rest that the abstract wasn't very good, but they knew the work and knew the author could give a a good talk.

    The movie version of this is "12 Angry Men."

    Likewise, in the benchmark studies, too-fast outliers were often hints either of benchmark-specific optimizations (hacks), of legitimate compiler optimizations that had overrun too-simple benchmarks, or when one benchmark was especially slow, a hint that compilers were doing something badly and needed attention.

  12. I think the problem is computational and robust models should be used if they give better predictions. We have good experience in using
    negative-Binomial and zero-inflated NB instead of Poisson. In classification, we have not yet seen data which would have much information about the degrees of freedom in robit (at least if using non-linear mean function). We have used logit which corresponds to robit with nu=7 and it seems to be more robust than probit (nu=infty). t is computationally more problematic (unless using MCMC), but we think we have solved that. Our (no-MCMC) solution for a robust and efficient implementation of Gaussian process regression with a Student-t observation model was posted yesterday to arXiv and the code will be included soon in our GPstuff toolbox

Comments are closed.