It’s Appendix A of ARM:

**A.1. Fit many models**

Think of a series of models, starting with the too-simple and continuing through to the hopelessly messy. Generally it’s a good idea to start simple. Or start complex if you’d like, but prepare to quickly drop things out and move to the simpler model to help understand what’s going on. Working with simple models is not a research goal—in the problems we work on, we usually find complicated models more believable—but rather a technique to help understand the fitting process.

A corollary of this principle is the need to be able to fit models relatively quickly. Realistically, you don’t know what model you want to be fitting, so it’s rarely a good idea to run the computer overnight fitting a single model. At least, wait until you’ve developed some understanding by fitting many models.

**A.2. Do a little work to make your computations faster and more reliable**

This sounds like computational advice but is really about statistics: if you can fit models faster, you can fit more models and better understand both data and model. But getting the model to run faster often has some startup cost, either in data preparation or in model complexity.

*Data subsetting* . . .

*Fake-data and predictive simulation* . . .

**A.3. Graphing the relevant and not the irrelevant**

*Graphing the fitted model*

Graphing the data is fine (see Appendix B) but it is also useful to graph the estimated model itself (see lots of examples of regression lines and curves throughout this book). A table of regression coefficients does not give you the same sense as graphs of the model. This point should seem obvious but can be obscured in statistical textbooks that focus so strongly on plots for raw data and for regression diagnostics, forgetting the simple plots that help us understand a model.

*Don’t graph the irrelevant*

Are you sure you really want to make those quantile-quantile plots, influence dia- grams, and all the other things that spew out of a statistical regression package? What are you going to do with all that? Just forget about it and focus on something more important. A quick rule: any graph you show, be prepared to explain.

**A.4. Transformations**

Consider transforming every variable in sight:

• Logarithms of all-positive variables (primarily because this leads to multiplicative models on the original scale, which often makes sense)

• Standardizing based on the scale or potential range of the data (so that coefficients can be more directly interpreted and scaled); an alternative is to present coefficients in scaled and unscaled forms

• Transforming before multilevel modeling (thus attempting to make coefficients more comparable, thus allowing more effective second-level regressions, which in turn improve partial pooling).

Plots of raw data and residuals can also be informative when considering transformations (as with the log transformation for arsenic levels in Section 5.6).

In addition to univariate transformations, consider interactions and predictors created by combining inputs (for example, adding several related survey responses to create a “total score”). The goal is to create models that could make sense (and can then be fit and compared to data) and that include all relevant information.

**A.5. Consider all coefficients as potentially varying**

Don’t get hung up on whether a coefficient “should” vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small (as with the varying slopes for the radon model in Section 13.1), maybe you can ignore it if that would be more convenient.

Practical concerns sometimes limit the feasible complexity of a model—for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. Generally, however, it is only the difficulties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coefficients, and more interactions.

**A.6. Estimate causal inferences in a targeted way, not as a byproduct of a large regression**

Don’t assume that a regression coefficient can be interpreted causally. If you are interested in causal inference, consider your treatment variable carefully and use the tools of Chapters 9, 10, and 23 to address the difficulties of comparing comparable units to estimate a treatment effect and its variation across the population. It can be tempting to set up a single large regression to answer several causal questions at once; however, in observational settings (including experiments in which certain conditions of interest are observational), this is not appropriate, as we discuss at the end of Chapter 9.

I strongly agree with A.4!

But isn’t A.1 a recipe for overfitting… for settling on a model that fits every irrelevant quirk in your particular data set? Shouldn’t there be A.7: test your model using one or more holdout sets.

I don’t think Andrew is suggesting to add complexity “willy-nilly” until you achieve perfect fit, he’s just saying that parsimony is overrated as a virtue. I’ve worked with plenty of people who insist on, say, polynomial regression when some kind of non-linear model both makes more sense theoretically and provides more interpretable parameters because they “don’t want to get into that complicated non-linear stuff, and look! The AIC says it’s just fine!”. I’d choose a complex model with interpretable parameters over a simple one anyday, AIC be damned.

I think you’re missing the Phil’s. Whether it’s polynomials or explanatory variables or non-linearity is irrelevent. Fitting multiple models is one form of a “forking path”, so when you obtain a model that “fits”, perhaps it’s overfit.

Personally, I’m on the fence about the whole forking paths business. I think multiple comparisons (and their adjustments) are the right answer to the wrong question.

Maybe he’s less worried about this in a modeling context with Bayesian shrinkage, but to me I’m never sure if the shrinkage is “enough”, there’s no theorem that tells you how much shrinkage is sufficient to avoid overfitting and how much is too much.

So yeah, what seems like a “quick tip” is actually more complicated, at least it seems so to me.

Some models tend to appear correct simply because they are looser. For example, suppose two forecasters want to predict the price of a stock XYZ and they have the following forecasts:

Forecaster A: “XYZ stock will close between $.05 and $500”

Forecaster B: “XYZ stock will close between $100 and $100.05”

Without knowing anything else about their models, which one is more likely to appear correct after the fact? Clearly Forecaster A. This principle determines a prior on models!

Mathematically, the usual Bayes situation is where we know P(data|theta) and P(theta) which are then used to find P(theta|data). But there’s nothing stopping us from starting with P(data|theta) and P(data) instead. Mathematically, that’s equivalent to determining a P(theta) and hence P(theta |data).

It also has a natural interpretation. The high probability region of P(data) can be though of as the “universe” of potential values for the data (forecast) which the model P(data| theta) must lie within. For a stock we typically know it will be between $0 and $1,000 for example. The model P(data|theta) must then fit forecasts inside this “universe”.

Theta’s that cover more of this “universe” will have higher prior probability than those that don’t. Theta’s which forecast prices outside of this “universe” are penalized heavily.

Mathematically speaking P(data) induces a distribution P(theta), which creates in the usual Bayesian way a “penalty” for over-fitting. Instead of choosing a theta which maximizes the likelihood P(data |theta) you have to choose one which maximizes P(data |theta)P(theta). Intuitively, this balances two competing effects. The P(data|theta) term causes you to pick theta’s which closely model the data, while the P(theta) term causes you to pick models which are inherently more likely to be correct (as in the example I began with). These two work against each other and the theta which balances them tends to be predictively accurate going forward (hence avoiding over-fitting).

This is exactly what things like regularization, AIC, DIC, the lasso, and so on are doing (or attempting to do). The Bayesian version I just described is far more general however, both in theory and in practice, and doesn’t require “tuning” parameters.

Sorry for being stupid here, but how do we know P(data)?

Let’s say our data is 1,2,3,4,20 and P(data|theta) is assumed to be N(theta,5^2). What’s P(theta) then?

PS: You may be unhappy about my model assumption N(theta,5^2), but if I don’t make such an assumption, how to find P(data) looks even more mysterious to me.

For better or worse, same way you’d choose the prior for P(theta) right? E.g. invariance etc. Maybe some sort of higher order modelling? But what one would call ‘data’ are just ‘observable’ parameters and what we call ‘parameters’ are just ‘unobserv(ed/able) parameters’ in the bayes approach, no?

Say in one approach you think it is a good idea for P(data) to be invariant/non-informative or whatever.

But instead of trying to specify P(data) directly you choose P(data|theta) and P(theta). Can’t you then get different P(data) under some conditions (according to the Borel paradox or something)? So the P(data) induced is not invariant? Or is this wrong?

There are lots of situations where “data” is more immediate, interpretable, and for which we have lots of prior knowledge of, than “theta”. So finding a prior P(data) is easier than finding P(theta).

Mathematically, knowing P(data|theta) and P(theta) is equivalent to knowing P(data|theta) and P(data).

Essentially, what’s happening is that historical Bayes is really just a special case of full Bayes. Historical Bayes decided to call P(theta) a “prior” and so made it seem like this is the one we know first. But there’s nothing in the mathematics saying we have to know this distribution first and full Bayes is a lot more flexible.

Anonymous:

Sometimes we speak of “external information” rather than “prior information” to clarify this point.

Christian,

Take forecasting a stock’s price next week as an example. Ideally we want a P(price |theta) that accurately gives us a small range for the price so we can make money.

Before we even begin making that model though, we know a (large) potential range for next week’s stock price. We know it has to be greater than 0 and can easily get an upper bound. If nothing else, the market capitalization (=price*number of shares) can’t be impossibly large.

Given that range, P(price) can be a distribution who’s high probably region (probability mass) covers 0 to the upper bound. If you were really do this and were trying to forecast a stock price one week away, you could almost certainty come up with a narrower region to set P(price).

Once you have P(price) and P(price |theta), then P(theta) is necessarily determined from the interval equation:

P(price) = \int P(price |theta)P(theta) d theta

So you are getting an empirically informed prior from past observations by back calculating from features of the distribution of past observations assuming theta varied to get the prior. Seems quite reasonable.

Keith,

It wouldn’t have to be empirically formed. Think again about forecasting a stock price. You could possibly determine an “empirical” P(price) based off observing general characteristics of stocks movements.

Or you can get “theoretical” upper and lower limits for the price by recalling that “price” is related to Market Capitalization which can’t be completely arbitrary.

Generally speaking the later is less likely to lead you astray (i.e. less likely to create a P(price) for the true price turns out to have low probability)

P.s. I meant “integral equation” in the previous comment

OK thanks, I think I understand. You’re not using the observed data you want to analyse to find P(data), rather some prior information. I was kind of familiar with that, just hadn’t recognized it in your initial posting.

Christian,

The easiest thing to do is check what I’m saying on examples where there’s conjugate prior eliminating the need to solve an integral equation. But here’s an attempt to clarify what’s happening in slightly different language.

Let x be the thing we’re trying to forecast. Then P(x) will define a high probability region for x. Let W be this region and |W| the “size” of this region. With each model P(x|theta) we can also identify the intersection of its high probability region with W. Call this region W(theta) and it’s size |W(theta)|.

Essentially, the prior P(theta) induced by P(x) and P(x|theta) will be related to the size |W(theta)|.

So given some data x* what happens when you go to maximize P(x*|theta)P(theta)? Well the P(x*|theta) wants to pick a theta for which W(theta) is small and sharply concentrated around x*. This is the over-fitting problem. W(theta) is then a very small target and future x’s are unlikely to be inside it.

The P(theta) term counterbalances this however. It want’s to pick thetas where |W(theta)| is as large as possible, since that creates a very big target and future x’s are likely to be inside it.

In other words, you’re directly balancing two competing desires. One is to pick a model which focuses on x* and one is to pick a model which is inherently more likely to work simply because the Bayesian Credibility interval you form from it is large and thus more robust.

Anonymous: Out of curiosity, is this explained like this in any literature, e.g., by Jaynes?

Actually given that P(x) is constructed a priori, it reminds me of de Finetti’s way of thinking about things; particularly about the fact that he saw decomposing P(x) into P(theta)P(x|theta) just as a technical device but treated the predictive distribution for x, i.e., P(x), as the “real” prior against which bets could be evaluated and that should be specified, be it through P(theta)P(x|theta) or otherwise.

Christian:

This has become a thread of a thread, and it probably deserves its own post, but . . . I see the appeal of a purely predictive approach to inference but I think that that the predictivist purists are missing the point that parameters (the “theta” in the model) represent what can generalize. Ultimately we’re interested in predicting new things, and it’s the thetas that give us the leverage to do so.

Andrew:

Do you mean that a purely predictive approach is more likely to yield non-generalizable parameters?

Christian,

It’s sort of in Jaynes. He doesn’t do any a big applications, but chapter 20 on Model comparison points out with Bayesian analytic details that “simpler” models tend to give rise to bigger |W(theta)| which is why they are preferred. They are in a sense more predictively robust in that they create a bigger target region for your predictions to hit.

I stress the word

tendthere are plenty of ‘simpler’ models that do not have that property.Having said that, everything I’m saying is in strong agreement with Jaynes overall. I take the Bayesian mathematics as primary and my attempts to interpret what’s going on are

nota justification. It’s just seeing at a deeper level what’s happening. My intuition like everyone else’s isn’t good enough to serve as the foundation of statistics. If my intuition contradicts the Bayesian mathematics, I improve my intuition or do a deeper Bayesian analysis, I don’t drop the Bayes and look for an ad-hoc stop gap.In particular, I’m not advocating a purely predict approach, that’s just where the math leads in some examples.

I got it from looking at thing like the Lasso and regularization from a Bayesian point of view and seeing how they’re doing what they’re doing. One immediate consequence was that specifying p(x) which is actually pretty easy to do relatively speaking in most forecasting type situations, fixes any “tuning” parameters present. Then after a little analytical work it was clear that this whole thing is seriously generalizable. Well beyond current ridge regression/lasso type stuff.

One other thing to note Christian, when you do ridge regression or lasso, you don’t have to think of it in terms of specifying P(x). Instead you could specify tuning parameters in the penalty term. The thing is the mathematics doesn’t care. It’s doing it’s own thing independent of what we think.

In particular, there’s an implicit P(x) present no matter what. Given definite tuning parameters you can calculate the implied P(x) in ridge regression or lasso and graph it. If the true x* doesn’t appear in the high probability region of that P(x) the whole thing isn’t going to work very well.

So you may not be thinking in terms of P(x), but the mathematics forces it on you whether you like it or not. It’s better to just deal with it directly. If I have to forecast tomorrow’s temperature I know it’s not going to be 212 degrees Fahrenheit (100 degrees Celsius). I can use that knowledge to pick P(temp) before I even get started creating serious weather models P(temp| theta). The prior range P(temp) together with a family of models P(temp |theta) determines P(theta) via Bayes theorem (as well as implicitly determines any tuning parameters present).

Anonymous: Thanks for the explanations, much appreciated. Could be worthwhile to write this down as one closed paper with some examples. There are issues with it, such as the definition of the “high probability region” (any part of the data space can be excluded by a suitable high probability region of any continuous distribution), and how precisely P(data) can be determined a priori in any real application, but not sure whether this discussion belongs here.

Ok so say you want to model p(d,theta) and so try using p(d|theta) and p(theta). This automatically generates a ‘prior’ distribution p(d).

Now you observe d0. You do the usual update by conditioning on d0 *while holding the normalization factor \int p(d|theta)p(theta) = p(d) fixed*.

This means your new marginal for p(d), over the posterior using your updated parameters theta’ should be unchanged. Thus motivating Andrew’s posterior predictive checks? (I should really write out all the conditioning carefully in Andrew’s notation).

What I don’t get is why hold p(d) fixed? Firstly it is d0 that is being conditioned on not p(d). Secondly, as mentioned p(d) can be considered a prior – why not allow this to vary based on d0 too to get a true ‘posterior for your data’?

That is, given priors p(d) and p(theta) and model p(d,theta) could we more generally update *both* priors conditional on d0?

I like that first point. There’s a difference between correctness & usefulness of models.

e.g. One past example from this blog I recall is regarding the prediction of goal differentials in the soccer World Cup.

I could make a model with huge predictive intervals (95% CIs, error bars whatever) on the goal differential & then I’d almost always be right. e.g. Team A or Team B will win with a difference of 1 to 10 goals.

That was a very correct model but mostly useless.

Well, it may still help you win bets against people who are overconfident in their models giving you more precise intervals. Also, it may show correctly how much uncertainty there is in predicting soccer.

Oh, by the way, if you give only 5% to draws, my imprecise model will still beat yours. :-)

The problem with models giving you very, very large correct prediction intervals in real applications is probably that they’re *not* correct in the sense that they oversimplify on the cautious side, and one could in fact outbet them easily if it’s not about intervals but about point predictions with penalty by distance.

What about evaluating the accuracy of prediction? On unseen data.

i.e. Any tips on how to best produce a regression with external validity. Not a superstar regression performing amazingly only on the data collected.

Wonderful. It’s a pity such practical advice on strategies is so rarely written. Just today, as luck would have it, I leafed through Cox & Snell’s 1981 “Applied Statistics” which is all about these strategies on a case-by-case basis. And that has stood the test of time.

Perhaps see “One can hope to discover only that which time would reveal through a learner’s sufficient experience [able to anticipate a complex model with interpretable parameters] anyway, so the point is to expedite it; economy of research is what demands the leap, so to speak, of abduction and governs its art.[152]” and especially the link of [152] at http://en.wikipedia.org/wiki/Charles_Sanders_Peirce#cite_note-econ-152

Let’s take the con out of econometrics.

There is A.0 Get yourself a decent dataset. Statistics is not white magic to elicit answers to interesting questions from lousy datasets.

Indeed an A.-1. Have an interesting problem. (Indeed, the counting gets silly here.)

Nick:

I agree. Your point A.-1 is clearly true, although perhaps outside the bounds of statistics. But your point A.0

isstatistics and it’s important. Statistics books (including my own) spend tons of space on data analysis, and a bit on sampling and data collection, but almost nothing in measurement. I do think that measurement is an underrated aspect of statistics. We all know about “garbage in, garbage out,” but our textbooks tend to just assume that available data are of reasonable quality and that they address the applied questions of interest.And I believe it was Gertrude Cox who said that the most valuable thing the statistician might do is to really pin down the researcher on what the hell they are trying to find out. (I paraphrase)

Guess you mean this quote from Gertude C: “The statistician […] finds repeatedly that he makes his most valuable contribution simply by persuading the investigator to explain why he wishes to do the experiment, by persuading him to justify the experimental treatments, and to explain why it is that the experiment, when completed, will assist him in his research.”

For starters, are there ways to quantify “lousy” in a dataset. What should we be looking for in measurements from a Statistical POV.

Any heuristics or tips? What are some typical staring points to assess the quality of a dataset.

IOW, is it easy to sniff out the garbage on its way in?

This is no alternative to knowing something about how it was generated and so really it helps to know who was involved and how they view the importance of correctly and fully recorded data. You can at least ask. Pharma companies tend to be very good at accurate data entry – especially since the FDA may check on that.

You can offer to re-enter a random subset from the records and check (that might be the most helpful thing you can do for them – I once found 4 errors in a random sample of 10 observations in a finalised data set!)

If they refuse but are new at doing research, you likely can notice anomalies. (You can do a lot of analysis and bill them for it before you point this or after – one great way to make extra money.)

If they are experienced at doing research they will have someone remove the anomalies. That does not mean detect and correct data entry/coding errors – it mean getting rid of noticeable ones (by deleting or modifying them). The expression one group used was “cleaning the data”.

Most of all presume the data you are first given is full of errors. And maybe the second set.

Reality sucks.

Nice tips. Thanks!

+1

If you fit many models, how do you compute standard errors (or perform any sort of statistical inference)? How can you possibly know the distribution of your estimator, when you’ve added an ad-hoc model selection step to the inference process?

Ivan:

Each model comes with its own standard error. If you want some sort of combined standard error for a model that is chosen in some way then, yes, you’d want to model that model-choice process too. Or try out your model on new data.

That is precisely what I wanted to get at. Ad-hoc model-choice processes (e.g., looking at some subjective interpretability criteria) are too common in statistical practice. Yet, nobody takes that process into account to compute standard errors. Practitioners just report the standard error associated to the final model, assuming it was specified a-priori. Even more so, how can we correctly interpret the coefficients of a given regression model, if, for every new dataset from the same data-generating mechanism, we are possibly choosing different regression models? (I’m thinking about it from a frequentist point of view, but I’m sure Bayesian considerations also apply).

Ivan:

I wouldn’t say that “nobody” takes this into account. There is some research on the topic, also there are generic approaches such as data splitting and external validation. A lot depends on the purpose of the modeling. You can look at my textbooks and applied papers for many examples.

Thanks for replying, Andrew. I say “nobody” because nobody can incorporate a subjective model selection process into the inference procedure. If data analysts were forced to write the mathematical algorithms carried out to select a model (e.g., in the style of step-wise or cross-validation algorithms), then would be hope, and I believe that is the direction of the research you mention. However, imagine a data analyst looking at the results of a model fit and subjectively deciding that it is not interpretable, and then running iterations of this type until he feels satisfied. I think it is impossible to introduce that process into any principled statistical inference framework.

Hi,

I want to predict some numeric values from a dataset. As of now, I am using simple linear regression. But before predicting the value, I want to filter the training data based on some criteria , and then train the model and do the predictions. This has to be done every iteration. Is there any method or neural network to achieve this?

Thank you.