Tongxi Hu writes:

Could you please answer a question about the application of statistical models. Let’s take regression models as an example.

In the real world, we use statistical models to find out relationships between different variables because we do not know the true relationship. For example, the crop yield, temperature, and precipitation. But when we apply statistical models, do we need to care about whether a model can retrieve the relationship between variables?

Examples:

Suppose the true relationship between crop yield (Y), temperature (T), and precipitation (P) is:

Y = T+ sin(T/6) + P + exp{- (P-160)/4}

Suppose we also simulated some observations of Y, T, and P. Then, we use a linear regression model to fit these simulated observations. I am sure we can fit them and fit them well using a certain statistical model. Let’s say the fitted model is:

Y = a*T+b*T^2 + c*P + d*P^2 + e).

Apparently, the fitted model can’t retrieve the real relationships between Y, T, and P. Can we really use the fitted model to do some inference?Many researchers using statistical models to predict crop yield in future relying on statistical models fitted using historical observations. Some of their work is published on top-level journals such as Science, Nature. I am doubting their conclusions. My argument is if we are unable to make sure a model is capable of retrieving the true relationships, inference from these models can be misleading.

My reply:

If you simulate fake data, there’s a true model. But in real life there’s just about never a true model, for two reasons:

First, to go back to your example: whatever is the actual function in the population of E(y|t,p) will not be any parametric form. E(y|t,p) could be approximated by a linear model or a model with sin and exp or whatever, but there will be no true parametric form.

Second, there is no single E(y|t,p), as this expectation or regression function will vary over space, time, different types of crop, etc. Just as when estimating a treatment effect there is really no single “treatment effect” to estimate, when estimating a predictive relationship there is no single relationship to estimate.

In practice, all models are approximations, both in their functional form and in their implicit assumption of stability. (Yes, you could extend your model to allow variation in space, time, and type of crop—and that could be a good idea—but there’d still be variation according to other factors you did not account for.)

You write, “if we are unable to make sure a model is capable of retrieving the true relationships, inference from these models can be misleading.” This is a legitimate concern. Just remember that there are no “true relationships” to recover.

Also consider that models can be substantively motivated—that is, justified in part from the underlying science of the problem being modeled. There are strong motivations such as with compartmental models in toxicology, or our golf putting model, and weaker motivations such as with models predicting elections from the economy. Substantive motivation can be seen as a kind of regularization. One advantage of a substantive model is that there can be natural ways to extend it, as in the golf example.

As it’s Christmas Eve, and there are no other comments, can I just say that, for a scientist trying to fit models to data, this blog, and its many contributors above and below the line, is the best thing on the internet. Thanks all!

No, the parameter values of the equation are meaningless unless you have derived it from some assumptions. The reason is that the meaning of each parameter depends on the form and what gets included in the model, which is almost always a matter of convenience.

Lots of research should get thrown out due to this mistake, and many people have wasted their lives on BS and can never admit it (it is an even more fundamental and widespread mistake than NHST). There is a good paper Andrew posted this past year where they check the value of one coefficient for millions of different plausible linear model specifications and see it vary greatly, then they (correctly) say the true model is probably nonlinear anyway. The conclusions drawn about parameters of such models are totally subjective and arbitrary.

An arbitrary statistical model can still be used to make predictions, all machine learning is based on this. I wouldn’t personally call that inference… but sometimes it is referred that way.

The best way to think about modeling of this sort is that you are partitioning the answer into two compartments, the model, and the error. With Frequentist inference the error is seen as a kind of fancy die that gets rolled and tacked on to the answer… that is bunk of the highest order. In a Bayesian model probability is used to quantify how much you know about the answer, and errors can be error in modeling, or in measurement, or in anything.

So, if your model is F(a,b,c)+epsilon and a more accurate model is F(a,b,c)+G(d,e,f) then by algebra epsilon = G(d,e,f) and as long as G is in the high probability region of the distribution you chose for epsilon, your model is satisfied. Sure, it’s not ideal, it would be good to discover G or something that approximates G… but essentially all of modeling is to pull structure out of the error term, and force the error to have a smaller scale or be less biased etc.

So the question to ask is whether you are satisfying the assumptions your model makes, and whether you can come up with a better model. Don’t worry about going straight to the true model. Andrews golf putting example is a perfect example of how iterative model development puts structure onto the errors.