See yesterday’s post for background.

Here’s the question:

In the helicopter activity, pairs of students design paper ”helicopters” and compete to create the copter that takes longest to reach the ground when dropped from a fixed height. The two parameters of the helicopter, a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1. In the activity, students are allowed to make 20 test helicopters at design points (a,b) of their choosing. The students measure how long each copter takes to reach the ground, and then they are supposed to fit a simple regression (not hierarchical or even Bayesian) to model this outcome as a function of a and b. Based on this model, they choose the optimal a,b and then submit this to the class. Here is the question. Why is it inappropriate for that regression model to be linear?

And here’s the answer: For a linear model the optimum is necessarily on the boundary. But we already know the solution can’t be on the boundary (each of a and b must be more than 0 and less than 1). You need a nonlinear model to get an internal optimum.

I was happy to see that all 4 of the students got this correct.

**P.S.** I see there was some confusion among the commenters on the definition of a linear model, perhaps because I labeled the design variables as “a” and “b” rather than “x1” and “x2.” The distinction between linear and nonlinear models is important in applied statistics. In this case, a linear regression model would be of the form, y=beta_0 + beta_1*a + beta_2*b + error. A nonlinear model could have quadratic terms (for example) or it could look like y=(beta_0 + beta_1*a + beta_2*b + error_beta)/(gamma_0 + gamma_1*a + gamma_2*b + error_gamma + error_gamma) or it could be mathematically constructed to give the right answers at the boundaries. All sorts of nonlinear models are possible. We don’t really teach our students how to construct such nonlinear models, but writing this post is making me think it should be in the curriculum.

Naive question: For regression, Linear means linear in the parameters, right?

So

“time = const1 * a + const2 * b^2 + const3”would be a linear model, right? Can this not have an internal maximum?Also: what stops the (0,1) parameterization being in a small enough range to have the optimum (a,b) within that small space be on its boundary?

George:

Perhaps it was not clear enough in the problem statement, but in this case 0 and 1 were physical constraints: “each of a and b must be more than 0 and less than 1” because 0 or 1 would not be possible (it was “more than” and “less than,” not “at least” and “no more than”).

My interpretation was the same as Rahul’s, which made me think the problem was with the OLS assumptions of the error distribution (non-negativity of fall time).

Rahul:

It depends on the context but in this sort of setting, “linear model” implies linear in the predictors.

Ok. Funnily I came a full circle. I went through college thinking linear models mean

“y = ax + b”then was surprised to use R’s glm, “linear” by name yet capable of fitting“y=ax^2 + b”etc. and then Wikipedia’s comprehensive article on linear regression had me convinced that “linear” means linear in the parameters (i.e.x^2etc. is ok)And now again I see “linear” being used as linear in predictors.

The semantics is super confusing!

May be if you distinguish “linear model” and “linear regression” it would help?

Oh, does the term “linear” mean differently when prefixed to model versus regression?

What the word means depends on convention. Per Wikipedia, in statistics “linear model” means linear in parameters and “linear regression” means both linear in parameters and in covariates. I am not sure that Wiki is a great source of usage advice, but that’s the best I can do.

Would you like to proffer any supporting evidence for your interpretation of the terms? This is the first time that I hear it suggested that “linear regression” means both linear in parameter and covariates.

All (text) books I read pretty much use “simple linear regression” for the case of one covariate and an intercept, otherwise it is “multiple linear regression”. Polynomial regression is definitely handled in “linear regression” text books, and it is clearly not linear in covariates.

Thus, I would have had problems with this question from the start as “fit a simple regression […] to model this outcome as a function of a and b” implies that a multiple linear regression model is to be fitted but it is called simple regression????

Rahul: “The semantics is super confusing!”

That’s why I don’t bother with these sort of exams. I always do terribly because I tend to find 7 tails to a cat.

Okay, I’ll admit that my primary emphasis isn’t in statistics, so I’m probably missing some big points, but I’m not sure why the optimum has to be on the boundary. I figured that linear models were inappropriate for the typical reasons that linear regression on bounded spaces are inappropriate (i.e. misfitting that might allow for false extrapolation outside of the boundaries), and that logistic regression might be more appropriate, but that assumes that the optimal point (or, rather, set of points) still lies inside the region.

I will say that, in typing this, I just realized that, unless the best fit has either infinite or zero slope with respect to both time and one of the measures, optimality necessarily increases with any increases (or decreases) in either a or b until a or b hits a boundary, but I wonder if there is more to it than that (especially considering that zero slope implies infinitely many “optimal” solutions and infinite slope implies some really weird physical phenomena that are probably never going to happen in most people’s lives).

Suppose they run the regression and then try to argmax{Aa+Bb} subject to constraints from above where A and B are coefficients from the regression.

This would suggest the optimal design based on the the signs of A and B

A>0 and B>0 would suggest optimal (a,b)=(1,1)

A0 would suggest optimal (a,b)=(0,1)

A<0 and B0 and B<0 would suggest optimal (a,b)=(1,0)

or rather if I did it with Matlab, they would be numbers like 1-E-10 (for 1) or 1E-10 (for 0)

I'm not sure there's anything, as the problem is written, to suggest that these couldn't reasonably be answers. Sure you claim the constraint is less than 1 rather than equal to 1. But this question is about actually cutting something (i.e. not a theoretical question). I could just make the relevant cuts to precision such that they are just barely acceptable.

I had gone through both the (0,1) and (0,0) cases, but it seems like it got cut off.

But if you think the solution is one of those four cases, why do you need a regression at all? Just try the four cases!

Good point.

I like it when the answer is much shorter than the question.

In reality, it does matter were you think the best parameters are. If they are near the corners of the feasibility set than why waste time measuring trials in the middle.And if you are pretty sure (say, based on theory) that extreme values of the parameters are not the answer, than clearly linear model is not your choice.

I overthought it. Assumed the model allowed for an (a,b) interaction which (I think?) means that we’re not left with Zeno’s paradox on the max/min values. So I went with the idea that the time for a falling body to reach the earth is not strictly linear, especially when air resistance is factored in: http://demonstrations.wolfram.com/FallingBodyWithZeroLinearOrQuadraticAirResistance/

Besides that I wanted to say that besides the problem of 0 or 1 always predicting the max (assuming Bi !!= 0) (and also the possibility of predicting a fall faster than that due to gravity alone) there was no indication that a linear model made sense or that there was no interaction between a and b. In fact, the design of the experiment seems to imply there was interest in whether there would be some interaction. But then again, I also thought that if you are doing an experiment like this with students you should have them plot the data :).

I don’t think interactions have anything to do with it. y ~ r*a +s*b + t*ab does not have minimum in the internal points either.

Andrew writes “But we already know the solution can’t be on the boundary (each of a and b must be more than 0 and less than 1).” Why is that necessarily true? It’s only true if we already know that there *is* a solution within the chosen parameter space, i.e. that the function must attain some maximum among the allowed values of a and b. I don’t see any reason why that should be true without further assumptions about how the fall time depends on a and b, in order to guarantee that the maximum must occur within the interior of the region rather than on the boundary (in which case the function simply has no maximum on the allowed region since we’re not including the boundary). Am I missing something?

So, does the question have anything spicific to do with regression, errors, noise etc. or is it more general? i.e. Even if we had a non-regression model, say, from first principles or some other source it’d still be silly to find a true optimum with a linear function right?

Basically, all this is saying is

“y=mx+c”cannot have internal optima?Rahul:

We say y=ax+b, not mx+c, but otherwise, yes, that’s exactly the point. In this example, the optimum cannot be on the boundary (for physical reasons) so it’s inappropriate to use a linear model to estimate the function for the purpose of finding an optimum.

Gotcha! Thanks!

Andrew:

Regarding your PS. At least two decent sources, R & Wikipedia, regard linear to include some of what you term as non-linear.

Maybe the semantics here is so fuzzy that one should always spell out what exactly one means by linear.

Such non=linear models are crucial in Engineering so it might be worthwhile learning the nuances of how to fit them. e.g. Here’s a very common model from the chemical industry, looking sort of similar to Andrew’s illustration above:

http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2014/CS/c3cs60395d/c3cs60395d-t2_hi-res.gif

Hi Andrew,

I have a question and a comment.

Question: Consider the scenario where it is known in advance that the underlying model for the helicopter problem is linear (y=beta_0 + beta_1*a + beta_2*b + error) but the beta parameters are unknown. In that scenario, wouldn’t the linear model be an appropriate model to fit? And if so, isn’t this a special case of the problem you described above?

Comment: The constraints for a and b being strictly between 0 and 1 feels silly. Assuming a nontrivial amount of error, I can’t imagine there being a difference between making a one atom cut in a piece of paper (a > 0) vs. not cutting the paper at all (a=0). Similarly, cutting all but one atom of a piece of paper (a < 1) seems equivalent to cutting the paper fully (a = 1) since the slightest motion would break that last atomic bond.

These comments are related to those from Nathaniel and Jonathan above, but I thought a rephrasing might clarify the issues.

Andy:

For physical reasons it is clear that the maximum cannot be on the boundary, so the underlying model cannot be linear. But in any case, such a real-world problem would never have an underlying model that is linear; the point here is not that the linear model is incorrect (as it must be) but rather that it is inappropriate to be used to find a maximum. Certain models are inappropriate for certain purposes.

Thanks Andrew, the main point you’re making is much clearer to me now — that in general, if you’re looking for an optimum, you want a model that allows the optimum to be anywhere in the parameter space of interest rather than a model that apriori restricts where the optima can occur.

With that in mind, I think the boundary condition is actually a bit of a red herring. Even if the constraints include points 0 and 1, a linear model would still be a bad choice right? Also, this is now a very minor point but although you say it’s clear that for physical reason, the maximum cannot be on the boundary, it’s not immediately clear to me why this is the case. If it’d be helpful, I could certainly construct a physical problem where the optima would be on a boundary point. But that’s a minor quibble and I think I understand the main point now, so thank you for your response.

If that were know, you would not need to fit any model, merely test at boundaries. The optimum would have to necessarily lie at one of four points: (1,0), (0,1), (0,0),(1,1).

I could be wrong.

I’m late to the party, but I think the given solution makes no sense at all (and perhaps points to a fundamental difference between the training of statisticians and domain-focused scientists):

A model is appropriate if and only if we have reason to believe that it provides a sensible description of the thing being described. So if someone proposes a linear model, our first and only question should be whether we have reason to believe that the descent time is sensibly approximated by a linear function of a and b. If not, we shouldn’t model it that way – end of story. We should _never_ choose a model simply because we happen to have a handy implementation of it sitting on the shelf. But this question reinforces my impression that such an attitude is actively encouraged when teaching statistics.

Suppose instead the question was why it is inappropriate to use a sinusoidal model (or whatever other crazy parametric form you care to invent). The answer would be the same – if we have no reason to think that such a model provides a sensible description of the physical reality being modeled, we have no reason to use it.

The issue of where the optimum will be is a red herring. Sure, if we choose a model that ignores some of our practical constraints it can happen that it tells us to implement a solution that is not feasible in practice. In some situations this would be a good reason not to ignore those constraints when choosing the model. But in the present case we can implement solutions arbitrarily close to the boundaries, so in practice that is not an issue at all. If we have an analytical proof that certain solutions cannot be optimal under our model assumptions, and we believe the model assumptions to be reasonable, then we should discard those solutions rather than the model.

Konrad:

The optimum is not on the boundary or near the boundary, hence it does not make sense to try to find the optimum by fitting a linear model, which will have an optimum on the boundary.

A priori I have no reason to expect that the optimum would _not_ be on or near the boundary. My reasons for being unconvinced that a linear model is appropriate are much stronger than any expectation I might have about where the optimum might be. In particular, the absence of even an _attempt_ to justify the choice of model is a far stronger reason to discard it.

It definitely would’ve helped had I included in the problem a picture of how the helicopter is constructed. At the boundary the helicopter is not helicopter-like at all and will fall quickly.

“a and b, correspond to the length of certain cuts in the paper, parameterized so that each of a and b must be more than 0 and less than 1.”

the problem says nothing about the shape of the cuts, only their lengths. Given only the information in the problem, the lengths could correspond to shapes of cuts such that lengths of 0 and 1 do correspond to a helicopter.

It definitely would’ve helped had I included in the problem a picture of how the helicopter is constructed. At the boundary the helicopter is not helicopter-like at all and will fall quickly.

In any case, all 4 of the students got this one correct, so for whatever reason they got the point!

It seems to me this is a very serious issue in statistical training – teaching the idea that it can be ok to use linear regression without first attempting to justify linearity assumptions (one imagines large numbers of trained practitioners who may not even be _aware_ that they are making modeling assumptions).

It is related to the curious absence of “where does your model come from?” questions in the ranks of those asking “where does your prior come from?”

I agree with you, though I also think Andrew has some strong physical or prior knowledge intuition regarding the results of this particular experiment which is missing in the discussion. Note that if the design DOES have an interior optimum, then locally in the vicinity of that optimum the taylor series will have a dominant 2nd order term so it then makes sense to consider quadratic models at least over a small region of the parameter space near the optimum.

So, in general, I think your point is VERY IMPORTANT but in this particular case, I think what’s really missing is the actual description of the “helicopter experiment” which would help make the assumptions clear and would allow justifying some model assumptions.

[…] the extra degree of freedom it adds, it is not inherently wrong to show an r-squared statistic. [cite, cite, […]