# Leif and Uri need to hang out with a better class of statisticians

Noted psychology researchers and methods skeptics Leif Nelson and Uri Simonsohn write:

A recent Psych Science (.pdf) paper found that sports teams can perform worse when they have too much talent.

For example, in Study 3 they found that NBA teams with a higher percentage of talented players win more games, but that teams with the highest levels of talented players win fewer games.

The hypothesis is easy enough to articulate, but pause for a moment and ask yourself, “How would you test it?”

So far, so good. But then they come up with this stunner:

If you are like everyone we talked to over the last several weeks, you would run a quadratic regression (y=β0+β1x+β2×2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.

This is horrible! Not a negative comment on Leif and Uri, who don’t like that approach and suggest a different analysis (which I don’t love, but which I agree that for many purposes would be better than simply fitting a quadratic), but a negative comment on their social circle.

If “everyone you talk to over several weeks” gives a bad idea, maybe you should consider talking about statistics problems with more thoughtful and knowledgeable statisticians.

I’m not joking here.

But, before going on, let me emphasize that, although I have some disagreements with Leif and Uri on their methods, I generally think their post is clear and informative and like their general approach of forging strong links between the data, the statistical model, and the research question. Ultimately what’s most important in these sorts of problems is not picking “the right model” or “the right analysis” but, rather, understanding what the model is doing.

Who should we be talking to?

Now let me return to my contention that Leif and Uri are talking with the wrong people.

Perhaps it would help, when considering a statistical problem, to think about five classes of people who might be interested in the results and whom you might ask about methods:

1. Completely non-quantiative people who might be interested in the substantive claim (in this case, that sports teams can perform worse when they have too much talent) but have no interest in how it could be estimated from data.

2. People with only a basic statistical education: these might be “civilians” or they could be researchers—perhaps excellent researchers—who focus on the science and who rely on others to advise them on methods. These people might well be able to fit the quadratic regression being considered, and they could evaluate the advice coming from Leif and Uri, but they would not consider themselves statistical experts.

3. Statisticians or methodologists (I guess in psychology they’re called “psychometricians”) who trust their own judgment and might teach statistics or research methods and might have published some research articles on the topic. These people might make mistakes in controversial areas (recommending a 5th-degree polynomial control in a regression discontinuity analysis or, as in the example above, naively thinking that a quadratic regression fit demonstrates non-monotonicity).

4. General experts in this area of statistics: people such as Leif Nelson and Uri Simonsohn, or E. J. Wagenmakers, or various other people (including me!), who (a) possess general statistical knowledge and (b) have thought about, and may have even worked on, this sort of problem before, and can give out-of-the-box suggestions if appropriate.

5. Experts in this particular subfield, which might in this case include people who have analyzed a lot of sports data or statisticians who specialize in nonlinear models.

My guess is that the people Leif and Uri “talked to over the last several weeks” were in categories 2 and 3. This is fine—it’s useful to know what rank-and-file practitioners and methodologists would do—but it’s also a good idea to talk with some real experts! In some way, Leif and Uri don’t need this, as they themselves are experts, but I find that conversations with top people can give me insights.

## 42 thoughts on “Leif and Uri need to hang out with a better class of statisticians”

• Gabby:

I don’t think this question, “determine if the data has a U relationship,” is the right way to frame it. It’s similar to the question, when studying an interaction effect, of whether the two lines really cross. In these U-curve settings, the U is typically explained as the sum of two effects, one positive but decreasing in slope, and one negative but increasing in slope, and what is of interest are the individual effects, not so much whether their sum happens to be non-monotonic or U-shaped.

That said, of course I do think descriptive analysis can be important, and there are a lot of examples where we see a pattern in a sample and we’re interested in inference for the pattern in the general population (or, we see a pattern in noisy data and we’re interested in inference for the underlying pattern), and, in such settings, I guess I’d generally want to use a nonparametric regression method such as a spline or Gaussian process, maybe with some pretty strong constraints, depending on the context.

• Prof. Gelman, I politely remind you that you recently (i.e. a few months ago) wrote an exam question that relied heavily on function having a maximum within the defined parameter range. Sometimes figuring out whether the underlying function reaches the maximum can be an interesting problem in its own right.

• D.O.:

Thanks for the polite reminder. Yes, I agree that such questions can arise, especially in optimization problems.

• There are times when the “U relationship” matters. One serious occurrence is “umbrella” dose response curves-common in immunological treatments. Typically a dose response curve is sigmoidal but umbrella curves drop off as level increase. There are reasons a doctor chooses to be on either side of the umbrella. Also I spent a few months in a class called ‘advanced experimental design’ learning about response surface modeling with quadratic terms. I never had to do use this technique until I had to look at calibration well plates. A well plate is a tray of rows and columns of specimen-holding wells that is subjected to some process for biochemical measurement. The machine I had gave a perfect bubble with both a quadratic row and edge effect.

In economic and social research I agree with Andrew Gelman that it is better to look for combinations of effects or split up the response with splines etc. One item that comes to mind is income and mortgage amount. Among homeowners it increases with income until we get to higher incomes where people pay cash for homes. I use Loess because the bandwidths adjust and the idea of spliced together regression lines is easy to explain.

• I’d recommend using a linear spline with one joint, which consumes four degrees of freedom. The method that Leif and Uri propose on their site consumes 8 degrees of freedom (4 degrees of freedom for the original quadratic in order to locate the potential extremum, then 2 degrees of freedom for each of two linear regressions). Plus their method produces a discontinuity at the extremum, which seems contrary to the substantive motivation.

• Paul:

And with Stan, such models are easy to fit.

1. “…better class of statisticians”

“…think about five classes of people…”

“1. ….”
“2. ….”
“3. ….”
“4. ….”

Am I missing something? :-)

• Bill:

I don’t remember now, but I must have been thinking of experts in that particular area. I’ll add this.

• Maybe the really great statisticians understand that 5 always means 5 +/- 1. (Now if Andrew had written “5.0 classes of people”…)

2. Leif and Uri’s alternative analysis has the virtue of being simple and straightforward. After reading a blog post I understand how to do it.

But to the point: isn’t the use of quadratics one of those techniques that should generally be relegated to the dustbin? This is particularly true in forecasting contexts, where extrapolating to future time periods is perilous enough without a time^2 term sitting there. But quadratics seem to show up in most forecasting texts and in most intro stat texts. (I’ve just checked two, and quadratic is the first transformation mentioned — ahead of, for example, the log transformation.)

• Is there any utility in doing an automatic analysis of this sort on any linear regression? I’ve seen datasets where people happily fit a single straight line over the entire range whereas it seems likely that what they are seeing is, in fact, a +ive slope followed by a flat line or -ive slope line. A lot of real world phenomena seem to possess this characteristic of saturation or decrease.

Could one just take any linear regression dataset & auto-partition it at random points to see if such a change in slope happens at some threshold in the domain?

• Rahul, I’ve seen this done in forecasting as a way to test for “trend breaks” using the second derivative (the change in the rate of change). Scott Armstrong wrote this up somewhere, although given the prolific nature of Armstrong that doesn’t narrow it down enough.

3. Andrew:

Could you elaborate on the “some disagreements with Leif and Uri on their methods” part?

Would you propose to do the U-testing in some other way?

4. I’m going to push back a little here. I take it that the people they’re talking to are interested in the exploratory data analysis question: “In the data, do we see that the highest talent concentration teams do in fact do worse?”

Of course, the first thing to do is to plot “achievement” vs “talent concentration” (well, the first thing to do is to figure out how to quantify those things).

While I think the quadratic is probably not going to fit well to the entire spectrum of talent concentration (for low talent teams, probably a linear or even a convex-upward curve might make more sense), if you look at the top say 20 percent of the talent concentration range, and fit a quadratic, and it doesn’t have a strong negative quadratic term, the basic hypothesis is probably ruled out.

Now, if you want to actually model the causal question of how could teams do worse in some sense even though they have more talent… you’re going to need *some* area specific knowledge. I know nothing about sports data analysis, so the idea that there’d be two additive effects one of which begins to dominate the other as talent concentrates… well it’d take me a while to think of that idea. I’d probably have to start to think about mechanisms by which talent concentration would work against a team… mechanisms like personality conflicts, like talent mismatch (ie. people with different methods of performing that because of their specialized methods can’t work well in a team) or whatever, and from that kind of insight try to build a model, and key to all of that would be talking with “sports people”.

So, from an exploratory data analysis perspective, rather than a sports-specific modeling and causal analysis perspective, I think the advice to graph y vs x and fit a polynomial is perfectly fine (though like I said, I would probably restrict the polynomial to the data in the upper end of the range… and justify that theoretically by Taylor expansion).

• Daniel:

I’m with you all the way until you introduce the word “polynomial” in your last paragraph. I think it makes sense to graph y vs. x and fit a nonlinear regression model. But, in any case, the procedure that Leif and Uri were criticizing was not just “fit a polynomial” or even “fit a quadratic,” but rather the more specific “you would run a quadratic regression (y=β0+β1x+β2×2), check whether β2 is significant, and whether plotting the resulting equation yields the predicted u-shape.” Throwing statistical significance in here is toxic.

• > toxic

Well this comment certainly appeared to be toxic –
“(first slope is significantly positive, p<.001, but the second slope is not significant, p=.53)"

• If you were going to test at all, you’d test the difference between the two slopes. You’d need some kind of correction for the fact that the researchers split the data at the high point of the quadratic, thus maximizing the chance the slopes would be different.

This is a hint that I’d like to know what sort of correction would be suggested. If I don’t get an answer here I’ll try the time-series tag on CrossValidated.

• Great, I agree with you about statistical significance being probably irrelevant at this point, so we’re on the same page so far as everything goes except “polynomial”. so I’ll explain the polynomial bit a little more:

Polynomials are very flexible, there’s a proof that they form a complete basis for continuous functions on closed intervals (Weirstrass Approximation Theorem). An important part of that is the “closed intervals” part, as the x variable goes to infinity they are always dominated by their highest power term. This is a big problem for *extrapolation* which is often what we’re interested in but I don’t think so in this case.

In this case we’re interested in some data which we think comes from some nonlinear model, like:

y_i = f(x_i) + epsilon_i

And we’re going to check to see if f has a certain character, namely that as x gets near some largish but achievable value, f will reach a maximum and start to decline. We’re going to assume f is a smooth function, because in the presence of the noise and with a small dataset, we are never going to get much in the way of information about its roughness. If we knew that maximum point, we could write a taylor series for this smooth function around the max point

f(x) = a + b(x-xmax) + c (x-xmax)^2/2! +d (x-xmax)^3/3! …..

So long as (x-xmax) is small, the higher order terms will be ever smaller in magnitude, so that we don’t need very many terms before the rest are likely irrelevant. And, if there is a maximum at xmax, then the 3rd order term is zero, so the model :

f(x) = a + c(x-xmax)^2/2 O((x-xmax)^4)

is good to 4th order!

Now, let’s further define x in such a way that all the x values are O(1). Specifically, x is going to be something like “the total talent on the team” as measured by some metrics of talent… let’s divide this number by the talent you would get by picking a team full of the most talented player we know… so all the x values will be in the interval [0,1].

Now, in the vicinity of xmax, the Taylor series obviously is dominated by the 0th and 2nd order term as described above, AND the statement of the problem is a statement ABOUT the range over which xmax exists… somewhere near the upper end of the observed range…

So my advice to restrict yourself to the upper portion of the observed x range, and then fit a 2nd order polynomial will be good advice so long as you

a) don’t over-interpret the results extrapolated outside the range
b) don’t have so much noise that the noise spuriously amplifies the 1,3, and higher order terms in an OLS regression

if you suspect you have b) then most likely the trend is not going to be visible by eye and you aren’t going to be able to get much out of your data without a strong Bayesian causal model of the type you describe… but that’s well beyond an exploratory analysis.

• I should also say c) the maximum really is in the vicinity of where you think it is. So bad fit becomes indicator of a bad prior on xmax.

this kind of analysis is essentially incorporating a poor-mans bayesian prior on the location of the maximum and the scale of x over which the effect occurs.

5. Why not a change point analysis (never run one myself, but continually see papers on them, including claims that they are now n(log n) in estimation).

6. I don’t think that there’s anything fundamentally wrong with taking a quadratic model as a starting point, if you *actually* think you might have a u-shaped distribution and want to test it. The main issue I see here has to do with model-checking. Just like other linear models, a quadratic model makes certain assumptions about e.g., error distributions that can (and should) be checked. I mean, I suppose you could just go around fitting polynomials to everything and never look at your model fit or report anything about your residuals, but that’s terrible practice for *any* statistical technique.

Rahul, numeric: A changepoint analysis would certainly work to “automate” the sort of piecewise regression model used in the linked blog; however, at that point you’re basically doing an exploratory analysis, and I would think you might as well use a GAM or GP model or something.

7. All other issues aside, it would seem that any statistical test should focus on the term (B1 + 2*B2*x) at values of x in the upper parts of its distribution. Even if the data generating mechanism can be modelled with a quadratic, the feature of interest is a negative slope at a particular point, not concavity.

With their proposed solution, it would seem like the sequence of thoughts might be:
1. “Hey, why am I taking the location of my cutpoint from the quadratic fit, lets make it random, determined by another modelled parameter.”
2. “Hey, one cut point is fairly restrictive, perhaps I need more to model my data better.”
3. “hey what I am doing is just a hack version of a non-parametric model, lets just use a spline (or insert your favorite non-parametric technique)”

8. I had the same response to that blog post. And the statistically-minded folks I discussed it with all started talking about model comparisons between non-parametric regressions with and without a monotonicity restriction…

9. I guess in psychology they’re called “psychometricians”

A smallish, but notable, subset are called “mathematical psychologists” rather than “psychometricians.”

10. Should’t talent be synonymous with winning games? Things are simple in single player sports. It’s unlikely that someone would say athlete A is more “talented” at running than athlete B if B wins most of the races. You might say chess player A was more “talented” than player B if A is prone to swing between brilliance and blundering.

Team sports (and other team activities) are difficult to evaluate as to the contributions of individual players. Part of the issue is an interaction effect — players A, B, and C might have similar marginal abilities, but A and B might play better together than B and C or A and B. Five ballhandling point guards or five towering centers do not a successful NBA team make.

For running a science lab, you probably want to make the same considerations for recruiting a half dozen postdocs as a general manager might make in drafting a sports team. Five brilliant bench technicians with world-class lab hands would not be the ideal composition to generate Science papers and NIH grants, nor would five systems computer scientists who can build Google-scale clusters in the lab basement.

• What you’re saying seems pretty similar to the psych paper, especially in terms of how they operationalized it. They used ESPN’s EWA formula as a proxy for ‘talent’, which is computed from ‘newspaper statistics’ (such as points, assists, rebounds). So one could interpret their finding as “there are diminishing returns to putting a bunch of stat hogs together.”

It would be interesting to substitute one of the newer measures of player value such as “NBA Real Plus Minus” (“RPM”). Anecdotal observation: on fivethirtyeight, Nate Silver used RPM to project 65 wins (79% of the regular season games) for the Cavaliers (following the acquisition of Kevin Love). The Cavaliers are currently under .500 (6-7).

• “there are diminishing returns to putting a bunch of stat hogs together.”

yes, there still is only one ball, and in basketball and soccer if you have the ball, I don’t.

Which is why it’s notable that the result doesn’t occur in baseball, perhaps because (1) baseball is batter by batter; the ‘one ball’ notion doesn’t really apply here, or (2) sabremetrics is better developed in baseball so talent is more likely to have smaller error measurement.

• Yes, both reasons baseball is different are probably at work. Heh, even though there is only one ball in baseball, too!

The place where the one ball constraint applies in baseball is defense. I don’t know if the current state-of-the-art sabermetrics are sufficient to reveal the effect but I have seen it in softball: the more superb defensive players you put together, the fewer chances each of them will have to make a play. A great defensive team can reduce its opponent’s on base percentage from, say, 0.700 to 0.600. This reduces the expected number of opponent plate appearances in a 7 inning game from 70 down to 52.5. The actual reduction in expected plate appearances is even greater because of the increased probability that the great defensive team will win in an earlier inning by slaughter rule.

11. I think a horizontal asymptote is more likely than anything. At some point, there’s no more room for your win percentage to go up, no matter how much talent you add to the team (same with extremely unskilled players: eventually, you have no room to move down, and you’ll stop decreasing). My first impulse would be to fit some kind of logistic function.

• Another possibility is that talent is measured with error, so the teams that seem the most talented are somewhat less talented in reality.

• This (Corson’s point) seems very good. A reasonable model would be that the effect of talent (let’s accept the possibility that construct is reasonable for a moment) saturates, then either there is some noise in the data, or there is some other orthogonal thing you could have thought of but didn’t that ends up defining the small differences between teams. Either way, it looks like “talent” is the main thing, but then at the upper reaches, it fails to be.

Setting that aside, one could even imagine the whole description of the data being true… there is a breakpoint (the sign of the effect really does change)… but it’s nothing to do with the kind of theory most social psychologists would advance (in this case “status conflict”). Actually at a high level there might be a tradeoff, and front offices that spend too much time on recruitment of players could fail to fill out support staff (strength and skill coaches, physical therapy, etc etc). The idea that the thing a social psychologist would go to is “haha counterintuitive TOO MUCH TALENT cooperation problem” or something like that is pretty hopeless, although even I have to admit it’s possible that if you go dig in the noise you might find a tiny little slice of that effect. Sports teams are more than their players. Psychologists who care only for their little toy theories and not for the reality that produces the data they use are not going to do anything useful or serious. If this sounds harsh, it’s because I think it should be. It’s the product of sitting through more talks like this than I can stomach.

12. What if we just randomly choose (x,y) = (talent,performance) pairs and evaluate the slope all over the domain. And then plot probablity of obtaining a negative slope against talent level.

Then compare against the same probablity under the assumption of gaussian noise.

13. Love the segmentation of the geeks!

If you ask N statisticians a methodological question, you can expect to get N-1 different answers most of which will address the question at issue, one way or another. Only if the questioner is doing a dissertation does it become a matter of right vs wrong…and even then there is usually enough controversy on even the simplest issue to produce ambiguity. It is more likely to be the case that the answers provided will fall on a continuum anchored between uncompromising academic/scientific at one end vs “practical” at the other. The quadratic model form falls at the latter end of that spectrum.

Borrowing an analogy from literary criticism, Harold Bloom in his book The Anxiety of Influence compared the adjudicating critical process to the survival of the fittest in the literary jungle where only the strongest, loudest, most confident yaps carry the day. From the point of view of comparative homologous structure, is it that different in the realm of stats?

Borrowing another analogy to barroom debates, once the discussants at the bar have resolved a hotly contested issue, a new drunk will walk in the door articulating the initial, incorrect premises and the whole process will start over again. The scientific method isn’t that different. Everybody relies on “evidence-based” ignorance in their opinions, an equilibrium in agreement is rarely found, questions aren’t answered so much as they just fade away with the influx of new questions.

14. I think this post would be 10^3 x better if you actually did the analysis you think makes more sense — more than the “polynomial” solution (which I would have thought was a pretty fair way to discipline an inference one could draw from observing raw data, assuming one had a good theory-based reason for expecting to see such a relationship; I think that’s what D.Lakeland is saying, no?) or the L/U alternative (which strikes me as having the obvious problems associated with splitting one’s data & trying to draw inferences from difference in effects observed in a small, underpowered slice…)

Can you ask L/U for the data set & either do the analyses yourself or upload them to a location where others who’ve suggested solutions — ones that seem interesting but also inadequately explained — might get at them?

But it’s more intersting & satisfying to learn how to participate in the insights of those who have figured out how to do things better.

• Dan:

Indeed, yes, I am working on this! In my “day job” as a statistician, I’m working with some students on procedures for routine nonlinear regression modeling using Gaussian processes, and I’m hoping that soon the method will be ready for problems such as this.

I also feel that there are some interesting social problems to look at, in this case the idea that here is a pair of statistically-savvy psychology researchers, who work at a university with a top statistics department, yet appear to only encounter people with a narrow approach to this particular statistics problem. That’s a social science puzzle in its own right, something I don’t mind thinking about during those times that I’m not working on statistics problems.

• > social science puzzle in its own right

On that note, when I was about to leave the University of Toronto (1997) I heard that meta-analysis was becoming a topic of interest in business. As I had been an MBA student and enjoyed the program I thought it would be nice to give a talk on meta-analysis at the MBA school before I left the city.

After some back and forth about how to avoid making the talk too advanced for their faculty I received an email that roughly stated “there is no way we are going to allow a statistician to come in and give a talk and make our faculty look like fools.”

15. Not being a statistician, I would be tempted to do a binning plot before doing anything else, for example plotting the mean of the target in each quantile against the quantile (maybe ten bins, or slightly more). This would at least give some indication if the regression function (crudely estimated in this way) looks at all quadratic. A poor man’s spline. If the function doesn’t look roughly quadratic, I’m not sure why it makes sense to try to fit a quadratic unless there is some other reason unrelated to this data set to believe the function is quadratic which I kind of doubt.

• Jeff:

I agree, but I think one issue is that psychologists are looking not just for exploration but for confirmation (that is, p-values), which points them toward more formal modes of inference.