Skip to content

Random patterns in data yield random conclusions.

Bert Gunter points to this New York Times article, “How Exercise May Make Us Healthier: People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary,” writing that it is “hyping a Journal of Applied Physiology paper that is now my personal record holder for most extensive conclusions from practically no data by using all possible statistical (appearing) methodology . . . I [Gunter] find it breathtaking that it got through peer review.”

OK, to dispose of that last issue first, I’ve seen enough crap published by PNAS and Lancet to never find it breathtaking that anything gets through peer review.

But let’s look at the research paper itself, “Habitual aerobic exercise and circulating proteomic patterns in healthy adults: relation to indicators of healthspan,” by Jessica Santos-Parker, Keli Santos-Parker, Matthew McQueen, Christopher Martens, and Douglas Seals, which reports:

In this exploratory study, we assessed the plasma proteome (SOMAscan proteomic assay; 1,129 proteins) of healthy sedentary or aerobic exercise-trained young women and young and older men (n = 47). Using weighted correlation network analysis to identify clusters of highly co-expressed proteins, we characterized 10 distinct plasma proteomic modules (patterns).

Here’s what they found:

In healthy young men and women, 4 modules were associated with aerobic exercise status and 1 with participant sex. In healthy young and older men, 5 modules differed with age, but 2 of these were partially preserved at young adult levels in older men who exercised; among all men, 4 modules were associated with exercise status, including 3 of the 4 identified in young adults.

Uh oh. This does sound like a mess.

On the plus side, the study is described right in the abstract as “exploratory.” On the minus side, the word “exploratory” is not in the title, nor did it make it into the news article. The journal article concludes as follows:

Overall, these findings provide initial insight into circulating proteomic patterns modulated by habitual aerobic exercise in healthy young and older adults, the biological processes involved, and the relation between proteomic patterns and clinical and physiological indicators of human healthspan.

I do think this is a bit too strong. The “initial” in “initial insight” corresponds to the study being exploratory, but it does not seem like enough of a caveat to me, especially considering that the preceding sentences (“We were able to characterize . . . Habitual exercise-associated proteomic patterns were related to biological pathways . . . Several of the exercise-related proteomic patterns were associated . . .”) had no qualifications and were written exactly how you’d write them if the results came from a preregistered study of 10,000 randomly sampled people rather than an uncontrolled study of 47 people who happened to answer an ad.

How to analyze the data better?

But enough about the reporting. Let’s talk about how this exploratory study should’ve been analyzed. Or, for that matter, how it can be analyzed, as the data are still there, right?

To start with, don’t throw away data. For example, “Outliers were identified as protein values ≥ 3 standard deviations from the mean and were removed.” Huh?

Also this: “Because of the exploratory nature of this study, significance for all subsequent analyses was set at an uncorrected α < 0.05." This makes no sense. Look at everything. Don't use an arbitrary threshold. Also there's some weird thing in which proteins were divided into 5 categories. It's kind of a mess. To be honest, I'm not quite sure what should be done here. They're looking at 1129 different proteins so some sort of structuring needs to be done. But I don't think it makes sense to do the structuring based on this little dataset from 47 people. A lot must already be known about these proteins, right? So I think the right way to go would be to use some pre-existing structuring of the proteins, then present the correlations of interest in a grid, then maybe fit some sort of multilevel model. I fear that the analysis in the published paper is not so useful because it's picking out a few random comparisons, and I'd guess that a replication study using the same methods would come up with results that are completely different. Finally, I hove no doubt that the subtitle of the news article, "People who exercise have different proteins moving through their bloodstreams than those who are generally sedentary," is true, because any two groups of people will differ in all sorts of ways. I think the analysis as performed won't help much in understanding these differences in the general population, but perhaps a multilevel model, along with more data, could give some insight. P.S. Maybe the title of this post could be compressed to the following: Random in, random out.

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions.

Jeff Lax writes:

I’m probably not the only one telling you about this Science story, but just in case.

The link points to a new research article reporting a failed replication of a study from 2008. The journal that published that now-questionable result refuses to consider publishing the replication attempt.

My reply:

I agree it’s a problem but it doesn’t surprise me. It’s pretty random what these tabloids publish, as they get so many submissions. Sure, they couldn’t publish this particular paper, but maybe there was something more exciting submitted to Science that week, maybe a new manuscript by Michael Lacour?

Causal inference: I recommend the classical approach in which an observational study is understood in reference to a hypothetical controlled experiment

Amy Cohen asked me what I thought of this article, “Control of Confounding and Reporting of Results in Causal Inference Studies: Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals,” by David Lederer et al.

I replied that I liked some of their recommendations (downplaying p-values, graphing raw data, presenting results clearly) and I am supportive of their general goal to provide statistical advice for practitioners, but I was less happy about their recommendations for causal inference, which was focused on what taking observational data and drawing causal graphs. Also I don’t think their phrase “causal association” has any useful meaning. A statement such as “Causal inference is the examination of causal associations to estimate the causal effect of an exposure on an outcome” looks pretty circular to me.

When it comes to causal inference, I prefer a more classical approach in which an observational study is understood in reference to a hypothetical controlled experiment.

I also think that the discussion of causal inference in the paper is misguided in part because of the authors’ non-quantitative approach. For example, they consider a hypothetical study estimating the effect of exercise on lung cancer and they say that “Controlling for ‘smoking’ will close the back-door path.” First off, given the effects of smoking on lung cancer, “controlling for smoking” won’t do the job at all, unless this is some incredibly precise model with smoking very well measured. The trouble is that the effect of smoking on lung cancer is so large that any biases in this measurement could easily overwhelm the effect they’d be trying to estimate. And this sort of thing comes up a lot in public health studies. Second, you’d need to control for lots of things, not just smoking. This example illustrates how I don’t see the point of all their discussion of colliders. If we instead simply take the classical approach, we’d start with a hypothetical controlled study of exercise on lung cancer, a randomized prospective study in which the experimenter assigns exercise levels to patients, who are then followed up, etc., then we move to the observational study and consider pre-treatment differences between people with different exercise levels. This makes it clear that there’s no “back-door path”; there are just differences between the groups, differences that you’d like to adjust for in the design and analysis of the study.

Also I fear that this passage in the linked article could be misleading: “Causal inference studies require a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding. The latter is addressed in detail later in this document. Prediction models are fundamentally different than those used for causal inference. Prediction models use individual-level data (predictors) to estimate (predict) the value of an outcome. . . ” This seems misleading to me in that a good prediction study also requires a clearly articulated hypothesis, careful attention to minimizing selection and information bias, and a deliberate and rigorous plan to control confounding.

The point is that, once you’re concerned about out-of-sample (rather than within-sample) prediction, all these issues of measurement, selection, confounding, etc. arise. Also, a causal model is a special case of a predictive model where the prediction is conditional on some treatment being applied. So I think it’s a mistake to think of causal and predictive inference as being two different things.

The publication asymmetry: What happens if the New England Journal of Medicine publishes something that you think is wrong?

After reading my news article on the replication crisis, retired cardiac surgeon Gerald Weinstein wrote:

I have long been disappointed by the quality of research articles written by people and published by editors who should know better. Previously, I had published two articles on experimental design written with your colleague Bruce Levin [of the Columbia University biostatistics department]:

Weinstein GS and Levin B: The coronary artery surgery study (CASS): a critical appraisal. J. Thorac. Cardiovasc. Surg. 1985;90:541-548.

Weinstein GS and Levin B: The effect of crossover on the statistical power of randomized studies. Ann. Thorac. Surg. 1989;48:490-495.

I [Weinstein] would like to point out some additional problems with such studies in the hope that you could address them in some future essays. I am focusing on one recent article in the New England Journal of Medicine because it is typical of so many other clinical studies:

Alirocumab and Cardiovascular Outcomes after Acute Coronary Syndrome

November 7, 2018 DOI: 10.1056/NEJMoa1801174

BACKGROUND

Patients who have had an acute coronary syndrome are at high risk for recurrent ischemic cardiovascular events. We sought to determine whether alirocumab, a human monoclonal antibody to proprotein convertase subtilisin–kexin type 9 (PCSK9), would improve cardiovascular outcomes after an acute coronary syndrome in patients receiving high-intensity statin therapy.

METHODS

We conducted a multicenter, randomized, double-blind, placebo-controlled trial involving 18,924 patients who had an acute coronary syndrome 1 to 12 months earlier, had a low-density lipoprotein (LDL) cholesterol level of at least 70 mg per deciliter (1.8 mmol per liter), a non−high-density lipoprotein cholesterol level of at least 100 mg per deciliter (2.6 mmol per liter), or an apolipoprotein B level of at least 80 mg per deciliter, and were receiving statin therapy at a high-intensity dose or at the maximum tolerated dose. Patients were randomly assigned to receive alirocumab subcutaneously at a dose of 75 mg (9462 patients) or matching placebo (9462 patients) every 2 weeks. The dose of alirocumab was adjusted under blinded conditions to target an LDL cholesterol level of 25 to 50 mg per deciliter (0.6 to 1.3 mmol per liter). “The primary end point was a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.”

RESULTS

The median duration of follow-up was 2.8 years. A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group (hazard ratio, 0.85; 95% confidence interval [CI], 0.78 to 0.93; P<0.001). A total of 334 patients (3.5%) in the alirocumab group and 392 patients (4.1%) in the placebo group died (hazard ratio, 0.85; 95% CI, 0.73 to 0.98). The absolute benefit of alirocumab with respect to the composite primary end point was greater among patients who had a baseline LDL cholesterol level of 100 mg or more per deciliter than among patients who had a lower baseline level. The incidence of adverse events was similar in the two groups, with the exception of local injection-site reactions (3.8% in the alirocumab group vs. 2.1% in the placebo group).

Here are some major problems I [Weinstein] have found in this study:

1. Misleading terminology: the “primary composite endpoint.” Many drug studies, such as those concerning PCSK9 inhibitors (which are supposed to lower LDL or “bad” cholesterol) use the term “primary endpoint” which is actually “a composite of death from coronary heart disease, nonfatal myocardial infarction, fatal or nonfatal ischemic stroke, or unstable angina requiring hospitalization.” [Emphasis added]

Obviously, a “composite primary endpoint” is an oxymoron (which of the primary colors are composites?) but, worse, the term is so broad that it casts doubt on any conclusions drawn. For example, stroke is generally an embolic phenomenon and may be caused by atherosclerosis, but also may be due to atrial fibrillation in at least 15% of cases. Including stroke in the “primary composite endpoint” is misleading, at best.

By casting such a broad net, the investigators seem to be seeking evidence from any of the four elements in the so-called primary endpoint. Instead of being specific as to which types of events are prevented, the composite primary endpoint obscures the clinical benefit.

2. The use of relative risks, odds ratios or hazard ratios to obscure clinically insignificant differences in absolute differences. “A composite primary end-point event occurred in 903 patients (9.5%) in the alirocumab group and in 1052 patients (11.1%) in the placebo group.” This is an absolute difference of only 1.6%. Such small differences are unlikely to be clinically important, or even replicated on subsequent studies, yet the authors obscure this fact by citing hazard ratios. Only in a supplemental appendix (available online), does this become apparent. Note the enlarged and prominently displayed hazard ratio, drawing attention away from the almost nonexistent difference in event rates (and lack of error bars). Of course, when the absolute differences are small, the ratio of two small numbers can be misleadingly large.

I am concerned because this type of thing is appearing more and more frequently. Minimally effective drugs are being promoted at great expense, and investigators are unthinkingly adopting questionable methods in search of new treatments. No wonder they can’t be repeated.

I suggested to Weinstein that he write a letter to the journal, and he replied:

Unfortunately, the New England Journal of Medicine has a strict limit on the number of words in a letter to the editor of 175 words.

In addition, they have not been very receptive to my previous submissions. Today they rejected my short letter on an article that reached a conclusion that was the opposite of the data due to a similar category error, even though I kept it within that word limit.

“I am sorry that we will not be able to publish your recent letter to the editor regarding the Perner article of 06-Dec-2018. The space available for correspondence is very limited, and we must use our judgment to present a representative selection of the material received.” Of course, they have the space to publish articles that are false on their face.

Here is the letter they rejected:

Re: Pantoprazole in Patients at Risk for Gastrointestinal Bleeding in the ICU

(December 6, 2018 N Engl J Med 2018; 379:2199-2208)

This article appears to reach an erroneous conclusion based on its own data. The study implies that pantoprazole is ineffective in preventing GI bleeding in ICU patients when, in fact, the results show that it is effective.

The purpose of the study was to evaluate the effectiveness of pantoprazole in preventing GI bleeding. Instead, the abstract shifts gears and uses death within 90 days as the primary endpoint and the Results section focuses on “at least one clinically important event (a composite of clinically important gastrointestinal bleeding, pneumonia, Clostridium difficile infection, or myocardial ischemia).” For mortality and for the composite “clinically important event,” relative risks, confidence intervals and p-values are given, indicating no significant difference between pantoprazole and control, but a p-value was not provided for GI bleeding, which is the real primary endpoint, even though “In the pantoprazole group, 2.5% of patients had clinically important gastrointestinal bleeding, as compared with 4.2% in the placebo group.” According to my calculations, the chi-square value is 7.23, with a p-value of 0.0072, indicating that pantoprazole is effective at the p<0.05 level in decreasing gastrointestinal bleeding in ICU patients. [emphasis added]

My concern is that clinicians may be misled into believing that pantoprazole is not effective in preventing GI bleeding in ICU patients when the study indicates that it is, in fact, effective.

This sort of mislabeling of end-points is now commonplace in many medical journals. I am hoping you can shed some light on this. Perhaps you might be able to get the NY Times or the NEJM to publish an essay by you on this subject, as I believe the quality of medical publications is suffering from this practice.

I have no idea. I’m a bit intimidated by medical research with all its specialized measurements and models. So I don’t think I’m the right person to write this essay; indeed I haven’t even put in the work to evaluate Weinstein’s claims above.

But I do think they’re worth sharing, just because there is this “publication asymmetry” in which, once something appears in print, especially in a prestigious journal, it becomes very difficult to criticize (except in certain cases when there’s a lot of money, politics, or publicity involved).

We’re done with our Applied Regression final exam (and solution to question 15)

We’re done with our exam.

And the solution to question 15:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

Both (a) and (b) are true.

(a) is true because everything’s approximately normally distributed so you’d expect a 95% chance for an estimate +/- 2 se’s to contain the true value. In real life we’re concerned with model violations, but here it’s all simulated data so no worries about bias. And n=100 is large enough that we don’t have to worry about the t rather than normal distribution. (Actually, even if n were pretty small, we’d be doing ok with estimates +/- 2 sd’s because we’re using the mad sd which gets wider when the t degrees of freedom are low.)

And (b) is true too because of the central limit theorem. Switching from a normal to a bimodal distribution will affect predictions for individual cases but it will have essentially no effect on the distribution of the estimate, which is an average from 100 data points.

Common mistakes

Most of the students got (a) correct but not (b). I guess I have to bang even harder on the relative unimportance of the error distribution (except when the goal is predicting individual cases).

Algorithmic bias and social bias

The “algorithmic bias” that concerns me is not so much a bias in an algorithm, but rather a social bias resulting from the demand for, and expectation of, certainty.

Pharmacometrics meeting in Paris on the afternoon of 11 July 2019

Julie Bertrand writes:

The pharmacometrics group led by France Mentre (IAME, INSERM, Univ Paris) is very pleased to host a free ISoP Statistics and Pharmacometrics (SxP) SIG local event at Faculté Bichat, 16 rue Henri Huchard, 75018 Paris, on Thursday afternoon the 11th of July 2019.

It will features talks from Professor Andrew Gelman, Univ of Columbia (We’ve Got More Than One Model: Evaluating, comparing, and extending Bayesian predictions) and Professor Rob Bies, Univ of Buffalo (A hybrid genetic algorithm for NONMEM structural model optimization).

We welcome all of you (please register here). Registration is capped at 70 attendees.

If you would like to present some of your work (related to SxP), please contact us by July 1, 2019. Send a title and short abstract (julie.bertrand@inserm.fr).

Question 15 of our Applied Regression final exam (and solution to question 14)

Here’s question 15 of our exam:

15. Consider the following procedure.

• Set n = 100 and draw n continuous values x_i uniformly distributed between 0 and 10. Then simulate data from the model y_i = a + bx_i + error_i, for i = 1,…,n, with a = 2, b = 3, and independent errors from a normal distribution.

• Regress y on x. Look at the median and mad sd of b. Check to see if the interval formed by the median ± 2 mad sd includes the true value, b = 3.

• Repeat the above two steps 1000 times.

(a) True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

(b) Same as above, except the error distribution is bimodal, not normal. True or false: You would expect the interval to contain the true value approximately 950 times. Explain your answer (in one sentence).

And the solution to question 14:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

(a) For an average classroom, the curve is invlogit(1 + 0.1x), so it goes through the 50% point at x = -10. So the easiest way to draw the curve is to extend it outside the range of the data. But in the graph, the x-axis should go from 0 to 50. Recall that invlogit(5) = 0.99, so the probability of passing reaches 99% when x reaches 40. From all this information, you can draw the curve.

(b) The 25th and 75th percentage points of the normal distribution are at the mean +/- 0.67 standard errors. Thus, the 25th and 75th percentage points of the intercepts are 1 +/- 0.67*2, or -0.34, 2.34, so the curves to draw are invlogit(-0.34 + 0.1x) and invlogit(2.34 + 0.1x). These are just shifted versions of the curve from a, shifting by 1.34/0.1 = 13.4 to the left and the right.

Common mistakes

Students didn’t always use the range of x. The most common bad answer was to just draw a logistic curve and then put some numbers on the axes.

A key lesson that I had not conveyed well in class: draw and label the axes first, then draw the curve.

Question 14 of our Applied Regression final exam (and solution to question 13)

Here’s question 14 of our exam:

14. You are predicting whether a student passes a class given pre-test score. The fitted model is, Pr(Pass) = logit^−1(a_j + 0.1x),
for a student in classroom j whose pre-test score is x. The pre-test scores range from 0 to 50. The a_j’s are estimated to have a normal distribution with mean 1 and standard deviation 2.

(a) Draw the fitted curve Pr(Pass) given x, for students in an average classroom.

(b) Draw the fitted curve for students in a classroom at the 25th and the 75th percentile of classrooms.

And the solution to question 13:

13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as
y_i = a_j[i] + bx + ε_i
a_j = A + Bu_j + η_j.

(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

(a) The estimates of b, A, and B are 0.7, 2.5, and 0.5, respectively, and the estimated distributions are ε ~ normal(0, 3.0) and η ~ normal(0, 2.0).

(b) 3.0 and sqrt(3.0^2 + 2.0^2) = 3.6.

Common mistakes

Almost everyone got part (a) correct, and most people got (b) also, but there was some confusion about the uncertainty for a new observation in a new group.

Naomi Wolf and David Brooks

Palko makes a good point:

Parul Sehgal has a devastating review of the latest from Naomi Wolf, but while Sehgal is being justly praised for her sharp and relentless treatment of her subject, she stops short before she gets to the most disturbing and important implication of the story.

There’s an excellent case made here that Wolf’s career should have collapsed long ago under the weight of her contradictions and factual errors, but the question of responsibility, of how enablers have sustained that career, and how many other journalistic all-stars owe their successes to the turning of blind eyes.

For example, Sehgal’s review ran in the New York Times. One of, if not the most prominent voice of that paper is David Brooks. . . .

Really these columnists should just stick to football writing, where it’s enough just to be entertaining, and accuracy and consistency don’t matter so much.

Question 13 of our Applied Regression final exam (and solution to question 12)

Here’s question 13 of our exam:

13. You fit a model of the form: y ∼ x + u full + (1 | group). The estimated coefficients are 2.5, 0.7, and 0.5 respectively for the intercept, x, and u full, with group and individual residual standard deviations estimated as 2.0 and 3.0 respectively. Write the above model as
y_i = a_j[i] + bx + ε_i
a_j = A + Bu_j + η_j.

(a) Give the estimates of b, A, and B together with the estimated distributions of the error terms.

(b) Ignoring uncertainty in the parameter estimates, give the predictive standard deviation for a new observation in an existing group and for a new observation in a new group.

And the solution to question 12:

12. In the regression above, suppose you replaced height in inches by height in centimeters. What would then be the intercept and slope of the regression? (One inch is 2.54 centimeters.)

The intercept remains the same at -21.51, and the slope is divided by 2.54, so it becomes 0.28/2.54 = 0.11: a change of one centimeter corresponds to 1/2.54 inches.

Common mistakes

About half the students go the right answer here; the others guessed in various ways: some changed the intercept as well as the slope, some multiplied by 2.54 instead of dividing. The problem’s trickier than it looks, and to make sure my answer was correct I had to think twice and make sure I wasn’t getting it backward.

I’m not quite sure how I could teach the class better so that students would get this question right. Except of course by drilling them, giving them lots of homeworks covering this material.

Question 12 of our Applied Regression final exam (and solution to question 11)

Here’s question 12 of our exam:

12. In the regression above, suppose you replaced height in inches by height in centimeters. What would then be the intercept and slope of the regression? (One inch is 2.54 centimeters.)

And the solution to question 11:

11. We defined a new variable based on weight (in pounds):

heavy <- weight>200

and then ran a logistic regression, predicting “heavy” from height (in inches):

glm(formula = heavy ~ height, family = binomial(link = "logit"))
            coef.est coef.se
(Intercept) -21.51     1.60
height        0.28     0.02
---
  n = 1984, k = 2

(a) Graph the logistic regression curve (the probability that someone is heavy) over the approximate range of the data. Be clear where the line goes through the 50% probability point.

(b) Fill in the blank: near the 50% point, comparing two people who differ by one inch in height, you’ll expect a difference of ____ in the probability of being heavy.

(a) The x-axis should range from approximately 60 to 80 (most people have heights between 60 and 80 inches), and the y-axis should range from 0 to 1. The easiest way to draw the logistic regression curve is to first figure out where it goes through 0.5. That’s when the linear predictor equals 0, thus -21.51 + 0.28*x = 0, so x = 21.58/0.28 = 79. Then at that point the line has slope 0.07 (remember the divide-by-4 rule), and that will be enough to get something pretty close to the fitted curve.

(b) As just noted, the divide-by-4 rule gives us an answer of 0.07.

Common mistakes

In making the graphs, most of the students didn’t think about the range of x, for example having x go from 0 to 100, which doesn’t make sense as there aren’t any people close to 0 or 100 inches tall. To demand that the range of the curve fit the range of the data is not just being picky: it changes the entire interpretation of the fitted model because changing the range of x also changes the range of probabilities on the y-axis.

How statistics is used to crush (scientific) dissent.

Lakeland writes:

When we interpret powerful as political power, I think it’s clear that Classical Statistics has the most political power, that is, the power to get people to believe things and change policy or alter funding decisions etc… Today Bayes is questioned at every turn, and ridiculed for being “subjective” with a focus on the prior, or modeling “belief”. People in current power to make decisions about resources etc are predominantly users of Classical type methods (hypothesis testing, straw man NHST specifically, and to a lesser extent maximum likelihood fitting and in econ Difference In Difference analysis and synthetic controls and robust standard errors and etc all based on sampling theory typically without mechanistic models…).

The alternative is hard: model mechanisms directly, use Bayes to constrain the model to the reasonable range of applicability, and do a lot of computing to get fitted results that are difficult for anyone without a lot of Bayesian background to understand, and that specifically make a lot of assumptions and choices that are easy to question. It’s hard to argue against “model free inference procedures” that “guarantee unbiased estimates of causal effects” and etc. But it’s easy to argue that some specific structural assumption might be wrong and therefore the result of a Bayesian analysis might not hold…

So from a political perspective, I see Classical Stats as it’s applied in many areas as a way to try to wield power to crush dissent.

My reply:

Yup. But the funny thing is that I think that a lot of the people doing bad science also feel that they’re being pounded by classical statistics.

It goes like this:
– Researcher X has an idea for an experiment.
– X does the experiment and gathers data, would love to publish.
– Because of the annoying hegemony of classical statistics, X needs to do a zillion analyses to find statistical significance.
– Publication! NPR! Gladwell! Freakonomics, etc.
– Methodologist Y points to problems with the statistical analysis, the nominal p-values aren’t correct, etc.
– X is angry: first the statistical establishment required statistical significance, now the statistical establishment is saying that statistical significance isn’t good enough.
– From Researcher X’s point of view, statistics is being used to crush new ideas and it’s being used to force creative science into narrow conventional pathways.

This is a narrative that’s held by some people who detest me (and, no, I’m not Methodologist Y; this might be Greg Francis or Uri Simonsohn or all sorts of people.) There’s some truth to the narrative, which is one thing that makes things complicated.

Question 11 of our Applied Regression final exam (and solution to question 10)

Here’s question 11 of our exam:

11. We defined a new variable based on weight (in pounds):

heavy <- weight>200

and then ran a logistic regression, predicting “heavy” from height (in inches):

glm(formula = heavy ~ height, family = binomial(link = "logit"))
            coef.est coef.se
(Intercept) -21.51     1.60
height        0.28     0.02
---
  n = 1984, k = 2

(a) Graph the logistic regression curve (the probability that someone is heavy) over the approximate range of the data. Be clear where the line goes through the 50% probability point.

(b) Fill in the blank: near the 50% point, comparing two people who differ by one inch in height, you’ll expect a difference of ____ in the probability of being heavy.

And the solution to question 10:

10. For the above example, we then created indicator variables, age18_29, age30_44, age45_64, and age65up, for four age categories. We then fit a new regression:

lm(formula = weight ~ age30_44 + age45_64 + age65up)
             coef.est coef.se
(Intercept)     157.2     5.4
age30_44TRUE     19.1     7.0
age45_64TRUE     27.2     7.6
age65upTRUE       8.5     8.7
  n = 2009, k = 4
  residual sd = 119.4, R-Squared = 0.01

Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis) and draw the fitted regression model. Again, this graph should be consistent with the above computer output.

The graph of weight vs. age should be identical to that in the previous problem. Fitting a new model does not change the data. The fitted regression model is four horizontal lines: a line from x=18 to x=30 at the level 157.2, a line from x=30 to x=45 at the level 157.2 + 19.1, a line from x=45 to x=65 at the level 157.2 + 27.2, and a line from x=65 to x=90 at the level 157.2 + 8.5.

Common mistakes

The biggest mistake was discretizing x. Another common mistake was to draw the regression line and then draw the dots relative to the line, so that there were jumps in the underlying data at the cutpoints in the model. Also, as in the previous problem, when students drew the dots, they almost always didn’t include enough vertical spread, and their scatterplots looked nothing like what the data might look like. Again, when teaching, I need to clarify the distinction between the model and the data.

Question 10 of our Applied Regression final exam (and solution to question 9)

Here’s question 10 of our exam:

10. For the above example, we then created indicator variables, age18_29, age30_44, age45_64, and age65up, for four age categories. We then fit a new regression:

lm(formula = weight ~ age30_44 + age45_64 + age65up)
             coef.est coef.se
(Intercept)     157.2     5.4
age30_44TRUE     19.1     7.0
age45_64TRUE     27.2     7.6
age65upTRUE       8.5     8.7
  n = 2009, k = 4
  residual sd = 119.4, R-Squared = 0.01

Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis) and draw the fitted regression model. Again, this graph should be consistent with the above computer output.

And the solution to question 9:

9. We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We created a new variables, age10 = age/10. We then fit a regression:

lm(formula = weight ~ age10)
            coef.est coef.se
(Intercept)    161.0     7.3
age10            2.6     1.6
  n = 2009, k = 2
  residual sd = 119.7, R-Squared = 0.00

Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis). Label the axes appropriately, draw the fitted regression line, and make a scatterplot of a bunch of points consistent with the information given and with ages ranging roughly uniformly between 18 and 90.

The x-axis should go from 18 to 90, or from 0 to 90 and the y-axis should go from approximately 100 to 300, or from 0 to 300. It’s easy enough to draw the regression line, as the intercept and slope are right there. The scatterplot should have enough vertical spread to be consistent with a residual sd of 120. Recall that approximately 2/3 of the points should fall between +/- 1 sd of the regression line in vertical distance.

Common mistakes

Everyone could draw the regression line; nearly nobody could draw a good scatterplot. Typical scatterplots were very tightly clustered around the regression line, not at all consistent with a residual sd of 120 and an R-squared of essentially zero.

I guess we should have more assignments where students draw scatterplots and sketch possible data.

Question 9 of our Applied Regression final exam (and solution to question 8)

Here’s question 9 of our exam:

9. We downloaded data with weight (in pounds) and age (in years) from a random sample of American adults. We created a new variables, age10 = age/10. We then fit a regression:

lm(formula = weight ~ age10)
            coef.est coef.se
(Intercept)    161.0     7.3
age10            2.6     1.6
  n = 2009, k = 2
  residual sd = 119.7, R-Squared = 0.00

Make a graph of weight versus age (that is, weight in pounds on y-axis, age in years on x-axis). Label the axes appropriately, draw the fitted regression line, and make a scatterplot of a bunch of points consistent with the information given and with ages ranging roughly uniformly between 18 and 90.

And the solution to question 8:

8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

This is a job for the Agresti-Coull interval. y* = y + 2, n* = n + 4, p* = y*/n* = 2/54 = 0.037, with standard error sqrt(p*(1-p*)/n*) = sqrt((2/54)*(52/54)/54) = 0.026. Estimate is [p* +/- 2se] = [-0.014, 0.088], but the probability can’t be negative, so [0, 0.088] or simply [0, 0.09].

Common mistakes

Most of the students remembered the Agresti-Coull interval, but some made the mistake of giving confidence intervals that excluded zero (which can’t be right, given that the data are 0/50) or that included negative values.

Question 8 of our Applied Regression final exam (and solution to question 7)

Here’s question 8 of our exam:

8. Out of a random sample of 50 Americans, zero report having ever held political office. From this information, give a 95% confidence interval for the proportion of Americans who have ever held political office.

And the solution to question 7:

7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

Assume 250 got the treatment and 250 got the control. Then the standard error of the estimated treatment effect is sqrt(0.5^2/250 + 0.5^2/250) = 0.045. An estimate is statistically significant if it is at least 2 standard errors from 0, so the answer to the question is 0.09, an effect of 9 percentage points.

Common mistakes

Most of the students couldn’t handle this one. One problem was forgivable: I didn’t actually say that half the people got the treatment and half got the control. I guess I should’ve made that clear in the statement of the problem.

But that wasn’t the only issue. Many of the students weren’t clear on how to get started on this one. One key point is that you can plug p=0.5 into the sqrt(p*(1-p)/n) formula.

Question 7 of our Applied Regression final exam (and solution to question 6)

Here’s question 7 of our exam:

7. You conduct an experiment in which some people get a special get-out-the-vote message and others do not. Then you follow up with a sample, after the election, to see if they voted. If you follow up with 500 people, how large an effect would you be able to detect so that, if the result had the expected outcome, the observed difference would be statistically significant?

And the solution to question 6:

6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

(a) is true, (b) is false. The problem is purely predictive, and adding a good predictor should help (on average; sure, you could find individual examples where it would make things worse, but there’s no reason to think it wouldn’t help in the generically-described example above). When the goal is prediction (rather than estimating regression coefficients which will be given a direct causal interpretation), there’s no problem with adding this sort of informative predictor.

Common mistakes

Just about all the students got this one correct.

Question 6 of our Applied Regression final exam (and solution to question 5)

Here’s question 6 of our exam:

6. You are applying hierarchical logistic regression on a survey of 1500 people to estimate support for a federal jobs program. The model is fit using, as a state-level predictor, the Republican presidential vote in the state. Which of the following two statements is basically true?

(a) Adding a predictor specifically for this model (for example, state-level unemployment) could improve the estimates of state-level opinion.

(b) It would not be appropriate to add a predictor such as state-level unemployment: by adding such a predictor to the model, you would essentially be assuming what you are trying to prove.

Briefly explain your answer in one to two sentences.

And the solution to question 5:

5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

(a) is false. If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be negative.

(b) is false. It’s no problem at all to have more questions than students. Even in a classical regression, even without a multilevel model, this is typically no problem as long as each question is answered by a few different students.

Common mistakes

Most of the students had the impression that one of (a) or (b) had to be true, so a common response was to work through one of the two options, figure out that it was false, and then mistakenly conclude that the other one was true. I guess I should rephrase the question. Instead of “Which of the following statements are basically true?”, I could say, “For each of the following statements, say whether it is true or false.”

Question 5 of our Applied Regression final exam (and solution to question 4)

Here’s question 5 of our exam:

5. You have just graded an exam with 28 questions and 15 students. You fit a logistic item-response model estimating ability, difficulty, and discrimination parameters. Which of the following statements are basically true?

(a) If a question is answered correctly by students with low ability, but is missed by students with high ability, then its discrimination parameter will be near zero.

(b) It is not possible to fit an item-response model when you have more questions than students. In order to fit the model, you either need to reduce the number of questions (for example, by discarding some questions or by putting together some questions into a combined score) or increase the number of students in the dataset.

Briefly explain your answer in one to two sentences.

And the solution to question 4:

4. A researcher is imputing missing responses for income in a social survey of American households, using for the imputation a regression model given demographic variables. Which of the following two statements is basically true?

(a) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as rich or poor: A deterministic procedure overstates your certainty, making you more likely to impute extreme values.

(b) If you impute income deterministically using a fitted regression model (that is, imputing using Xβ rather than Xβ + ε), you will tend to impute too many people as middle class: By not using the error term, you’ll impute too many values in the middle of the distribution.

Option (a) is wrong and option (b) is right. We discuss this in the missing-data chapter of the book. The point prediction from a regression model gives you something in the middle of the distribution. You need to add noise in order to approximate the correct spread.

Common mistakes

Almost all the students got this one correct.