Something fishy in political science p-values, or, it’s tacky to set up quantitative research in terms of “hypotheses”

Posted on September 21, 2006 12:26 AM by Andrew

A commented pointed out this note by Kevin Drum on this cool paper by Alan Gerber and Neil Malhotra on p-values in published political science papers. They find that there are suprisingly many papers with results that are just barely statistically significant (t=1.96 to 2.06) and surprisingly few that are just barely not significant (t=1.85 to 1.95). Perhaps people are fuding their results or selecting analyses to get significance. Gerber and Malhotra’s analysis is excellent–clean and thorough.

Just one note: the finding is interesting, and I love the graphs, but, as Gerber and Malhotra note,

We only examined papers that listed a set of hypotheses prior to presenting the statistical results. . . .

I think it’s kind of tacky to state a formal “hypothesis,” especially in a social science paper, partly because, in many (most?) of my research, the most interesting finding was not anything we’d hypothesized ahead of time. (See here for some favorite examples.) I think there’s a problem with the whole mode of research that focuses on “rejecting hypotheses” using statistical significance, and so I’m sort of happy to find that Gerber and Malhotra notice a problem with studies formulated in this way.

Slightly related

In practice, t-statistics are rarely much more than 2. Why? Because, if they’re much more than 2, you’ll probably subdivide the data (e.g., look at effects among men and among women) until subsample sizes are too small to learn much. Knowing this can affect experimental design, as I discuss in my paper, “Should we take measurements at an intermediate design point?”

5 thoughts on “Something fishy in political science p-values, or, it’s tacky to set up quantitative research in terms of “hypotheses””

waldtest on September 20, 2006 12:21 PM at 12:21 pm said:

An interesting parallel paper in the economics literature on the impact of minimum wages. In 1994, Card and Krueger published a study in the American Economic Review finding no effect of a minimum wage increase in NJ on employment. They were widely attacked for a finding that was so contrary to prior published work and economic theory.

A year later they published a meta-analysis of the prior research in AER (Time-Series Minimum-Wage Studies: A Meta-analysis,American Economic Review, 85(2):238-243 (1995)). Basically, they find evidence that the prior studies torture the data to get the "right" answer, i.e., that higher minimum wages reduce employment.

More specifically: If the effect of minimum wages on employment is consistent, then estimates of the effect from different studies would cluster around the true effect size. Studies based on data with fewer subjects, which are less precise, might vary more than larger studies, but the effects should be symmetrical around the true effect size. What Card and Krueger found was that they weren't.

The standard tests for statistical significance, t-tests or z-tests, require a value of 1.96 (call it 2) for the researcher to assert the probability the tue effect is zero is less than 5%, which is the magic number for publishability. As sample size goes down, the size of the effect needed to achieve a t-test or z-test of 2 goes up. Card and Krueger found that the reported t-stats in the literature were almost all around 2, the minimum needed for publishability, and that smaller sample studies systematically found larger effects. This is not the pattern one would observe if the studies were getting unbiased estimates of the same effect.

There are lots of ways to edit one's data when one is getting the wrong answer. And for an economist writing for a professional audience that believes economic theory supports finding an effect of the minimum wage, lots of reason to believe that the small, unpublishable effect or contrary effect one observes in one's initial analysis, is wrong. And so, one edits, and keeps on editing, until one gets the "right" answer. Card and Krueger's analysis provide substantial evidence that this occurred in the prior studies.
John Thacker on September 21, 2006 8:40 AM at 8:40 am said:

I think there's a problem with the whole mode of research that focuses on "rejecting hypotheses" using statistical significance.

Indeed, the results suggest many of the papers which state a hypothesis followed the "Actual Method" listed in this quite funny comic.
Martin Ringo on September 22, 2006 9:06 AM at 9:06 am said:

Any 2nd or 3rd year graduate student in a discipline which extensively uses multivariate statistics who was not aware that in the published literature the reported significance of the primary effects (variables or what have you) tends to lie at 5% or below (95% or above for those in school before 1970) should have thought … well, thought of changing fields. And if those 1st and 2nd year reading lists weren’t enough, there was the “learning by doing” experience of the dissertation, which should have left no qualms. Science may be science, but people are people and we mortals practice science to our own greater benefit, albeit the latter has the traditional class and individual variation ratios.

These experiences have left most users of the literature with a set of cynicism guidelines. Mine are: is the model (including all transformations and functional forms) the most natural implementation of the theory? Is the data the most obvious source for the test? That is I am trying to find out if there has been undue specification searching or data mining. I mean undue in the sense that working from received theory is already a specification search and data mining exercise of enormous consequence but in practice unavoidable. If we are going to get anywhere, we have to stand on somebody’s shoulders, but we don’t have to erect a model and data scaffolding to keep from falling.

In that regard let me comment on the Card and Krueger meta-analysis study in the 1995 American Economic Assoc. Proceedings mentioned by waldtest in comment 1. C&K were looking at a set of 15 studies on the effects of minimum wages on employment to see if the absolute value of the t-statistic on minimum wage increased proportionally with the square root of the degrees of freedom.[f1] C&K found an insignificant, indeed slightly negative, relationship and concluded that not all was kosher in minimum wage studies.[f2] The statistical hanky panky in minimum wage studies is hardly news — think about the empirical problem and it looks almost hopeless for a non-hanky-panky analysis, and the C&K tests are hardly a confirmation since they did not reject any hypothesis of proportional increase. Rather they failed to reject the lack of proportional increase. Further their own model in unconvincing because 1) it is not a comparison of the same Data Generating Processes[f3] and 2) is a selection of 15 t-statistic values out a much larger universe of minimum wage studies.[f4] Thus, while I do not disagree with their conclusions about being care in interpreting the published empirical results, I do not think that C&K’s analysis, other than by offering a less than stellar model and test, showed much.

Finally, let me offer the following hypothetical: you are an industrial consultant. You have been asked whether the modulation of the plant temperatures would produce an increased mechanical/power efficiency in the industrial process. Of course you will estimate the effect, but are you going to present it, as opposed to dismiss it, without at least an implicit hypothesis test?

Footnote [1] If you have a simple model Y(i) = A + B*X(i) + e(i), i = 1,…, N, then assuming the model holds for all i AND X is stationary, the t-statistic on B will probabilistically grow with the square root of degrees of freedom. Variations of the model and hypotheses make for interesting Monte Carlo studies in a topics in regression class.

Footnote [2] Within the economics profession, the article must be placed in the context of the primary C&K agenda: showing that the classical and neoclassical law of demand does not strictly apply to minimum wage labor markets.

Footnote [3] The result holds for each DGP, thus even with a random mixing of DGPs it should hold as the number of studies compared, called that M as opposed to N, becomes large. However, for N in the size of the minimum wage studies, less than a 1000, M has to become very large to assure the result. 15 isn’t very large M in this case.

Footnote [4] As early as 1915 the Bureau of Labor Statistics published a report on the effects of minimum wages on employment.
David Gal on September 23, 2006 5:56 PM at 5:56 pm said:

A similar finding was recently published with regard to medical journals:

http://bmj.bmjjournals.com/cgi/content/full/bmj;3…
John Thacker on October 2, 2006 6:15 AM at 6:15 am said:

They were widely attacked for a finding that was so contrary to prior published work and economic theory.

Interestingly, their own original study also contains some suspect methodology of its own. First off, they based their results on survey data of managers. If their study is replicated using payroll data instead of surveys, then the opposite result is achieved. That's quite suspicious right there, and worth attacking.

Comments are closed.