Carl Bialik of the Wall Street Journal writes:

I’m working on a column this week about numerical/statistical tips and resolutions for writers and people in other fields in the new year. 2013 is the International Year of Statistics, so I’d like to offer some ways to better grapple with statistics in the year ahead. Here’s where you come in. If you have time in the next couple of days, please send me an idea or three about what people often do incorrectly when it comes to numbers, and how they could do better, without making things much more complicated. Maybe with an example of something you’ve seen that rubbed you the wrong way, and how you’d fix it. Bonus if you can tie it in to some sort of statistic that will be particularly relevant in 2013.

Any ideas of yours I use, I’ll credit, of course.

Examples of what I have in mind:

–Don’t report on how ubiquitous or important something is by saying there are 130,000 Google search results for it. Or if you do, at least check that you typed the query in a way that it’s only finding relevant results; and if you’re using the data to make the argument it’s becoming trendy, compare it to how many results there were a year ago, and compare that growth to the typical growth in search results. Or, much better, find a better data point.

–Don’t, on a controversial issue, find a study from one advocacy group on one side of the issue, a study from a group on the other, mention both and consider your work done. Look for research from less biased sources, and see what independent researchers think of the body of work.

–When two candidates are separated in the polls by a margin less than the statistical margin of error, don’t say they’re statistically tied. Especially if one candidate is leading the other in nearly every poll.

–Check your numbers, and then check them against a smell test. I’ve mixed up millions and billions, too, but if I’d looked twice I would have realized there can’t be 11 billion people in Ohio. Hopefully this conveys the idea.

My deadline is Thursday, 9 a.m. Eastern time.

Oddly enough, I can’t think of any good examples, nor can I think of any good suggestions beyond generic advice such as, “Just because something is counterintuitive, it doesn’t mean it’s true,” “Don’t trust anything written by Gregg Easterbrook,” and, of course, the ever-popular, “Hey—I don’t like that graph!” Maybe you can come up with something better for the readers of the Wall Street Journal?

When reporting government spending, don’t just say “the program costs $10 billion per year.” Put it in context. Something like “The program costs $10b per year, which is X% of GDP” or “… which is Y% of total gov’t spending.”

Here are a few things that have bothered me for years:

1. Failure to report base rates. For example, if grilling over charcoal “doubles your risk” of a certain type of cancer, what is the incidence of that cancer? Going from 1 in 10 to 2 in 10 is a lot different than going from 1 in 500,000 to 2 in 500,000.

2. On a related note, reporting that a study found a “statistically significant” effect, rather than reporting (in addition to, or instead), the effect size. Just because an effect is statistically significant does not mean that the effect is significant in any important, reality-based way (Of course, removing all references to statistical significance entirely might be too much to ask).

3. When reporting that a test, or other predictive measure, is “X% accurate,” define what that accuracy means, also in the context of the base rate. I believe that most of the time, the reporter is referring to hit rate. Instead, please report the rates of *both* false positives and false negatives. Only with both pieces of information can we evaluate, and communicate, the true incremental value of the test.

4. More selfishly, please do not refer to “a study by Nameof University found that…” Please give credit to the authors, or at least all of the principal investigators. University research activities are not as centralized as in private companies. And I have felt quite bad when I was interviewed for an article, and the article refers to one of my papers, but ignores my co-authors.

Sadly, I have brought all of these issues up to friends who are professional journalists (including one at the WSJ!), only to be told that either (a) we only have so much time or so many inches to tell the story, so we can’t include that much detail; or (b) the reporter is on deadline and does not have enough time to gather the information that is really important. Neither response seems like much of an excuse for not getting the story correct.

As a variant of Mr. Bialik’s third point, and Michael Braun’s second, if a result is not statistically significant, don’t pretend that that makes the estimate “indistinguishable from zero”/”no effect”, etc.

Of course, that’s not specifically a journalists’ problem, but can be seen in the research literature all the time. Seth Roberts recently linked to a talk about malpractice in reporting medical research results. One of the examples given was that a treatment group had a much elevated risk of suicide, but it was not significant (suicide is rare, even in depressed populations), and was hence reported as though there were no problem.

So what work does “statistical significance” ever do?

I cannot agree more strongly with the complaint that a finding of “statistical significance” should not, but often is interpreted to, imply some sort of significance. There’s no validity in this implication at all, ever, and it is an on-going fraud that the statistical profession “accidentally” co-opted the word “significance” this way and somehow, just somehow, can’t get around to correcting the situation – even when there are far better, more honest, alternatives (“statistically discernible”, for instance)?

But now you want the other direction too? No statistical significance (even with all the padding one is given to help you find s.s.) the failure to find s.s. can’t be interpreted as no interesting effect? (Outside physics, there’s always _some_ effect, so the usual question is whether it is an interesting one or not.) This seems to be your concern, no?

If I put these together, does this complaint not boil down to “never report nor talk about _any _research whose results are phrased in terms of statistical significance” because – whether the results are positive or negative – they could be

very misleading – ? (N.b. I myself would sign on to this).

If not, perhaps you, perhaps someone else, should suggest a laymans/reporters quick guide to “when a statistical significance test might possibly be saying something in the faintest bit interesting about the real world?”

I wouldn’t sign on to the extreme statement about never reporting any results in terms of statistical significance.

The test of statistical significance tells you the probability of finding a statistical association as extreme as or more extreme than the one observed in the sample if the null hypothesis is true in the universe (the null hypothesis typically being, no difference between groups or, effect not different from zero). Among other things, the test is sensitive to sample size, as it should be. That means that you can set out to find “no significant difference” by deliberately employing small samples.

I would suggest a three-step procedure when looking at statistical associations: (i) Is the sample (somewhat) representative of an identifiable universe? If so, (ii) is the size of the association interesting? If so, (iii) is the association statistically significant? If it isn’t, the likelihood that the association was observed by chance is deemed so high that one should be cautious about generalizing the result.

Notes: (a) The universe is the total class of units about which you want to make a statement, e.g., all people who voted in the most recent election. (b) An effect that looks small might nonetheless be of substantive interest, e.g., treatment kills one in 10,000 patients. (c) Whether an association represents a causal effect is a different matter altogether, which is not addressed by significance tests.

Bxg:

As I wrote a few weeks ago:

Help people figure out what 70% probability means by giving some example from daily life (like you, Gelman, did on the election when mentioned that the leading by Obama was like leading in football under such and such circumstances).

Ecological fallacy — e.g. assuming that individuals have the characteristics of groups and vice-versa.

Good example from Tim Harford on “More or Less” — study in New England Journal of Medicine on the health effects of chocolate/cocoa showing correlation between national per capita consumption of chocolate in a country (group) with the number of Nobel Prize winners (individual).

Wasn’t the NEJM paper a fairly obvious satire? I’m surprised by how much serious attention it got.

1. Going along with Andy W.’s comment, I would divide by the relevant population. So, if a government program costs $100 million in Pennsylvania, report that it will cost about $10 per person.

2. A politician proudly reports that his state is responsible for 50% of the gains in nationwide employment in some quarter. This is a meaningless statistic because some states gain jobs and others lose jobs. So, if the national economy has a net gain of zero jobs. Then any state with a positive net gain in jobs can effectively report that it is responsible for infinity percent of the gain in jobs (if that is how you define division by zero). Instead, report a state’s ranking, like that it had the 5th largest percent drop in unemployment.

Check out healthnewsreviews.org for a lot of examples of bad–and good–reporting.

http://www.healthnewsreview.org (not ~reviewS)

Exclude “don’t know”, “not applicable”, “no response” or other such responses from the denominator when calculating percentages. A better answer, of course, is to think carefully about whether to include these responses given the context of the analysis, but I recognize that this is hardly suggesting a quick and easy fix. Excluding them usually estimates a quantity of interest even if including them is also of interest. For example, if a survey asks about smartphone OSs “n/a” probably mostly means the respondent doesn’t own a smartphone, and analysis of smartphone owners is probably good even if some analysis of all Americans is important as well. As a further quick fix, one valid reason for separating these categories out would be to allow the reader to do a sensitivity analysis – but if that’s the intention, then do that math and report it instead of making readers do it themselves.

As an example, here’s a webpage that presents data on the winning percentage of different first moves in professional Go games (http://senseis.xmp.net/?MoveOneLosesTheGame). Although this is a niche interest, the example shows how egregiously wrong this mistake can take you.

In Games where Black plays in one or more empty corners at:

4,4 – B wins 54.7% W wins 44.7% No result 0.6% (precision 1.0%)

3,4 – B wins 54.0% W wins 43.4% No result 2.6% (precision 0.6%)

3,5 – B wins 47.4% W wins 42.8% No result 9.8% (precision 3.0%)

4,5 – B wins 47.0% W wins 45.4% No result 7.6% (precision 5.2%)

3,3 – B wins 48.2% W wins 51.3% No result 0.6% (precision 5.4%)

In a 2 player game, there’s a big difference between winning 47.4% and winning 52.5%! This problem is exacerbated because of the substantial difference in “no result” rates for the different moves, making the win percentages for the different moves difficult to compare. But that comparison is the whole point of this table!

The obvious one (maybe someone listed it already):Don’t assume a statistically significant effect is evidence for a theory that seems to “explain” the effect. Distinguish statistical from substantive significance.

My favorite: don’t assume that a significant difference between two groups means you will observe the same effect between particular members of each group. This ties into Michael Braun’s point about effect size, too.

I think Jorge Cham puts it very well:

http://www.phdcomics.com/comics/archive.php?comicid=1271

MPG is actually pretty hard to interpret intuitively. The marginal improvement in fuel used per 10000 miles going from 16 to 20 mpg is greater than from 34 to 50 mpg, but that’s not always intuitive/obvious to most people.

Here’s one idea: Do *not* take a good (poor) fit between model and data as evidence for warranting (rejecting) a hypothesis (e.g., $\latex R^{2}$ supposedly explaining the variance). Example: Treisman and Gelade (when discussing Experiment II) claim, “In both [easy and difficult discrimination] conditions we have evidence supporting serial, self-terminating search through the display for the conjunction targets” (Treisman & Gelade 1980, “A feature-integration theory of attention”). That is, the primary reason for this claim is linearity supposedly being preserved across both easy and difficult stimuli. Alternatively, the authors are relying on a measure between 0 and 1.0, known as the coefficient of determination (denoted $\latex R^{2}$), to provide strong *evidence* for their claim.

Whenever you write about a prediction report the confidence interval of the prediction.

Predictions without confidence intervals aren’t worth much.

Not quite statistics… but, report real prices, not nominal prices, when making comparisons over time.

[…] Bialik asks (passed on by Andrew Gelman at his blog) what do journalists do wrong in reporting studies? Gelman can’t think of an answer but commenters […]

“Plot graphs with raw data or as close to raw data as much as possible”

I hate graphs with a axis like “Seasonally adjusted homicide rate (excluding self-fatalities) with 1999 as 100”.

If a variable seems grossly convoluted the author probably has a tendentious reason for using it.

Not strictly statistical but:

Read the actual reports/papers and look at the numbers. Don’t just quote the press release or even the abstract.

I have seen papers whose reported results, as I interpret them, just don’t match either.