James Heathers reports on the article, “Contagion or restitution? When bad apples can motivate ethical behavior,” by Gino, Gu, and Zhong (2009):
There is some sentiment data reported in Experiment 3, which seems to be reported in whole units.
They also indicated how guilty they would feel about the behavior of the person who took all the money along with some unrelated emotional measures (1 = not at all, 5 = very much)… participants in the in-group selfish condition felt more guilty (M = 4.61, SD = 1.64) about the person’s selfish behavior than the participants in the out-group selfish condition (M = 3.26, SD = 1.54), t(80) = 3.82, p < .001.
If you have a 1 to 5 scale, it isn’t possible to have M = 4.61, SD = 1.64.
Huh? Really? Yeah!
Let’s work it out. If your measurements are on a 1-5 scale, the way to maximize their standard deviation for any given mean is to put the data all at 1 and 5. If the mean is 4.61, that would imply that (4.61 – 1)/(5 – 1) = 0.9025 of the data take on the value 5, and 1 – 0.9025 = 0.0975 take on the value 1. (Just to check, 0.0975*1 + 0.9025*5 = 4.61.)
For this extreme dataset, the standard deviation is sqrt(0.0975*(1 – 4.61)^2 + 0.9025*(5 – 4.61)^2) = 1.19. So, yeah, there’s no way to get a standard deviation of 1.64 from these data. Just not possible!
Just to make sure, we can check our calculation via simulation:
n <- 1e6 y <- sample(c(1,5), n, replace=TRUE, prob=c(0.0975, 0.9025)) print(c(mean(y), sd(y)))
Here's what we get:
[1] 4.610172 1.186317
Check.
OK, let's try one more thing. Maybe b is so small that there's some kinda 1/sqrt(n-1) thing in the denominator driving the result? I don't think so. The trouble is that, to get a mean of 4.61, you need enough data (in his post, Heathers guesses "n=41 (as 189/41 = 4.6098)") that the difference between 1/sqrt(n) and 1/sqrt(n-1) wouldn't be enough to take you from 1.19 all the way up to 1.64 or even close. Also, it's kinda implausible that all the observations would be 1's and 5's anyway.
So what happened?
It's always easier to figure out what didn't happen than to figure out what did happen.
Here are some speculations.
One possibility is a typo, but Heathers doubts that because other calculations in that paper are consistent the above-reported impossible numbers.
A related possibility is that this was a typo that was then propagated into the rest of paper. For example, the mean was 3.61, it was typed in the paper as 4.61, and then this typed-in number was used in later calculations. This would be bad workflow---you want all the computations to be done in a single script---but people use bad workflow all the time. I use bad workflow myself sometimes and end up with wrong numbers or wrongly-labeled graphs.
Another possibility is that the mean and standard deviation were calculated from two different datasets. That might sound kind of weird, but it can happen all the time, due to sloppiness or because of goofs in data processing. For example, you read in the data, calculate the mean and standard deviation for each variable, then perform some data-exclusion rule, perhaps removing data with incomplete responses to some of the questions, then you do further statistical analysis, recalculating the mean and standard deviation, among other things---but then when you pull together your numbers, you take the mean from some place and the standard deviation from the other place.
Yet another possibility is that someone involved in the data analysis or writeup was cheating in order to get a statistically-significant and thus publishable result, for example changing 3.61 to 4.61 to get a big fat difference but not touching the standard deviation. This would be a great way to cheat, because if you get caught, you can just say that you made a typo!
In any case, it's a fun little statistics example. And it's worth checking your data, even if you have no suspicion of cheating. I've often had incoherent data in problems I've worked on. Lots of things can go wrong in data processing and analysis, and we have to check things in all sorts of ways.
I forget where I came across this (maybe Charles Manski cited it in something I read), but the economist Paul Samuelson wrote up a paper on this in 1968 with the excellent title “How Deviant Can You Be?” (https://www.jstor.org/stable/2285901). It provides bounds both on how far from the mean a single observation can be as well as a set of r observations. “For a finite universe of N items, it is proved no one can lie more than sqrt(N-1) standard deviations away from the mean. This is an improvement over the result given by Tchebycheff’s inequality: and a similar improvement is possible when speaking of how far from the mean any odd-number r out of N observations can lie.” He also goes into the median absolute deviation a bit. “No one of N observations can be more than N mean-absolute-deviations away from the median.”
I’d have to think a bit more to apply the results here, where the summary stats don’t include a maximum or minimum value, but it’s worth sharing for the title alone.
I thought maybe also 0 was allowed. But even if X = 0, 1,…,5 the maximum standard deviation is 1.34!
Or they used some software package which treats missing values as 0 when calculating a sum, which is then divided by the amount of samples. But treats missing values as missing when calculating a SDev. I seem to remember some cases where Excel does this.
That’s an interesting idea, but I don’t think it helps in this case. Under that scenario, it seems that the mean of the non-missing data would actually be higher than 4.61 which would lead to a maximum possible standard deviation that is even smaller than the 1.19 that was calculated in the post. So, a standard deviation of 1.64 would still be ruled out. To see this, you can make the sample mean a variable in the procedure outlined in the post to get a function for the maximum possible SD for any possible mean between 1 and 5. If you graph it, you’ll see it is unimodal with a peak of 2 when the sample mean is 3 and, of course, is zero at sample means of 1 and 5. It is actually the top curve in the “umbrella plot” in the linked post by James Heathers.
Just tried out an example, seems my previous comment I got things in reverse:
In Excel, define A1:A50 to be 37 values of 5, 9 missing values, and 4 values of 1, then =AVERAGE(A1:A50) gives 4.61.
Define B1:B50 to be (A – 4.61)^2, then =SQRT(SUM(B1:B50)/40) gives a “standard deviation” of 2.5.
If one is not aware of the STDEV function in Excel (and have your brain turned off and are not paying attention to missing values) this may be an possible mistake to make.
“Never attribute to malice that which is adequately explained by stupidity.”
Yep, that could do it. Nice one!
More generally, though, surely there’s a better measure of dispersion than a standard deviation for ordered categorical data, no?
The values of 1-5 are sentiment categories, not numeric values. Calculating anything like a mean or standard deviation is ridiculous. There are legitimate methods of analyzing categorical data that can be used instead.
This seems to happen a lot with Likert data, especially by management consultants who are good at selling their survey techniques and poor at doing data analysis.
What’s the most prevalent alternative?
https://link.springer.com/referenceworkentry/10.1007/978-3-642-04898-2_608
Rdn:
I disagree with you! Yes, there are cases where a discrete survey response behaves in an unexpected way, but usually I think it’s just fine to model a five-point response as a continuous outcome; indeed, that’s what I recommend. I will, however, recommend something like an ordered logit in those cases where there is interest in predicting the response on the original scale.
Andrew
I just wanted to say that I remember reading this advice of yours in Regressions and other stories. That and the order of important assumptions in linear regressions have been incredibly helpful in grounding me with what pitfalls to focus the most effort in when doing stats. Analyses can go wrong in a so many ways that it’s really, really hard just to get the biggest things right.
I find my workflow so much easier to just start with the quick linear assumption, iterate rapidly on the biggest issues, and then eventually go back and check with CV that ordinality isn’t doing anything funky.
I was also reminded of all this when talking with an econometrician the other day who was advocating preregistration. Despite largely agreeing, we still talked past one another. He kept praising it for dealing with things like researchers using significance to determine stopping criteria. I know stopping criteria is a can of worms vis-a-vis frequentist and bayesianism, but even to frequentists, it can’t be even a top 10 contributor to false discoveries or failed research. It’s just strange how much bandwidth is spent on relatively minor statistical minutiae compared to the elephants like model misspecification, good hypothesis development, and careful design and measurement.
I find with 1-4 or even 5 responses that you get lots of distributional issues suggesting that effects aren’t what a mean would reflect, such as responses are pushed to extremes. The most common issue though is the difference between selecting 3 very commonly or a flat distribution. The means are the same but the meanings are very different. Unless you pay close attention to the variance you may miss it using means.
Maybe it’s not as common in poli-sci where you have hung out a lot.
It’s not Likert, it’s ratings. A Likert scale is something different and you have more justification for getting means.
In the context of ‘honest error’, I think it is important to note that the first author of the paper referred to is Francesca Gino, who has been accused of data manipulation and is currently fighting these accusations (and her dismissal from Harvard Business School) in court.
Raphael:
Recall Clarke’s Law.
For what it’s worth, 1.19^3 = 1.68, which is close.
(My first thought was that they reported the variance instead of the s.d., but that’s only 1.42.)
Another (relatively benign) possibility is that the scale was 1 to 7 rather than the reported 1 to 5.