Skip to content

Combining apparently contradictory evidence

I want to write a more formal article about this, but in the meantime here’s a placeholder.

The topic is the combination of apparently contradictory evidence.

Let’s start with a simple example: you have some ratings on a 1-10 scale. These could be, for example, research proposals being rated by a funding committee, or, umm, I dunno, gymnasts being rated by Olympic judges. Suppose there are 3 judges doing the ratings, and consider two gymnasts: one receives ratings of 8, 8, 8; the other is rated 6, 8, 10. Or, forget about ratings, just consider students taking multiple exams in a class. Consider two students: Amy, whose three test scores are 80, 80, 80; and Beth, who had scores 80, 100, 60. (I’ve purposely scrambled the order of those last three so that we don’t have to think about trends. Forget about time trends; that’s not my point here.)

How to compare those two students? A naive reader of test scores will say that Amy is consistent while Beth is flaky; or you might even say that you think Beth is better as she has a higher potential. But if you have some experience with psychometrics, you’ll be wary of overinterpreting results from three exam scores. Inference about an average from N=3 is tough; inference about variance from N=3 is close to impossible. Long story short: from a psychometrics perspective, there’s very little you can say about the relative consistency of Amy and Beth’s test-taking based on just three scores.

Academic researchers will recognize this problem when considering reviews of their own papers that they’ve submitted to journals. When you send in a paper, you’ll typically get a few reviews, and these reviews can differ dramatically in their messages.

Here’s a hilarious example supplied to me by Wolfgang Gaissmaier and Julian Marewski, from reviews of their 2011 article, “Forecasting elections with mere recognition from small, lousy samples: A comparison of collective recognition, wisdom of crowds, and representative polls.”

Here are some positive reviewer comments:

– This is a very interesting piece of work that raises a number of important questions related to public opinion. The major finding — that for elections with large numbers of parties, small non-probability samples looking only at party name recognition do as well as medium-sized probility samples looking at voter intent — is stunning.

– There is a lot to like about this short paper… I’m surprised by the strength of the results… If these results are correct (and I have no real reason to suspect otherwise), then the authors are more than justified in their praise of recognition-based forecasts. This could be an extremely useful forecasting technique not just for the multi-party European elections discussed by the authors, but also in relatively low-salience American local elections.

– This is concise, high-quality paper that demonstrates that the predictive power of (collective) recognition extends to the important domain of political elections.

And now the fun stuff. The negative comments:

– This is probably the strangest manuscript that I have ever been asked to review… Even if the argument is correct, I’m not sure that it tells us anything useful. The fact that recognition can be used to predict the winners of tennis tournaments and soccer matches is unsurprising – people are more likely to recognize the better players/teams, and the better players/teams usually win. It’s like saying that a football team wins 90% (or whatever) of the games in which it leads going into the fourth quarter. So what?

– To be frank, this is an exercise in nonsense. Twofold nonsense. For one thing, to forecast election outcomes based on whether or not voters recognize the parties/candidates makes no sense… Two, why should we pay any attention to unrepresentative samples, which is what the authors use in this analysis? They call them, even in the title, “lousy.” Self-deprecating humor? Or are the authors laughing at a gullible audience?

So, their paper is either “a very interesting piece of work” whose main finding is “stunning”—or it is “an exercise in nonsense” aimed at “a gullible audience.”


  1. Terry says:

    consider students taking multiple exams in a class. Consider two students: Amy, whose three test scores are 80, 80, 80; and Beth, who had scores 80, 100, 60. … Inference about an average from N=3 is tough; inference about variance from N=3 is close to impossible. Long story short: from a psychometrics perspective, there’s very little you can say about the relative consistency of Amy and Beth’s test-taking based on just three scores.

    I can see why you might draw these conclusions from the other examples, but is it really true for this test-taking example? Aren’t tests made up of many questions? Isn’t performance on each question a different data point? Isn’t that (possibly) a lot of data to draw conclusions from?

    Why wouldn’t the following model be (possibly) quite informative? Say that the probability that student i gets question j on test k correct is a function of student i’s base ability plus a random shock to student i’s ability on the day of test k (iid across test days, but the variance of the shock varies across students), plus an iid error-term for each question (iid across questions and students).

    If we conceptualize the three test scores as the aggregate of all test scores for an entire year for three years, wouldn’t the difference in test scores probably be quite informative? If so, why is it different when we aggregate scores across tests?

    • Andrew says:


      If everybody had consistent test scores (e.g., Amy’s scores were 80, 79, 80; Beth’s were 70, 72, 71; etc.) then your point would be valid. But if the variation between tests is as given in the above post (Amy’s scores were 80, 80, 80; Beth’s were 80, 100, 60; etc.) then the point is that what you are calling “random shocks” represent some amount of test-to-test variation, and with only 3 tests per person, it’s hard to get a handle on the variation between people of this test-to-test variation. That is, with only 3 data points, the difference between Amy’s apparent consistency and Beth’s apparent variation can itself easily be explained by noise.

      • Terry says:

        Right. My example actually illustrates your point. Even though there are many questions on each test, there are only three observations of the “shock” on each test date, so the estimation of the variance of the shocks is extremely weak.

        I should have seen that.

        • jim says:

          I think you have to see each test as an event, and consider the many other factors that could impact that event. When you think of it that way the real issue is that 3 scores aren’t enough to average out the impacts of competing events: 1) other course loads; 2) prev knowledge of subj; 3) timing and difficulty of exams in other classes; 4) outside activity (like job) schedule; 5) Personal issues (e.g., family events); 6) other unexpected events (car breakdowns, lost bus card…)…

  2. Corey says:

    The striking thing about these two reviews is that one of them takes note of — or let us say, updates on — the presented data and the other is pure argument from incredulity — or let us say, from the prior. I think both reviewers shared the same prior, in fact…

    • Yes, the positive comments in this case seem to respond to the substance and the negative ones to the premise. In this instance, the positive comments seem to reflect a more careful reading–in that they take note of the data and arguments, as you point out. This is not always the case; the reverse can be true, or positive and negative comments can be of similar quality and depth.

  3. Eric Pedersen says:

    NSERC (the scientific grants agency in Canada) has explicitly taken this into account when evaluating Discovery Grants (which provides baseline funding in Canada). Panel members rank different components of the grant (researcher track record, grant quality, training of qualified personnel) on a qualitative, ordinal scale (with ratings like , “moderate”, “strong”, and “Exceptional”), then for each component they just take the median rating of the panel. In general, it seems to work well, avoiding problems like one really negative review sinking a grant.

    It seems like papers are often evaluated using an argmin criteria instead: if the worst review says “useless, reject”, the paper ends up rejected.

    • Martha (Smith) says:

      Yes, taking a median here rather than a mean does make sense — since especially with small samples, outliers can have a large effect on the mean. It’s similar to the reasoning of using median rather than mean for evaluating housing prices in a particular locality — even though samples there are not small, a few very expensive houses can affect the mean strongly — and the interest is in what is “typical”.

      (However, I would not use a median for evaluating performance on a single exam — since, in my opinion, a good exam includes problems at a range of difficulties.)

  4. Keith O'Rourke says:

    Andrew, did we not argue that combining requires one to be convinced that there is something common while appropriately being able to allow for what is different?

    Among what these reviewers perceived of the paper – what is common?

    K Pearson went with further inquiry when things seemed very different _

    On the other hand, judgement is something else all together…

  5. Dale Lehman says:

    The referee comments add another dimension to the problem. The referees add their own variability – on top of the variability in the quality of papers. For the test example, there is variability in student performance as well as variability in the testers ability to write exams and grade them. So, three referee reports (and often there are only 2) seem like far too few to determine anything. I remember the first paper I ever submitted (right after grad school) – one referee said that the paper was either obvious or wrong. It was not accepted. Since I no longer care about tenure, I’ve given up on “peer reviewed” publications entirely. The process is just too inconsistent and political. It makes the figure skating competition look like an exercise in objectivity.

  6. Jon O Johnson says:

    Amazon reviews always show a number of 1s. If you only read the negative reviews, you would save a lot of money. I usually look at the shape of the curve – lots of 5s, fewer 4s, on down to a few 1s. I figure this is normal for most products. If the number of 1s stick out and break the pattern of the curve, I probably move on and buy something else. (Amazon gets my money either way)

    • James Whanger says:

      Jon O Johnson:

      You make an interesting point. Let’s unpack this process that many people may use.

      1. Order a product.
      2. Not happy with the product.
      3. Check the reviews again.
      4. Compare to reviews for product you liked.
      5. Observe more 1’s in the product with which you are unsatisfied.
      6. Conclude that the numbers of 1’s differentiates products with some level of reliability.

      The value of the ratings function at Amazon is not to get your money, but to provide a method of comparison between products for both the consumer and for Amazon. Given that this is the intended function, the validity of the information is important. In reality it is not possible for such process to provide even reasonably objective information, but that doesn’t stop us from devising idiosyncratic methods to convince ourselves it does.

      Sources of subjectivity are:

      1. Varying expectations of quality.
      2. Varying amounts of experience with a specific or similar product.
      3. Varying individual thresholds for the quality delta we are willing to accept.
      4. Intentional stuffing of the ballot box by sellers and manufacturers.
      5. Intentional stuffing of the ballot box by competitors.

      Where ratings such as these can be useful is to identify those customers who had addressable problems from a service or quality perspective.

  7. Mike Maltz says:

    It seems to me that you’re looking too closely at the *statistical* aspects of the situation instead of its *contextual* aspects. What did the three tests measure, the same things or different concepts? Was Beth sick when she scored the lowest value? What are the characteristics of the three populations? As I have often said, smell the data before applying your favorite method. And Happy New Year to all.

  8. zbicyclist says:

    Whenever the Olympics rolls around, I find myself fascinated by the difference in how events are scored.

    In some events, it’s the total score with multiple runs. In some events (notably downhill skiing), it’s the best run. In a lot of events, you have to do well in the preliminary heats to make it to the finals, but the scores from the preliminary runs are not counted (e.g. 100 meter dash).

    In individual diving, there are seven judges, with the top two and bottom two scores discarded, and the muliplied by the degree of difficulty.

    “For synchronized events, there are 11 judges. Three judge the execution of each diver, and five judge the synchronization. Only the median execution score for each diver is considered, along with the middle three scores for the synchronization, and the sum of these five scores is multiplied by the degree of difficulty. Men, or teams of men, perform six dives each round, while women, or women’s teams, perform five dives. The round is scored by the sum of all dives—that is, each round is cumulative. It doesn’t matter if one dive blows the judges away; divers need to be consistent across all their dives in each round.”

    Decathalon-type events give you a certain number of points depending on performance in each individual event, regardless of your ranking.

    Many of these rules reflect some inherent characteristic of the sport, but a lot seem arbitrary. And it’s the effect of arbitrary nature of the scoring that is a subject worthy of study.

    A simple example: football (soccer) and hockey typically awarded 2 points for a win, 1 for a tie. But to reduce the number of ties, point systems were changed to reward winning more.

    A thought experiment: what would happen if academic papers were accepted depending on the best review only?

  9. Yuling says:

    One thing I feel interesting is that the result of such “aggregation” really depends on how the model looks like. If I model individual scores by a normal distribution, there will never be any *contradiction*, as the posterior is always a unimodal normal, or mathematically the sum of log convex functions is still log convex. But if instead I use a Cauchy distribution then I might get a multimodal posterior and it amounts to what we shall call data-data conflict.

    Anyway, this goes back to the old idea that a single number summary is often insufficient for model-evaluation, or maybe student-evaluation and journal reviews.

  10. Kaiser says:

    I came across this specific problem when reading applications for the MS program. We required three references. Most applicants have a good undergrad degree in some STEM field. GPAs as we know are next to useless since there is almost no variability. (Actually, they are *worse* than useless because the variability is explained by things like when they got the degree, whether the school/department uses some kind of grade deflation policy, etc. for which I had incomplete data.) Few of these applicants have any useful work experience.

    At first I thought the reference letters would be useful. Then, I realize that most applicants have three good references. Occasionally, an applicant would receive a poor reference but in all cases, the one bad reference is invalidated by the other two good references. I typically don’t know the authors of these references and so have no external info on reliability. I was really struggling with 3 references being too small a sample. One is tempted to think the mere fact that this applicant got one poor review while the majority didn’t is a “signal” but like you said, that would be over-confident in estimating variability!

  11. Nat says:

    Inference about an average from N=3 is tough; inference about variance from N=3 is close to impossible.

    That is, with only 3 data points, the difference between Amy’s apparent consistency and Beth’s apparent variation can itself easily be explained by noise.

    So even though the evidence appears contradictory based on N = 3, it may not actually be contradictory, and therefore we can combine the evidence and conclude that the evidence in fact may or may not be contradictory?

Leave a Reply