Skip to content

What are the standards for reliability in experimental psychology?

An experimental psychologist was wondering about the standards in that field for “acceptable reliability” (when looking at inter-rater reliability in coding data). He wondered, for example, if some variation on signal detectability theory might be applied to adjust for inter-rater differences in criteria for saying some code is present.

What about Cohen’s kappa? The psychologist wrote:

Cohen’s kappa does adjust for “guessing,” but its assumptions are not well motivated, perhaps not any more than adjustments for guessing versus the application of signal detectability theory where that can be applied. But one can’t do a straightforward application of signal detectability theory for reliability in that you don’t know whether the signal is present or not.

I think measurement issues are important but I don’t have enough experience in this area to answer the question without knowing more about the problem that this researcher is working on.

I’m posting it here because I imagine that some of the psychometricians out there might have some comments.


  1. Stephen says:

    I would recommend looking into Doug Steinly’s work on the Hubert Arabie adjusted rand index.

  2. Marko says:

    In communication research we are often concerned with inter-rater-reliability when doing content analysis of mass media content. Most recommendable is Krippendorff’s alpha as a measure of reliability. For a short overview see this article by Krippendorff and Hayes:
    More detailed information can be found in Krippendorff’s work on content analysis.

    • Jonathan says:

      Most of the work I have done for Communications professors has me using the Cohen’s kappa statistic. There are major drawbacks though if the distribution of responses is skewed and not reasonably distributed. Just for a warning to keep that in mind.

  3. There are huge problems with Cohen’s kappa and Krippendorf’s alpha. In no particular order, they aren’t predictive of anything, they assume homogeneous annotators, they’re intrinsically pairwise, two wrong answers (agreement) equals a right, and they don’t model task difficulty the items being coded (some are hard to label, some are easy to label).

    I’ve been building Bayesian hierarchical models for inter-annotator agreement. Similar models have been used in so-called “mode-based” epidemiology (and continually reinvented elsewhere), at least since Dawid and Skene’s 1979 paper:

    Here’s a link to a tutorial of mine and Massimo Poesio’s from LREC (linguistic annotation conference) on model-based methods:

    An earlier talk version has my discussion of problems with kappa:

    and an even earlier tech report has a survey of the related “model-based” epidemiology literature:

    • K? O'Rourke says:

      Excellent comment!

      I had to deal with this a few years ago and it was interesting to quickly check the overlap of your survey of the epi literature. Think I mainly went with Qu et al informed by Albert and Dodd’s cautionary tale.

      If I have to deal with this again, I’ll start with your links here.

  4. Jon Baron says:

    My usual answer to questions like this is, “All generalizations are false.” And I think that is appropriate here. The answer to the question depends on the purpose. If, for example, you have a measure that is theoretically exactly right, and your hypothesis depends on this measure being correlated with something else, then a very low reliability (e.g.) assessed with intraclass correlation will suffice. (Psychologists do test null hypotheses, because they work very hard to design studies in which the null hypothesis will be exactly true if their alternative hypothesis is false.) If you want a test that you will use to reject people for jobs, then you want something else.

  5. John says:

    Stradivarius’s constant is most often used in practice…

  6. Matthias says:

    “Psychometric Theory” by Nunnally and Bernstein has a whole chapter on reliability.

  7. Antony Unwin says:

    I’m surprised no one has mentioned graphics so far. The measures are difficult to evaluate and interpret on their own. Graphics can often help to show what is going on and they complement the measures nicely. My group had a paper at the 2011 ISI meeting on this, using the agreement between the three major financial rating agencies on their ratings of countries as the main example (

  8. Gennady says:

    There’s a small section in “Elementary Signal Detection Theory” by Wickens on reliability in signal detection theory.

    If you’re rating something subjective (e.g., pretty vs. not pretty), then signal detection theory won’t help.

    If not, one possibility might be to compute the criterion, sensitivity and bias and the corresponding variances for the average rater and then make confidence intervals. However, I can’t think of any papers that have done this. There are lots of assumptions you need to make when aggregating these sorts of data and your standard errors are very likely to be too small. Classic applications of signal detection theory to psychology are usually at the individual level.

  9. Ilya Goldin says:

    Some very helpful references are:

    Lyle D. Broemeling. (2009) Bayesian Methods for Measures of Agreement

    Mohamed M. Shoukri. (2010) Measures of Interobserver Agreement and Reliability