What are the standards for reliability in experimental psychology?

Posted on January 5, 2012 3:56 PM by Andrew

An experimental psychologist was wondering about the standards in that field for “acceptable reliability” (when looking at inter-rater reliability in coding data). He wondered, for example, if some variation on signal detectability theory might be applied to adjust for inter-rater differences in criteria for saying some code is present.

What about Cohen’s kappa? The psychologist wrote:

Cohen’s kappa does adjust for “guessing,” but its assumptions are not well motivated, perhaps not any more than adjustments for guessing versus the application of signal detectability theory where that can be applied. But one can’t do a straightforward application of signal detectability theory for reliability in that you don’t know whether the signal is present or not.

I think measurement issues are important but I don’t have enough experience in this area to answer the question without knowing more about the problem that this researcher is working on.

I’m posting it here because I imagine that some of the psychometricians out there might have some comments.

11 thoughts on “What are the standards for reliability in experimental psychology?”

Stephen on January 5, 2012 4:53 PM at 4:53 pm said:

I would recommend looking into Doug Steinly’s work on the Hubert Arabie adjusted rand index.
Marko on January 5, 2012 5:12 PM at 5:12 pm said:

In communication research we are often concerned with inter-rater-reliability when doing content analysis of mass media content. Most recommendable is Krippendorff’s alpha as a measure of reliability. For a short overview see this article by Krippendorff and Hayes: http://www.afhayes.com/public/cmm2007.pdf
More detailed information can be found in Krippendorff’s work on content analysis.
- Jonathan on January 5, 2012 6:06 PM at 6:06 pm said:
  
  Most of the work I have done for Communications professors has me using the Cohen’s kappa statistic. There are major drawbacks though if the distribution of responses is skewed and not reasonably distributed. Just for a warning to keep that in mind.
Bob Carpenter on January 5, 2012 6:49 PM at 6:49 pm said:

There are huge problems with Cohen’s kappa and Krippendorf’s alpha. In no particular order, they aren’t predictive of anything, they assume homogeneous annotators, they’re intrinsically pairwise, two wrong answers (agreement) equals a right, and they don’t model task difficulty the items being coded (some are hard to label, some are easy to label).

I’ve been building Bayesian hierarchical models for inter-annotator agreement. Similar models have been used in so-called “mode-based” epidemiology (and continually reinvented elsewhere), at least since Dawid and Skene’s 1979 paper:

http://www.jstor.org/pss/2346806

Here’s a link to a tutorial of mine and Massimo Poesio’s from LREC (linguistic annotation conference) on model-based methods:

http://lingpipe-blog.com/2010/05/17/lrec-2010-tutorial-modeling-data-annotation/

An earlier talk version has my discussion of problems with kappa:

http://lingpipe.files.wordpress.com/2008/04/ed-2010-slides.pdf

and an even earlier tech report has a survey of the related “model-based” epidemiology literature:

http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf
- K? O'Rourke on January 6, 2012 8:45 AM at 8:45 am said:
  
  Excellent comment!
  
  I had to deal with this a few years ago and it was interesting to quickly check the overlap of your survey of the epi literature. Think I mainly went with Qu et al informed by Albert and Dodd’s cautionary tale.
  
  If I have to deal with this again, I’ll start with your links here.
Jon Baron on January 5, 2012 7:19 PM at 7:19 pm said:

My usual answer to questions like this is, “All generalizations are false.” And I think that is appropriate here. The answer to the question depends on the purpose. If, for example, you have a measure that is theoretically exactly right, and your hypothesis depends on this measure being correlated with something else, then a very low reliability (e.g.) assessed with intraclass correlation will suffice. (Psychologists do test null hypotheses, because they work very hard to design studies in which the null hypothesis will be exactly true if their alternative hypothesis is false.) If you want a test that you will use to reject people for jobs, then you want something else.
John on January 5, 2012 7:35 PM at 7:35 pm said:

Stradivarius’s constant is most often used in practice…
Matthias on January 6, 2012 3:34 AM at 3:34 am said:

“Psychometric Theory” by Nunnally and Bernstein has a whole chapter on reliability.
Antony Unwin on January 6, 2012 12:49 PM at 12:49 pm said:

I’m surprised no one has mentioned graphics so far. The measures are difficult to evaluate and interpret on their own. Graphics can often help to show what is going on and they complement the measures nicely. My group had a paper at the 2011 ISI meeting on this, using the agreement between the three major financial rating agencies on their ratings of countries as the main example (isi2011.congressplanner.eu/showabstract.php?congress=ISI2011&id=1290).
Gennady on January 6, 2012 2:48 PM at 2:48 pm said:

There’s a small section in “Elementary Signal Detection Theory” by Wickens on reliability in signal detection theory.

If you’re rating something subjective (e.g., pretty vs. not pretty), then signal detection theory won’t help.

If not, one possibility might be to compute the criterion, sensitivity and bias and the corresponding variances for the average rater and then make confidence intervals. However, I can’t think of any papers that have done this. There are lots of assumptions you need to make when aggregating these sorts of data and your standard errors are very likely to be too small. Classic applications of signal detection theory to psychology are usually at the individual level.
Ilya Goldin on January 6, 2012 6:53 PM at 6:53 pm said:

Some very helpful references are:

Lyle D. Broemeling. (2009) Bayesian Methods for Measures of Agreement

Mohamed M. Shoukri. (2010) Measures of Interobserver Agreement and Reliability

Comments are closed.