Statistical Methods for Rater and Diagnostic Agreement

This site is a resource for the analysis of agreement among raters, diagnostic tests, observers, judges or experts. It contains background discussion on different methods, examples, references, software, and information on recent methodological developments.


Go to the Latent Structure Analysis web site

Click here for a list of Statistics links including links to pages related to analyzing agreement.

Basic Considerations

In many fields it is common to study agreement among ratings of multiple judges, experts, diagnostic tests, etc. We are concerned here with categorical ratings: dichotomous (Yes/No, Present/Absent, etc.), ordered categorical (Low, Medium, High, etc.), and nominal (Schizophrenic, Bi-Polar, Major Depression, etc.) ratings. Likert-type ratings--intermediate between ordered-categorical and interval-level ratings, are also considered.

There is little consensus about what statistical methods are best to analyze rater agreement (we will use the generic words "raters" and "ratings" here to include observers, judges, diagnostic tests, etc. and their ratings/results.) To the non-statistician, the number of alternatives and lack of consistency in the literature is no doubt cause for concern. This site aims to reduce confusion and help researchers select appropriate methods for their applications.

Despite the many apparent options for analyzing agreement data, the basic issues are very simple. Usually there are one or two methods best for a particular application. But it is necessary to clearly identify the purpose of analysis and the substantive questions to be answered.

 Know the goals


The most common mistake made when analyzing agreement data is not having a explicit goal. It is not enough for the goal to be "measuring agreement" or "finding out if raters agree." There is presumably some reason why one wants to measure agreement. Which statistical method is best depends on this reason.

For example, rating agreement studies are often used to evaluate a new rating system or instrument. If such a study is being conducted during the development phase of the instrument, one may wish to analyze the data using methods that identify how the instrument could be changed to improve agreement. However if an instrument is already in a final format, the same methods might not be helpful.

Very often agreement studies are an indirect attempt to validate a new rating system or instrument. That is, lacking a definitive criterion variable or "gold standard," the accuracy of a scale or instrument is assessed by comparing its results when used by different raters. Here one may wish to use methods that address the issue of real concern--how well do ratings reflect the true trait one wants to measure?

In other situations one may be considering combining the ratings of two or more raters to obtain evaluations of suitable accuracy. If so, again, specific methods suitable for this purpose should be used.

 Consider theory


A second common problem in analyzing agreement is the failure to think about the data from the standpoint of theory. Nearly all statistical methods for analyzing agreement make assumptions. If one has not thought about the data from a theoretical point of view it will be hard to select an appropriate method. The theoretical questions one asks do not need to be complicated. Even simple questions, like "is the trait being measured really discrete, like presence/absence of a pathogen, or is the trait really continuous and being divided into discrete levels (e.g., "low," "medium, "high") for convenience? If the latter, is it reasonable to assume that the trait is normally distributed? Or is some other distribution plausible?

Sometimes one will not know the answers to these questions. That is fine, too, because there are methods suitable for that case also. The main point is to be inclined to think about data in this way, and to be attuned to the issue of matching method and data on this basis.

These two issues--knowing ones goals and considering theory, are the main keys to successful analysis of agreement data. Following are some other, more specific issues that pertain to the selection of methods appropriate to a given study.

 Reliability vs. validity


One can broadly distinguish two reasons for studying rating agreement. Sometimes the goal is estimate the validity (accuracy) of ratings in the absence of a "gold standard." This is a reasonable use of agreement data: if two ratings disagree, then at least one of them must be incorrect. Proper analysis of agreement data therefore permits certain inferences about how likely a given rating is to be correct.

Other times one merely wants to know the consistency of ratings made by different raters. In some cases, the issue of accuracy may even have no meaning--for example ratings may concern opinions, attitudes, or values.

 Modeling vs. description


One should also distinguish between modeling vs. describing agreement. Ultimately, there are only a few simple ways to describe the amount of agreement: for example, the proportion of times two ratings of the same case agree, the proportion of times raters agree on specific categories, the proportions of times different raters use the various rating levels, etc.

The quantification of agreement in any other way inevitably involves a model about how ratings are made and why raters agree or disagree. This model is either explicit, as with latent structure models, or implicit, as with the kappa coefficient. With this in mind, two basic principles are evident:

  • It is better to have a model that is explicitly understood than one which is only implicit and potentially not understood.
  • The model should be testable.
Methods vary with respect to how well they meet the these two criteria.

 Components of disagreement


Consider that disagreement has different components. With ordered-category (including dichotomous) ratings, one can distinguish between two different sources of disagreement. Raters may differ: (a) in the definition of the trait itself; or (b) in their definitions of specific rating levels or categories.

A trait definition can be thought of as a weighted composite of several variables. Different raters may define or understand the trait as different weighted combinations. For example, to one rater Intelligence may mean 50% verbal skill and 50% mathematical skill; to another it may mean 33% verbal skill, 33% mathematical skill, and 33% motor skill. Thus their essential definitions of what the trait means differ. Similarity in raters' trait definitions can be assessed with various estimates of the correlation of their ratings, or analogous measures of association.

Category definitions, on the other hand, differ because raters divide the trait into different intervals. For example, by "low skill" one rater may mean subjects from the 1st to the 20th percentile. Another rater, though, may take it to mean subjects from the 1st to the 10th percentile. When this occurs, rater thresholds can usually be adjusted to improve agreement. Similarity of category definitions is reflected as marginal homogeneity between raters. Marginal homogeneity means that the frequencies (or, equivalently, the "base rates") with which two raters use various rating categories are the same.

Because disagreement on trait definition and disagreement on rating category widths are distinct components of disagreement, with different practical implications, a statistical approach to the data should ideally quantify each separately.

 Keep it simple


All other things being equal, a simpler statistical method is preferable to a more complicated one. Very basic methods can reveal far more about agreement data than is commonly realized. For the most part, advanced methods are complements to, not substitutes for simple methods.

An example

To illustrate these principles, consider the example for rater agreement on screening mammograms, a diagnostic imaging method for detecting possible breast cancer. Radiologists often score mammograms on a scale such as "no cancer," "benign cancer," "possible malignancy," or "malignancy." Many studies have examined rater agreement on applying these categories to the same set of images.

In choosing a suitable statistical approach, one would first consider theoretical aspects of the data. The trait being measured, degree of evidence for cancer, is continuous. So the actual rating levels would be viewed as somewhat arbitrary discretizations of the underlying trait. A reasonable view is that, in the mind of a rater, the overall weight of evidence for cancer is an aggregate composed of various physical image features and weights attached to each feature. Raters may vary in terms of which features they notice and the weights they associate with each.

One would also consider the purpose of analyzing the data. In this application, the purpose of studying rater agreement is not usually to estimate the accuracy of ratings by a single rater. That can be done directly in a validity study, which compares ratings to a definitive diagnosis made from a biopsy.

Instead, the aim is more to understand the factors that cause raters to disagree, with an ultimate goal of improving their consistency and accuracy. For this, one should separately assess whether raters have the same definition of the basic trait (that different raters weight various image features similarly) and that they have similar widths for the various rating levels. The former can be accomplished with, for example, latent trait models. Moreover, latent trait models are consistent with the theoretical assumptions about the data noted above. Raters' rating category widths can be studied by visually representing raters' rates of use for the different rating levels and/or their thresholds for the various levels, and statistically comparing them with tests of marginal homogeneity.

Another possibility would be to examine if some raters are biased such that they make generally higher or lower ratings than other raters. One might also note which images are the subject of the most disagreement and then to try identify the specific image features that are the cause of the disagreement.

Such steps can help one identify specific ways to improve ratings. For example, raters who seem to define the trait much differently than other raters, or use a particular category too often, can have this pointed out to them, and this feedback may promote their making ratings in a way more consistent with other raters.

Back to Contents

Recommended Methods

This section suggests statistical methods suitable for various levels of measurement based on the principles outlined above. These are general guidelines only--it follows from the discussion that no one method is best for all applications. But these suggestions will at least give the reader an idea of where to start.

Dichotomous data

  • Two raters

  • Multiple raters

    • Assess raw agreement, overall and specific to each category.
    • Calculate the appropriate intraclass correlation for the data. If different raters are used for each subject, an alternative is the Fleiss kappa.
    • If the trait being rated is assumed to be latently discrete, consider use of latent class models.
    • If the trait being rated can be interpreted as latently continuous, latent trait models can be used to assess association among raters and to estimate the correlation of ratings with the true trait; these models can also be used to assess marginal homogeneity.
    • In some cases latent class and latent trait models can be used to estimate the accuracy (e.g., Sensitivity and Specificity) of diagnostic ratings even when a 'gold standard' is lacking.

Ordered-category data

  • Two raters

    • Use weighted kappa with Fleiss-Cohen (quadratic) weights; note that quadratic weights are not the default with SAS and you must specify (WT=FC) with the AGREE option in PROC FREQ.
    • Alternatively, estimate the intraclass correlation.
    • Ordered rating levels often imply a latently continuous trait; if so, measure association between the raters with the polychoric correlation or one of its generalizations.
    • Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.
    • Test (a) for differences in rater thresholds associated with each rating category and (b) for a difference between the raters' overall bias using the respectively applicable McNemar tests.
    • Optionally, use graphical displays to visually compare the proportion of times raters use each category (base rates).
    • Consider association models and related methods for ordered category data. (See Agresti A., Categorical Data Analysis, New York: Wiley, 2002).

  • Multiple raters

    • Estimate the intraclass correlation.
    • Test for differences in rater bias using ANOVA or the Friedman test.
    • Use latent trait analysis as a multi-rater generalization of the polychoric correlation. Latent trait models can also be used to test for differences among raters in individual rating category thresholds.
    • Graphically examine and compare rater base rates and/or thresholds for various rating categories.
    • Alternatively, consider each pair of raters and proceed as described for two raters.

Nominal data

  • Two raters

    • Assess raw agreement, overall and specific to each category.
    • Use the p-value of Cohen's unweighted kappa to verify that raters agree more than chance alone would predict.
    • Often (perhaps usually), disregard the actual magnitude of kappa here; it is problematic with nominal data because ordinarily one can neither assume that all types of disagreement are equally serious (unweighted kappa) nor choose an objective set of differential disagreement weights (weighted kappa). If, however, it is genuinely true that all pairs of rating categories are equally "disparate", then the magnitude of Cohen's unweighted kappa can be interpreted as a form of intraclass correlation.
    • Test overall marginal homogeneity using the Stuart-Maxwell test or the Bhapkar test.
    • Test marginal homogeneity relative to individual categories using McNemar tests.
    • Consider use of latent class models.
    • Another possibility is use of loglinear, association, or quasi- symmetry models.

  • Multiple raters

    • Assess raw agreement, overall and specific to each category.
    • If different raters are used for different subjects, use the Fleiss kappa statistic; again, as with nominal data/two raters, attend only to the p-value of the test unless one has a genuine basis for regarding all pairs of rating categories as equally "disparate".
    • Use latent class modeling. Conditional tests of marginal homogeneity can be made within the context of latent class modeling.
    • Use graphical displays to visually compare the proportion of times raters use each category (base rates).
    • Alternatively, consider each pair of raters individually and proceed as described for two raters.

Likert-type items

Very often, Likert-type items can be assumed to produce interval-level data. (By a "Likert-type item" here we mean one where the format clearly implies to the rater that rating levels are evenly-spaced, such as

lowest highest |-------|-------|-------|-------|-------|-------| 1 2 3 4 5 6 7 (circle level that applies)

  • Two raters

  • Multiple raters

    • Perform a one-factor common factor analysis; examine/report the correlation of each rater with the common factor (for details, see the section Methods for Likert-type or interval-level data).
    • Test for differences in rater bias using two-way ANOVA models.
    • Possibly estimate the intraclass correlation.
    • Use histograms to describe raters' marginal distributions.
    • If greater detail is required, consider each pair of raters and proceed as described for two raters
Back to Contents
Go to Latent Structure Analysis site
Go to My papers and programs

Last updated: 14 Apr 2010 (new LSM page)

John Uebersax Enterprises LLC
(c) 2000-2011 John Uebersax PhD    email