Statistical Methods for Rater and Diagnostic Agreement
This site is a resource for the analysis of agreement among raters,
diagnostic tests, observers, judges or experts. It contains background
discussion on different methods, examples, references, software, and
information on recent methodological developments.
Contents
Go to the
Latent Structure Analysis web site
Click here for a
list of Statistics links including
links to pages related to analyzing agreement.
Basic Considerations
In many fields it is common to study agreement among ratings of multiple
judges, experts, diagnostic tests, etc. We are concerned here with
categorical ratings:
dichotomous (Yes/No, Present/Absent, etc.),
ordered categorical (Low, Medium, High, etc.), and
nominal
(Schizophrenic, BiPolar, Major Depression, etc.) ratings. Likerttype
ratingsintermediate between orderedcategorical and
intervallevel ratings, are also considered.
There is little consensus about what statistical methods are best to
analyze rater agreement (we will use the generic words "raters" and
"ratings" here to include observers, judges, diagnostic tests, etc. and
their ratings/results.) To the nonstatistician, the number of
alternatives and lack of consistency in the literature is no doubt cause
for concern. This site aims to reduce confusion and help researchers
select appropriate methods for their applications.
Despite the many apparent options for analyzing agreement data, the
basic issues are very simple. Usually there are one or two methods best
for a particular application. But it is necessary to clearly identify
the purpose of analysis and the substantive questions to be answered.
The most common mistake made when analyzing agreement data is
not having a explicit goal. It is not enough for the goal to be
"measuring agreement" or "finding out if raters agree." There is
presumably some reason why one wants to measure agreement. Which
statistical method is best depends on this reason.
For example, rating agreement studies are often used to evaluate a new
rating system or instrument. If such a study is being conducted during
the development phase of the instrument, one may wish to analyze the
data using methods that identify how the instrument could be changed to
improve agreement. However if an instrument is already in a final
format, the same methods might not be helpful.
Very often agreement studies are an indirect attempt to validate
a new rating system or instrument. That is, lacking a definitive criterion
variable or "gold standard," the accuracy of a scale or instrument is assessed
by comparing its results when used by different raters. Here one may wish
to use methods that address the issue of real concernhow well do
ratings reflect the true trait one wants to measure?
In other situations one may be considering combining the ratings of
two or more raters to obtain evaluations of suitable accuracy. If so, again,
specific methods suitable for this purpose should be used.
A second common problem in analyzing agreement is the failure to
think about the data from the standpoint of theory. Nearly all
statistical methods for analyzing agreement make assumptions.
If one has not thought about the data from a theoretical
point of view it will be hard to select an appropriate method. The
theoretical questions one asks do not need to be complicated. Even
simple questions, like "is the trait being measured really discrete,
like presence/absence of a pathogen, or is the trait really continuous
and being divided into discrete levels (e.g., "low," "medium, "high")
for convenience? If the latter, is it reasonable to assume that the
trait is normally distributed? Or is some other distribution plausible?
Sometimes one will not know the answers to these questions. That is
fine, too, because there are methods suitable for that case also.
The main point is to be inclined to think about data in this way,
and to be attuned to the issue of matching method and data on this basis.
These two issuesknowing ones goals and considering theory, are the
main keys to successful analysis of agreement data. Following are some
other, more specific issues that pertain to the selection of methods
appropriate to a given study.
One can broadly distinguish two reasons for studying rating
agreement. Sometimes the goal is estimate the validity (accuracy) of
ratings in the absence of a "gold standard." This is a reasonable use
of agreement data: if two ratings disagree, then at least one of them
must be incorrect. Proper analysis of agreement data therefore permits
certain inferences about how likely a given rating is to be correct.
Other times one merely wants to know the consistency of ratings made
by different raters. In some cases, the issue of accuracy may even have
no meaningfor example ratings may concern opinions, attitudes, or
values.
One should also distinguish between modeling vs. describing agreement.
Ultimately, there are only a few simple ways to describe the amount of
agreement: for example, the proportion of times two ratings of the same
case agree, the proportion of times raters agree on specific categories,
the proportions of times different raters use the various rating levels,
etc.
The quantification of agreement in any other way inevitably involves
a model about how ratings are made and why raters agree or disagree.
This model is either explicit, as with latent structure models, or
implicit, as with the kappa coefficient. With this in mind, two basic
principles are evident:
 It is better to have a model that is
explicitly understood than one which is only implicit and potentially
not understood.
 The model should be testable.
Methods vary with respect to how well they meet the these two criteria.
Components of disagreement

Consider that disagreement has different components. With
orderedcategory (including dichotomous) ratings, one can distinguish
between two different sources of disagreement. Raters may differ: (a)
in the
definition of the trait itself; or (b) in their
definitions of
specific rating levels or categories.
A trait definition can be thought of as a weighted composite of
several variables. Different raters may define or understand the trait
as different weighted combinations. For example, to one rater
Intelligence may mean 50% verbal skill and 50% mathematical skill; to
another it may mean 33% verbal skill, 33% mathematical skill, and 33%
motor skill. Thus their essential definitions of what the trait means
differ. Similarity in raters' trait definitions can be assessed with
various estimates of the correlation of their ratings, or
analogous measures of association.
Category definitions, on the other hand, differ because raters divide
the trait into different intervals. For example, by "low skill" one rater
may mean subjects from the 1st to the 20th percentile. Another rater,
though, may take it to mean subjects from the 1st to the 10th
percentile. When this occurs, rater thresholds can usually be adjusted
to improve agreement. Similarity of category definitions is reflected
as marginal homogeneity between raters.
Marginal homogeneity means that the frequencies (or, equivalently, the
"base rates") with which two raters use various rating categories are the
same.
Because disagreement on trait definition and disagreement on rating
category widths are distinct components of disagreement, with different
practical implications, a statistical approach to the data should
ideally quantify each separately.
All other things being equal, a simpler statistical method is preferable
to a more complicated one. Very basic methods can reveal far more about
agreement data than is commonly realized. For the most part, advanced
methods are complements to, not substitutes for simple methods.
An example
To illustrate these principles, consider the example for rater agreement
on screening mammograms, a diagnostic imaging method for detecting
possible breast cancer. Radiologists often score mammograms on a scale
such as "no cancer," "benign cancer," "possible malignancy," or
"malignancy." Many studies have examined rater agreement
on applying these categories to the same set of images.
In choosing a suitable statistical approach, one would first consider
theoretical aspects of the data. The trait being measured, degree of
evidence for cancer, is continuous. So the actual rating levels would
be viewed as somewhat arbitrary discretizations of the underlying trait.
A reasonable view is that, in the mind of a rater, the overall weight of
evidence for cancer is an aggregate composed of various physical image
features and weights attached to each feature. Raters may vary in terms
of which features they notice and the weights they associate with each.
One would also consider the purpose of analyzing the data. In this
application, the purpose of studying rater agreement is not usually to
estimate the accuracy of ratings by a single rater. That can be done
directly in a validity study, which compares ratings to a
definitive diagnosis made from a biopsy.
Instead, the aim is more to understand the factors that cause raters
to disagree, with an ultimate goal of improving their consistency
and accuracy. For this, one should separately assess whether raters
have the same definition of the basic trait (that different raters
weight various image features similarly) and that they have similar
widths for the various rating levels. The former can be accomplished
with, for example, latent trait models. Moreover, latent trait models
are consistent with the theoretical assumptions about the data noted
above. Raters' rating category widths can be studied by visually representing raters' rates of use
for the different rating levels and/or their thresholds for the various
levels, and statistically comparing them with tests of
marginal homogeneity.
Another possibility would be to examine if some raters are biased
such that they make generally higher or lower ratings than other raters.
One might also note which images are the subject of the most
disagreement and then to try identify the specific image features that
are the cause of the disagreement.
Such steps can help one identify specific ways to improve ratings.
For example, raters who seem to define the trait much differently than
other raters, or use a particular category too often, can have this
pointed out to them, and this feedback may promote their making ratings
in a way more consistent with other raters.
Back to Contents
Recommended Methods
This section suggests statistical methods suitable for various levels of
measurement based on the principles outlined above. These are general
guidelines onlyit follows from the discussion that no one method is
best for all applications. But these suggestions will at least give the
reader an idea of where to start.
Dichotomous data

Two raters
 Multiple raters

Assess raw agreement, overall and specific to each category.

Calculate the appropriate intraclass correlation for the data.
If different raters are used for each subject, an alternative is the Fleiss
kappa.

If the trait being rated is assumed to be latently discrete, consider use of latent class models.

If the trait being rated can be interpreted as latently continuous, latent trait models can be used to assess association among raters
and to estimate the correlation of ratings with the true trait; these models can also
be used to assess marginal homogeneity.

In some cases latent class and latent trait models can be used to estimate the
accuracy (e.g., Sensitivity and Specificity) of diagnostic ratings even when a 'gold
standard' is lacking.
Orderedcategory data
 Two raters

Use weighted kappa with FleissCohen (quadratic) weights; note
that quadratic weights are not the default with SAS and you must specify (WT=FC) with
the AGREE option in PROC FREQ.

Alternatively, estimate the intraclass correlation.

Ordered rating levels often imply a latently continuous trait; if so, measure
association between the raters with the polychoric correlation
or one of its generalizations.

Test overall marginal homogeneity using the StuartMaxwell test or the
Bhapkar test.

Test (a) for differences in rater thresholds associated with each rating category and
(b) for a difference between the raters' overall bias using the respectively applicable
McNemar tests.

Optionally, use graphical displays to visually compare
the proportion of times raters use each category (base rates).

Consider association models and related methods for
ordered category data. (See Agresti A., Categorical Data Analysis, New York:
Wiley, 2002).
 Multiple raters

Estimate the intraclass correlation.

Test for differences in rater bias using ANOVA or the Friedman test.

Use latent trait analysis as a multirater generalization of
the polychoric correlation. Latent trait models can also be used to test for
differences among raters in individual rating category thresholds.

Graphically examine and compare rater base rates and/or
thresholds for various rating categories.

Alternatively, consider each pair of raters and proceed as described for two raters.
Nominal data

Two raters

Assess raw agreement, overall and specific to each category.

Use the pvalue of Cohen's unweighted kappa to verify that
raters agree more than chance alone would predict.

Often (perhaps usually), disregard the actual magnitude of kappa here; it is
problematic with nominal data because ordinarily one can neither assume that all types
of disagreement are equally serious (unweighted kappa) nor choose an
objective set of differential disagreement weights (weighted kappa).
If, however, it is genuinely true that all pairs of rating categories are equally
"disparate", then the magnitude of Cohen's unweighted kappa can be interpreted as a
form of intraclass correlation.

Test overall marginal homogeneity using the StuartMaxwell test or the
Bhapkar test.

Test marginal homogeneity relative to individual categories using
McNemar tests.

Consider use of latent class models.

Another possibility is use of loglinear, association, or quasi
symmetry models.

Multiple raters

Assess raw agreement, overall and specific to each category.

If different raters are used for different subjects, use the Fleiss
kappa statistic; again, as with nominal data/two raters,
attend only to the pvalue of the test unless one has a genuine basis for regarding all
pairs of rating categories as equally "disparate".

Use latent class modeling. Conditional tests of
marginal homogeneity can be made within the context of latent
class modeling.

Use graphical displays to visually compare
the proportion of times raters use each category (base rates).

Alternatively, consider each pair of raters individually and proceed as described for
two raters.
Likerttype items
Very often, Likerttype items can be assumed to produce intervallevel data.
(By a "Likerttype item" here we mean one where the format clearly implies to the rater
that rating levels are evenlyspaced, such as
lowest highest

1 2 3 4 5 6 7
(circle level that applies)


Two raters

Multiple raters

Perform a onefactor common factor analysis; examine/report the correlation of each
rater with the common factor (for details, see the section Methods
for Likerttype or intervallevel data).

Test for differences in rater bias using twoway ANOVA models.

Possibly estimate the intraclass correlation.

Use histograms to describe raters' marginal
distributions.

If greater detail is required, consider each pair of raters and proceed as described
for two raters
Back to Contents