Latent Trait Models for Rater Agreement

Introduction

Of all the methods discussed here for analyzing rater agreement, latent trait modeling is arguably the best method for handling ordered category ratings. The latent trait model is intrinsically plausible. More than most other approaches, it applies a natural view of rater decisionmaking. If one were interested only in developing a good model of how experts make ratings--without concern for the subject of agreement--one could easily arrive at the latent trait model. The latent trait model is closely related to signal detection theory, modern psychometric theory, and factor analysis. The latent trait agreement model is also very flexible and can be adapted to specific needs of a given study.

Given its advantages, that this method is not more often used is surprising. The likely explanation is its relative unfamiliarity and a mistaken perception that it is difficult or esoteric. In truth, it is no more complex that many standard methods for categorical data analysis.

The basic principles of latent trait models for rater agreement are sketched here (this will be expanded as time permits). For more details, the reader may consult Uebersax (1992; oriented to non-statisticians) and Uebersax (1993; a technical exposition), or other references listed in the bibliography below.

If there are only two raters, the latent trait model is the same as the polychoric correlation coefficient model.

Measurement Model

The essence of the latent trait agreement model is contained in the measurement model,



    Y = bT + e                                            (1)


where:


    T is the latent trait level of a given case;
    Y is the perception or impression of a given rater of
      the case's trait level;
    b is a regression coefficient; and
    e is measurement error.

The latent trait is what the ratings intend measure--for example, disease severity, subject ability, or treatment effectiveness; this corresponds to the "signal" emitted by the case being rated.

The term e corresponds to random measurement error or noise. The combined effect of T and e is to produce a continuous variable, Y, which is the rater's impression of the signal. These continuous impressions are converted to ordered category ratings as the rater applies thresholds associated with the rating categories.

Model parameters are estimated from observed data. The basic estimated parameters are: (1) parameters that describe the distribution of the latent trait in the sample or population; (2) the regression coefficient, b, for each rater; and (3) the threshold locations for each rater. Model variations may have more or fewer parameters.

Parameters are estimated by a computer algorithm that iteratively tests and revises parameter values to find those which best fit the observed data; usually "best" means the maximum likelihood estimates. Many different algorithms can be used for this.

Evaluating the Assumptions

The assumptions of the latent trait model are very mild. (Moreover, it should be noted that the assumptions are tested by evaluating the fit of a model to the observed data).

The existence of a continuous latent trait, a simple additive model of "signal plus noise," and thresholds that map a rater's continuous impressions into discrete rating categories are very plausible.

One has latitude in choosing the form of the latent trait distribution. A normal (Gaussian) distribution is most often assumed. If, as in many medical applications, this is considered unsuitable, one can consider an asymmetric distribution; this is readily modeled as say, a beta distribution which can be substituted for a normal distribution with no difficulty.

Still more flexible are versions that use a nonparametric latent trait distribution. This approach models the latent trait distribution in a way analogous to a histogram, where the user controls the number of bars, and each bar's height is optimized to best fit the data. In this way nearly any latent trait distribution can be well approximated.

The usual latent trait agreement model makes two assumptions about measurement error. The first is that it is normally distributed. The second is that, for any rater, measurement error variance is constant.

Similar assumptions are made in many statistical models. Still one might wish to relax them. Hutchinson (2000) showed how non-constant measurement error variance can be easily included in latent trait agreement models. For example, measurement error variance can be lower for cases with very high or low latent trait levels, or may increase from low to high levels of the latent trait,

What the Model Provides

The latent trait agreement model supplies parameters that separately evaluate the degree of association between raters, and differences in their category definitions. The separation of these components of agreement and disagreement enable one to precisely target interventions to improve rater consistency.

Association is expressed as a correlation between each rater's impressions (Y) and the latent trait. A higher correlation means that a rater's impressions are more strongly associated with the "average" impression of all other raters. A simple statistical test permits assessment of the significance of rater differences in their correlation with the latent trait. One can also use the model to express association between a pair of raters as a correlation between one rater's impressions and those of the other; this measure is related to the polychoric correlation coefficient.

Estimated rater thresholds can be displayed graphically. Their inspection, with particular attention given to the distance between successive thresholds of a rater, shows how raters may differ in the definition and widths of the rating categories. Again, these differences can be statistically tested.

Finally, the model can be used to measure the extent to which one rater's impressions may be systematically higher or lower than those of other raters--that is, for the existence of rater bias.

Software

Specialized software is required to estimate the latent trait agreement model, but this should not be seen as an obstacle. The LLCA (Located latent class analysis; Uebersax, 1993) program estimates latent trait agreement models using flexible nonparametric latent trait distribution assumptions (Heinen, 1996). This should suit a wide range of applications. The program, with documentation, can be downloaded here.

Another program, LTMA (Latent trait mixture analysis) ; Uebersax & Grove, 1993) is similar, but assumes normal latent trait distributions.

A new program that combines the features of LLCA and LTMA and adds new features is currently in progress.

If there are only two raters, the POLYCORR program can be used to estimate the polychoric correlation coefficient and rater thresholds.

The LEM program by Jeroen Vermunt can estimate several versions of the latent trait agreement model, although it is not specifically designed for this purpose.

It is possible to estimate the latent trait agreement model with a normal latent trait distribution by factor analysis of polychoric correlations. By this method, the model can be estimated using existing software packages such as SAS. For a description of this approach, click here.

Johnson and Albert (1999) provide details on estimating latent trait agreement models with a normally distributed latent trait using Bayesian statistical methods, including Markov Chain Monte Carlo (MCMC) and Gibbs' sampler estimation. Their programs are available for download via ftp. Consult their book, or contact Valen Johnson of the Duke University Statistics department for details.

(Back to Agreement Statistics main page)

References

Heinen T. Latent class and discrete latent trait models: Similarities and differences. Thousand Oaks, California: Sage, 1996.

Hutchinson TP. Assessing the health of plants: Simulation helps us understand observer disagreements. Environmetrics, in press (2000).

Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999.

Uebersax JS. A review of modeling approaches for the analysis of observer agreement. Investigative Radiology, 1992, 27, 738-743.

Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993, 88, 421-427.

Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.

Go to Latent Structure Analysis
Go to Statistical Methods for Rater Agreement

Revised: 26 September 2000