Latent Structure Models for the Analysis of Rater Agreement and Multiple Diagnostic Tests
Introductionlatent structure analysis (LSA). LSA is a broad category that subsumes several individual methods, including latent class analysis (LCA) and latent trait analysis (LTA). The topics covered on this page include the basic rationale for these models, their history, issues concerning their use, and some suggestion for further research and application directions.
The purpose of LSA is to infer, from observed variables (manifest variables), the structure of other, more fundamental variables that are not directly observed (latent variables). Both manifest variables and latent variables can be binary, nominal, ordered-categorical, or interval/continuous -- leading to a large different combinations and different methods. For example, classical latent class analysis considers binary, nominal, or ordered-categorical manifest variables and nominal latent variables, and latent trait analysis considers binary or ordered-categorical variables and continuous latent variables.
LSA is implicitly Bayesian in formulation and implementation. Its aim is to determine the most likely structure of latent variables given information on manifest variables. In a sense, the method does this by considering the probability distribution of all possible states of latent variables given observed data and choosing the most likely latent structure. For this it must estimate:
across various possible latent structure states, S. This is done using Bayes' rule from a probability model one constructs to expresses:
Traditionally this is done by constructing a likelihood function that corresponds to , and finding the parameter estimates associated with latent variables that maximize the log of this likelihood function. However, as we shall see, other computational approaches, e.g., MCMC estimation are also possible. From recent literature one might get the impression that only latent structure analysis that uses MCMC estimation and software like WinBUGS is Bayesian. That is not the case. The distinctions primarily concern choices among methods of estimating parameters. The models themselves are all Bayesian.
HistoryThere is now a substantial literature concerning the use of LSA methods the analysis and interpretation of rater agreement and associated problems with multiple diagnostic tests. This literature has not been reviewed nor has there been an attempt at an overall integration of the material. (I feel partly responsible for this, as I should probably have already covered this in a textbook. The present will hopefully suffice in the interim.)
As is often the case with a fundamentally sound idea, latent structure analysis has been repeatedly and independently invented by many researchers in different disciplines. The first examples, as shown by Gelfand and Solomon (1973, 1974, 1975), go back over two hundred years to the work of Poisson and Laplace on modeling jury verdicts. As a distinct methodology, however, latent structure analysis is mainly associated with the pioneering work of sociologist Paul Lazarsfeld. The book Latent Structure Analysis, published by Lazarsfeld and Neil Henry in 1968, supplied a coherent theoretical framework and analytic methods for both latent class and latent trait models. Lazarfeld's estimtion methods, however, were fairly primitive and not very practical until. A major breakthrough in latent class analysis came when Goodman (1974) and Haberman (1979) demonstrated more efficient maximum-likelihood estimation algorithms.
Meanwhile, simultaneous with Lazarfeld's work, important developments were occurring in the field of educational testing or psychometrics. Models similar to what Lazarsfeld called latent trait analysis were developed by Frederick Lord and others; this body of work was termed item response theory (IRT) methods. Similar methods were developed by the Danish researchers, George Rasch. Bock and Lieberman (1970) and Bock and Aitkin (1984) described a flexible class of IRT/latent structure methods, based on an estimation framework they called they called marginal maximum likelihood. Samejima (1969) described IRT/latent structure for use with ordered-category manifest variables, and Mislevy (1984) showed how to model a latent trait distributed as a mixture of continuous normal distributions.
Up to that point, except for Laplace and Poisson and one highly original paper by Fleiss (1965), none of these models had been used in connection with rater agreement or multiple diagnostic test data. That would soon change. The first modern paper demonstrating use of a latent class-like approach to analyze rater agreement appears to be Dawid and Skene (1979), with another early paper being Dillon and Mulani (1984). In the epidemiological literature Stephen Walter and colleagues presented mathematical models for estimating the accuracy of two diagnostic tests without a gold standard under special conditions (Hui and Walter, 1980). Walter gradually expanded his work to consider multiple diagnostic tests (Walter, 1984; Walter & Irwig, 1988), effectively, in a sense, re-inventing latent class analysis, though perhaps not at first seeing the connection with the earlier work of Lazarsfeld, Goodman, and others.
Espeland and Handelman (1989) extended Haberman's approach to LCA to study multiple diagnostic opinions. Meanwhile I, individually and together with William Grove, were developing latent class and latent trait agreement models in the psychological/psychiatric literature, at the RAND Corporation (Uebersax) and University of Minnesota (Grove). This produced a series of papers, monographs, and articles from 1987 to 1993 (see References). One important feature of this work is that it supplied a comprehensive framework that tied together a variety of different latent structure modelsc, covering (1) nominal, binary, and ordered-categorical manifest variables, (2) both discrete (latent class) and continuous (latent trait, latent distribution) latent variables, and (3) different kinds of rating studies. Uebersax (1992) gave a brief overview of the subject, intended for non-statisticians. (This paper is available online here and recommended as a starting point.)
Latent Distribution and Random Effect Latent Class ModelUebersax and Grove (1993) presented a very flexible latent structure model for the analysis of rater agreement and multiple diagnostic tests and, under ideal conditions, for making inferences about rater/test accuracy. They called this method latent trait mixture analysis (LTMA). Uebersax (1988). Slightly earlier, Mislevy (1984), in the psychometric literature, developed a similar latent structure model, although he did not consider its application and extensions for the analysis of rater agreement and diagnostic tests, or ordered-categorical data.
Like Uebersax (1988), Uebersax and Grove (1993) sought to develop a analytic approach based on a plausible theoretical/psychological model about how ratings or diagnostic classifications are made. One way to understand their approach is as a latent signal detection model. That is, they proposed to model a disease process (or other trait being rated) as (1) a mixture of two latent distributions (negatives and positives) along (2) a single latent continuum (disease severity, intensity, or 'diagnosability'), and (3) discretizing thresholds which mark the minimum levels of perceived disease severity that would cause a rater or test to assign each possible ordered manifest diagnostic category.
The features of the model can be understood with reference to Figure 1.
Figure 1. The Uebersax & Grove (1993) latent trait mixture analysis (LTMA) model.
Here the x-axis represents a latent trait, θ -- e.g., severity of cancer. Subjects or patients are assumed to correspond to two latent types - e.g., negatives (left distribution) and positives (right distribution). Both distributions are assumed normal (Gaussian), although the model could potentially consider other distributional forms.
Thresholds along the x-axis (t1, t2, t3) indicate discrete cut-points corresponding to the perceived latent trait levels at which different raters (raters 1, 2, 3) distinguish a negative vs. a positive diagnosis. Figure 1 considers only binary decisions, but in the case of ordered category ratings (e.g., cancer staging), each rater would have k-1, ordered thresholds that divide the range of possible observations into k graded rating levels.
The three s-shaped functions describe the probability of a positive rating for each of three raters. As shown, for each rater, the probability of a positive diagnosis increases monotonically according to the latent trait level of a case. At tt1, t2, t3, these functions have a probability of .5. On the assumption that rater observations are a function of true trait level plus random, normally distributed error, these functions have the shape of cumulative distribution functions (cdf's) of a normal distribution.
Uebersax and Grove (1993) used logistic ogives as convenient approximations to Gaussian cdf's. In the subsequently developed software (LTMA), this was replaced with an accurate polynomial approximation of the normal cdf. (Thus, in Figure 1 it would be more technically correct to replace , Ψ1, Ψ2, and Ψ3 with the more conventional notation for normal cdf's of Φ1, Φ2, and Φ3).
Uebersax and Grove (1993) also showed how to include in this model specific parameters to characterize the relative bias of different raters, test whether raters have the same or different discretizing thresholds, and to test whether raters define the latent trait differently or not.
In 1996, partly based on this work, Qu, Tan and Kutner (1996) developed a related model which they called a random effects latent class model. Unfortunately they neglected to cite the 1993 Uebersax and Grove paper, citing instead a different paper, Uebersax 1993. Partly as a result, it seems, some later researchers were unaware of the simpler method described by Uebersax and Grove (1993). This had a cascading effect, so that more recent papers like Albert (2007) and Albert and Dodd (2008) discussed only the random effects latent class model, possibly unaware of the earlier work.
The relationship between the Uebersax and Grove LTMA model and the random effects latent class model was eventually clarified by Uebersax (1999). That paper showed that the LTMA model is a special case of the random effects latent class model, and that both of these -- along with many other models -- are all special cases of a superordinate model described by Everitt (1988), Everitt and Merette (1990), and Henkelman, Kay and Bronskill (1990). The first two of these papers described a method for estimating latent distributions in data based on multivariate integrals performed in the latent space; the last authors independently developed an equivalent model for specific application to observer agreement. Two difficulties with the approach of these three papers are (1) the computational burden imposed by performing multidimensional integration (e.g., with 10 manifest variables, a 10-dimensional integral is required), and (2) the sheer number of parameters (including, for each latent distribution, the covariances between all latent variables).
The Uebersax LTMA model and the random effects latent class model model remedy both of these problems by, in effect, placing plausible restrictions on the parameters. The latter requires that the covariances of latent variables within each latent class have, in effect, a one dimensional or one factor structure. The LTMA model adds to these the further restriction of homogeneity of latent variable covariances across latent classs, and also constrains the means of the latent classes (see Figure 2a). This simpler form follows directly from the theoretical assumptions of a latent signal detection paradigm.
The greater parametric efficiency of Uebersax and Grove model has an additional advantage. A common problem with this class of models is identifiability. This refers to the ability to identify a unique solution, i.e., a single set of parameter values most likely to account for observed data. Nonidentifiability may occur for different reasons, including issues associated with the rating design (how many raters/tests per subject) and patterns of missing data. It is sometimes difficult to determine in advance whether a model is identifiable or not; however the greater the number of estimated parameters a model has, the more likely it is the solution will nonidentifiable or weakly identifiable (weak identifiability means that parameters can be estimated only with large standard errors). Therefore with latent structure analysis it is very helpful to keep models parametrically efficient.
Figure 2 helps show the differences between the LTMA and random effects latent class models. In each panel, the ellipses indicate two latent classes/distributions of cases (e.g., disease-negative and disease-positive subjects). Each manifest variable is associated with a pre-discretized latent continuous variable. For simplicity, two latent continuous variables, (y1 and y2), are shown, but usually there would be more. With the random effects latent class model (2b) there is a different latent trait associated with each latent class. These latent traits, along which cases in a latent class vary, account for heterogeneity in the probabilities of eliciting positive or negative ratings by a given test or rater for different cases of the same latent class. (With conventional LCA, these probabilities are homogenous). The latent trait of each distribution is the same thing as the random effect. These latent traits are illustrated by the lines drawn within the ellipses. In Figure 2b, these lines are not parallel. That is because the random effects/latent traits are different between the two latent classes, or at potentially so.
Figure 2. The Uebersax & Grove (1993) LTMA model (a) and the random effects latent class model (b) represented with respect to two latent continuous variables, y1 and y2.
Figure 2a corresponds to the Uebersax and Grove LTMA model. Here there is a single latent trait or principal latent axis: it is the same within each latent class, and the means of the distributions are also aligned along this principal axis. A latent trait has the same relationship to latent continuous variables that a common factor (as in common factor analysis) has with continuous variables. Thus, with these models, there are the equivalent of factor loadings of each latent continuous variable on a latent trait. With the random effects latent class model, these loadings are different for each latent class, but with the LTMA model, they are the same, so that fewer parameters require estimation. This makes the LTMA model computationally simpler and more readily identifiable. Moroever, the situation in panel 2a is isomorphic to the unidimensional trait representation in Figure 1 (except that the latter considers three manifest variables; for discussion of this isomorphism see Takane and de Leeuw, 1987). Thus the LTMA model corresponds exactly to a latent signal detection theory paradigm, which is a natural way to understand how ratings or diagnostic test results are are produced. When we think of a disease, for example, say cancer, we ordinarily think of it as constituting a single continuum – e.g., severity of cancer or degree of evidence for cancer. Disease-negative and disease-positive cases vary with respect to this same continuum. This corresponds to the LTMA model. The random effects latent class model, however, effectively assumes that disease-positive and disease-negative cases each have their own separate continuous dimension of variation. This is not how we usually think of a diagnostic situation. Indeed, it threatens to discard the fundamental idea of there being a single disease entity or concept at all. In truth, there might be some difference in the latent trait between disease-negative and disease-positive cases; but in practice it is very questionable whether formally acounting for this possible difference is very important, and it is certainly the case that this adds considerable complexity and parametric burden to the model.
In sum, the Uebersax and Grove LTMA model has three advantages relative to the random effects latent class model:
Estimation: MCMC or Direct Maximum Likelihood?The subject of WinBUGS and related software for MCMC estimation raises a separate point. Some papers have used MCMC to estimate random effects latent class models and have suggested that this is a Bayesian approach, whereas other latent structure models are not. That is not the case. As noted at the outset, all the LSA models considered here are Bayesian. What's mainly at issue is a choice of estimation algorithms.
Basically there are two ways to estimate models of this type: by direct maximum likelihood estimation (MLE), or using MCMC simulation. Earlier papers in the field tended to use MLE, whereas more recent papers show a preference for MCMC estimation. However, this different is arguably more a matter of researcher preference than superiority of one method over another. MLE actually has several advantages relative to MCMC estimation, including:
It is important in any case to note that that both estimation approaches are Bayesian. A potential advantage of MCMC, however, is that it is more flexible in terms of possible prior distributions on various parameters. At present, that does not seem like a major practical consideration, but it could become more so in the future.
Estimating Accuracy without a Gold Standard
A certain amount of misunderstanding and disagreement exists concerning the proper use of these methods. In theory it is possible to use them to estimate the accuracy (e.g., sensitivity (Se), specificity (Sp), and positive/negative predictive validity) of ratings or diagnostic tests in the absence of a gold standard. Some authors, however, have expressed strong reservations about whether in practice this can actually be done.
However a different view is that the main advantages of latent structure modeling of rater and diagnostic agreement are twofold. First, these methods force one to think carefully about the nature of the rating/diagnostic process. Such a disciplined, theoretical consideration has many fruits; it provides a structured framework for approaching a problem, which is very valuable in itself; otherwise one is prone to approach a problem in a haphazard and ineffective way. An explicit model identifies key issues, leading to improved theoretical understanding. An explicit model is further useful because it has a cumulative aspect; other researchers can build on the model, gradual improving it or suiting it to new applications. In short, a theoretical model contributes much in a holistic sense to our theoretical and practical understanding of a rating or diagnostic problem.
The second advantage of latent structure modeling is that it helps one to identify and quantify the specific sources of disagreement. Do raters, for example, disagree because they define a trait differently, or because they have different decision cutpoints? Investigating this can lead to specific interventions -- feedback or training for raters, for example -- to that their ratings or diagnoses are more consistent.
Ordinarily, however, it would seem too ambitious to try to use these methods to estimate Se, Sp, etc. in the absence of a gold standard. If used for this purpose, the methods are potentially too sensitive to departures from assumptions about the shapes of the latent trait distrbution and the distribution of rater errors. Perhaps in certain special cases, say involving completely automated ratings, or in engineering applications, the methods could be used for this purpose. The key would be to have a very exact idea of the form of latent distributions and other model parameters. Otherwise the principle merits of these methods would seem more along the other lines stated above.
Located Latent Class AnalysisIn 1993, Uebersax (1993) described what he called a located latent class analysis (LLCA) model for analysis of rater agreement data. This is, in a sense, intermediate between a latent class model and a latent trait model. It can be understood as a latent trait model where the latent trait has an arbitrary unidimensional distribution, approximated by a set of C > 2 latent trait levels, each with a probability density. In other words, whereas a traditional latent trait model posits a latent trait distributed as, say, a single normal distribution (or, as in Figure 1, a mixture of two normal distributions), the LLCA model describes the latent distribution as analogous to a histogram. The LLCA model can also be understood as an application of the marginal maximum likelihood method trait model of Bock and Aitken (1981), extended to ordered category data and with several other elaborations relevant to the analysis of rater or test agreement. The LLCA models is perhaps the most flexible and informative model available for analyzing agreement among human experts making ordered categorical ratings. Without making strong distributional assumptions, it permits quantification of separate components of agreement/disagreement associated with rater differences in (1) definition of the trait, (2) discretizing thresholds, and (3) bias.
Polychoric Correlation Coefficient and Generalized Latent Correlation CoefficientAround 2000, Uebesax (online publication, 2000) suggested that the polychoric correlation coefficient could be used as a simple latent structure model to analyze agreement between pairs of raters or tests with ordered category data. He also showed that within the same general framework the usual assumptions of the polychoric correlation coefficient -- namely that of a bivariate normal distribution between the latent continuous variables -- could be relaxed, leading to estimation of what could be termed a generalized latent correlation coefficient.
Issues for the Future
Research in this area is alive and well. Indeed, due in part to the increased prominence of diagnostic tests in the medical field, interest in such models seems likelty to increase in the coming years.
At present, there appear to be two major needs for the field. The first is for better software. For latent class models, a commercial product, LatentGold, will satisfy the needs of many or most researchers. However there needs to be a better alternative for the LTMA model (Uebersax and Grove, 1993) and the LLCA model (Uebersax, 1993). Currently these models can be estimated with two standalone programs. These are compiled Fortran 90 program, designed to be run in a Command Prompt window on any Windows computer. The programs still run effectively, although they are based on a batch computing model, derived from mainframe computer days. Perhaps the best way to use them is to do first run the benchmark examples supplied, and then to edit the command files as required to suit ones own data. Nevertheless some users will find the batch computing approach unfamiliar and cumbersome, or in any case would prefer a modern graphical user interface (GUI).
One solution would be to re-write both these programs in a more Windows-friendly language. However, an alternative -- something of a compromise -- might be more expedient. That would be to keep the original programs as computational engines, but to make them accessible online -- i.e., a 'cloud computing' paradigm -- allowing users to submit data and commands using a web form interface. From web form, the appropriate batch command file could be written automatically, the model analyzed, and results given back to the user via web form. (If anyone is interested in collaborating on this and might be willing to place the programs on their university or institution server, please let me know). Yet another possibility would be to write R programs to estimate these models.
A second need is to extend this class of methods to other rating designs. A common problem with rater agreement analysis is that one may wish to generalize to a larger population of raters based on a study that uses a few raters. For example, one may collect data form five pathologists rating slides, but the goal is to estimate the degree of agreement among all pathologists, not just these five.
In order to do this effectively, a separate category of random rater effects models will need to be developed. This is different from the random effects models discussed above; there the random effects referred to random effects among the cases that are rated; here the random effects pertain to raters -- such that each rater/test observed is understood as being a random representative from a population of raters/tests, and one wishes to make inferences about the larger population of raters/tests based on those in the immediate study.
This, then, will suffice as an overview. Additional pages on this website supply further details on latent class models and latent trait models. (These pages were writting before this introductory section, and will need to be updated accordingly.) Further information can be found in the reprints accessible online. As already noted, of these, the paper Uebersax (1992) was intended as a general introduction and is still helpful for that purpose, and Uebersax (1999) supplies an integrative framework for latent class, latent trait, and random effects models.
A computational example illustrating use of latent trait, located latent class, and latent distribution analysis will be supplied here (in progress).
Albert PS, Dodd LE. On estimating diagnostic accuracy from studies with multiple raters and partial gold standard evaluation. J Am Stat Assoc. 2008 Mar 1;103(481):61-73.
Bock RD, Lieberman M. Fitting a response curve model for dichotomously scored items. Psychometrika, 1970, 35, 179-198.
Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 1981, 46, 443-459.
Dawid AP, Skene AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 1979, 28, 20-28.
Dillon WR, Mulani N. A probabilistic latent class model for assessing inter-judge reliability. Multivariate Behavioral Research, 1984, 19, 438-458.
Espeland MA, Handelman SL. Using latent class models to characterize and assess relative error in discrete lmasurements. Biometrics, 1989, 45, 587-599.
Everitt BS. A finite mixture model for the clustering of mixed-model data. Statistics and Probability Letters, 1988, 6, 305-309.
Everitt BS, Merette C. The clustering of mixed-mode data: A comparison of possible approaches. Journal of Applied Statistics, 1990, 17, 283-297.
Fanshawe TR, Lynch AG, Ellis IO, Green AR, Hanka R. Assessing agreement between multiple raters with missing rating information, applied to breast cancer tumour grading. PLoS ONE 2008 3(8): e2925. doi:10.1371/journal.pone.0002925
Gaffikin L, McGrath JA, Arbyn M, Blumenthal PD. Visual inspection with acetic acid as a cervical cancer test: accuracy validated using latent class analysis. BMC Med Res Methodol. 2007; 7: 36.
Fleiss JL. Estimating the accuracy of dichotomous judgments. Psychometrika, 1965, 30, 469-479.
Gelfand AE, Solomon H. A study of Poisson's models for jury verdicts in criminal and civil trials. Journal of the American Statistical Association, 1973, 68, 271-278.
Gelfand AE, Solomon H. Modeling jury verdicts in the American legal system. Journal of the American Statistical Association, 1974, 69, 32-37.
Gelfand AE, Solomon H. Analyzing the decision-making process of the American jury. Journal of the American Statistical Association, 1975, 70, 305-310.
Goodman LA. Exploratory latent structure analysis using both identifiable and unidentifiable models," Biometrika, 1974, 61, 215-231.
Haberman, SJ. Analysis of qualitative data. Vol. 2. Academic Press, New York, 1979.
Henkelman RM, Kay I, Bronskill MJ. Receiver operator characteristic (ROC) analysis without truth. Medical Decision Making, 1990, 10, 24-29.
Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics, 1980, 36, 167-171.
Hutchinson TP. Assessing the health of plants: simulation helps us understand observer disagreements. Environmetrics, 2000, vol. 11, 305-314.
Lazarsfeld PF, Henry NW. (1968), Latent structure analysis. Boston: Houghton Mifflin, 1968.
Lord FM, Novick MR. Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968.
Mislevy RJ. Estimating latent distributions. Psychometrika, 1984, 49, 359-381.
Qu Y, Tan M, Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics, 1996, 52, 797-810.
Rindskopf R, Rindskopf W., The value of latent class analysis in medical diagnosis. Statistics in Medicine, 1986, Vol. 5, pp. 21-27.
Samejima, F. Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 1969, 34, 100-114.
Takane Y, de Leeuw J. On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 1987, 52, 393-408.
Uebersax JS. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405-416.
Uebersax JS. Quantitative methods for the analysis of observer agreement: toward a unifying model. RAND Paper P-7686. The RAND Corporation, 1991a.
Uebersax JS. Latent class agreement analysis with ordered rating categories. RAND Paper P-7694. The RAND Corporation, 1991b.
Uebersax JS. A review of modeling approaches for the analysis of observer agreement. Investigative Radiology, 1992, 17, 738-743.
Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993, 88, 421-427.
Uebersax JS. . Probit latent class analysis: conditional independence and conditional dependence models. Appl Psychol Measmt, 1999, 23(4), 283-297.
Uebersax JS, Grove WM. Latent structure agreement analysis. RAND N-3029-RC. The RAND Corporation, Santa Monica, CA, October 1989.
Uebersax JS, Grove WM. Latent class analysis of diagnostic agreement. Statistics in Medicine, 1990, 9, 559-572.
Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.
Walter SD. Measuring the reliability of clinical data: the case for using three observers. Revue d'Epidemiologie et de Sante Publique, 1984, 32, 206-211.
Walter SD, Irwig LM. "Estimation of Test Error Rates, Disease Prevalence and Relative Risk from Misclassified Data: A Review," Journal of Clinical Epidemiology, 1988, 41, 923-937.
Xu H, Craig BA. A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests. Biometrics. 2009 Dec;65(4):1145-55.
Go to Latent Structure Analysis
Go to Statistical Methods for Rater Agreement
(c) 2000-2010 John Uebersax PhD email
First draft: 10 April 2010