A clear, concise description of the tetrachoric and polychoric correlation coefficients, including issues relating to their estimation, is found in Drasgow (1988). Olsson (1979) is also helpful.
What distinguishes the present discussion is the view that the tetrachoric and polychoric correlation models are special cases of latent trait modeling. (This is not a new observation, but it is sometimes overlooked). Recognizing this opens up important new possibilities. In particular, it allows one to relax the distributional assumptions which are the most limiting feature of the "classical" tetrachoric and polychoric correlation models.
These statistics make certain assumptions, however. With the polychoric correlation, the assumptions can be tested. The assumptions cannot be tested with the tetrachoric correlation if there are only two raters; in some applications, though, theoretical considerations may justify the use of the tetrachoric correlation without a test of model fit.
Consider the example of two psychiatrists (Raters 1 and 2) making a diagnosis for presence/absence of Major Depression. Though the diagnosis is dichotomous, we allow that depression as a trait is continuously distributed in the population.
+---------------------------------------------------------------+ | | | | | | * | | | * * | | | * * | | | * | * | | | * | * | | | ** | ** | | | *** | *** | | | *** | *** | | | ***** | ***** | | +--------------------------------+----------------> Y | | not depressed t depressed | | | +---------------------------------------------------------------+
Figure 1 (draft). Latent continuous variable (depression
severity, Y); and discretizing threshold (t).
In diagnosing a given case, a rater considers the case's level of depression, Y, relative to some threshold, t: if the judged level is above the threshold, a positive diagnosis is made; otherwise the diagnosis is negative.
Figure 2 portrays the situation for two raters. It shows the distribution of cases in terms of depression level as judged by Rater 1 and Rater 2.
a, b, c and d denote the proportion of cases that fall in each region defined by the two raters' thresholds. For example, a is the proportion below both raters' thresholds and therefore diagnosed negative by both.
These proportions correspond to a summary of data as a 2 x 2 cross-classification of the raters' ratings.
+------------------------------------------------+ | | | Rater 1 | | - + | | +-------+-------+ | | - | a | b | a + b | | Rater 2 +-------+-------+ | | + | c | d | c + d | | +-------+-------+ | | a + c b + d 1 | | | +------------------------------------------------+
Figure 3 (draft). Crossclassification proportions
for binary ratings by two raters.
Again, a, b, c and d in Figure 3 represent proportions (not frequencies).
Once we know the observed cross-classification proportions a, b, c and d for a study, it is a simple matter to estimate the model represented by Figure 2. Specifically, we estimate the location of the discretizing thresholds, t1 and t2, and a third parameter, rho, which determines the "fatness" of the ellipse. Rho is the tetrachoric correlation, or r*. It can be interpreted here as the correlation between judged disease severity (before application of thresholds) as viewed by Rater 1 and Rater 2.
The principle of estimation is simple: basically, a computer program tries various combinations for t1, t2 and r* until values are found for which the expected proportions for a, b, c and d in Figure 2 are as close as possible to the observed proportions in Figure 3. The parameter values that do so are regarded as (estimates of) the true, population values.
The polychoric correlation, used when there are more than two ordered rating levels is a straightforward extension of the model above. The difference is that there are more thresholds, more regions in Figure 2, and more cells in Figure 3. But again the idea is to find the values for thresholds and r* that maximize similarity between model-expected and observed cross-classification proportions.
In many situations, even though a trait may be continuous, it may be convenient to divide it into ordered levels. For example, for research purposes, one may classify levels of headache pain into the categories none, mild, moderate and severe. Even for trait usually viewed as discrete, one might still consider continuous gradations--for example, people infected with the flu virus exhibit varying levels of symptom intensity.
The tetrachoric correlation and polychoric correlation coefficients are appropriate when the latent trait that forms the basis of ratings can be viewed as continuous. We will outline here the measurement model and assumptions for the tetrachoric correlation. The model and assumptions for the polychoric correlation are the same--the only difference is that there are more threshold parameters for the polychoric correlations, corresponding to the greater number ordered rating levels.
We begin with some notation and definitions. Let:
X1 and X2 be the manifest (observed) ratings by Raters (or procedures, diagnostic tests, etc.) 1 and 2; these are discrete-valued variables;
Y1, Y2 be latent continuous variables associated with X1 and X2; these are the pre-discretized, continuous "impressions" of the trait level, as judged by Raters 1 and 2;
T be the true, latent trait level of a case.
A rating or diagnosis of a case begins with the case's true trait level, T. This information, along with "noise" (random error) and perhaps other information unrelated to the true trait which a given rater may consider (unique variation), leads to each rater's impression of the case's trait level (Y1 and Y2). Each rater applies discretizing thresholds to this judged trait level to yield a dichotomous or ordered-category rating (X1 and X2).
Stated more formally, we have:
where b is a regression coefficient, u1 and u2 are the unique components of the raters' impressions, and e1 and e2 represent random error or noise. It turns out that unique variation and error variation behave more or less the same in the model, and the former can be subsumed under the latter. Thus we may consider the simpler model:
The tetrachoric correlation assumes that the latent trait T is normally distributed. As scaling is arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed (and independent both between raters and across cases). For reasons we need not pursue here, the model loses no generality by assuming that var(e1) = var(e2). We therefore stipulate that e1, e2 ~ N(0, sigmae). A consequence of these assumptions is that Y1 and Y2 must also be normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b1 = b2 = b = the correlation of both Y1 and Y2 with the latent trait.
We define the tetrachoric correlation, r*, as
r* = b2
A simple "path diagram" may clarify this:
+-------------------------------------+ | | | | | b b | | Y1 <--- T ---> Y2 | | | | | +-------------------------------------+
Figure 4 (draft). Path diagram.
Here b is the path coefficient that reflects the influence of T on both Y1 and Y2. Those familiar with the rules of path analysis will see that the correlation of Y1 and Y2 is simply the product of their degree of dependence on T--that is b2.
As an aside, one might consider that the value of b is interesting in its own right, inasmuch as it offers a measure of the association of ratings with the true latent trait--i.e., a measure of rating validity or accuracy.
The tetrachoric correlation r* is readily interpretable as a measure of the association between the ratings of Rater 1 and Rater 2. Because it estimates the correlation that exists between the pre-discretized judgements of the raters, it is, in theory, not affected by (1) the number of rating levels, or (2) the marginal proportions for rating levels (i.e., the 'base rates.') The fact that this association is expressed in the familiar form of a correlation is also helpful.
The assumptions of the tetrachoric correlation coefficient may be expressed as follows:
Assumptions 1--4 can be alternatively expressed as the assumption that Y1 and Y2 follow a bivariate normal distribution.
We will assume that the one has sufficient theoretical understanding of the application to accept the assumption of latent continuity.
The second assumption--that of a normal distribution for T--is potentially more questionable. Absolute normality, however, is probably not necessary; a unimodal, roughly symmetrical distribution may be close enough. Also, the model implicitly allows for a monotonic transformation of the latent continuous variables. That is, a more exact way to express Assumptions 1-4 is that one can obtain a bivariate normal distribution by some monotonic transformation of Y1 and Y2.
The model assumptions can be tested for the polychoric correlation. This is done by comparing the observed numbers of cases for each combination of rating levels with those predicted by the model. This is done with the likelihood ratio chi-squared test, G2 (Bishop, Fienberg & Holland, 1975), which is similar the usual Pearson chi-squared test (the Pearson chi-square test can also be used; for more information on these tests, see the FAQ for testing model fit on the Latent Class Analysis web site.
The G2 test is assessed by considering the associated
p value, with the appropriate degrees of freedom (df). The
df are given by:
For the tetrachoric correlation R = C = 2, and there are no df with which to test the model. It is possible to test the model, though, when there are more than two raters.
For this a computer program, such as those described in the software section, is required.
The next step is to determine if the assumptions of the polychoric correlation are empirically valid. This is done with the goodness-of-fit test that compares observed crossclassification frequencies to model-predicted frequencies described previously. As noted, this test cannot be done for the tetrachoric correlation.
PRELIS includes a test of model fit when estimating the polychoric correlation. It is unknown whether SAS PROC FREQ includes such a test.
Assuming that model fit is acceptable, the next step is to note is the magnitude of the polychoric correlation. Its value is interpreted in the same way as a Pearson correlation. As the value approaches 1.0, more agreement on the trait definition is indicated. Values near 0 indicate little agreement on the trait definition.
One may wish to test the null hypothesis of no correlation between raters. There are at least two ways to do this. The first makes use of the estimated standard error of the polychoric correlation under the null hypothesis of r* = 0. At least for the tetrachoric correlation, there is a simple closed-form expression for this standard error (Brown, 1977). Knowing this value, one may calculate a z value as:
r* z = ----------- sigmar*(0)where the denominator is the standard error of r* where r* = 0. One may then assess statistical significance by evaluating the z value in terms of the associated tail probabilities of the standard normal curve.
The second method is via a chi-squared test. If r* = 0, the polychoric correlation model is the same as the model of statistical independence. It therefore seems reasonable to test the null hypothesis of r* = 0 by testing the statistical independence model. Either the Pearson (X2) or likelihood-ratio (G2) chi-squared statistics can be used to test the independence model. The df for either test is (R - 1)(C - 1). A significant chi-squared value implies that r* is not equal to 0.
[I now question whether the above is correct. For the polychoric
correlation, data may fail the test of independence even with when r* = 0
(i.e., there may be some other kind of 'structure' to the data). If so,
a better alternative would be to calculate a difference G2
G2H0 - G2H1,
where G2H0 is the likelihood-ratio chi-squared for the independence model and G2H1 is the likelihood-ratio chi-squared for the polychoric correlation model. The difference G2 can be evaluated as a chi-squared value with 1 df. -- JSU, 27 Jul 00]
Equality of thresholds between raters can be tested by estimating what may be termed a threshold-constrained polychoric correlation. That is, one estimates the polychoric correlation with the added constraint(s) that the threshold(s) of Rater 1 is/are the same Rater 2's threshold(s). A difference G2 test is then made comparing the G2 statistic for this constrained model with the G2 for the unconstrained polychoric correlation model. The difference G2 statistic is evaluated as a chi-squared value with df = R - 1, where R is the number of rating levels (this test only applies when both raters use the same number of rating levels).
Here we briefly note some extensions and generalizations of the tetrachoric/polychoric correlation approach to analyzing rater agreement:
Skewed distributions. A new page describing what might be colloquially called "skewed tetrachoric or polychoric correlation," but would be more accurately termed the latent correlation with a skewed latent distribution has been added. This page also describes a simple computer program to implement the model for binary ratings.
Nonnparametric distributions. Example 3 below describes an alternative approach based on a nonparametric latent trait distribution.
-------------------------- Rater 2 --------- Rater 1 - + Total --------------------------- - 40 10 50 + 20 30 50 -------------------------- Total 60 40 100 -------------------------- Table 1 (draft)For these data, the tetrachoric correlation (std. error) is:
rho 0.6071 (0.1152)which is much larger than the Pearson correlation of 0.4082 calculated for the same data.
The thresholds (std. errors) for the two raters are estimated as:
Rater 1 0.0000 (0.1253) Rater 2 0.2533 (0.1268)
Tallis suggested that the number of lambs born is a manifestation of the ewe's fertility--a continuous and potentially normally distributed variable. Clearly the situation is more complex than the simple "continuous normal variable plus discretizing thresholds" assumptions allow for. We consider the data simply for the sake of a computational example.
----------------------------------- Lambs Lambs born in 1952 born in ------------------ 1953 None 1 2 Total ----------------------------------- None 58 52 1 111 1 26 58 3 87 2 8 12 9 29 ----------------------------------- Total 92 122 13 227 ----------------------------------- Table 2 (draft)
Drasgow (1988; see also Olsson, 1979) described two different ways to calculate the polychoric correlation. The first method, the joint maximum likelihood (ML) approach, estimates all model parameters--i.e., rho and the thresholds--at the same time.
The second method, two-step ML estimation, first estimates the thresholds from the one-way marginal frequencies, then estimates rho, conditional on these thresholds, via maximum likelihood. For the tetrachoric correlation, both methods produce the same results; for the polychoric correlation, they may produce slightly different results.
The data in Table 2 are analyzed with the POLYCORR program (Uebersax, 2000). Application of the joint ML approach produces the following estimates (standard errors):
rho 0.4192 (0.0761) threshold 2, 1952 -0.2421 (0.0836) threshold 3, 1952 1.5938 (0.1372) threshold 2, 1953 -0.0297 (0.0830) threshold 3, 1953 1.1331 (0.1063)With two-step estimation the results are:
rho 0.4199 (0.0747) threshold 2, 1952 -0.2397 threshold 3, 1952 1.5781 threshold 2, 1953 -0.0276 threshold 3, 1953 1.1371However the G2 statistic testing model fit for the joint ML and two-step estimates are 11.54 and 11.55, respectively, each with 3 df. The corresponding p-values, less than .01, suggest poor model fit and implausibility of the polychoric model assumptions. Acceptable fit could possibly be obtained by considering a skewed latent trait distribution.
--------------------------------------------- Rating Rating of Rater 1 of Rater --------------------------- 2 1 2 3 4 5 6 Total --------------------------------------------- 1 30 1 0 0 0 0 31 2 0 10 2 0 0 0 12 3 0 4 8 3 1 0 16 4 0 3 3 37 9 0 52 5 0 0 1 25 71 49 146 6 0 0 0 2 20 181 203 --------------------------------------------- Total 30 18 14 67 101 230 460 --------------------------------------------- Table 3 (draft). Ratings of plant health by two judges
The polychoric correlation (std. error) for these data is .954 using joint estimation. However there is reason to doubt the assumptions of the standard polychoric correlation model; the G2 model fit statistic is 57.33 on 24 df (p < .001).
Hutchinson (2000) showed that the data can be fit by allowing measurement error variance to differ from low to high levels of the latent trait. Instead, we relax the assumption of a normally distributed latent trait. Using the LLCA program (Uebersax, 1993a) a latent trait model with a nonparametric latent trait distribution was fit to the data. The distribution was represented as six equally-spaced locations (located latent classes) along a unidimensional continuum, the density at each location (latent class prevalence) being estimated.
Model fit, assessed by the G2 statistic was 15.65 on 19 df (p = .660). The LLCA program gave the correlation of each variable with the latent trait as .963. This value squared, .927, estimates what the correlation of the raters would be if they made their ratings on a continuous scale. This is a generalization of the polychoric correlation, though perhaps we should reserve that term for the latent bivariate normal case. Instead, we simply term this the latent correlation between the raters.
(To see the input file for the LLCA program, click here.)
The distribution of the latent trait estimated by the model is follows:
.5 + * D | * e .4 + * n | * s .3 + * i | * * t .2 + * * y | * * .1 + * * * | * * * * * +----*------*------*------*------*------*---- -2.5 -1.5 -0.5 0.5 1.5 2.5 Latent Trait Level Figure 5 (draft). Estimated latent trait distribution
Tcorr is a simple utility for estimating a single tetrachoric correlation coefficient and its standard error. Just enter the frequencies of a fourfold table and get the answer. Also supplies threshold estimates.
Dirk Enzmann has written an SPSS macro to estimate a matrix of tetrachoric correlations. He also has a standalone version.
Jim Fleming also has a program to estimate a matrix of tetrachoric correlations and optionally smoothe of a poorly conditioned matrix.
Brown's (1977) algorithm AS 116, a Fortran subroutine to calculate the tetrachoric correlation and its standard error, can be found at StatLib. Alternatively, you can download my program, Tcorr, above, which includes simple source code with an actual working version of Brown's subroutine.
TESTFACT is a very sophisticated program for item analysis using both classical and modern psychometric (IRT) methods. It includes provisions for calculating tetrachoric correlations.
POLYCORR is a program I've written to estimate the polychoric correlation and its standard error using either joint ML or two-step estimation. Goodness-of-fit and a lot of other information are also provided. Note: this program is just for a single pair of variables, or a few considered two at a time. It does not estimate a matrix of tetra- or polychoric correlations.
PRELIS. A useful program for estimating a matrix of polychoric or tetrachoric correlations is PRELIS. It includes a goodness-of-fit test for each pair of variables. Standard errors can be requested. PRELIS uses two-step estimation. Because it is supplied with LISREL, PRELIS is widely available. Most university computation centers probably already have copies and/or site licenses.
Mplus can estimate a matrix of polychoric and tetrachoric correlations and estimate their standard errors. Two-step estimation is used. Features similar to PRELIS/LISREL.
will estimate polychoric and tetrachoric correlations and standard errors. Provisions for
smoothing an improper correlation matrix are supplied. No
goodness-of-fit tests. A free student version that
handles up to thirty variables can be downloaded.
Also does factor analysis.
A single polychoric or tetrachoric correlation can be calculated with the PLCORR option of SAS PROC FREQ. Example:
proc freq; tables var1*var2 / plcorr maxiter=100; run;Joint estimation is used. The standard error is supplied, but not thresholds. No goodness-of-fit test is performed.
A SAS macro, %POLYCHOR , can construct a matrix of polychoric or tetrachoric correlations. For tetrachoric correlations, if there is a single 0 frequency in the 2x2 crossclassification table for a pair of variables (see Figure 3 above), plcorr and %POLYCHOR may unnecessarily supply a missing value result, at least if maxiter is left at the default value of 20. So far I have found this problem is avoided by setting maxiter higher, e.g., to 40, 50 or 100. (Increasing the value of maxiter should not significantly increase run times). In any case, it's a good idea to check your SAS log, which will contain a message if estimation did not converge for any pair of variables.
The macro is relatively slow (e.g., on a PC, a 50 x 50 matrix can take 5 minutes to estimate; a 100 x 100 matrix four times as long).
John Fox has written an R program to estimate the polychoric correlation and its standard error with R. A goodness-of-fit test is performed. Another R program for polychorics has been written by David Duffy.
Stata's internal function for tetrachoric correlations is a very rough approximation (e.g., actual tetrachoric correlation = .5172, Stata reports .6169!) based on Edwards and Edwards (1984) and is unsuitable for many or most applications. A more accurate external module has been written by Stas Kolenikov to estimate a matrix of polychoric or tetrachoric correlations and their standard errors.
The skewed distribution is modeled as a mixture of two Gaussian distributions, the parameters of which the user supplies; that is, one specifies in advance the shape of the latent trait distribution, based on prior beliefs/knowledge. This program is much simpler to use than those desribed below. Several sets of data (summarized as a series of 2× tables) can be analyzed in a single run.
The LTMA program can similarly be used to estimate a generalized polychoric correlation, based on a latent trait mixture model (Uebersax & Grove, 1993). This is basically a fancier version of the glc program: (1) it handles ordered-categorical as well as dichotomous variables, and (2) it will estimate the shape of the latent trait distribution from the data (again, modeling it as a mixture of two component Gaussians).
The LLCA program can be used to estimate a polychoric correlation with nonparametric distributional assumptions. The latent trait is represented as a sequence of latent classes on a single continuum (Uebersax 1993b). That is, the latent trait distribution is modeled as a "histogram," where the densities at each point are estimated, rather as a continuous parametric distribution.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 1977, 26, 343-351.
Drasgow F. Polychoric and polyserial correlations. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 7 (pp. 69-74). New York: Wiley, 1988.
Edwards JH, Edwards AWF. Approximating the tetrachoric correlation coefficient. Biometrics, 1984, 40, 563.
Harris B. Tetrachoric correlation coefficient. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 9 (pp. 223-225). New York: Wiley, 1988.
Hutchinson TP. Kappa muddles together two sources of disagreement: tetrachoric correlation is better. Research in Nursing and Health, 1993, 16, 313-315.
Hutchinson TP. Assessing the health of plants: Simulation helps us understand observer disagreements. Environmetrics, 2000, 11, 305-314.
Joreskog KG, Sorbom, D. PRELIS User's Manual, Version 2. Chicago: Scientific Software, Inc., 1996.
Knol DL, ten Berge JMF. Least-squares approximation of an improper correlation matrix by a proper one. Psychometrika, 1989, 54, 53-61.
Loehlin JC. Latent variable models, 3rd ed. Lawrence Erlbaum, 1999.
Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 1979, 44(4), 443-460.
Pearson K. Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, Series A, 1900, vol. 195, pp. 1-47.
Tallis GM. The maximum likelihood estimation of correlation from contingency tables. Biometrics, 1962, 342-353.
Uebersax JS. LLCA: Located latent class analysis. Computer program documentation, 1993a.
Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993b, 88, 421-427.
Uebersax JS. POLYCORR: A program for estimation of the standard and extended polychoric correlation coefficient. Computer program documentation, 2000.
Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.
Uebersax JS. The tetrachoric and polychoric correlation coefficients. Statistical Methods for Rater Agreement web site. 2006. Available at: http://john-uebersax.com/stat/tetra.htm . Accessed mmmm dd, yyyy.
Last updated: 27 Apr 2011 (updated R link)