We consider here estimation of a fully Gaussian latent trait model. "Fully Gaussian" means that both the latent trait and measurement error are assumed to be normally distributed. This model, equivalent to what Item Response Theory (IRT) terms a two-parameter normal (2PN) model with a normally distributed latent trait, is often used in educational testing, social science research, and, increasingly, in health research. It is also useful--perhaps one of the best methods--for binary data factor analysis.
For general background on latent trait and IRT models (we use these terms interchangeably), the two classic texts are Lord and Novick (1968) and Lazarsfeld and Henry (1968; 1968 was a good year!). Heinen (1996) and Hulin, Drasgow and Parsons (1983) are also good sources. For background on Gaussian latent trait models see Bock and Aitkin (1981), Bock and Lieberman (1970), and Bock, Gibbons and Muraki (1988)--the last reference is less technical. For interesting discussion of newer, Bayesian methods for estimating IRT models, see Johnson and Albert (1999).
The best way to estimate the fully Gaussian latent trait model, a method developed by Bock and colleagues, is FIML (full-information maximum likelihood) estimation. With this approach, all model parameters are estimated simultaneously by maximum likelihood. The method is very effective, but, for multidimensional latent trait models, requires specialized software, such as TESTFACT (Wilson, Wood & Gibbons, 1998). [Unidimensional versions can be estimated with LEM (Vermunt, 1988), or LTM (Uebersax, 2000), both free programs.]
Another way to estimate the fully Gaussian latent trait model has long been known--in fact, it was the only option for many years. The Bock method was rightly seen as an improvement. Fairly recently, though, studies (Knol & Berger, 1991; Parry & McArdle, 1991) reexamined the older method, and concluded that it is still very viable.
This older method--sometimes called the heuristic method--is based on factor analysis of the tetrachoric correlations between item or variable pairs. (For convenience, we will use the term 'item' herein to refer generically to any binary variable.)
It is no surprise that the FIML and heuristic estimation methods produce very similar results; despite apparent differences, both are based on the same fundamental model. The FIML method formulates the problem as a standard latent trait or IRT model, whereas the heuristic method views it as a factor analysis problem. Mislevy (1986), and, more formally, Takane and de Leeuw (1987) describe how these models are the same.
In the following sections, we will describe the fully Gaussian latent trait model and its parameters, explain how to estimate the parameters by the heuristic method, and consider an example.
This discussion will not consider the case of ordered-polytomous data. However, methods similar to those described here are applicable to such data.
As noted above, the fully Gaussian latent trait model can be understood as a factor analysis model (to be precise, a common-factor model with probit link functions; this will become clear), or a latent trait/IRT model. In the former case, the model assumes that, underlying each dichotomous manifest variable (observed variable), X_{i}, is a corresponding latent continuous variable, Y_{i}. Dichotomous responses result from the operation of thresholds relative to the latent continuous variables.
The latent trait/IRT version is most easily described for the unidimensional case: (a) there is a normally distributed latent trait; (b) associated with each item is an item response function that gives the probability of a positive response for a respondent/examinee with each level of the latent trait; and (c) item response functions have a characteristic "S" shape--specifically, that of a "normal ogive," better known as the cumulative distribution function (cdf) of a normal curve.
Both models--the factor analysis model and the latent trait/IRT model--have two sets of parameters. These have various names; we shall simply call them the correlation parameters and the threshold parameters. We will consider these parameters mainly in terms of the factor analysis model, which is consistent with present aims.
These are the values of thresholds that discretize latent continuous variables to produce dichtomous observed values. These correspond--that is, they are related in a simple way--to what are alternatively termed location, difficulty, or b parameters in various latent trait/IRT models.
These are the correlations of latent continuous variables with the underlying factors. That is, they are latent factor loadings. These correspond--that is, again, are related by very simple functions--to what are variously known as slope, dispersion, a, or measurement error parameters in latent trait/IRT models.
We shall now consider how to estimate the threshold and correlation parameters by the heuristic method.
Threshold parameters are estimated by applying the probit function to the marginal response proportions for each item.
The probit function is the inverse of the cumulative distribution function (cdf) of the standard normal curve. We typically denote the latter by Phi() [i.e., F()] and the former function by Phi^{-1}(), or probit(). The Phi function is familiar from its use in significance testing. For some normal deviate, z, Phi(z) is the area under the standard normal curve to the left of z. For example, if z = 0, then
because half the standard normal curve is to the left of 0.
The probit function does the reverse: one supplies the proportion of the curve to the left of some z value, and the probit function gives the z value. So, for example,
The heuristic estimation of threshold parameters is, therefore, quite simple. First, for each item, one notes the proportion of negative responses. This is the proportion of cases that fall below the response threshold on the corresponding latent continuous variable--for example, the proportion of patients who are below a physician's threshold for making a positive diagnosis, or the proportion of respondents whose opinion strength is below the threshold required to answer "yes" to a survey question. Let p_{i} denote this proportion. Letting t_{i} denote the threshold for item i,
Doing this calculation for each item gives the heuristic item threshold estimates.
The probit function is usually calculated with a polynomial approximation. Several good routines are available. Some are described in Abramowitz and Stegun (1972) and Kennedy and Gentle (1980).
The thresholds calculated by this method are relative to the latent continuous variables. For a unidimensional latent trait model, one can also consider the location of the thresholds relative to the latent trait. These latter thresholds, sometimes called the location parameters in IRT, correspond to the latent trait level where probability of a positive response for an item is .5 (i.e., where the item response function is .5, at the inflection point of the "S" curve.) Letting b_{i} denote these thresholds,
The tetrachoric correlation (Drasgow, 1988; Harris, 1988; the former reference is more complete) between two dichotomous items estimates the Pearson correlation one would obtain if the two constructs were measured continuously. In terms of our definitions, the tetrachoric correlation for manifest variables X_{i} and X_{j}, which we denote by r^{*}_{ij}, is equal to the Pearson correlation between the corresponding latent continuous variables Y_{1} and Y_{2}. That is:
To dispel any possible confusion, it bears mentioning that we have referred to two different types of latent correlations. What we've termed the correlation parameters of the latent trait model are the correlations of latent continuous variables with factors. These we will derive from the tetrachoric correlations, which are the correlations of latent continuous variables with each other.
Figure 1 portrays the joint distribution of the latent continuous variables, Y_{1} and Y_{2}, corresponding to manifest items X_{1} and X_{2}.
a, b, c and d denote the proportions of cases that fall in each region defined by the two items' thresholds. For example, a is the proportion below both items' thresholds--e.g., the cases that fail or respond "no" to both items.Figure 1 (draft) Y_{2} | . | . . | . | . | . | t | . d | c . 2 +------------+---------------- | . a | b . | . | | . | +------------|-------------------> Y_{1} t_{1}
Bivariate normal distribution of latent continuous variables Y_{1} and Y_{2} and discretizing thresholds t_{1} and t_{2}
These proportions correspond to the summary of data as a 2 x 2 cross-classification table, as in Table 1.
Table 1 (draft) Item 1 0 1 +-------+-------+ 0 | a | b | p_{2} Item 2 +-------+-------+ 1 | c | d | q_{2} +-------+-------+ p_{1} q_{1} 1
Cross-classification of responses for Item 1 and Item 2
Note that a, b, c and d here are defined as proportions, not frequencies.
Once we know the observed cross-classification proportions a, b, c and d for a study, it is a simple matter to estimate the model represented by Figure 1. Specifically, we want to know the locations of the discretizing thresholds, t_{1} and t_{2}, and a third parameter, which determines the "fatness" of the ellipse: this third parameter is the tetrachoric correlation for Items 1 and 2, or r^{*}_{ij}.
The principle of estimation is very simple: basically, a computer program tries various combinations for t_{1}, t_{2} and r^{*}, until values are found for which the expected proportions for a, b, c and d in Figure 1 are as close as possible to the actual, observed proportions in Table 1. The parameter values that do so are regarded as (estimates of) the true, population values.
In fact, our probit-based threshold estimates are exactly right (this is so for the tetrachoric, but not the polychoric correlation, a generalization of the tetrachoric to ordered category data), so we need only find the right value of r^{*}.
A computer program is necessary to calculate the tetrachoric correlation. Several algorithms are available. One method (the original approach) is based on a series expansion, called the tetrachoric series. More recent approaches rely on numerical integration. Nearly all modern algorithms are accurate to many decimal places. Older algorithms that rely on a series expansion with too few terms (e.g., a sixth-degree polynomial expansion) are potentially inaccurate. A good subroutine (Brown, 1977) can be downloaded from the Applied Statistics section of StatLib.
For more information about software options, click here.
SAS will calculate tetrachoric correlations, as an option for PROC FREQ. The only drawback is that it calculates these one at a time, rather than constructing a matrix of tetrachoric correlations between all item pairs. A search of the SAS web site for the term "polychoric", though, will reveal a macro for constructing such a matrix.
Once one has constructed a matrix of tetrachoric correlations between all item pairs (i, j), the next step is to factor analyze the matrix.
One may factor analyze the matrix of tetrachoric correlations just as one would a matrix of Pearson correlations. One can use any software that will estimate a common factor model.
It is possible for a tetrachoric correlation matrix to not be positive definite. Knol and Berger (1991) suggest that, with ML or Generalized (Weighted) least-squares (GLS/WLS) estimation of the common factor model, if the matrix is indefinite or ill-conditioned with respect to inversion, a smoothing procedure is required; they also describe a procedure for this.
The extent to which this extra step is required is unclear. It perhaps becomes more of an issue as the number of items increases. If one uses iterated principal factor analysis or unweighted least-squares estimation, then the smoothing procedure is not required.
LISREL can be used for the factor analysis. LISREL works well for a 1-factor model (see Example below). However, it may be awkward for multi-factor models; LISREL is better suited for confirmatory than for exploratory factor analysis. If LISREL is used, one should not use ML estimation, as the assumptions of that method do not apply to tetrachoric correlations.
Another possibility is to use SAS PROC FACTOR. For this, one reads the matrix of tetrachoric correlations into a special SAS data set that is specified as a correlation matrix. This data set is then used as the input for SAS PROC FACTOR.
For example, let "matrix.in" be the name of an external file (e.g., a DOS, Windows, or Unix file) that contains the tetrachoric correlations among five items, as follows:
ITEM1 1.000 . . . . ITEM2 .170 1.000 . . . ITEM3 .228 .189 1.000 . . ITEM4 .107 .111 .187 1.000 . ITEM5 .067 .172 .105 .201 1.000Note that: (1) this matrix has the form of a lower-half matrix, with main diagonal elements included; (2) the main diagonal elements are unities; (3) elements of the upper half of the matrix are explicitly supplied as missing values ("."); and (4) a variable name begins each row of the matrix.
I believe one could also supply a full correlation matrix here. However it is important to supply *some* values for the upper half of the matrix; otherwise when reading the data SAS will advance to the next line looking for the values above the main diagonal. (In theory you could use the SAS input option MISSOVER to prevent that, but not if any row of the correlation matrix occupies more than one physical line of the file.)
Also, this is just an example; in practice one might better supply the correlations to more decimal places.
A sample SAS program to analyze such data is as follows:
* Input the matrix of tetrachoric correlations ; * and store as a SAS correlation data set (refer ; * to SAS manual under PROC CORR for details) ;data one (type=corr); infile 'matrix.in'; _type_='corr'; input _name_ $ item1-item5; run;
* Do the factor analysis;
proc factor data = one method = prinit n = 2 rotate = v scree ; run;
The PROC FACTOR step requests estimation by the "PRINIT" (iterated principal factor analysis or IPFA) method, a two-factor model, varimax rotation, and a scree test of eigenvalues.
Based on limited experience, I have found the PRINIT method better for factoring tetrachorics than most other SAS factoring methods (a comparable method is available with SPSS). Knol and Berger found good results with this method, but suggested that unweighted least-squares estimation may be preferable in order to avoid possible "Heywood cases." (A Heywood case is an estimation problem where a commonality estimate becomes 1.0 or greater).
Regardless of the program used, one will find on the output a listing of factor loadings of each item on each factor. These loadings, the item--factor correlations, are the correlation parameters we seek--that is, the estimated correlations of latent continuous variables with the factors.
For a unidimensional model, we denote this value by rho_{i} (i.e., r_{i}).
For a multidimensional model, we could denote the latent correlation of item i with Factor c by rho_{ic} (i.e., r_{ic}). We could alternatively denote these as rho(Y_{i}, theta_{c}) [i.e., r(Y_{i}, q_{c})], that is, the correlation of latent continuous variable Y_{i} with the c'th factor/latent trait, theta_{c}.
As with ordinary factor analysis, one may test models with varying numbers of factors. Or one may use analysis to confirm certain hypotheses. However, any goodness-of-fit statistic that depends on the assumption of continuous variables does not strictly apply; such statistics may roughly guide model selection and comparison, but disregard the p-values.
This example uses the LSAT-6 data analyzed by Bock and Lieberman (1970) and others. The data are responses to five items of the Law School Admissions Test (LSAT) test by 1000 examinees (see Bock & Lieberman for data). We proceed in two steps.
For this, we use the PRELIS (Joreskog & Sorbom, 1988) program. The PRELIS command file is as follows:
LSAT-6 data (Bock & Lieberman, 1970) da ni = 5 nobs = 1000 missing = -9 treatment = listwise raw-data-from file = lsat6.raw labels item1 item2 item3 item4 item5 ordinal item1 - item5 output matrix = pmatrix sm = lsat6.mat sa=test.acv
These commands are for PRELIS Version 1; commands for Version 2 might differ slightly. The commands state that there are five variables (items 1 to 5), to be treated as ordinal, not continuous data. A matrix of estimated Pearson correlations (pmatrix) is requested; since the data are dichotomous, PRELIS will estimate the Pearson correlations as tetrachoric correlations. Raw data are supplied in the file LSAT6.RAW. The tetrachoric correlation matrix will be written to file LSAT6.MAT. The asymptotic variance/covariance matrix for estimated parameters is written to the file TEST.ACV. LISREL uses this matrix for Weighted Least Squares estimation (if you plan to use Unweighted Least Squares estimation with LISREL, this matrix is not needed). PRELIS will only calculate this matrix if listwise (not pairwise) deletion of missing cases is specified.
The output file produced by these commands gives the one-way item marginal totals as follows:
CATEGORY VARIABLE 1 2 ________ ITEM1 76 924 ITEM2 291 709 ITEM3 447 553 ITEM4 237 763 ITEM5 130 870
From these we calculate, for each item i, the marginal proportion p_{i} and threshold t_{i} = probit(p_{i}). For example, for the first item,
and
Doing this for all items gives the heuristic threshold estimates shown in Table 2.
Note: It appears that PRELIS 2 supplies these thresholds directly; there is no need to calculate them from marginal proportions.
The matrix of tetrachoric correlations, stored in file LSAT6.MAT, is as follows:
(6D13.6) .100000D+01 .170316D+00 .100000D+01 .227522D+00 .189091D+00 .100000D+01 .107186D+00 .111147D+00 .186680D+00 .100000D+01 .665002D-01 .172422D+00 .105492D+00 .200924D+00 .100000D+01
The matrix as shown on the printed output, more readable, is:
ESTIMATED CORRELATION MATRIX ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ________ ________ ________ ________ ________ ITEM1 1.000 ITEM2 .170 1.000 ITEM3 .228 .189 1.000 ITEM4 .107 .111 .187 1.000 ITEM5 .067 .172 .105 .201 1.000
For this we use LISREL. The command file is:
FACTOR ANALYSIS OF DICHOTOMOUS VARIABLES: LSAT6 DATA -- WEIGHTED LEAST SQUARES DA NI=5 NO=1000 MA=PM PM .100000D+01 .170316D+00 .100000D+01 .227522D+00 .189091D+00 .100000D+01 .107186D+00 .111147D+00 .186680D+00 .100000D+01 .665002D-01 .172422D+00 .105492D+00 .200924D+00 .100000D+01 AC FI=TEST.ACV MO NX=5 NK=1 LX=FU,FR PH=ST TD=SY,FI FR TD 1 1 TD 2 2 TD 3 3 TD 4 4 TD 5 5 OU
This simply requests a 1-common factor model of the data. The AC command tells LISREL where to find an asymptotic variance/covariance matrix for used with Weighted Least Squares (WLS) estimation. Because of this, LISREL will use WLS estimation even though it is not explicitly requested.
The LISREL output shows the loadings of each latent continuous variable on the common factor, as follows:
LISREL ESTIMATES (WEIGHTED LEAST SQUARES) LAMBDA X KSI 1 ________ VAR 1 .389 VAR 2 .397 VAR 3 .471 VAR 4 .377 VAR 5 .342
These values (here labeled KSI 1) are the heuristic estimates of the correlation parameters rho_{i} of the latent trait model.
Table 2 compares the heuristic threshold and correlation parameter estimates with the FIML estimates reported by Bock and Lieberman (1970).
Table 2. Heuristic and Full Information Maximum Likelihood (FIML) Parameter Estimates for LSAT-6 Data ------------------------------------------------ Heuristic estimates FIML estimates^{a} ------------------- ---------------- Thresh- Corre- Thresh- Corre- Item olds lations olds lations (i) (t_{i}) (rho_{i}) (t_{i}) (rho_{i}) ------------------------------------------------- 1 -1.4325 .389 -1.4329 .3856 2 -.5505 .397 -.5505 .3976 3 -.1332 .471 -.1332 .4732 4 -.7160 .377 -.7159 .3749 5 -1.1264 .342 -1.1264 .3377 ------------------------------------------------- ^{a}Bock & Lieberman (1970); p. 191.
The heuristic and FIML estimates are very close, especially for the threshold parameters. Possibly, using more accurate FIML estimation than was available in 1970, the threshold estimates would be identical.
For some studies one will be content simply to reveal the dimensional structure of the data. Other times one will want to locate each object in the multidimensional latent trait space. This is equivalent to calculating a factor score for each case, relative to each factor.
Bock and Aitkin (1981) describe three methods to estimate such factor scores: Maximum Likelihood (ML), Maximum a posteriori (MAP), and Expected a posteriori (EAP) estimation. The EAP estimates are fairly easily obtained, especially for a one-factor model; for details, click here .
At the outset it was suggested that the heuristic method may be useful mainly to those who do not have access to the specialized software needed for FIML estimation of a multidimensional latent trait/IRT model. However, after writing this, it occurs to me that the heuristic model may be somewhat more flexible than current versions of the FIML approach. For example, with the heuristic method, one can easily combine binary and continuous variables in the same analysis. One would simply construct a correlation matrix where correlations are estimated as:
Similarly, it would appear that the heuristic method can more easily permit certain uses of covariates to improve estimation of latent trait scores for individual cases.
These same things can, in principle, be done with FIML estimation, but it is unclear that software exists to do them.
Abramowitz M, Stegun IA. Handbook of mathematical functions. Washington DC: U. S. Government Printing Office, 1972.
Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 1981, 46, 443-459.
Bock RD, Lieberman M. Fitting a response curve model for dichotomously scored items. Psychometrika, 1970, 35, 179-198.
Bock RD, Gibbons R, Muraki E. Full-information item factor analysis. Applied Psychological Measurement, 1988, 12, 261-280.
Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 1977, 26, 343-351.
Drasgow F. Polychoric and polyserial correlations. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 7 (pp. 69-74). New York: Wiley, 1988.
Gibbons RD. TESTFACT Version 3 users manual. Chicago: Scientific Software International, 1998.
Harris B. Tetrachoric correlation coefficient. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 9 (pp. 223-225). New York: Wiley, 1988.
Heinen T. (1996). Latent class and discrete latent trait models: similarities and differences. Thousand Oaks, California: Sage, 1996.
Hulin CL, Drasgow F, Parsons CK. Item response theory. Homewood, Illinois: Dow Jones-Irwin, 1983.
Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999.
Joreskog KG, Sorbom, D. PRELIS 1 User's Manual, Version 1. Chicago: Scientific Software, Inc., 1988.
Kennedy WJ Jr, Gentle JE. Statistical computing. New York: Dekker, 1980.
Knol DL, Berger MP. Empirical comparison between factor analysis and multidimensional item response models. Multivariate Behavioral Research, 1991, 46, 457-477.
Lazarsfeld PF, Henry NW. Latent Structure Analysis, Boston: Houghton Mifflin, 1968.
Lord FM, Novick MR. Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968.
Mislevy RJ. Recent developments in the factor analysis of categorical variables. Journal of Educational Statistics, 1986, 11, 3-31.
Parry CD, McArdle JJ. An applied comparison of methods for least-squares factor analysis of dichotomous variables. Applied Psychological Measurement, 1991, 15, 35-46
Takane Y, de Leeuw J. On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 1987, 52, 393-408.
Uebersax J. LTM: a program for unidimensional latent trait modeling. In preparation, 2000.
Vermunt J. LEM users manual. Tilburg, The Netherlands: Tilburg University, 1998.
Wichura MJ. Algorithm AS 241: Percentage points of the normal distribution. Applied Statistics, 1988, 37(3) 477-484.
This page maintained by John Uebersax PhD
Revised: 10 August 2006 (corrected example matrix, new software link)