User Guide for POLYCORR 1.1 (Advanced Version)



1  Introduction

POLYCORR is a simple program to estimate the polychoric correlation (Drasgow, 1988; Olsson, 1979; online, see Uebersax, 2000) between two dichotomous or ordered-category variables. This document describes use of the POLYCORR program (advanced version).

The advanced version of POLYCORR has several features that the basic version of POLYCORR does not include. The basic version, with only a few simple commands, is mainly intended as a teaching tool. Added features of the advanced version include: ability to have different numbers of levels for the row variable and column variable, ability to combine data cells, and options to control the accuracy of estimation. Appendix A gives a complete list of the advanced features.

Data are supplied as a table of observed frequencies. Output includes the polychoric correlation and its standard error, estimated thresholds and, possibly, their standard errors, and model fit statistics.

The user can choose joint maximum likelihood (ML) or two-step estimation of parameters (Drasgow, 1988).
 

2  Running POLYCORR

To run the advanced version of POLYCORR, while in DOS or a DOS window, navigate to the directory where XPC.EXE resides. Type the command:


     XPC 

The program can also be run from the Windows File Manager, or from the Windows "Run program" prompt. If you are using a pre-Pentium machine without a math coprocessor, this version of XPC will not run; contact the author to obtain a suitable version.

The program will first prompt for input and output filenames. In response to each of these prompts, supply a valid DOS file name, including, if appropriate, a path, for example:


     c:\datasets\laruche.xpc

Simply pressing the Return key will cause the default file name to be used. The default input and output filenames are input.txt and output.txt, respectively.

Numbers will then scroll past as the program runs. These are the likelihood-ratio chi-squared statistic calculated at each iteration. These numbers should generally decrease.

With two-step estimation, fewer than 50 iterations may be needed; with joint ML estimation, 1000 or more may be required for a large table. If the program doesn't converge in the number of allotted iterations, enter a "1" in Command Line 3 of the input file and re-run the program. POLYCORR will the resume estimation where it left off.
 

3  Input File

To run POLYCORR you must construct an input file. It must contain 14 command lines and the data to be analyzed. (It may also contain meta-cell definitions, as described below.)

3.1  Command lines

The 14 command lines of the input file are as follows:

The following lines are more technical. Many users can leave these set to the default values of 0.

Line 10. Algorithm used to calculate normal cdf. The default value of 0 means ALNORM (Applied Statistics algorithm AS 66) is used to calculate values for the normal cumulative distribution function (cdf). This should be adequate for most applications. If this value is 1 POLYCORR will use a more accurate cdf routine (NORMP). If the value is 2, an alternative accurate routine (NPROB) is used.

Line 11. Latent trait range. This defines the range of the latent trait over which integration is performed in the calculation of expected frequencies. The default value is the range (relative to a standard normal curve) of -/+ 5. To extend the range, supply a (positive) value of up to 10.0. The latent trait range will be set to minus/plus this value; for example, if 10.0 is specified, the range will be from -10 to 10. The format is F4.0. (If you include a decimal place, it will override the F4.0 format, but the value must be in columns 1-4).

Line 12. Number of quadrature points for integration. Integration is performed by dividing the latent trait into a finite number of equally-spaced points. A value of 0 in this field results in the default number of 51 points being used. It is recommended that this value not be changed unless there is a reason. For more accuracy, a larger number of up to 81 can be specified. For technical reasons it is probably better to specify an odd number. A number less than 51 will increase program speed, but, this should probably not be done without a good reason (in any case, the number should never be less than 21).

Line 13. Output format. This controls the number of decimal places for printing of expected frequencies, as follows:

Value Number of decimal places printed
0 (default) 2
1 to 7 1 to 7, respectively
8 18
9 or more 0

Line 14. Number of meta-cells. A meta-cell is the combination of two or more cells in the original data table. When cells are combined, their observed and expected frequencies are pooled for purposes of parameter estimation. Up to 20 meta-cells can be defined. For Command Lines 2--14 (except Line 11) values are supplied in I4 format--that is, the integer value must be (a) in Columns 1-4 and (b) be right-justified. Leaving Columns 1-4 blank is the same as supplying a value of 0.

Comments can be supplied on Command Lines 2--14 anywhere after Column 4. It is recommended that comments be used to identify the option associated with each line.

The file input.txt supplied with POLYCORR shows proper construction of an input file.

3.2  Observed frequencies

Following the command lines are the data. These are the observed frequencies for every combination of levels of the two variables (full observed crossclassification table). The format is free field, but it is usually best to format the frequencies as a table.

3.3  Meta-cell pattern matrix (optional)

Meta-cells are a convenient way to combine certain table frequencies for purposes of estimation. If the meta-cells option is selected, you must supply a meta-cell pattern matrix.

The elements of the meta-cell pattern matrix correspond one-for-one with the cells of the observed frequency table. Supply a "0" in the pattern matrix to show that a cell is not to be combined. Supply a positive integer from 1 to 20 to indicate meta-cell membership; all data cells with the same nonzero pattern value comprise the corresponding meta-cell. For example, all cells with a "1" in the pattern matrix define Meta-cell 1, all cells with a "2" define Meta-cell 2, etc.

The following example meta-cell pattern matrix:

      0  0  0  1  1
      0  0  0  0  1
      0  0  0  0  0
      2  0  0  0  0
      2  2  0  0  0

specifies that, in cells (4, 1), (5, 1) and (5, 2) of the data table are to be combined, and cells (1, 4), (1, 5) and (2, 5) of the data table are to be combined for purposes of estimation.

Format for the pattern matrix is free-field. One or more blank lines can separate the observed frequency table and the pattern matrix.

The use of meta-cells is experimental at this point. The idea is to improve parameter estimation reducing data sparseness. Definitely do not use meta-cells to combine entire rows or columns of the data table--doing so will make the solution unidentified; instead, collapse the rows or columns before running POLYCORR. Nonidentifiability can possibly result in other situations as well. There probably should be at least one cell in each row and each column that is not combined with other cells.

4  Output file

The output file has five sections, as follows:

4.1  Design information

This reports the number of row and column levels, whether joint ML or two-step estimation was used, and run options.

4.2  Model fit statistics

This reports the likelihood-ratio chi-squared (G-squared) and Pearson chi-squared (X-squared) statistics and their associated p values. These test fit of the polychoric correlation model. A significant p value implies that the model assumptions (a latent bivariate normal distribution and fixed, discretizing thresholds) do not apply.

Chi-squared df are calculated as (R × C) - 1 - k, where:

For joint estimation k = R + C - 1. For 2-step estimation, k = 1.

If meta-cells are defined, df are adjusted (reduced) accordingly.

If the G-squared and/or X-squared statistics show significant lack of model fit (e.g., p < .10), the user may consider the following options:

  • Use of meta-cells to combine cells with small observed frequencies sometimes substantially improves model fit.

  • It is possible to relax the assumptions of the polychoric correlation model. One method is via a nonparametric latent trait distribution; models of this kind can be estimated with the LLCA program. Models with relaxed assumptions have improved model fit.

  • Another way to relax assumptions is to let measurement error vary with the latent trait level (Hutchinson, 2000). A newer version of POLYCORR with this feature is in testing.

  • Some researchers accept the value of the polychoric correlation even if moderate lack of fit is observed. The rationale is that when model assumptions are relaxed, the value of the polychoric correlation often does not change substantially.

This section also reports whether the program converged or not.

4.3  Parameter estimates

This section first reports the estimated polychoric correlation (rho) and its standard error.

Next it reports a test of a zero polychoric correlation. This is simply a chi-squared test of statistical independence for the data, to which the polychoric model reduces when rho = 0. A non-significant result means that a model that assumes a zero polychoric correlation fits the data; this can be interpreted as evidence that the null hypothesis H0: rho = 0 is tenable. At present, POLYCORR does not consider meta-cells when performing this test.

Next the Pearson correlation between the two manifest variables is reported (i.e., the correlation obtained treating the variables as interval data).

Following this the threshold estimates are reported. Standard errors of threshold estimates are not calculated if two-step estimation of the polychoric correlation is used.

4.4  Observed/expected frequencies

This section shows, for each combination of levels of the row and column variables, the observed and expected frequency. Observed and expected marginal frequencies are also reported.

If meta-cells have been defined, meta-cell memberships are shown. The observed and expected meta-cell frequencies are also printed.

4.5  First derivatives

If the program meets its internal convergence criteria, the first derivative of G-squared relative to each estimated parameter will be printed (this is the same as the first derivative of -2 log L relative to each parameter). For a true convergent solution, these values should be close to 0--ideally, less than 0.001. An occasional value as large as 0.1 might be no cause for concern. However, a large value, especially one much larger, means the program did not converge.

First derivatives are printed twice, once to 4 decimal places, and once in scientific notation.
 

5  Model and Estimation Method

5.1  Model

POLYCORR reformulates the polychoric correlation model as a latent trait or "variable-in-common" model (Hutchinson, 1993). The approach is explained at . The latent trait model reformulation is not an approximation--it is isomorphically equivalent to the usual bivariate-normal view of the polychoric correlation.

Let X1 and X2 denote the observed levels of the row and column variables, respectively, for a given case. Let Y1 and Y2 denote values of the pre-discretized continuous variables associated with X1 and X2.

The measurement model is:

Y1 = bT + e1,
Y2 = bT + e2.

In the above equations, T is a latent trait--analogous to a common factor--which Y1 and Y2 have in common and which accounts for their correlation; b is a regression coefficient, and e1 and e2 represent random errors.

The standard model assumes that the latent trait T is normally distributed. As scaling is arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed (and independent both between raters and across cases). A consequence of these assumptions is that Y1 and Y2 must also be normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b = the correlation of both Y1 and Y2 with the latent trait, and that b2 is the correlation of Y1 and Y2 (it is also the polychoric correlation of X1 and X2--the correlation of the two variables we would observe if both variables were measured continuously.

The assumptions of the polychoric correlation coefficient may be summarized as follows:

  1. There is a latent trait which is common to both variables and accounts for their correlation.

  2. The latent trait is normally distributed.

  3. Rating errors are normally distributed.

  4. Var(e) is homogeneous across levels of T.

  5. Errors are independent between raters.

  6. Errors are independent between cases.

Assumption 1 is true essentially true by definition. The existence of a latent trait is implied by the existence of a nonzero polychoric correlation and vice versa. Just as with a common factor in factor analysis, the latent trait is "what the variable have in common." It may correspond to a more-or-less real but unobserved variable--such as intelligence or disease severity. Or it may simply be a shared component of variation.

Assumptions 2, 3 and 4 can be alternatively expressed as the assumption that Y1 and Y2 follow a bivariate normal distribution.

Assumption 5 is essentially true by definition, since any consistent association between the two variables is accounted for by the latent trait. Assumption 6, a standard assumption for statistical methods, is usually considered met with random sampling.

Assumptions 2, 3 and 4, then, are the main assumptions tested with model fit statistics. Assumption 2 can be relaxed by considering other distributional forms for the latent trait, or modeling a nonparametric latent trait distribution. Methods for relaxing Assumption 4 are described by Hutchinson (2000); a version of POLYCORR that permits relaxation of this assumption is currently being tested (users may contact the author to obtain a preliminary version.)

5.2  Estimation method

Concerning calculations, expected frequencies are calculated by numerical integration over the range of the latent trait, T. The method is described in Uebersax (1993). Bivariate integration is not necessary. At each level of T, the product of two normal cumulative distribution function values (calculated via an accurate polynomial approximation), one associated with Y1 and one associated with Y2, is calculated.

Accuracy depends on the following:

  1. The range of the latent trait considered
  2. The number of finite levels (quadrature points) into which the latent trait is divided for integration
  3. The method of integration
  4. The accuracy of the normal cdf algorithm

Based both on experience and reference to earlier literature (e.g., Bock and Aitkin, 1981) a latent trait range of -/+ 5 (relative to a standard normal curve) is taken as the default.

POLYCORR uses the most elementary integration method--literally "integration by rectangles." Greater efficiency could be obtained by using Simpson's rule or Gauss-Hermite quadrature. However, with 51 quadrature points over the range +/- 5, this simpler method is sufficient. (Doubling the number of quadrature points, for example, has little effect on results).

Parameter estimates are obtained by iteratively adjusting parameter values to find those that best fit the observed data by the criterion of maximum likelihood (or, is specified, minimum-X-squared). The iterative adjustments are handled by STEPIT, a general algorithm for multivariate minimization/maximization (Chandler, 1969).

With joint ML estimation, all parameters (the polychoric correlation and thresholds) are estimated by this means. With two-step estimation, thresholds are estimated directly from cumulative marginal proportions, and only rho is estimated iteratively.

Standard errors are calculated by inverting the observed information matrix (the matrix of second derivatives of model parameters relative to -log L). The observed information matrix is calculated by finite differences. For two-step estimation, when estimating the standard error of rho, the thresholds are viewed as fixed parameters. This appears consistent with Drasgow (1988) and others. It is debatable, however, as the thresholds are still subject to sampling variability even if calculated from the marginals. At present, the question of standard errors for two-step estimation is left open.

POLYCORR has been benchmarked against: PRELIS Version 1.0 (Joreskog & Sorbom, 1993) for two-step estimation; against SAS PROC FREQ PLCORR and the calculations of Tallis (1962) and Drasgow (1988) for joint ML estimation; and against Applied Statistics algorithm AS 116 (Brown, 1977) for the tetrachoric correlation. In each case POLYCORR appears at least as accurate as the benchmark source.
 

6  User-Supplied Start Values

One can specify the initial parameter values by constructing a special file. The file, named START.XPC, has k lines, where k is the number of estimated parameters.

For joint ML estimation, k = R + C - 1, where R is the number of row levels and C is the number of column levels. The first line gives the start value for rho. Next, on successive lines, are the start values for thresholds 2, 3, ..., R for the first item/rater (row variable), followed by start values for thresholds 2, 3, ..., C for the second item/rater (column variable). With respect to each variable, it is important that threshold start values be in ascending order--that is, within the row variable and within the column variable, higher-numbered thresholds must be greater than lower-numbered thresholds. In general, one use successive integers, e.g., -2., -1., 0., 1., 2. as start values for each rater's thresholds.

For two-step estimation, k = 1. There is only one line, containing the start value for rho.

Values must include a decimal place and be one per line, with no blank lines. Other than that the format is unimportant. To see an example, run POLYCORR and examine the START.XPC file is creates.
 

7  Negative Polychoric Correlations

A minor adjustment must be made to the latent trait model to accommodate a negative polychoric correlation. For technical reasons, in a given run POLYCORR will estimate rho either within the range 0 to 1.0 or -1.0 to 0, but not both.

This will not likely affect the user. The default start value for rho is the Pearson r calculated for the data. If the Pearson r is positive, a positive rho will be estimated; if the Pearson r is negative, a negative rho is estimated.

It is unlikely that rho would have a sign opposite of the Pearson r. Still, should this be the case, the user has an option. Suppose that the Pearson r is positive, and that POLYCORR attempts to estimate a positive rho. If the true rho is negative, one of two things will happen: (1) rho will be reported as 0; or (2) the program will terminate with an error message.

In either case the user should re-run the program using user-supplied start values. A negative value should be specified for the rho start value. This will cause POLYCORR to estimate a negative-valued rho.

Similarly, a user-supplied positive-valued rho will cause POLYCORR to estimate a positive-valued rho.
 

8  Limitations

It is possible to construct unusual data sets where POLYCORR will fail. (The same is probably true of any program to estimate the polychoric correlation. For example, even SAS has reported bugs associated with PROC FREQ PLCORR.) Estimating the polychoric correlation, like many forms of latent structure modeling, is a fairly complex numerical procedure and cannot be guaranteed to work in every case. However, that does not mean one should doubt the results in the large majority of cases where it does work.

With POLYCORR, any computational problem that might occur is usually obvious. Signs that something is wrong include a negative G-squared value or a program crash. If these occur, first try two-step estimation to see if that eliminates the problem. If that doesn't work, please send me email (including the input file) and I will try to correct the problem.

For added assurance that POLYCORR has worked correctly, examine the first derivatives in the printed output. If these are all near-zero, it is likely that the estimates are correct.
 

9  Technical Output

The STEPIT subroutine writes a small amount of output to the file STEPIT.OUT. Most users need not be concerned with this file. The most useful information is potentially the matrix of second derivatives of the objective function (in this case G-squared) relative to estimated model parameters, which is produced if standard errors are estimated.

(Top of Page)  


Availability

Copyright Notice

POLYCORR is copyrighted (all rights reserved). It may be downloaded from this site, and the user may retain multiple copies of the downloaded version for his or her personal use. But it may not be transmitted to other users. It may not be translated to other programming languages without the express permission and consent of the author. You may not decompile, disassemble, modify, decrypt, or otherwise exploit this program.

The POLYCORR program can be downloaded at:

http://wwww.john-uebersax.com/bin/xpc.zip.

This user guide is available at:

http://www.john-uebersax.com/stat/xpc.htm.

I hope you find the POLYCORR program helpful. Please notify me if the program does not work correctly, or to suggest additions or changes that might make it more useful.

John Uebersax PhD

Disclaimer

This program is distributed as-is. It has not undergone extensive testing. The author does not guarantee accuracy and assumes no responsibility for unintended consequences of its use.

 
 
(Top of Page)  
 

References


Appendix A

The following are the features of the advanced version of POLYCORR that, as of September, 2000, are not included with the simple version:


Appendix B

The file xpc.zip contains the following files:

xpc.htm   User guide for the POLYCORR program (advanced version); HTML format.
xpc.exe   Executable version of POLYCORR (advanced version)
input.txt   Sample input file
output.txt   Sample output file
BENCHMARK\   Folder containing benchmark input and output files

To cite this article:
(Top of Page)
Go to Tetrachoric and polychoric correlation coefficients
Go to Agreement Statistics
Go to Latent Structure Analysis

Last updated: 5 November 2010 (corrected links; xpc.exe now compatible with 64-bit Windows 7)


(c) 2006-2010 John Uebersax PhD    email