Tests of Marginal Homogeneity



Introduction

Consider symptom ratings (1 = low, 2 = moderate, 3 = high) by two raters on the same sample of subjects, summarized by a 3×3 table as follows:

Table 1. Summarization of ratings by Rater 1 (rows) and Rater 2 (columns).
  1 2 3  
1 p11 p12 p13 p1.
2 p21 p22 p23 p2.
3 p31 p32 p33 p3.
  p.1 p.2 p.3 1.0

Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote the marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories 1, 2 and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2.

Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category. When such differences are observed, it may be possible to provide feedback or improve instructions to make raters' marginal proportions more similar and improve agreement.

Differences in raters' marginal rates can be formally assessed with statistical tests of marginal homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates different cases, testing marginal homogeneity is straightforward: one can compare the marginal frequencies of different raters with a simple chi-squared test. However this cannot be done when different raters rate the same cases--the usual situation with rater agreement studies; then the ratings of different raters are not statistically independent and this must be accounted for.

Several statistical approaches to this problem are available. Alternatives include:
 

  • Nonparametric tests

  • Bootstrap methods

  • Loglinear, association, and quasi-symmetry models

  • Latent trait and related models
 
These approaches are outlined here.
 
(Top of Page)

Graphical and descriptive methods

Before discussing formal statistical methods, non-statistical methods for comparing raters' marginal distributions should be briefly mentioned. Simple descriptive methods can be very useful. For example, a table might report each raters' rate of use for each category. Graphical methods are especially helpful. A histogram can show the distribution of each raters' ratings across categories. The following example is from the output of the MH program:
 
             Marginal Distributions of Categories
              for Rater 1 (**) and Rater 2 (==)
 
0.304 +                                  **
      |                                  ** ==
      |                                  ** ==      ==
      |                          ** ==   ** ==   ** ==
      |                          ** ==   ** ==   ** ==
      |                          ** ==   ** ==   ** ==
      |                  ** ==   ** ==   ** ==   ** ==
      |          ** ==   ** ==   ** ==   ** ==   ** ==
      |  ** ==   ** ==   ** ==   ** ==   ** ==   ** ==
      |  ** ==   ** ==   ** ==   ** ==   ** ==   ** ==
    0 +----+-------+-------+-------+-------+-------+----
           1       2       3       4       5       6
 
      Notes:  x-axis is category number or level.
              y-axis is proportion of cases.

Vertical or horizontal stacked-bar histograms are good ways to summarize the data. With ordered-category ratings, a related type of figure shows the cumulative proportion of cases below each rating level for each rater. An example, again from the MH program, is as follows:
 

Proportion of cases below each level
 
1   2 3 4     5                   6
*---*-*-*-----*-------------------*--------------------------  Rater 1
*---*-*-*--------*------------*------------------------------  Rater 2
1   2 3 4        5            6
 
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+  Scale
0    .1    .2    .3    .4    .5    .6    .7    .8    .9    .1

These are merely examples. Many other ways to graphically compare marginal distributions are possible.
 
(Top of Page)
 

Nonparametric tests

The main nonparametric test for assessing marginal homogeneity is the McNemar test. The McNemar test assesses marginal homogeneity in a 2×2 table. Suppose, however, that one has an N×N crossclassification frequency table that summarizes ratings by two raters for an N-category rating system. By collapsing the N×N table into various 2×2 tables, one can use the McNemar test to assess marginal homogeneity of each rating category. With ordered-category data one can also collapse the N×N table in other ways to test rater equality of category thresholds, or test raters for overall bias (i.e., a tendency to make higher or lower rating than other raters.)

The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across all categories simultaneously. It thus complements McNemar tests of individual categories by providing an overall significance value.

Further explanation of these methods and their calculation can be found by clicking on the test names above.

MH, a computer program for testing marginal homogeneity with these methods is available online. For more information, click here.

These tests are remarkably easy to use and are usually just as effective as more complex methods. Because the tests are nonparametric, they make few or no assumptions about the data. While some of the methods described below are potentially more powerful, this comes at the price of making assumptions which may or may not be true. The simplicity of the nonparametric tests lends persuasiveness to their results.

A mild limitation is that these tests apply only for comparisons of two raters. With more than two raters, of course, one can apply the tests for each pair of raters.
 
(Top of Page)
 

Bootstrapping

Bootstrap and related jackknife methods (Efron, 1982; Efron & Tibshirani, 1993) provide a very general and flexible framework for testing marginal homogeneity. Again, suppose one has an N×N crossclassification frequency table summarizing agreement between two raters on an N-category rating. Using what is termed the nonparametric bootstrap, one would repeatedly sample from this table to produce a large number (e.g., 500) of pseudo-tables, each with the same total frequency as the original table.

Various measures of marginal homogeneity would be calculated for each pseudo-table; for example, one might calculate the difference between the row marginal proportion and the column marginal proportion for each category, or construct an overall measure of row vs. column marginal differences.

Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same measure calculated for the original table. From the pseudo-tables, one can empirically calculate the standard deviation of d*, or sd*. Let d' denote the true population value of d. Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null hypothesis by calculating the z value:

z = d/sd*

and determining the significance of the standard normal deviate z by usual methods (e.g., a table of z value probabilities).

The method above is merely an example. Many variations are possible within the framework of bootstrap and jackknife methods.

An advantage of bootstrap and jackknife methods is their flexibility. For example, one could potentially adapt them for simultaneous comparisons among more than two raters.

A potential disadvantage of these methods is that the user may need to write a computer program to apply them. However, such a program could also be used for other purposes, such as providing bootstrap significance tests and/or confidence intervals for various raw agreement indices.
 
(Top of Page)
 

Loglinear, association and quasi-symmetry modeling

If one is using a loglinear, association or quasi-symmetry model to analyze agreement data, one can adapt the model to test marginal homogeneity.

For each type of model the basic approach is the same. First one estimates a general form of the model--that is, one without assuming marginal homogeneity; let this be termed the "unrestricted model." Next one adds the assumption of marginal homogeneity to the model. This is done by applying equality restrictions to some model parameters so as to require homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the "restricted model."

Marginal homogeneity can then be tested using the difference G2 statistic, calculated as:

difference G2 = G2(restricted) - G2(unrestricted)

where

G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models.

The difference G2 can be interpreted as a chi-squared value and its significance determined from a table of chi-squared probabilities. The df are equal to the difference in df for the unrestricted and restricted models. A significant value implies that the rater marginal probabilities are not homogeneous.

An advantage of this approach is that one can test marginal homogeneity for one category, several categories, or all categories using a unified approach. Another is that, if one is already analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of marginal homogeneity tests may require relatively little extra work.

A possible limitation is that loglinear, association, and quasi-symmetry models are only well-developed for analysis of two-way tables. Another is that use of the difference G2 test typically requires that the unrestricted model fit the data, which sometimes might not be the case.

For an excellent discussion of these and related models (including linear-by-linear models), see Agresti (2002).
 
(Top of Page)
 

Latent trait and related models

Latent trait models and related methods such as the tetrachoric and polychoric correlation coefficients can be used to test marginal homogeneity for dichotomous or ordered-category ratings. The general strategy using these methods is similar to that described for loglinear and related models. That is, one estimates both an unrestricted version of the model and a restricted version that assumes marginal homogeneity, and compares the two models with a difference G2 test.

With latent trait and related models, the restricted models are usually constructed by assuming that the thresholds for one or more rating levels are equal across raters.

A variation of this method tests overall rater bias. That is done by estimating a restricted model in which the thresholds of one rater are equal to those of another plus a fixed constant. A comparison of this restricted model with the corresponding unrestricted model tests the hypothesis that the fixed constant, which corresponds to bias of a rater, is 0.

Another way to test marginal homogeneity using latent trait models is with the asymptotic standard errors of estimated category thresholds. These can be used to estimate the standard error of the difference between the thresholds of two raters for a given category, and this standard error used to test the significance of the observed difference.

An advantage of the latent trait approach is that it can be used to assess marginal homogeneity among any number of raters simultaneously. A disadvantage is that these methods require more computation than nonparametric tests. If one is only interested in testing marginal homogeneity, the nonparametric methods might be a better choice. However, if one is already using latent trait models for other reasons, such as to estimate accuracy of individual raters or to estimate the correlation of their ratings, one might also use them to examine marginal homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of marginal homogeneity.

If there are many raters and categories, data may be sparse (i.e., many possible patterns of ratings across raters with 0 observed frequencies). With very sparse data, the difference G2 statistic is no longer distributed as chi-squared, so that standard methods cannot be used to determine its statistical significance.
 
(Top of Page)


References


Go to Agreement Statistics site
Go to Latent Structure Analysis site

Last updated: 31 August 2006 (added reference, counter)


(c) 2006 John Uebersax PhD    email