Back to Agreement Statistics page
The basic McNemar test applies to 2×2 tables. Consider such a table that summarizes agreement between two raters on a dichotomous trait:
- | + | ||
- | a | b | a + b |
+ | c | d | c + d |
a + c | b + d | total |
(a + b) = (a + c) (c + d) = (b + d) |
Since the a and the d on both sides of the equations cancel, this implies b = c; this is the basis of the McNemar test.
The McNemar statistic is calculated as
The value X^{2} can be viewed as a chi-squared statistic with 1 df.
Some authors recommend a version of the McNemar test with a correction for discontinuity, calculated as:
but this is controversial.
Statistical significance is determined by evaluating the probability of X^{2} with reference to a table of cumulative probabilities of the chi-squared distribution or a comparable computer function. A significant result implies that marginal frequencies (or proportions) are not homogeneous. The test is inherently two-tailed. For a one-tailed test, one could divide the obtained p value by two.
When b and/or c are small, the McNemar test X^{2} is not well approximated by the chi-squared distribution. When, say, (b + c) < 10 a two-tailed exact test, based on the cumulative binomial distribution with p = q = .5, can be used instead.
40 | 10 |
20 | 50 |
By Eq. 1, the McNemar test X^{2} = (10 - 20)^{2}/(10 + 20) = 100/30 = 3.33 (1 df, p = .068).
Using the continuity correction (Eq. 2), X^{2} = 2.70 (1 df, p = .100).
With the exact test, p = 0.099.
low | mod. | high | row total | |
low | n_{11} | n_{12} | n_{13} | n_{1.} |
moderate | n_{21} | n_{22} | n_{23} | n_{2.} |
high | n_{31} | n_{32} | n_{33} | n_{3.} |
column total | n_{.1} | n_{.2} | n_{.3} | n_{..} |
To test marginal homogeneity for a single category, one collapses the full table into a 2×2 table. Specifically, to test row/column marginal homogeneity for category k, one collapses all rows and columns corresponding to the other categories. For example, to test marginal homogeneity for the category "low," one would collapse the table above to produce:
Rater 1 |
Rater 2 | |||
low | moderate or high |
|||
low | n_{11} | n_{12}+n_{13} | ||
moderate or high |
n_{21}+ n_{31} |
n_{22}+n_{23}+ n_{32}+n_{33} |
and then apply the basic McNemar test to this table. The test has 1 df. A significant X^{2} value would imply that the Rater 1 and Rater 2 marginals for this category differ.
Similarly, to test the raters' marginal rates for the "moderate" category, one would collapse rows/columns 1 and 3 to produce the 2×2 table:
moderate | low or high | |
moderate | n_{22} | n_{21}+n_{23} |
low or high | n_{12}+ n_{32} |
n_{11}+n_{13}+ n_{31}+n_{33} |
and perform the basic McNemar test on this table.
In this way marginal homogeneity with respect to each category can be
tested. Because there are multiple tests, one may wish to adjust the
overall alpha. For example, a simple Bonferroni adjustment can be
applied. With K categories, there are K - 1 independent tests. For an
"experiment-wise" alpha of .05, the Bonferroni method would make
.05/(K - 1) the significance criterion for each test.
The test is calculated in the following way. Consider a K × K frequency table of the same form as Table 3. Let column vector d contain any K - 1 of the values,
where
Let S denote the (K - 1) × (K - 1) matrix of the variances and covariances of the elements of d. The elements of S are equal to:
s_{ii} = n_{i.} + n_{.i} - 2n_{ii } |
s_{ij} = -(n_{ij} + n_{ji}) |
The Stuart-Maxwell statistic is calculated as:
where d' is the transpose of d and matrix S^{-1} is the inverse of S
X^{2} is interpreted as a chi-squared value with df equal to K - 1. In the case of K = 2, the Stuart-Maxwell statistic and the McNemar statistic (Eq. 1) are identically equal.
If there is perfect agreement for any category k, that category must be omitted in order to invert matrix S. (Note that if there is perfect agreement on a category, the corresponding row and column marginal frequencies are equal.) Such categories should be ignored in calculations and the Stuart-Maxwell test performed with respect to the remaining categories. The df in this case can still be considered K - 1, where K is the number of original categories; this treats omitted categories as if they were included but contributed 0 to the value of X^{2}--a reasonable view since such categories have equal row and column marginals.
low | mod. | high | row total | |
low | 20 | 10 | 5 | 35 |
moderate | 3 | 30 | 15 | 48 |
high | 0 | 5 | 40 | 45 |
column total | 23 | 45 | 60 | 128 |
We first calculate any K - 1 of the (row sum - column sum) differences; we arbitrarily choose those for rows/columns 1 and 2. This produces d as follows:
12 3
The corresponding variance/covariance matrix S is:
18 -13 -13 33
The inverse, S^{-1}, is:
0.0776 0.0306 0.0306 0.0424
The value of d' S^{-1 }d = X^{2}
= 13.76. With 2 df, p = 0.001.
The Bhapkar (1966) test is a more powerful alternative to the Stuart-Maxwell test. It is calculated in a way similar to the Stuart-Maxwell test (see above), but the formulas for the elements of S are different. See Agresti (2002) for details.
The Bhapkar and Stuart-Maxwell tests are asymptotically equivalent (Keefe, 1982). With a large N, both will produce the same chi-squared value. As the Bhapkar test is more powerful, it is preferred.
The MH program, free software available on this website,
will perform both the Bhapkar test and the Stuart-Maxwell test.
With ordered-category ratings, it is often theoretically reasonable and intuitively appealing to consider the idea of rater thresholds. By this view, raters begin with a subjective continuous impression of how much trait a case has. Then they apply subjective thresholds or cutpoints which map that impression into a particular rating category. For example, if the trait is "mobility," a rater first perceives a given patient's level as falling somewhere on a continuum. The rater then applies thresholds to assign a specific rating category of, say, low, moderate, or high, as illustrated below.
low moderate high <--------|------------|----------------> t2 t3 |
In the example above, a case whose judged trait level is below threshold t_{2} would be assigned the rating category "low." A case whose judged trait level is above threshold t_{3} would be assigned the rating category "high." A case whose judged trait level is between the two thresholds would be assigned the rating category "moderate."
Threshold t_{k} (k = 2, ..., K) is the minimum trait level a case must display to be assigned rating level k or higher. There is no threshold t_{1}; a case is assigned rating level 1 if the case's trait level does not exceed threshold t_{2}.
Threshold locations potentially differ between raters. The locations of a rater's thresholds determine how often the rater uses each rating category. For example in the situation below,
<--------|------------|------------> Rater 1 t2 t3 |
We now return to the 3×3 crossclassification in Table 3. Suppose one wishes to test whether the lowest threshold (t_{2}) is the same for both raters. To do this one would first collapse all rows after Row 1 and all columns after Column 1. Then one would perform the McNemar test on the resulting 2×2 table. A significant result would imply that threshold t_{2} differs between the two raters. (Note that here the 2×2 table and associated McNemar test is the same as with Table 4.)
To test equality of threshold t_{3} between raters, one would collapse Rows 1 and 2, and Columns 1 and 2 to produce the following 2×2 table.
low or moderate |
high | |
low or moderate |
n_{11}+n_{12}+ n_{21}+n_{22} |
n_{13}+ n_{23} |
high | n_{31}+n_{32} | n_{33} |
In general, with a K × K table, one can test equality of a given threshold k (k = 2, ..., K) by collapsing rows/columns 1 to k-1 and collapsing rows/columns k to K, and performing the basic McNemar test on the resulting 2×2 table.
The tests for thresholds t_{2} and
t_{K} are identical to the tests of marginal
homogeneity for categories 1 and K (although the results are interpreted
differently). However, the tests for thresholds t_{3}, ...,
t_{K-1} are unique.
This simple test is described by Bishop, Fienberg and Holland (1975; pp. 284-285). For a K × K table, let b = the sum of frequencies in cells above the main diagonal, and let c = the sum of frequencies in cells below the main diagonal. For example, with reference to Table 3,
b = n_{12} + n_{13} + n_{23} c = n_{21} + n_{31} + n_{32} |
The MH program will perform all the tests described on this page for a K × K crossclassification table, where K can be as large as 50.
SAS will perform a McNemar test for 2×2 tables. It is possible
SPSS has similar features. Other specialized biostatistics and
epidemiological software, such as Epistat, perform the McNemar
test. For additional suggestions, one might search the web using
the key words "McNemar test" and "software".
Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998.
Bhapkar VP. A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 1966, 61, 228-235.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
Everitt BS. The analysis of contingency tables. London: Chapman & Hall, 1977.
Fleiss JL. Statistical methods for rates and proportions (second ed.) New York: Wiley, 1981.
Keefe TJ. On the relationship between two tests for homogeneity of the marginal distributions in a two-way classification, Biometrika, 1982, 69(3), 683-684.
Maxwell AE. Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 1970, 116, 651-655.
McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 1947, 12, 153-157.
Sheskin DJ. Handbook of parametric and nonparametric statistical procedures (second edition). Boca Raton: Chapman & Hall, 2000.
Somes G. McNemar test. Encyclopedia of statistical sciences, vol. 5, S. Kotz & N. Johnson, eds., pp. 361-363. New York: Wiley, 1983.
Stuart AA. A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 1955, 42, 412-416.
Last updated: 30 August 2006 (Bhapkar test)