There is wide disagreement about the usefulness of kappa statistics to assess rater agreement. At the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) oneshould consider alternatives and make an informed choice.
One can distinguish between two possible uses of kappa: as a way to test rater independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases).
It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable.
Thus, the common statement that kappa is a "chance-corrected measure of agreement" misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of rater decisionmaking, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it.
A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intraclass correlation.
(Top of page)
(Back to Agreement main page)
(Top of page)
(Back to Agreement main page)
(Top of page)
(Back to Agreement main page)
(Top of page)
(Back to Agreement main page)
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 196037-46, 1960.
Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 70:213-20, 1968.
Cook RJ. Kappa. In: The Encyclopedia of Biostatistics, T. P. Armitage, Colton, eds., pp. 2160-2166. New York: Wiley, 1998.
Cook RJ. Kappa and its dependence on marginal rates. In: The Encyclopedia of Biostatistics, P. Armitage, T. Colton, eds., pp. 2166-2168. New York: Wiley, 1998.
Hutchinson TP. Focus on Psychometrics. Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Research in Nursing & Health, 1993, 16, 313-316.
McKenzie DP, Mackinnon AJ, Peladeau N, Onghena P, Bruce PC, Clarke DM, Harrigan S, McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another? Journal of Psychiatric Research, 1996, 30, 483-492.
Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 1987, 126, 161-169.
Uebersax JS. Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin, 1987, 101, 140-146.
Cook RJ. Kappa. In: The Encyclopedia of Biostatistics, T. P. Armitage, Colton, eds., pp. 2160-2166. New York: Wiley, 1998.
Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley, 1981, 38-46.
Fleiss JL, Levin B, Paik MC . Statistical methods for rates and proportions. 3rd ed. New York: Wiley, 2004.
Kraemer HC. Measurement of reliability for categorical data in medical research. Statistical Methods in Medical Research. 1(2):183-99, 1992.
Shrout PE. Measurement reliability and agreement in psychiatry. Statistical Methods in Medical Research. 7(3):301-17, 1998 Sep.
von Eye A, Mun EY. Analyzing rater agreement: manifest variable methods. New York: Routledge, 2005.
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20:37-46, 1960.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin. 76:378-81, 1971.
Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley, 1981, 38-46.
Cicchetti DV. Comparison of the null distributions of weighted kappa and the C ordinal statistic. Applied Psychological Measurement, 1977, 1, 195-201.
Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 70:213-20, 1968.
Fleiss JL, Cohen, J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 1973, 33, 613-619.
Brenner H. Kliebsch U. Dependence of weighted kappa coefficients on the number of categories. Epidemiology. 7(2):199-202, 1996 Mar.
Byrt T. Bishop J. Carlin JB. Bias, prevalence and kappa. Journal of Clinical Epidemiology. 46(5):423-9, 1993 May.
Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology. 43(6):551-8, 1990.
Cook RJ. Kappa and its dependence on marginal rates. In: The Encyclopedia of Biostatistics, P. Armitage, T. Colton, eds., pp. 2166-2168. New York: Wiley, 1998.
Feinstein AR. Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes [see comments]. Journal of Clinical Epidemiology. 43(6):543-9, 1990.
Grove WM, Andreasen NC, McDonald-Scott P, Keller MB, Shapiro RW. Reliability studies of psychiatric diagnosis. Theory and practice. Archives of General Psychiatry. 38(4):408-13, 1981 Apr.
Guggenmoos-Holzmann I. How reliable are chance-corrected measures of agreement? Statistics in Medicine. 12(23):2191-205, 1993 Dec 15.
Hutchinson TP. Focus on Psychometrics. Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Research in Nursing & Health. 16(4):313-6, 1993 Aug.
Kraemer HC, Bloch DA. Kappa coefficients in epidemiology: an appraisal of a reappraisal. Journal of Clinical Epidemiology, 1988, 41, 959-68.
Lantz CA. Nebenzahl E. Behavior and interpretation of the kappa statistic: resolution of the two paradoxes. Journal of Clinical Epidemiology. 49(4):431-4, 1996 Apr.
Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology. 126(2)161-9, 1987 Aug. [dissenting letter and reply appears in Am J Epidemiol 1888 Nov.;128(5)1179-81].
Spitznagel EL, Helzer JE. A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry. 42(7):725-8, 1985 Jul.
Stewart, G. W, J. M. Rey, "A Partial Solution to the Base Rate Problem of the k Statistic," Archives of General Psychiatry, Vol. 45, 504-505, 1988.
Thompson WD. Walter SD. A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology. 41(10):949-58, 1988.
Thompson WD. Walter SD. Kappa and the concept of independent errors. Journal of Clinical Epidemiology, 1988, 41, 969-70.
Uebersax JS. Measuring diagnostic reliability: Reply to Spitznagel and Helzer (letter). Archives of General Psychiatry, 1987, 44, 193-194.
Uebersax, J. S. (1987). Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin, 101, 140-146.
Blackman NJ, Koval JJ. Interval estimation for Cohen's kappa as a measure of agreement. Statistics in Medicine. 19(5):723-741, 2000 Mar.
Donner A. Sample size requirements for the comparison of two or more coefficients of inter-observer agreement. Statistics in Medicine. 17(10):1157-68, 1998 May.
Donner A. Eliasziw M. A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation [see comments]. Statistics in Medicine. 11(11):1511-9, 1992 Aug.
Donner A. Eliasziw M. Klar N. Testing the homogeneity of kappa statistics. Biometrics. 52(1):176-83, 1996 Mar.
Fleiss, J. L., J. Cohen, B. S. Everitt, "Large Sample Standard Errors of Kappa and Weighted Kappa," Psychological Bulletin, Vol. 72, 323-327, 1969.
Fleiss JL, Nee JCM, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin, 1979, 86, 974-77.
Hale CA. Fleiss JL. Interval estimation under two study designs for kappa with binary classifications. Biometrics. 49(2):523-34, 1993 Jun.
Lee J. Fung KP. Confidence interval of the kappa coefficient by bootstrap resampling [letter]. Psychiatry Research. 49(1):97-8, 1993 Oct.
Lehmann M. Daures JP. Mottet N. Navratil H. Comparison between exact and parametric distributions of multiple inter-raters agreement coefficient. Computer Methods & Programs in Biomedicine. 47(2):113-21, 1995 Jul.
Lui KJ. Kelly C. A note on interval estimation of kappa in a series of 2 x 2 tables. Statistics in Medicine. 18(15):2041-9, 1999 Aug 15.
McKenzie DP. Mackinnon AJ. Peladeau N. Onghena P. Bruce PC. Clarke DM. Harrigan S. McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?. Journal of Psychiatric Research. 30(6):483-92, 1996 Nov-Dec.
Barlow W. Lai MY. Azen SP. A comparison of methods for calculating a stratified kappa. Statistics in Medicine. 10(9):1465-72, 1991 Sep.
Donner A. Klar N. The statistical analysis of kappa statistics in multiple samples. Journal of Clinical Epidemiology. 49(9):1053-8, 1996 Sep.
Fleiss J, Spitzer R, Endicott J, Cohen J. Quantification of agreement in multiple psychiatric diagnosis. Archives of General Psychiatry, 1972, 26, 168-71.
Gross ST. The kappa coefficient of agreement for multiple observers when the number of subjects is small. Biometrics. 42(4):883-93, 1986 Dec.
Haley SM. Osberg JS. Kappa coefficient calculation using multiple ratings per subject: a special communication. Physical Therapy. 69(11):970-4, 1989 Nov.
Kupper LL. Hafner KB. On assessing interrater agreement for multiple attribute responses. Biometrics. 45(3):957-67, 1989 Sep.
Kvalseth TO. A coefficient of agreement for nominal scales: An asymmetric version of Kappa. Educational and Psychological Measurement. 1991 Spr; Vol 51(1): 95-101.
Lau T. Higher-order kappa-type statistics for a dichotomous attribute in multiple ratings. Biometrics. 49(2):535-42, 1993 Jun.
O'Connell, D. L., Dobson, A. J. (1984). General observer-agreement measures on individual subjects and groups of subjects. Biometrics, 40, 973-983.
Posner, K. L., Sampson, P. D., Caplan, R. A., Ward, R. J., Cheney, F. W. (1990). Measuring interrater reliability among multiple raters: An example of methods for nominal data. Statistics in Medicine, 9, 1103-1115.
Roberts C. McNamee R. A matrix of kappa-type coefficients to assess the reliability of nominal scales. Statistics in Medicine. 17(4):471-88, 1998 Feb 28.
Schouten HJA. Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 1982, 36, 45-61.
Schouten HJ. Estimating kappa from binocular data and comparing marginal probabilities. Statistics in Medicine. 12(23):2207-17, 1993 Dec 15.
Shoukri MM. Martin SW. Mian IU. Maximum likelihood estimation of the kappa coefficient from models of matched binary responses. Statistics in Medicine. 14(1):83-99, 1995 Jan 15.
Shoukri MM. Mian IU. Maximum likelihood estimation of the kappa coefficient from bivariate logistic regression. Statistics in Medicine. 15(13):1409-19, 1996 Jul 15.
Spitzer R, Cohen J, Fleiss J, Endicott J. Quantification of agreement in psychiatry diagnosis: A new approach. Archives of General Psychiatry, 1967, 17, 83-87.
Szalai JP. Kappa-sub(sc): A measure of agreement on a single rating category for a single item or object rated by multiple raters. Psychological Reports. 1998 Jun; Vol 82(3, Pt 2): 1321-1322.
Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal of Psychiatric Research. 1982-1983; Vol 17(4): 335-342.
Uebersax JS. A generalized kappa coefficient. Educational and Psychological-Measurement. 1982 Spr; Vol 42(1): 181-183.
Ahn CW. Mezzich JE. PROPOV-K: a FORTRAN program for computing a kappa coefficient using a proportional overlap procedure. Computers & Biomedical Research. 22(5):415-23, 1989 Oct.
Aiken LR. Program for computing and evaluating reliability coefficients for criterion-referenced tests. Educational and Psychological Measurement. 1988 Fal; Vol 48(3): 697-700.
Berk RA, Campbell KL. A FORTRAN program for Cohen's kappa coefficient of observer agreement. Behavior Research Methods, Instruments and Computers. 1976 Aug; Vol 8(4): 396.
Boushka WM. Marinez YN. Prihoda TJ. Dunford R. Barnwell GM. A computer program for calculating kappa: application to interexaminer agreement in periodontal research. Computer Methods & Programs in Biomedicine. 33(1):35-41, 1990 Sep.
Gamsu CV. Calculating reliability measures for ordinal data. British Journal of Clinical Psychology. 1986 Nov; Vol 25(4): 307-308.
Moussa MA. The measurement of interobserver agreement based on categorical scales. Computer Programs in Biomedicine. 19(2-3):221-8, 1985.
Oud JH, Sattler JM. Generalized kappa coefficient: A Microsoft BASIC program. Behavior Research Methods, Instruments and Computers. 1984 Oct; Vol 16(5): 481.
Strube MJ. A general program for the calculation of the kappa coefficient. Behavior-Research-Methods,-Instruments-and-Computers. 1989 Dec; Vol 21(6): 643-644.
Uebersax JS. GKAPPA: Generalized kappa coefficient (computer program abstract). Applied Psychological Measurement, 1983, 5, 28.
Valiquette CAM, Lesage AD, Cyr M, Toupin J. Computing Cohen's kappa coefficients using SPSS MATRIX. Behavioral Research Methods, Instruments and Computers, 1994, 26, 60-61.
Vierkant RA. A SAS macro for calculating bootstrapped confidence intervals about a kappa coefficient. Paper presented at the annual SUGI (SAS User's Group) Meeting, 2000?
Top of Bibliography
Top of Page
Updated:
01 Oct 2009 (Myth of chance correction)
18 Mar 2010 (link updated)