The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects.
The theoretical formula for the ICC is:
|s 2(b) + s 2 (w)|
where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait between subjects.
It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all ratings, regardless of whether they are for the same subject or not. Hence the interpretation of the ICC as the proportion of total variance accounted for by within-subject variation.
Equation  would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and must instead estimate them from sample data. For this we wish to use all available information; this adds terms to Equation .
For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know a subject's true trait level, we estimate it from the subject's mean rating across the raters who rate the subject. Each mean rating is subject to sampling variation--deviation from the subject's true trait level, or it's surrogate, the mean rating that would be obtained from a very large number of raters. Since the actual mean ratings are often based on two or a few ratings, these deviations are appreciable and inflate the estimate of between-subject variance.
We can estimate the amount and correct for this extra, error variation. If all subjects have k ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as (1/k) s 2(w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects have k ratings, s2(w) equals the average variance of the k ratings of each subject (each calculated using k-1 as denominator). To get the ICC we then:
|Case 1||Raters for each subject are selected at random|
|Case 2||The same raters rate each case. These are a random sample.|
|Case 3||The same raters rate each case. These are the only raters.|
Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool k different raters to rate this subject. Therefore the raters who rate one subject are not necessarily the same as those who rate another. This design corresponds to a 1-way Analysis of Variance (ANOVA) in which Subject is a random effect, and Rater is viewed as measurement error.
Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater × Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In Case 2, Rater is considered a random effect; this means the k raters in the study are considered a random sample from a population of potential raters. The Case 2 ICC estimates the reliability of the larger population of raters.
Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates the ICC that applies only to the k raters in the study. Since this does not permit generalization to other raters, the Case 3 ICC is not often used.
Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the ICC in two ways:
The ICC, and more broadly, ANOVA analysis of ratings, is very flexible. Besides the six ICCs discussed above, one can consider more complex designs, such as a grouping factor among raters (e.g., experts vs. nonexperts), or covariates. See Landis and Koch (1977a,b) for examples.
Software to estimate the ICC is readily available (e.g, SPSS and SAS). Output from most any ANOVA software will contain the values needed to calculate the ICC.
The ICC allows estimation of the reliability of both single and mean ratings. "Prophecy" formulas let one predict the reliability of mean ratings based on any number of raters.
An alternative to the ICC for Cases 2 and 3 is to calculate the Pearson correlation between all pairs of rater. The Pearson correlation measures association between raters, but is insensitive to rater mean differences (bias). The ICC decreases in response to both lower correlation between raters and larger rater mean differences. Some may see this advantage, but others (see Cons) as a limitation.
The ICC can be used to compare the reliability of different instruments. For example, the reliability of a 3-level rating scale can be compared to the reliability of a 5-level scale (provided they are assessed relative to the same sample or population; see Cons).
The ICC is strongly influenced by the variance of the trait in the sample/population in which it is assessed. ICCs measured for different populations might not be comparable.
For example, suppose one has a depression rating scale. When applied to a random sample of the adult population the scale might have a high ICC. However, if the scale is applied to a very homogeneous population--such as patients hospitalized for acute depression--it might have a low ICC.
This is evident from the definition of the ICC as s 2(b)/ [s 2(b)+s 2(w)]. In both populations above, s 2(w), variance of different raters' opinions of the same subject, may be the same. But between-subject variance, s 2(b), may be much smaller in the clinical population than in the general population. Therefore the ICC would be smaller in the clinical population.
|The the same instrument may be judged "reliable" or "unreliable," depending on the population in which it is assessed.|
This issue is similar to, and just as much a concern as, the "base rate" problem of the kappa coefficient. It means that:
For more discussion on the implications of this topic see, The Comparability Issue below.
To use the ICC with ordered-category ratings, one must assign the rating categories numeric values. Usually categories are assigned values 1, 2, ..., C, where C is the number of rating categories; this assumes all categories are equally wide, which may not be true. An alternative is to assign ordered categories numeric values from their cumulative frequencies via probit (for a normally distributed trait) or ridit (for a rectangularly distributed trait) scoring; see Fleiss (1981).
The ICC combines, or some might say, confounds, two ways in which raters differ: (1) association, which concerns whether the raters understand the meaning of the trait in the same way, and (2) bias, which concerns whether some raters' mean ratings are higher or lower than others. If a goal is to give feedback to raters to improve future ratings, one should distinguish between these two sources of disagreement. For discussion on alternatives that separate these components, see the Likert Scale page of this website.
With ordered-category or Likert-type data, the ICC discounts the fact that we have a natural unit to evaluate rating consistency: the number or percent of agreements on each rating category. Raw agreement is simple, intuitive, and clinically meaningful. With ordered category data, it is not clear why one would prefer the ICC to raw agreement rates, especially in light of the comparability issue discussed below. A good idea is to report reliability using both the ICC and raw agreement rates.
Above it was noted that the ICC is strongly dependent on the trait variance within the population for which it is measured. This can complicate comparisons of ICCs measured in different populations, or in generalizing results from a single population.
Some suggest avoiding this problem by eliminating or holding constant the "problematic" term, s 2(b).
Holding the term constant would mean choosing some fixed value for s 2(b), and using this in place of the different value estimated in each population. For example, one might pick as s 2(b) the trait variance in the general adult population--regardless of what population the ICC is measured in.
However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been suggested, though not used in practice much.
But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step further and express reliability simply as raw agreement rates--for example, the percent of times two raters agree on the exact same category, and the percent of time they are within on level of one another?
An advantage of including s 2(b) is that it automatically controls for the scaling factor of an instrument. Thus (at least within the same population), ICCs for instruments with different numbers of categories can be meaningfully compared. Such is not the case with raw agreement measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new scale may wish to include the ICC along with other measures if they expect later researchers might compare their results to those of a new or different instrument with fewer or more categories.
SPSS has excellent features for calculating the ICC. The sources below explain them:
Barrett P. Assessing the reliability of rating data.
Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley, 1981, 38-46.
Landis JR Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977a; 33: 159-174.
Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977b; 33: 363-374.
McGraw KO, Wong SP. Forming inferences about some intraclass correlations. Psychological Methods 1996;1:30-46.
Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994 Dec 15-30;13(23-24):2465-76.
Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bulletin 1979;86:420-427.
Shrout PE. Measurement reliability and agreement in psychiatry. Statistical Methods in Medical Research. 7(3):301-17, 1998 Sep.
Last updated: 2 April 2007 (removed background)