Inter-Rater Reliability Measures in R

Intraclass Correlation Coefficient in R

The Intraclass Correlation Coefficient (ICC) can be used to measure the strength of inter-rater agreement in the situation where the rating scale is continuous or ordinal. It is suitable for studies with two or more raters. Note that, the ICC can be also used for test-retest (repeated measures of the same subject) and intra-rater (multiple scores from the same raters) reliability analysis.

Generally speaking, the ICC determines the reliability of ratings by comparing the variability of different ratings of the same individuals to the total variation across all ratings and all individuals.

• A high ICC (close to 1) indicates high similarity between values from the same group.
• A low ICC (ICC close to zero) means that values from the same group are not similar.

There are multiple forms of ICC (Koo and Li 2016). This article describes how to:

• choose the correct ICC form for inter-rater reliability studies.
• compute the intraclass correlation coefficient in R.

Contents:

Related Book

Inter-Rater Reliability Essentials: Practical Guide in R

How to choose the correct ICC forms

There are different forms of ICC that can give different results when applied to the same set of data (Koo and Li 2016). The forms of ICC can be defined based on the:

• model: one-way random effects, two-way random effects or two-way fixed effects.
• unit: single rater or the mean of k raters
• type of relationship considered to be important: consistency or absolute agreement

There are three models:

• ICC1: One-way random-effects model. In this model, each subject is rated by a different set of randomly chosen raters. Here, raters are considered as the random effects. Practically, this model is rarely used in clinical reliability analysis because majority of such studies typically involve the same set of raters to measure all individuals. An exception would be multicenter studies for which the physical distance between centers makes it impossible to use the same set of raters to rate all subjects. Under such situation, the one-way random-effects model should be used (Koo and Li 2016).
• ICC2: Two-way random-effects model. A set of k raters are randomly selected, then, each subject is measured by the same set of k raters with similar characteristics. In this model, both subjects and raters are viewed as random effects. The two-way random-effects model is chosen if we plan to generalize our reliability results to any raters who possess the same characteristics as the selected raters in the reliability study. This model is appropriate for evaluating rater-based clinical assessment methods that are designed for routine clinical use.
• ICC3: Two-way mixed effects model. Here the raters are considered as fixed. We should use the two-way mixed-effects model if the selected raters are the only raters of interest. With this model, the results only represent the reliability of the specific raters involved in the reliability experiment. They cannot be generalized to other raters even if those raters have similar characteristics as the selected raters in the reliability experiment. The two-way mixed-effects model is less commonly used in inter-rater reliability analysis.

Unit of ratings. For each of these 3 models, reliability can be estimated for a single rating or for the average of k ratings. The selection between “single” and “average” depends on how the measurement protocol will be conducted in the actual application (Koo and Li 2016). For example:

• If we plan to use the mean value of k raters as an assessment basis, the experimental design of the reliability study should involve 3 raters, and the “average of k raters” type should be selected.
• Conversely, if we plan to use the measurement from a single rater as the basis of the actual measurement, “single rater” type should be considered even though the reliability experiment involves 2 or more raters.

Note that, in the next sections, we’ll use the terms:

• ICC1, ICC2 and ICC3 to specify the reliability for a sing rating; and
• ICC1k, ICC2K and ICC3K to design the reliability for the average of k raters.

Consistency or absolute agreement. In the one-way model, the ICC is always a measure for absolute agreement. In the two-way models a choice can be made between two types: Consistency when systematic differences between raters are irrelevant, and absolute agreement, when systematic differences are relevant. In other words, the absolute agreement measures the extent to which different raters assign the same score to the same subject. Conversely, consistency type concerns if raters’ scores to the same group of subjects are correlated in an additive manner (Koo and Li 2016).

Note that, the two-way mixed-effects model and the absolute agreement are recommended for both test-retest and intra-rater reliability studies (Koo et al., 206).

ICC Interpretation

Koo and Li (2016) gives the following suggestion for interpreting ICC (Koo and Li 2016):

• below 0.50: poor
• between 0.50 and 0.75: moderate
• between 0.75 and 0.90: good
• above 0.90: excellent

Example of data

We’ll use the anxiety data [irr package], which contains the anxiety ratings of 20 subjects, rated by 3 raters. Values are ranging from 1 (not anxious at all) to 6 (extremely anxious).

data("anxiety", package = "irr")
head(anxiety, 4)
##   rater1 rater2 rater3
## 1      3      3      2
## 2      3      6      1
## 3      3      4      4
## 4      4      6      4

We want to compute the inter-rater agreement using ICC2.

Computing ICC in R

There are many functions and R packages to compute ICC. Were, we’ll consider the function icc() [irr package] and the function ICC() [psych package].

Using the irr package

Recall that, there are different modes of ICC calculations. When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout and Fleiss 1979):

1. Should only the subjects be considered as random effects (‘“oneway”’ model) or are subjects and raters randomly chosen from a bigger pool of persons (‘“twoway”’ model).
2. If differences in judges’ mean ratings are of interest, inter-rater ‘“agreement”’ instead of ‘“consistency”’ should be computed.
3. If the unit of analysis is a mean of several ratings, unit should be changed to ‘“average”’. In most cases, however, single values (unit=‘“single”’) are regarded.

You can specify the different parameters as follow:

library("irr")
icc(
anxiety, model = "twoway",
type = "agreement", unit = "single"
)
##  Single Score Intraclass Correlation
##
##    Model: twoway
##    Type : agreement
##
##    Subjects = 20
##      Raters = 3
##    ICC(A,1) = 0.198
##
##  F-Test, H0: r0 = 0 ; H1: r0 > 0
##  F(19,39.7) = 1.83 , p = 0.0543
##
##  95%-Confidence Interval for ICC Population Values:
##   -0.039 < ICC < 0.494

Using the psych package

If you use ICC() function, you don’t need to specify anything. R will compute all forms and you will just select the right one. The output will be in this form:

# install.packages("psych")
library(psych)
ICC(anxiety)
## Call: ICC(x = anxiety)
##
## Intraclass correlation coefficients
##                          type  ICC   F df1 df2     p lower bound upper bound
## Single_raters_absolute   ICC1 0.18 1.6  19  40 0.094      -0.077        0.48
## Single_random_raters     ICC2 0.20 1.8  19  38 0.056      -0.039        0.49
## Single_fixed_raters      ICC3 0.22 1.8  19  38 0.056      -0.046        0.52
## Average_raters_absolute ICC1k 0.39 1.6  19  40 0.094      -0.275        0.74
## Average_random_raters   ICC2k 0.43 1.8  19  38 0.056      -0.127        0.75
## Average_fixed_raters    ICC3k 0.45 1.8  19  38 0.056      -0.153        0.77
##
##  Number of subjects = 20     Number of Judges =  3

The rows of the table correspond to the following ICC, respectively: ICC1, ICC2, ICC3, ICC1k, ICC2k and ICC3k. In our example, we will consider the ICC2 form.

Note that, by default, the ICC() function uses the lmer function, which can handle missing data and unbalanced designs.

Report

The intra-class correlation coefficient was computed to assess the agreement between three doctors in rating the anxiety levels in 20 individuals. There was a poor absolute agreement between the three doctors, using the two-way random effect models and “single rater” unit, kappa = 0.2, p = 0.056.

Summary

This chapter explains the basics of the intra-class correlation coefficient (ICC), which can be used to measure the agreement between multiple raters rating in ordinal or continuous scales. We also show how to compute and interpret the ICC values using the R software.

References

Koo, Terry, and Mae Li. 2016. “A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.” Journal of Chiropractic Medicine 15 (March). doi:10.1016/j.jcm.2016.02.012.

Shrout, P.E., and J.L. Fleiss. 1979. “Intraclass Correlation: Uses in Assessing Rater Reliability.” Psychological Bulletin 86: 420–28.

Version: Français

Comment ( 1 )

• Artyom

Thanks a lot for the great post! I would like to ask whether one could use the ICC with questionnaire scales instead of doctors, as in the example above? For instance, if 20 participants filled out a questionnaire with 3 factors that all 3 measure the same construct (e.g., the three factors could be number of shot drank per week, distance between each time a person drinks and # of friends who drink alcohol to measure alcoholism). So, could one use these three factors instead of “doctors” above (i.e., columns in the ICC)? Thank you!