Inter-Rater Reliability Measures in R

Inter-Rater Reliability Analyses: Quick R Codes

This chapter provides a quick start R code to compute the different statistical measures for analyzing the inter-rater reliability or agreement. These include:

  • Cohen’s Kappa: It can be used for either two nominal or two ordinal variables. It accounts for strict agreements between observers. It is most appropriate for two nominal variables.
  • Weighted Kappa: It should be considered for two ordinal variables only. It allows partial agreement.
  • Light’s Kappa, which is the average of Cohen’s Kappa if using more than two categorical variables.
  • Fleiss Kappa: for two or more categorical variables (nominal or ordinal)
  • Intraclass correlation coefficient (ICC) for continuous or ordinal data


Contents:

Related Book

Inter-Rater Reliability Essentials: Practical Guide in R

R packages

There are many R packages and functions for inter-rater agreement analyses, including:

Measures R function [package]
Cohen’s kappa Kappa() [vcd], kappa2() [irr]
Weighted kappa Kappa() [vcd], kappa2() [irr]
Light’s kappa kappam.light() [irr]
Fleiss Kappa kappam.fleiss() [irr]
ICC icc() [irr], ICC() [psych]

Prerequisites

In the next sections, we’ll use only the functions from the irr package. Make sure you have installed it.

Load the package:

# install.packages("irrr")
library(irr)

Examples data

  • psychiatric diagnoses data provided by 6 raters [irr package]. A total of 30 patients were enrolled and classified by each of the raters into 5 nominal categories (Fleiss and others 1971): 1. Depression, 2. Personality Disorder, 3. Schizophrenia, 4. Neurosis, 5. Other.
  • anxiety data [irr package], which contains the anxiety ratings of 20 subjects, rated by 3 raters on ordinal scales. Values are ranging from 1 (not anxious at all) to 6 (extremely anxious).

Inspect the data:

# Diagnoses data
data("diagnoses", package = "irr")
head(diagnoses[, 1:3])
##                    rater1                  rater2                  rater3
## 1             4. Neurosis             4. Neurosis             4. Neurosis
## 2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 3 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
## 4                5. Other                5. Other                5. Other
## 5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 6           1. Depression           1. Depression        3. Schizophrenia
# Anxiety data
data("anxiety", package = "irr")
head(anxiety, 4)
##   rater1 rater2 rater3
## 1      3      3      2
## 2      3      6      1
## 3      3      4      4
## 4      4      6      4

Cohen’s Kappa: two raters

The Cohen’s kappa corresponds to the unweighted kappa. It can be used for two nominal or two ordinal categorical variables

kappa2(diagnoses[, c("rater1", "rater2")], weight = "unweighted")
##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.651 
## 
##         z = 7 
##   p-value = 2.63e-12

Weighed kappa: ordinal scales

Weighted kappa should be considered only when ratings are performed in ordinal scale as in the following example.

kappa2(anxiety[, c("rater1", "rater2")], weight = "equal")

Light’s kappa: multiple raters

It returns the average Cohen’s kappa when you have multiple raters

kappam.light(diagnoses[, 1:3])
##  Light's Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 3 
##     Kappa = 0.555 
## 
##         z = NaN 
##   p-value = NaN

Fleiss’ kappa: multiple raters

The raters are not assumed to be the same for all subjects.

kappam.fleiss(diagnoses[, 1:3])
##  Fleiss' Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 3 
##     Kappa = 0.534 
## 
##         z = 9.89 
##   p-value = 0

Intraclass correlation coefficients: continuous scales

Read more in Chapter @ref(intraclass-correlation-coefficient):

icc(
  anxiety, model = "twoway", 
  type = "agreement", unit = "single"
  )
##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 20 
##      Raters = 3 
##    ICC(A,1) = 0.198
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(19,39.7) = 1.83 , p = 0.0543 
## 
##  95%-Confidence Interval for ICC Population Values:
##   -0.039 < ICC < 0.494

Summary

This article describes how to compute the different inter-rater agreement measures using the irr packages.

References

Fleiss, J.L., and others. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82.



Version: Français

Inter-Rater Agreement Chart in R (Prev Lesson)
Back to Inter-Rater Reliability Measures in R

Comment ( 1 )

  • Paulina

    I ran into an issue while installing the package as install.packages(“irrr”) with the messege “Warning in install.packages: package ‘irrr’ is not available for this version of R” and was able to fix this by instead typing install.packages(“irr”) (So 2 r’s instead of 3 r’s).
    Using R 4.1.2

Give a comment

Want to post an issue with R? If yes, please make sure you have read this: How to Include Reproducible R Script Examples in Datanovia Comments