Inter-Rater Reliability Analyses: Quick R Codes

10 mins

This chapter provides a quick start R code to compute the different statistical measures for analyzing the inter-rater reliability or agreement. These include:

Cohen’s Kappa: It can be used for either two nominal or two ordinal variables. It accounts for strict agreements between observers. It is most appropriate for two nominal variables.
Weighted Kappa: It should be considered for two ordinal variables only. It allows partial agreement.
Light’s Kappa, which is the average of Cohen’s Kappa if using more than two categorical variables.
Fleiss Kappa: for two or more categorical variables (nominal or ordinal)
Intraclass correlation coefficient (ICC) for continuous or ordinal data

Contents:

R packages
Prerequisites
Examples data
Cohen’s Kappa: two raters
Weighed kappa: ordinal scales
Light’s kappa: multiple raters
Fleiss’ kappa: multiple raters
Intraclass correlation coefficients: continuous scales
Summary
References

Related Book

Inter-Rater Reliability Essentials: Practical Guide in R

R packages

There are many R packages and functions for inter-rater agreement analyses, including:

Measures	R function [package]
Cohen’s kappa	Kappa() [vcd], kappa2() [irr]
Weighted kappa	Kappa() [vcd], kappa2() [irr]
Light’s kappa	kappam.light() [irr]
Fleiss Kappa	kappam.fleiss() [irr]
ICC	icc() [irr], ICC() [psych]

Prerequisites

In the next sections, we’ll use only the functions from the irr package. Make sure you have installed it.

Load the package:

# install.packages("irrr")
library(irr)

Examples data

psychiatric diagnoses data provided by 6 raters [irr package]. A total of 30 patients were enrolled and classified by each of the raters into 5 nominal categories (Fleiss and others 1971): 1. Depression, 2. Personality Disorder, 3. Schizophrenia, 4. Neurosis, 5. Other.
anxiety data [irr package], which contains the anxiety ratings of 20 subjects, rated by 3 raters on ordinal scales. Values are ranging from 1 (not anxious at all) to 6 (extremely anxious).

Inspect the data:

# Diagnoses data
data("diagnoses", package = "irr")
head(diagnoses[, 1:3])

##                    rater1                  rater2                  rater3
## 1             4. Neurosis             4. Neurosis             4. Neurosis
## 2 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 3 2. Personality Disorder        3. Schizophrenia        3. Schizophrenia
## 4                5. Other                5. Other                5. Other
## 5 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
## 6           1. Depression           1. Depression        3. Schizophrenia

# Anxiety data
data("anxiety", package = "irr")
head(anxiety, 4)

##   rater1 rater2 rater3
## 1      3      3      2
## 2      3      6      1
## 3      3      4      4
## 4      4      6      4

Cohen’s Kappa: two raters

The Cohen’s kappa corresponds to the unweighted kappa. It can be used for two nominal or two ordinal categorical variables

kappa2(diagnoses[, c("rater1", "rater2")], weight = "unweighted")

##  Cohen's Kappa for 2 Raters (Weights: unweighted)
## 
##  Subjects = 30 
##    Raters = 2 
##     Kappa = 0.651 
## 
##         z = 7 
##   p-value = 2.63e-12

Weighed kappa: ordinal scales

Weighted kappa should be considered only when ratings are performed in ordinal scale as in the following example.

kappa2(anxiety[, c("rater1", "rater2")], weight = "equal")

Light’s kappa: multiple raters

It returns the average Cohen’s kappa when you have multiple raters

kappam.light(diagnoses[, 1:3])

##  Light's Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 3 
##     Kappa = 0.555 
## 
##         z = NaN 
##   p-value = NaN

Fleiss’ kappa: multiple raters

The raters are not assumed to be the same for all subjects.

kappam.fleiss(diagnoses[, 1:3])

##  Fleiss' Kappa for m Raters
## 
##  Subjects = 30 
##    Raters = 3 
##     Kappa = 0.534 
## 
##         z = 9.89 
##   p-value = 0

Intraclass correlation coefficients: continuous scales

Read more in Chapter @ref(intraclass-correlation-coefficient):

icc(
  anxiety, model = "twoway", 
  type = "agreement", unit = "single"
  )

##  Single Score Intraclass Correlation
## 
##    Model: twoway 
##    Type : agreement 
## 
##    Subjects = 20 
##      Raters = 3 
##    ICC(A,1) = 0.198
## 
##  F-Test, H0: r0 = 0 ; H1: r0 > 0 
##  F(19,39.7) = 1.83 , p = 0.0543 
## 
##  95%-Confidence Interval for ICC Population Values:
##   -0.039 < ICC < 0.494

Summary

This article describes how to compute the different inter-rater agreement measures using the irr packages.

References

Fleiss, J.L., and others. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82.

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Version: Français

Back to Inter-Rater Reliability Measures in R

Comment ( 1 )

Paulina

31 Mar 2022

I ran into an issue while installing the package as install.packages(“irrr”) with the messege “Warning in install.packages: package ‘irrr’ is not available for this version of R” and was able to fix this by instead typing install.packages(“irr”) (So 2 r’s instead of 3 r’s).
Using R 4.1.2

Reply