This article describes how to interpret the kappa coefficient, which is used to assess the inter-rater reliability or agreement.
In most applications, there is usually more interest in the magnitude of kappa than in the statistical significance of kappa. The following classifications has been suggested to interpret the strength of the agreement based on the Cohen’s Kappa value (Altman 1999, Landis JR (1977)).
|Value of k||Strength of agreement|
|0.01 - 0.20||Slight|
|0.81 - 1.00||Almost perfect|
However, this interpretation allows for very little agreement among raters to be described as “substantial”. According to the table 61% agreement is considered as good, but this can immediately be seen as problematic depending on the field. Almost 40% of the data in the dataset represent faulty data. In healthcare research, this could lead to recommendations for changing practice based on faulty evidence. For a clinical laboratory, having 40% of the sample evaluations being wrong would be an extremely serious quality problem (McHugh 2012).
This is the reason that many texts recommend 80% agreement as the minimum acceptable inter-rater agreement. Any kappa below 0.60 indicates inadequate agreement among the raters and little confidence should be placed in the study results.
Fleiss et al. (2003) stated that for most purposes,
- values greater than 0.75 or so may be taken to represent excellent agreement beyond chance,
- values below 0.40 or so may be taken to represent poor agreement beyond chance, and
- values between 0.40 and 0.75 may be taken to represent fair to good agreement beyond chance.
Another logical interpretation of kappa from (McHugh 2012) is suggested in the table below:
|Value of k||Level of agreement||% of data that are reliable|
|0 - 0.20||None||0 - 4%|
|0.21 - 0.39||Minimal||4 - 15%|
|0.40 - 0.59||Weak||15 - 35%|
|0.60 - 0.79||Moderate||35 - 63%|
|0.80 - 0.90||Strong||64 - 81%|
|Above 0.90||Almost Perfect||82 - 100%|
In the table above, the column “% of data that are reliable” corresponds to the squared kappa, an equivalent of the squared correlation coefficient, which is directly interpretable.
Altman, Douglas G. 1999. Practical Statistics for Medical Research. Chapman; Hall/CRC Press.
Landis JR, Koch GG. 1977. “The Measurement of Observer Agreement for Categorical Data” 1 (33). Biometrics: 159–74.
McHugh, Mary. 2012. “Interrater Reliability: The Kappa Statistic.” Biochemia Medica : Časopis Hrvatskoga Društva Medicinskih Biokemičara / HDMB 22 (October): 276–82. doi:10.11613/BM.2012.031.