Measures of Agreement with Multiple Raters: Fréchet Variances and Inference.

IF 3.1 2区心理学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Psychometrika Pub Date : 2024-06-01 Epub Date: 2024-01-08 DOI:10.1007/s11336-023-09945-2

Jonas Moss

{"title":"Measures of Agreement with Multiple Raters: Fréchet Variances and Inference.","authors":"Jonas Moss","doi":"10.1007/s11336-023-09945-2","DOIUrl":null,"url":null,"abstract":"<p><p>Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen's kappa or Fleiss's kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss's kappa, Conger's kappa, and Hubert's kappa, the variant of Fleiss's kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.</p>","PeriodicalId":54534,"journal":{"name":"Psychometrika","volume":" ","pages":"517-541"},"PeriodicalIF":3.1000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11164747/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychometrika","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1007/s11336-023-09945-2","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/8 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Most measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen's kappa or Fleiss's kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss's kappa, Conger's kappa, and Hubert's kappa, the variant of Fleiss's kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多个评分者的一致性测量：弗雷谢特方差与推理。

大多数一致性测量方法都是偶然校正法。它们在三个方面存在差异：偶然一致的定义、不一致函数的选择以及如何处理多个评分者。偶然一致通常是按照科恩卡帕（Cohen's kappa）或弗莱斯卡帕（Fleiss's kappa）进行成对定义的。分歧函数通常是名义函数、二次函数或绝对值函数。但是，如何处理多个评分者却存在争议，主要的竞争者有弗莱斯卡帕（Fleiss's kappa）、康格卡帕（Conger's kappa）和休伯特卡帕（Hubert's kappa）。更一般地说，多评分者一致系数可以 g-wise 方式定义，其中分歧加权函数使用 g 个评分者而不是两个。本文有两个主要贡献(a) 我们建议使用弗雷谢特方差来处理多评分者的情况。弗雷谢特方差是直观的分歧度量，并将名义函数、二次函数和绝对值函数推广到两个以上评分者的情况。(b) 对于每个项目都由相同数量的评分者进行评分的情况，我们推导了 g-加权同意系数的极限理论，以及科恩型或弗莱斯型的偶然同意。在尝试了三种置信区间结构后，我们最终建议使用 arcsine 变换或 Fisher 变换来计算置信区间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Psychometrika 数学-数学跨学科应用

CiteScore

4.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： The journal Psychometrika is devoted to the advancement of theory and methodology for behavioral data in psychology, education and the social and behavioral sciences generally. Its coverage is offered in two sections: Theory and Methods (T& M), and Application Reviews and Case Studies (ARCS). T&M articles present original research and reviews on the development of quantitative models, statistical methods, and mathematical techniques for evaluating data from psychology, the social and behavioral sciences and related fields. Application Reviews can be integrative, drawing together disparate methodologies for applications, or comparative and evaluative, discussing advantages and disadvantages of one or more methodologies in applications. Case Studies highlight methodology that deepens understanding of substantive phenomena through more informative data analysis, or more elegant data description.