G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.

IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2024-10-23 DOI:10.1186/s13040-024-00402-z
Andrew Marra
{"title":"G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.","authors":"Andrew Marra","doi":"10.1186/s13040-024-00402-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.</p><p><strong>Results: </strong>Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.</p><p><strong>Conclusions: </strong>Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"43"},"PeriodicalIF":4.0000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515465/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00402-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.

Results: Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.

Conclusions: Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
G4 和平衡度量系列--解决医疗器械验证和确认研究中二元分类问题的新方法。
背景:在医疗器械验证和确认研究中,接收者操作特征曲线下面积 (AUROC) 经常被用作主要终点,尽管有多份报告显示了它的局限性。因此,鼓励研究人员考虑采用其他指标作为主要终点。本文介绍了一种名为 G4 的新指标,它是灵敏度、特异性、阳性预测值和阴性预测值的几何平均数。G4 是一个平衡指标体系的一部分,该体系包括统一性能指标(又称 P4)和马修斯相关系数 (MCC)。本手稿旨在揭示在分析二元分类器的整体性能时将 G4 与平衡度量系列结合使用的好处:结果:在多阅读器多案例研究设计下,分析了包含不同少数群体流行率的模拟数据集。此外,还考虑了一项独立发表的研究数据,该研究测试了独特的超声人工智能算法在乳腺癌检测方面的性能。在每个数据集中,AUROC 与平衡度量系列一起报告,以供比较。当数据集中少数群体的流行率和偏差接近 50%时,所有三个平衡指标都能对人工智能的性能做出等效的解释。随着流行率的增加/减少,数据变得更加不平衡,AUROC 往往会高估/低估真正的分类器性能,而平衡度量系列则能抵御这种不平衡。在某些情况下,数据不平衡性很强(少数类流行率结论:在二元分类问题中使用 AUROC 作为主要终点,会随着数据集变得越来越不平衡而产生误导性结果。这一点在将 AUROC 纳入医疗设备验证和检验研究时会被明确注意到。G4、P4 和 MCC 不具有这种局限性,它们能更全面地反映医疗设备在临床环境中的性能。因此,我们鼓励研究人员在评估二元分类问题时探索平衡度量系列。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Biodata Mining
Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
7.90
自引率
0.00%
发文量
28
审稿时长
23 weeks
期刊介绍: BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.
期刊最新文献
Transcriptome-based network analysis related to regulatory T cells infiltration identified RCN1 as a potential biomarker for prognosis in clear cell renal cell carcinoma. Deciphering the tissue-specific functional effect of Alzheimer risk SNPs with deep genome annotation. Investigating potential drug targets for IgA nephropathy and membranous nephropathy through multi-queue plasma protein analysis: a Mendelian randomization study based on SMR and co-localization analysis. Deep joint learning diagnosis of Alzheimer's disease based on multimodal feature fusion. Modeling heterogeneity of Sudanese hospital stay in neonatal and maternal unit: non-parametric random effect models with Gamma distribution.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1