G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2024-10-23 DOI:10.1186/s13040-024-00402-z

Andrew Marra

{"title":"G4 & the balanced metric family - a novel approach to solving binary classification problems in medical device validation & verification studies.","authors":"Andrew Marra","doi":"10.1186/s13040-024-00402-z","DOIUrl":null,"url":null,"abstract":"Background: In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.Results: Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.Conclusions: Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"43"},"PeriodicalIF":6.1000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515465/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00402-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In medical device validation and verification studies, the area under the receiver operating characteristic curve (AUROC) is often used as a primary endpoint despite multiple reports showing its limitations. Hence, researchers are encouraged to consider alternative metrics as primary endpoints. A new metric called G4 is presented, which is the geometric mean of sensitivity, specificity, the positive predictive value, and the negative predictive value. G4 is part of a balanced metric family which includes the Unified Performance Measure (also known as P4) and the Matthews' Correlation Coefficient (MCC). The purpose of this manuscript is to unveil the benefits of using G4 together with the balanced metric family when analyzing the overall performance of binary classifiers.

Results: Simulated datasets encompassing different prevalence rates of the minority class were analyzed under a multi-reader-multi-case study design. In addition, data from an independently published study that tested the performance of a unique ultrasound artificial intelligence algorithm in the context of breast cancer detection was also considered. Within each dataset, AUROC was reported alongside the balanced metric family for comparison. When the dataset prevalence and bias of the minority class approached 50%, all three balanced metrics provided equivalent interpretations of an AI's performance. As the prevalence rate increased / decreased and the data became more imbalanced, AUROC tended to overvalue / undervalue the true classifier performance, while the balanced metric family was resistant to such imbalance. Under certain circumstances where data imbalance was strong (minority-class prevalence < 10%), MCC was preferred for standalone assessments while P4 provided a stronger effect size when evaluating between-groups analyses. G4 acted as a middle ground for maximizing both standalone assessments and between-groups analyses.

Conclusions: Use of AUROC as the primary endpoint in binary classification problems provides misleading results as the dataset becomes more imbalanced. This is explicitly noticed when incorporating AUROC in medical device validation and verification studies. G4, P4, and MCC do not share this limitation and paint a more complete picture of a medical device's performance in a clinical setting. Therefore, researchers are encouraged to explore the balanced metric family when evaluating binary classification problems.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

G4 和平衡度量系列--解决医疗器械验证和确认研究中二元分类问题的新方法。

背景：在医疗器械验证和确认研究中，接收者操作特征曲线下面积 (AUROC) 经常被用作主要终点，尽管有多份报告显示了它的局限性。因此，鼓励研究人员考虑采用其他指标作为主要终点。本文介绍了一种名为 G4 的新指标，它是灵敏度、特异性、阳性预测值和阴性预测值的几何平均数。G4 是一个平衡指标体系的一部分，该体系包括统一性能指标（又称 P4）和马修斯相关系数 (MCC)。本手稿旨在揭示在分析二元分类器的整体性能时将 G4 与平衡度量系列结合使用的好处：结果：在多阅读器多案例研究设计下，分析了包含不同少数群体流行率的模拟数据集。此外，还考虑了一项独立发表的研究数据，该研究测试了独特的超声人工智能算法在乳腺癌检测方面的性能。在每个数据集中，AUROC 与平衡度量系列一起报告，以供比较。当数据集中少数群体的流行率和偏差接近 50%时，所有三个平衡指标都能对人工智能的性能做出等效的解释。随着流行率的增加/减少，数据变得更加不平衡，AUROC 往往会高估/低估真正的分类器性能，而平衡度量系列则能抵御这种不平衡。在某些情况下，数据不平衡性很强（少数类流行率结论：在二元分类问题中使用 AUROC 作为主要终点，会随着数据集变得越来越不平衡而产生误导性结果。这一点在将 AUROC 纳入医疗设备验证和检验研究时会被明确注意到。G4、P4 和 MCC 不具有这种局限性，它们能更全面地反映医疗设备在临床环境中的性能。因此，我们鼓励研究人员在评估二元分类问题时探索平衡度量系列。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.