A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data

John T. Hancock, Justin M. Johnson, T. Khoshgoftaar
{"title":"A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data","authors":"John T. Hancock, Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/CIC56439.2022.00028","DOIUrl":null,"url":null,"abstract":"For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.","PeriodicalId":170721,"journal":{"name":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC56439.2022.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
不平衡数据分类阈值优化的比较方法
在分类器的实际应用中,需要选择一个最优的输出概率阈值来获得最佳的分类结果。有许多标准可以用来选择阈值。然而,选择一个阈值通常需要在一个指标的性能和另一个指标的性能之间进行权衡。在我们对涉及选择阈值以优化不平衡数据分类的研究的文献综述中,我们发现有机会扩展先前对阈值选择的深入研究。我们的贡献是提出了一种系统的方法来为给定的分类任务及其期望的性能约束选择最佳阈值。正如机器学习算法在一些训练数据集上进行优化一样,我们演示了如何利用用户定义的性能指标集来优化分类阈值。在本研究中,我们使用了四种常用的指标来优化阈值:精度、马修斯相关系数、真阳性率的f-测度和几何平均值以及真阴性率。此外,我们将针对这些指标优化的阈值的分类结果与常用的默认阈值0.5和正类的先验概率(也称为少数多数类比率)进行比较。我们的结果表明,其他阈值轻松地优于默认阈值0.5。此外,我们表明,正类先验概率是一个很好的基准,用于寻找在多个指标方面表现良好的分类阈值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Semantic Communications: A Paradigm Whose Time Has Come Exploring Generalizability of Fine-Tuned Models for Fake News Detection Human-Machine Collaboration for Smart Decision Making: Current Trends and Future Opportunities Next Generation Federated Learning for Edge Devices: An Overview Leveraging Synergies Between AI and Networking to Build Next Generation Edge Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1