A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data

2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC) Pub Date : 2022-12-01 DOI:10.1109/CIC56439.2022.00028

John T. Hancock, Justin M. Johnson, T. Khoshgoftaar

{"title":"A Comparative Approach to Threshold Optimization for Classifying Imbalanced Data","authors":"John T. Hancock, Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/CIC56439.2022.00028","DOIUrl":null,"url":null,"abstract":"For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.","PeriodicalId":170721,"journal":{"name":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC56439.2022.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

For the practical application of a classifier, it is necessary to select an optimal output probability threshold to obtain the best classification results. There are many criteria one may employ to select a threshold. However, selecting a threshold will often involve trading off performance in terms of one metric for performance in terms of another metric. In our literature review of studies involving selecting thresholds to optimize classification of imbalanced data, we find there is an opportunity to expand on previous work for an in-depth study of threshold selection. Our contribution is to present a systematic method for selecting the best threshold value for a given classification task and its desired performance constraints. Just as a machine learning algorithm is optimized on some training data set, we demonstrate how a user-defined set of performance metrics can be utilized to optimize the classification threshold. In this study we use four popular metrics to optimize thresholds: precision, Matthews’ Correlation Coefficient, f-measure and geometric mean of true positive rate, and true negative rate. Moreover, we compare classification results for thresholds optimized for these metrics with the commonly used default threshold of 0.5, and the prior probability of the positive class (also known as the minority to majority class ratio). Our results show that other thresholds handily outperform the default threshold of 0.5. Moreover, we show that the positive class prior probability is a good benchmark for finding classification thresholds that perform well in terms of multiple metrics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

不平衡数据分类阈值优化的比较方法

在分类器的实际应用中，需要选择一个最优的输出概率阈值来获得最佳的分类结果。有许多标准可以用来选择阈值。然而，选择一个阈值通常需要在一个指标的性能和另一个指标的性能之间进行权衡。在我们对涉及选择阈值以优化不平衡数据分类的研究的文献综述中，我们发现有机会扩展先前对阈值选择的深入研究。我们的贡献是提出了一种系统的方法来为给定的分类任务及其期望的性能约束选择最佳阈值。正如机器学习算法在一些训练数据集上进行优化一样，我们演示了如何利用用户定义的性能指标集来优化分类阈值。在本研究中，我们使用了四种常用的指标来优化阈值:精度、马修斯相关系数、真阳性率的f-测度和几何平均值以及真阴性率。此外，我们将针对这些指标优化的阈值的分类结果与常用的默认阈值0.5和正类的先验概率(也称为少数多数类比率)进行比较。我们的结果表明，其他阈值轻松地优于默认阈值0.5。此外，我们表明，正类先验概率是一个很好的基准，用于寻找在多个指标方面表现良好的分类阈值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE 8th International Conference on Collaboration and Internet Computing (CIC)

自引率

0.00%

发文量

期刊最新文献

Semantic Communications: A Paradigm Whose Time Has Come Exploring Generalizability of Fine-Tuned Models for Fake News Detection Human-Machine Collaboration for Smart Decision Making: Current Trends and Future Opportunities Next Generation Federated Learning for Edge Devices: An Overview Leveraging Synergies Between AI and Networking to Build Next Generation Edge Networks