Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling

Ümit Yılmaz, C. Gezer, Z. Aydın, V. C. Gungor
{"title":"Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling","authors":"Ümit Yılmaz, C. Gezer, Z. Aydın, V. C. Gungor","doi":"10.1145/3471287.3471299","DOIUrl":null,"url":null,"abstract":"Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"IM-36 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 the 5th International Conference on Information System and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3471287.3471299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于Tomek Link和随机欠抽样的不平衡直销数据挖掘技术
确定潜在客户在直销中是非常重要的。数据挖掘技术是企业确定潜在客户的最重要方法之一。然而,由于潜在客户的数量与非潜在客户的数量相比非常低,因此存在类不平衡问题,这严重影响了数据挖掘技术的性能。本文通过综合少数派过采样技术(Synthetic Minority Oversampling Technique, SMOTE)、Tomek Link、RUS和ROS等基本重采样技术和高级重采样技术的不同组合进行评估,以提高客户分类的性能。为了减少数据中的非信息特征(如信息增益、增益比、卡方和救济)的数量,使用了不同的特征选择技术。使用LightGBM、XGBoost、Gradient Boost、Random Forest、AdaBoost、ANN、Logistic回归、决策树、SVC、Bagging Classifier等基于ROC AUC和灵敏度指标的数据挖掘技术对分类性能进行了比较和利用。结合Tomek Link和Random undersampling作为重采样技术和卡方方法作为特征选择算法的组合在其他组合中表现出更好的性能。详细的性能评估表明,基于决策树的梯度增强算法LightGBM在其他分类器中表现最佳,灵敏度为0.947,ROC AUC值为0.896。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ethnicity Based Consumer Buying Behavior Analysis and Prediction on Online Clothing Platforms in Sri Lanka Email Clustering & Generating Email Templates Based on Their Topics LASTD: A Manually Annotated and Tested Large Arabic Sentiment Tweets Dataset Selection and Verification of Privacy Parameters for Local Differentially Private Data Aggregation MDNet: A Multi-image Input Real-Time Semantic Segmentation Model with Decoupled Supervision
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1