Bipartite graphs can be used to model many real-world relationships, with applications in many domains such as medicine and social networks. We present an application of maximal bicliques of bipartite graphs to recommendation systems that makes use of the notion of biclique similarity of a set of vertices in order to recommend items to users in a certain order of preference. Experimental results using real-world datasets that justify our approach are presented.
{"title":"On Biclique Connectivity in Bipartite Graphs and Recommendation Systems","authors":"Cristina Maier, D. Simovici","doi":"10.1145/3471287.3471302","DOIUrl":"https://doi.org/10.1145/3471287.3471302","url":null,"abstract":"Bipartite graphs can be used to model many real-world relationships, with applications in many domains such as medicine and social networks. We present an application of maximal bicliques of bipartite graphs to recommendation systems that makes use of the notion of biclique similarity of a set of vertices in order to recommend items to users in a certain order of preference. Experimental results using real-world datasets that justify our approach are presented.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper investigates reinforcement learning (RL) based finite-time control (FTC) of uncertain robotic systems. The proposed methodology consists of a terminal sliding mode based finite-time controller and an Actor-Critic (AC)-based RL loop that adjusts the output of the neural network. The terminal sliding mode controller is designed to ensure calculable settling time, as compared to conventional asymptotic stability. The AC-based RL loop uses recursive least square technique to update the critic network and policy gradient algorithm to estimate the parameters of actor network. We show that the AC is beneficial to improve robustness of terminal sliding mode controller both in approaching stage and near equilibrium. The performance of proposed controller is compared to that with only terminal sliding mode controller. The simulation results show that proposed controller outperforms pure terminal sliding mode controller, and that AC is a successful supplement to FTC.
{"title":"Actor-Critic Neural Network Based Finite-time Control for Uncertain Robotic Systems","authors":"Changyi Lei","doi":"10.1145/3471287.3471288","DOIUrl":"https://doi.org/10.1145/3471287.3471288","url":null,"abstract":"This paper investigates reinforcement learning (RL) based finite-time control (FTC) of uncertain robotic systems. The proposed methodology consists of a terminal sliding mode based finite-time controller and an Actor-Critic (AC)-based RL loop that adjusts the output of the neural network. The terminal sliding mode controller is designed to ensure calculable settling time, as compared to conventional asymptotic stability. The AC-based RL loop uses recursive least square technique to update the critic network and policy gradient algorithm to estimate the parameters of actor network. We show that the AC is beneficial to improve robustness of terminal sliding mode controller both in approaching stage and near equilibrium. The performance of proposed controller is compared to that with only terminal sliding mode controller. The simulation results show that proposed controller outperforms pure terminal sliding mode controller, and that AC is a successful supplement to FTC.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133954608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pronab Ghosh, S. Azam, Asif Karim, M. Jonkman, Md. Zahid Hasan
Cardiovascular disease has become one of the world's major causes of death. Accurate and timely diagnosis is of crucial importance. We constructed an intelligent diagnostic framework for prediction of heart disease, using the Cleveland Heart disease dataset. We have used three machine learning approaches, Decision Tree (DT), K- Nearest Neighbor (KNN), and Random Forest (RF) in combination with different sets of features. We have applied the three techniques to the full set of features, to a set of ten features selected by “Pearson's Correlation” technique and to a set of six features selected by the Relief algorithm. Results were evaluated based on accuracy, precision, sensitivity, and several other indices. The best results were obtained with the combination of the RF classifier and the features selected by Relief achieving an accuracy of 98.36%. This could even further be improved by employing a 5-fold Cross Validation (CV) approach, resulting in an accuracy of 99.337%. CCS CONCEPTS • Applied computing • Life and medical sciences • Health informatics
{"title":"Use of Efficient Machine Learning Techniques in the Identification of Patients with Heart Diseases","authors":"Pronab Ghosh, S. Azam, Asif Karim, M. Jonkman, Md. Zahid Hasan","doi":"10.1145/3471287.3471297","DOIUrl":"https://doi.org/10.1145/3471287.3471297","url":null,"abstract":"Cardiovascular disease has become one of the world's major causes of death. Accurate and timely diagnosis is of crucial importance. We constructed an intelligent diagnostic framework for prediction of heart disease, using the Cleveland Heart disease dataset. We have used three machine learning approaches, Decision Tree (DT), K- Nearest Neighbor (KNN), and Random Forest (RF) in combination with different sets of features. We have applied the three techniques to the full set of features, to a set of ten features selected by “Pearson's Correlation” technique and to a set of six features selected by the Relief algorithm. Results were evaluated based on accuracy, precision, sensitivity, and several other indices. The best results were obtained with the combination of the RF classifier and the features selected by Relief achieving an accuracy of 98.36%. This could even further be improved by employing a 5-fold Cross Validation (CV) approach, resulting in an accuracy of 99.337%. CCS CONCEPTS • Applied computing • Life and medical sciences • Health informatics","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"56 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126126126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The features of retinal blood vessels are very essential indicators playing an important part in the process of judging and diagnosing the eye diseases for doctors. Sometimes, these features can also be the indicators for the examination of hypertension, coronary heart disease and diabetes. However, retinal blood vessels are often very small and complex in distribution, which brings toughness to the doctors when doing the operations of the segmentation of retinal blood vessels. Although the deep learning manners represented by U-Net has performed very well in the field of the segmentation of the images of the retinal blood vessel in recent years, the above-mentioned inconvenience still cannot be effectively settled. For the purpose of improving the correct rate of the segmentation and settling the above-mentioned inconvenience we propose a network called SF-U-Net, which uses accurate shape estimation and feature restoration to achieve the improvement of the accuracy. We follow the structure of Fully Convolutional Networks (FCN) and Skip Connection of U-Net and use deformable convolution to accurately capture the shape of blood vessels when extracting features at the coding layer to overcome the problem of complex blood vessel distribution. At the decoding layer, we adopt a novel dual-stream up-sampling method to achieve accurate feature restoration. Experimental results show that our SF-U-Net is capable of improving the segmentation results of retinal blood vessels conspicuously. In the experiment, we use both fundus image datasets called DRIVE and CHASE-DB1 and the experimental results of multiple indicators on them surpass other deep-learning methods significantly. The experimental results of the SF-U-Net model on a variety of indicators on DRIVE dataset exceed the experimental performances of the currently most advanced methods. The mean accuracy is 0.9602 the area under the curve (AUC) is 0.9848 and the sensitivity is 0.8567.
{"title":"SF-U-Net: Using Accurate Shape Estimation and Feature Restoration to Improve Retinal Vessel Segmentation","authors":"Wen-Chun Yang","doi":"10.1145/3471287.3471377","DOIUrl":"https://doi.org/10.1145/3471287.3471377","url":null,"abstract":"The features of retinal blood vessels are very essential indicators playing an important part in the process of judging and diagnosing the eye diseases for doctors. Sometimes, these features can also be the indicators for the examination of hypertension, coronary heart disease and diabetes. However, retinal blood vessels are often very small and complex in distribution, which brings toughness to the doctors when doing the operations of the segmentation of retinal blood vessels. Although the deep learning manners represented by U-Net has performed very well in the field of the segmentation of the images of the retinal blood vessel in recent years, the above-mentioned inconvenience still cannot be effectively settled. For the purpose of improving the correct rate of the segmentation and settling the above-mentioned inconvenience we propose a network called SF-U-Net, which uses accurate shape estimation and feature restoration to achieve the improvement of the accuracy. We follow the structure of Fully Convolutional Networks (FCN) and Skip Connection of U-Net and use deformable convolution to accurately capture the shape of blood vessels when extracting features at the coding layer to overcome the problem of complex blood vessel distribution. At the decoding layer, we adopt a novel dual-stream up-sampling method to achieve accurate feature restoration. Experimental results show that our SF-U-Net is capable of improving the segmentation results of retinal blood vessels conspicuously. In the experiment, we use both fundus image datasets called DRIVE and CHASE-DB1 and the experimental results of multiple indicators on them surpass other deep-learning methods significantly. The experimental results of the SF-U-Net model on a variety of indicators on DRIVE dataset exceed the experimental performances of the currently most advanced methods. The mean accuracy is 0.9602 the area under the curve (AUC) is 0.9848 and the sensitivity is 0.8567.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126413715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rumors are defined as widely spread talk with no reliable source to back it up. In modern society, the rumors are widely spreading on the social network. The spread of rumors poses great challenges for the society. A ”fake news” story can rile up your emotions and change your mood. Some rumors can even cause social panic and economic losses. As such, the influence of rumors can be far-reaching and long-lasting. Efficient and intelligent rumor control strategies are necessary to constrain the spread of rumors. Existing rumor control strategies are designed for controlling a single rumor. However, there are usually many rumors existing on social networks and only limited rumors can be removed at a time due to the limited detection capacity and CPU performance. Consequently, when dealing with multiple rumors, we should remove rumors in a certain order. We argue that the order of removing rumors matters as different rumors possess different properties, e.g., acceptance rate, propagation speed, etc. Unfortunately, to the best of our knowledge, there is no prior work on removing multiple rumors and the order of removing rumors. To this end, this paper proposes two novel rumor control strategies to remove the multiple rumors. We also extends the classical Susceptible Infected Recovered (SIR) model to simulate the dynamics of rumor propagation in a more practical manner. We evaluate the performance of strategies. The experiments show that our proposed rumor control strategies obviously outperform than benchmark strategy.
{"title":"Rumor Remove Order Strategy on Social Networks","authors":"Yuanda Wang, Haibo Wang, Shigang Chen, Ye Xia","doi":"10.1145/3471287.3471294","DOIUrl":"https://doi.org/10.1145/3471287.3471294","url":null,"abstract":"Rumors are defined as widely spread talk with no reliable source to back it up. In modern society, the rumors are widely spreading on the social network. The spread of rumors poses great challenges for the society. A ”fake news” story can rile up your emotions and change your mood. Some rumors can even cause social panic and economic losses. As such, the influence of rumors can be far-reaching and long-lasting. Efficient and intelligent rumor control strategies are necessary to constrain the spread of rumors. Existing rumor control strategies are designed for controlling a single rumor. However, there are usually many rumors existing on social networks and only limited rumors can be removed at a time due to the limited detection capacity and CPU performance. Consequently, when dealing with multiple rumors, we should remove rumors in a certain order. We argue that the order of removing rumors matters as different rumors possess different properties, e.g., acceptance rate, propagation speed, etc. Unfortunately, to the best of our knowledge, there is no prior work on removing multiple rumors and the order of removing rumors. To this end, this paper proposes two novel rumor control strategies to remove the multiple rumors. We also extends the classical Susceptible Infected Recovered (SIR) model to simulate the dynamics of rumor propagation in a more practical manner. We evaluate the performance of strategies. The experiments show that our proposed rumor control strategies obviously outperform than benchmark strategy.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129067616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we will introduce an original neural network FOB-Net for semantic segmentation tasks. This network's innovation is that it pays more attention to the boundary information and gets the supervised information directly. After our extensive understanding and research about the several used networks in the semantic segmentation application scenario. We analyzed the various accepted network structures. In the end, we revise the existing network to get the FOB-Net.
{"title":"FOB-Net: A Semantic Segmentation Method Focusing on Boundary Information","authors":"Jiayu Wang","doi":"10.1145/3471287.3471290","DOIUrl":"https://doi.org/10.1145/3471287.3471290","url":null,"abstract":"In this paper, we will introduce an original neural network FOB-Net for semantic segmentation tasks. This network's innovation is that it pays more attention to the boundary information and gets the supervised information directly. After our extensive understanding and research about the several used networks in the semantic segmentation application scenario. We analyzed the various accepted network structures. In the end, we revise the existing network to get the FOB-Net.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131662576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rising popularity of facial recognition technology has prompted a lot of questions about its application, reliability, safety, and legality. The ability of a machine to identify an individual and their emotions through an image with near perfect accuracy is a testament to how far Artificial intelligence (AI) models have come. This study rigorously analyzes and consolidates several reputable materials with the purposes of answering the following questions: What is facial recognition? How is data acquired? What is the machine learning process? How does the Convolution Neural Network (CNN) work? It also explores the potential obstructions such as face masks that affect the machine's accuracy, security vulnerabilities, reliability, and legal concerns of the technology.
{"title":"A Survey of CNN and Facial Recognition Methods in the Age of COVID-19∗","authors":"Adinma Chidumije, Fatima Gowher, Ehsan Kamalinejad, Justine Mercado, Jiwanjot Soni, Jiaofei Zhong","doi":"10.1145/3471287.3471292","DOIUrl":"https://doi.org/10.1145/3471287.3471292","url":null,"abstract":"The rising popularity of facial recognition technology has prompted a lot of questions about its application, reliability, safety, and legality. The ability of a machine to identify an individual and their emotions through an image with near perfect accuracy is a testament to how far Artificial intelligence (AI) models have come. This study rigorously analyzes and consolidates several reputable materials with the purposes of answering the following questions: What is facial recognition? How is data acquired? What is the machine learning process? How does the Convolution Neural Network (CNN) work? It also explores the potential obstructions such as face masks that affect the machine's accuracy, security vulnerabilities, reliability, and legal concerns of the technology.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the busy lives, people feel uncomfortable and have less time to go to a shop and buy things nowadays. E-Shopping is rapidly growing during the last decade simply because people are able to purchase items from these online platforms 24x7. Online shopping is a process where consumers buy goods and services from a shop or a seller over the internet. Popular websites like AlieExpress, eBay are good examples of that. Presently this Online-shopping concept is also very popular in Sri Lanka. In this research, the authors mainly concentrate on the Sri Lankan Online Clothing stores and try to examine the online consumers buying behavior based on ethnicity. The authors chose Sri Lanka since Sri Lanka is a multi-ethnic country. There are mainly four ethnicities live in Sri Lanka. They are Sinhalese, Tamils, Muslims, and Burghers. For this research mainly Sinhalese and Tamils, the two main ethnicities are considered. In this research, the study authors analyze the buying behavior of those two ethnicities such as cloth types, favorite colors in online clothing shopping platforms. Moreover, in this research, authors implement classification models that can predict the cloth type and the color of the clothes based on consumers' ethnicities. The main benefit of this research would go to Sri Lankan online sellers. They would be able to get a clear understanding of the buying behaviors and the expectations of the consumers based on their ethnicities. Moreover, from this, sellers could improve their sales and online shopping users can get an attractive and good online shopping experience.
{"title":"Ethnicity Based Consumer Buying Behavior Analysis and Prediction on Online Clothing Platforms in Sri Lanka","authors":"T. Ginige, K. Mahima","doi":"10.1145/3471287.3471291","DOIUrl":"https://doi.org/10.1145/3471287.3471291","url":null,"abstract":"With the busy lives, people feel uncomfortable and have less time to go to a shop and buy things nowadays. E-Shopping is rapidly growing during the last decade simply because people are able to purchase items from these online platforms 24x7. Online shopping is a process where consumers buy goods and services from a shop or a seller over the internet. Popular websites like AlieExpress, eBay are good examples of that. Presently this Online-shopping concept is also very popular in Sri Lanka. In this research, the authors mainly concentrate on the Sri Lankan Online Clothing stores and try to examine the online consumers buying behavior based on ethnicity. The authors chose Sri Lanka since Sri Lanka is a multi-ethnic country. There are mainly four ethnicities live in Sri Lanka. They are Sinhalese, Tamils, Muslims, and Burghers. For this research mainly Sinhalese and Tamils, the two main ethnicities are considered. In this research, the study authors analyze the buying behavior of those two ethnicities such as cloth types, favorite colors in online clothing shopping platforms. Moreover, in this research, authors implement classification models that can predict the cloth type and the color of the clothes based on consumers' ethnicities. The main benefit of this research would go to Sri Lankan online sellers. They would be able to get a clear understanding of the buying behaviors and the expectations of the consumers based on their ethnicities. Moreover, from this, sellers could improve their sales and online shopping users can get an attractive and good online shopping experience.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114451720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work is about the use of machine learning methods to improve the monitoring of water quality. The work aims to use machine learning to predict the normal values of a quality indicator (pH, salinity, etc.). Upon significant deviation from actual measurements, monitoring scientists would be alerted to the need to inspect the water way more closely thereby reducing the possibility of missing a problem and speeding up determinations of issues regarding water quality. This study compares methods to predict water quality parameters using water data from Lake Pontchartrain in Southeast Louisiana. K-Nearest neighbors, decision trees, and an artificial neural network have been used to determine which method most accurately predicted water quality parameters such as pH, temperature, salinity, specific conductance, and dissolved oxygen. The decision tree and k-nearest neighbors algorithms produced similar results which were only slightly below the standard deviation of the data. However, a neural network was able to predict the values with a much higher accuracy.
{"title":"Predicting Water Quality Parameters in Lake Pontchartrain using Machine Learning: A comparison on K-Nearest Neighbors, Decision Trees, and Neural Networks to Predict Water Quality","authors":"A. Daniels, C. Koutsougeras","doi":"10.1145/3471287.3471308","DOIUrl":"https://doi.org/10.1145/3471287.3471308","url":null,"abstract":"This work is about the use of machine learning methods to improve the monitoring of water quality. The work aims to use machine learning to predict the normal values of a quality indicator (pH, salinity, etc.). Upon significant deviation from actual measurements, monitoring scientists would be alerted to the need to inspect the water way more closely thereby reducing the possibility of missing a problem and speeding up determinations of issues regarding water quality. This study compares methods to predict water quality parameters using water data from Lake Pontchartrain in Southeast Louisiana. K-Nearest neighbors, decision trees, and an artificial neural network have been used to determine which method most accurately predicted water quality parameters such as pH, temperature, salinity, specific conductance, and dissolved oxygen. The decision tree and k-nearest neighbors algorithms produced similar results which were only slightly below the standard deviation of the data. However, a neural network was able to predict the values with a much higher accuracy.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127262038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.
{"title":"Data Mining Techniques in Direct Marketing on Imbalanced Data using Tomek Link Combined with Random Under-sampling","authors":"Ümit Yılmaz, C. Gezer, Z. Aydın, V. C. Gungor","doi":"10.1145/3471287.3471299","DOIUrl":"https://doi.org/10.1145/3471287.3471299","url":null,"abstract":"Determining the potential customers is very important in direct marketing. Data mining techniques are one of the most important methods for companies to determine potential customers. However, since the number of potential customers is very low compared to the number of non-potential customers, there is a class imbalance problem that significantly affects the performance of data mining techniques. In this paper, different combinations of basic and advanced resampling techniques such as Synthetic Minority Oversampling Technique (SMOTE), Tomek Link, RUS, and ROS were evaluated to improve the performance of customer classification. Different feature selection techniques are used in order the decrease the number of non-informative features from the data such as Information Gain, Gain Ratio, Chi-squared, and Relief. Classification performance was compared and utilized using several data mining techniques, such as LightGBM, XGBoost, Gradient Boost, Random Forest, AdaBoost, ANN, Logistic Regression, Decision Trees, SVC, Bagging Classifier based on ROC AUC and sensitivity metrics. A combination of Tomek Link and Random Under-Sampling as a resampling technique and Chi-squared method as feature selection algorithm showed superior performance among the other combinations. Detailed performance evaluations demonstrated that with the proposed approach, LightGBM, which is a gradient boosting algorithm based on decision tree, gave the best results among the other classifiers with 0.947 sensitivity and 0.896 ROC AUC value.","PeriodicalId":306474,"journal":{"name":"2021 the 5th International Conference on Information System and Data Mining","volume":"IM-36 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132899537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}