{"title":"A Clustering Based Priority Driven Sampling Technique for Imbalance Data Classification","authors":"Iftakhar Ali Khandokar, Abdullah-All-Tanvir, Tanvina Khondokar, Nabila Tabassum Jhilik, Swakkhar Shatabda","doi":"10.1109/SKIMA57145.2022.10029565","DOIUrl":null,"url":null,"abstract":"Classification of Imbalance data is one of t he most vital tasks in the field of machine learning because most of the real-life datasets available have an imbalanced distribution of class labels. The effect of imbalanced data is severe where the predictive model trained on the imbalanced data faces some unprecedented problems like overfitting where t he model gets biased towards the majority target class. Many techniques have been proposed over time to deal with the imbalanced distribution caused by problems like oversampling and undersampling where oversampling isn't able to match the performance acquired by the undersampling method. One such baseline method is clustering the majority of data into multiple clusters and then randomly sampling some of the redundant data but we believe that randomly sampling the data sample might open the loophole to losing informative data samples. So, in this work, we would like to propose two clustering-based priority sampling methods which manage to boost the performance of the predictive model compared to the clustering-based random sampling techniques.","PeriodicalId":277436,"journal":{"name":"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Software, Knowledge, Information Management and Applications (SKIMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKIMA57145.2022.10029565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Classification of Imbalance data is one of t he most vital tasks in the field of machine learning because most of the real-life datasets available have an imbalanced distribution of class labels. The effect of imbalanced data is severe where the predictive model trained on the imbalanced data faces some unprecedented problems like overfitting where t he model gets biased towards the majority target class. Many techniques have been proposed over time to deal with the imbalanced distribution caused by problems like oversampling and undersampling where oversampling isn't able to match the performance acquired by the undersampling method. One such baseline method is clustering the majority of data into multiple clusters and then randomly sampling some of the redundant data but we believe that randomly sampling the data sample might open the loophole to losing informative data samples. So, in this work, we would like to propose two clustering-based priority sampling methods which manage to boost the performance of the predictive model compared to the clustering-based random sampling techniques.