{"title":"用于极端不平衡数据的加权支持向量机","authors":"Jongmin Mun , Sungwan Bang , Jaeoh Kim","doi":"10.1016/j.csda.2024.108078","DOIUrl":null,"url":null,"abstract":"<div><div>Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108078"},"PeriodicalIF":1.5000,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weighted support vector machine for extremely imbalanced data\",\"authors\":\"Jongmin Mun , Sungwan Bang , Jaeoh Kim\",\"doi\":\"10.1016/j.csda.2024.108078\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.</div></div>\",\"PeriodicalId\":55225,\"journal\":{\"name\":\"Computational Statistics & Data Analysis\",\"volume\":\"203 \",\"pages\":\"Article 108078\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Statistics & Data Analysis\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167947324001622\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947324001622","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Weighted support vector machine for extremely imbalanced data
Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.
期刊介绍:
Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas:
I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article.
II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures.
[...]
III) Special Applications - [...]
IV) Annals of Statistical Data Science [...]