Z. Lee, Chou-Yuan Lee, So-Tsung Chou, Wei-Ping Ma, Fulan Ye, Zhen Chen
{"title":"一种应用于不平衡数据的分布式智能算法","authors":"Z. Lee, Chou-Yuan Lee, So-Tsung Chou, Wei-Ping Ma, Fulan Ye, Zhen Chen","doi":"10.1109/ICIASE45644.2019.9074009","DOIUrl":null,"url":null,"abstract":"Data mining means to find valuable information in database or data sets. For imbalanced data, there are extremely low number of samples in database or data sets and it is not easy to solve these problems by traditional methods of data mining. In this paper, a distributed intelligent algorithm is proposed to imbalanced data. Apache Spark is implemented as the distributed framework in the proposed distributed intelligent algorithm, and its cluster computing framework with in-memory data processing engine can do analytic on large volumes of data. In the distributed framework, Apache Spark with synthetic minority oversampling technique (SMOTE) is proposed to process imbalanced data first. Thereafter, the support vector machine (SVM) is used to classify imbalanced data. The zoo data set from UCI repository is used to verify the correctness of the proposed algorithm. The results of the proposed distributed intelligent algorithm can get better performance than these compared traditional classifiers.","PeriodicalId":206741,"journal":{"name":"2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Distributed Intelligent Algorithm Applied to Imbalanced Data\",\"authors\":\"Z. Lee, Chou-Yuan Lee, So-Tsung Chou, Wei-Ping Ma, Fulan Ye, Zhen Chen\",\"doi\":\"10.1109/ICIASE45644.2019.9074009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data mining means to find valuable information in database or data sets. For imbalanced data, there are extremely low number of samples in database or data sets and it is not easy to solve these problems by traditional methods of data mining. In this paper, a distributed intelligent algorithm is proposed to imbalanced data. Apache Spark is implemented as the distributed framework in the proposed distributed intelligent algorithm, and its cluster computing framework with in-memory data processing engine can do analytic on large volumes of data. In the distributed framework, Apache Spark with synthetic minority oversampling technique (SMOTE) is proposed to process imbalanced data first. Thereafter, the support vector machine (SVM) is used to classify imbalanced data. The zoo data set from UCI repository is used to verify the correctness of the proposed algorithm. The results of the proposed distributed intelligent algorithm can get better performance than these compared traditional classifiers.\",\"PeriodicalId\":206741,\"journal\":{\"name\":\"2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIASE45644.2019.9074009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference of Intelligent Applied Systems on Engineering (ICIASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIASE45644.2019.9074009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Distributed Intelligent Algorithm Applied to Imbalanced Data
Data mining means to find valuable information in database or data sets. For imbalanced data, there are extremely low number of samples in database or data sets and it is not easy to solve these problems by traditional methods of data mining. In this paper, a distributed intelligent algorithm is proposed to imbalanced data. Apache Spark is implemented as the distributed framework in the proposed distributed intelligent algorithm, and its cluster computing framework with in-memory data processing engine can do analytic on large volumes of data. In the distributed framework, Apache Spark with synthetic minority oversampling technique (SMOTE) is proposed to process imbalanced data first. Thereafter, the support vector machine (SVM) is used to classify imbalanced data. The zoo data set from UCI repository is used to verify the correctness of the proposed algorithm. The results of the proposed distributed intelligent algorithm can get better performance than these compared traditional classifiers.