T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono
{"title":"研究综合数据分布对克服小数据集问题的回归模型性能的影响","authors":"T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono","doi":"10.1109/iSemantic50169.2020.9234265","DOIUrl":null,"url":null,"abstract":"Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.","PeriodicalId":345558,"journal":{"name":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Investigating the Impact of Synthetic Data Distribution on the Performance of Regression Models to Overcome Small Dataset Problems\",\"authors\":\"T. Sutojo, A. Syukur, Supriadi Rustad, Guruh Fajar Shidik, Heru Agus Santoso, Purwanto Purwanto, Muljono Muljono\",\"doi\":\"10.1109/iSemantic50169.2020.9234265\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.\",\"PeriodicalId\":345558,\"journal\":{\"name\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/iSemantic50169.2020.9234265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Seminar on Application for Technology of Information and Communication (iSemantic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/iSemantic50169.2020.9234265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigating the Impact of Synthetic Data Distribution on the Performance of Regression Models to Overcome Small Dataset Problems
Machine learning is widely used in various fields, its ability to study data without having to determine the functional relationships that govern a system. However, small datasets often make it difficult for learning algorithms to make accurate predictions. To overcome this, an oversampling technique is needed. However, for the regression learning model this is not easy to do, because in regression to place synthesis data in a certain feature space must be accompanied by an appropriate target value, usually represented by an estimate function. Therefore in this paper oversampling is done by distributing synthetic data according to the Bus, Star, and Mesh topology, using the SMOTE (Synthetic Minority Over-sampling Technique) method. In the experiment, one of the ISE (Istanbul Stock Exchange) public datasets and one of the CF (Color Filter) real datasets were tested to measure the performance of the proposed oversampling technique. Besides, the results of experiments conducted on the same dataset using the MPV, FCM, and MMPV methods were used as a comparison. The results show that oversampling using the Bus, Star, or Mesh distribution results in better performance than without using oversampling. The ISE dataset tested using the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods. For CF datasets, the proposed method has an average RMSE value smaller than the MPV, FCM, and MMPV methods when the amount of training data is smaller than the amount of testing data.