Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim
{"title":"具有极端验证延迟的非平稳环境下基于密度的核心支持提取","authors":"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim","doi":"10.1109/BRACIS.2018.00039","DOIUrl":null,"url":null,"abstract":"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency\",\"authors\":\"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim\",\"doi\":\"10.1109/BRACIS.2018.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.\",\"PeriodicalId\":405190,\"journal\":{\"name\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BRACIS.2018.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency
Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.