具有极端验证延迟的非平稳环境下基于密度的核心支持提取

2018 7th Brazilian Conference on Intelligent Systems (BRACIS) Pub Date : 2018-10-01 DOI:10.1109/BRACIS.2018.00039

Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim

{"title":"具有极端验证延迟的非平稳环境下基于密度的核心支持提取","authors":"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim","doi":"10.1109/BRACIS.2018.00039","DOIUrl":null,"url":null,"abstract":"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency\",\"authors\":\"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim\",\"doi\":\"10.1109/BRACIS.2018.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.\",\"PeriodicalId\":405190,\"journal\":{\"name\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BRACIS.2018.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

机器学习解决方案通常认为训练数据和测试数据具有相同的概率分布，即数据是平稳的。但在流场景下，数据的分布一般会随着时间的变化而变化，即数据是非平稳的。这种在线环境下的主要挑战是如何适应数据分布的不断变化。此外，在在线场景中可能会出现其他重要的限制:验证标签的极端延迟。值得一提的是，增量漂移假设是类分布在随后的时间步骤中重叠。因此，数据分布的核心区域与输入数据有明显的重叠。因此，从这些核心区域中选择样本有助于保留代表新分布的最重要的实例。这个选择被命名为核心支持提取(CSE)。因此，我们提出了一项关于应用于非平稳环境的基于密度的算法的研究。我们将KDE、GMM和DBSCAN的两个变体与单一的半监督方法进行了比较。我们在17个合成数据集和一个真实数据集中验证了这些方法，通过许多度量显示了这些CSE方法的优点和缺点。我们表明，当与基于密度的CSE算法一起应用时，半监督分类器在真实数据集上的改进高达68%。作为CSE方法，KDE和GMM之间的结果非常接近，但使用KDE的方法由于参数较少而更加实用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency

Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 7th Brazilian Conference on Intelligent Systems (BRACIS)

自引率

0.00%

发文量