具有极端验证延迟的非平稳环境下基于密度的核心支持提取

Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim
{"title":"具有极端验证延迟的非平稳环境下基于密度的核心支持提取","authors":"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim","doi":"10.1109/BRACIS.2018.00039","DOIUrl":null,"url":null,"abstract":"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency\",\"authors\":\"Raul Sena Ferreira, Bruno M. A. da Silva, W. Teixeira, Geraldo Zimbrão, L. Alvim\",\"doi\":\"10.1109/BRACIS.2018.00039\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.\",\"PeriodicalId\":405190,\"journal\":{\"name\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BRACIS.2018.00039\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

机器学习解决方案通常认为训练数据和测试数据具有相同的概率分布,即数据是平稳的。但在流场景下,数据的分布一般会随着时间的变化而变化,即数据是非平稳的。这种在线环境下的主要挑战是如何适应数据分布的不断变化。此外,在在线场景中可能会出现其他重要的限制:验证标签的极端延迟。值得一提的是,增量漂移假设是类分布在随后的时间步骤中重叠。因此,数据分布的核心区域与输入数据有明显的重叠。因此,从这些核心区域中选择样本有助于保留代表新分布的最重要的实例。这个选择被命名为核心支持提取(CSE)。因此,我们提出了一项关于应用于非平稳环境的基于密度的算法的研究。我们将KDE、GMM和DBSCAN的两个变体与单一的半监督方法进行了比较。我们在17个合成数据集和一个真实数据集中验证了这些方法,通过许多度量显示了这些CSE方法的优点和缺点。我们表明,当与基于密度的CSE算法一起应用时,半监督分类器在真实数据集上的改进高达68%。作为CSE方法,KDE和GMM之间的结果非常接近,但使用KDE的方法由于参数较少而更加实用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Density-Based Core Support Extraction for Non-stationary Environments with Extreme Verification Latency
Machine learning solutions usually consider that the train and test data has the same probabilistic distribution, that is, the data is stationary. However, in streaming scenarios, data distribution generally change through the time, that is, the data is non-stationary. The main challenge in such online environment is the model adaptation for the constant drifts in data distribution. Besides, other important restriction may happen in online scenarios: the extreme latency to verify the labels. Worth to mention that the incremental drift assumption is that class distributions overlap at subsequent time steps. Hence, the core region of data distribution have significant overlap with incoming data. Therefore, selecting samples from these core regions helps to retain the most important instances that represent the new distribution. This selection is denominated core support extraction (CSE). Thus, we present a study about density-based algorithms applied in non-stationary environments. We compared KDE, GMM and two variations of DBSCAN against single semi-supervised approaches. We validated these approaches in seventeen synthetic datasets and a real one, showing the strengths and weaknesses of these CSE methods through many metrics. We show that a semi-supervised classifier is improved up to 68% on a real dataset when it is applied along with a density-based CSE algorithm. The results between KDE and GMM, as CSE methods, were close but the approach using KDE is more practical due to having less parameters.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring the Data Using Extended Association Rule Network SPt: A Text Mining Process to Extract Relevant Areas from SW Documents to Exploratory Tests Gene Essentiality Prediction Using Topological Features From Metabolic Networks Bio-Inspired and Heuristic Methods Applied to a Benchmark of the Task Scheduling Problem A New Genetic Algorithm-Based Pruning Approach for Optimum-Path Forest
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1