Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary

Ling Lin, Qian Yu, Wen Ji, Yang Gao
{"title":"Feature-Selected and -Preserved Sampling for High-Dimensional Stream Data Summary","authors":"Ling Lin, Qian Yu, Wen Ji, Yang Gao","doi":"10.1109/ICTAI.2019.00198","DOIUrl":null,"url":null,"abstract":"Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.","PeriodicalId":346657,"journal":{"name":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2019.00198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Along with the prosperity of the Mobile Internet, a large amount of stream data has emerged. Stream data cannot be completely stored in memory because of its massive volume and continuous arrival. Moreover, it should be accessed only once and handled in time due to the high cost of multiple accesses. Therefore, the intrinsic nature of stream data calls facilitates the development of a summary in the main memory to enable fast incremental learning and to allow working in limited time and memory. Sampling techniques are one of the commonly used methods for constructing data stream summaries. Given that the traditional random sampling algorithm deviates from the real data distribution and does not consider the true distribution of the stream data attributes, we propose a novel sampling algorithm based on feature-selected and -preserved algorithm. We first use matrix approximation to select important features in stream data. Then, the feature-preserved sampling algorithm is used to generate high-quality representative samples over a sliding window. The sampling quality of our algorithm could guarantee a high degree of consistency between the distribution of attribute values in the population (the entire data) and that in the sample. Experiments on real datasets show that the proposed algorithm can select a representative sample with high efficiency.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高维流数据汇总的特征选择和保留采样
随着移动互联网的蓬勃发展,大量的流数据应运而生。流数据不能完全存储在内存中,因为它的巨大容量和持续到达。此外,由于多次访问的高成本,它应该只被访问一次并及时处理。因此,流数据调用的固有性质有助于在主存储器中开发摘要,以实现快速增量学习,并允许在有限的时间和内存中工作。采样技术是构建数据流摘要的常用方法之一。针对传统随机抽样算法偏离真实数据分布,未考虑流数据属性真实分布的问题,提出了一种基于特征选择与保留算法的采样算法。我们首先使用矩阵近似来选择流数据中的重要特征。然后,使用特征保留采样算法在滑动窗口上生成高质量的代表性样本。我们算法的抽样质量可以保证属性值在总体(整个数据)中的分布与样本中的分布高度一致。在实际数据集上的实验表明,该算法能够高效地选取具有代表性的样本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Monaural Music Source Separation using a ResNet Latent Separator Network Graph-Based Attention Networks for Aspect Level Sentiment Analysis A Multi-channel Neural Network for Imbalanced Emotion Recognition Scaling up Prediction of Psychosis by Natural Language Processing Improving Bandit-Based Recommendations with Spatial Context Reasoning: An Online Evaluation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1