A performance study of the chain sampling algorithm

Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar
{"title":"A performance study of the chain sampling algorithm","authors":"Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar","doi":"10.1109/INTELCIS.2015.7397265","DOIUrl":null,"url":null,"abstract":"On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.","PeriodicalId":6478,"journal":{"name":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","volume":"91 1","pages":"487-494"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELCIS.2015.7397265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
链式采样算法的性能研究
在线数据流分析是当今的一个重要挑战,因为在许多应用领域中,来自多个异构源的数据流的比率一直在增加。为了减少数据流的量,数据流研究界设计了几种采样方法。本文主要研究Babcock等人提出的链式采样算法。该算法的目的是在任何时候随机地从包含在最后一个滑动窗口中的流的最近项中选择给定的固定比例。该算法很好地适应了流上下文,因为只对数据执行一次传递。此外,它使用很小的内存,因为它不存储当前滑动窗口的所有项目。本文证明了链式采样算法存在一些碰撞或冗余问题。当在算法执行期间多次选择同一项作为样本时,就会发生冲突。我们提出了两种方法来克服这一缺点并改进链采样算法。第一种方法被称为“高采样率的反向选择”,第二种方法的灵感来自于“分而治之”策略。通过不同的实验来证明这两种改进的效率,特别是它们对算法执行时间的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
On the use of probabilistic model-checking for the verification of prognostics applications Prospective, knowledge based clinical risk analysis: The OPT-model Partial deduction in predicate calculus as a tool for artificial intelligence problem complexity decreasing XML summarization: A survey Finding the pin in the haystack: A Bot Traceback service for public clouds
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1