A performance study of the chain sampling algorithm

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS) Pub Date : 2015-12-01 DOI:10.1109/INTELCIS.2015.7397265

Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar

{"title":"A performance study of the chain sampling algorithm","authors":"Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar","doi":"10.1109/INTELCIS.2015.7397265","DOIUrl":null,"url":null,"abstract":"On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.","PeriodicalId":6478,"journal":{"name":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","volume":"91 1","pages":"487-494"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELCIS.2015.7397265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

链式采样算法的性能研究

在线数据流分析是当今的一个重要挑战，因为在许多应用领域中，来自多个异构源的数据流的比率一直在增加。为了减少数据流的量，数据流研究界设计了几种采样方法。本文主要研究Babcock等人提出的链式采样算法。该算法的目的是在任何时候随机地从包含在最后一个滑动窗口中的流的最近项中选择给定的固定比例。该算法很好地适应了流上下文，因为只对数据执行一次传递。此外，它使用很小的内存，因为它不存储当前滑动窗口的所有项目。本文证明了链式采样算法存在一些碰撞或冗余问题。当在算法执行期间多次选择同一项作为样本时，就会发生冲突。我们提出了两种方法来克服这一缺点并改进链采样算法。第一种方法被称为“高采样率的反向选择”，第二种方法的灵感来自于“分而治之”策略。通过不同的实验来证明这两种改进的效率，特别是它们对算法执行时间的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)

自引率

0.00%

发文量

期刊最新文献

On the use of probabilistic model-checking for the verification of prognostics applications Prospective, knowledge based clinical risk analysis: The OPT-model Partial deduction in predicate calculus as a tool for artificial intelligence problem complexity decreasing XML summarization: A survey Finding the pin in the haystack: A Bot Traceback service for public clouds