Space-efficient tracking of persistent items in a massive data stream

IF 3.6 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Statistical Analysis and Data Mining Pub Date : 2011-07-11 DOI:10.1145/2002259.2002294

Bibudh Lahiri, S. Tirthapura, J. Chandrashekar

{"title":"Space-efficient tracking of persistent items in a massive data stream","authors":"Bibudh Lahiri, S. Tirthapura, J. Chandrashekar","doi":"10.1145/2002259.2002294","DOIUrl":null,"url":null,"abstract":"Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur \"regularly\" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors.\n We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"45 1 1","pages":"70-92"},"PeriodicalIF":3.6000,"publicationDate":"2011-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2002259.2002294","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 19

Abstract

Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur "regularly" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors. We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大规模数据流中持久项的空间高效跟踪

受网络异常检测场景的启发，我们考虑了检测数据流中持久项的问题，这些持久项是流中“定期”出现的项。与重量级条目相比，持久条目不一定对流的容量有很大贡献，并且可能无法被传统的基于容量的异常检测器检测到。我们首先表明，任何精确跟踪持久项的在线算法都必须使用大型工作空间，并且在流量监视节点上运行是不可行的。鉴于这个下界，我们引入了问题的近似公式，并提出了一个小空间算法来近似跟踪大数据流上的持久项。我们在真实交通数据集上的实验表明，在典型情况下，该算法实现了5 -7x的物理空间压缩，同时产生很少的假阳性(< 1%)和假阴性(< 4%)。据我们所知，这是对数据流中持久项检测问题的第一个系统研究，我们的工作可以帮助检测暂时的异常，而不是基于量的异常。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.