StreamKDD '10最新文献

英文中文

Fully decentralized computation of aggregates over data streams 对数据流的聚合进行完全分散的计算

StreamKDD '10

Pub Date : 2011-03-31 DOI: 10.1145/1833280.1833281

L. Becchetti, Ilaria Bordino, S. Leonardi, A. Rosén

In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We finally present experimental analysis providing evidence for the efficiency and accuracy of our algorithms on realistic simulated scenarios.

在一些新兴的应用中，数据是在几个分布的观测点以大量流的方式收集的。一个基本且具有挑战性的任务是允许每个节点通过对其附近观察到的流发出连续的聚合查询来监视感兴趣的邻域。这类算法在本质上是完全去中心化和扩散性的:在低容量设备或海量数据集存在的网络中，在网络的几个中心节点上收集所有数据是不可行的。设计扩散算法的主要困难是处理重复检测。这些都是由于在网络的几个节点上观察到相同的事件和/或在多条扩散路径上接收到相同的汇总信息而产生的。在本文中，我们考虑了完全分散的算法，这些算法在上述场景中回答关于不同事件的数量、事件总数和第二次频率矩的局部连续聚合查询。所提出的算法在最坏情况下或在实际分布下使用每个节点的次线性空间。我们还提出了在观察到新事件时将更新聚合所需的通信最小化的策略。最后，我们给出了实验分析，为我们的算法在现实模拟场景下的效率和准确性提供了证据。

{"title":"Fully decentralized computation of aggregates over data streams","authors":"L. Becchetti, Ilaria Bordino, S. Leonardi, A. Rosén","doi":"10.1145/1833280.1833281","DOIUrl":"https://doi.org/10.1145/1833280.1833281","url":null,"abstract":"In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets.\u0000 The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion.\u0000 In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node.\u0000 We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We finally present experimental analysis providing evidence for the efficiency and accuracy of our algorithms on realistic simulated scenarios.","PeriodicalId":383372,"journal":{"name":"StreamKDD '10","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131661872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evolutionary clustering using frequent itemsets 使用频繁项集的进化聚类

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833284

R. Shankar, G. V. Kiran, Vikram Pudi

Evolutionary clustering is an emerging research area addressing the problem of clustering dynamic data. An evolutionary clustering should take care of two conflicting criteria: preserving the current cluster quality and not deviating too much from the recent history. In this paper we propose an algorithm for evolutionary clustering using frequent itemsets. A frequent itemset based approach for evolutionary clustering is natural and it automatically satisfy the two criteria of evolutionary clustering. We provide theoretical as well as experimental proofs to support our claims. We performed experiments on our approach using different datasets and the results show that our approach is comparable to most of the existing algorithms for evolutionary clustering.

进化聚类是解决动态数据聚类问题的新兴研究领域。进化聚类应该注意两个相互冲突的标准:保持当前的聚类质量，不要过多地偏离最近的历史。本文提出了一种基于频繁项集的进化聚类算法。一种基于频繁项集的进化聚类方法是自然的，它自动满足进化聚类的两个标准。我们提供理论和实验证据来支持我们的主张。我们使用不同的数据集对我们的方法进行了实验，结果表明我们的方法与大多数现有的进化聚类算法相当。

引用次数: 15

CALDS: context-aware learning from data streams CALDS:从数据流中进行上下文感知学习

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833283

J. Gomes, Ernestina Menasalvas Ruiz, Pedro A. C. Sousa

Drift detection methods in data streams can detect changes in incoming data so that learned models can be used to represent the underlying population. In many real-world scenarios context information is available and could be exploited to improve existing approaches, by detecting or even anticipating to recurring concepts in the underlying population. Several applications, among them health-care or recommender systems, lend themselves to use such information as data from sensors is available but is not being used. Nevertheless, new challenges arise when integrating context with drift detection methods. Modeling and comparing context information, representing the context-concepts history and storing previously learned concepts for reuse are some of the critical problems. In this work, we propose the Context-aware Learning from Data Streams (CALDS) system to improve existing drift detection methods by exploiting available context information. Our enhancement is seamless: we use the association between context information and learned concepts to improve detection and adaptation to drift when concepts reappear. We present and discuss our preliminary experimental results with synthetic and real datasets.

数据流中的漂移检测方法可以检测传入数据的变化，以便学习的模型可以用来表示潜在的总体。在许多现实场景中，上下文信息是可用的，可以通过检测甚至预测潜在人群中重复出现的概念来改进现有方法。一些应用程序，其中包括保健或推荐系统，可以使用来自传感器的数据等信息，但尚未使用。然而，当将上下文与漂移检测方法相结合时，出现了新的挑战。建模和比较上下文信息、表示上下文概念历史以及存储先前学习的概念以供重用是一些关键问题。在这项工作中，我们提出了从数据流中进行上下文感知学习(CALDS)系统，通过利用可用的上下文信息来改进现有的漂移检测方法。我们的增强是无缝的:我们使用上下文信息和学习到的概念之间的关联来提高检测和适应，当概念再次出现时漂移。我们介绍并讨论了我们在合成数据集和真实数据集上的初步实验结果。

{"title":"CALDS: context-aware learning from data streams","authors":"J. Gomes, Ernestina Menasalvas Ruiz, Pedro A. C. Sousa","doi":"10.1145/1833280.1833283","DOIUrl":"https://doi.org/10.1145/1833280.1833283","url":null,"abstract":"Drift detection methods in data streams can detect changes in incoming data so that learned models can be used to represent the underlying population. In many real-world scenarios context information is available and could be exploited to improve existing approaches, by detecting or even anticipating to recurring concepts in the underlying population. Several applications, among them health-care or recommender systems, lend themselves to use such information as data from sensors is available but is not being used. Nevertheless, new challenges arise when integrating context with drift detection methods. Modeling and comparing context information, representing the context-concepts history and storing previously learned concepts for reuse are some of the critical problems. In this work, we propose the Context-aware Learning from Data Streams (CALDS) system to improve existing drift detection methods by exploiting available context information. Our enhancement is seamless: we use the association between context information and learned concepts to improve detection and adaptation to drift when concepts reappear. We present and discuss our preliminary experimental results with synthetic and real datasets.","PeriodicalId":383372,"journal":{"name":"StreamKDD '10","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114566280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Towards subspace clustering on dynamic data: an incremental version of PreDeCon 动态数据上的子空间聚类:一个增量版本的前辈

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833285

H. Kriegel, Peer Kröger, Eirini Ntoutsi, A. Zimek

Todays data are high dimensional and dynamic, thus clustering over such kind of data is rather complicated. To deal with the high dimensionality problem, the subspace clustering research area has lately emerged that aims at finding clusters in subspaces of the original feature space. So far, the subspace clustering methods are mainly static and thus, cannot address the dynamic nature of modern data. In this paper, we propose an incremental version of the density based projected clustering algorithm PreDeCon, called incPreDeCon. The proposed algorithm efficiently updates only those subspace clusters that might be affected due to the population update.

如今的数据是高维的、动态的，因此在这类数据上聚类是相当复杂的。为了解决高维问题，最近出现了子空间聚类研究领域，其目的是在原始特征空间的子空间中寻找聚类。到目前为止，子空间聚类方法主要是静态的，因此不能解决现代数据的动态性。在本文中，我们提出了基于密度的投影聚类算法的增量版本，称为incpredecessor。该算法只对可能受到种群更新影响的子空间聚类进行有效更新。

引用次数: 11

Visual analysis of news streams with article threads 可视化分析的新闻流与文章线程

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833286

Milos Krstajic, E. Bertini, Florian Mansmann, D. Keim

The analysis of large quantities of news is an emerging area in the field of data analysis and visualization. International agencies collect thousands of news every day from a large number of sources and making sense of them is becoming increasingly complex due to the rate of the incoming news, as well as the inherent complexity of analyzing large quantities of evolving text corpora. Current visual techniques that deal with temporal evolution of such complex datasets, together with research efforts in related domains like text mining and topic detection and tracking, represent early attempts to understand, gain insight and make sense of these data. Despite these initial propositions, there is still a lack of techniques dealing directly with the problem of visualizing news streams in a "on-line" fashion, that is, in a way that the evolution of news can be monitored in real-time by the operator. In this paper we propose a purely visual technique that permits to see the evolution of news in real-time. The technique permits to show the stream of news as they enter into the system as well as a series of important threads which are computed on the fly. By merging single articles into threads, the technique permits to offload the visualization and retain only the most relevant information. The proposed technique is applied to the visualization of news streams generated by a news aggregation system that monitors over 4000 sites from 1600 key news portals world-wide and retrieves over 80000 reports per day in 43 languages.

海量新闻分析是数据分析和可视化领域的一个新兴领域。国际机构每天从大量来源收集成千上万的新闻，由于新闻传入的速度，以及分析大量不断发展的文本语料库的固有复杂性，对这些新闻的理解变得越来越复杂。目前处理这些复杂数据集的时间演变的视觉技术，以及在文本挖掘和主题检测和跟踪等相关领域的研究工作，代表了理解、获得洞察力和理解这些数据的早期尝试。尽管有这些最初的建议，仍然缺乏直接处理以“在线”方式可视化新闻流问题的技术，也就是说，以一种操作员可以实时监控新闻演变的方式。在本文中，我们提出了一种纯视觉技术，可以实时看到新闻的演变。该技术允许显示进入系统的新闻流以及动态计算的一系列重要线程。通过将单个文章合并到线程中，该技术允许卸载可视化并只保留最相关的信息。所提出的技术被应用于新闻聚合系统生成的新闻流的可视化，该系统监控来自全球1600个关键新闻门户的4000多个站点，每天检索43种语言的80,000多篇报道。

{"title":"Visual analysis of news streams with article threads","authors":"Milos Krstajic, E. Bertini, Florian Mansmann, D. Keim","doi":"10.1145/1833280.1833286","DOIUrl":"https://doi.org/10.1145/1833280.1833286","url":null,"abstract":"The analysis of large quantities of news is an emerging area in the field of data analysis and visualization. International agencies collect thousands of news every day from a large number of sources and making sense of them is becoming increasingly complex due to the rate of the incoming news, as well as the inherent complexity of analyzing large quantities of evolving text corpora. Current visual techniques that deal with temporal evolution of such complex datasets, together with research efforts in related domains like text mining and topic detection and tracking, represent early attempts to understand, gain insight and make sense of these data. Despite these initial propositions, there is still a lack of techniques dealing directly with the problem of visualizing news streams in a \"on-line\" fashion, that is, in a way that the evolution of news can be monitored in real-time by the operator. In this paper we propose a purely visual technique that permits to see the evolution of news in real-time. The technique permits to show the stream of news as they enter into the system as well as a series of important threads which are computed on the fly. By merging single articles into threads, the technique permits to offload the visualization and retain only the most relevant information. The proposed technique is applied to the visualization of news streams generated by a news aggregation system that monitors over 4000 sites from 1600 key news portals world-wide and retrieves over 80000 reports per day in 43 languages.","PeriodicalId":383372,"journal":{"name":"StreamKDD '10","volume":"239 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124448504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Detecting outliers on arbitrary data streams using anytime approaches 使用任意时间方法检测任意数据流上的异常值

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833282

I. Assent, P. Kranen, C. Baldauf, T. Seidl

Data streams are gaining importance in many sensoring and monitoring environments. Frequent mining tasks on data streams include classification, modeling and outlier detection. Since often the data arrival rates vary, anytime algorithms have been proposed for stream clustering and classification, which can deliver a fast first result and improve their result if more time is available. In this work, we propose the novel concept of anytime outlier detection and introduce an algorithm for anytime outlier detection based on a hierarchical cluster representation. We show promising results in preliminary experiments and discuss future research for anytime outlier detection.

数据流在许多传感和监测环境中变得越来越重要。数据流上常见的挖掘任务包括分类、建模和异常点检测。由于数据到达率经常变化，因此提出了用于流聚类和分类的任何时间算法，这些算法可以提供快速的第一个结果，并且如果有更多的时间可用，则可以改进结果。在这项工作中，我们提出了随时离群点检测的新概念，并引入了一种基于分层聚类表示的随时离群点检测算法。我们在初步实验中展示了有希望的结果，并讨论了随时异常值检测的未来研究。

引用次数: 9

Research issues in mining multiple data streams 多数据流挖掘的研究问题

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833288

Wenyan Wu, L. Gruenwald

There exist emerging applications of data streams that have mining requirements. Although single data stream mining has been extensively studied, little research has been done for mining multiple data streams (MDS), which are more complex than single data streams and involved in many real-world applications. This paper discusses the characteristics of MDS, proposes a formal definition for them, analyzes MDS application in terms of mining requirements, and identifies research issues for MDS mining.

存在具有挖掘需求的数据流的新兴应用。虽然单数据流挖掘已经得到了广泛的研究，但对多数据流(MDS)的挖掘研究很少，MDS比单数据流更复杂，涉及许多实际应用。本文讨论了MDS的特点，提出了MDS的形式化定义，从挖掘需求的角度分析了MDS的应用，提出了MDS挖掘的研究问题。

引用次数: 27

Conformal prediction for distribution-independent anomaly detection in streaming vessel data 流船数据中分布无关异常检测的保形预测

StreamKDD '10

Pub Date : 2010-07-25 DOI: 10.1145/1833280.1833287

Rikard Laxhammar, G. Falkman

This paper presents a novel application of the theory of conformal prediction for distribution-independent on-line learning and anomaly detection. We exploit the fact that conformal predictors give valid prediction sets at specified confidence levels under the relatively weak assumption that the (normal) training data together with (normal) observations to be predicted have been generated from the same distribution. If the actual observation is not included in the possibly empty prediction set, it is classified as anomalous at the corresponding significance level. Interpreting the significance level as an upper bound of the probability that a normal observation is mistakenly classified as anomalous, we can conveniently adjust the sensitivity to anomalies while controlling the rate of false alarms without having to find any application specific thresholds. The proposed method has been evaluated in the domain of sea surveillance using recorded data assumed to be normal. The validity of the prediction sets is justified by the empirical error rate which is just below the significance level. In addition, experiments with simulated anomalous data indicate that anomaly detection sensitivity is superior to that of two previously proposed methods.

本文提出了保形预测理论在分布无关在线学习和异常检测中的一种新应用。我们利用这样一个事实，即在相对较弱的假设下，在指定的置信水平上，共形预测器给出有效的预测集，即(正态)训练数据和待预测的(正态)观测数据是从相同的分布中生成的。如果实际观测值不包括在可能为空的预测集中，则在相应的显著性水平上将其分类为异常。将显著性水平解释为正常观测被错误地分类为异常的概率的上界，我们可以方便地调整对异常的敏感性，同时控制误报率，而无需找到任何特定于应用程序的阈值。该方法已在海上监视领域中使用假设为正常的记录数据进行了评估。预测集的有效性由经验错误率来证明，该错误率略低于显著性水平。此外，模拟异常数据实验表明，该方法的异常检测灵敏度优于之前提出的两种方法。

{"title":"Conformal prediction for distribution-independent anomaly detection in streaming vessel data","authors":"Rikard Laxhammar, G. Falkman","doi":"10.1145/1833280.1833287","DOIUrl":"https://doi.org/10.1145/1833280.1833287","url":null,"abstract":"This paper presents a novel application of the theory of conformal prediction for distribution-independent on-line learning and anomaly detection. We exploit the fact that conformal predictors give valid prediction sets at specified confidence levels under the relatively weak assumption that the (normal) training data together with (normal) observations to be predicted have been generated from the same distribution. If the actual observation is not included in the possibly empty prediction set, it is classified as anomalous at the corresponding significance level. Interpreting the significance level as an upper bound of the probability that a normal observation is mistakenly classified as anomalous, we can conveniently adjust the sensitivity to anomalies while controlling the rate of false alarms without having to find any application specific thresholds. The proposed method has been evaluated in the domain of sea surveillance using recorded data assumed to be normal. The validity of the prediction sets is justified by the empirical error rate which is just below the significance level. In addition, experiments with simulated anomalous data indicate that anomaly detection sensitivity is superior to that of two previously proposed methods.","PeriodicalId":383372,"journal":{"name":"StreamKDD '10","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122561956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

StreamKDD '10

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀