首页 > 最新文献

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Exponential random graph estimation under differential privacy 差分隐私下的指数随机图估计
Wentian Lu, G. Miklau
The effective analysis of social networks and graph-structured data is often limited by the privacy concerns of individuals whose data make up these networks. Differential privacy offers individuals a rigorous and appealing guarantee of privacy. But while differentially private algorithms for computing basic graph properties have been proposed, most graph modeling tasks common in the data mining community cannot yet be carried out privately. In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). We break the estimation problem into two steps: computing private sufficient statistics, then using them to estimate the model parameters. We consider specific alternating statistics that are in common use for ERGM models and describe a method for estimating them privately by adding noise proportional to a high-confidence bound on their local sensitivity. In addition, we propose an estimation algorithm that considers the noise distribution of the private statistics and offers better accuracy than performing standard parameter estimation using the private statistics.
对社交网络和图形结构数据的有效分析常常受到构成这些网络的个人数据的隐私问题的限制。差别隐私为个人提供了严格而有吸引力的隐私保障。但是,尽管已经提出了用于计算基本图属性的差分私有算法,但数据挖掘社区中常见的大多数图建模任务还不能私有地执行。在这项工作中,我们提出了一种私下估计指数随机图模型(ergm)参数的算法。我们将估计问题分为两个步骤:计算私有充分统计量,然后使用它们来估计模型参数。我们考虑了ERGM模型中常用的特定交替统计量,并描述了一种通过在其局部灵敏度上添加与高置信度界成比例的噪声来私下估计它们的方法。此外,我们提出了一种估计算法,该算法考虑了私有统计量的噪声分布,比使用私有统计量进行标准参数估计具有更好的准确性。
{"title":"Exponential random graph estimation under differential privacy","authors":"Wentian Lu, G. Miklau","doi":"10.1145/2623330.2623683","DOIUrl":"https://doi.org/10.1145/2623330.2623683","url":null,"abstract":"The effective analysis of social networks and graph-structured data is often limited by the privacy concerns of individuals whose data make up these networks. Differential privacy offers individuals a rigorous and appealing guarantee of privacy. But while differentially private algorithms for computing basic graph properties have been proposed, most graph modeling tasks common in the data mining community cannot yet be carried out privately. In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). We break the estimation problem into two steps: computing private sufficient statistics, then using them to estimate the model parameters. We consider specific alternating statistics that are in common use for ERGM models and describe a method for estimating them privately by adding noise proportional to a high-confidence bound on their local sensitivity. In addition, we propose an estimation algorithm that considers the noise distribution of the private statistics and offers better accuracy than performing standard parameter estimation using the private statistics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91454742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Learning time-series shapelets 学习时间序列shapelets
Josif Grabocka, Nicolas Schilling, Martin Wistuba, L. Schmidt-Thieme
Shapelets are discriminative sub-sequences of time series that best predict the target variable. For this reason, shapelet discovery has recently attracted considerable interest within the time-series research community. Currently shapelets are found by evaluating the prediction qualities of numerous candidates extracted from the series segments. In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. A new mathematical formalization of the task via a classification objective function is proposed and a tailored stochastic gradient learning algorithm is applied. The proposed method enables learning near-to-optimal shapelets directly without the need to try out lots of candidates. Furthermore, our method can learn true top-K shapelets by capturing their interaction. Extensive experimentation demonstrates statistically significant improvement in terms of wins and ranks against 13 baselines over 28 time-series datasets.
Shapelets是时间序列的判别子序列,最能预测目标变量。由于这个原因,shapelet的发现最近在时间序列研究界引起了相当大的兴趣。目前,shapelets是通过评估从序列片段中提取的众多候选样本的预测质量来发现的。与目前的研究相比,本文提出了一种新的视角来学习小颗粒。提出了一种新的基于分类目标函数的任务数学形式化方法,并应用了一种定制的随机梯度学习算法。提出的方法可以直接学习接近最优的shapelets,而不需要尝试大量的候选者。此外,我们的方法可以通过捕获它们的相互作用来学习真正的top-K shapelets。广泛的实验表明,在28个时间序列数据集的13个基线上,在胜利和排名方面有统计学上的显着改善。
{"title":"Learning time-series shapelets","authors":"Josif Grabocka, Nicolas Schilling, Martin Wistuba, L. Schmidt-Thieme","doi":"10.1145/2623330.2623613","DOIUrl":"https://doi.org/10.1145/2623330.2623613","url":null,"abstract":"Shapelets are discriminative sub-sequences of time series that best predict the target variable. For this reason, shapelet discovery has recently attracted considerable interest within the time-series research community. Currently shapelets are found by evaluating the prediction qualities of numerous candidates extracted from the series segments. In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. A new mathematical formalization of the task via a classification objective function is proposed and a tailored stochastic gradient learning algorithm is applied. The proposed method enables learning near-to-optimal shapelets directly without the need to try out lots of candidates. Furthermore, our method can learn true top-K shapelets by capturing their interaction. Extensive experimentation demonstrates statistically significant improvement in terms of wins and ranks against 13 baselines over 28 time-series datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87227391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 377
Active learning for sparse bayesian multilabel classification 稀疏贝叶斯多标签分类的主动学习
Deepak Vasisht, Andreas C. Damianou, M. Varma, Ashish Kapoor
We study the problem of active learning for multilabel classification. We focus on the real-world scenario where the average number of positive (relevant) labels per data point is small leading to positive label sparsity. Carrying out mutual information based near-optimal active learning in this setting is a challenging task since the computational complexity involved is exponential in the total number of labels. We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. The benefit of this alternate inference scheme is that it enables a natural approximation of the mutual information objective. We prove that the approximation leads to an identical solution to the exact optimization problem but at a fraction of the optimization cost. This allows us to carry out efficient, non-myopic, and near-optimal active learning for sparse multilabel classification. Extensive experiments reveal the effectiveness of the method.
研究了多标签分类的主动学习问题。我们关注的是现实世界的场景,其中每个数据点的正(相关)标签的平均数量很小,导致正标签稀疏性。在这种情况下,执行基于互信息的近最优主动学习是一项具有挑战性的任务,因为所涉及的计算复杂度在标签总数中呈指数级增长。针对[17]的稀疏贝叶斯多标签模型,提出了一种新的推理算法。这种替代推理方案的好处是,它使相互信息目标的自然逼近成为可能。我们证明,近似导致一个完全相同的解决方案的优化问题,但在一小部分的优化成本。这使我们能够对稀疏多标签分类进行高效、非短视和接近最优的主动学习。大量的实验证明了该方法的有效性。
{"title":"Active learning for sparse bayesian multilabel classification","authors":"Deepak Vasisht, Andreas C. Damianou, M. Varma, Ashish Kapoor","doi":"10.1145/2623330.2623759","DOIUrl":"https://doi.org/10.1145/2623330.2623759","url":null,"abstract":"We study the problem of active learning for multilabel classification. We focus on the real-world scenario where the average number of positive (relevant) labels per data point is small leading to positive label sparsity. Carrying out mutual information based near-optimal active learning in this setting is a challenging task since the computational complexity involved is exponential in the total number of labels. We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. The benefit of this alternate inference scheme is that it enables a natural approximation of the mutual information objective. We prove that the approximation leads to an identical solution to the exact optimization problem but at a fraction of the optimization cost. This allows us to carry out efficient, non-myopic, and near-optimal active learning for sparse multilabel classification. Extensive experiments reveal the effectiveness of the method.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90112208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Distance metric learning using dropout: a structured regularization approach 使用dropout的距离度量学习:一种结构化的正则化方法
Qi Qian, Juhua Hu, Rong Jin, J. Pei, Shenghuo Zhu
Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting problem because of a large number of parameters to be determined in DML. In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. Different from the previous studies that only apply dropout to training data, we apply dropout to both the learned metrics and the training data. We illustrate that application of dropout to DML is essentially equivalent to matrix norm based regularization. Compared with the standard regularization scheme in DML, dropout is advantageous in simulating the structured regularizers which have shown consistently better performance than non structured regularizers. We verify, both empirically and theoretically, that dropout is effective in regulating the learned metric to avoid the over-fitting problem. Last, we examine the idea of wrapping the dropout technique in the state-of-art DML methods and observe that the dropout technique can significantly improve the performance of the original DML methods.
距离度量学习(DML)的目的是学习比欧氏距离更好的距离度量。它已成功地应用于各种任务,如分类、聚类和信息检索。由于DML中需要确定大量的参数,许多DML算法都存在过拟合问题。在本文中,我们利用深度学习中成功应用的dropout技术来缓解DML的过拟合问题。与以往的研究只将dropout应用于训练数据不同,我们将dropout应用于学习指标和训练数据。我们说明了dropout在DML中的应用本质上等同于基于矩阵范数的正则化。与DML中的标准正则化方案相比,dropout在模拟结构化正则化方面具有优势,其性能始终优于非结构化正则化。我们从经验和理论上验证了dropout在调节学习度量以避免过度拟合问题方面是有效的。最后,我们研究了在最先进的DML方法中包装dropout技术的想法,并观察到dropout技术可以显着提高原始DML方法的性能。
{"title":"Distance metric learning using dropout: a structured regularization approach","authors":"Qi Qian, Juhua Hu, Rong Jin, J. Pei, Shenghuo Zhu","doi":"10.1145/2623330.2623678","DOIUrl":"https://doi.org/10.1145/2623330.2623678","url":null,"abstract":"Distance metric learning (DML) aims to learn a distance metric better than Euclidean distance. It has been successfully applied to various tasks, e.g., classification, clustering and information retrieval. Many DML algorithms suffer from the over-fitting problem because of a large number of parameters to be determined in DML. In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. Different from the previous studies that only apply dropout to training data, we apply dropout to both the learned metrics and the training data. We illustrate that application of dropout to DML is essentially equivalent to matrix norm based regularization. Compared with the standard regularization scheme in DML, dropout is advantageous in simulating the structured regularizers which have shown consistently better performance than non structured regularizers. We verify, both empirically and theoretically, that dropout is effective in regulating the learned metric to avoid the over-fitting problem. Last, we examine the idea of wrapping the dropout technique in the state-of-art DML methods and observe that the dropout technique can significantly improve the performance of the original DML methods.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87173813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Inferring gas consumption and pollution emission of vehicles throughout a city 推断全市车辆的油耗和污染排放情况
Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu
This paper instantly infers the gas consumption and pollution emission of vehicles traveling on a city's road network in a current time slot, using GPS trajectories from a sample of vehicles (e.g., taxicabs). The knowledge can be used to suggest cost-efficient driving routes as well as identifying road segments where gas has been wasted significantly. The instant estimation of the emissions from vehicles can enable pollution alerts and help diagnose the root cause of air pollution in the long run. In our method, we first compute the travel speed of each road segment using the GPS trajectories received recently. As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. TSE leverages features learned from other data sources, e.g., map data and historical trajectories, to deal with the data sparsity problem. We then propose a Traffic Volume Inference (TVI) model to infer the number of vehicles passing each road segment per minute. TVI is an unsupervised Bayesian Network that incorporates multiple factors, such as travel speed, weather conditions and geographical features of a road. Given the travel speed and traffic volume of a road segment, gas consumption and emissions can be calculated based on existing environmental theories. We evaluate our method based on extensive experiments using GPS trajectories generated by over 32,000 taxis in Beijing over a period of two months. The results demonstrate the advantages of our method over baselines, validating the contribution of its components and finding interesting discoveries for the benefit of society.
本文利用车辆样本(如出租车)的GPS轨迹,即时推断出当前时段在城市道路网络上行驶的车辆的油耗和污染排放。这些知识可以用来建议具有成本效益的驾驶路线,并确定汽油浪费严重的路段。从长远来看,对车辆排放的即时估计可以实现污染警报,并有助于诊断空气污染的根本原因。在我们的方法中,我们首先使用最近收到的GPS轨迹计算每个路段的行驶速度。由于许多路段没有经过轨迹(即数据稀疏性),我们提出了一种基于上下文感知矩阵分解方法的行驶速度估计(TSE)模型。TSE利用从其他数据源学习的特征,例如地图数据和历史轨迹,来处理数据稀疏性问题。然后,我们提出了一个交通量推断(TVI)模型来推断每分钟通过每个路段的车辆数量。TVI是一种无监督贝叶斯网络,它结合了多种因素,如行驶速度、天气条件和道路的地理特征。给定路段的行驶速度和交通量,可以根据现有的环境理论计算油耗和排放量。我们利用北京32000多辆出租车在两个月的时间里生成的GPS轨迹进行了广泛的实验,以此来评估我们的方法。结果证明了我们的方法比基线的优势,验证了其组成部分的贡献,并为社会的利益找到了有趣的发现。
{"title":"Inferring gas consumption and pollution emission of vehicles throughout a city","authors":"Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu","doi":"10.1145/2623330.2623653","DOIUrl":"https://doi.org/10.1145/2623330.2623653","url":null,"abstract":"This paper instantly infers the gas consumption and pollution emission of vehicles traveling on a city's road network in a current time slot, using GPS trajectories from a sample of vehicles (e.g., taxicabs). The knowledge can be used to suggest cost-efficient driving routes as well as identifying road segments where gas has been wasted significantly. The instant estimation of the emissions from vehicles can enable pollution alerts and help diagnose the root cause of air pollution in the long run. In our method, we first compute the travel speed of each road segment using the GPS trajectories received recently. As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. TSE leverages features learned from other data sources, e.g., map data and historical trajectories, to deal with the data sparsity problem. We then propose a Traffic Volume Inference (TVI) model to infer the number of vehicles passing each road segment per minute. TVI is an unsupervised Bayesian Network that incorporates multiple factors, such as travel speed, weather conditions and geographical features of a road. Given the travel speed and traffic volume of a road segment, gas consumption and emissions can be calculated based on existing environmental theories. We evaluate our method based on extensive experiments using GPS trajectories generated by over 32,000 taxis in Beijing over a period of two months. The results demonstrate the advantages of our method over baselines, validating the contribution of its components and finding interesting discoveries for the benefit of society.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75219906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 262
Who are experts specializing in landscape photography?: analyzing topic-specific authority on content sharing services 谁是风景摄影方面的专家?:分析特定主题的内容共享服务权限
Bin Bi, B. Kao, Chang Wan, Junghoo Cho
With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated content varies greatly in quality, which calls for a principled method to identify a set of authorities, who created high-quality resources, from a massive number of contributors of content. Since most previous studies only infer global authoritativeness of a user, there is no way to differentiate the authoritativeness in different aspects of life (topics). In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. This model jointly leverages the usage data collected from the sharing log and the favorite log. The parameters in TAA are learned from a constructed training dataset, for which a novel logistic likelihood function is specifically designed. To perform Bayesian inference for TAA with the new logistic likelihood, we extend typical Gibbs sampling by introducing auxiliary variables. Thorough experiments with two real-world datasets demonstrate the effectiveness of TAA in topic-specific authority identification as well as the generalizability of the TAA generative model.
随着Web 2.0的快速发展,各种各样的内容共享服务,如Flickr、YouTube、Blogger和TripAdvisor等,在过去十年中变得非常流行。在这些网站上,用户可以创建各种资源并彼此共享,例如照片、视频和旅游博客。用户生成内容的数量在质量上差别很大,这就需要一种有原则的方法,从大量的内容贡献者中识别出一组创造高质量资源的权威。由于以往的研究大多只是推断用户的全局权威性,无法区分用户在生活不同方面(话题)的权威性。在本文中,我们提出了一种新的特定于主题的权限分析(TAA)模型,它解决了以前方法的局限性,用于识别特定于内容共享服务上给定查询主题的权限。该模型联合利用从共享日志和收藏日志收集的使用数据。TAA中的参数从构建的训练数据集中学习,并为此设计了新的逻辑似然函数。为了使用新的逻辑似然对TAA进行贝叶斯推理,我们通过引入辅助变量扩展了典型的吉布斯抽样。在两个真实世界数据集上进行的深入实验证明了TAA在特定主题权威识别中的有效性以及TAA生成模型的可泛化性。
{"title":"Who are experts specializing in landscape photography?: analyzing topic-specific authority on content sharing services","authors":"Bin Bi, B. Kao, Chang Wan, Junghoo Cho","doi":"10.1145/2623330.2623752","DOIUrl":"https://doi.org/10.1145/2623330.2623752","url":null,"abstract":"With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated content varies greatly in quality, which calls for a principled method to identify a set of authorities, who created high-quality resources, from a massive number of contributors of content. Since most previous studies only infer global authoritativeness of a user, there is no way to differentiate the authoritativeness in different aspects of life (topics). In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. This model jointly leverages the usage data collected from the sharing log and the favorite log. The parameters in TAA are learned from a constructed training dataset, for which a novel logistic likelihood function is specifically designed. To perform Bayesian inference for TAA with the new logistic likelihood, we extend typical Gibbs sampling by introducing auxiliary variables. Thorough experiments with two real-world datasets demonstrate the effectiveness of TAA in topic-specific authority identification as well as the generalizability of the TAA generative model.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75781498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Prototype-based learning on concept-drifting data streams 基于原型的概念漂移数据流学习
Junming Shao, Zahra Ahmadi, Stefan Kramer
Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drift in data streams, PCA and statistics based heuristic approaches are employed. SyncStream has several attractive benefits: (a) It is capable of dynamically modeling evolving concepts from even a small set of prototypes and is robust against noisy examples. (b) Owing to synchronization-based constrained clustering and the P-Tree, it supports an efficient and effective data representation and maintenance. (c) Gradual and abrupt concept drift can be effectively detected. Empirical results shows that our method achieves good predictive performance compared to state-of-the-art algorithms and that it requires much less time than another instance-based stream mining algorithm.
数据流挖掘因其在目标营销、电子邮件过滤、网络入侵检测等领域的广泛应用而受到越来越多的关注。在本文中,我们提出了一个基于原型的数据流分类模型,称为SyncStream,它动态建模随时间变化的概念,并以局部方式进行预测。SyncStream不是在滑动窗口或集成学习中学习单个模型,而是通过在称为p树的新数据结构中动态维护一组原型来捕获不断发展的概念。原型是通过错误驱动的代表性学习和同步启发的约束聚类获得的。为了识别数据流中突然的概念漂移,采用了PCA和基于统计的启发式方法。SyncStream有几个吸引人的好处:(a)它能够从一个小的原型集动态建模不断发展的概念,并且对噪声示例具有鲁棒性。(b)基于同步的约束聚类和P-Tree支持高效的数据表示和维护。(c)可以有效地发现逐渐和突然的概念漂移。实证结果表明,与最先进的算法相比,我们的方法实现了良好的预测性能,并且比另一种基于实例的流挖掘算法所需的时间要少得多。
{"title":"Prototype-based learning on concept-drifting data streams","authors":"Junming Shao, Zahra Ahmadi, Stefan Kramer","doi":"10.1145/2623330.2623609","DOIUrl":"https://doi.org/10.1145/2623330.2623609","url":null,"abstract":"Data stream mining has gained growing attentions due to its wide emerging applications such as target marketing, email filtering and network intrusion detection. In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. Instead of learning a single model on a sliding window or ensemble learning, SyncStream captures evolving concepts by dynamically maintaining a set of prototypes in a new data structure called the P-tree. The prototypes are obtained by error-driven representativeness learning and synchronization-inspired constrained clustering. To identify abrupt concept drift in data streams, PCA and statistics based heuristic approaches are employed. SyncStream has several attractive benefits: (a) It is capable of dynamically modeling evolving concepts from even a small set of prototypes and is robust against noisy examples. (b) Owing to synchronization-based constrained clustering and the P-Tree, it supports an efficient and effective data representation and maintenance. (c) Gradual and abrupt concept drift can be effectively detected. Empirical results shows that our method achieves good predictive performance compared to state-of-the-art algorithms and that it requires much less time than another instance-based stream mining algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74773761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Optimal recommendations under attraction, aversion, and social influence 吸引力、厌恶和社会影响下的最佳建议
Wei Lu, Stratis Ioannidis, Smriti Bhagat, L. Lakshmanan
People's interests are dynamically evolving, often affected by external factors such as trends promoted by the media or adopted by their friends. In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user's interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. In the latter case, we model user reactions through either attraction or aversion towards past suggestions. We study this interest evolution process, and the utility accrued by recommendations, as a function of the system's recommendation strategy. We show that, in steady state, the optimal strategy can be computed as the solution of a semi-definite program (SDP). Using datasets of user ratings, we provide evidence for the existence of aversion and attraction in real-life data, and show that our optimal strategy can lead to significantly improved recommendations over systems that ignore aversion and attraction.
人们的兴趣是动态变化的,经常受到外部因素的影响,如媒体推动的趋势或他们的朋友所采用的。在这项工作中,我们通过动态兴趣级联来建模兴趣演变:我们考虑一个场景,其中用户的兴趣可能受到(a)她的社交圈中其他用户的兴趣以及(b)她从推荐系统收到的建议的影响。在后一种情况下,我们通过吸引或厌恶过去的建议来模拟用户的反应。我们研究了这个兴趣演化过程,以及作为系统推荐策略函数的推荐效用。我们证明,在稳态下,最优策略可以计算为半确定规划(SDP)的解。使用用户评分数据集,我们为现实生活数据中厌恶和吸引力的存在提供了证据,并表明我们的最优策略可以显著改善忽略厌恶和吸引力的系统的推荐。
{"title":"Optimal recommendations under attraction, aversion, and social influence","authors":"Wei Lu, Stratis Ioannidis, Smriti Bhagat, L. Lakshmanan","doi":"10.1145/2623330.2623744","DOIUrl":"https://doi.org/10.1145/2623330.2623744","url":null,"abstract":"People's interests are dynamically evolving, often affected by external factors such as trends promoted by the media or adopted by their friends. In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user's interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. In the latter case, we model user reactions through either attraction or aversion towards past suggestions. We study this interest evolution process, and the utility accrued by recommendations, as a function of the system's recommendation strategy. We show that, in steady state, the optimal strategy can be computed as the solution of a semi-definite program (SDP). Using datasets of user ratings, we provide evidence for the existence of aversion and attraction in real-life data, and show that our optimal strategy can lead to significantly improved recommendations over systems that ignore aversion and attraction.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74349118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Integrating spreadsheet data via accurate and low-effort extraction 通过精确和低工作量的提取集成电子表格数据
Zhe Chen, Michael J. Cafarella
Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.
电子表格包含许多主题的有价值的数据。然而,电子表格很难与其他数据源集成。将电子表格数据转换为关系模型将允许数据分析师使用关系集成工具。我们提出了一个两阶段的半自动系统,它可以提取准确的关系元数据,同时最大限度地减少用户的工作量。基于无向图形模型,我们的系统支持下游电子表格集成应用程序。首先,自动提取器使用电子表格图形样式的提示和恢复的元数据尽可能准确地提取电子表格数据。其次,交互式修复识别分散在大型电子表格语料库中的不同电子表格中的相似区域,允许用户的单个手动修复分摊到许多可能的提取错误上。我们的实验表明,在两个真实世界的数据集上,一个人只需要31%的人工操作,就可以获得基于标准分类技术的准确提取。
{"title":"Integrating spreadsheet data via accurate and low-effort extraction","authors":"Zhe Chen, Michael J. Cafarella","doi":"10.1145/2623330.2623617","DOIUrl":"https://doi.org/10.1145/2623330.2623617","url":null,"abstract":"Spreadsheets contain valuable data on many topics. However, spreadsheets are difficult to integrate with other data sources. Converting spreadsheet data to the relational model would allow data analysts to use relational integration tools. We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. Based on an undirected graphical model, our system enables downstream spreadsheet integration applications. First, the automatic extractor uses hints from spreadsheets' graphical style and recovered metadata to extract the spreadsheet data as accurately as possible. Second, the interactive repair identifies similar regions in distinct spreadsheets scattered across large spreadsheet corpora, allowing a user's single manual repair to be amortized over many possible extraction errors. Our experiments show that a human can obtain the accurate extraction with just 31% of the manual operations required by a standard classification based technique on two real-world datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74594063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
Unveiling clusters of events for alert and incident management in large-scale enterprise it 揭示事件集群,用于大型企业it中的警报和事件管理
Derek Lin, R. Raghu, V. Ramamurthy, Jin Yu, R. Radhakrishnan, Joseph Fernandez
Large enterprise IT (Information Technology) infrastructure components generate large volumes of alerts and incident tickets. These are manually screened, but it is otherwise difficult to extract information automatically from them to gain insights in order to improve operational efficiency. We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. This would be a step towards eliminating manual classification of the alerts and incidents, which is very labor intense and costly. Our framework can handle the semi-structured text in alerts generated by IT infrastructure components such as storage devices, network devices, servers etc., as well as the unstructured text in incident tickets created manually by operations support personnel. After text pre-processing and application of appropriate distance metrics, we apply different graph-theoretic approaches to cluster the alerts and incident tickets, based on their semi-structured and unstructured text respectively. For automated interpretation and read-ability on semi-structured text clusters, we propose a method to visualize clusters that preserves the structure and human-readability of the text data as compared to traditional word clouds where the text structure is not preserved; for unstructured text clusters, we find a simple way to define prototypes of clusters for easy interpretation. This framework for clustering and visualization will enable enterprises to prioritize the issues in their IT infrastructure and improve the reliability and availability of their services.
大型企业IT(信息技术)基础设施组件生成大量警报和事件票据。这些都是手动筛选的,但是很难从它们中自动提取信息以获得洞察力,从而提高操作效率。我们提出了一个框架,使用无监督机器学习,根据其中的文本对警报和事件票据进行聚类。这将是消除手动分类警报和事件的一个步骤,这是非常劳动密集型和昂贵的。我们的框架可以处理由IT基础设施组件(如存储设备、网络设备、服务器等)生成的警报中的半结构化文本,以及由操作支持人员手动创建的事件票据中的非结构化文本。在文本预处理和应用适当的距离度量之后,我们分别基于半结构化和非结构化文本,应用不同的图论方法对警报和事件票据进行聚类。对于半结构化文本集群的自动解释和可读性,我们提出了一种方法来可视化集群,与传统的词云相比,它保留了文本数据的结构和人类可读性;对于非结构化文本聚类,我们找到了一种简单的方法来定义聚类的原型,以便于解释。这个用于集群和可视化的框架将使企业能够优先考虑其IT基础设施中的问题,并提高其服务的可靠性和可用性。
{"title":"Unveiling clusters of events for alert and incident management in large-scale enterprise it","authors":"Derek Lin, R. Raghu, V. Ramamurthy, Jin Yu, R. Radhakrishnan, Joseph Fernandez","doi":"10.1145/2623330.2623360","DOIUrl":"https://doi.org/10.1145/2623330.2623360","url":null,"abstract":"Large enterprise IT (Information Technology) infrastructure components generate large volumes of alerts and incident tickets. These are manually screened, but it is otherwise difficult to extract information automatically from them to gain insights in order to improve operational efficiency. We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. This would be a step towards eliminating manual classification of the alerts and incidents, which is very labor intense and costly. Our framework can handle the semi-structured text in alerts generated by IT infrastructure components such as storage devices, network devices, servers etc., as well as the unstructured text in incident tickets created manually by operations support personnel. After text pre-processing and application of appropriate distance metrics, we apply different graph-theoretic approaches to cluster the alerts and incident tickets, based on their semi-structured and unstructured text respectively. For automated interpretation and read-ability on semi-structured text clusters, we propose a method to visualize clusters that preserves the structure and human-readability of the text data as compared to traditional word clouds where the text structure is not preserved; for unstructured text clusters, we find a simple way to define prototypes of clusters for easy interpretation. This framework for clustering and visualization will enable enterprises to prioritize the issues in their IT infrastructure and improve the reliability and availability of their services.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73818036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1