Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文中文

Bugbears or legitimate threats?: (social) scientists' criticisms of machine learning? 麻烦还是正当的威胁?:(社会)科学家对机器学习的批评?

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630818

S. Mullainathan

Social scientists increasingly criticize the use of machine learning techniques to understand human behavior. Criticisms include: (1) They are atheoretical and hence of limited scientific value; (2) They do not address causality and are hence of limited policy value; and (3) They are uninterpretable and hence of limited generalizability value (outside contexts very narrowly similar to the training dataset). These criticisms, I argue, miss the enormous opportunity offered by ML techniques to fundamentally improve the practice of empirical social science. Yet each criticism does contain a grain of truth and overcoming them will require innovations to existing methodologies. Some of these innovations are being developed today and some are yet to be tackled. I will in this talk sketch (1) what these innovations look like or should look like; (2) why they are needed; and (3) the technical challenges they raise. I will illustrate my points using a set of applications that range from financial markets to social policy problems to computational models of basic psychological processes. This talk describes joint work with Jon Kleinberg and individual projects with Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Anuj Shah, Chenhao Tan, Mike Yeomans and Tom Zimmerman.

社会科学家越来越多地批评使用机器学习技术来理解人类行为。批评包括:(1)它们是理论性的，因此科学价值有限;(2)它们不涉及因果关系，因此政策价值有限;(3)它们是不可解释的，因此泛化价值有限(在与训练数据集非常狭窄相似的环境之外)。我认为，这些批评错过了机器学习技术提供的从根本上改善实证社会科学实践的巨大机会。然而，每一种批评都有一定的道理，克服它们需要对现有方法进行创新。其中一些创新正在开发中，有些还有待解决。我将在这次演讲中概述(1)这些创新是什么样子或者应该是什么样子;(二)需要的理由;(3)它们带来的技术挑战。我将使用一系列应用程序来阐述我的观点，从金融市场到社会政策问题，再到基本心理过程的计算模型。这次演讲描述了与Jon Kleinberg的合作以及与Himabindu Lakkaraju、Jure Leskovec、Jens Ludwig、Anuj Shah、Chenhao Tan、Mike Yeomans和Tom Zimmerman的单独项目。

{"title":"Bugbears or legitimate threats?: (social) scientists' criticisms of machine learning?","authors":"S. Mullainathan","doi":"10.1145/2623330.2630818","DOIUrl":"https://doi.org/10.1145/2623330.2630818","url":null,"abstract":"Social scientists increasingly criticize the use of machine learning techniques to understand human behavior. Criticisms include: (1) They are atheoretical and hence of limited scientific value; (2) They do not address causality and are hence of limited policy value; and (3) They are uninterpretable and hence of limited generalizability value (outside contexts very narrowly similar to the training dataset). These criticisms, I argue, miss the enormous opportunity offered by ML techniques to fundamentally improve the practice of empirical social science. Yet each criticism does contain a grain of truth and overcoming them will require innovations to existing methodologies. Some of these innovations are being developed today and some are yet to be tackled. I will in this talk sketch (1) what these innovations look like or should look like; (2) why they are needed; and (3) the technical challenges they raise. I will illustrate my points using a set of applications that range from financial markets to social policy problems to computational models of basic psychological processes. This talk describes joint work with Jon Kleinberg and individual projects with Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Anuj Shah, Chenhao Tan, Mike Yeomans and Tom Zimmerman.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89078145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Utilizing temporal patterns for estimating uncertainty in interpretable early decision making 利用时间模式估算可解释的早期决策中的不确定性

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623694

Mohamed F. Ghalwash, Vladan Radosavljevic, Z. Obradovic

Early classification of time series is prevalent in many time-sensitive applications such as, but not limited to, early warning of disease outcome and early warning of crisis in stock market. textcolor{black}{ For example,} early diagnosis allows physicians to design appropriate therapeutic strategies at early stages of diseases. However, practical adaptation of early classification of time series requires an easy to understand explanation (interpretability) and a measure of confidence of the prediction results (uncertainty estimates). These two aspects were not jointly addressed in previous time series early classification studies, such that a difficult choice of selecting one of these aspects is required. In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method. The question we address here is "how to provide estimates of uncertainty in regard to interpretable early prediction." In our extensive evaluation on twenty time series datasets we showed that the proposed method has several advantages over the state-of-the-art method that provides reliability estimates in early classification. Namely, the proposed method is more effective than the state-of-the-art method, is simple to implement, and provides interpretable results.

时间序列的早期分类在许多时间敏感的应用中很普遍，例如但不限于疾病结果的早期预警和股票市场危机的早期预警。{例如，早期诊断使医生能够在疾病的早期阶段设计适当的治疗策略。然而，时间序列早期分类的实际适应需要易于理解的解释(可解释性)和预测结果的置信度(不确定性估计)。在以前的时间序列早期分类研究中，这两个方面没有共同解决，因此需要在其中一个方面进行艰难的选择。在本研究中，我们提出了一种简单而有效的方法来为可解释的早期分类方法提供不确定性估计。我们在这里讨论的问题是“如何提供关于可解释的早期预测的不确定性的估计。”在我们对20个时间序列数据集的广泛评估中，我们表明，与提供早期分类可靠性估计的最先进方法相比，所提出的方法具有几个优势。也就是说，所提出的方法比最先进的方法更有效，易于实现，并提供可解释的结果。

{"title":"Utilizing temporal patterns for estimating uncertainty in interpretable early decision making","authors":"Mohamed F. Ghalwash, Vladan Radosavljevic, Z. Obradovic","doi":"10.1145/2623330.2623694","DOIUrl":"https://doi.org/10.1145/2623330.2623694","url":null,"abstract":"Early classification of time series is prevalent in many time-sensitive applications such as, but not limited to, early warning of disease outcome and early warning of crisis in stock market. textcolor{black}{ For example,} early diagnosis allows physicians to design appropriate therapeutic strategies at early stages of diseases. However, practical adaptation of early classification of time series requires an easy to understand explanation (interpretability) and a measure of confidence of the prediction results (uncertainty estimates). These two aspects were not jointly addressed in previous time series early classification studies, such that a difficult choice of selecting one of these aspects is required. In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method. The question we address here is \"how to provide estimates of uncertainty in regard to interpretable early prediction.\" In our extensive evaluation on twenty time series datasets we showed that the proposed method has several advantages over the state-of-the-art method that provides reliability estimates in early classification. Namely, the proposed method is more effective than the state-of-the-art method, is simple to implement, and provides interpretable results.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85609725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Online multiple kernel regression 在线多核回归

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623712

Doyen Sahoo, S. Hoi, Bin Li

Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns. Despite being studied extensively, most of the existing work suffers from two major drawbacks: (i) they are often designed for solving regression tasks in a batch learning setting, making them not only computationally inefficient and but also poorly scalable in real-world applications where data arrives sequentially; and (ii) they usually assume a fixed kernel function is given prior to the learning task, which could result in poor performance if the chosen kernel is inappropriate. To overcome these drawbacks, this paper presents a novel scheme of Online Multiple Kernel Regression (OMKR), which sequentially learns the kernel-based regressor in an online and scalable fashion, and dynamically explore a pool of multiple diverse kernels to avoid suffering from a single fixed poor kernel so as to remedy the drawback of manual/heuristic kernel selection. The OMKR problem is more challenging than regular kernel-based regression tasks since we have to on-the-fly determine both the optimal kernel-based regressor for each individual kernel and the best combination of the multiple kernel regressors. In this paper, we propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks. We also analyze the theoretical bounds of the proposed OMKR method and conduct extensive experiments to evaluate its empirical performance on both real-world regression and times series prediction tasks.

基于核的回归代表了一个重要的学习技术家族，用于解决具有非线性模式的具有挑战性的回归任务。尽管得到了广泛的研究，但大多数现有的工作都有两个主要的缺点:(i)它们通常是为解决批量学习设置中的回归任务而设计的，这不仅使它们在计算上效率低下，而且在数据顺序到达的实际应用中也很难扩展;(ii)它们通常假设在学习任务之前给出一个固定的核函数，如果选择的核函数不合适，可能会导致性能不佳。为了克服这些缺点，本文提出了一种新的在线多核回归(OMKR)方案，该方案以在线和可扩展的方式顺序学习基于核的回归量，并动态探索多个不同核的池，以避免单个固定的差核，从而弥补人工/启发式核选择的缺点。OMKR问题比常规的基于内核的回归任务更具挑战性，因为我们必须实时确定每个内核的最佳基于内核的回归量和多个内核回归量的最佳组合。在本文中，我们提出了一系列用于回归的OMKR算法，并讨论了它们在时间序列预测任务中的应用。我们还分析了所提出的OMKR方法的理论界限，并进行了广泛的实验，以评估其在现实世界回归和时间序列预测任务上的经验性能。

{"title":"Online multiple kernel regression","authors":"Doyen Sahoo, S. Hoi, Bin Li","doi":"10.1145/2623330.2623712","DOIUrl":"https://doi.org/10.1145/2623330.2623712","url":null,"abstract":"Kernel-based regression represents an important family of learning techniques for solving challenging regression tasks with non-linear patterns. Despite being studied extensively, most of the existing work suffers from two major drawbacks: (i) they are often designed for solving regression tasks in a batch learning setting, making them not only computationally inefficient and but also poorly scalable in real-world applications where data arrives sequentially; and (ii) they usually assume a fixed kernel function is given prior to the learning task, which could result in poor performance if the chosen kernel is inappropriate. To overcome these drawbacks, this paper presents a novel scheme of Online Multiple Kernel Regression (OMKR), which sequentially learns the kernel-based regressor in an online and scalable fashion, and dynamically explore a pool of multiple diverse kernels to avoid suffering from a single fixed poor kernel so as to remedy the drawback of manual/heuristic kernel selection. The OMKR problem is more challenging than regular kernel-based regression tasks since we have to on-the-fly determine both the optimal kernel-based regressor for each individual kernel and the best combination of the multiple kernel regressors. In this paper, we propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks. We also analyze the theoretical bounds of the proposed OMKR method and conduct extensive experiments to evaluate its empirical performance on both real-world regression and times series prediction tasks.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89733084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Probabilistic latent network visualization: inferring and embedding diffusion networks 概率潜在网络可视化:推断和嵌入扩散网络

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623646

Takeshi Kurashima, Tomoharu Iwata, Noriko Takaya, H. Sawada

The diffusion of information, rumors, and diseases are assumed to be probabilistic processes over some network structure. An event starts at one node of the network, and then spreads to the edges of the network. In most cases, the underlying network structure that generates the diffusion process is unobserved, and we only observe the times at which each node is altered/influenced by the process. This paper proposes a probabilistic model for inferring the diffusion network, which we call Probabilistic Latent Network Visualization (PLNV); it is based on cascade data, a record of observed times of node influence. An important characteristic of our approach is to infer the network by embedding it into a low-dimensional visualization space. We assume that each node in the network has latent coordinates in the visualization space, and diffusion is more likely to occur between nodes that are placed close together. Our model uses maximum a posteriori estimation to learn the latent coordinates of nodes that best explain the observed cascade data. The latent coordinates of nodes in the visualization space can 1) enable the system to suggest network layouts most suitable for browsing, and 2) lead to high accuracy in inferring the underlying network when analyzing the diffusion process of new or rare information, rumors, and disease.

信息、谣言和疾病的传播被认为是某种网络结构上的概率过程。事件从网络的一个节点开始，然后传播到网络的边缘。在大多数情况下，产生扩散过程的底层网络结构是无法观察到的，我们只观察到每个节点被扩散过程改变/影响的时间。本文提出了一种推断扩散网络的概率模型，我们称之为概率潜在网络可视化(probabilistic Latent network Visualization, PLNV);它基于级联数据，即观测到的节点影响次数的记录。我们方法的一个重要特点是通过将网络嵌入到一个低维的可视化空间中来推断网络。我们假设网络中的每个节点在可视化空间中都有潜在坐标，并且靠近的节点之间更容易发生扩散。我们的模型使用最大后验估计来学习最能解释观察到的级联数据的节点的潜在坐标。可视化空间中节点的潜坐标可以使系统提出最适合浏览的网络布局。2)在分析新的或罕见的信息、谣言、疾病的传播过程时，推断底层网络的准确性很高。

{"title":"Probabilistic latent network visualization: inferring and embedding diffusion networks","authors":"Takeshi Kurashima, Tomoharu Iwata, Noriko Takaya, H. Sawada","doi":"10.1145/2623330.2623646","DOIUrl":"https://doi.org/10.1145/2623330.2623646","url":null,"abstract":"The diffusion of information, rumors, and diseases are assumed to be probabilistic processes over some network structure. An event starts at one node of the network, and then spreads to the edges of the network. In most cases, the underlying network structure that generates the diffusion process is unobserved, and we only observe the times at which each node is altered/influenced by the process. This paper proposes a probabilistic model for inferring the diffusion network, which we call Probabilistic Latent Network Visualization (PLNV); it is based on cascade data, a record of observed times of node influence. An important characteristic of our approach is to infer the network by embedding it into a low-dimensional visualization space. We assume that each node in the network has latent coordinates in the visualization space, and diffusion is more likely to occur between nodes that are placed close together. Our model uses maximum a posteriori estimation to learn the latent coordinates of nodes that best explain the observed cascade data. The latent coordinates of nodes in the visualization space can 1) enable the system to suggest network layouts most suitable for browsing, and 2) lead to high accuracy in inferring the underlying network when analyzing the diffusion process of new or rare information, rumors, and disease.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86561069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Identifying and labeling search tasks via query-based hawkes processes 通过基于查询的hawkes流程识别和标记搜索任务

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623679

Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, H. Zha

We consider a search task as a set of queries that serve the same user information need. Analyzing search tasks from user query streams plays an important role in building a set of modern tools to improve search engine performance. In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries. To capture the above intuitions, we directly model query temporal patterns using a special class of point processes called Hawkes processes, and combine topic models with Hawkes processes for simultaneously identifying and labeling search tasks. Essentially, Hawkes processes utilize their self-exciting properties to identify search tasks if influence exists among a sequence of queries for individual users, while the topic model exploits query co-occurrence across different users to discover the latent information needed for labeling search tasks. More importantly, there is mutual reinforcement between Hawkes processes and the topic model in the unified model that enhances the performance of both. We evaluate our method based on both synthetic data and real-world query log data. In addition, we also apply our model to query clustering and search task identification. By comparing with state-of-the-art methods, the results demonstrate that the improvement in our proposed approach is consistent and promising.

我们将搜索任务视为满足相同用户信息需求的一组查询。从用户查询流中分析搜索任务对于构建一套提高搜索引擎性能的现代工具起着重要的作用。本文基于以下直观观察，提出了一种识别和标记搜索任务的概率方法:在许多查询序列中，用户发出的查询在时间上接近于同一搜索任务，同时，具有相同信息需求的不同用户倾向于提交主题一致的搜索查询。为了捕捉上述直觉，我们使用一种称为Hawkes过程的特殊点过程直接对查询时间模式建模，并将主题模型与Hawkes过程结合起来，同时识别和标记搜索任务。从本质上讲，Hawkes过程利用其自兴奋特性来识别搜索任务，如果单个用户的查询序列之间存在影响，而主题模型利用不同用户之间的查询共发生来发现标记搜索任务所需的潜在信息。更重要的是，在统一模型中Hawkes过程与主题模型之间存在着相互强化的关系，增强了两者的性能。我们基于合成数据和实际查询日志数据来评估我们的方法。此外，我们还将该模型应用于查询聚类和搜索任务识别。通过与最先进的方法进行比较，结果表明我们提出的方法的改进是一致的和有希望的。

{"title":"Identifying and labeling search tasks via query-based hawkes processes","authors":"Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, H. Zha","doi":"10.1145/2623330.2623679","DOIUrl":"https://doi.org/10.1145/2623330.2623679","url":null,"abstract":"We consider a search task as a set of queries that serve the same user information need. Analyzing search tasks from user query streams plays an important role in building a set of modern tools to improve search engine performance. In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries. To capture the above intuitions, we directly model query temporal patterns using a special class of point processes called Hawkes processes, and combine topic models with Hawkes processes for simultaneously identifying and labeling search tasks. Essentially, Hawkes processes utilize their self-exciting properties to identify search tasks if influence exists among a sequence of queries for individual users, while the topic model exploits query co-occurrence across different users to discover the latent information needed for labeling search tasks. More importantly, there is mutual reinforcement between Hawkes processes and the topic model in the unified model that enhances the performance of both. We evaluate our method based on both synthetic data and real-world query log data. In addition, we also apply our model to query clustering and search task identification. By comparing with state-of-the-art methods, the results demonstrate that the improvement in our proposed approach is consistent and promising.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89528677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Large-scale high-precision topic modeling on twitter 推特上的大规模高精度主题建模

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623336

Shuang-Hong Yang, A. Kolcz, A. Schlaikjer, Pankaj Gupta

We are interested in organizing a continuous stream of sparse and noisy texts, known as "tweets", in real time into an ontology of hundreds of topics with measurable and stringently high precision. This inference is performed over a full-scale stream of Twitter data, whose statistical distribution evolves rapidly over time. The implementation in an industrial setting with the potential of affecting and being visible to real users made it necessary to overcome a host of practical challenges. We present a spectrum of topic modeling techniques that contribute to a deployed system. These include non-topical tweet detection, automatic labeled data acquisition, evaluation with human computation, diagnostic and corrective learning and, most importantly, high-precision topic inference. The latter represents a novel two-stage training algorithm for tweet text classification and a close-loop inference mechanism for combining texts with additional sources of information. The resulting system achieves 93% precision at substantial overall coverage.

我们感兴趣的是将连续的稀疏和嘈杂的文本流(称为“tweets”)实时组织成一个包含数百个主题的本体，具有可测量和严格的高精度。这种推断是在全面的Twitter数据流上执行的，这些数据的统计分布随着时间的推移而迅速发展。在工业环境中实施，有可能影响到实际用户并使其可见，因此必须克服许多实际挑战。我们提出了一系列有助于部署系统的主题建模技术。这些包括非主题推文检测，自动标记数据采集，人工计算评估，诊断和纠正学习，最重要的是，高精度主题推理。后者代表了一种新的用于tweet文本分类的两阶段训练算法和用于将文本与附加信息源结合的闭环推理机制。最终系统在总体覆盖范围内达到93%的精度。

引用次数: 117

Representative clustering of uncertain data 不确定数据的代表性聚类

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623725

Andreas Züfle, Tobias Emrich, Klaus Arthur Schmid, N. Mamoulis, A. Zimek, M. Renz

This paper targets the problem of computing meaningful clusterings from uncertain data sets. Existing methods for clustering uncertain data compute a single clustering without any indication of its quality and reliability; thus, decisions based on their results are questionable. In this paper, we describe a framework, based on possible-worlds semantics; when applied on an uncertain dataset, it computes a set of representative clusterings, each of which has a probabilistic guarantee not to exceed some maximum distance to the ground truth clustering, i.e., the clustering of the actual (but unknown) data. Our framework can be combined with any existing clustering algorithm and it is the first to provide quality guarantees about its result. In addition, our experimental evaluation shows that our representative clusterings have a much smaller deviation from the ground truth clustering than existing approaches, thus reducing the effect of uncertainty.

本文研究了不确定数据集中有意义聚类的计算问题。现有的不确定数据聚类方法计算单个聚类，而不表明其质量和可靠性;因此，基于结果的决定是有问题的。在本文中，我们描述了一个基于可能世界语义的框架;当应用于不确定数据集时，它计算一组有代表性的聚类，每个聚类都有概率保证不超过到真实聚类的最大距离，即实际(但未知)数据的聚类。我们的框架可以与任何现有的聚类算法相结合，并且它是第一个为其结果提供质量保证的框架。此外，我们的实验评估表明，与现有方法相比，我们的代表性聚类与地面真实聚类的偏差要小得多，从而减少了不确定性的影响。

引用次数: 32

Novel geospatial interpolation analytics for general meteorological measurements 用于一般气象测量的新型地理空间插值分析

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623367

Bingsheng Wang, Jinjun Xiong

This paper addresses geospatial interpolation for meteorological measurements in which we estimate the values of climatic metrics at unsampled sites with existing observations. Providing climatological and meteorological conditions covering a large region is potentially useful in many applications, such as smart grid. However, existing research works on interpolation either cause a large number of complex calculations or are lack of high accuracy. We propose a Bayesian compressed sensing based non-parametric statistical model to efficiently perform the spatial interpolation task. Student-t priors are employed to model the sparsity of unknown signals' coefficients, and the Approximated Variational Inference (AVI) method is provided for effective and fast learning. The presented model has been deployed at IBM, targeting for aiding the intelligent management of smart grid. The evaluations on two real world datasets demonstrate that our algorithm achieves state-of-the-art performance in both effectiveness and efficiency.

本文讨论了气象测量的地理空间插值，其中我们用现有观测估计未采样地点的气候指标值。提供覆盖大区域的气候和气象条件在许多应用中都有潜在的用处，比如智能电网。然而，现有的插值研究要么计算量大、计算复杂，要么精度不高。我们提出了一种基于贝叶斯压缩感知的非参数统计模型来有效地执行空间插值任务。采用Student-t先验对未知信号系数的稀疏度进行建模，并采用近似变分推理(AVI)方法进行有效快速的学习。所提出的模型已在IBM部署，旨在帮助智能电网的智能管理。对两个真实世界数据集的评估表明，我们的算法在有效性和效率方面都达到了最先进的性能。

引用次数: 2

Scalable hands-free transfer learning for online advertising 可扩展的免提转移学习在线广告

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623349

B. Dalessandro, Daizhuo Chen, T. Raeder, C. Perlich, Melinda Han Williams, F. Provost

Internet display advertising is a critical revenue source for publishers and online content providers, and is supported by massive amounts of user and publisher data. Targeting display ads can be improved substantially with machine learning methods, but building many models on massive data becomes prohibitively expensive computationally. This paper presents a combination of strategies, deployed by the online advertising firm Dstillery, for learning many models from extremely high-dimensional data efficiently and without human intervention. This combination includes: (i)~A method for simple-yet-effective transfer learning where a model learned from data that is relatively abundant and cheap is taken as a prior for Bayesian logistic regression trained with stochastic gradient descent (SGD) from the more expensive target data. (ii)~A new update rule for automatic learning rate adaptation, to support learning from sparse, high-dimensional data, as well as the integration with adaptive regularization. We present an experimental analysis across 100 different ad campaigns, showing that the transfer learning indeed improves performance across a large number of them, especially at the start of the campaigns. The combined "hands-free" method needs no fiddling with the SGD learning rate, and we show that it is just as effective as using expensive grid search to set the regularization parameter for each campaign.

互联网展示广告是出版商和在线内容提供商的重要收入来源，并得到大量用户和出版商数据的支持。定向展示广告可以通过机器学习方法得到大幅改进，但在大量数据上构建许多模型的计算成本过高。本文介绍了在线广告公司Dstillery部署的一种策略组合，用于在没有人为干预的情况下从极高维度的数据中高效地学习许多模型。这种组合包括:(i)~一种简单而有效的迁移学习方法，其中从相对丰富和廉价的数据中学习的模型被用作从更昂贵的目标数据中使用随机梯度下降(SGD)训练的贝叶斯逻辑回归的先验。(ii)~一种新的自动学习率自适应更新规则，支持从稀疏、高维数据中学习，以及与自适应正则化的集成。我们对100个不同的广告活动进行了实验分析，表明迁移学习确实提高了许多广告活动的表现，尤其是在广告活动开始时。组合的“免提”方法不需要对SGD学习率进行修改，并且我们表明，它与使用昂贵的网格搜索为每个活动设置正则化参数一样有效。

{"title":"Scalable hands-free transfer learning for online advertising","authors":"B. Dalessandro, Daizhuo Chen, T. Raeder, C. Perlich, Melinda Han Williams, F. Provost","doi":"10.1145/2623330.2623349","DOIUrl":"https://doi.org/10.1145/2623330.2623349","url":null,"abstract":"Internet display advertising is a critical revenue source for publishers and online content providers, and is supported by massive amounts of user and publisher data. Targeting display ads can be improved substantially with machine learning methods, but building many models on massive data becomes prohibitively expensive computationally. This paper presents a combination of strategies, deployed by the online advertising firm Dstillery, for learning many models from extremely high-dimensional data efficiently and without human intervention. This combination includes: (i)~A method for simple-yet-effective transfer learning where a model learned from data that is relatively abundant and cheap is taken as a prior for Bayesian logistic regression trained with stochastic gradient descent (SGD) from the more expensive target data. (ii)~A new update rule for automatic learning rate adaptation, to support learning from sparse, high-dimensional data, as well as the integration with adaptive regularization. We present an experimental analysis across 100 different ad campaigns, showing that the transfer learning indeed improves performance across a large number of them, especially at the start of the campaigns. The combined \"hands-free\" method needs no fiddling with the SGD learning rate, and we show that it is just as effective as using expensive grid search to set the regularization parameter for each campaign.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80619597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Guilt by association: large scale malware detection by mining file-relation graphs 关联犯罪:通过挖掘文件关系图进行大规模恶意软件检测

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623342

Acar Tamersoy, Kevin A. Roundy, Duen Horng Chau

The increasing sophistication of malicious software calls for new defensive techniques that are harder to evade, and are capable of protecting users against novel threats. We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop's moral that "a man is known by the company he keeps." We use a large dataset voluntarily contributed by the members of Norton Community Watch, consisting of partial lists of the files that exist on their machines, to identify close relationships between files that often appear together on machines. AESOP leverages locality-sensitive hashing to measure the strength of these inter-file relationships to construct a graph, on which it performs large scale inference by propagating information from the labeled files (as benign or malicious) to the preponderance of unlabeled files. AESOP attained early labeling of 99% of benign files and 79% of malicious files, over a week before they are labeled by the state-of-the-art techniques, with a 0.9961 true positive rate at flagging malware, at 0.0001 false positive rate.

越来越复杂的恶意软件要求新的防御技术，更难以逃避，并能够保护用户免受新的威胁。我们介绍了AESOP，这是一种可扩展的算法，通过应用伊索的道德“谁与谁为邻”来识别恶意可执行文件。我们使用诺顿社区观察成员自愿提供的大型数据集，包括他们机器上存在的文件的部分列表，以识别经常出现在机器上的文件之间的密切关系。AESOP利用对位置敏感的散列来测量这些文件间关系的强度，以构建一个图，在这个图上，它通过将信息从标记文件(无论是良性的还是恶意的)传播到未标记文件的优势来执行大规模推断。AESOP对99%的良性文件和79%的恶意文件进行了早期标记，在最先进的技术标记之前一周，标记恶意软件的真阳性率为0.9961，假阳性率为0.0001。

{"title":"Guilt by association: large scale malware detection by mining file-relation graphs","authors":"Acar Tamersoy, Kevin A. Roundy, Duen Horng Chau","doi":"10.1145/2623330.2623342","DOIUrl":"https://doi.org/10.1145/2623330.2623342","url":null,"abstract":"The increasing sophistication of malicious software calls for new defensive techniques that are harder to evade, and are capable of protecting users against novel threats. We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop's moral that \"a man is known by the company he keeps.\" We use a large dataset voluntarily contributed by the members of Norton Community Watch, consisting of partial lists of the files that exist on their machines, to identify close relationships between files that often appear together on machines. AESOP leverages locality-sensitive hashing to measure the strength of these inter-file relationships to construct a graph, on which it performs large scale inference by propagating information from the labeled files (as benign or malicious) to the preponderance of unlabeled files. AESOP attained early labeling of 99% of benign files and 79% of malicious files, over a week before they are labeled by the state-of-the-art techniques, with a 0.9961 true positive rate at flagging malware, at 0.0001 false positive rate.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86051665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 165

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀