Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第5页

Simultaneous feature and feature group selection through hard thresholding 通过硬阈值同时选择特征和特征组

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623662

Shuo Xiang, Tao Yang, Jieping Ye

Selecting an informative subset of features has important applications in many data mining tasks especially for high-dimensional data. Recently, simultaneous selection of features and feature groups (a.k.a bi-level selection) becomes increasingly popular since it not only reduces the number of features but also unveils the underlying grouping effect in the data, which is a valuable functionality in many applications such as bioinformatics and web data mining. One major challenge of bi-level selection (or even feature selection only) is that computing a globally optimal solution requires a prohibitive computational cost. To overcome such a challenge, current research mainly falls into two categories. The first one focuses on finding suitable continuous computational surrogates for the discrete functions and this leads to various convex and nonconvex optimization models. Although efficient, convex models usually deliver sub-optimal performance while nonconvex models on the other hand require significantly more computational effort. Another direction is to use greedy algorithms to solve the discrete optimization directly. However, existing algorithms are proposed to handle single-level selection only and it remains challenging to extend these methods to handle bi-level selection. In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. Our main contributions are: (1) we propose a novel bi-level selection model and show that the key combinatorial problem admits a globally optimal solution using dynamic programming; (2) we provide an error bound between our solution and the globally optimal under the RIP (Restricted Isometry Property) theoretical framework. Our experiments on synthetic and real data demonstrate that the proposed algorithm produces encouraging performance while keeping comparable computational efficiency to convex relaxation models.

选择信息丰富的特征子集在许多数据挖掘任务中具有重要的应用，特别是对于高维数据。近年来，特征和特征组的同时选择(又称双级选择)越来越流行，因为它不仅减少了特征的数量，而且揭示了数据中潜在的分组效应，这在生物信息学和网络数据挖掘等许多应用中都是一种有价值的功能。双层选择(甚至仅是特征选择)的一个主要挑战是，计算全局最优解决方案需要高昂的计算成本。为了克服这一挑战，目前的研究主要分为两类。第一个重点是为离散函数寻找合适的连续计算替代品，这导致了各种凸和非凸优化模型。虽然有效，但凸模型通常提供次优性能，而另一方面，非凸模型需要更多的计算工作。另一个方向是利用贪心算法直接求解离散优化问题。然而，现有的算法只处理单层选择，将这些方法扩展到处理双层选择仍然是一个挑战。在本文中，我们通过引入一种高效的稀疏群硬阈值算法来填补这一空白。本文的主要贡献有:(1)提出了一种新的双层选择模型，并利用动态规划证明了关键组合问题存在全局最优解;(2)在RIP (Restricted Isometry Property)理论框架下给出了我们的解与全局最优解之间的误差界。我们在合成数据和真实数据上的实验表明，该算法在保持与凸松弛模型相当的计算效率的同时，产生了令人鼓舞的性能。

{"title":"Simultaneous feature and feature group selection through hard thresholding","authors":"Shuo Xiang, Tao Yang, Jieping Ye","doi":"10.1145/2623330.2623662","DOIUrl":"https://doi.org/10.1145/2623330.2623662","url":null,"abstract":"Selecting an informative subset of features has important applications in many data mining tasks especially for high-dimensional data. Recently, simultaneous selection of features and feature groups (a.k.a bi-level selection) becomes increasingly popular since it not only reduces the number of features but also unveils the underlying grouping effect in the data, which is a valuable functionality in many applications such as bioinformatics and web data mining. One major challenge of bi-level selection (or even feature selection only) is that computing a globally optimal solution requires a prohibitive computational cost. To overcome such a challenge, current research mainly falls into two categories. The first one focuses on finding suitable continuous computational surrogates for the discrete functions and this leads to various convex and nonconvex optimization models. Although efficient, convex models usually deliver sub-optimal performance while nonconvex models on the other hand require significantly more computational effort. Another direction is to use greedy algorithms to solve the discrete optimization directly. However, existing algorithms are proposed to handle single-level selection only and it remains challenging to extend these methods to handle bi-level selection. In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. Our main contributions are: (1) we propose a novel bi-level selection model and show that the key combinatorial problem admits a globally optimal solution using dynamic programming; (2) we provide an error bound between our solution and the globally optimal under the RIP (Restricted Isometry Property) theoretical framework. Our experiments on synthetic and real data demonstrate that the proposed algorithm produces encouraging performance while keeping comparable computational efficiency to convex relaxation models.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"226 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83437070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Prediction of human emergency behavior and their mobility following large-scale disaster 大规模灾害后人类应急行为及其流动性预测

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623628

Xuan Song, Quanshi Zhang, Y. Sekimoto, R. Shibasaki

The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, accurately predicting human emergency behavior and their mobility will become the critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. Based on our empirical analysis through these data, we find that human behavior and their mobility following large-scale disaster sometimes correlate with their mobility patterns during normal times, and are also highly impacted by their social relationship, intensity of disaster, damage level, government appointed shelters, news reporting, large population flow and etc. On the basis of these findings, we develop a model of human behavior that takes into account these factors for accurately predicting human emergency behavior and their mobility following large-scale disaster. The experimental results and validations demonstrate the efficiency of our behavior model, and suggest that human behavior and their movements during disasters may be significantly more predictable than previously thought.

过去几十年来，自然灾害的频率和强度大大增加，预计这一趋势将继续下去。面对这些可能和意外的灾害，准确预测人类的应急行为及其流动性将成为规划有效的人道主义救援、灾害管理和长期社会重建的关键问题。本文建立了一个大型的人类移动数据库(160万用户一年的GPS记录)和几个不同的数据集，以捕捉和分析东日本大地震和福岛核事故后的人类应急行为和流动性。通过对这些数据的实证分析，我们发现，大规模灾害后人类的行为和流动性有时与正常情况下的流动性模式相关，同时也受到社会关系、灾害强度、破坏程度、政府指定庇护所、新闻报道、大量人口流动等因素的高度影响。在这些发现的基础上，我们开发了一个人类行为模型，该模型考虑了这些因素，以准确预测大规模灾害后人类的紧急行为及其流动性。实验结果和验证证明了我们的行为模型的有效性，并表明人类的行为和他们在灾难中的行动可能比以前认为的更容易预测。

{"title":"Prediction of human emergency behavior and their mobility following large-scale disaster","authors":"Xuan Song, Quanshi Zhang, Y. Sekimoto, R. Shibasaki","doi":"10.1145/2623330.2623628","DOIUrl":"https://doi.org/10.1145/2623330.2623628","url":null,"abstract":"The frequency and intensity of natural disasters has significantly increased over the past decades and this trend is predicted to continue. Facing these possible and unexpected disasters, accurately predicting human emergency behavior and their mobility will become the critical issue for planning effective humanitarian relief, disaster management, and long-term societal reconstruction. In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. Based on our empirical analysis through these data, we find that human behavior and their mobility following large-scale disaster sometimes correlate with their mobility patterns during normal times, and are also highly impacted by their social relationship, intensity of disaster, damage level, government appointed shelters, news reporting, large population flow and etc. On the basis of these findings, we develop a model of human behavior that takes into account these factors for accurately predicting human emergency behavior and their mobility following large-scale disaster. The experimental results and validations demonstrate the efficiency of our behavior model, and suggest that human behavior and their movements during disasters may be significantly more predictable than previously thought.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90196617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 171

Seven rules of thumb for web site experimenters 网站实验者的七条经验法则

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623341

Ron Kohavi, Alex Deng, R. Longbotham, Ya Xu

Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as 'speed matters,' but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD's industrial track.

网站所有者，从小型网站到包括Amazon、Facebook、Google、LinkedIn、Microsoft和Yahoo在内的大型网站，都试图改进他们的网站，优化从重复使用、站点停留时间到收入的标准。在亚马逊、Booking.com、LinkedIn和多个微软公司进行了数千次对照实验后，我们从这些实验及其结果中总结出了七条经验法则。我们认为这些原则在控制实验之外的网络优化和分析中具有广泛的适用性，但它们并不能证明是正确的，在某些情况下，例外情况是已知的。为了支持这些经验法则，我们分享了多个真实的例子，其中大多数是第一次在公开论文中分享。一些经验法则之前已经陈述过，比如“速度很重要”，但我们在实验设计中描述了假设，并分享了额外的实验，这些实验提高了我们对速度更重要的地方的理解:网页的某些区域更关键。本文有两个目的。首先，它可以用经验法则指导实验人员，帮助他们优化网站。其次，它为KDD社区提供了关于这些内容的适用性、例外和扩展的新的研究挑战，这是KDD工业轨道的目标之一。

{"title":"Seven rules of thumb for web site experimenters","authors":"Ron Kohavi, Alex Deng, R. Longbotham, Ya Xu","doi":"10.1145/2623330.2623341","DOIUrl":"https://doi.org/10.1145/2623330.2623341","url":null,"abstract":"Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known. To support these rules of thumb, we share multiple real examples, most being shared in a public paper for the first time. Some rules of thumb have previously been stated, such as 'speed matters,' but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. This paper serves two goals. First, it can guide experimenters with rules of thumb that can help them optimize their sites. Second, it provides the KDD community with new research challenges on the applicability, exceptions, and extensions to these, one of the goals for KDD's industrial track.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89750134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 186

Scalable near real-time failure localization of data center networks 数据中心网络可伸缩的近实时故障定位

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623365

H. Herodotou, Bolin Ding, S. Balakrishnan, G. Outhred, Percy Fitter

Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.

大型数据中心网络是复杂的——包括数千个网络设备和数十万条链路——并构成了所有高级服务所依赖的关键基础设施。尽管数据中心网络具有内置冗余，但网络中的性能问题和设备或链路故障可能导致用户感知到的服务中断。因此，在接近实时的情况下确定和本地化影响用户的网络可用性和性能问题是至关重要的。传统上，被动监测和主动监测两种方法都被用于故障定位。然而，被动监测的数据通常噪声太大，不能有效地捕获沉默故障或灰色故障，而主动监测在检测故障方面很有效，但根据其规模和粒度隔离准确故障位置的能力有限。我们的关键思想是在大规模的主动监测数据上使用统计数据挖掘技术来确定可疑原因的排名列表，我们使用被动监测信号对其进行改进。特别是，我们使用来自主动监测的数据计算设备和链路的近实时故障概率，并寻找故障概率的统计显着增加。我们还将概率输出与来自被动监测的其他故障信号相关联，以增加概率分析的置信度。我们已经在Windows Azure生产环境中实现了我们的方法，并在过去三个月里使用已知的网络事件验证了其在本地化准确性、精度和本地化时间方面的有效性。设备和链接的相关排名列表以报告的形式出现，网络运营商使用该报告来调查当前问题并确定可能的根本原因。

{"title":"Scalable near real-time failure localization of data center networks","authors":"H. Herodotou, Bolin Ding, S. Balakrishnan, G. Outhred, Percy Fitter","doi":"10.1145/2623330.2623365","DOIUrl":"https://doi.org/10.1145/2623330.2623365","url":null,"abstract":"Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89914986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Matching users and items across domains to improve the recommendation quality 跨域匹配用户和项目，提高推荐质量

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623657

Chung-Yi Li, Shou-de Lin

Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions. First, can we identify the unknown mapping between the users and/or items? Second, can we further utilize the identified mappings to improve the quality of recommendation in either domain? Our solution integrates a latent space matching procedure and a refining process based on the optimization of prediction to identify the matching. Then, we further design a transfer-based method to improve the recommendation performance. Using both synthetic and real data, we have done extensive experiments given different real life scenarios to verify the effectiveness of our models. The code and other materials are available at http://www.csie.ntu.edu.tw/~r00922051/matching/

给定两个齐次评价矩阵，其中一些用户/项目的映射是未知的，本文旨在回答两个问题。首先，我们能否识别用户和/或项目之间的未知映射?其次，我们能否进一步利用已识别的映射来提高两个领域的推荐质量?我们的解决方案集成了潜在空间匹配过程和基于预测优化的精炼过程来识别匹配。然后，我们进一步设计了一种基于迁移的方法来提高推荐性能。利用合成数据和真实数据，我们在不同的现实生活场景下进行了大量的实验，以验证我们模型的有效性。代码和其他材料可在http://www.csie.ntu.edu.tw/~r00922051/matching/上获得

引用次数: 73

Open-domain quantity queries on web tables: annotation, response, and consensus models web表上的开放域数量查询:注释、响应和共识模型

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623749

Sunita Sarawagi, Soumen Chakrabarti

Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as net worth of zuckerburg, battery life of ipad, half life of plutonium, and calories in pizza. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early "hardening" of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.

在数以亿计的Web表中，超过40%的列包含数字数量。表格是比自由文本更丰富的结构化知识来源。我们利用网络表格来回答目标是具有自然变化的数量的查询，例如扎克伯格的净资产、ipad的电池寿命、钚的半衰期和披萨的卡路里。我们的目标是用适当表示的数量分布的排名列表来响应此类查询。除了非正式模式和噪声提取的挑战(自从表被用于非数量信息提取以来，我们就知道了这一点)，我们还面临着噪声数字格式的额外问题，以及通常与上下文相关且模棱两可的单元规范。在表级别上早期“硬化”提取决策会导致较差的准确性。相反，我们在表上使用基于概率上下文无关语法(PCFG)的单位提取器，并保留了几个得分最高的数量和数字提取。然后，我们将这些信息注入到一个新的集体推理框架中，该框架对候选表片段的相关性、查询的目标数量类型的解释、要排序和呈现的值分布以及为支持提议的数量分布而构建的共识程度做出全局决策。对超过2500万个Web表和350个不同查询进行的实验显示，我们的数量目录、单元提取器和集体推理带来了强大的巨大好处。

{"title":"Open-domain quantity queries on web tables: annotation, response, and consensus models","authors":"Sunita Sarawagi, Soumen Chakrabarti","doi":"10.1145/2623330.2623749","DOIUrl":"https://doi.org/10.1145/2623330.2623749","url":null,"abstract":"Over 40% of columns in hundreds of millions of Web tables contain numeric quantities. Tables are a richer source of structured knowledge than free text. We harness Web tables to answer queries whose target is a quantity with natural variation, such as net worth of zuckerburg, battery life of ipad, half life of plutonium, and calories in pizza. Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. Apart from the challenges of informal schema and noisy extractions, which have been known since tables were used for non-quantity information extraction, we face additional problems of noisy number formats, as well as unit specifications that are often contextual and ambiguous. Early \"hardening\" of extraction decisions at a table level leads to poor accuracy. Instead, we use a probabilistic context free grammar (PCFG) based unit extractor on the tables, and retain several top-scoring extractions of quantity and numerals. Then we inject these into a new collective inference framework that makes global decisions about the relevance of candidate table snippets, the interpretation of the query's target quantity type, the value distributions to be ranked and presented, and the degree of consensus that can be built to support the proposed quantity distributions. Experiments with over 25 million Web tables and 350 diverse queries show robust, large benefits from our quantity catalog, unit extractor, and collective inference.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87467140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Data, predictions, and decisions in support of people and society 支持个人和社会的数据、预测和决策

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630815

E. Horvitz

Deep societal benefits will spring from advances in data availability and in computational procedures for mining insights and inferences from large data sets. I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. I will start with efforts to learn and field predictive models that forecast flows of traffic in greater city regions. Moving from the ground to the air, I will discuss fusing data from aircraft to make inferences about atmospheric conditions and using these results to enhance air transport. I will then focus on experiences with building and fielding predictive models in clinical medicine. I will show how inferences about outcomes and interventions can provide insights and guide decision making. Moving beyond data captured by hospitals, I will discuss the promise of transforming anonymized behavioral data drawn from web services into large-scale sensor networks for public health, including efforts to identify adverse effects of medications and to understand illness in populations. I will conclude by describing how we can use machine learning to leverage the complementarity of human and machine intellect to solve challenging problems in science and society.

深刻的社会效益将来自于数据可用性的进步，以及从大型数据集中挖掘见解和推断的计算程序的进步。我将描述利用数据进行预测和指导决策的努力，涉及交通、医疗保健、在线服务和交互式系统方面的工作。我将从努力学习和应用预测模型开始，预测大城市地区的交通流量。从地面到空中，我将讨论融合来自飞机的数据来推断大气条件，并利用这些结果来增强航空运输。然后，我将重点介绍在临床医学中建立和应用预测模型的经验。我将展示关于结果和干预措施的推论如何提供见解和指导决策。除了医院收集的数据之外，我还将讨论将从网络服务中提取的匿名行为数据转化为大规模公共卫生传感器网络的前景，包括识别药物不良影响和了解人群疾病的努力。最后，我将描述我们如何利用机器学习来利用人类和机器智能的互补性来解决科学和社会中的挑战性问题。

{"title":"Data, predictions, and decisions in support of people and society","authors":"E. Horvitz","doi":"10.1145/2623330.2630815","DOIUrl":"https://doi.org/10.1145/2623330.2630815","url":null,"abstract":"Deep societal benefits will spring from advances in data availability and in computational procedures for mining insights and inferences from large data sets. I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. I will start with efforts to learn and field predictive models that forecast flows of traffic in greater city regions. Moving from the ground to the air, I will discuss fusing data from aircraft to make inferences about atmospheric conditions and using these results to enhance air transport. I will then focus on experiences with building and fielding predictive models in clinical medicine. I will show how inferences about outcomes and interventions can provide insights and guide decision making. Moving beyond data captured by hospitals, I will discuss the promise of transforming anonymized behavioral data drawn from web services into large-scale sensor networks for public health, including efforts to identify adverse effects of medications and to understand illness in populations. I will conclude by describing how we can use machine learning to leverage the complementarity of human and machine intellect to solve challenging problems in science and society.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"122 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76148537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning with dual heterogeneity: a nonparametric bayes model 具有双重异质性的学习:一个非参数贝叶斯模型

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623727

Hongxia Yang, Jingrui He

Traditional data mining techniques are designed to model a single type of heterogeneity, such as multi-task learning for modeling task heterogeneity, multi-view learning for modeling view heterogeneity, etc. Recently, a variety of real applications emerged, which exhibit dual heterogeneity, namely both task heterogeneity and view heterogeneity. Examples include insider threat detection across multiple organizations, web image classification in different domains, etc. Existing methods for addressing such problems typically assume that multiple tasks are equally related and multiple views are equally consistent, which limits their application in complex settings with varying task relatedness and view consistency. In this paper, we advance state-of-the-art techniques by adaptively modeling task relatedness and view consistency via a nonparametric Bayes model: we model task relatedness using normal penalty with sparse covariances, and view consistency using matrix Dirichlet process. Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. Experimental results on multiple real data sets demonstrate the effectiveness of the proposed algorithm.

传统的数据挖掘技术是为单一类型的异构建模而设计的，如为任务异构建模的多任务学习、为视图异构建模的多视图学习等。近年来，各种实际应用都表现出双重异构，即任务异构和视图异构。示例包括跨多个组织的内部威胁检测，不同域的web图像分类等。解决此类问题的现有方法通常假设多个任务同等相关，多个视图同等一致，这限制了它们在具有不同任务相关性和视图一致性的复杂设置中的应用。在本文中，我们通过非参数贝叶斯模型自适应建模任务相关性和视图一致性来推进最新技术:我们使用稀疏协方差的正态惩罚来建模任务相关性，使用矩阵狄利克雷过程来建模视图一致性。在此模型的基础上，我们提出了使用高效吉布斯采样器的NOBLE算法。在多个真实数据集上的实验结果证明了该算法的有效性。

{"title":"Learning with dual heterogeneity: a nonparametric bayes model","authors":"Hongxia Yang, Jingrui He","doi":"10.1145/2623330.2623727","DOIUrl":"https://doi.org/10.1145/2623330.2623727","url":null,"abstract":"Traditional data mining techniques are designed to model a single type of heterogeneity, such as multi-task learning for modeling task heterogeneity, multi-view learning for modeling view heterogeneity, etc. Recently, a variety of real applications emerged, which exhibit dual heterogeneity, namely both task heterogeneity and view heterogeneity. Examples include insider threat detection across multiple organizations, web image classification in different domains, etc. Existing methods for addressing such problems typically assume that multiple tasks are equally related and multiple views are equally consistent, which limits their application in complex settings with varying task relatedness and view consistency. In this paper, we advance state-of-the-art techniques by adaptively modeling task relatedness and view consistency via a nonparametric Bayes model: we model task relatedness using normal penalty with sparse covariances, and view consistency using matrix Dirichlet process. Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. Experimental results on multiple real data sets demonstrate the effectiveness of the proposed algorithm.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"141 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77518352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Network mining and analysis for social applications 社会应用的网络挖掘和分析

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630810

Feida Zhu, Huan Sun, Xifeng Yan

The recent blossom of social network and communication services in both public and corporate settings have generated a staggering amount of network data of all kinds. Unlike the bio-networks and the chemical compound graph data often used in traditional network mining and analysis, the new network data grown out of the social applications are characterized by their rich attributes, high heterogeneity, enormous sizes and complex patterns of various semantic meanings, all of which have posed significant research challenges to the graph/network mining community. In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern. We would present the problem settings, the challenges, the recent research advances and some future directions for each perspective. Topics include but are not limited to correlation mining, iceberg finding, anomaly detection, relationship discovery, information flow, task routing, and pattern mining.

最近社交网络和通信服务在公共和企业环境中的蓬勃发展产生了数量惊人的各种网络数据。与传统网络挖掘和分析中经常使用的生物网络和化学复合图数据不同，从社会应用中产生的新型网络数据具有属性丰富、异构性高、规模庞大、各种语义模式复杂等特点，这些都对图/网络挖掘界提出了重大的研究挑战。在本教程中，我们的目标是研究社交应用程序的网络挖掘和分析方面的一些最新进展，从事件、关系、协作和网络模式的角度涵盖了各种方法和应用程序。我们将介绍问题设置、挑战、最近的研究进展和未来的一些方向。主题包括但不限于相关性挖掘、冰山发现、异常检测、关系发现、信息流、任务路由和模式挖掘。

引用次数: 3

A cost-effective recommender system for taxi drivers 一个具成本效益的的士司机推荐系统

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623668

Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, Hui Xiong

The GPS technology and new forms of urban geography have changed the paradigm for mobile services. As such, the abundant availability of GPS traces has enabled new ways of doing taxi business. Indeed, recent efforts have been made on developing mobile recommender systems for taxi drivers using Taxi GPS traces. These systems can recommend a sequence of pick-up points for the purpose of maximizing the probability of identifying a customer with the shortest driving distance. However, in the real world, the income of taxi drivers is strongly correlated with the effective driving hours. In other words, it is more critical for taxi drivers to know the actual driving routes to minimize the driving time before finding a customer. To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. The design goal is to maximize their profits when following the recommended routes for finding passengers. Specifically, we first design a net profit objective function for evaluating the potential profits of the driving routes. Then, we develop a graph representation of road networks by mining the historical taxi GPS traces and provide a Brute-Force strategy to generate optimal driving route for recommendation. However, a critical challenge along this line is the high computational cost of the graph based approach. Therefore, we develop a novel recursion strategy based on the special form of the net profit function for searching optimal candidate routes efficiently. Particularly, instead of recommending a sequence of pick-up points and letting the driver decide how to get to those points, our recommender system is capable of providing an entire driving route, and the drivers are able to find a customer for the largest potential profit by following the recommendations. This makes our recommender system more practical and profitable than other existing recommender systems. Finally, we carry out extensive experiments on a real-world data set collected from the San Francisco Bay area and the experimental results clearly validate the effectiveness of the proposed recommender system.

GPS技术和城市地理的新形式已经改变了移动服务的范式。因此，GPS追踪的大量可用性为出租车业务提供了新的方式。事实上，最近已经在利用出租车GPS跟踪为出租车司机开发移动推荐系统方面做出了努力。这些系统可以推荐一系列的取货点，以最大限度地提高识别最短驾驶距离的客户的可能性。然而，在现实世界中，出租车司机的收入与有效驾驶时数密切相关。换句话说，对于出租车司机来说，了解实际的行驶路线，以最大限度地减少在找到顾客之前的驾驶时间是至关重要的。为此，在本文中，我们建议开发一个具有成本效益的出租车司机推荐系统。设计目标是在按照推荐的路线寻找乘客时，使他们的利润最大化。具体来说，我们首先设计了一个净利润目标函数来评估行驶路线的潜在利润。然后，我们通过挖掘历史出租车GPS轨迹来开发道路网络的图形表示，并提供一种蛮力策略来生成最优的驾驶路线供推荐。然而，这方面的一个关键挑战是基于图的方法的高计算成本。因此，我们提出了一种基于净利润函数特殊形式的递归策略，用于高效地搜索最优候选路线。特别是，我们的推荐系统不是推荐一系列的接送点，让司机决定如何到达这些点，而是能够提供整个驾驶路线，司机可以根据建议找到一个潜在利润最大的客户。这使得我们的推荐系统比其他现有的推荐系统更加实用和有利可图。最后，我们在来自旧金山湾区的真实数据集上进行了大量的实验，实验结果清楚地验证了所提出的推荐系统的有效性。

{"title":"A cost-effective recommender system for taxi drivers","authors":"Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, Hui Xiong","doi":"10.1145/2623330.2623668","DOIUrl":"https://doi.org/10.1145/2623330.2623668","url":null,"abstract":"The GPS technology and new forms of urban geography have changed the paradigm for mobile services. As such, the abundant availability of GPS traces has enabled new ways of doing taxi business. Indeed, recent efforts have been made on developing mobile recommender systems for taxi drivers using Taxi GPS traces. These systems can recommend a sequence of pick-up points for the purpose of maximizing the probability of identifying a customer with the shortest driving distance. However, in the real world, the income of taxi drivers is strongly correlated with the effective driving hours. In other words, it is more critical for taxi drivers to know the actual driving routes to minimize the driving time before finding a customer. To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. The design goal is to maximize their profits when following the recommended routes for finding passengers. Specifically, we first design a net profit objective function for evaluating the potential profits of the driving routes. Then, we develop a graph representation of road networks by mining the historical taxi GPS traces and provide a Brute-Force strategy to generate optimal driving route for recommendation. However, a critical challenge along this line is the high computational cost of the graph based approach. Therefore, we develop a novel recursion strategy based on the special form of the net profit function for searching optimal candidate routes efficiently. Particularly, instead of recommending a sequence of pick-up points and letting the driver decide how to get to those points, our recommender system is capable of providing an entire driving route, and the drivers are able to find a customer for the largest potential profit by following the recommendations. This makes our recommender system more practical and profitable than other existing recommender systems. Finally, we carry out extensive experiments on a real-world data set collected from the San Francisco Bay area and the experimental results clearly validate the effectiveness of the proposed recommender system.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84852259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 180