Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第7页

Trace complexity of network inference 跟踪网络推理的复杂性

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487664

B. Abrahao, Flavio Chierichetti, Robert D. Kleinberg, A. Panconesi

The network inference problem consists of reconstructing the edge set of a network given traces representing the chronology of infection times as epidemics spread through the network. This problem is a paradigmatic representative of prediction tasks in machine learning that require deducing a latent structure from observed patterns of activity in a network, which often require an unrealistically large number of resources (e.g., amount of available data, or computational time). A fundamental question is to understand which properties we can predict with a reasonable degree of accuracy with the available resources, and which we cannot. We define the trace complexity as the number of distinct traces required to achieve high fidelity in reconstructing the topology of the unobserved network or, more generally, some of its properties. We give algorithms that are competitive with, while being simpler and more efficient than, existing network inference approaches. Moreover, we prove that our algorithms are nearly optimal, by proving an information-theoretic lower bound on the number of traces that an optimal inference algorithm requires for performing this task in the general case. Given these strong lower bounds, we turn our attention to special cases, such as trees and bounded-degree graphs, and to property recovery tasks, such as reconstructing the degree distribution without inferring the network. We show that these problems require a much smaller (and more realistic) number of traces, making them potentially solvable in practice.

网络推理问题包括重建一个网络的边缘集，该网络给出了代表传染病在网络中传播的时间顺序的迹线。这个问题是机器学习中预测任务的典型代表，它需要从观察到的网络活动模式中推断潜在结构，这通常需要不切实际的大量资源(例如，可用数据量或计算时间)。一个基本的问题是了解我们可以利用现有资源以合理的精度预测哪些属性，哪些属性不能。我们将迹复杂度定义为在重建未观察网络的拓扑或更一般地说，其某些属性时实现高保真度所需的不同迹的数量。我们给出了与现有网络推理方法竞争的算法，同时比现有网络推理方法更简单、更有效。此外，我们证明了我们的算法几乎是最优的，通过证明了在一般情况下执行该任务所需的最优推理算法的跟踪数的信息理论下界。给定这些强下界，我们将注意力转向特殊情况，例如树和有界度图，以及属性恢复任务，例如在不推断网络的情况下重建度分布。我们表明，这些问题需要更少(和更现实)的跟踪数量，使它们在实践中有可能解决。

{"title":"Trace complexity of network inference","authors":"B. Abrahao, Flavio Chierichetti, Robert D. Kleinberg, A. Panconesi","doi":"10.1145/2487575.2487664","DOIUrl":"https://doi.org/10.1145/2487575.2487664","url":null,"abstract":"The network inference problem consists of reconstructing the edge set of a network given traces representing the chronology of infection times as epidemics spread through the network. This problem is a paradigmatic representative of prediction tasks in machine learning that require deducing a latent structure from observed patterns of activity in a network, which often require an unrealistically large number of resources (e.g., amount of available data, or computational time). A fundamental question is to understand which properties we can predict with a reasonable degree of accuracy with the available resources, and which we cannot. We define the trace complexity as the number of distinct traces required to achieve high fidelity in reconstructing the topology of the unobserved network or, more generally, some of its properties. We give algorithms that are competitive with, while being simpler and more efficient than, existing network inference approaches. Moreover, we prove that our algorithms are nearly optimal, by proving an information-theoretic lower bound on the number of traces that an optimal inference algorithm requires for performing this task in the general case. Given these strong lower bounds, we turn our attention to special cases, such as trees and bounded-degree graphs, and to property recovery tasks, such as reconstructing the degree distribution without inferring the network. We show that these problems require a much smaller (and more realistic) number of traces, making them potentially solvable in practice.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78188387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

Exploiting user clicks for automatic seed set generation for entity matching 利用用户点击自动生成种子集进行实体匹配

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487662

Xiao Bai, F. Junqueira, Srinivasan H. Sengamedu

Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.

在数据分析和数据集成中，不同信息源的实体匹配是一个非常重要的问题。然而，由于所涉及的信息来源的数量和多样性，以及收集足够的培训数据需要大量的编辑工作，这是具有挑战性的。在本文中，我们提出了一种方法，利用用户在Web搜索期间的点击来自动生成用于实体匹配的训练数据。我们方法的关键观点是，为给定查询单击的Web页面很可能是关于同一实体的。我们使用带重启的随机漫步来降低数据稀疏性，依靠共聚类对查询和网页进行分组，并利用页面相似度来提高匹配精度。实验结果表明:(1)在6个主要旅游网站的360K个页面中，我们获得了84K个指向相同实体的匹配(179K个页面)，平均精度为0.826;(ii)从对结果种子数据进行训练的分类器获得的匹配质量是有希望的:性能与小尺寸编辑数据相匹配，并随着尺寸的增加而提高。

{"title":"Exploiting user clicks for automatic seed set generation for entity matching","authors":"Xiao Bai, F. Junqueira, Srinivasan H. Sengamedu","doi":"10.1145/2487575.2487662","DOIUrl":"https://doi.org/10.1145/2487575.2487662","url":null,"abstract":"Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90063965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Beyond myopic inference in big data pipelines 超越大数据管道的短视推断

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487588

Karthik Raman, Adith Swaminathan, J. Gehrke, T. Joachims

Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off end-to-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction.

大数据管道将大型数据集的复杂分析分解为一系列更简单的任务，每个任务都有独立调优的组件。这种模块化设置允许跨多个不同管道重用组件。然而，独立调优的管道组件之间的交互会导致端到端性能差，因为一个组件引入的误差会级联整个管道，从而影响整体精度。我们提出了一种新颖的模型，用于以概率良好的方式跨大数据管道组件进行推理。我们的关键思想是将组件的交互视为对底层图形模型的依赖关系。该图形模型上的不同消息传递方案提供了不同的推理算法，以权衡端到端性能和计算成本。我们用一个高效的梁搜索算法实例化了我们的框架，并演示了它在两个大数据管道上的效率:解析和关系提取。

引用次数: 12

DTW-D: time series semi-supervised learning from a single example DTW-D:单例时间序列半监督学习

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487633

Yanping Chen, Bing Hu, Eamonn J. Keogh, Gustavo E. A. P. A. Batista

Classification of time series data is an important problem with applications in virtually every scientific endeavor. The large research community working on time series classification has typically used the UCR Archive to test their algorithms. In this work we argue that the availability of this resource has isolated much of the research community from the following reality, labeled time series data is often very difficult to obtain. The obvious solution to this problem is the application of semi-supervised learning; however, as we shall show, direct applications of off-the-shelf semi-supervised learning algorithms do not typically work well for time series. In this work we explain why semi-supervised learning algorithms typically fail for time series problems, and we introduce a simple but very effective fix. We demonstrate our ideas on diverse real word problems.

时间序列数据的分类是一个重要的问题与应用几乎每一个科学努力。从事时间序列分类的大型研究团体通常使用UCR Archive来测试他们的算法。在这项工作中，我们认为这种资源的可用性已经将许多研究界与以下现实隔离开来，标记的时间序列数据通常很难获得。解决这个问题的显而易见的方法是应用半监督学习;然而，正如我们将展示的那样，现成的半监督学习算法的直接应用通常不适用于时间序列。在这项工作中，我们解释了为什么半监督学习算法通常无法解决时间序列问题，并介绍了一种简单但非常有效的修复方法。我们在不同的实际问题上展示我们的想法。

引用次数: 111

Robust principal component analysis via capped norms 基于上限规范的稳健主成分分析

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487604

Qian Sun, Shuo Xiang, Jieping Ye

In many applications such as image and video processing, the data matrix often possesses simultaneously a low-rank structure capturing the global information and a sparse component capturing the local information. How to accurately extract the low-rank and sparse components is a major challenge. Robust Principal Component Analysis (RPCA) is a general framework to extract such structures. It is well studied that under certain assumptions, convex optimization using the trace norm and l1-norm can be an effective computation surrogate of the difficult RPCA problem. However, such convex formulation is based on a strong assumption which may not hold in real-world applications, and the approximation error in these convex relaxations often cannot be neglected. In this paper, we present a novel non-convex formulation for the RPCA problem using the capped trace norm and the capped l1-norm. In addition, we present two algorithms to solve the non-convex optimization: one is based on the Difference of Convex functions (DC) framework and the other attempts to solve the sub-problems via a greedy approach. Our empirical evaluations on synthetic and real-world data show that both of the proposed algorithms achieve higher accuracy than existing convex formulations. Furthermore, between the two proposed algorithms, the greedy algorithm is more efficient than the DC programming, while they achieve comparable accuracy.

在图像和视频处理等许多应用中，数据矩阵通常同时具有捕获全局信息的低秩结构和捕获局部信息的稀疏分量。如何准确提取低秩稀疏分量是一个重要的挑战。鲁棒主成分分析(RPCA)是提取此类结构的通用框架。在一定的假设条件下，利用迹范数和11范数的凸优化可以作为求解复杂RPCA问题的有效方法。然而，这种凸公式是建立在一个强大的假设基础上的，在实际应用中可能不成立，并且这些凸松弛的近似误差通常不能忽视。在本文中，我们利用带帽迹范数和带帽11范数给出了一个新的RPCA问题的非凸公式。此外，我们提出了两种解决非凸优化问题的算法:一种是基于凸函数差分(DC)框架，另一种是尝试通过贪心方法来解决子问题。我们对合成数据和实际数据的经验评估表明，所提出的两种算法都比现有的凸公式具有更高的精度。此外，在两种算法之间，贪心算法比直流规划算法效率更高，而两者的精度相当。

{"title":"Robust principal component analysis via capped norms","authors":"Qian Sun, Shuo Xiang, Jieping Ye","doi":"10.1145/2487575.2487604","DOIUrl":"https://doi.org/10.1145/2487575.2487604","url":null,"abstract":"In many applications such as image and video processing, the data matrix often possesses simultaneously a low-rank structure capturing the global information and a sparse component capturing the local information. How to accurately extract the low-rank and sparse components is a major challenge. Robust Principal Component Analysis (RPCA) is a general framework to extract such structures. It is well studied that under certain assumptions, convex optimization using the trace norm and l1-norm can be an effective computation surrogate of the difficult RPCA problem. However, such convex formulation is based on a strong assumption which may not hold in real-world applications, and the approximation error in these convex relaxations often cannot be neglected. In this paper, we present a novel non-convex formulation for the RPCA problem using the capped trace norm and the capped l1-norm. In addition, we present two algorithms to solve the non-convex optimization: one is based on the Difference of Convex functions (DC) framework and the other attempts to solve the sub-problems via a greedy approach. Our empirical evaluations on synthetic and real-world data show that both of the proposed algorithms achieve higher accuracy than existing convex formulations. Furthermore, between the two proposed algorithms, the greedy algorithm is more efficient than the DC programming, while they achieve comparable accuracy.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78954851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints 具有性质约束的化合物-蛋白质对可扩展相似性搜索的简洁性区间分裂树

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487637

Yasuo Tabei, Akihiro Kishimoto, Masaaki Kotera, Yoshihiro Yamanishi

Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.

分析小分子化合物与蛋白质之间的功能相互作用在基因组药物发现中是必不可少的。由于最近的分子数据库提供了丰富的关于各种化合物-蛋白质相互作用的信息，因此对充分利用这些数据库的强烈需求要求开发强大的方法来帮助我们大规模地发现新的功能化合物-蛋白质对。我们提出了一种简洁的区间分割树算法(SITA)，该算法可以有效地在数据库中对化合物-蛋白质对的二值指纹和实值性质进行相似性搜索。SITA通过开发称为间隔分割树的数据结构来实现时间和空间效率，该数据结构能够有效地修剪搜索空间中无用的部分，并通过合并小波树背后的思想，一种简洁的数据结构来紧凑地表示树。我们通过实验测试了SITA从超过2亿个化合物蛋白质对/底物产物对的大型数据库中检索相似化合物蛋白质对/底物产物对的能力，并表明SITA比其他可能的方法表现得更好。

{"title":"Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints","authors":"Yasuo Tabei, Akihiro Kishimoto, Masaaki Kotera, Yoshihiro Yamanishi","doi":"10.1145/2487575.2487637","DOIUrl":"https://doi.org/10.1145/2487575.2487637","url":null,"abstract":"Analyzing functional interactions between small compounds and proteins is indispensable in genomic drug discovery. Since rich information on various compound-protein inter- actions is available in recent molecular databases, strong demands for making best use of such databases require to in- vent powerful methods to help us find new functional compound-protein pairs on a large scale. We present the succinct interval-splitting tree algorithm (SITA) that efficiently per- forms similarity search in databases for compound-protein pairs with respect to both binary fingerprints and real-valued properties. SITA achieves both time and space efficiency by developing the data structure called interval-splitting trees, which enables to efficiently prune the useless portions of search space, and by incorporating the ideas behind wavelet tree, a succinct data structure to compactly represent trees. We experimentally test SITA on the ability to retrieve similar compound-protein pairs/substrate-product pairs for a query from large databases with over 200 million compound- protein pairs/substrate-product pairs and show that SITA performs better than other possible approaches.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79023550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Estimating sharer reputation via social data calibration 通过社会数据校准估计共享者声誉

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487685

Jaewon Yang, Bee-Chung Chen, D. Agarwal

Online social networks have become important channels for users to share content with their connections and diffuse information. Although much work has been done to identify socially influential users, the problem of finding "reputable" sharers, who share good content, has received relatively little attention. Availability of such reputation scores can be useful or various applications like recommending people to follow, procuring high quality content in a scalable way, creating a content reputation economy to incentivize high quality sharing, and many more. To estimate sharer reputation, it is intuitive to leverage data that records how recipients respond (through clicking, liking, etc.) to content items shared by a sharer. However, such data is usually biased --- it has a selection bias since the shared items can only be seen and responded to by users connected to the sharer in most social networks, and it has a response bias since the response is usually influenced by the relationship between the sharer and the recipient (which may not indicate whether the shared content is good). To correct for such biases, we propose to utilize an additional data source that provides unbiased goodness estimates for a small set of shared items, and calibrate biased social data through a novel multi-level hierarchical model that describes how the unbiased data and biased data are jointly generated according to sharer reputation scores. The unbiased data also provides the ground truth for quantitative evaluation of different methods. Experiments based on such ground-truth data show that our proposed model significantly outperforms existing methods that estimate social influence using biased social data.

在线社交网络已经成为用户与好友分享内容、传播信息的重要渠道。尽管在识别有社会影响力的用户方面已经做了很多工作，但寻找分享优质内容的“有信誉的”分享者的问题却相对较少受到关注。这种声誉评分的可用性对各种应用程序都很有用，比如推荐值得关注的人、以可扩展的方式获取高质量的内容、创建内容声誉经济以激励高质量的分享等等。为了估计分享者的声誉，利用记录接收者如何回应(通过点击、点赞等)的数据是很直观的。然而，这样的数据通常是有偏差的——它有选择偏差，因为在大多数社交网络中，共享的项目只能被连接到分享者的用户看到和回应，它有响应偏差，因为响应通常受到分享者和接受者之间关系的影响(这可能不能表明共享的内容是否好)。为了纠正这种偏差，我们建议利用一个额外的数据源，为一小部分共享项目提供无偏优度估计，并通过一个新的多层次层次模型来校准有偏的社会数据，该模型描述了如何根据共享者声誉分数共同生成无偏数据和有偏数据。无偏数据也为不同方法的定量评价提供了基础真理。基于这些基本事实数据的实验表明，我们提出的模型明显优于使用有偏见的社会数据估计社会影响的现有方法。

{"title":"Estimating sharer reputation via social data calibration","authors":"Jaewon Yang, Bee-Chung Chen, D. Agarwal","doi":"10.1145/2487575.2487685","DOIUrl":"https://doi.org/10.1145/2487575.2487685","url":null,"abstract":"Online social networks have become important channels for users to share content with their connections and diffuse information. Although much work has been done to identify socially influential users, the problem of finding \"reputable\" sharers, who share good content, has received relatively little attention. Availability of such reputation scores can be useful or various applications like recommending people to follow, procuring high quality content in a scalable way, creating a content reputation economy to incentivize high quality sharing, and many more. To estimate sharer reputation, it is intuitive to leverage data that records how recipients respond (through clicking, liking, etc.) to content items shared by a sharer. However, such data is usually biased --- it has a selection bias since the shared items can only be seen and responded to by users connected to the sharer in most social networks, and it has a response bias since the response is usually influenced by the relationship between the sharer and the recipient (which may not indicate whether the shared content is good). To correct for such biases, we propose to utilize an additional data source that provides unbiased goodness estimates for a small set of shared items, and calibrate biased social data through a novel multi-level hierarchical model that describes how the unbiased data and biased data are jointly generated according to sharer reputation scores. The unbiased data also provides the ground truth for quantitative evaluation of different methods. Experiments based on such ground-truth data show that our proposed model significantly outperforms existing methods that estimate social influence using biased social data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81119909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees 比最密集子图更密集:提取具有质量保证的最优拟团

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487645

Charalampos E. Tsourakakis, F. Bonchi, A. Gionis, Francesco Gullo, M. A. Tsiarli

Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative density functions. A very popular among such functions is the average degree, whose maximization leads to the well-known densest-subgraph notion. Surprisingly enough, however, densest subgraphs are typically large graphs, with small edge density and large diameter. In this paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. We show that the proposed function can be derived from a general framework, which includes other important density functions as subcases and for which we show interesting general theoretical properties. To optimize the proposed function we provide an additive approximation algorithm and a local-search heuristic. Both algorithms are very efficient and scale well to large graphs. We evaluate our algorithms on real and synthetic datasets, and we also devise several application studies as variants of our original problem. When compared with the method that finds the subgraph of the largest average degree, our algorithms return denser subgraphs with smaller diameter. Finally, we discuss new interesting research directions that our problem leaves open.

在许多应用中，查找密集子图是一项重要的图挖掘任务。考虑到边密度的直接优化是没有意义的，因为即使是单个边也能达到最大密度，研究的重点是优化替代密度函数。这些函数中非常流行的一个是平均度，它的最大化导致了众所周知的最密集子图概念。然而，令人惊讶的是，最密集的子图通常是大图，边缘密度小，直径大。在本文中，我们定义了一个新的密度函数，它给出了比最密集子图质量高得多的子图:通过我们的方法发现的图是紧凑的，密集的，直径更小。我们证明了所提出的函数可以从一个一般框架中推导出来，其中包括其他重要的密度函数作为子情况，并且我们展示了有趣的一般理论性质。为了优化所提出的函数，我们提供了一个加性逼近算法和一个局部搜索启发式算法。这两种算法都非常有效，并且可以很好地扩展到大型图形。我们在真实和合成数据集上评估我们的算法，并且我们还设计了几个应用研究作为原始问题的变体。与寻找最大平均度子图的方法相比，我们的算法返回的子图密度更大，直径更小。最后，我们讨论了我们的问题留下的新的有趣的研究方向。

{"title":"Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees","authors":"Charalampos E. Tsourakakis, F. Bonchi, A. Gionis, Francesco Gullo, M. A. Tsiarli","doi":"10.1145/2487575.2487645","DOIUrl":"https://doi.org/10.1145/2487575.2487645","url":null,"abstract":"Finding dense subgraphs is an important graph-mining task with many applications. Given that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative density functions. A very popular among such functions is the average degree, whose maximization leads to the well-known densest-subgraph notion. Surprisingly enough, however, densest subgraphs are typically large graphs, with small edge density and large diameter. In this paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. We show that the proposed function can be derived from a general framework, which includes other important density functions as subcases and for which we show interesting general theoretical properties. To optimize the proposed function we provide an additive approximation algorithm and a local-search heuristic. Both algorithms are very efficient and scale well to large graphs. We evaluate our algorithms on real and synthetic datasets, and we also devise several application studies as variants of our original problem. When compared with the method that finds the subgraph of the largest average degree, our algorithms return denser subgraphs with smaller diameter. Finally, we discuss new interesting research directions that our problem leaves open.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75348551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 276

Predicting the present with search engine data 用搜索引擎数据预测当下

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2492150

H. Varian

Many businesses now have almost real time data available about their operations. This data can be helpful in contemporaneous prediction ("nowcasting") of various economic indicators. We illustrate how one can use Google search data to nowcast economic metrics of interest, and discuss some of the ramifications for research and policy. Our approach combines three Bayesian techniques: Kalman filtering, spike-and-slab regression, and model averaging. We use Kalman filtering to whiten the time series in question by removing the trend and seasonal behavior. Spike-and-slab regression is a Bayesian method for variable selection that works even in cases where the number of predictors is far larger than the number of observations. Finally, we use Markov Chain Monte Carlo methods to sample from the posterior distribution for our model; the final forecast is an average over thousands of draws from the posterior. An advantage of the Bayesian approach is that it allows us to specify informative priors that affect the number and type of predictors in a flexible way.

现在，许多企业几乎可以获得有关其运营的实时数据。这些数据有助于对各种经济指标进行同期预测(“临近预测”)。我们说明了如何使用谷歌搜索数据来预测感兴趣的经济指标，并讨论了研究和政策的一些后果。我们的方法结合了三种贝叶斯技术:卡尔曼滤波、峰值-平板回归和模型平均。我们使用卡尔曼滤波通过去除趋势和季节行为来漂白所讨论的时间序列。柱状-板状回归是一种贝叶斯变量选择方法，即使在预测因子的数量远远大于观测值的情况下也有效。最后，我们使用马尔可夫链蒙特卡罗方法从我们的模型的后验分布中抽样;最后的预测是几千次后验的平均值。贝叶斯方法的一个优点是，它允许我们以灵活的方式指定影响预测器数量和类型的信息先验。

引用次数: 1

Repetition-aware content placement in navigational networks 导航网络中重复感知内容的放置

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487622

D. Erdös, Vatche Isahagian, Azer Bestavros, Evimaria Terzi

Arguably, the most effective technique to ensure wide adoption of a concept (or product) is by repeatedly exposing individuals to messages that reinforce the concept (or promote the product). Recognizing the role of repeated exposure to a message, in this paper we propose a novel framework for the effective placement of content: Given the navigational patterns of users in a network, e.g., web graph, hyperlinked corpus, or road network, and given a model of the relationship between content-adoption and frequency of exposition, we define the repetition-aware content-placement (RACP) problem as that of identifying the set of B nodes on which content should be placed so that the expected number of users adopting that content is maximized. The key contribution of our work is the introduction of memory into the navigation process, by making user conversion dependent on the number of her exposures to that content. This dependency is captured using a conversion model that is general enough to capture arbitrary dependencies. Our solution to this general problem builds upon the notion of absorbing random walks, which we extend appropriately in order to address the technicalities of our definitions. Although we show the RACP problem to be NP-hard, we propose a general and efficient algorithmic solution. Our experimental results demonstrate the efficacy and the efficiency of our methods in multiple real-world datasets obtained from different application domains.

可以说，确保一个概念(或产品)被广泛采用的最有效的技术是反复向个人展示强化这个概念(或推广这个产品)的信息。认识到重复暴露于信息的作用，在本文中，我们提出了一个有效放置内容的新框架:给定网络中用户的导航模式，例如网络图、超链接语料库或道路网络，并给定内容采用与展示频率之间的关系模型，我们将重复感知内容放置(RACP)问题定义为识别应该放置内容的B节点集，以便使采用该内容的预期用户数量最大化的问题。我们工作的关键贡献是将记忆引入到导航过程中，使用户的转换依赖于她对该内容的暴露次数。使用转换模型捕获此依赖项，该转换模型足够通用，可以捕获任意依赖项。我们对这一普遍问题的解决方案建立在吸收随机游走的概念之上，为了解决定义的技术性问题，我们对其进行了适当的扩展。尽管我们证明了RACP问题是np困难的，但我们提出了一个通用且有效的算法解决方案。我们的实验结果证明了我们的方法在来自不同应用领域的多个真实数据集上的有效性和效率。

{"title":"Repetition-aware content placement in navigational networks","authors":"D. Erdös, Vatche Isahagian, Azer Bestavros, Evimaria Terzi","doi":"10.1145/2487575.2487622","DOIUrl":"https://doi.org/10.1145/2487575.2487622","url":null,"abstract":"Arguably, the most effective technique to ensure wide adoption of a concept (or product) is by repeatedly exposing individuals to messages that reinforce the concept (or promote the product). Recognizing the role of repeated exposure to a message, in this paper we propose a novel framework for the effective placement of content: Given the navigational patterns of users in a network, e.g., web graph, hyperlinked corpus, or road network, and given a model of the relationship between content-adoption and frequency of exposition, we define the repetition-aware content-placement (RACP) problem as that of identifying the set of B nodes on which content should be placed so that the expected number of users adopting that content is maximized. The key contribution of our work is the introduction of memory into the navigation process, by making user conversion dependent on the number of her exposures to that content. This dependency is captured using a conversion model that is general enough to capture arbitrary dependencies. Our solution to this general problem builds upon the notion of absorbing random walks, which we extend appropriately in order to address the technicalities of our definitions. Although we show the RACP problem to be NP-hard, we propose a general and efficient algorithmic solution. Our experimental results demonstrate the efficacy and the efficiency of our methods in multiple real-world datasets obtained from different application domains.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84266584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1