Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献_第8页

Computational advertising: the linkedin way 计算广告:领英模式

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2514690

D. Agarwal

LinkedIn is the largest professional social network in the world with more than 238M members. It provides a platform for advertisers to reach out to professionals and target them using rich profile and behavioral data. Thus, online advertising is an important business for LinkedIn. In this talk, I will give an overview of machine learning and optimization components that power LinkedIn self-serve display advertising systems. The talk will not only focus on machine learning and optimization methods, but various practical challenges that arise when running such components in a real production environment. I will describe how we overcome some of these challenges to bridge the gap between theory and practice. The major components that will be described in details include Response prediction: The goal of this component is to estimate click-through rates (CTR) when an ad is shown to a user in a given context. Given the data sparseness due to low CTR for advertising applications in general and the curse of dimensionality, estimating such interactions is known to be a challenging. Furthermore, the goal of the system is to maximize expected revenue, hence this is an explore/exploit problem and not a supervised learning problem. Our approach takes recourse to supervised learning to reduce dimensionality and couples it with classical explore/exploit schemes to balance the explore/exploit tradeoff. In particular, we use a large scale logistic regression to estimate user and ad interactions. Such interactions are comprised of two additive terms a) stable interactions captured by using features for both users and ads whose coefficients change slowly over time, and b) ephemeral interactions that capture ad-specific residual idiosyncrasies that are missed by the stable component. Exploration is introduced via Thompson sampling on the ephemeral interactions (sample coefficients from the posterior distribution), since the stable part is estimated using large amounts of data and subject to very little statistical variance. Our model training pipeline estimates the stable part using a scatter and gather approach via the ADMM algorithm, ephemeral part is estimated more frequently by learning a per ad correction through an ad-specific logistic regression. Scoring thousands of ads at runtime under tight latency constraints is a formidable challenge when using such models, the talk will describe methods to scale such computations at runtime. Automatic Format Selection: The presentation of ads in a given slot on a page has a significant impact on how users interact with them. Web designers are adept at creating good formats to facilitate ad display but selecting the best among those automatically is a machine learning task. I will describe a machine learning approach we use to solve this problem. It is again an explore/exploit problem but the dimensionality of this problem is much less than the ad selection problem. I will also provide a detailed description of how we de

LinkedIn是世界上最大的职业社交网络，拥有超过2.38亿会员。它为广告商提供了一个接触专业人士的平台，并利用丰富的个人资料和行为数据来定位他们。因此，在线广告是LinkedIn的一项重要业务。在这次演讲中，我将概述为LinkedIn自助展示广告系统提供动力的机器学习和优化组件。这次演讲不仅将关注机器学习和优化方法，还将关注在实际生产环境中运行这些组件时出现的各种实际挑战。我将描述我们如何克服这些挑战，弥合理论与实践之间的差距。将详细描述的主要组件包括响应预测:此组件的目标是在给定上下文中向用户显示广告时估计点击率(CTR)。考虑到广告应用通常由于低点击率而导致的数据稀疏性和维度的诅咒，估计这种交互是一项挑战。此外，系统的目标是最大化预期收益，因此这是一个探索/利用问题，而不是监督学习问题。我们的方法采用监督学习来降低维数，并将其与经典的探索/利用方案相结合，以平衡探索/利用的权衡。特别是，我们使用大规模逻辑回归来估计用户和广告的交互。这种交互由两个附加项组成:a)使用用户和广告的特性捕获的稳定交互，其系数随时间缓慢变化;b)捕获稳定组件错过的特定于广告的残余特性的短暂交互。通过汤普森抽样对短暂的相互作用(来自后验分布的样本系数)进行探索，因为稳定部分是使用大量数据估计的，并且受到很小的统计方差的影响。我们的模型训练管道通过ADMM算法使用分散和收集方法估计稳定部分，通过特定于广告的逻辑回归学习每个广告修正来更频繁地估计短暂部分。当使用这种模型时，在严格的延迟限制下在运行时对数千个广告进行评分是一项艰巨的挑战，该演讲将描述在运行时扩展此类计算的方法。自动格式选择:广告在页面上给定位置的呈现方式对用户与广告的交互方式有重大影响。网页设计师擅长于创造良好的格式来促进广告的展示，但在这些格式中自动选择最佳格式是一项机器学习任务。我将描述我们用来解决这个问题的机器学习方法。这也是一个探索/利用问题，但这个问题的维度远低于广告选择问题。我还将详细描述我们如何处理预算节奏、投标预测、供应预测和目标等问题。在整个过程中，机器学习组件将使用来自生产的真实示例进行说明，并且将从实时测试中报告评估指标。还将讨论在将方法启动到实时流量之前对方法进行评估的离线度量。

{"title":"Computational advertising: the linkedin way","authors":"D. Agarwal","doi":"10.1145/2505515.2514690","DOIUrl":"https://doi.org/10.1145/2505515.2514690","url":null,"abstract":"LinkedIn is the largest professional social network in the world with more than 238M members. It provides a platform for advertisers to reach out to professionals and target them using rich profile and behavioral data. Thus, online advertising is an important business for LinkedIn. In this talk, I will give an overview of machine learning and optimization components that power LinkedIn self-serve display advertising systems. The talk will not only focus on machine learning and optimization methods, but various practical challenges that arise when running such components in a real production environment. I will describe how we overcome some of these challenges to bridge the gap between theory and practice. The major components that will be described in details include Response prediction: The goal of this component is to estimate click-through rates (CTR) when an ad is shown to a user in a given context. Given the data sparseness due to low CTR for advertising applications in general and the curse of dimensionality, estimating such interactions is known to be a challenging. Furthermore, the goal of the system is to maximize expected revenue, hence this is an explore/exploit problem and not a supervised learning problem. Our approach takes recourse to supervised learning to reduce dimensionality and couples it with classical explore/exploit schemes to balance the explore/exploit tradeoff. In particular, we use a large scale logistic regression to estimate user and ad interactions. Such interactions are comprised of two additive terms a) stable interactions captured by using features for both users and ads whose coefficients change slowly over time, and b) ephemeral interactions that capture ad-specific residual idiosyncrasies that are missed by the stable component. Exploration is introduced via Thompson sampling on the ephemeral interactions (sample coefficients from the posterior distribution), since the stable part is estimated using large amounts of data and subject to very little statistical variance. Our model training pipeline estimates the stable part using a scatter and gather approach via the ADMM algorithm, ephemeral part is estimated more frequently by learning a per ad correction through an ad-specific logistic regression. Scoring thousands of ads at runtime under tight latency constraints is a formidable challenge when using such models, the talk will describe methods to scale such computations at runtime. Automatic Format Selection: The presentation of ads in a given slot on a page has a significant impact on how users interact with them. Web designers are adept at creating good formats to facilitate ad display but selecting the best among those automatically is a machine learning task. I will describe a machine learning approach we use to solve this problem. It is again an explore/exploit problem but the dimensionality of this problem is much less than the ad selection problem. I will also provide a detailed description of how we de","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91348892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

An efficient algorithm for approximate betweenness centrality computation 一种高效的近似中间度中心性计算算法

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507826

Mostafa Haghir Chehreghani

Betweenness centrality is an important centrality measure widely used in social network analysis, route planning etc. However, even for mid-size networks, it is practically intractable to compute exact betweenness scores. In this paper, we propose a generic randomized framework for unbiased approximation of betweenness centrality. The proposed framework can be adapted with different sampling techniques and give diverse methods. We discuss the conditions a promising sampling technique should satisfy to minimize the approximation error and present a sampling method partially satisfying the conditions. We perform extensive experiments and show the high efficiency and accuracy of the proposed method.

中间中心性是一种重要的中心性度量，广泛应用于社会网络分析、路线规划等领域。然而，即使对于中等规模的网络，也很难计算出精确的中间值。在本文中，我们提出了一个通用的随机框架来无偏逼近中间性中心性。所提出的框架可以适应不同的采样技术，并给出不同的方法。讨论了一种有前途的采样技术为使近似误差最小化所应满足的条件，并提出了一种部分满足这些条件的采样方法。我们进行了大量的实验，并证明了该方法的高效率和准确性。

引用次数: 2

PredictionIO: a distributed machine learning server for practical software development PredictionIO:用于实际软件开发的分布式机器学习服务器

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2508198

Simon Chan, T. Stone, Kit Pang Szeto, Ka‐Hou Chan

One of the biggest challenges for software developers to build real-world predictive applications with machine learning is the steep learning curve of data processing frameworks, learning algorithms and scalable system infrastructure. We present PredictionIO, an open source machine learning server that comes with a step-by-step graphical user interface for developers to (i) evaluate, compare and deploy scalable learning algorithms, (ii) tune hyperparameters of algorithms manually or automatically and (iii) evaluate model training status. The system also comes with an Application Programming Interface (API) to communicate with software applications for data collection and prediction retrieval. The whole infrastructure of PredictionIO is horizontally scalable with a distributed computing component based on Hadoop. The demonstration shows a live example and workflows of building real-world predictive applications with the graphical user interface of PredictionIO, from data collection, algorithm tuning and selection, model training and re-training to real-time prediction querying.

软件开发人员使用机器学习构建现实世界预测应用程序的最大挑战之一是数据处理框架、学习算法和可扩展系统基础设施的陡峭学习曲线。我们介绍了PredictionIO，这是一个开源的机器学习服务器，它提供了一个循序渐进的图形用户界面，供开发人员(i)评估、比较和部署可扩展的学习算法，(ii)手动或自动调整算法的超参数，以及(iii)评估模型训练状态。该系统还配备了一个应用程序编程接口(API)，用于与软件应用程序进行数据收集和预测检索。PredictionIO的整个基础设施是基于Hadoop的分布式计算组件水平扩展的。该演示展示了一个使用PredictionIO图形用户界面构建现实世界预测应用程序的实例和工作流程，从数据收集、算法调优和选择、模型训练和再训练到实时预测查询。

{"title":"PredictionIO: a distributed machine learning server for practical software development","authors":"Simon Chan, T. Stone, Kit Pang Szeto, Ka‐Hou Chan","doi":"10.1145/2505515.2508198","DOIUrl":"https://doi.org/10.1145/2505515.2508198","url":null,"abstract":"One of the biggest challenges for software developers to build real-world predictive applications with machine learning is the steep learning curve of data processing frameworks, learning algorithms and scalable system infrastructure. We present PredictionIO, an open source machine learning server that comes with a step-by-step graphical user interface for developers to (i) evaluate, compare and deploy scalable learning algorithms, (ii) tune hyperparameters of algorithms manually or automatically and (iii) evaluate model training status. The system also comes with an Application Programming Interface (API) to communicate with software applications for data collection and prediction retrieval. The whole infrastructure of PredictionIO is horizontally scalable with a distributed computing component based on Hadoop. The demonstration shows a live example and workflows of building real-world predictive applications with the graphical user interface of PredictionIO, from data collection, algorithm tuning and selection, model training and re-training to real-time prediction querying.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87249397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

An efficient MapReduce algorithm for counting triangles in a very large graph 一个有效的MapReduce算法，用于在非常大的图中计数三角形

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505563

Ha-Myung Park, C. Chung

Triangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, trianglular connectivity, trusses, etc. The problem have been extensively studied in internal memory but the algorithms are not scalable for enormous graphs. In recent years, the MapReduce has emerged as a de facto standard framework for processing large data through parallel computing. A MapReduce algorithm was proposed for the problem based on graph partitioning. However, the algorithm redundantly generates a large number of intermediate data that cause network overload and prolong the processing time. In this paper, we propose a new algorithm based on graph partitioning with a novel idea of triangle classification to count the number of triangles in a graph. The algorithm substantially reduces the duplication by classifying triangles into three types and processing each triangle differently according to its type. In the experiments, we compare the proposed algorithm with recent existing algorithms using both synthetic datasets and real-world datasets that are composed of millions of nodes and billions of edges. The proposed algorithm outperforms other algorithms in most cases. Especially, for a twitter dataset, the proposed algorithm is more than twice as fast as existing MapReduce algorithms. Moreover, the performance gap increases as the graph becomes larger and denser.

三角形计数问题是各个领域的基本问题之一。该问题可用于聚类系数、传递性、三角形连通性、桁架等的计算。这个问题已经在内存中进行了广泛的研究，但算法对于巨大的图是不可扩展的。近年来，MapReduce已经成为通过并行计算处理大数据的事实上的标准框架。提出了一种基于图划分的MapReduce算法。但是，该算法会产生大量冗余的中间数据，造成网络过载，延长处理时间。在本文中，我们提出了一种新的基于图划分的算法，该算法采用了新的三角形分类思想来计算图中三角形的数量。该算法将三角形分为三种类型，并根据不同的类型对每个三角形进行不同的处理，从而大大减少了重复。在实验中，我们使用合成数据集和由数百万个节点和数十亿条边组成的真实数据集，将所提出的算法与最近的现有算法进行比较。该算法在大多数情况下优于其他算法。特别是，对于twitter数据集，该算法的速度是现有MapReduce算法的两倍以上。此外，随着图变得更大更密集，性能差距也会增加。

{"title":"An efficient MapReduce algorithm for counting triangles in a very large graph","authors":"Ha-Myung Park, C. Chung","doi":"10.1145/2505515.2505563","DOIUrl":"https://doi.org/10.1145/2505515.2505563","url":null,"abstract":"Triangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, trianglular connectivity, trusses, etc. The problem have been extensively studied in internal memory but the algorithms are not scalable for enormous graphs. In recent years, the MapReduce has emerged as a de facto standard framework for processing large data through parallel computing. A MapReduce algorithm was proposed for the problem based on graph partitioning. However, the algorithm redundantly generates a large number of intermediate data that cause network overload and prolong the processing time. In this paper, we propose a new algorithm based on graph partitioning with a novel idea of triangle classification to count the number of triangles in a graph. The algorithm substantially reduces the duplication by classifying triangles into three types and processing each triangle differently according to its type. In the experiments, we compare the proposed algorithm with recent existing algorithms using both synthetic datasets and real-world datasets that are composed of millions of nodes and billions of edges. The proposed algorithm outperforms other algorithms in most cases. Especially, for a twitter dataset, the proposed algorithm is more than twice as fast as existing MapReduce algorithms. Moreover, the performance gap increases as the graph becomes larger and denser.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90566465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Retrieving opinions from discussion forums 从论坛中检索意见

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507861

Laura Dietz, Ziqi Wang, Samuel Huston, W. Bruce Croft

Abstract Understanding the landscape of opinions on a given topic or issue is important for policy makers, sociologists, and intelligence analysts. The first step in this process is to retrieve relevant opinions. Discussion forums are potentially a good source of this information, but comes with a unique set of retrieval challenges. In this short paper, we test a range of existing techniques for forum retrieval and develop new retrieval models to differentiate between opinionated and factual forum posts. We are able to demonstrate some significant performance improvements over the baseline retrieval models, demonstrating that this as a promising avenue for further study.

对政策制定者、社会学家和情报分析人员来说，了解某一特定话题或问题的观点格局是非常重要的。这个过程的第一步是检索相关意见。讨论论坛可能是这类信息的一个很好的来源，但是它带来了一组独特的检索挑战。在这篇短文中，我们测试了一系列现有的论坛检索技术，并开发了新的检索模型来区分武断的和事实的论坛帖子。我们能够在基线检索模型上展示一些显著的性能改进，证明这是一个有前途的进一步研究途径。

引用次数: 5

Learning to handle negated language in medical records search 学习处理病历检索中的否定语言

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505706

Nut Limsopatham, C. Macdonald, I. Ounis

Negated language is frequently used by medical practitioners to indicate that a patient does not have a given medical condition. Traditionally, information retrieval systems do not distinguish between the positive and negative contexts of terms when indexing documents. For example, when searching for patients with angina, a retrieval system might wrongly consider a patient with a medical record stating ``no evidence of angina" to be relevant. While it is possible to enhance a retrieval system by taking into account the context of terms within the indexing representation of a document, some non-relevant medical records can still be ranked highly, if they include some of the query terms with the intended context. In this paper, we propose a novel learning framework that effectively handles negated language. Based on features related to the positive and negative contexts of a term, the framework learns how to appropriately weight the occurrences of the opposite context of any query term, thus preventing documents that may not be relevant from being retrieved. We thoroughly evaluate our proposed framework using the TREC 2011 and 2012 Medical Records track test collections. Our results show significant improvements over existing strong baselines. In addition, in combination with a traditional query expansion and a conceptual representation approach, our proposed framework could achieve a retrieval effectiveness comparable to the performance of the best TREC 2011 and 2012 systems, while not addressing other challenges in medical records search, such as the exploitation of semantic relationships between medical terms.

医生经常使用否定的语言来表示病人没有某种特定的医疗状况。传统上，信息检索系统在索引文档时不区分术语的正反两种上下文。例如，当搜索心绞痛患者时，检索系统可能会错误地将病历上写着“无心绞痛证据”的患者视为相关。虽然可以通过考虑文档索引表示中的术语上下文来增强检索系统，但如果一些不相关的医疗记录包含一些具有预期上下文的查询术语，则它们仍然可以排名靠前。在本文中，我们提出了一个新的学习框架，有效地处理否定语言。基于与一个词的积极和消极上下文相关的特征，框架学习如何适当地权衡任何查询词的相反上下文的出现，从而防止可能不相关的文档被检索。我们使用TREC 2011年和2012年医疗记录跟踪测试集彻底评估了我们提出的框架。我们的结果显示，与现有的强基线相比，有了显著的改进。此外，结合传统的查询扩展和概念表示方法，我们提出的框架可以实现与最佳TREC 2011和2012系统性能相当的检索效率，同时不解决医疗记录搜索中的其他挑战，例如医学术语之间的语义关系的利用。

{"title":"Learning to handle negated language in medical records search","authors":"Nut Limsopatham, C. Macdonald, I. Ounis","doi":"10.1145/2505515.2505706","DOIUrl":"https://doi.org/10.1145/2505515.2505706","url":null,"abstract":"Negated language is frequently used by medical practitioners to indicate that a patient does not have a given medical condition. Traditionally, information retrieval systems do not distinguish between the positive and negative contexts of terms when indexing documents. For example, when searching for patients with angina, a retrieval system might wrongly consider a patient with a medical record stating ``no evidence of angina\" to be relevant. While it is possible to enhance a retrieval system by taking into account the context of terms within the indexing representation of a document, some non-relevant medical records can still be ranked highly, if they include some of the query terms with the intended context. In this paper, we propose a novel learning framework that effectively handles negated language. Based on features related to the positive and negative contexts of a term, the framework learns how to appropriately weight the occurrences of the opposite context of any query term, thus preventing documents that may not be relevant from being retrieved. We thoroughly evaluate our proposed framework using the TREC 2011 and 2012 Medical Records track test collections. Our results show significant improvements over existing strong baselines. In addition, in combination with a traditional query expansion and a conceptual representation approach, our proposed framework could achieve a retrieval effectiveness comparable to the performance of the best TREC 2011 and 2012 systems, while not addressing other challenges in medical records search, such as the exploitation of semantic relationships between medical terms.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85694085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

High throughput filtering using FPGA-acceleration 采用fpga加速的高吞吐量滤波

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2507866

W. Vanderbauwhede, Anton Frolov, L. Azzopardi, S. R. Chalamalasetti, M. Margala

With the rise in the amount information of being streamed across networks, there is a growing demand to vet the quality, type and content itself for various purposes such as spam, security and search. In this paper, we develop an energy-efficient high performance information filtering system that is capable of classifying a stream of incoming document at high speed. The prototype parses a stream of documents using a multicore CPU and then performs classification using Field-Programmable Gate Arrays (FPGAs). On a large TREC data collection, we implemented a Naive Bayes classifier on our prototype and compared it to an optimized CPU based-baseline. Our empirical findings show that we can classify documents at 10Gb/s which is up to 94 times faster than the CPU baseline (and up to 5 times faster than previous FPGA based implementations). In future work, we aim to increase the throughput by another order of magnitude by implementing both the parser and filter on the FPGA.

随着通过网络传输的信息量的增加，为了垃圾邮件、安全、搜索等各种目的，对质量、类型和内容本身进行审查的需求也在不断增长。在本文中，我们开发了一种高效节能的信息过滤系统，该系统能够对输入的文档流进行高速分类。原型使用多核CPU解析文档流，然后使用现场可编程门阵列(fpga)执行分类。在一个大型TREC数据集上，我们在原型上实现了朴素贝叶斯分类器，并将其与优化的基于CPU的基线进行了比较。我们的实证研究结果表明，我们可以以10Gb/s的速度对文档进行分类，这比CPU基准快94倍(比以前基于FPGA的实现快5倍)。在未来的工作中，我们的目标是通过在FPGA上实现解析器和滤波器来将吞吐量提高一个数量级。

引用次数: 1

Multimedia summarization for trending topics in microblogs 微博热门话题的多媒体总结

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505652

Jingwen Bian, Yang Yang, Tat-Seng Chua

Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of microblogs with multimedia contents and trending topics, it is desirable to provide visualized summarization to help users to quickly grasp the essence of topics. While existing works mostly focus on text-based methods only, summarization of multiple media types (e.g., text and image) are scarcely explored. In this paper, we propose a multimedia microblog summarization framework to automatically generate visualized summaries for trending topics. Specifically, a novel generative probabilistic model, termed multimodal-LDA (MMLDA), is proposed to discover subtopics from microblogs by exploring the correlations among different media types. Based on the information achieved from MMLDA, a multimedia summarizer is designed to separately identify representative textual and visual samples and then form a comprehensive visualized summary. We conduct extensive experiments on a real-world Sina Weibo microblog dataset to demonstrate the superiority of our proposed method against the state-of-the-art approaches.

微博服务彻底改变了人们交换信息的方式。面对越来越多的多媒体内容和热门话题的微博，提供可视化的摘要帮助用户快速掌握话题的本质是很有必要的。现有的研究大多集中在基于文本的方法上，而对多种媒体类型(如文本和图像)的总结研究甚少。在本文中，我们提出了一个多媒体微博摘要框架来自动生成趋势话题的可视化摘要。具体而言，提出了一种新的生成概率模型，即多模态lda (MMLDA)，通过探索不同媒体类型之间的相关性，从微博中发现子主题。基于MMLDA获取的信息，设计了一个多媒体摘要器，分别识别具有代表性的文本样本和视觉样本，形成全面的可视化摘要。我们在真实的新浪微博数据集上进行了广泛的实验，以证明我们提出的方法相对于最先进的方法的优越性。

{"title":"Multimedia summarization for trending topics in microblogs","authors":"Jingwen Bian, Yang Yang, Tat-Seng Chua","doi":"10.1145/2505515.2505652","DOIUrl":"https://doi.org/10.1145/2505515.2505652","url":null,"abstract":"Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of microblogs with multimedia contents and trending topics, it is desirable to provide visualized summarization to help users to quickly grasp the essence of topics. While existing works mostly focus on text-based methods only, summarization of multiple media types (e.g., text and image) are scarcely explored. In this paper, we propose a multimedia microblog summarization framework to automatically generate visualized summaries for trending topics. Specifically, a novel generative probabilistic model, termed multimodal-LDA (MMLDA), is proposed to discover subtopics from microblogs by exploring the correlations among different media types. Based on the information achieved from MMLDA, a multimedia summarizer is designed to separately identify representative textual and visual samples and then form a comprehensive visualized summary. We conduct extensive experiments on a real-world Sina Weibo microblog dataset to demonstrate the superiority of our proposed method against the state-of-the-art approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91031837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Generating informative snippet to maximize item visibility 生成信息片段，以最大限度地提高项目可见性

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505606

Mahashweta Das, Habibur Rahman, Gautam Das, Vagelis Hristidis

The widespread use and growing popularity of online collaborative content sites has created rich resources for users to consult in order to make purchasing decisions on various items such as e-commerce products, restaurants, etc. Ideally, a user wants to quickly decide whether an item is desirable, from the list of items returned as a result of her search query. This has created new challenges for producers/manufacturers (e.g., Dell) or retailers (e.g., Amazon, eBay) of such items to compose succinct summarizations of web item descriptions, henceforth referred to as snippets, that are likely to maximize the items' visibility among users. We exploit the availability of user feedback in collaborative content sites in the form of tags to identify the most important item attributes that must be highlighted in an item snippet. We investigate the problem of finding the top-k best snippets for an item that are likely to maximize the probability that the user preference (available in the form of search query) is satisfied. Since a search query returns multiple relevant items, we also study the problem of finding the best diverse set of snippets for the items in order to maximize the probability of a user liking at least one of the top items. We develop an exact top-k algorithm for each of the problem and perform detailed experiments on synthetic and real data crawled from the web to to demonstrate the utility of our problems and effectiveness of our solutions.

在线协同内容网站的广泛使用和日益普及，为用户在电子商务产品、餐饮等各种项目的购买决策提供了丰富的参考资源。理想情况下，用户希望从搜索查询返回的项目列表中快速决定是否需要某项。这给这些商品的生产者/制造商(如戴尔)或零售商(如亚马逊、eBay)带来了新的挑战，他们需要编写简洁的网络商品描述摘要，因此被称为片段，这样才能最大限度地提高商品在用户中的可见性。我们利用协作内容站点中用户反馈的可用性，以标签的形式来识别最重要的项目属性，这些属性必须在项目摘要中突出显示。我们研究的问题是找到一个项目的前k个最佳片段，这些片段有可能最大化满足用户偏好(以搜索查询的形式提供)的概率。由于搜索查询返回多个相关项目，我们还研究了为这些项目找到最佳多样化片段集的问题，以便最大化用户喜欢至少一个顶级项目的概率。我们为每个问题开发了一个精确的top-k算法，并对从网络上抓取的合成和真实数据进行了详细的实验，以证明我们的问题的实用性和解决方案的有效性。

{"title":"Generating informative snippet to maximize item visibility","authors":"Mahashweta Das, Habibur Rahman, Gautam Das, Vagelis Hristidis","doi":"10.1145/2505515.2505606","DOIUrl":"https://doi.org/10.1145/2505515.2505606","url":null,"abstract":"The widespread use and growing popularity of online collaborative content sites has created rich resources for users to consult in order to make purchasing decisions on various items such as e-commerce products, restaurants, etc. Ideally, a user wants to quickly decide whether an item is desirable, from the list of items returned as a result of her search query. This has created new challenges for producers/manufacturers (e.g., Dell) or retailers (e.g., Amazon, eBay) of such items to compose succinct summarizations of web item descriptions, henceforth referred to as snippets, that are likely to maximize the items' visibility among users. We exploit the availability of user feedback in collaborative content sites in the form of tags to identify the most important item attributes that must be highlighted in an item snippet. We investigate the problem of finding the top-k best snippets for an item that are likely to maximize the probability that the user preference (available in the form of search query) is satisfied. Since a search query returns multiple relevant items, we also study the problem of finding the best diverse set of snippets for the items in order to maximize the probability of a user liking at least one of the top items. We develop an exact top-k algorithm for each of the problem and perform detailed experiments on synthetic and real data crawled from the web to to demonstrate the utility of our problems and effectiveness of our solutions.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88776178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Linear-time enumeration of maximal K-edge-connected subgraphs in large networks by random contraction 基于随机收缩的大网络中最大k边连通子图的线性时间枚举

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pub Date : 2013-10-27 DOI: 10.1145/2505515.2505751

Takuya Akiba, Yoichi Iwata, Yuichi Yoshida

Capturing sets of closely related vertices from large networks is an essential task in many applications such as social network analysis, bioinformatics, and web link research. Decomposing a graph into k-core components is a standard and efficient method for this task, but obtained clusters might not be well-connected. The idea of using maximal k-edge-connected subgraphs was recently proposed to address this issue. Although we can obtain better clusters with this idea, the state-of-the-art method is not efficient enough to process large networks with millions of vertices. In this paper, we propose a new method to decompose a graph into maximal k-edge-connected components, based on random contraction of edges. Our method is simple to implement but improves performance drastically. We experimentally show that our method can successfully decompose large networks and it is thousands times faster than the previous method. Also, we theoretically explain why our method is efficient in practice. To see the importance of maximal k-edge-connected subgraphs, we also conduct experiments using real-world networks to show that many k-core components have small edge-connectivity and they can be decomposed into a lot of maximal k-edge-connected subgraphs.

在社会网络分析、生物信息学和网络链接研究等许多应用中，从大型网络中获取密切相关的点集是一项重要的任务。将图分解为k核组件是一种标准而有效的方法，但得到的聚类可能没有很好地连接。为了解决这个问题，最近提出了使用最大k边连通子图的想法。虽然我们可以用这个想法获得更好的聚类，但是最先进的方法对于处理具有数百万个顶点的大型网络来说不够有效。本文提出了一种基于边的随机收缩将图分解为最大k个边连通分量的新方法。我们的方法实现起来很简单，但却大大提高了性能。实验表明，该方法可以成功地分解大型网络，并且分解速度比以前的方法快数千倍。同时，我们从理论上解释了为什么我们的方法在实践中是有效的。为了看到最大k边连通子图的重要性，我们还使用现实网络进行了实验，表明许多k核组件具有小的边连通性，并且它们可以分解为许多最大k边连通子图。

{"title":"Linear-time enumeration of maximal K-edge-connected subgraphs in large networks by random contraction","authors":"Takuya Akiba, Yoichi Iwata, Yuichi Yoshida","doi":"10.1145/2505515.2505751","DOIUrl":"https://doi.org/10.1145/2505515.2505751","url":null,"abstract":"Capturing sets of closely related vertices from large networks is an essential task in many applications such as social network analysis, bioinformatics, and web link research. Decomposing a graph into k-core components is a standard and efficient method for this task, but obtained clusters might not be well-connected. The idea of using maximal k-edge-connected subgraphs was recently proposed to address this issue. Although we can obtain better clusters with this idea, the state-of-the-art method is not efficient enough to process large networks with millions of vertices. In this paper, we propose a new method to decompose a graph into maximal k-edge-connected components, based on random contraction of edges. Our method is simple to implement but improves performance drastically. We experimentally show that our method can successfully decompose large networks and it is thousands times faster than the previous method. Also, we theoretically explain why our method is efficient in practice. To see the importance of maximal k-edge-connected subgraphs, we also conduct experiments using real-world networks to show that many k-core components have small edge-connectivity and they can be decomposed into a lot of maximal k-edge-connected subgraphs.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89051099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51