首页 > 最新文献

Proceedings of the 22nd ACM international conference on Information & Knowledge Management最新文献

英文 中文
Computational advertising: the linkedin way 计算广告:领英模式
D. Agarwal
LinkedIn is the largest professional social network in the world with more than 238M members. It provides a platform for advertisers to reach out to professionals and target them using rich profile and behavioral data. Thus, online advertising is an important business for LinkedIn. In this talk, I will give an overview of machine learning and optimization components that power LinkedIn self-serve display advertising systems. The talk will not only focus on machine learning and optimization methods, but various practical challenges that arise when running such components in a real production environment. I will describe how we overcome some of these challenges to bridge the gap between theory and practice. The major components that will be described in details include Response prediction: The goal of this component is to estimate click-through rates (CTR) when an ad is shown to a user in a given context. Given the data sparseness due to low CTR for advertising applications in general and the curse of dimensionality, estimating such interactions is known to be a challenging. Furthermore, the goal of the system is to maximize expected revenue, hence this is an explore/exploit problem and not a supervised learning problem. Our approach takes recourse to supervised learning to reduce dimensionality and couples it with classical explore/exploit schemes to balance the explore/exploit tradeoff. In particular, we use a large scale logistic regression to estimate user and ad interactions. Such interactions are comprised of two additive terms a) stable interactions captured by using features for both users and ads whose coefficients change slowly over time, and b) ephemeral interactions that capture ad-specific residual idiosyncrasies that are missed by the stable component. Exploration is introduced via Thompson sampling on the ephemeral interactions (sample coefficients from the posterior distribution), since the stable part is estimated using large amounts of data and subject to very little statistical variance. Our model training pipeline estimates the stable part using a scatter and gather approach via the ADMM algorithm, ephemeral part is estimated more frequently by learning a per ad correction through an ad-specific logistic regression. Scoring thousands of ads at runtime under tight latency constraints is a formidable challenge when using such models, the talk will describe methods to scale such computations at runtime. Automatic Format Selection: The presentation of ads in a given slot on a page has a significant impact on how users interact with them. Web designers are adept at creating good formats to facilitate ad display but selecting the best among those automatically is a machine learning task. I will describe a machine learning approach we use to solve this problem. It is again an explore/exploit problem but the dimensionality of this problem is much less than the ad selection problem. I will also provide a detailed description of how we de
LinkedIn是世界上最大的职业社交网络,拥有超过2.38亿会员。它为广告商提供了一个接触专业人士的平台,并利用丰富的个人资料和行为数据来定位他们。因此,在线广告是LinkedIn的一项重要业务。在这次演讲中,我将概述为LinkedIn自助展示广告系统提供动力的机器学习和优化组件。这次演讲不仅将关注机器学习和优化方法,还将关注在实际生产环境中运行这些组件时出现的各种实际挑战。我将描述我们如何克服这些挑战,弥合理论与实践之间的差距。将详细描述的主要组件包括响应预测:此组件的目标是在给定上下文中向用户显示广告时估计点击率(CTR)。考虑到广告应用通常由于低点击率而导致的数据稀疏性和维度的诅咒,估计这种交互是一项挑战。此外,系统的目标是最大化预期收益,因此这是一个探索/利用问题,而不是监督学习问题。我们的方法采用监督学习来降低维数,并将其与经典的探索/利用方案相结合,以平衡探索/利用的权衡。特别是,我们使用大规模逻辑回归来估计用户和广告的交互。这种交互由两个附加项组成:a)使用用户和广告的特性捕获的稳定交互,其系数随时间缓慢变化;b)捕获稳定组件错过的特定于广告的残余特性的短暂交互。通过汤普森抽样对短暂的相互作用(来自后验分布的样本系数)进行探索,因为稳定部分是使用大量数据估计的,并且受到很小的统计方差的影响。我们的模型训练管道通过ADMM算法使用分散和收集方法估计稳定部分,通过特定于广告的逻辑回归学习每个广告修正来更频繁地估计短暂部分。当使用这种模型时,在严格的延迟限制下在运行时对数千个广告进行评分是一项艰巨的挑战,该演讲将描述在运行时扩展此类计算的方法。自动格式选择:广告在页面上给定位置的呈现方式对用户与广告的交互方式有重大影响。网页设计师擅长于创造良好的格式来促进广告的展示,但在这些格式中自动选择最佳格式是一项机器学习任务。我将描述我们用来解决这个问题的机器学习方法。这也是一个探索/利用问题,但这个问题的维度远低于广告选择问题。我还将详细描述我们如何处理预算节奏、投标预测、供应预测和目标等问题。在整个过程中,机器学习组件将使用来自生产的真实示例进行说明,并且将从实时测试中报告评估指标。还将讨论在将方法启动到实时流量之前对方法进行评估的离线度量。
{"title":"Computational advertising: the linkedin way","authors":"D. Agarwal","doi":"10.1145/2505515.2514690","DOIUrl":"https://doi.org/10.1145/2505515.2514690","url":null,"abstract":"LinkedIn is the largest professional social network in the world with more than 238M members. It provides a platform for advertisers to reach out to professionals and target them using rich profile and behavioral data. Thus, online advertising is an important business for LinkedIn. In this talk, I will give an overview of machine learning and optimization components that power LinkedIn self-serve display advertising systems. The talk will not only focus on machine learning and optimization methods, but various practical challenges that arise when running such components in a real production environment. I will describe how we overcome some of these challenges to bridge the gap between theory and practice. The major components that will be described in details include Response prediction: The goal of this component is to estimate click-through rates (CTR) when an ad is shown to a user in a given context. Given the data sparseness due to low CTR for advertising applications in general and the curse of dimensionality, estimating such interactions is known to be a challenging. Furthermore, the goal of the system is to maximize expected revenue, hence this is an explore/exploit problem and not a supervised learning problem. Our approach takes recourse to supervised learning to reduce dimensionality and couples it with classical explore/exploit schemes to balance the explore/exploit tradeoff. In particular, we use a large scale logistic regression to estimate user and ad interactions. Such interactions are comprised of two additive terms a) stable interactions captured by using features for both users and ads whose coefficients change slowly over time, and b) ephemeral interactions that capture ad-specific residual idiosyncrasies that are missed by the stable component. Exploration is introduced via Thompson sampling on the ephemeral interactions (sample coefficients from the posterior distribution), since the stable part is estimated using large amounts of data and subject to very little statistical variance. Our model training pipeline estimates the stable part using a scatter and gather approach via the ADMM algorithm, ephemeral part is estimated more frequently by learning a per ad correction through an ad-specific logistic regression. Scoring thousands of ads at runtime under tight latency constraints is a formidable challenge when using such models, the talk will describe methods to scale such computations at runtime. Automatic Format Selection: The presentation of ads in a given slot on a page has a significant impact on how users interact with them. Web designers are adept at creating good formats to facilitate ad display but selecting the best among those automatically is a machine learning task. I will describe a machine learning approach we use to solve this problem. It is again an explore/exploit problem but the dimensionality of this problem is much less than the ad selection problem. I will also provide a detailed description of how we de","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91348892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
An efficient algorithm for approximate betweenness centrality computation 一种高效的近似中间度中心性计算算法
Mostafa Haghir Chehreghani
Betweenness centrality is an important centrality measure widely used in social network analysis, route planning etc. However, even for mid-size networks, it is practically intractable to compute exact betweenness scores. In this paper, we propose a generic randomized framework for unbiased approximation of betweenness centrality. The proposed framework can be adapted with different sampling techniques and give diverse methods. We discuss the conditions a promising sampling technique should satisfy to minimize the approximation error and present a sampling method partially satisfying the conditions. We perform extensive experiments and show the high efficiency and accuracy of the proposed method.
中间中心性是一种重要的中心性度量,广泛应用于社会网络分析、路线规划等领域。然而,即使对于中等规模的网络,也很难计算出精确的中间值。在本文中,我们提出了一个通用的随机框架来无偏逼近中间性中心性。所提出的框架可以适应不同的采样技术,并给出不同的方法。讨论了一种有前途的采样技术为使近似误差最小化所应满足的条件,并提出了一种部分满足这些条件的采样方法。我们进行了大量的实验,并证明了该方法的高效率和准确性。
{"title":"An efficient algorithm for approximate betweenness centrality computation","authors":"Mostafa Haghir Chehreghani","doi":"10.1145/2505515.2507826","DOIUrl":"https://doi.org/10.1145/2505515.2507826","url":null,"abstract":"Betweenness centrality is an important centrality measure widely used in social network analysis, route planning etc. However, even for mid-size networks, it is practically intractable to compute exact betweenness scores. In this paper, we propose a generic randomized framework for unbiased approximation of betweenness centrality. The proposed framework can be adapted with different sampling techniques and give diverse methods. We discuss the conditions a promising sampling technique should satisfy to minimize the approximation error and present a sampling method partially satisfying the conditions. We perform extensive experiments and show the high efficiency and accuracy of the proposed method.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86914176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
PredictionIO: a distributed machine learning server for practical software development PredictionIO:用于实际软件开发的分布式机器学习服务器
Simon Chan, T. Stone, Kit Pang Szeto, Ka‐Hou Chan
One of the biggest challenges for software developers to build real-world predictive applications with machine learning is the steep learning curve of data processing frameworks, learning algorithms and scalable system infrastructure. We present PredictionIO, an open source machine learning server that comes with a step-by-step graphical user interface for developers to (i) evaluate, compare and deploy scalable learning algorithms, (ii) tune hyperparameters of algorithms manually or automatically and (iii) evaluate model training status. The system also comes with an Application Programming Interface (API) to communicate with software applications for data collection and prediction retrieval. The whole infrastructure of PredictionIO is horizontally scalable with a distributed computing component based on Hadoop. The demonstration shows a live example and workflows of building real-world predictive applications with the graphical user interface of PredictionIO, from data collection, algorithm tuning and selection, model training and re-training to real-time prediction querying.
软件开发人员使用机器学习构建现实世界预测应用程序的最大挑战之一是数据处理框架、学习算法和可扩展系统基础设施的陡峭学习曲线。我们介绍了PredictionIO,这是一个开源的机器学习服务器,它提供了一个循序渐进的图形用户界面,供开发人员(i)评估、比较和部署可扩展的学习算法,(ii)手动或自动调整算法的超参数,以及(iii)评估模型训练状态。该系统还配备了一个应用程序编程接口(API),用于与软件应用程序进行数据收集和预测检索。PredictionIO的整个基础设施是基于Hadoop的分布式计算组件水平扩展的。该演示展示了一个使用PredictionIO图形用户界面构建现实世界预测应用程序的实例和工作流程,从数据收集、算法调优和选择、模型训练和再训练到实时预测查询。
{"title":"PredictionIO: a distributed machine learning server for practical software development","authors":"Simon Chan, T. Stone, Kit Pang Szeto, Ka‐Hou Chan","doi":"10.1145/2505515.2508198","DOIUrl":"https://doi.org/10.1145/2505515.2508198","url":null,"abstract":"One of the biggest challenges for software developers to build real-world predictive applications with machine learning is the steep learning curve of data processing frameworks, learning algorithms and scalable system infrastructure. We present PredictionIO, an open source machine learning server that comes with a step-by-step graphical user interface for developers to (i) evaluate, compare and deploy scalable learning algorithms, (ii) tune hyperparameters of algorithms manually or automatically and (iii) evaluate model training status. The system also comes with an Application Programming Interface (API) to communicate with software applications for data collection and prediction retrieval. The whole infrastructure of PredictionIO is horizontally scalable with a distributed computing component based on Hadoop. The demonstration shows a live example and workflows of building real-world predictive applications with the graphical user interface of PredictionIO, from data collection, algorithm tuning and selection, model training and re-training to real-time prediction querying.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87249397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
An efficient MapReduce algorithm for counting triangles in a very large graph 一个有效的MapReduce算法,用于在非常大的图中计数三角形
Ha-Myung Park, C. Chung
Triangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, trianglular connectivity, trusses, etc. The problem have been extensively studied in internal memory but the algorithms are not scalable for enormous graphs. In recent years, the MapReduce has emerged as a de facto standard framework for processing large data through parallel computing. A MapReduce algorithm was proposed for the problem based on graph partitioning. However, the algorithm redundantly generates a large number of intermediate data that cause network overload and prolong the processing time. In this paper, we propose a new algorithm based on graph partitioning with a novel idea of triangle classification to count the number of triangles in a graph. The algorithm substantially reduces the duplication by classifying triangles into three types and processing each triangle differently according to its type. In the experiments, we compare the proposed algorithm with recent existing algorithms using both synthetic datasets and real-world datasets that are composed of millions of nodes and billions of edges. The proposed algorithm outperforms other algorithms in most cases. Especially, for a twitter dataset, the proposed algorithm is more than twice as fast as existing MapReduce algorithms. Moreover, the performance gap increases as the graph becomes larger and denser.
三角形计数问题是各个领域的基本问题之一。该问题可用于聚类系数、传递性、三角形连通性、桁架等的计算。这个问题已经在内存中进行了广泛的研究,但算法对于巨大的图是不可扩展的。近年来,MapReduce已经成为通过并行计算处理大数据的事实上的标准框架。提出了一种基于图划分的MapReduce算法。但是,该算法会产生大量冗余的中间数据,造成网络过载,延长处理时间。在本文中,我们提出了一种新的基于图划分的算法,该算法采用了新的三角形分类思想来计算图中三角形的数量。该算法将三角形分为三种类型,并根据不同的类型对每个三角形进行不同的处理,从而大大减少了重复。在实验中,我们使用合成数据集和由数百万个节点和数十亿条边组成的真实数据集,将所提出的算法与最近的现有算法进行比较。该算法在大多数情况下优于其他算法。特别是,对于twitter数据集,该算法的速度是现有MapReduce算法的两倍以上。此外,随着图变得更大更密集,性能差距也会增加。
{"title":"An efficient MapReduce algorithm for counting triangles in a very large graph","authors":"Ha-Myung Park, C. Chung","doi":"10.1145/2505515.2505563","DOIUrl":"https://doi.org/10.1145/2505515.2505563","url":null,"abstract":"Triangle counting problem is one of the fundamental problem in various domains. The problem can be utilized for computation of clustering coefficient, transitivity, trianglular connectivity, trusses, etc. The problem have been extensively studied in internal memory but the algorithms are not scalable for enormous graphs. In recent years, the MapReduce has emerged as a de facto standard framework for processing large data through parallel computing. A MapReduce algorithm was proposed for the problem based on graph partitioning. However, the algorithm redundantly generates a large number of intermediate data that cause network overload and prolong the processing time. In this paper, we propose a new algorithm based on graph partitioning with a novel idea of triangle classification to count the number of triangles in a graph. The algorithm substantially reduces the duplication by classifying triangles into three types and processing each triangle differently according to its type. In the experiments, we compare the proposed algorithm with recent existing algorithms using both synthetic datasets and real-world datasets that are composed of millions of nodes and billions of edges. The proposed algorithm outperforms other algorithms in most cases. Especially, for a twitter dataset, the proposed algorithm is more than twice as fast as existing MapReduce algorithms. Moreover, the performance gap increases as the graph becomes larger and denser.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90566465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Retrieving opinions from discussion forums 从论坛中检索意见
Laura Dietz, Ziqi Wang, Samuel Huston, W. Bruce Croft
Abstract Understanding the landscape of opinions on a given topic or issue is important for policy makers, sociologists, and intelligence analysts. The first step in this process is to retrieve relevant opinions. Discussion forums are potentially a good source of this information, but comes with a unique set of retrieval challenges. In this short paper, we test a range of existing techniques for forum retrieval and develop new retrieval models to differentiate between opinionated and factual forum posts. We are able to demonstrate some significant performance improvements over the baseline retrieval models, demonstrating that this as a promising avenue for further study.
对政策制定者、社会学家和情报分析人员来说,了解某一特定话题或问题的观点格局是非常重要的。这个过程的第一步是检索相关意见。讨论论坛可能是这类信息的一个很好的来源,但是它带来了一组独特的检索挑战。在这篇短文中,我们测试了一系列现有的论坛检索技术,并开发了新的检索模型来区分武断的和事实的论坛帖子。我们能够在基线检索模型上展示一些显著的性能改进,证明这是一个有前途的进一步研究途径。
{"title":"Retrieving opinions from discussion forums","authors":"Laura Dietz, Ziqi Wang, Samuel Huston, W. Bruce Croft","doi":"10.1145/2505515.2507861","DOIUrl":"https://doi.org/10.1145/2505515.2507861","url":null,"abstract":"Abstract Understanding the landscape of opinions on a given topic or issue is important for policy makers, sociologists, and intelligence analysts. The first step in this process is to retrieve relevant opinions. Discussion forums are potentially a good source of this information, but comes with a unique set of retrieval challenges. In this short paper, we test a range of existing techniques for forum retrieval and develop new retrieval models to differentiate between opinionated and factual forum posts. We are able to demonstrate some significant performance improvements over the baseline retrieval models, demonstrating that this as a promising avenue for further study.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85619324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Learning to handle negated language in medical records search 学习处理病历检索中的否定语言
Nut Limsopatham, C. Macdonald, I. Ounis
Negated language is frequently used by medical practitioners to indicate that a patient does not have a given medical condition. Traditionally, information retrieval systems do not distinguish between the positive and negative contexts of terms when indexing documents. For example, when searching for patients with angina, a retrieval system might wrongly consider a patient with a medical record stating ``no evidence of angina" to be relevant. While it is possible to enhance a retrieval system by taking into account the context of terms within the indexing representation of a document, some non-relevant medical records can still be ranked highly, if they include some of the query terms with the intended context. In this paper, we propose a novel learning framework that effectively handles negated language. Based on features related to the positive and negative contexts of a term, the framework learns how to appropriately weight the occurrences of the opposite context of any query term, thus preventing documents that may not be relevant from being retrieved. We thoroughly evaluate our proposed framework using the TREC 2011 and 2012 Medical Records track test collections. Our results show significant improvements over existing strong baselines. In addition, in combination with a traditional query expansion and a conceptual representation approach, our proposed framework could achieve a retrieval effectiveness comparable to the performance of the best TREC 2011 and 2012 systems, while not addressing other challenges in medical records search, such as the exploitation of semantic relationships between medical terms.
医生经常使用否定的语言来表示病人没有某种特定的医疗状况。传统上,信息检索系统在索引文档时不区分术语的正反两种上下文。例如,当搜索心绞痛患者时,检索系统可能会错误地将病历上写着“无心绞痛证据”的患者视为相关。虽然可以通过考虑文档索引表示中的术语上下文来增强检索系统,但如果一些不相关的医疗记录包含一些具有预期上下文的查询术语,则它们仍然可以排名靠前。在本文中,我们提出了一个新的学习框架,有效地处理否定语言。基于与一个词的积极和消极上下文相关的特征,框架学习如何适当地权衡任何查询词的相反上下文的出现,从而防止可能不相关的文档被检索。我们使用TREC 2011年和2012年医疗记录跟踪测试集彻底评估了我们提出的框架。我们的结果显示,与现有的强基线相比,有了显著的改进。此外,结合传统的查询扩展和概念表示方法,我们提出的框架可以实现与最佳TREC 2011和2012系统性能相当的检索效率,同时不解决医疗记录搜索中的其他挑战,例如医学术语之间的语义关系的利用。
{"title":"Learning to handle negated language in medical records search","authors":"Nut Limsopatham, C. Macdonald, I. Ounis","doi":"10.1145/2505515.2505706","DOIUrl":"https://doi.org/10.1145/2505515.2505706","url":null,"abstract":"Negated language is frequently used by medical practitioners to indicate that a patient does not have a given medical condition. Traditionally, information retrieval systems do not distinguish between the positive and negative contexts of terms when indexing documents. For example, when searching for patients with angina, a retrieval system might wrongly consider a patient with a medical record stating ``no evidence of angina\" to be relevant. While it is possible to enhance a retrieval system by taking into account the context of terms within the indexing representation of a document, some non-relevant medical records can still be ranked highly, if they include some of the query terms with the intended context. In this paper, we propose a novel learning framework that effectively handles negated language. Based on features related to the positive and negative contexts of a term, the framework learns how to appropriately weight the occurrences of the opposite context of any query term, thus preventing documents that may not be relevant from being retrieved. We thoroughly evaluate our proposed framework using the TREC 2011 and 2012 Medical Records track test collections. Our results show significant improvements over existing strong baselines. In addition, in combination with a traditional query expansion and a conceptual representation approach, our proposed framework could achieve a retrieval effectiveness comparable to the performance of the best TREC 2011 and 2012 systems, while not addressing other challenges in medical records search, such as the exploitation of semantic relationships between medical terms.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85694085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
High throughput filtering using FPGA-acceleration 采用fpga加速的高吞吐量滤波
W. Vanderbauwhede, Anton Frolov, L. Azzopardi, S. R. Chalamalasetti, M. Margala
With the rise in the amount information of being streamed across networks, there is a growing demand to vet the quality, type and content itself for various purposes such as spam, security and search. In this paper, we develop an energy-efficient high performance information filtering system that is capable of classifying a stream of incoming document at high speed. The prototype parses a stream of documents using a multicore CPU and then performs classification using Field-Programmable Gate Arrays (FPGAs). On a large TREC data collection, we implemented a Naive Bayes classifier on our prototype and compared it to an optimized CPU based-baseline. Our empirical findings show that we can classify documents at 10Gb/s which is up to 94 times faster than the CPU baseline (and up to 5 times faster than previous FPGA based implementations). In future work, we aim to increase the throughput by another order of magnitude by implementing both the parser and filter on the FPGA.
随着通过网络传输的信息量的增加,为了垃圾邮件、安全、搜索等各种目的,对质量、类型和内容本身进行审查的需求也在不断增长。在本文中,我们开发了一种高效节能的信息过滤系统,该系统能够对输入的文档流进行高速分类。原型使用多核CPU解析文档流,然后使用现场可编程门阵列(fpga)执行分类。在一个大型TREC数据集上,我们在原型上实现了朴素贝叶斯分类器,并将其与优化的基于CPU的基线进行了比较。我们的实证研究结果表明,我们可以以10Gb/s的速度对文档进行分类,这比CPU基准快94倍(比以前基于FPGA的实现快5倍)。在未来的工作中,我们的目标是通过在FPGA上实现解析器和滤波器来将吞吐量提高一个数量级。
{"title":"High throughput filtering using FPGA-acceleration","authors":"W. Vanderbauwhede, Anton Frolov, L. Azzopardi, S. R. Chalamalasetti, M. Margala","doi":"10.1145/2505515.2507866","DOIUrl":"https://doi.org/10.1145/2505515.2507866","url":null,"abstract":"With the rise in the amount information of being streamed across networks, there is a growing demand to vet the quality, type and content itself for various purposes such as spam, security and search. In this paper, we develop an energy-efficient high performance information filtering system that is capable of classifying a stream of incoming document at high speed. The prototype parses a stream of documents using a multicore CPU and then performs classification using Field-Programmable Gate Arrays (FPGAs). On a large TREC data collection, we implemented a Naive Bayes classifier on our prototype and compared it to an optimized CPU based-baseline. Our empirical findings show that we can classify documents at 10Gb/s which is up to 94 times faster than the CPU baseline (and up to 5 times faster than previous FPGA based implementations). In future work, we aim to increase the throughput by another order of magnitude by implementing both the parser and filter on the FPGA.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85975588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multimedia summarization for trending topics in microblogs 微博热门话题的多媒体总结
Jingwen Bian, Yang Yang, Tat-Seng Chua
Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of microblogs with multimedia contents and trending topics, it is desirable to provide visualized summarization to help users to quickly grasp the essence of topics. While existing works mostly focus on text-based methods only, summarization of multiple media types (e.g., text and image) are scarcely explored. In this paper, we propose a multimedia microblog summarization framework to automatically generate visualized summaries for trending topics. Specifically, a novel generative probabilistic model, termed multimodal-LDA (MMLDA), is proposed to discover subtopics from microblogs by exploring the correlations among different media types. Based on the information achieved from MMLDA, a multimedia summarizer is designed to separately identify representative textual and visual samples and then form a comprehensive visualized summary. We conduct extensive experiments on a real-world Sina Weibo microblog dataset to demonstrate the superiority of our proposed method against the state-of-the-art approaches.
微博服务彻底改变了人们交换信息的方式。面对越来越多的多媒体内容和热门话题的微博,提供可视化的摘要帮助用户快速掌握话题的本质是很有必要的。现有的研究大多集中在基于文本的方法上,而对多种媒体类型(如文本和图像)的总结研究甚少。在本文中,我们提出了一个多媒体微博摘要框架来自动生成趋势话题的可视化摘要。具体而言,提出了一种新的生成概率模型,即多模态lda (MMLDA),通过探索不同媒体类型之间的相关性,从微博中发现子主题。基于MMLDA获取的信息,设计了一个多媒体摘要器,分别识别具有代表性的文本样本和视觉样本,形成全面的可视化摘要。我们在真实的新浪微博数据集上进行了广泛的实验,以证明我们提出的方法相对于最先进的方法的优越性。
{"title":"Multimedia summarization for trending topics in microblogs","authors":"Jingwen Bian, Yang Yang, Tat-Seng Chua","doi":"10.1145/2505515.2505652","DOIUrl":"https://doi.org/10.1145/2505515.2505652","url":null,"abstract":"Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of microblogs with multimedia contents and trending topics, it is desirable to provide visualized summarization to help users to quickly grasp the essence of topics. While existing works mostly focus on text-based methods only, summarization of multiple media types (e.g., text and image) are scarcely explored. In this paper, we propose a multimedia microblog summarization framework to automatically generate visualized summaries for trending topics. Specifically, a novel generative probabilistic model, termed multimodal-LDA (MMLDA), is proposed to discover subtopics from microblogs by exploring the correlations among different media types. Based on the information achieved from MMLDA, a multimedia summarizer is designed to separately identify representative textual and visual samples and then form a comprehensive visualized summary. We conduct extensive experiments on a real-world Sina Weibo microblog dataset to demonstrate the superiority of our proposed method against the state-of-the-art approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91031837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
Exploiting trustors as well as trustees in trust-based recommendation 利用委托人和受托人的信任推荐
Won-Seok Hwang, Shaoyu Li, Sang-Wook Kim, Ho‐Jin Choi
In a trust network, two users who are connected by a trust relationship tend to have similar interests. Based on this observation, existing trust-aware recommendation methods predict ratings for a target user on unseen items by referencing to ratings of those users who are reachable from the target user in the forward direction of trustor-trustee relationship through the trust network. However, these methods have overlooked the possibility of utilizing the ratings of those users reachable in the backward direction, which may also have similar interests. In this paper, we investigate this possibility by identifying and adding these users to the existing methods when predicting ratings for the target user. We perform a series of experiments and observe that our approach improves the coverage while preserving the accuracy.
在信任网络中,由信任关系连接起来的两个用户往往具有相似的兴趣。基于这一观察,现有的信任感知推荐方法通过参考目标用户在信任网络信任关系的向前方向上可到达的用户的评分来预测目标用户对未见过的物品的评分。然而,这些方法忽略了利用反向可达用户的评分的可能性,这些用户可能也有相似的兴趣。在本文中,我们通过识别这些用户并将其添加到预测目标用户评分的现有方法中来研究这种可能性。我们进行了一系列的实验,观察到我们的方法在保持精度的同时提高了覆盖范围。
{"title":"Exploiting trustors as well as trustees in trust-based recommendation","authors":"Won-Seok Hwang, Shaoyu Li, Sang-Wook Kim, Ho‐Jin Choi","doi":"10.1145/2505515.2507889","DOIUrl":"https://doi.org/10.1145/2505515.2507889","url":null,"abstract":"In a trust network, two users who are connected by a trust relationship tend to have similar interests. Based on this observation, existing trust-aware recommendation methods predict ratings for a target user on unseen items by referencing to ratings of those users who are reachable from the target user in the forward direction of trustor-trustee relationship through the trust network. However, these methods have overlooked the possibility of utilizing the ratings of those users reachable in the backward direction, which may also have similar interests. In this paper, we investigate this possibility by identifying and adding these users to the existing methods when predicting ratings for the target user. We perform a series of experiments and observe that our approach improves the coverage while preserving the accuracy.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73031398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
UNIK: unsupervised social network spam detection UNIK:无监督的社交网络垃圾邮件检测
Enhua Tan, Lei Guo, Songqing Chen, Xiaodong Zhang, Y. Zhao
Social network spam increases explosively with the rapid development and wide usage of various social networks on the Internet. To timely detect spam in large social network sites, it is desirable to discover unsupervised schemes that can save the training cost of supervised schemes. In this work, we first show several limitations of existing unsupervised detection schemes. The main reason behind the limitations is that existing schemes heavily rely on spamming patterns that are constantly changing to avoid detection. Motivated by our observations, we first propose a sybil defense based spam detection scheme SD2 that remarkably outperforms existing schemes by taking the social network relationship into consideration. In order to make it highly robust in facing an increased level of spam attacks, we further design an unsupervised spam detection scheme, called UNIK. Instead of detecting spammers directly, UNIK works by deliberately removing non-spammers from the network, leveraging both the social graph and the user-link graph. The underpinning of UNIK is that while spammers constantly change their patterns to evade detection, non-spammers do not have to do so and thus have a relatively non-volatile pattern. UNIK has comparable performance to SD2 when it is applied to a large social network site, and outperforms SD2 significantly when the level of spam attacks increases. Based on detection results of UNIK, we further analyze several identified spam campaigns in this social network site. The result shows that different spammer clusters demonstrate distinct characteristics, implying the volatility of spamming patterns and the ability of UNIK to automatically extract spam signatures.
随着互联网上各种社交网络的快速发展和广泛使用,社交网络垃圾邮件呈爆炸式增长。为了及时检测大型社交网站中的垃圾邮件,需要发现能够节省监督方案训练成本的无监督方案。在这项工作中,我们首先展示了现有无监督检测方案的几个局限性。这些限制背后的主要原因是,现有的方案严重依赖于不断变化以避免检测的垃圾邮件模式。基于我们的观察,我们首先提出了一种基于符号防御的垃圾邮件检测方案SD2,该方案通过考虑社交网络关系而显著优于现有方案。为了使其在面对越来越多的垃圾邮件攻击时具有很强的鲁棒性,我们进一步设计了一种无监督的垃圾邮件检测方案,称为UNIK。UNIK不是直接检测垃圾邮件发送者,而是通过利用社交图和用户链接图,故意从网络中删除非垃圾邮件发送者。UNIK的基础是,当垃圾邮件发送者不断改变其模式以逃避检测时,非垃圾邮件发送者不必这样做,因此具有相对稳定的模式。当UNIK应用于大型社交网站时,其性能与SD2相当,当垃圾邮件攻击水平增加时,其性能明显优于SD2。基于UNIK的检测结果,我们进一步分析了该社交网站中几个已识别的垃圾邮件活动。结果表明,不同的垃圾邮件发送者集群表现出不同的特征,这意味着垃圾邮件模式的波动性和UNIK自动提取垃圾邮件签名的能力。
{"title":"UNIK: unsupervised social network spam detection","authors":"Enhua Tan, Lei Guo, Songqing Chen, Xiaodong Zhang, Y. Zhao","doi":"10.1145/2505515.2505581","DOIUrl":"https://doi.org/10.1145/2505515.2505581","url":null,"abstract":"Social network spam increases explosively with the rapid development and wide usage of various social networks on the Internet. To timely detect spam in large social network sites, it is desirable to discover unsupervised schemes that can save the training cost of supervised schemes. In this work, we first show several limitations of existing unsupervised detection schemes. The main reason behind the limitations is that existing schemes heavily rely on spamming patterns that are constantly changing to avoid detection. Motivated by our observations, we first propose a sybil defense based spam detection scheme SD2 that remarkably outperforms existing schemes by taking the social network relationship into consideration. In order to make it highly robust in facing an increased level of spam attacks, we further design an unsupervised spam detection scheme, called UNIK. Instead of detecting spammers directly, UNIK works by deliberately removing non-spammers from the network, leveraging both the social graph and the user-link graph. The underpinning of UNIK is that while spammers constantly change their patterns to evade detection, non-spammers do not have to do so and thus have a relatively non-volatile pattern. UNIK has comparable performance to SD2 when it is applied to a large social network site, and outperforms SD2 significantly when the level of spam attacks increases. Based on detection results of UNIK, we further analyze several identified spam campaigns in this social network site. The result shows that different spammer clusters demonstrate distinct characteristics, implying the volatility of spamming patterns and the ability of UNIK to automatically extract spam signatures.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"139 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73298106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
期刊
Proceedings of the 22nd ACM international conference on Information & Knowledge Management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1