首页 > 最新文献

Proceedings of the 21st ACM international conference on Information and knowledge management最新文献

英文 中文
Graph classification: a diversified discriminative feature selection approach 图分类:一种多样化的判别特征选择方法
Yuanyuan Zhu, J. Yu, Hong Cheng, Lu Qin
A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.
图对对象之间的复杂结构关系进行建模,并已广泛应用于各种应用中。构建自动图分类模型对于预测未知图或理解不同类之间的复杂结构变得非常重要。目前广泛使用的图分类框架包括两个步骤,即特征选择和分类。关键问题是如何从包含大量正图和负图的图数据库中选择重要的子图特征。给定所选的特征,可以使用通用分类方法来构建分类模型。在本文中,我们主要研究特征选择。我们发现了使用最广泛的特征选择方法的两个主要问题,即基于判别分数来选择频繁子图特征,以及引入新的多样化判别分数来选择具有更高多样性的特征。我们分析了新提出的多样化判别分数的性质,并进行了广泛的性能研究,证明了这种多样化判别分数使正/负图可分离,从而导致更高的分类精度。
{"title":"Graph classification: a diversified discriminative feature selection approach","authors":"Yuanyuan Zhu, J. Yu, Hong Cheng, Lu Qin","doi":"10.1145/2396761.2396791","DOIUrl":"https://doi.org/10.1145/2396761.2396791","url":null,"abstract":"A graph models complex structural relationships among objects, and has been prevalently used in a wide range of applications. Building an automated graph classification model becomes very important for predicting unknown graphs or understanding complex structures between different classes. The graph classification framework being widely used consists of two steps, namely, feature selection and classification. The key issue is how to select important subgraph features from a graph database with a large number of graphs including positive graphs and negative graphs. Given the features selected, a generic classification approach can be used to build a classification model. In this paper, we focus on feature selection. We identify two main issues with the most widely used feature selection approach which is based on a discriminative score to select frequent subgraph features, and introduce a new diversified discriminative score to select features that have a higher diversity. We analyze the properties of the newly proposed diversified discriminative score, and conducted extensive performance studies to demonstrate that such a diversified discriminative score makes positive/negative graphs separable and leads to a higher classification accuracy.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134235433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries 面向人类,减少错误:将重复数据删除的众包应用于数字图书馆
Mihai Georgescu, Dang Duc Pham, C. S. Firan, W. Nejdl, Julien Gaugaz
Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
检测重复实体(通常通过检查元数据)一直是最近许多工作的重点。有几种方法试图识别重复的实体,同时关注准确性或效率和速度-仍然没有完美的解决方案。我们提出了一种结合分层的重复检测方法,其主要优点是使用众包作为培训和反馈机制。通过对人类提供的示例使用主动学习技术,我们对算法进行了微调,以提高重复检测的准确性。我们通过收集边缘性案例或不确定评估所需的培训数据来降低培训成本。我们将简单而强大的方法应用于在线出版物搜索系统:首先,我们实时地根据出版物签名进行粗略的重复检测。然后,第二个自动步骤比较重复的候选对象,并根据在线用户和众包平台的反馈进行调整,从而提高准确性。我们的方法比未经训练的设置提高了14%,与人类评估者的准确率只有4%的差异。
{"title":"Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries","authors":"Mihai Georgescu, Dang Duc Pham, C. S. Firan, W. Nejdl, Julien Gaugaz","doi":"10.1145/2396761.2398554","DOIUrl":"https://doi.org/10.1145/2396761.2398554","url":null,"abstract":"Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134549942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Continuous top-k query for graph streams 图流的连续top-k查询
Shirui Pan, Xingquan Zhu
In this paper, we propose to query correlated graphs in a data stream scenario, where an algorithm is required to retrieve the top k graphs which are mostly correlated to a query graph q. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, Hoe-PGPL, to identify top-k correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to employ Hoeffding bound to discover some potential candidates and use two level candidate checking (one corresponding to the whole sliding window level and one corresponding to the local data batch level) to accurately estimate the correlation of the emerging candidate patterns, without rechecking the historical stream data. Experimental results demonstrate that the proposed algorithm not only achieves good performance in terms of query precision and recall, but also is several times, or even an order of magnitude, more efficient than the straightforward algorithm with respect to the time and the memory consumption. Our method represents the first research endeavor for data stream based top-k correlated graph query.
在本文中,我们提出在数据流场景中查询相关图,其中需要一种算法来检索最前面的k个图,这些图主要与查询图q相关。由于流数据的动态变化性质和图查询过程的固有复杂性,将图流视为静态数据集在计算上是不可行的或无效的。在本文中,我们提出了一种新的算法,Hoe-PGPL,通过使用覆盖多个连续批次流数据记录的滑动窗口,从数据流中识别top-k相关图。我们的主题是使用Hoeffding bound来发现一些潜在的候选模式,并使用两级候选模式检查(一个对应于整个滑动窗口级别,一个对应于局部数据批处理级别)来准确估计新出现的候选模式的相关性,而无需重新检查历史流数据。实验结果表明,该算法不仅在查询精度和查全率方面取得了良好的性能,而且在时间和内存消耗方面比直接算法的效率提高了几倍甚至一个数量级。我们的方法代表了基于top-k相关图查询的数据流的第一个研究尝试。
{"title":"Continuous top-k query for graph streams","authors":"Shirui Pan, Xingquan Zhu","doi":"10.1145/2396761.2398717","DOIUrl":"https://doi.org/10.1145/2396761.2398717","url":null,"abstract":"In this paper, we propose to query correlated graphs in a data stream scenario, where an algorithm is required to retrieve the top k graphs which are mostly correlated to a query graph q. Due to the dynamic changing nature of the stream data and the inherent complexity of the graph query process, treating graph streams as static datasets is computationally infeasible or ineffective. In the paper, we propose a novel algorithm, Hoe-PGPL, to identify top-k correlated graphs from data stream, by using a sliding window which covers a number of consecutive batches of stream data records. Our theme is to employ Hoeffding bound to discover some potential candidates and use two level candidate checking (one corresponding to the whole sliding window level and one corresponding to the local data batch level) to accurately estimate the correlation of the emerging candidate patterns, without rechecking the historical stream data. Experimental results demonstrate that the proposed algorithm not only achieves good performance in terms of query precision and recall, but also is several times, or even an order of magnitude, more efficient than the straightforward algorithm with respect to the time and the memory consumption. Our method represents the first research endeavor for data stream based top-k correlated graph query.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"142 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133970579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A comprehensive analysis of parameter settings for novelty-biased cumulative gain 对新颖性偏置累积增益参数设置的综合分析
Teerapong Leelanupab, G. Zuccon, J. Jose
In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.
在TREC网络多样性轨道中,新颖性偏差累积增益(α-NDCG)是评估IR系统检索性能的官方指标之一。该措施的特征是参数α,其影响尚未被彻底研究。我们发现,在评估结果多样化时,常见的α设置,即α=0.5,可能会阻止度量的行为。这是因为它过度地惩罚了覆盖许多意图的系统,而奖励了那些冗余地只覆盖少数意图的系统。这个问题至关重要,因为它对高层系统有很大影响。我们重新考虑之前提出的阈值,建议在查询基础上设置α。然后通过检查TREC 09-10网络曲目提交的实际排名来研究该措施的直观性。通过根据我们的基于查询的阈值改变α, α- ndcg的判别能力没有受到损害,实际上,我们的方法提高了α- ndcg的鲁棒性。实验结果表明,α的阈值可以使测量比使用其常用设置更直观。
{"title":"A comprehensive analysis of parameter settings for novelty-biased cumulative gain","authors":"Teerapong Leelanupab, G. Zuccon, J. Jose","doi":"10.1145/2396761.2398550","DOIUrl":"https://doi.org/10.1145/2396761.2398550","url":null,"abstract":"In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134105398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
User engagement: the network effect matters! 用户粘性:网络效应很重要!
R. Baeza-Yates, M. Lalmas
In the online world, user engagement refers to the quality of the user experience that emphasizes the positive aspects of the interaction with a web application and, in particular, the phenomena associated with wanting to use that application longer and frequently. This definition is motivated by the observation that successful web applications are not just used, but they are engaged with. Users invest time, attention, and emotion into them. Online providers aim not only to engage users with each service, but across all services in their network. They spend increasing effort to direct users to various services (e.g.~using hyperlinks to help users navigate to and explore other services), to increase user traffic between their services. Nothing is known for users engaging across such a network of Web sites, something we call networked user engagement. We address this problem by combining techniques from web analytics and mining, information retrieval evaluation, and existing works on user engagement coming from the domains of information science, multimodal human computer interaction and cognitive psychology. In this way, we can combine insights from big data with deep analysis of human behavior in the lab or through crowd-sourcing experiments.
在网络世界中,用户粘性指的是用户体验的质量,它强调与web应用程序交互的积极方面,特别是与想要更长时间、更频繁地使用该应用程序相关的现象。这个定义的动机是观察到成功的web应用程序不仅被使用,而且被参与。用户在游戏中投入时间、注意力和情感。在线提供商的目标不仅是让用户参与到每项服务中,还包括他们网络中的所有服务。他们花费越来越多的精力将用户引导到各种服务(例如,使用超链接帮助用户导航到和探索其他服务),以增加其服务之间的用户流量。用户在这样一个网站网络上的参与度是未知的,我们称之为网络用户参与度。我们通过结合来自网络分析和挖掘、信息检索评估以及来自信息科学、多模态人机交互和认知心理学领域的现有用户参与工作的技术来解决这个问题。通过这种方式,我们可以将大数据的见解与实验室或通过众包实验对人类行为的深入分析结合起来。
{"title":"User engagement: the network effect matters!","authors":"R. Baeza-Yates, M. Lalmas","doi":"10.1145/2396761.2396763","DOIUrl":"https://doi.org/10.1145/2396761.2396763","url":null,"abstract":"In the online world, user engagement refers to the quality of the user experience that emphasizes the positive aspects of the interaction with a web application and, in particular, the phenomena associated with wanting to use that application longer and frequently. This definition is motivated by the observation that successful web applications are not just used, but they are engaged with. Users invest time, attention, and emotion into them. Online providers aim not only to engage users with each service, but across all services in their network. They spend increasing effort to direct users to various services (e.g.~using hyperlinks to help users navigate to and explore other services), to increase user traffic between their services. Nothing is known for users engaging across such a network of Web sites, something we call networked user engagement. We address this problem by combining techniques from web analytics and mining, information retrieval evaluation, and existing works on user engagement coming from the domains of information science, multimodal human computer interaction and cognitive psychology. In this way, we can combine insights from big data with deep analysis of human behavior in the lab or through crowd-sourcing experiments.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131620022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
WiSeNet: building a wikipedia-based semantic network with ontologized relations WiSeNet:建立一个基于维基百科的语义网络,具有本体关系
A. Moro, Roberto Navigli
In this paper we present an approach for building a Wikipedia-based semantic network by integrating Open Information Extraction with Knowledge Acquisition techniques. Our algorithm extracts relation instances from Wikipedia page bodies and ontologizes them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets. As a result we obtain WiSeNet, a Wikipedia-based Semantic Network with Wikipedia pages as concepts and labeled, ontologized relations between them.
本文提出了一种结合开放信息抽取和知识获取技术构建基于维基百科的语义网络的方法。我们的算法从维基百科页面主体中提取关系实例,并通过以下方式对它们进行本体论化:首先,创建一组同义关系短语,称为关系同义词集;其次,为这些关系同义词集的参数分配语义类;第三,用关系同义词集消除初始关系实例的歧义。因此,我们得到了WiSeNet,一个基于维基百科的语义网络,将维基百科页面作为概念,并标记它们之间的本体论关系。
{"title":"WiSeNet: building a wikipedia-based semantic network with ontologized relations","authors":"A. Moro, Roberto Navigli","doi":"10.1145/2396761.2398495","DOIUrl":"https://doi.org/10.1145/2396761.2398495","url":null,"abstract":"In this paper we present an approach for building a Wikipedia-based semantic network by integrating Open Information Extraction with Knowledge Acquisition techniques. Our algorithm extracts relation instances from Wikipedia page bodies and ontologizes them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets. As a result we obtain WiSeNet, a Wikipedia-based Semantic Network with Wikipedia pages as concepts and labeled, ontologized relations between them.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133038044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
A unified optimization framework for auction and guaranteed delivery in online advertising 一个统一的在线广告拍卖和保证投放的优化框架
Konstantin Salomatin, Tie-Yan Liu, Yiming Yang
This paper proposes a new unified optimization framework combining pay-per-click auctions and guaranteed delivery in sponsored search. Advertisers usually have different (and sometimes mixed) marketing goals: brand awareness and direct response. Different mechanisms are good at addressing different goals, e.g., guaranteed delivery was often used to build brand awareness and pay-per-click auctions was widely used for direct marketing. Our new method accommodates both in a unified framework, with the search engine revenue as an optimization objective. In this way, we can target a guaranteed number of ad clicks (or impressions) per campaign for advertisers willing to pay a premium and enable keyword auctions for all others. Specifically, we formulate this joint optimization problem using linear programming and a column generation strategy for efficiency. To select the best column (a ranked list of ads) given a query, we propose a novel dynamic programming algorithm that takes the special structure of the ad allocation and pricing mechanisms into account. We have tested the proposed framework and the algorithms on real ad data obtained from a commercial search engine. The results demonstrate that our proposed approach can outperform several baselines in guaranteeing the number of clicks for the given advertisers, and in increasing the total revenue for the search engine.
本文提出了一种将赞助搜索中按点击付费拍卖与保证投放相结合的统一优化框架。广告商通常有不同的(有时是混合的)营销目标:品牌知名度和直接反应。不同的机制适用于不同的目标,例如,保证交付通常用于建立品牌知名度,点击付费拍卖广泛用于直接营销。我们的新方法容纳在一个统一的框架,与搜索引擎收入作为优化目标。通过这种方式,我们可以为愿意支付额外费用的广告客户锁定每次广告点击(或印象)的保证数量,并为所有其他广告提供关键字拍卖。具体地说,我们使用线性规划和效率的列生成策略来制定这个联合优化问题。为了在给定查询条件下选择最佳栏目(广告排名列表),我们提出了一种新的动态规划算法,该算法考虑了广告分配和定价机制的特殊结构。我们已经在从商业搜索引擎获得的真实广告数据上测试了所提出的框架和算法。结果表明,我们提出的方法在保证给定广告商的点击次数和增加搜索引擎的总收入方面优于几个基线。
{"title":"A unified optimization framework for auction and guaranteed delivery in online advertising","authors":"Konstantin Salomatin, Tie-Yan Liu, Yiming Yang","doi":"10.1145/2396761.2398561","DOIUrl":"https://doi.org/10.1145/2396761.2398561","url":null,"abstract":"This paper proposes a new unified optimization framework combining pay-per-click auctions and guaranteed delivery in sponsored search. Advertisers usually have different (and sometimes mixed) marketing goals: brand awareness and direct response. Different mechanisms are good at addressing different goals, e.g., guaranteed delivery was often used to build brand awareness and pay-per-click auctions was widely used for direct marketing. Our new method accommodates both in a unified framework, with the search engine revenue as an optimization objective. In this way, we can target a guaranteed number of ad clicks (or impressions) per campaign for advertisers willing to pay a premium and enable keyword auctions for all others. Specifically, we formulate this joint optimization problem using linear programming and a column generation strategy for efficiency. To select the best column (a ranked list of ads) given a query, we propose a novel dynamic programming algorithm that takes the special structure of the ad allocation and pricing mechanisms into account. We have tested the proposed framework and the algorithms on real ad data obtained from a commercial search engine. The results demonstrate that our proposed approach can outperform several baselines in guaranteeing the number of clicks for the given advertisers, and in increasing the total revenue for the search engine.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133045083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Estimating interleaved comparison outcomes from historical click data 估计历史点击数据的交错比较结果
Katja Hofmann, Shimon Whiteson, M. de Rijke
Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.
交错比较方法是一种利用点击数据来比较排名的方法,是一种很有前途的替代方法,传统的信息检索评估方法需要昂贵的显式判断。这些方法的一个主要限制是,它们假定可以访问实时数据,这意味着必须为每一对比较的排名器收集新数据。我们调查使用以前收集的点击数据(即历史数据)进行交错比较。我们首先分析现有的交错比较方法可以应用到什么程度,并发现最近的一种概率方法允许这种数据重用,尽管它在应用于历史数据时存在偏差。然后,我们提出了一种基于概率方法的交错比较方法,但使用重要抽样来补偿偏差。我们通过实验证实,概率方法使使用历史数据进行交错比较成为可能和有效的。
{"title":"Estimating interleaved comparison outcomes from historical click data","authors":"Katja Hofmann, Shimon Whiteson, M. de Rijke","doi":"10.1145/2396761.2398516","DOIUrl":"https://doi.org/10.1145/2396761.2398516","url":null,"abstract":"Interleaved comparison methods, which compare rankers using click data, are a promising alternative to traditional information retrieval evaluation methods that require expensive explicit judgments. A major limitation of these methods is that they assume access to live data, meaning that new data must be collected for every pair of rankers compared. We investigate the use of previously collected click data (i.e., historical data) for interleaved comparisons. We start by analyzing to what degree existing interleaved comparison methods can be applied and find that a recent probabilistic method allows such data reuse, even though it is biased when applied to historical data. We then propose an interleaved comparison method that is based on the probabilistic approach but uses importance sampling to compensate for bias. We experimentally confirm that probabilistic methods make the use of historical data for interleaved comparisons possible and effective.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133465180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Being picky: processing top-k queries with set-defined selections 挑剔:用集合定义的选择处理top-k查询
A. Stupar, S. Michel
Focusing on the top-K items according to a ranking criterion constitutes an important functionality in many different query answering scenarios. The idea is to read only the necessary information---mostly from secondary storage---with the ultimate goal to achieve low latency. In this work, we consider processing such top-K queries under the constraint that the result items are members of a specific set, which is provided at query time. We call this restriction a set-defined selection criterion. Set-defined selections drastically influence the pros and cons of an id-ordered index vs. a score-ordered index. We present a mathematical model that allows to decide at runtime which index to choose, leading to a combined index. To improve the latency around the break even point of the two indices, we show how to benefit from a partitioned score-ordered index and present an algorithm to create such partitions based on analyzing query logs. Further performance gains can be enjoyed using approximate top-K results, with tunable result quality. The presented approaches are evaluated using both real-world and synthetic data.
在许多不同的查询回答场景中,根据排序标准关注前k项是一项重要的功能。其思想是只读取必要的信息——主要来自辅助存储——最终目标是实现低延迟。在这项工作中,我们考虑在结果项是查询时提供的特定集合的成员的约束下处理此类top-K查询。我们称这种限制为集合定义的选择标准。集合定义的选择极大地影响id排序索引与分数排序索引的优缺点。我们提出了一个数学模型,允许在运行时决定选择哪个索引,从而生成一个组合索引。为了改善两个索引的盈亏平衡点附近的延迟,我们展示了如何从分区分数排序索引中获益,并介绍了一种基于分析查询日志创建此类分区的算法。使用近似的top-K结果可以获得进一步的性能提升,并具有可调的结果质量。使用真实世界和合成数据对所提出的方法进行了评估。
{"title":"Being picky: processing top-k queries with set-defined selections","authors":"A. Stupar, S. Michel","doi":"10.1145/2396761.2396877","DOIUrl":"https://doi.org/10.1145/2396761.2396877","url":null,"abstract":"Focusing on the top-K items according to a ranking criterion constitutes an important functionality in many different query answering scenarios. The idea is to read only the necessary information---mostly from secondary storage---with the ultimate goal to achieve low latency. In this work, we consider processing such top-K queries under the constraint that the result items are members of a specific set, which is provided at query time. We call this restriction a set-defined selection criterion. Set-defined selections drastically influence the pros and cons of an id-ordered index vs. a score-ordered index. We present a mathematical model that allows to decide at runtime which index to choose, leading to a combined index. To improve the latency around the break even point of the two indices, we show how to benefit from a partitioned score-ordered index and present an algorithm to create such partitions based on analyzing query logs. Further performance gains can be enjoyed using approximate top-K results, with tunable result quality. The presented approaches are evaluated using both real-world and synthetic data.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"326 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132663474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast approximation of steiner trees in large graphs 大图中steiner树的快速逼近
Andrey Gubichev, Thomas Neumann
Finding the minimum connected subtree of a graph that contains a given set of nodes (i.e., the Steiner tree problem) is a fundamental operation in keyword search in graphs, yet it is known to be NP-hard. Existing approximation techniques either make use of the heavy indexing of the graph, or entirely rely on online heuristics. In this paper we bridge the gap between these two extremes and present a scalable landmark-based index structure that, combined with a few lightweight online heuristics, yields a fast and accurate approximation of the Steiner tree. Our solution handles real-world graphs with millions of nodes and provides an approximation error of less than 5% on average.
寻找包含给定节点集的图的最小连通子树(即Steiner树问题)是图中关键字搜索的基本操作,但已知它是np困难的。现有的近似技术要么利用图的大量索引,要么完全依赖于在线启发式。在本文中,我们弥合了这两个极端之间的差距,并提出了一个可扩展的基于地标的索引结构,结合一些轻量级的在线启发式方法,产生了一个快速而准确的Steiner树近似。我们的解决方案处理具有数百万个节点的真实图形,并提供平均小于5%的近似误差。
{"title":"Fast approximation of steiner trees in large graphs","authors":"Andrey Gubichev, Thomas Neumann","doi":"10.1145/2396761.2398460","DOIUrl":"https://doi.org/10.1145/2396761.2398460","url":null,"abstract":"Finding the minimum connected subtree of a graph that contains a given set of nodes (i.e., the Steiner tree problem) is a fundamental operation in keyword search in graphs, yet it is known to be NP-hard. Existing approximation techniques either make use of the heavy indexing of the graph, or entirely rely on online heuristics. In this paper we bridge the gap between these two extremes and present a scalable landmark-based index structure that, combined with a few lightweight online heuristics, yields a fast and accurate approximation of the Steiner tree. Our solution handles real-world graphs with millions of nodes and provides an approximation error of less than 5% on average.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132709750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
期刊
Proceedings of the 21st ACM international conference on Information and knowledge management
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1