首页 > 最新文献

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval最新文献

英文 中文
Query change as relevance feedback in session search 查询变化作为会话搜索的相关反馈
Sicong Zhang, Dongyi Guan, G. Yang
Session search is the Information Retrieval (IR) task that performs document retrieval for an entire session. During a session, users often change queries to explore and investigate the information needs. In this paper, we propose to use query change as a new form of relevance feedback for better session search. Evaluation conducted over TREC 2012 Session Track shows that query change is a highly effective form of feedback as compared with existing relevance feedback methods. The proposed method outperforms the state-of-the-art relevance feedback methods for the TREC 2012 Session Track by a significant improvement of >25%.
会话搜索是为整个会话执行文档检索的信息检索(Information Retrieval, IR)任务。在会话期间,用户经常更改查询以探索和调查信息需求。在本文中,我们提出使用查询变化作为一种新的相关反馈形式,以更好地进行会话搜索。对TREC 2012 Session Track的评估表明,与现有的相关反馈方法相比,查询变化是一种非常有效的反馈形式。该方法优于当前最先进的TREC 2012会话跟踪相关反馈方法,显著提高了>25%。
{"title":"Query change as relevance feedback in session search","authors":"Sicong Zhang, Dongyi Guan, G. Yang","doi":"10.1145/2484028.2484171","DOIUrl":"https://doi.org/10.1145/2484028.2484171","url":null,"abstract":"Session search is the Information Retrieval (IR) task that performs document retrieval for an entire session. During a session, users often change queries to explore and investigate the information needs. In this paper, we propose to use query change as a new form of relevance feedback for better session search. Evaluation conducted over TREC 2012 Session Track shows that query change is a highly effective form of feedback as compared with existing relevance feedback methods. The proposed method outperforms the state-of-the-art relevance feedback methods for the TREC 2012 Session Track by a significant improvement of >25%.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128728221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Leveraging conceptual lexicon: query disambiguation using proximity information for patent retrieval 利用概念词典:使用接近信息进行专利检索的查询消歧
Parvaz Mahdabi, Shima Gerani, Xiangji Huang, F. Crestani
Patent prior art search is a task in patent retrieval where the goal is to rank documents which describe prior art work related to a patent application. One of the main properties of patent retrieval is that the query topic is a full patent application and does not represent a focused information need. This query by document nature of patent retrieval introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering different information resources for query reduction and query disambiguation. However, previous work has not fully studied the effect of using proximity information and exploiting domain specific resources for performing query disambiguation. In this paper, we first reduce the query document by taking the first claim of the document itself. We then build a query-specific patent lexicon based on definitions of the International Patent Classification (IPC). We study how to expand queries by selecting expansion terms from the lexicon that are focused on the query topic. The key problem is how to capture whether an expansion term is focused on the query topic or not. We address this problem by exploiting proximity information. We assign high weights to expansion terms appearing closer to query terms based on the intuition that terms closer to query terms are more likely to be related to the query topic. Experimental results on two patent retrieval datasets show that the proposed method is effective and robust for query expansion, significantly outperforming the standard pseudo relevance feedback (PRF) and existing baselines in patent retrieval.
专利现有技术检索是专利检索中的一项任务,其目标是对描述与专利申请相关的现有技术工作的文档进行排序。专利检索的主要特性之一是查询主题是完整的专利申请,而不代表集中的信息需求。这种按文档性质查询专利检索带来了新的挑战,需要针对这个问题进行新的调查。研究人员通过考虑不同的信息资源进行查询约简和查询消歧来解决这个问题。然而,以前的工作并没有充分研究使用接近信息和利用特定领域资源来执行查询消歧的效果。在本文中,我们首先通过对文档本身的第一权利要求来减少查询文档。然后,我们基于国际专利分类(IPC)的定义构建一个特定于查询的专利词典。我们研究如何通过从词典中选择与查询主题相关的扩展术语来扩展查询。关键问题是如何捕获展开项是否关注于查询主题。我们通过利用邻近信息来解决这个问题。我们给看起来更接近查询词的扩展词分配了高权重,这是基于这样一种直觉:更接近查询词的词更有可能与查询主题相关。在两个专利检索数据集上的实验结果表明,该方法对查询扩展具有良好的鲁棒性,显著优于标准的伪相关反馈(PRF)和现有的专利检索基线。
{"title":"Leveraging conceptual lexicon: query disambiguation using proximity information for patent retrieval","authors":"Parvaz Mahdabi, Shima Gerani, Xiangji Huang, F. Crestani","doi":"10.1145/2484028.2484056","DOIUrl":"https://doi.org/10.1145/2484028.2484056","url":null,"abstract":"Patent prior art search is a task in patent retrieval where the goal is to rank documents which describe prior art work related to a patent application. One of the main properties of patent retrieval is that the query topic is a full patent application and does not represent a focused information need. This query by document nature of patent retrieval introduces new challenges and requires new investigations specific to this problem. Researchers have addressed this problem by considering different information resources for query reduction and query disambiguation. However, previous work has not fully studied the effect of using proximity information and exploiting domain specific resources for performing query disambiguation. In this paper, we first reduce the query document by taking the first claim of the document itself. We then build a query-specific patent lexicon based on definitions of the International Patent Classification (IPC). We study how to expand queries by selecting expansion terms from the lexicon that are focused on the query topic. The key problem is how to capture whether an expansion term is focused on the query topic or not. We address this problem by exploiting proximity information. We assign high weights to expansion terms appearing closer to query terms based on the intuition that terms closer to query terms are more likely to be related to the query topic. Experimental results on two patent retrieval datasets show that the proposed method is effective and robust for query expansion, significantly outperforming the standard pseudo relevance feedback (PRF) and existing baselines in patent retrieval.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"55 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115926600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Taily
Robin Aly, D. Hiemstra, T. Demeester
Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.
{"title":"Taily","authors":"Robin Aly, D. Hiemstra, T. Demeester","doi":"10.1145/2484028.2484033","DOIUrl":"https://doi.org/10.1145/2484028.2484033","url":null,"abstract":"Search engines can improve their efficiency by selecting only few promising shards for each query. State-of-the-art shard selection algorithms first query a central index of sampled documents, and their effectiveness is similar to searching all shards. However, the search in the central index also hurts efficiency. Additionally, we show that the effectiveness of these approaches varies substantially with the sampled documents. This paper proposes Taily, a novel shard selection algorithm that models a query's score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function's features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116028079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
The cluster hypothesis in information retrieval 信息检索中的聚类假设
Oren Kurland
{"title":"The cluster hypothesis in information retrieval","authors":"Oren Kurland","doi":"10.1007/978-3-319-06028-6_105","DOIUrl":"https://doi.org/10.1007/978-3-319-06028-6_105","url":null,"abstract":"","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116745478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
An LDA-smoothed relevance model for document expansion: a case study for spoken document retrieval 用于文档扩展的lda平滑关联模型:口头文档检索的案例研究
Debasis Ganguly, Johannes Leveling, G. Jones
Document expansion (DE) in information retrieval (IR) involves modifying each document in the collection by introducing additional terms into the document. It is particularly useful to improve retrieval of short and noisy documents where the additional terms can improve the description of the document content. Existing approaches to DE assume that documents to be expanded are from a single topic. In the case of multi-topic documents this can lead to a topic bias in terms selected for DE and hence may result in poor retrieval quality due to the lack of coverage of the original document topics in the expanded document. This paper proposes a new DE technique providing a more uniform selection and weighting of DE terms from all constituent topics. We show that our proposed method significantly outperforms the most recently reported relevance model based DE method on a spoken document retrieval task for both manual and automatic speech recognition transcripts.
信息检索(IR)中的文档扩展(DE)涉及通过在文档中引入附加术语来修改集合中的每个文档。它对于改进简短和嘈杂文档的检索特别有用,其中附加的术语可以改进对文档内容的描述。现有的DE方法假设要展开的文档来自单个主题。在多主题文档的情况下,这可能导致为DE选择的术语的主题偏差,因此可能导致检索质量差,因为扩展文档中缺乏对原始文档主题的覆盖。本文提出了一种新的DE技术,从所有组成主题中提供更统一的DE术语选择和加权。我们表明,我们提出的方法在手动和自动语音识别转录本的语音文档检索任务上显著优于最近报道的基于关联模型的DE方法。
{"title":"An LDA-smoothed relevance model for document expansion: a case study for spoken document retrieval","authors":"Debasis Ganguly, Johannes Leveling, G. Jones","doi":"10.1145/2484028.2484110","DOIUrl":"https://doi.org/10.1145/2484028.2484110","url":null,"abstract":"Document expansion (DE) in information retrieval (IR) involves modifying each document in the collection by introducing additional terms into the document. It is particularly useful to improve retrieval of short and noisy documents where the additional terms can improve the description of the document content. Existing approaches to DE assume that documents to be expanded are from a single topic. In the case of multi-topic documents this can lead to a topic bias in terms selected for DE and hence may result in poor retrieval quality due to the lack of coverage of the original document topics in the expanded document. This paper proposes a new DE technique providing a more uniform selection and weighting of DE terms from all constituent topics. We show that our proposed method significantly outperforms the most recently reported relevance model based DE method on a spoken document retrieval task for both manual and automatic speech recognition transcripts.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115578548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A test collection for entity search in DBpedia 用于DBpedia中实体搜索的测试集合
K. Balog, R. Neumayer
We develop and make publicly available an entity search test collection based on the DBpedia knowledge base. This includes a large number of queries and corresponding relevance judgments from previous benchmarking campaigns, covering a broad range of information needs, ranging from short keyword queries to natural language questions. Further, we present baseline results for this collection with a set of retrieval models based on language modeling and BM25. Finally, we perform an initial analysis to shed light on certain characteristics that make this data set particularly challenging.
我们开发了一个基于DBpedia知识库的实体搜索测试集,并使其公开可用。这包括从以前的基准测试活动中获得的大量查询和相应的相关性判断,涵盖了广泛的信息需求,从简短的关键字查询到自然语言问题。此外,我们使用一组基于语言建模和BM25的检索模型给出了该集合的基线结果。最后,我们进行了初步分析,以阐明使该数据集特别具有挑战性的某些特征。
{"title":"A test collection for entity search in DBpedia","authors":"K. Balog, R. Neumayer","doi":"10.1145/2484028.2484165","DOIUrl":"https://doi.org/10.1145/2484028.2484165","url":null,"abstract":"We develop and make publicly available an entity search test collection based on the DBpedia knowledge base. This includes a large number of queries and corresponding relevance judgments from previous benchmarking campaigns, covering a broad range of information needs, ranging from short keyword queries to natural language questions. Further, we present baseline results for this collection with a set of retrieval models based on language modeling and BM25. Finally, we perform an initial analysis to shed light on certain characteristics that make this data set particularly challenging.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114337998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Document classification by topic labeling 通过主题标记进行文档分类
Swapnil Hingmire, S. Chougule, Girish Keshav Palshikar, Sutanu Chakraborti
In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.
在本文中,我们提出了基于潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)[1]的文档分类算法,该算法不需要任何标记数据集。在我们的算法中,我们使用LDA构建一个主题模型,将一个主题分配给一个类标签,使用Dirichlet分布的聚合属性将所有相同的类标签主题聚合为一个主题,然后根据其与聚合主题之一的“接近程度”自动为每个未标记的文档分配一个类标签。在期望最大化(EM)算法和朴素贝叶斯分类器的基础上,对该算法进行了扩展。我们在三个真实世界的数据集上展示了算法的有效性。
{"title":"Document classification by topic labeling","authors":"Swapnil Hingmire, S. Chougule, Girish Keshav Palshikar, Sutanu Chakraborti","doi":"10.1145/2484028.2484140","DOIUrl":"https://doi.org/10.1145/2484028.2484140","url":null,"abstract":"In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its \"closeness\" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114621196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Optimizing top-n collaborative filtering via dynamic negative item sampling 通过动态负项抽样优化top-n协同过滤
Weinan Zhang, Tianqi Chen, Jun Wang, Yong Yu
Collaborative filtering techniques rely on aggregated user preference data to make personalized predictions. In many cases, users are reluctant to explicitly express their preferences and many recommender systems have to infer them from implicit user behaviors, such as clicking a link in a webpage or playing a music track. The clicks and the plays are good for indicating the items a user liked (i.e., positive training examples), but the items a user did not like (negative training examples) are not directly observed. Previous approaches either randomly pick negative training samples from unseen items or incorporate some heuristics into the learning model, leading to a biased solution and a prolonged training period. In this paper, we propose to dynamically choose negative training samples from the ranked list produced by the current prediction model and iteratively update our model. The experiments conducted on three large-scale datasets show that our approach not only reduces the training time, but also leads to significant performance gains.
协同过滤技术依赖于聚合的用户偏好数据来进行个性化预测。在许多情况下,用户不愿意明确地表达他们的偏好,许多推荐系统不得不从隐含的用户行为中推断他们的偏好,例如点击网页中的链接或播放音乐曲目。点击和游戏能够有效地指示用户喜欢的道具(即积极的训练例子),但是用户不喜欢的道具(消极的训练例子)却不能被直接观察到。以前的方法要么从看不见的项目中随机选择负训练样本,要么在学习模型中加入一些启发式方法,导致有偏见的解决方案和延长的训练周期。在本文中,我们提出从当前预测模型产生的排名列表中动态选择负训练样本,并迭代更新我们的模型。在三个大规模数据集上进行的实验表明,我们的方法不仅减少了训练时间,而且显著提高了性能。
{"title":"Optimizing top-n collaborative filtering via dynamic negative item sampling","authors":"Weinan Zhang, Tianqi Chen, Jun Wang, Yong Yu","doi":"10.1145/2484028.2484126","DOIUrl":"https://doi.org/10.1145/2484028.2484126","url":null,"abstract":"Collaborative filtering techniques rely on aggregated user preference data to make personalized predictions. In many cases, users are reluctant to explicitly express their preferences and many recommender systems have to infer them from implicit user behaviors, such as clicking a link in a webpage or playing a music track. The clicks and the plays are good for indicating the items a user liked (i.e., positive training examples), but the items a user did not like (negative training examples) are not directly observed. Previous approaches either randomly pick negative training samples from unseen items or incorporate some heuristics into the learning model, leading to a biased solution and a prolonged training period. In this paper, we propose to dynamically choose negative training samples from the ranked list produced by the current prediction model and iteratively update our model. The experiments conducted on three large-scale datasets show that our approach not only reduces the training time, but also leads to significant performance gains.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114649751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 189
Studying page life patterns in dynamical web 动态web中页面生命周期模式的研究
Alexey Tikhonov, Ivan Bogatyy, Pavel Burangulov, L. Ostroumova, V. Koshelev, Gleb Gusev
With the ever-increasing speed of content turnover on the web, it is particularly important to understand the patterns that pages' popularity follows. This paper focuses on the dynamical part of the web, i.e. pages that have a limited lifespan and experience a short popularity outburst within it. We classify these pages into five patterns based on how quickly they gain popularity and how quickly they lose it. We study the properties of pages that belong to each pattern and determine content topics that contain disproportionately high fractions of particular patterns. These developments are utilized to create an algorithm that approximates with reasonable accuracy the expected popularity pattern of a web page based on its URL and, if available, prior knowledge about its domain's topics.
随着网络上内容更新的速度越来越快,理解页面受欢迎程度遵循的模式尤为重要。本文关注的是网络的动态部分,即具有有限生命周期并在其中经历短暂流行爆发的页面。我们根据这些页面获得人气的速度和失去人气的速度将它们分为五种模式。我们研究属于每种模式的页面的属性,并确定包含特定模式的不成比例高的部分的内容主题。这些发展被用来创建一种算法,该算法基于网页的URL和(如果可用的话)有关其域主题的先验知识,以合理的准确性近似于预期的网页流行模式。
{"title":"Studying page life patterns in dynamical web","authors":"Alexey Tikhonov, Ivan Bogatyy, Pavel Burangulov, L. Ostroumova, V. Koshelev, Gleb Gusev","doi":"10.1145/2484028.2484185","DOIUrl":"https://doi.org/10.1145/2484028.2484185","url":null,"abstract":"With the ever-increasing speed of content turnover on the web, it is particularly important to understand the patterns that pages' popularity follows. This paper focuses on the dynamical part of the web, i.e. pages that have a limited lifespan and experience a short popularity outburst within it. We classify these pages into five patterns based on how quickly they gain popularity and how quickly they lose it. We study the properties of pages that belong to each pattern and determine content topics that contain disproportionately high fractions of particular patterns. These developments are utilized to create an algorithm that approximates with reasonable accuracy the expected popularity pattern of a web page based on its URL and, if available, prior knowledge about its domain's topics.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126847431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Document identifier reassignment and run-length-compressed inverted indexes for improved search performance 文档标识符重新分配和运行长度压缩倒排索引,以提高搜索性能
Diego Arroyuelo, Senén González, M. Oyarzún, Victor Sepulveda
Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).
文本搜索引擎是当今的基本工具。它们的效率依赖于一种流行而简单的数据结构:倒排索引。目前,使用索引压缩方案可以非常有效地表示倒排索引。最近的调查还研究了如何使用优化的文档排序将文档标识符(docid)分配给文档数据库。这在索引压缩和查询处理时间方面产生了重要的改进。在本文中,我们遵循这条研究路线,但从不同的角度。我们提出了一种docID重新分配方法,该方法允许人们专注于给定的倒排列表子集,以提高它们的性能。然后我们使用运行长度编码来压缩这些列表(因为生成了许多连续的1)。我们证明,通过使用这种方法,不仅可以提高倒排表的特定子集的性能,而且可以提高整个倒排索引的性能。我们的实验结果表明,整个索引文档的空间使用减少了大约10%。此外,如果必须显式地对运行进行解压缩,则解压缩速度可提高1.22倍,如果允许隐式地对运行进行解压缩,则解压缩速度可提高4.58倍。最后,我们还提高了AND查询(最多12%)、WAND查询(最多23%)和完整(无排名)OR查询(最多86%)的Document-at-a-Time查询处理时间。
{"title":"Document identifier reassignment and run-length-compressed inverted indexes for improved search performance","authors":"Diego Arroyuelo, Senén González, M. Oyarzún, Victor Sepulveda","doi":"10.1145/2484028.2484079","DOIUrl":"https://doi.org/10.1145/2484028.2484079","url":null,"abstract":"Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121910386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1