The Simple family of codecs is popular for encoding postings lists for a search engine because they are both space effective and time efficient at decoding. These algorithms pack as many integers into a codeword as possible before moving on to the next codeword. This technique is known as left-greedy. This contribution proves that left-greedy is not optimal and then goes on to introduce a dynamic programming solution to find the optimal packing. Experiments on .gov2 and INEX Wikipedia 2009 show that although this is an interesting theoretical result, left-greedy is empirically near optimal in effectiveness and efficiency.
Simple系列编解码器在为搜索引擎编码帖子列表时很受欢迎,因为它们在解码时既节省空间又节省时间。这些算法在移动到下一个码字之前,将尽可能多的整数打包到一个码字中。这种技术被称为左贪。这一贡献证明了左贪婪不是最优的,然后引入了一个动态规划解决方案来寻找最优包装。在.gov2和INEX Wikipedia 2009上的实验表明,尽管这是一个有趣的理论结果,但从经验上看,左贪婪在有效性和效率上接近最优。
{"title":"Optimal Packing in Simple-Family Codecs","authors":"A. Trotman, Michael H. Albert, Blake Burgess","doi":"10.1145/2808194.2809483","DOIUrl":"https://doi.org/10.1145/2808194.2809483","url":null,"abstract":"The Simple family of codecs is popular for encoding postings lists for a search engine because they are both space effective and time efficient at decoding. These algorithms pack as many integers into a codeword as possible before moving on to the next codeword. This technique is known as left-greedy. This contribution proves that left-greedy is not optimal and then goes on to introduce a dynamic programming solution to find the optimal packing. Experiments on .gov2 and INEX Wikipedia 2009 show that although this is an interesting theoretical result, left-greedy is empirically near optimal in effectiveness and efficiency.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133670788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theoretical frameworks like the Probability Ranking Principle and its more recent Interactive Information Retrieval variant have guided the development of ranking and retrieval algorithms for decades, yet they are not capable of helping us model problems in Dynamic Information Retrieval which exhibit the following three properties; an observable user signal, retrieval over multiple stages and an overall search intent. In this paper a new theoretical framework for retrieval in these scenarios is proposed. We derive a general dynamic utility function for optimizing over these types of tasks, that takes into account the utility of each stage and the probability of observing user feedback. We apply our framework to experiments over TREC data in the dynamic multi page search scenario as a practical demonstration of its effectiveness and to frame the discussion of its use, its limitations and to compare it against the existing frameworks.
{"title":"Dynamic Information Retrieval: Theoretical Framework and Application","authors":"Marc Sloan, Jun Wang","doi":"10.1145/2808194.2809457","DOIUrl":"https://doi.org/10.1145/2808194.2809457","url":null,"abstract":"Theoretical frameworks like the Probability Ranking Principle and its more recent Interactive Information Retrieval variant have guided the development of ranking and retrieval algorithms for decades, yet they are not capable of helping us model problems in Dynamic Information Retrieval which exhibit the following three properties; an observable user signal, retrieval over multiple stages and an overall search intent. In this paper a new theoretical framework for retrieval in these scenarios is proposed. We derive a general dynamic utility function for optimizing over these types of tasks, that takes into account the utility of each stage and the probability of observing user feedback. We apply our framework to experiments over TREC data in the dynamic multi page search scenario as a practical demonstration of its effectiveness and to frame the discussion of its use, its limitations and to compare it against the existing frameworks.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132986229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BIO Andrew McCallum is a Professor, Director of the Center for Data Science, and Director of the Information Extraction and Synthesis Laboratory in the College of Information and Computer Sciences at University of Massachusetts Amherst. He has published over 250 papers in many areas of AI, including natural language processing, machine learning, data mining and reinforcement learning, and his work has received over 40,000 citations.
Andrew McCallum是马萨诸塞大学阿姆赫斯特分校信息与计算机科学学院的教授、数据科学中心主任、信息提取与合成实验室主任。他在人工智能的许多领域发表了250多篇论文,包括自然语言处理、机器学习、数据挖掘和强化学习,他的工作被引用超过4万次。
{"title":"Embedded Representations of Lexical and Knowledge-Base Semantics","authors":"A. McCallum","doi":"10.1145/2808194.2808195","DOIUrl":"https://doi.org/10.1145/2808194.2808195","url":null,"abstract":"BIO Andrew McCallum is a Professor, Director of the Center for Data Science, and Director of the Information Extraction and Synthesis Laboratory in the College of Information and Computer Sciences at University of Massachusetts Amherst. He has published over 250 papers in many areas of AI, including natural language processing, machine learning, data mining and reinforcement learning, and his work has received over 40,000 citations.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121065419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
JavaScript engines inside modern web browsers are capable of running sophisticated multi-player games, rendering impressive 3D scenes, and supporting complex, interactive visualizations. Can this processing power be harnessed for information retrieval? This paper explores the feasibility of building a JavaScript search engine that runs completely self-contained on the client side within the browser - this includes building the inverted index, gathering terms statistics for scoring, and performing query evaluation. The design takes advantage of the IndexDB API, which is implemented by the LevelDB key{value store inside Google's Chrome browser. Experiments show that although the performance of the JavaScript prototype falls far short of the open-source Lucene search engine, it is sufficiently responsive for interactive applications. This feasibility demonstration opens the door to interesting applications and architectures.
{"title":"Building a Self-Contained Search Engine in the Browser","authors":"Jimmy J. Lin","doi":"10.1145/2808194.2809478","DOIUrl":"https://doi.org/10.1145/2808194.2809478","url":null,"abstract":"JavaScript engines inside modern web browsers are capable of running sophisticated multi-player games, rendering impressive 3D scenes, and supporting complex, interactive visualizations. Can this processing power be harnessed for information retrieval? This paper explores the feasibility of building a JavaScript search engine that runs completely self-contained on the client side within the browser - this includes building the inverted index, gathering terms statistics for scoring, and performing query evaluation. The design takes advantage of the IndexDB API, which is implemented by the LevelDB key{value store inside Google's Chrome browser. Experiments show that although the performance of the JavaScript prototype falls far short of the open-source Lucene search engine, it is sufficiently responsive for interactive applications. This feasibility demonstration opens the door to interesting applications and architectures.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127240902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theories of search and search behavior can be used to glean insights and generate hypotheses about how people interact with retrieval systems. This paper examines three such theories, the long standing Information Foraging Theory, along with the more recently proposed Search Economic Theory and the Interactive Probability Ranking Principle. Our goal is to develop a model for ad-hoc topic retrieval using each approach, all within a common framework, in order to (1) determine what predictions each approach makes about search behavior, and (2) show the relationships, equivalences and differences between the approaches. While each approach takes a different perspective on modeling searcher interactions, we show that under certain assumptions, they lead to similar hypotheses regarding search behavior. Moreover, we show that the models are complementary to each other, but operate at different levels (i.e., sessions, patches and situations). We further show how the differences between the approaches lead to new insights into the theories and new models. This contribution will not only lead to further theoretical developments, but also enables practitioners to employ one of the three equivalent models depending on the data available.
{"title":"An Analysis of Theories of Search and Search Behavior","authors":"L. Azzopardi, G. Zuccon","doi":"10.1145/2808194.2809447","DOIUrl":"https://doi.org/10.1145/2808194.2809447","url":null,"abstract":"Theories of search and search behavior can be used to glean insights and generate hypotheses about how people interact with retrieval systems. This paper examines three such theories, the long standing Information Foraging Theory, along with the more recently proposed Search Economic Theory and the Interactive Probability Ranking Principle. Our goal is to develop a model for ad-hoc topic retrieval using each approach, all within a common framework, in order to (1) determine what predictions each approach makes about search behavior, and (2) show the relationships, equivalences and differences between the approaches. While each approach takes a different perspective on modeling searcher interactions, we show that under certain assumptions, they lead to similar hypotheses regarding search behavior. Moreover, we show that the models are complementary to each other, but operate at different levels (i.e., sessions, patches and situations). We further show how the differences between the approaches lead to new insights into the theories and new models. This contribution will not only lead to further theoretical developments, but also enables practitioners to employ one of the three equivalent models depending on the data available.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121936182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fiana Raiber, Oren Kurland, Filip Radlinski, Milad Shokouhi
Several applications in information retrieval rely on asymmetric co-relevance estimation; that is, estimating the relevance of a document to a query under the assumption that another document is relevant. We present a supervised model for learning an asymmetric co-relevance estimate. The model uses different types of similarities with the assumed relevant document and the query, as well as document-quality measures. Empirical evaluation demonstrates the merits of using the co-relevance estimate in various applications, including cluster-based and graph-based document retrieval. Specifically, the resultant performance transcends that of using a wide variety of alternative estimates, mostly symmetric inter-document similarity measures that dominate past work.
{"title":"Learning Asymmetric Co-Relevance","authors":"Fiana Raiber, Oren Kurland, Filip Radlinski, Milad Shokouhi","doi":"10.1145/2808194.2809454","DOIUrl":"https://doi.org/10.1145/2808194.2809454","url":null,"abstract":"Several applications in information retrieval rely on asymmetric co-relevance estimation; that is, estimating the relevance of a document to a query under the assumption that another document is relevant. We present a supervised model for learning an asymmetric co-relevance estimate. The model uses different types of similarities with the assumed relevant document and the query, as well as document-quality measures. Empirical evaluation demonstrates the merits of using the co-relevance estimate in various applications, including cluster-based and graph-based document retrieval. Specifically, the resultant performance transcends that of using a wide variety of alternative estimates, mostly symmetric inter-document similarity measures that dominate past work.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132161532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The real-time search problem requires making ingested documents immediately searchable, which presents architectural challenges for systems built around inverted indexing. In this paper, we explore a radical proposition: What if we abandon document inversion and instead adopt an architecture based on brute force scans of document representations? In such a design, "indexing" simply involves appending the parsed representation of an ingested document to an existing buffer, which is simple and fast. Quite surprisingly, experiments with TREC Microblog test collections show that query evaluation with brute force scans is feasible and performance compares favorably to a traditional search architecture based on an inverted index, especially if we take advantage of vectorized SIMD instructions and multiple cores in modern processor architectures. We believe that such a novel design is worth further exploration by IR researchers and practitioners.
{"title":"The Feasibility of Brute Force Scans for Real-Time Tweet Search","authors":"Yulu Wang, Jimmy J. Lin","doi":"10.1145/2808194.2809489","DOIUrl":"https://doi.org/10.1145/2808194.2809489","url":null,"abstract":"The real-time search problem requires making ingested documents immediately searchable, which presents architectural challenges for systems built around inverted indexing. In this paper, we explore a radical proposition: What if we abandon document inversion and instead adopt an architecture based on brute force scans of document representations? In such a design, \"indexing\" simply involves appending the parsed representation of an ingested document to an existing buffer, which is simple and fast. Quite surprisingly, experiments with TREC Microblog test collections show that query evaluation with brute force scans is feasible and performance compares favorably to a traditional search architecture based on an inverted index, especially if we take advantage of vectorized SIMD instructions and multiple cores in modern processor architectures. We believe that such a novel design is worth further exploration by IR researchers and practitioners.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126031686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Freedom of Information legislations in many western democracies, including the United Kingdom (UK) and the United States of America (USA), state that citizens have typically the right to access government documents. However, certain sensitive information is exempt from release into the public domain. For example, in the UK, FOIA Exemption 27 (International Relations) excludes the release of Information that might damage the interests of the UK abroad. Therefore, the process of reviewing government documents for sensitivity is essential to determine if a document must be redacted before it is archived, or closed until the information is no longer sensitive. With the increased volume of digital government documents in recent years, there is a need for new tools to assist the digital sensitivity review process. Therefore, in this paper we propose an automatic approach for identifying sensitive text in documents by measuring the amount of sensitivity in sequences of text. Using government documents reviewed by trained sensitivity reviewers, we focus on an aspect of FOIA Exemption 27 which can have a major impact on international relations, namely, information supplied in confidence. We show that our approach leads to markedly increased recall of sensitive text, while achieving a very high level of precision, when compared to a baseline that has been shown to be effective at identifying sensitive text in other domains.
{"title":"Using Part-of-Speech N-grams for Sensitive-Text Classification","authors":"G. Mcdonald, C. Macdonald, I. Ounis","doi":"10.1145/2808194.2809496","DOIUrl":"https://doi.org/10.1145/2808194.2809496","url":null,"abstract":"Freedom of Information legislations in many western democracies, including the United Kingdom (UK) and the United States of America (USA), state that citizens have typically the right to access government documents. However, certain sensitive information is exempt from release into the public domain. For example, in the UK, FOIA Exemption 27 (International Relations) excludes the release of Information that might damage the interests of the UK abroad. Therefore, the process of reviewing government documents for sensitivity is essential to determine if a document must be redacted before it is archived, or closed until the information is no longer sensitive. With the increased volume of digital government documents in recent years, there is a need for new tools to assist the digital sensitivity review process. Therefore, in this paper we propose an automatic approach for identifying sensitive text in documents by measuring the amount of sensitivity in sequences of text. Using government documents reviewed by trained sensitivity reviewers, we focus on an aspect of FOIA Exemption 27 which can have a major impact on international relations, namely, information supplied in confidence. We show that our approach leads to markedly increased recall of sensitive text, while achieving a very high level of precision, when compared to a baseline that has been shown to be effective at identifying sensitive text in other domains.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114579364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annotating queries with entities is one of the core problem areas in query understanding. While seeming similar, the task of entity linking in queries is different from entity linking in documents and requires a methodological departure due to the inherent ambiguity of queries. We differentiate between two specific tasks, semantic mapping and interpretation finding, discuss current evaluation methodology, and propose refinements. We examine publicly available datasets for these tasks and introduce a new manually curated dataset for interpretation finding. To further deepen the understanding of task differences, we present a set of approaches for effectively addressing these tasks and report on experimental results.
{"title":"Entity Linking in Queries: Tasks and Evaluation","authors":"Faegheh Hasibi, K. Balog, Svein Erik Bratsberg","doi":"10.1145/2808194.2809473","DOIUrl":"https://doi.org/10.1145/2808194.2809473","url":null,"abstract":"Annotating queries with entities is one of the core problem areas in query understanding. While seeming similar, the task of entity linking in queries is different from entity linking in documents and requires a methodological departure due to the inherent ambiguity of queries. We differentiate between two specific tasks, semantic mapping and interpretation finding, discuss current evaluation methodology, and propose refinements. We examine publicly available datasets for these tasks and introduce a new manually curated dataset for interpretation finding. To further deepen the understanding of task differences, we present a set of approaches for effectively addressing these tasks and report on experimental results.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"54 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113957363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a theoretical framework for tackling the cold-start collaborative filtering problem, where unknown targets (items or users) keep coming to the system, and there is a limited number of resources (users or items) that can be allocated and related to them. The solution requires a trade-off between exploitation and exploration since with the limited recommendation opportunities, we need to, on one hand, allocate the most relevant resources right away, but, on the other hand, it is also necessary to allocate resources that are useful for learning the target's properties in order to recommend more relevant ones in the future. In this paper, we study a simple two-stage recommendation combining a sequential and a batch solution together. We first model the problem with the partially observable Markov decision process (POMDP) and provide its exact solution. Then, through an in-depth analysis over the POMDP value iteration solution, we identify that an exact solution can be abstracted as selecting resources that are not only highly relevant to the target according to the initial-stage information, but also highly correlated, either positively or negatively, with other potential resources for the next stage. With this finding, we propose an approximate solution to ease the intractability of the exact solution. Our initial results on synthetic data and the MovieLens 100K dataset confirm our theoretical development and analysis.
{"title":"A Theoretical Analysis of Two-Stage Recommendation for Cold-Start Collaborative Filtering","authors":"Xiaoxue Zhao, Jun Wang","doi":"10.1145/2808194.2809459","DOIUrl":"https://doi.org/10.1145/2808194.2809459","url":null,"abstract":"In this paper, we present a theoretical framework for tackling the cold-start collaborative filtering problem, where unknown targets (items or users) keep coming to the system, and there is a limited number of resources (users or items) that can be allocated and related to them. The solution requires a trade-off between exploitation and exploration since with the limited recommendation opportunities, we need to, on one hand, allocate the most relevant resources right away, but, on the other hand, it is also necessary to allocate resources that are useful for learning the target's properties in order to recommend more relevant ones in the future. In this paper, we study a simple two-stage recommendation combining a sequential and a batch solution together. We first model the problem with the partially observable Markov decision process (POMDP) and provide its exact solution. Then, through an in-depth analysis over the POMDP value iteration solution, we identify that an exact solution can be abstracted as selecting resources that are not only highly relevant to the target according to the initial-stage information, but also highly correlated, either positively or negatively, with other potential resources for the next stage. With this finding, we propose an approximate solution to ease the intractability of the exact solution. Our initial results on synthetic data and the MovieLens 100K dataset confirm our theoretical development and analysis.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124169189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}