In this paper we investigate the effect of the heuristic IR constraints on IR term-document scoring functions within the recently proposed function discovery framework. In the earlier study the constraints were empirically validated as a whole. Moreover, only the group of form constraints was utilized and the other prominent group, the adjustment constraints, was not considered. In this work we will investigate all the constraints individually and study them with two different term frequency normalization, namely normalization scheme used in DFR models and relative term count normalization used in language models.
{"title":"Study of Heuristic IR Constraints Under Function Discovery Framework","authors":"Parantapa Goswami, Massih-Reza Amini, Éric Gaussier","doi":"10.1145/2808194.2809479","DOIUrl":"https://doi.org/10.1145/2808194.2809479","url":null,"abstract":"In this paper we investigate the effect of the heuristic IR constraints on IR term-document scoring functions within the recently proposed function discovery framework. In the earlier study the constraints were empirically validated as a whole. Moreover, only the group of form constraints was utilized and the other prominent group, the adjustment constraints, was not considered. In this work we will investigate all the constraints individually and study them with two different term frequency normalization, namely normalization scheme used in DFR models and relative term count normalization used in language models.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133339214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patent retrieval has some unique features relative to web search. One major task in this domain is finding existing patents that may invalidate new patents, known as prior-art or invalidity search, where search queries can be formulated from query patents (i.e., new patents). Since a patent document generally contains long and complex descriptions, generating effective search queries can be complex and difficult. Typically, these queries must cover diverse aspects of the new patent application in order to retrieve relevant documents that cover the full scope of the patent. Given this context, search diversification techniques can potentially improve the retrieval performance of patent search by introducing diversity into the document ranking. In this paper, we examine the effectiveness for patent search of a recent term-based diversification framework. Using this framework involves developing methods to identify effective phrases related to the topics mentioned in the query patent. In our experiments, we evaluate our diversification approach using standard measures of retrieval effectiveness and diversity, and show significant improvements relative to state-of-the-art baselines.
{"title":"Improving Patent Search by Search Result Diversification","authors":"Youngho Kim, W. Bruce Croft","doi":"10.1145/2808194.2809455","DOIUrl":"https://doi.org/10.1145/2808194.2809455","url":null,"abstract":"Patent retrieval has some unique features relative to web search. One major task in this domain is finding existing patents that may invalidate new patents, known as prior-art or invalidity search, where search queries can be formulated from query patents (i.e., new patents). Since a patent document generally contains long and complex descriptions, generating effective search queries can be complex and difficult. Typically, these queries must cover diverse aspects of the new patent application in order to retrieve relevant documents that cover the full scope of the patent. Given this context, search diversification techniques can potentially improve the retrieval performance of patent search by introducing diversity into the document ranking. In this paper, we examine the effectiveness for patent search of a recent term-based diversification framework. Using this framework involves developing methods to identify effective phrases related to the topics mentioned in the query patent. In our experiments, we evaluate our diversification approach using standard measures of retrieval effectiveness and diversity, and show significant improvements relative to state-of-the-art baselines.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"30 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115603377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large knowledge bases are being developed to describe entities, their attributes, and their relationships to other entities. Prior research mostly focuses on the construction of knowledge bases, while how to use them in information retrieval is still an open problem. This paper presents a simple and effective method of using one such knowledge base, Freebase, to improve query expansion, a classic and widely studied information retrieval task. It investigates two methods of identifying the entities associated with a query, and two methods of using those entities to perform query expansion. A supervised model combines information derived from Freebase descriptions and categories to select terms that are effective for query expansion. Experiments on the ClueWeb09 dataset with TREC Web Track queries demonstrate that these methods are almost 30% more effective than strong, state-of-the-art query expansion algorithms. In addition to improving average performance, some of these methods have better win/loss ratios than baseline algorithms, with 50% fewer queries damaged.
正在开发大型知识库来描述实体、它们的属性以及它们与其他实体的关系。以往的研究大多集中在知识库的构建上,而如何利用知识库进行信息检索仍然是一个有待解决的问题。本文提出了一种简单有效的方法,利用Freebase知识库来改进查询扩展这一经典的、被广泛研究的信息检索任务。本文研究了识别与查询关联的实体的两种方法,以及使用这些实体执行查询扩展的两种方法。监督模型结合Freebase描述和类别的信息来选择对查询扩展有效的术语。在ClueWeb09数据集上使用TREC Web Track查询的实验表明,这些方法比强大的、最先进的查询扩展算法有效近30%。除了提高平均性能外,其中一些方法比基线算法具有更好的胜败比,查询损坏减少了50%。
{"title":"Query Expansion with Freebase","authors":"Chenyan Xiong, Jamie Callan","doi":"10.1145/2808194.2809446","DOIUrl":"https://doi.org/10.1145/2808194.2809446","url":null,"abstract":"Large knowledge bases are being developed to describe entities, their attributes, and their relationships to other entities. Prior research mostly focuses on the construction of knowledge bases, while how to use them in information retrieval is still an open problem. This paper presents a simple and effective method of using one such knowledge base, Freebase, to improve query expansion, a classic and widely studied information retrieval task. It investigates two methods of identifying the entities associated with a query, and two methods of using those entities to perform query expansion. A supervised model combines information derived from Freebase descriptions and categories to select terms that are effective for query expansion. Experiments on the ClueWeb09 dataset with TREC Web Track queries demonstrate that these methods are almost 30% more effective than strong, state-of-the-art query expansion algorithms. In addition to improving average performance, some of these methods have better win/loss ratios than baseline algorithms, with 50% fewer queries damaged.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114676385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manual annotations, e.g. tags and links, of user generated content in community question answering forums and social media play an important role in making the content searchable. During the active phase of a new question entered into a CQA forum, a moderator or an answerer often has to make a significant effort to manually search for related question threads (which we refer to as documents), that he may consider linking to the current question. This manual effort can be greatly reduced by an automated search process to suggest a list of candidate documents to be linked to the new document. We described our investigation of link recommendation for this task. We approach the problem as an ad-hoc information retrieval (IR) task in which a new document (question) acts as the query and the intention is to retrieve a list of potentially relevant documents (previously asked questions in the forum), which could then be linked (manually) to the new one. In contrast to standard ad-hoc search, two pieces of human annotated additional information, namely the tags of the documents and the known links between existing document pairs, can potentially be used to improve the search quality for new questions. To utilize this additional information, we propose a generative model of tagged documents which jointly estimates the distribution of topics corresponding to each tag of a document along with the likelihood of a document being linked to another one. The model predictions are then incorporated in the query likelihood estimate of a standard language model (LM) of IR. Experiments conducted on three months of a crawled StackOverflow dataset show that utilizing the tag specific topic distributions results in a significant improvement in retrieval of the candidate set of related documents.
{"title":"Partially Labeled Supervised Topic Models for RetrievingSimilar Questions in CQA Forums","authors":"Debasis Ganguly, G. Jones","doi":"10.1145/2808194.2809460","DOIUrl":"https://doi.org/10.1145/2808194.2809460","url":null,"abstract":"Manual annotations, e.g. tags and links, of user generated content in community question answering forums and social media play an important role in making the content searchable. During the active phase of a new question entered into a CQA forum, a moderator or an answerer often has to make a significant effort to manually search for related question threads (which we refer to as documents), that he may consider linking to the current question. This manual effort can be greatly reduced by an automated search process to suggest a list of candidate documents to be linked to the new document. We described our investigation of link recommendation for this task. We approach the problem as an ad-hoc information retrieval (IR) task in which a new document (question) acts as the query and the intention is to retrieve a list of potentially relevant documents (previously asked questions in the forum), which could then be linked (manually) to the new one. In contrast to standard ad-hoc search, two pieces of human annotated additional information, namely the tags of the documents and the known links between existing document pairs, can potentially be used to improve the search quality for new questions. To utilize this additional information, we propose a generative model of tagged documents which jointly estimates the distribution of topics corresponding to each tag of a document along with the likelihood of a document being linked to another one. The model predictions are then incorporated in the query likelihood estimate of a standard language model (LM) of IR. Experiments conducted on three months of a crawled StackOverflow dataset show that utilizing the tag specific topic distributions results in a significant improvement in retrieval of the candidate set of related documents.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114990973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose the theme model, which will provide the end user with join and meet operators to define and manipulate themes. These operators have properties that cannot be reduced to the classical logic operators, thus allowing the researchers to model the informative content of documents in a novel way and to rank documents in ways other than those provided by the classical logic. To this end, we introduce the main definitions and properties of the theme model and we link the model to a number of related techniques, thus suggesting how the model can be implemented and applied.
{"title":"Two Operators to Define and Manipulate Themes of a Document Collection","authors":"E. D. Buccio, M. Melucci","doi":"10.1145/2808194.2809482","DOIUrl":"https://doi.org/10.1145/2808194.2809482","url":null,"abstract":"In this paper, we propose the theme model, which will provide the end user with join and meet operators to define and manipulate themes. These operators have properties that cannot be reduced to the classical logic operators, thus allowing the researchers to model the informative content of documents in a novel way and to rank documents in ways other than those provided by the classical logic. To this end, we introduce the main definitions and properties of the theme model and we link the model to a number of related techniques, thus suggesting how the model can be implemented and applied.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115066263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future performance on unseen data. In this paper, we propose a novel approach to explicitly modelling the uncertainty of average F1 scores through Bayesian reasoning.
{"title":"Estimating the Uncertainty of Average F1 Scores","authors":"Dell Zhang, Jun Wang, Xiaoxue Zhao","doi":"10.1145/2808194.2809488","DOIUrl":"https://doi.org/10.1145/2808194.2809488","url":null,"abstract":"In multi-class text classification, the performance (effectiveness) of a classifier is usually measured by micro-averaged and macro-averaged F1 scores. However, the scores themselves do not tell us how reliable they are in terms of forecasting the classifier's future performance on unseen data. In this paper, we propose a novel approach to explicitly modelling the uncertainty of average F1 scores through Bayesian reasoning.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115649497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional TREC-style pooling methodology relies on using predicted relevance by systems to select documents for judgment. This coincides with typical search behaviour (e.g., web search). In the case of temporally ordered streams of documents, the order that users encounter documents is in this temporal order and not some predetermined rank order. We investigate a user oriented pooling methodology focusing on the documents that simulated users would likely read in such temporally ordered streams. Under this user model, many of the relevant documents found in the TREC 2013 Temporal Summarization Track's pooling effort would never be read. Not only does our pooling strategy focus on pooling documents that will be read by (simulated) users, the resultant pools are different from the standard TREC pools.
{"title":"Pooling for User-Oriented Evaluation Measures","authors":"G. Baruah, Adam Roegiest, Mark D. Smucker","doi":"10.1145/2808194.2809493","DOIUrl":"https://doi.org/10.1145/2808194.2809493","url":null,"abstract":"Traditional TREC-style pooling methodology relies on using predicted relevance by systems to select documents for judgment. This coincides with typical search behaviour (e.g., web search). In the case of temporally ordered streams of documents, the order that users encounter documents is in this temporal order and not some predetermined rank order. We investigate a user oriented pooling methodology focusing on the documents that simulated users would likely read in such temporally ordered streams. Under this user model, many of the relevant documents found in the TREC 2013 Temporal Summarization Track's pooling effort would never be read. Not only does our pooling strategy focus on pooling documents that will be read by (simulated) users, the resultant pools are different from the standard TREC pools.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retrievability is an important and interesting indicator that can be used in a number of ways to analyse Information Retrieval systems and document collections. Rather than focusing totally on relevance, retrievability examines what is retrieved, how often it is retrieved, and whether a user is likely to retrieve it or not. This is important because a document needs to be retrieved, before it can be judged for relevance. In this tutorial, we shall explain the concept of retrievability along with a number of retrievability measures, how it can be estimated and how it can be used for analysis. Since retrieval precedes relevance, we shall also provide an overview of how retrievability relates to effectiveness - describing some of the insights that researchers have discovered thus far. We shall also show how retrievability relates to efficiency, and how the theory of retrievability can be used to improve both effectiveness and efficiency. Then we shall provide an overview of the different applications of retrievability such as Search Engine Bias, Corpus Profiling, etc., before wrapping up with challenges and opportunities. The final session will look at example problems and ways to analyse and apply retrievability to other problems and domains. Participants are invited to bring their own problems to be discussed after the tutorial. This half-day tutorial is ideal for: (i) researchers curious about retrievability and wanting to see how it can impact their research, (ii) researchers who would like to expand their set of analysis techniques, and/or (iii) researchers who would like to use retrievability to perform their own analysis.
{"title":"Theory of Retrieval: The Retrievability of Information","authors":"L. Azzopardi","doi":"10.1145/2808194.2809444","DOIUrl":"https://doi.org/10.1145/2808194.2809444","url":null,"abstract":"Retrievability is an important and interesting indicator that can be used in a number of ways to analyse Information Retrieval systems and document collections. Rather than focusing totally on relevance, retrievability examines what is retrieved, how often it is retrieved, and whether a user is likely to retrieve it or not. This is important because a document needs to be retrieved, before it can be judged for relevance. In this tutorial, we shall explain the concept of retrievability along with a number of retrievability measures, how it can be estimated and how it can be used for analysis. Since retrieval precedes relevance, we shall also provide an overview of how retrievability relates to effectiveness - describing some of the insights that researchers have discovered thus far. We shall also show how retrievability relates to efficiency, and how the theory of retrievability can be used to improve both effectiveness and efficiency. Then we shall provide an overview of the different applications of retrievability such as Search Engine Bias, Corpus Profiling, etc., before wrapping up with challenges and opportunities. The final session will look at example problems and ways to analyse and apply retrievability to other problems and domains. Participants are invited to bring their own problems to be discussed after the tutorial. This half-day tutorial is ideal for: (i) researchers curious about retrievability and wanting to see how it can impact their research, (ii) researchers who would like to expand their set of analysis techniques, and/or (iii) researchers who would like to use retrievability to perform their own analysis.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128639189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semantic relatedness is essential for different text processing tasks, especially in the cross-lingual setting due to the vocabulary mismatch problem. Many concept-based solutions to semantic relatedness have been proposed, which vary in the notions of concept and document representation. In our contribution, we provide a unified model that generalizes over the existing approaches to cross-lingual semantic relatedness. It shows that the main existing solutions represent different ways for constructing the concept space, which result in different document representations and implications for semantic relatedness computation. In particular, it al- lows us to provide theoretical justifications of existing solutions. Through the experimental evaluation, we show that the results support our theoretical findings.
{"title":"A Theoretical Analysis of Cross-lingual Semantic Relatedness in Vector Space Models","authors":"Lei Zhang, Thanh Tran, Achim Rettinger","doi":"10.1145/2808194.2809450","DOIUrl":"https://doi.org/10.1145/2808194.2809450","url":null,"abstract":"Semantic relatedness is essential for different text processing tasks, especially in the cross-lingual setting due to the vocabulary mismatch problem. Many concept-based solutions to semantic relatedness have been proposed, which vary in the notions of concept and document representation. In our contribution, we provide a unified model that generalizes over the existing approaches to cross-lingual semantic relatedness. It shows that the main existing solutions represent different ways for constructing the concept space, which result in different document representations and implications for semantic relatedness computation. In particular, it al- lows us to provide theoretical justifications of existing solutions. Through the experimental evaluation, we show that the results support our theoretical findings.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130879424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeroen B. P. Vuurens, A. D. Vries, Roi Blanco, P. Mika
Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We propose a three-step approach to online news tracking for ad-hoc information needs. First, we continuously cluster the titles of all incoming news articles. Then, we select the clusters that best fit a user's ad-hoc information need and identify salient sentences. Finally, we select sentences for the summary based on novelty and relevance to the information seen, without requiring an a-priori model of events of interest. We evaluate this approach using the 2013 TREC Temporal Summarization test set and show that compared to existing systems our approach retrieves news facts with significantly higher F-measure and Latency-Discounted Expected Gain.
{"title":"Online News Tracking for Ad-Hoc Information Needs","authors":"Jeroen B. P. Vuurens, A. D. Vries, Roi Blanco, P. Mika","doi":"10.1145/2808194.2809474","DOIUrl":"https://doi.org/10.1145/2808194.2809474","url":null,"abstract":"Following online news about a specific event can be a difficult task as new information is often scattered across web pages. In such cases, an up-to-date summary of the event would help to inform users and allow them to navigate to articles that are likely to contain relevant and novel details. We propose a three-step approach to online news tracking for ad-hoc information needs. First, we continuously cluster the titles of all incoming news articles. Then, we select the clusters that best fit a user's ad-hoc information need and identify salient sentences. Finally, we select sentences for the summary based on novelty and relevance to the information seen, without requiring an a-priori model of events of interest. We evaluate this approach using the 2013 TREC Temporal Summarization test set and show that compared to existing systems our approach retrieves news facts with significantly higher F-measure and Latency-Discounted Expected Gain.","PeriodicalId":440325,"journal":{"name":"Proceedings of the 2015 International Conference on The Theory of Information Retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125116250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}