{"title":"Utilizing distinct terms for proximity and phrases in the document for better information retrieval","authors":"M. I. Rafique, M. Hassan","doi":"10.1109/ICET.2014.7021024","DOIUrl":null,"url":null,"abstract":"The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.","PeriodicalId":325890,"journal":{"name":"2014 International Conference on Emerging Technologies (ICET)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Emerging Technologies (ICET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2014.7021024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.