在文档中使用不同的术语表示接近度和短语，以便更好地进行信息检索

2014 International Conference on Emerging Technologies (ICET) Pub Date : 2014-12-01 DOI:10.1109/ICET.2014.7021024

M. I. Rafique, M. Hassan

{"title":"在文档中使用不同的术语表示接近度和短语，以便更好地进行信息检索","authors":"M. I. Rafique, M. Hassan","doi":"10.1109/ICET.2014.7021024","DOIUrl":null,"url":null,"abstract":"The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.","PeriodicalId":325890,"journal":{"name":"2014 International Conference on Emerging Technologies (ICET)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Utilizing distinct terms for proximity and phrases in the document for better information retrieval\",\"authors\":\"M. I. Rafique, M. Hassan\",\"doi\":\"10.1109/ICET.2014.7021024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.\",\"PeriodicalId\":325890,\"journal\":{\"name\":\"2014 International Conference on Emerging Technologies (ICET)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Emerging Technologies (ICET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICET.2014.7021024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Emerging Technologies (ICET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2014.7021024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

网络数据的快速增长导致用户在网络上搜索信息。用户希望他们的查询与文档更加相关。出于这个目的，在文档中接近的查询词显示高度接近的情况下，使用接近的概念，因此文档可能更相关。但是几乎每个检索函数(例如BM25)都使用词袋技术，从而忽略了接近度的重要性。在本文中，我们找到由不同的词或术语在文档中首次出现时组成的短句(子句)或一组词来计算接近度，并随后将其纳入检索函数以提高文档的排名。我们已经证明，有相当数量的由不同的单词组成的短句可以用来利用接近度和短语搜索。此外，利用不同项(UDT)将非位置(记录级)倒排索引转化为部分位置倒排索引，将UDT部分位置倒排索引与完整位置(词级)倒排索引分离，从而有效地计算不同项之间的接近度。由于全位置倒排索引的接近度计算复杂且计算量大，而UDT部分位置倒排索引的接近度计算简单有效。在各种数据集上的实验表明，该方法提高了文档的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Utilizing distinct terms for proximity and phrases in the document for better information retrieval

The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Conference on Emerging Technologies (ICET)

自引率

0.00%

发文量