在文档中使用不同的术语表示接近度和短语,以便更好地进行信息检索

M. I. Rafique, M. Hassan
{"title":"在文档中使用不同的术语表示接近度和短语,以便更好地进行信息检索","authors":"M. I. Rafique, M. Hassan","doi":"10.1109/ICET.2014.7021024","DOIUrl":null,"url":null,"abstract":"The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.","PeriodicalId":325890,"journal":{"name":"2014 International Conference on Emerging Technologies (ICET)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Utilizing distinct terms for proximity and phrases in the document for better information retrieval\",\"authors\":\"M. I. Rafique, M. Hassan\",\"doi\":\"10.1109/ICET.2014.7021024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.\",\"PeriodicalId\":325890,\"journal\":{\"name\":\"2014 International Conference on Emerging Technologies (ICET)\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Emerging Technologies (ICET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICET.2014.7021024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Emerging Technologies (ICET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICET.2014.7021024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

网络数据的快速增长导致用户在网络上搜索信息。用户希望他们的查询与文档更加相关。出于这个目的,在文档中接近的查询词显示高度接近的情况下,使用接近的概念,因此文档可能更相关。但是几乎每个检索函数(例如BM25)都使用词袋技术,从而忽略了接近度的重要性。在本文中,我们找到由不同的词或术语在文档中首次出现时组成的短句(子句)或一组词来计算接近度,并随后将其纳入检索函数以提高文档的排名。我们已经证明,有相当数量的由不同的单词组成的短句可以用来利用接近度和短语搜索。此外,利用不同项(UDT)将非位置(记录级)倒排索引转化为部分位置倒排索引,将UDT部分位置倒排索引与完整位置(词级)倒排索引分离,从而有效地计算不同项之间的接近度。由于全位置倒排索引的接近度计算复杂且计算量大,而UDT部分位置倒排索引的接近度计算简单有效。在各种数据集上的实验表明,该方法提高了文档的精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Utilizing distinct terms for proximity and phrases in the document for better information retrieval
The rapid increase in web data has led the users to search the web for information. Users want their queries to be more relevant to the documents. For this purpose, the idea of proximity is utilized where close query terms in the document show high proximity and hence the document is likely to be more relevant. But almost every retrieval function (e.g., BM25) uses bag-of-words technique, thus ignoring the importance of proximity. In this paper, we find such short sentences (sub-sentences) or group of words that are made by distinct words or terms on their first occurrence in the documents to calculate the proximity and subsequently incorporate it into a retrieval function to improve the ranking of the documents. We have shown that there are significant numbers of short sentences formed by distinct words that can be used to exploit proximity and phrase search. Furthermore, turning the non-positional (record level) inverted index into partial-positional inverted index by utilizing distinct terms (UDT), the UDT partial-positional inverted index has been separated from the full positional (word level) inverted index to calculate proximity among the distinct terms efficiently. As calculating proximity with full positional inverted index is complex and computationally expensive, the proximity with the UDT partial-positional inverted index can be computed easily and efficiently. Experiments on various data sets have shown that the proposed approach has improved the precision of the documents.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Analysing the performance of EIT images using the point spread function Utilizing distinct terms for proximity and phrases in the document for better information retrieval A mitigation strategy against malicious Primary User Emulation Attack in Cognitive Radio networks On the controllability of a sampled-data system under nonuniform sampling Electro-textile based wearable patch antenna on biodegradable poly lactic acid (PLA) plastic substrate for 2.45 GHz, ISM band applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1