To Clean or Not to Clean

Dwaipayan Roy, Mandar Mitra, Debasis Ganguly
{"title":"To Clean or Not to Clean","authors":"Dwaipayan Roy, Mandar Mitra, Debasis Ganguly","doi":"10.1145/3242180","DOIUrl":null,"url":null,"abstract":"Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"18 6","pages":"1 - 25"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
清洁还是不清洁
WT10G、GOV2和ClueWeb等Web文档集合被广泛用于文本检索实验。这些集合中的文档包含大量以标记、超链接等形式出现的与内容无关的标记。使用这些语料库的已发布文章通常不提供有关在索引期间如何处理此标记信息的具体细节。然而,这个问题被证明是重要的:通过实验,我们发现在索引中包含或排除元数据会产生与标准IR模型显著不同的结果。更重要的是,效果因型号和系列而异。例如,发现元数据过滤在使用BM25或使用Dirichlet平滑的语言建模时通常是有益的,但如果使用Jelinek-Mercer平滑的语言建模,则会显著降低检索效率。我们还观察到,通常情况下,随着测试集合中元数据数量的增加,性能差异会变得更加明显。鉴于这种可变性,我们认为从再现性的角度来看,文件预处理的细节是重要的。在第二组实验中,我们还使用RM3研究了预处理对查询扩展的影响。在这种情况下,我们再次发现,在使用文档进行查询扩展之前,通常最好先删除标记。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Editorial: Special Issue on Data Transparency—Data Quality, Annotation, and Provenance Challenge Paper: The Vision for Time Profiled Temporal Association Mining Editorial: Special Issue on Quality Assessment and Management in Big Data—Part I Developing a Global Data Breach Database and the Challenges Encountered Knowledge Transfer for Entity Resolution with Siamese Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1