寻找更好的扩展词

Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang
{"title":"寻找更好的扩展词","authors":"Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang","doi":"10.1109/NLPKE.2010.5587826","DOIUrl":null,"url":null,"abstract":"The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.","PeriodicalId":259975,"journal":{"name":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Bagging to find better expansion words\",\"authors\":\"Bingqing Wang, Yaqian Zhou, Xipeng Qiu, Qi Zhang, Xuanjing Huang\",\"doi\":\"10.1109/NLPKE.2010.5587826\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.\",\"PeriodicalId\":259975,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NLPKE.2010.5587826\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NLPKE.2010.5587826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

将监督学习应用到查询扩展技术中,训练一个模型来预测扩展词对检索系统的“良度”或“效用”。有许多特征可以用来衡量扩展词与查询之间的相关性,这些特征可以被纳入监督学习中来选择扩展词。训练数据集是通过一种复杂的方法自动生成的。然而,这种方法会受到许多方面的影响。一个严重的问题是特征的分布是查询相关的,这在以前的工作中没有讨论过。由于特征的分布不同,将这些训练实例合并在一起并使用整个数据集来训练单个模型是有问题的。在本文中,我们首先研究了自动生成的训练数据的统计分布,并指出了训练数据集中存在的问题。在分析的基础上,我们提出采用bagging方法对多个回归模型进行集成,以得到一个更好的监督模型来对扩展项进行预测。我们在TREC基准测试集合上进行了实验。我们对训练数据的分析揭示了一些关于查询扩展技术的有趣现象。实验结果还表明,套袋方法可以在标准TREC数据集上达到最先进的检索性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Bagging to find better expansion words
The supervised learning has been applied into the query expansion techniques, which trains a model to predict the “goodness” or “utility” of the expanded term to the retrieval system. There are many features to measure the relatedness between the expanded word and the query, which can be incorporated in the supervised learning to select the expanded terms. The training data set is generated automatically by a tricky method. However, this method can be affected by many aspects. A severe problem is that the distribution of the features is query-dependent, which has not been discussed in previous work. With a different distribution on the features, it is questionable to merge these training instances together and use the whole data set to train one single model. In this paper, we first investigate the statistical distribution of the auto-generated training data and show the problems in the training data set. Based on our analysis, we proposed to use the bagging method to ensemble several regression models in order to get a better supervised model to make prediction on the expanded terms. We conducted the experiments on the TREC benchmark test collections. Our analysis on the training data reveals some interesting phenomena about the query expansion techniques. The experiment results also show that the bagging approach can achieve the state-of-art retrieval performance on the standard TREC data set.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Dashboard: An integration and testing platform based on backboard architecture for NLP applications Chinese semantic role labeling based on semantic knowledge Transitivity in semantic relation learning Wisdom media “CAIWA Channel” based on natural language interface agent A new cascade algorithm based on CRFs for recognizing Chinese verb-object collocation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1